M.S. Applied Data Science - Capstone Chronicles 2025
17
into three subsets using stratified sampling: 70% for training (14,725,127 observations), 15% for validation (3,155,385 observations), and 15% for testing (3,155,384 observations). Stratification ensured each subset maintained the original class distribution, with DDOS attacks dominating at approximately 58.5% and BRUTE_FORCE attacks representing only 0.06% of the data. Training models directly on such imbalanced data often results in poor minority-class performance because algorithms tend to optimize for the majority classes. To address this, random undersampling was applied exclusively to the training set. The smallest class (BRUTE_FORCE with 8,764 samples) was used as the reference; exactly 8,764 observations were randomly sampled from each of the other seven classes. This procedure created a perfectly balanced training set of 70,112 samples (8,764 per class), giving the models equal exposure to patterns from all attack families during training. The validation and test sets were intentionally kept in their original imbalanced state to reflect real-world operational conditions, where DDOS attacks are far more common than BRUTE_FORCE attempts. Evaluating imbalanced data provides performance estimates accounting for the actual prevalence of each attack type, while training on the balanced set supports learning discriminative patterns for minority families under realistic deployment scenarios. 5 Results and Findings This section provides an overview of the performance of the seven supervised models trained on the CIC-IoT 2023 dataset. Their task was to detect the type of intrusion by classifying the traffic into seven attack families and a benign activity category. Since the original dataset was highly imbalanced, a stratified method was used to balance the training set, so each attack type
had the same number of samples. And because some attacks are very rare in the real world, the models were tested using the original imbalanced validation and test sets. The goal of this project was to test if a fast and lightweight LightGBM model would outperform the others. And the null hypothesis would be there is no difference between the models. After testing them all, the ensemble models (LightGBM, XGBoost, CatBoost, and Random Forest) were the top performers. As shown in Table 1, LightGBM had the highest test accuracy (78%) and macro F1 (63.3%), and the rest of the models followed closely behind. The LinearSVM model, AdaBoost, and Logistic Regression did not perform as well, especially on the rarer attack types. On normal traffic and the common attack categories: DDOS, DOS, and MIRAI the ensemble models achieved high precision and recall. All models struggled with rare WEB and BRUTE_FORCE attacks, indicating this remains a challenging area to detect and may require more samples to improve accuracy. However, the validation and test results across all models were highly similar implying the data was not overfitted and is generalizable. This study wanted to test these models to comprehend if they would give useful information and be effective at detecting intrusions in IoT networks. The evaluation process measured accuracy, precision, recall, and F1-score (both macro and weighted). Macro scores treated each attack type equally while weighted scores gave more importance to common attack types. Confusion matrices (Figures 7–13) and classification reports were used to understand how well each model could detect different attack types. LightGBM was found to perform slightly better yet still had trouble identifying minority 5.1 Evaluation of Results
255
Made with FlippingBook flipbook maker