M.S. Applied Data Science - Capstone Chronicles 2025

21

classes, but the misclassifications are a result driven by the decision threshold in an imbalanced environment rather than the inability to distinguish classes at the probability level. The model interpretability results also add strength to the case for LightGBM as a practical IDS candidate. Global SHAP analysis identified packet rate and TCP-related features as the most influential towards driving predictions across all classes and families. Class-level SHAP visualizations revealed different attack families emphasize different aspects of these highlighted features. These insights allow network defenders to have more than just “black box” scores; they allow concrete and inspectable signals to map to attack behavior. This builds trust and helps incident investigations. Overall, this study found a tuned LightGBM model trained on a balanced subset and evaluated on imbalanced traffic provides a strong case for the balance of accuracy, minority-class sensitivity, computational efficiency and interpretability. The LightGBM model, when properly tuned, is well suited as a core classifier in IDS pipelines for large scale IoT environments. 6.2 Recommend Next Steps While this current work demonstrates the impact of LightGBM as an effective and deployable option for IoT intrusion detection using the CIC-IoT 2023 dataset, several directions can be taken to improve the performance and operational usefulness. The first is to perform threshold tuning and calibration for minority classes/families. The ROC and AUC results revealed LightGBM ranked the minority classes well, despite lower F1 scores. The natural next step is to move beyond the uniform 0.5 decision threshold and construct class specific thresholds or calibrated class probabilities. A second direction for further study is to implement a different strategy to handle the

presence of extreme class imbalance. In this study, random undersampling successfully produced a balanced training set and prevented overfitting with five-fold cross validation. A limitation is it discarded a significant number of the majority class samples. Future studies should compare this approach with techniques such as Synthetic Minority Oversampling Technique or other synthetic oversampling methods, different class-weighted loss functions. These approaches may better leverage the full dataset while further improving minority class generalization. A final suggestion for the next steps is to validate the results of this study on different cybersecurity datasets. The models in this study were evaluated using only the CIC-IoT 2023 dataset and therefore may have different results on other research datasets. It will be important to track the performance of the LightGBM model within different security settings to establish if it performs well across datasets. Given these suggestions for future improvements, the direction moves closer to establishing proper tuning, extensions, and interpretability techniques, which allow the LightGBM model to become operational within real world deployments within an IDS. These findings establish a strong baseline and clear path for future researchers seeking to build practical, interpretable and computationally efficient machine learning models within the realm of cybersecurity with the goal of helping to protect and safeguard IoT networks. ACKNOWLEDGMENTS We would like to acknowledge and express our gratitude to the Canadian Institute for Cybersecurity for providing access to the CIC-IoT2023 dataset and for their continued work in developing datasets that reflect real-world cybersecurity conditions. Their published research helped establish the foundation upon which this project was built. We also extend our appreciation to the faculty of the

259

Made with FlippingBook flipbook maker