M.S. Applied Data Science - Capstone Chronicles 2025

18

classes in the dataset. However, the results among the top performers XGBoost, CatBoost, and Random Forest showed there was no significant difference. Since all models were able to detect common attacks but struggled with rare ones, some extra steps like adjusting decision thresholds or adding anomaly-detection components would be recommended to increase rare attack detection. The scores between validation and test sets were low, indicating tuned models can generalize well and the training strategy was appropriate for this dataset. 5.1.1 Multiclass AUC-ROC and Error Tradeoff While F1 scores provide insight into threshold-based classification performance, alone they are not enough to reflect how well the LightGBM model ranks attack classes before a decision threshold has been applied. To evaluate separability and probabilistic discrimination, one-vs-rest multiclass ROC curves for the tuned LightGBM model were generated using the stratified imbalanced validation set (Figure 14). This ROC analysis examines the sensitivity-specificity tradeoff across all thresholds, which captures a more complete view of class separability, especially for minority families where F1 is suppressed by class imbalance. The results of the analysis demonstrated strong ranking capability across all eight attack families. Of note, MIRAI attacks achieved perfect separation (AUC=1.000), indicating the tuned LightGBM model nearly always assigned higher attack probability to MIRAI samples than non-MIRAI samples. The Benign, Recon, Spoorfing, Web, and Brute Force families also exhibited almost perfect discrimination, where AUC values ranged between 0.990 and 0.998, despite having lower F1 scores for minority attacks. The takeaway is that the LightGBM model was able to successfully learn meaningful signals for these minority classes, but the misclassification occurs when the default

probability threshold becomes applied in the highly imbalanced environment. This indicates threshold tuning may substantially improve recall for the minority families. Two of the attack families, DDOS and DOS, showed weaker but still impressive separability, with AUC scores of 0.940 and 0.909, respectively. These families are behaviorally similar with high volume packet floods which make the boundary distinctions more sensitive. Overall, the tuned LightGBM model achieved a 0.984 micro average AUC and a 0.977 macro average AUC. This confirms the presence of strong ranking performance even when the class sizes are highly imbalanced. 5.2 Feature Importance - SHAP Analysis To establish model interpretability of the tuned LightGBM attack-detection model, the team evaluated feature importances using Shapley Additive Explanations (SHAP). SHAP values quantify the strength each feature increases or decreases the prediction of a specific attack family. This feature importance analysis enables us to understand the tuned LightGBM model’s internal decision structure beyond just examining performance metrics calculated in the summary results table across all models. 5.2.1 Global Feature Importance Global feature importance was assessed using the mean absolute SHAP contribution across the imbalance validation set. Figure 15 highlights the top seven predictive features calculated from the tuned LightGBM model. Of note, the ‘Rate’ feature was the strongest overall driver of predictions across all classes, with high packet-rate activity indicating increased attack classification probability. Additionally, features representative of Transmission Control Protocol (TCP) traffic dominated the model’s reasoning, showing insight into session establishment, reset behavior and push/acknowledge sequencing

256

Made with FlippingBook flipbook maker