M.S. Applied Data Science - Capstone Chronicles 2025
13
model. The initial configuration employed 200 decision trees (estimators) with a learning rate of 0.1, allowing the model to make moderate updates at each boosting iteration. Five-fold stratified cross-validation on the balanced training set was used to obtain stable performance estimates and ensure all attack families were represented in each fold. The baseline LightGBM configuration achieved a mean cross-validation accuracy of 75.72% with a low standard deviation (±0.22%), indicating consistent performance across folds. The macro F1-score, which weights each class equally regardless of its frequency, reached 75.78% (±0.20%). This metric was particularly important because the goal was not only to detect common attacks but also to maintain reasonable performance on minority classes. Weighted metrics, which account for class frequencies, were very similar to the macro scores, suggesting LightGBM provided balanced performance across attack families. To further refine the model, hyperparameter tuning was conducted using “RandomizedSearchCV”. This approach is randomly sampled from predefined parameter distributions, enabling efficient exploration of the hyperparameter space without exhaustively testing all combinations. Fifty candidate configurations were evaluated, varying the number of estimators (150 – 300), maximum tree depth (10 – 30), number of leaves per tree (31 – 255), minimum samples per leaf (20 – 80), subsample ratio (70 – 100%), column subsample ratio (70 – 100%), and learning rate (0.01 – 0.1). The tuning process identified an improved configuration with 250 estimators, a learning rate of 0.05, maximum depth of 10, 63 leaves per tree, a minimum of 40 samples per leaf, an 80% subsample ratio, and a 70% column subsample ratio. This model achieved a cross-validation macro F1-score of 76.05%, representing a meaningful gain over the
baseline. The lower learning rate, combined with more estimators, allowed finer-grained updates, while the constraints on depth and leaves helped control model complexity and reduce overfitting. The tuned LightGBM model was then selected as a primary candidate for final evaluation on the held-out validation and test sets. 4.4.2.3 Random Forest Model Random Forest, an ensemble method, builds multiple independent decision trees and aggregates their predictions, serving as a second non-linear baseline. The model was initially configured with 300 trees using Gini impurity as the splitting criterion. Five-fold stratified cross-validation on the balanced training set yielded a mean accuracy of 74.33% (±0.16%) and a macro F1-score of 74.43% (±0.18%), slightly below LightGBM’s baseline performance. Hyperparameter tuning for Random Forest was conducted using “RandomizedSearchCV”. The search space included tree depth (10, 20, 30, or unlimited), the minimum number of samples required for a split (2, 5, or 10), the minimum samples per leaf (1, 2, or 4), the feature selection strategy (square root, log2, or all features), and whether bootstrap sampling was enabled. The optimization process identified an effective configuration using 500 trees with a maximum depth of 20, a minimum of 2 samples per split and per leaf, square-root feature selection, and bootstrap disabled. This tuned model achieved a cross-validation macro F1-score of 75.35% and was carried forward as a key ensemble candidate for evaluation on the validation and test sets discussed in Section 5. 4.4.2.4 CatBoost Model CatBoost, a gradient boosting implementation optimized for handling categorical features, was included as an additional ensemble baseline. The initial configuration used 300 iterations with a learning
251
Made with FlippingBook flipbook maker