M.S. Applied Data Science - Capstone Chronicles 2025

12

ensured each model operated on an identical feature space and all transformations were fitted only on the training data. Given the class imbalance in the original dataset, the modeling strategy distinguished between the data used to fit model parameters and the data used to approximate real-world performance. The initial stratified split yielded a 70% training subset and two 15% holdout subsets (validation and test), each preserving the observed class distribution. Within the 70% training portion, a balanced training set was created via random undersampling. All flows belonging to each family were grouped, and the number of observations per class was reduced to the size of the smallest class. This yielded a balanced training subset (X_train_bal, y_train_bal) used for model fitting and cross-validation, while the validation and test sets remained imbalanced to reflect operational conditions. Across all candidate models, a stratified 5-fold cross-validation scheme (StratifiedKFold) on the balanced training data was used to estimate generalization performance and guide hyperparameter tuning. Macro-averaged F1 score served as the primary selection metric, supplemented by accuracy, weighted F1, and macro/weighted precision and recall, to ensure performance on minority families was not obscured by the majority classes. 4.4.1 Selection of modeling techniques. Because the task involves multiclass classification with imbalanced labels and non-linear relationships among network-flow features, the modeling strategy relied on a combination of tree-based ensembles and regularized linear classifiers. Boosting frameworks such as LightGBM, XGBoost, CatBoost, and AdaBoost were selected to capture complex feature interactions and provide strong performance on tabular data, while Random Forest served as a bagging-based

baseline. In addition, multinomial logistic regression and a linear support vector machine were included as simpler, more interpretable baselines attractive for deployment when computational constraints are strict. 4.4.2 Model Configuration and Development 4.4.2.1 Feature preprocessing pipeline All models required preprocessing to transform raw features into appropriate formats. A standardized preprocessing pipeline was implemented using scikit-learn's “ColumnTransformer”, which applies different transformations to numerical and categorical features. The feature set consisted of 37 numerical variables, including metrics like header length, time-to-live, packet rate, TCP flag counts, and binary protocol indicators, plus one categorical variable (“Protocol_Type”). The preprocessing approach varied by algorithm type. For tree-based models (LightGBM, Random Forest, CatBoost, XGBoost, and AdaBoost), numerical features were passed through without scaling, because decision trees do not depend on the absolute scale of the input variables. For these algorithms, only the categorical “Protocol_Type” variable required transformation via one-hot encoding, which creates separate binary features for each protocol category. In contrast, linear models (logistic regression and linear SVM) are sensitive to feature scales, so “RobustScaler” was applied to numerical features before one-hot encoding the categorical variable. “RobustScaler” was chosen instead of “StandardScaler” because it is less affected by outliers, which are common in network traffic data. 4.4.2.2 LightGBM Model LightGBM, a gradient boosting framework designed for speed and efficiency on large datasets, was used as the primary ensemble

250

Made with FlippingBook flipbook maker