M.S. Applied Data Science - Capstone Chronicles 2025
11
datatype consistency (f64, i64, or str). The command (raw_df.describe()) showed most of the variables had less than ten missing values. After checking for null values, it was clear only the “Label” and “Std” features contained null entries. Since the target variable is essential for supervised learning, these rows were dropped applying drop_nulls() function. Sometimes, after combining many files, the same record can appear more than once. These duplicate observations were removed using the unique() function to avoid repeated information and create patterns which only exist because of the duplicates. Using the is_infinite() function showed the “Rate” column contained infinite values, which were removed because they can make the model less stable and less reliable. When all the cleaning process was completed, the remaining variables were checked again, confirming no nulls, duplicates, or infinite values remained. This resulted in a dataset with 21 million rows and 40 columns. Following these essential data-quality steps helped ensure the data was complete, consistent, accurate, and valid. Ehrlinger and Wöß (2022) highlighted completeness and consistency checks are basic requirements for any data pipeline. Zhou et al. (2024) mentioned validity and accuracy are critical for achieving reliable machine-learning performance. Ogrizović et al. (2024) expanded systematic data-quality validation is essential for ensuring reproducibility and robustness in big-data machine-learning systems. Rahm and Do (2000) describe the removal of missing, duplicate, and anomalous values as standard best practices for preparing large datasets for analysis. 4.2.1 Data Quality Issues . The data quality issues identified primarily during the cleaning process were: (1) only missing values in the “Label” and “Std” columns, (2) duplicate rows appeared after concatenating all the files, and (3) infinite values in the “Rate” feature. These problems didn’t happen very
often, but they could still have introduced some bias in the model. Once these issues were removed, the dataset contained the quality needed for this project. Ehrlinger and Wöß (2022) described resolving these inconsistencies as essential for dataset reliability. Rahm and Do (2000) mention systematic data cleaning is necessary to prevent errors when analyzing workflows. 4.4 Modeling The modeling section focused on learning a multi-class classifier, maps network flows to one of the attack families defined in the cleaned dataset. Following thorough cleaning and exploration of the network dataset, the modeling phase aimed to build predictive models capable of distinguishing between benign traffic and seven attack families: BRUTE_FORCE, DDOS, DOS, MIRAI, RECON, SPOOFING, and WEB. This section details the systematic procedure used to develop, tune, and evaluate multiple machine learning algorithms for this multiclass classification task. All modeling was implemented using scikit-learn pipelines to ensure reproducible and leakage-free preprocessing. The cleaned Polars data were converted to a Pandas DataFrame, and the target “Label_Family” was encoded to integers using “LabelEncoder”, yielding a consistent mapping across training, validation, and test sets. Features were partitioned into numeric and categorical subsets: all integer and floating-point network-flow variables were treated as numeric predictors, while the “Protocol_Type” field was treated as a categorical predictor. For tree-based and boosting models, numeric variables were passed through unchanged and the protocol variable was one-hot encoded via a “ColumnTransformer”. For linear models (logistic regression and linear SVM), numeric features were additionally scaled with “RobustScaler” to reduce the influence of outliers before one-hot encoding the categorical field. Wrapping these steps inside scikit-learn “Pipeline” objects
249
Made with FlippingBook flipbook maker