M.S. Applied Data Science - Capstone Chronicles 2025

9

redundancy among highly correlated variables (|r| > 0.95). The following methodological steps involve the creation of supervised classification models to correctly classify the type of attack into one of the eight categories. Model evaluation will include accuracy, precision, recall, and F1-score, while additional metrics such as ROC-AUC and confusion matrices will complement these results. This study aims to demonstrate gradient-boosting models can be better at detecting potential intrusion within IoT networks. 4.1 Data Acquisition and Aggregation This analysis began with the already mentioned CIC-IoT 2023 dataset released by the Canadian Institute for Cybersecurity (Neto et al., 2023). The dataset was downloaded and stored locally on the machine under the project directory ../Data/Raw/. It contains 63 CSV files and pre-engineered features. At this stage, no manual relabeling or feature reconstruction was needed. Instead, the study focused on the data already extracted and produced by the CIC traffic analyzer to ensure the results can be compared with the original benchmark. This next step of data ingestion was performed in Python 3.10.11 using the Polars library in the Data_Preparation notebook. All the 63 CSV files from the dataset followed the naming pattern “merged*.csv” and were lazily scanned using pl.scan_csv() function, concatenated with pl.concat() function, and saved into memory via collect(). This approach to working with data is called lazy evaluation (Mozzillo et al., 2023), which computes results only when needed; by doing so, it reduces memory usage and accelerates execution on the available local computer (Intel i9-8950HK CPU, 32 GB RAM). Using columnar and vectorized query execution–for example, in the Polars library—has been shown to improve

performance and scalability for analytical tasks (Zeng et al., 2023). The resulting merged dataset contained about 45 million observations and 40 columns. To improve efficiency in future data operations, the entire dataset was stored in the Apache Parquet format (../Data/Raw/Raw_Dataset) using the write_parquet() function. Parquet was chosen because columnar formats are especially beneficial for iterative and large-scale data-science or ML workflows (Liao et al., 2024). According to Zeng et al. (2023), columnar formats provide better compression, reduced I/O operations, and offer faster performance in repeated read/write cycles. This Parquet file was used in all subsequent data-cleaning, feature-selection, and modeling tasks performed within this project. 4.1.1 Exploratory Data Analysis A 200,000-row sample was used to avoid memory issues when creating graphs. Using the sample, the EDA presents distribution plots to profile key variables, class-balance visuals for the target, univariate summaries of numeric features with skew-aware transformations, and a correlation heatmap to screen for multicollinearity. The protocol histogram seen in Figure 1 shows traffic concentrated in a small number of protocol IDs: ID 6 (typically TCP) dominates; ID 17 (UDP) forms a secondary tier; several protocols appear rarely. Uneven protocol usage may affect downstream feature importance and model decision boundaries. The class histogram represented in Figure 2 shows a clear head-tail pattern. High-volume flooding attacks (DDOS-UDP_FLOOD, DDOS-ICMP_FLOOD, DOS-UDP_FLOOD, DDOS-SYN_FLOOD, DDOS_PSHACK_FLOOD, DDOS-TCP_FLOOD) dominate the dataset, each near or above the million-record range. Counts then decline steadily through categories such as DDOS-RSTFINFLOOD,

247

Made with FlippingBook flipbook maker