M.S. Applied Data Science - Capstone Chronicles 2025

10

DDOS-SYNONYMOUSIP_FLOOD, DOS-SYN_FLOOD, and DOS-TCP_FLOOD, with BENIGN sitting mid-pack. The long tail contains specialized events (e.g., VULNERABILITYSCAN, MITM-ARPSPOOFING, DNS_SPOOFING, RECON variants) and very rare web attacks (COMMANDINJECTION, SQLINJECTION, XSS, UPLOADING_ATTACK). Overall, the target is strongly imbalanced, with a few head classes and many minority classes. Grouping the fine-grained labels into families softens the extreme tail as seen in Figure 3, but the imbalance is still clear. DDOS dominates by a wide margin, with DOS and MIRAI forming the next tier. BENIGN and reconnaissance/spoofing appear far less often, and web or brute-force categories are rare. This view suggests reporting results at both levels, individual labels for detail and families for a steadier, higher-level signal. The binary target (attack vs. benign) is strongly skewed seen in Figure 4 toward attacks. Because the positive class dominates, plain accuracy will overstate performance, precision, recall, F1, and PR-AUC offer a more faithful view of model quality, especially under class imbalance. Training and validation should incorporate stratified sampling and/or class weighting, with cost-sensitive metrics considered where appropriate. Boxplots by class can be observed in Figure 5 highlighting systematic differences in numeric features. Rate and Tot_sum show higher medians and longer upper tails for the Malicious class, consistent with bursty, high-volume attack activity. In contrast, Max, Std, and Tot_size run higher for Benign traffic, indicating fewer but larger packets and greater per-flow variability under normal conditions. Variance also skews higher for Benign, whereas Malicious flows appear more uniform. Overall, these patterns point to complementary signals: attack flows emphasize intensity and accumulation (high

Rate/Tot_sum) with smaller packet sizes, while benign flows feature larger packets and broader dispersion. For modeling, log scaling is advisable for skewed features, and combining intensity features (Rate, Tot_sum) with size/variability features (Max, Tot_size, Std/Variance) should strengthen class separation. The correlation heatmap shown in Figure 6 was used to examine potential multicollinearity and to understand relationships between features prior to modeling. Overall, most variables showed weak linear correlations, suggesting the dataset contains many independent variables capturing different aspects of IoT traffic. However, some columns showed positive correlations, particularly “syn_flag_number”, “ack_flag_number”, and “fin_flag_number,” which is expected given they all relate to a TCP header. Traffic-volume metrics such as “Tot_sum,” “Max,” “AVG,” and “IAT” contained shared underlying mechanisms which might introduce redundant information. However, negative correlation appeared when comparing TCP to UDP or ICMP protocols because only one of them can be used at a time and not because they influence each other. 4.2 Data Quality Ensuring data quality in this project was a critical step for model development. Because the CIC-IoT 2023 dataset was produced using Wireshark software to capture all the traffic in the network and feature extraction, a solid structured validation process was necessary to ensure consistency, completeness, and validity of the data before modeling. All the steps to check data quality were performed in the Data_Preparation notebook using Polars, which improved RAM efficiency by enabling fast column-wise profiling and anomaly detection. In data aggregation, once all 63 CSV files were put together as a unified dataset; it contained around 45 million observations and 40 features, and each column was verified for

248

Made with FlippingBook flipbook maker