M.S. Applied Data Science - Capstone Chronicles 2025
5
Logistic Regression, Support Vector Machines (SVM), Decision Trees, AdaBoost within the 33-attack class experiment. 3. Evaluate model performance focusing on accuracy, precision and F1-Score as our primary metrics. 4. Interpret our model outputs through the implementation of feature importance (SHAP) analysis to identify the most influential indicators of malicious network traffic, which has not been documented in previous literature. 5. Finally, explore the potential integration paths to implement LightGBM models within IDS frameworks to improve operational response and resource allocation for cybersecurity analysts. This project aims to contribute to the ever-growing body of literature within the field of cybersecurity and IoT networks by testing the LightGBM model against the previous benchmarks as a lightweight, scalable, and interpretable algorithm for the identification of malicious network traffic. The findings will serve as a benchmark for the LightGBM algorithm and provide practical insights into real-world deployments and integrations into corporate cybersecurity infrastructures. 3 Literature Review 3.1 A Real Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment Nero et al. (2023) developed the CIC-IoT2023 dataset in this literature, which stands as a benchmark, large-scale advanced intrusion detection dataset for Internet of Things (IoT) network traffic. The study was developed to address the limitations of previous literature through the construction of a realistic smart home topology, which consisted of 105 IoT
devices and the execution of 33 unique malicious cyber-attacks spread across seven overarching attack categories, to include: Distributed Denial of Service (DDoS), Denial of Service (DoS), Reconnaissance, Web-Based, Spoofing, Brute Force, and Mirai botnets. The attacks were launched from compromised devices within the secured network and targeted other devices found on the network. The result is a realistic representation of malicious network traffic which can be used to develop machine learning models capable of detecting malicious activity found within network traffic. The authors captured data from the network traffic using Wireshark, a software captures network traffic from network packets which were available in the rawest packet capture files (.pcap) and then aggregated into comma separated value (.csv) files with feature-level formats and machine learning ready formats for quick and efficient analysis. The dataset contained 63 csv files with 41 statistical and protocol level features along with 41 million observations, totaling roughly 9GB of data. During benchmark analysis, the models evaluated were Logistic Regression, Perceptron, AdaBoost, Random Forest, and Deep Neural Network classifiers. The authors explored these models across three different experimental set-ups: Binary classification (malicious vs. benign), 8-category classification (seven attack categories plus benign) and finally a robust 34-class detailed attack classification type to include benign network traffic. The authors found the model with the best performance across all three experiments was the Random Forest model, which produced exceptional results with 99.7% accuracy, 96.5% precision, and a F1-Score of 0.97 during the binary classification experiment, a 99.4% accuracy, a precision of 70.5%, and a F1-Score of 0.72 during the 8-category classification experiment, and finally, a 99.2% accuracy, a precision of 70.4%, along
243
Made with FlippingBook flipbook maker