M.S. Applied Data Science - Capstone Chronicles 2025
1
Machine Learning for IoT Intrusion Detection A Realistic Evaluation on the CIC-IoT2023 Dataset
Graham Ward Applied Data Science Master’s Program
Anahit Shekikyan Applied Data Science Master’s Program
Gerard Corrales Fernandez Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego gcorralesfernandez@sandieg o.edu
Shiley Marcos School of Engineering / University of San Diego grahamward@sandiego.edu
Shiley Marcos School of Engineering / University of San Diego ashekikyan@sandiego.edu
ABSTRACT Internet of Things (IoT) networks are expanding rapidly, increasing the attack surface and putting pressure on security teams to detect malicious traffic in real time. Traditional rule-based intrusion detection systems struggle to keep pace with evolving threats, often producing high false-positive rates and operational inefficiencies. This capstone project evaluates seven supervised machine learning models for multiclass IoT attack detection using the CICIoT2023 dataset, a large-scale benchmark with over 21 million network flow records from 105 IoT devices and 33 attacks across seven families plus benign traffic. The dataset exhibits extreme class imbalance, with distributed denial-of-service attacks dominating and some families appearing only rarely. A cleaned and deduplicated version of the data was used to construct a balanced training set via majority undersampling, while validation and test sets preserved realistic imbalanced distributions. Models were compared using five-fold stratified cross-validation and hyperparameter optimization, focusing on accuracy and macro and weighted F1-scores. Tree-based ensemble models consistently outperformed linear baselines, and a tuned LightGBM model achieved the strongest performance on the imbalanced test set (Accuracy = 0.7799; macro F1 = 0.6329; weighted F1 = 0.7963; micro-average ROC-AUC = 0.9838; macro-average ROC-AUC = 0.9771), performing especially well on common families such as MIRAI and DDoS while still struggling with rare attack types. Multiclass ROC analysis and SHAP feature importance showed packet rate and
TCP-related features are key predictive signals. Overall, the findings indicate LightGBM offers a practical, efficient, and interpretable baseline for anomaly-based IoT network security, while highlighting the need for improved handling of minority attack classes. KEYWORDS Internet of Things (IoT), intrusion detection system (IDS), Light Gradient Boosting Machine (LightGBM), CIC-IoT2023, network security, unbalanced multiclass classification The digital landscape has seen rapid changes throughout the years as the growth of Internet of Things (IoT) devices continues to increase. By the year 2030, it is expected there will be 39 billion devices connected to the internet (Sinha, 2024). Across all industries, countries, companies, and devices there is a constantly evolving, continuous transmission of data from billions of endpoints. While this unprecedented level of connectivity has had a plethora of benefits for communication and technological growth it also presents an exponentially increased attack surface for cybercriminals and malicious actors who intend to use this data for nefarious purposes. Cybercriminals leverage the interconnectedness of these devices to conduct attacks, which can range from financial crimes, unauthorized surveillance, stealing of personally identifiable 1 Introduction
239
Made with FlippingBook flipbook maker