M.S. Applied Data Science - Capstone Chronicles 2025
4
6. Brute Force Attacks - The systematic password and/or credential guessing through sheer computational power or time. 7. Mirai Botnet Attacks - The compromising of IoT devices to create large-scale botnets which the host or Command and Control (C2) server can control to automate commands across a range of devices. The CIC-IoT dataset serves as a critical benchmarking dataset for researchers developing ML and DL algorithms, as it reflects a real-world and large-scale IoT network with a wide array of attacks to develop realistic models to be applied specifically within an organization's IDS. 2.1 Problem Identification and Motivation Despite the use of anomaly-based IDS frameworks to identify malicious network traffic through ML and DL algorithms, corporations continue to face inefficiencies stemming from high false positives rates, computational overhead, and often lack interpretability on the features that are the driving force behind detection outcomes. These issues manifest in significant strain on security operations centers (SOC), where human analysts manually triage false alerts, limiting their time and resources to address malicious threats in real time. As network traffic continues to grow, even the smallest inefficiencies can result in substantial delays in response times, unnecessary manpower hours triaging false positives and increased operational costs for the corporations attempting to safeguard their networks. The challenge then is not only developing algorithms that are accurate but also scalable, interpretable and efficient with limited computational resources. This study is motivated through the need for lightweight, interpretable and computationally
efficient algorithms capable of handling a real-world, large-scale, and diverse IoT network infrastructures. The authors propose Light Gradient Boosting Machine (LightGBM) algorithms provide a promising solution through their efficient memory approach to handling large datasets with minimal resource consumption, automatic feature selection, use of histogram-based optimization, and leaf-wise tree growth (M. Mohtasim Hossain, 2024). Applying LightGBM to the CIC-IoT2023 Dataset, this research aims to determine whether it is possible to enhance intrusion detection accuracy while simultaneously reducing the occurrence of false positives and computational costs and giving insights into the features most predictive of malicious cyberthreats within IoT networks. 2.2 Definition of Objectives The goal of this project is to develop and evaluate machine learning based algorithms for intrusion detection models, leveraging the real-world, large-scale CIC-IoT2023 Dataset, focusing on the application of evaluating the performance of the LightGBM algorithm in comparison to previously benchmarked models. The goal is to improve detection accuracy, while simultaneously minimizing the false positive rate, demonstrating computational efficiency, and model interpretability (something which DL models inherently lack) for integration into IDS anomaly-based frameworks. Our framework for conducting this study will follow these objectives: 1. Preprocess and perform exploratory data analysis (EDA) on the CIC-IoT2023 Dataset. This dataset consists of 63 comma separated value files (.csv) totaling roughly 9 gigabytes (GB) of data. Our team will leverage polars for efficient and scalable preprocessing. 2. Develop supervised ML models including LightGBM along with previously benchmarked models, Random Forest,
242
Made with FlippingBook flipbook maker