M.S. Applied Data Science - Capstone Chronicles 2025

8

3.6 Temporal-Spatial Feature Extraction in IoT-Based SCADA System Security: Hybrid CNN-LSTM and Attention-Based Architectures for Malware Classification and Attack Detection Multi-stage attacks in IoT/SCADA environments unfold over time, and effective detectors benefit from models capturing both local flow structure and longer temporal dependencies (Jony & Arnob, 2024; Neto et al., 2023). A hybrid extractor stacks CNN blocks to learn local flow motifs and LSTM layers to capture order and duration. A lightweight attention head highlights the time steps and channels which most influence a decision, improving accuracy and transparency (Kohli & Chhabra, 2025). Using fixed windows of flow features (bytes/packets, inter-arrival statistics, duration, flags, service indicators), the model outputs attack probabilities with attention weights. In SCADA-like traffic, attention frequently centers on phase changes (unexpected command sequences, abnormal polling). Results include accuracy, macro-F1, per-class precision/recall, confusion matrices for critical families (e.g., DDoS, brute force, spoofing), and per-window latency. On CICIoT2023, temporal models achieve strong baselines, with accuracy approximately 0.9875 and F1 is about 0.9859, supporting progression from a CNN baseline to a combined CNN and LSTM architecture and enabling analysis on whether attention or semantics add value beyond sequence modeling (Jony & Arnob, 2024). The benchmark’s breadth (33 attacks across seven classes on 105 devices) provides a rigorous setting for per-class precision/recall, macro-F1, latency, and confusion-matrix analysis central to publishable IDS evaluations (Neto et al., 2023; Kohli & Chhabra, 2025). 4 Methodology The immense CIC-IoT 2023 dataset created by Neto et al. (2023) was chosen due to its realism, variety of 33 different attacks, and 105 devices. The library Polars was used for

performance, and data manipulation; each CSV file was imported as a LazyFrame and merged into a unified dataset. Next, those 33 distinct attacks were grouped into 7 categories and benign traffic, which contained more than 40 million observations collected from 105 IoT devices. This project builds upon the original work of Neto et al. (2023) by introducing other machine learning (ML) algorithms—Light Gradient Boosting Machine (LightGBM), CatBoost, and AdaBoost—not considered in the original study. The performance results will be compared to previously used ML models such as Random Forest, Decision Tree, Logistic Regression, XGBoost, and Multi-Layer Perceptron models. Other deep-learning architectures—CNN, LSTM, and RNN—will be considered. The objective is to increase detection accuracy and minimize false positives. All the preprocessing was done by Polars library which offers high-performance, memory-efficient, and parallelized operations. Since the original dataset was split into 63 CSV files, each file was imported using LazyFrame, a Polars feature to evaluate large datasets, and merged into a single file, which was then saved in Parquet, a high-performance data format to access the database very fast. The team used a GitHub repository, dividing the code for this project into three main Jupyter notebooks— Data_Exploration, Data_Preparation, and Modeling— as specified in the project guidelines. The first steps in data preparation were doing an initial exploration, cleansing, feature selection, and transformation to prepare the data for modeling. After null and infinite values were dropped and duplicate records were removed, the dataset was left with 21 million valid instances. When analyzing feature importance, two different approaches were used to identify the most relevant columns for modeling. The first approach identified highly correlated pairs of columns, while the second used a preliminary LightGBM feature-importance ranking to select the most predictive features and eliminate

246

Made with FlippingBook flipbook maker