ADS Capstone Chronicles Revised
20
the curse of dimensionality and overfitting were addressed alongside improving computational efficiency. 4.5 Modeling The modeling process focused on predicting accident severity using advanced machine-learningtechniques.Thedatasetforthis study integrated features derived from weather data,spatialattributes,andtraffic-relatedmetrics, offering a rich set of predictors for severity estimation. Weather-related features encompassed continuous variables such as temperature, wind speed, humidity, and precipitation, capturing external conditions influencing accident risk. Spatial attributes include attributessuchaslatitude,longitude,and road-specific factors, and traffic-related metrics cover elements such as traffic volumes, lane counts, and speed limits, serving as critical indicators of road usage and conditions. The selection of modeling techniques was guided by the context of the final fifty features. 4.5.1 Selection of modeling techniques The original dataset was designed with factors significantly influencing the modeling approach, including notable feature interactions and high-dimensional nature. Preprocessing involved applying PCA for dimensionality reduction, highlightingunderlyingpatternsandreducingthe dataset's complexity. This step, combined with the inclusion of non-linear relationships, was pivotal in determining the appropriate modeling techniques. Consequently, the final models selected were CatBoost, a gradient boosting algorithm, and a Multi-LayerPerceptron(MLP), aneuralnetworkalgorithmtailoredtohandlethe dataset’s refined structure. Baseline versions of the CatBoost and MLP models were first created using default parameters to establish a performance benchmark. As suspected, these models
performed very poorly, failing to capture the intricaterelationshipswithinthedata.Theresults underscored the limitations of the baseline configurations and highlighted the necessity of furthertuningtoenhancepredictiveaccuracyand generalization. Toaddresstheseshortcomings,ahyperparameter optimization framework was implemented using RandomizedSearchCV to explore a predefined range of hyperparameters for each model. For CatBoost, the tuning process focused on parameters such as learning rate, depth, L2 leaf regularization, and the number of iterations, leveragingitsabilitytoeffectivelyhandletabular datasets, automatically process categorical variables, and reduce overfitting through L2 regularization (Prokhorenkova et al., 2018). Similarly, the MLP algorithm’s optimization targeted hidden layer sizes, activation functions, learning rate, and the regularization parameter (α), enhancing its capacity to learn complex non-linear patterns due to its multiple hidden layers and adaptability (LeCun et al., 2015). To ensure robust performance, 5-fold cross-validation was employedduringthetuning process, minimizing the risk of overfitting by evaluating the models across multiple training-validation splits. The results demonstrated that hyperparameter tuning significantly improved the predictive performance of both models, transforming them into effective tools for capturing complex patterns in the data and achieving optimal results. 4.5.2. Test design, i.e. training and validation datasets The test design employed a carefully planned dataset-splitting strategy and preprocessing pipeline to provide the machine-learningmodels with high-quality input data. Considering the substantial size of the final dataset, it was strategically divided into three subsets with stratificationofthetargetvariable.Thisapproach
260
Made with FlippingBook - Online Brochure Maker