ADS Capstone Chronicles Revised

First page Table of contents Previous page 262 Next page Last page

‭20‬

‭the‬ ‭curse‬ ‭of‬ ‭dimensionality‬ ‭and‬ ‭overfitting‬ ‭were‬ ‭addressed‬ ‭alongside‬ ‭improving‬ ‭computational‬ ‭efficiency.‬ ‭4.5‬ ‭ Modeling‬ ‭The‬ ‭modeling‬ ‭process‬ ‭focused‬ ‭on‬ ‭predicting‬ ‭accident‬ ‭severity‬ ‭using‬ ‭advanced‬ ‭machine-learning‬‭techniques.‬‭The‬‭dataset‬‭for‬‭this‬ ‭study‬ ‭integrated‬ ‭features‬ ‭derived‬ ‭from‬ ‭weather‬ ‭data,‬‭spatial‬‭attributes,‬‭and‬‭traffic-related‬‭metrics,‬ ‭offering‬ ‭a‬ ‭rich‬ ‭set‬ ‭of‬ ‭predictors‬ ‭for‬ ‭severity‬ ‭estimation.‬ ‭Weather-related‬ ‭features‬ ‭encompassed‬ ‭continuous‬ ‭variables‬ ‭such‬ ‭as‬ ‭temperature,‬ ‭wind‬ ‭speed,‬ ‭humidity,‬ ‭and‬ ‭precipitation,‬ ‭capturing‬ ‭external‬ ‭conditions‬ ‭influencing‬ ‭accident‬ ‭risk.‬ ‭Spatial‬ ‭attributes‬ ‭include‬ ‭attributes‬‭such‬‭as‬‭latitude,‬‭longitude,‬‭and‬ ‭road-specific‬ ‭factors,‬ ‭and‬ ‭traffic-related‬ ‭metrics‬ ‭cover‬ ‭elements‬ ‭such‬ ‭as‬ ‭traffic‬ ‭volumes,‬ ‭lane‬ ‭counts,‬ ‭and‬ ‭speed‬ ‭limits,‬ ‭serving‬ ‭as‬ ‭critical‬ ‭indicators‬ ‭of‬ ‭road‬ ‭usage‬ ‭and‬ ‭conditions.‬ ‭The‬ ‭selection‬ ‭of‬ ‭modeling‬ ‭techniques‬ ‭was‬ ‭guided‬ ‭by‬ ‭the context of the final fifty features.‬ ‭4.5.1 Selection of modeling techniques‬ ‭The‬ ‭original‬ ‭dataset‬ ‭was‬ ‭designed‬ ‭with‬ ‭factors‬ ‭significantly‬ ‭influencing‬ ‭the‬ ‭modeling‬ ‭approach,‬ ‭including‬ ‭notable‬ ‭feature‬ ‭interactions‬ ‭and‬ ‭high-dimensional‬ ‭nature.‬ ‭Preprocessing‬ ‭involved‬ ‭applying‬ ‭PCA‬ ‭for‬ ‭dimensionality‬ ‭reduction,‬ ‭highlighting‬‭underlying‬‭patterns‬‭and‬‭reducing‬‭the‬ ‭dataset's‬ ‭complexity.‬ ‭This‬ ‭step,‬ ‭combined‬ ‭with‬ ‭the‬ ‭inclusion‬ ‭of‬ ‭non-linear‬ ‭relationships,‬ ‭was‬ ‭pivotal‬ ‭in‬ ‭determining‬ ‭the‬ ‭appropriate‬ ‭modeling‬ ‭techniques.‬ ‭Consequently,‬ ‭the‬ ‭final‬ ‭models‬ ‭selected‬ ‭were‬ ‭CatBoost,‬ ‭a‬ ‭gradient‬ ‭boosting‬ ‭algorithm,‬ ‭and‬ ‭a‬ ‭Multi-Layer‬‭Perceptron‬‭(MLP),‬ ‭a‬‭neural‬‭network‬‭algorithm‬‭tailored‬‭to‬‭handle‬‭the‬ ‭dataset’s refined structure.‬ ‭Baseline‬ ‭versions‬ ‭of‬ ‭the‬ ‭CatBoost‬ ‭and‬ ‭MLP‬ ‭models‬ ‭were‬ ‭first‬ ‭created‬ ‭using‬ ‭default‬ ‭parameters‬ ‭to‬ ‭establish‬ ‭a‬ ‭performance‬ ‭benchmark.‬ ‭As‬ ‭suspected,‬ ‭these‬ ‭models‬

‭performed‬ ‭very‬ ‭poorly,‬ ‭failing‬ ‭to‬ ‭capture‬ ‭the‬ ‭intricate‬‭relationships‬‭within‬‭the‬‭data.‬‭The‬‭results‬ ‭underscored‬ ‭the‬ ‭limitations‬ ‭of‬ ‭the‬ ‭baseline‬ ‭configurations‬ ‭and‬ ‭highlighted‬ ‭the‬ ‭necessity‬ ‭of‬ ‭further‬‭tuning‬‭to‬‭enhance‬‭predictive‬‭accuracy‬‭and‬ ‭generalization.‬ ‭To‬‭address‬‭these‬‭shortcomings,‬‭a‬‭hyperparameter‬ ‭optimization‬ ‭framework‬ ‭was‬ ‭implemented‬ ‭using‬ ‭RandomizedSearchCV‬ ‭to‬ ‭explore‬ ‭a‬ ‭predefined‬ ‭range‬ ‭of‬ ‭hyperparameters‬ ‭for‬ ‭each‬ ‭model.‬ ‭For‬ ‭CatBoost,‬ ‭the‬ ‭tuning‬ ‭process‬ ‭focused‬ ‭on‬ ‭parameters‬ ‭such‬ ‭as‬ ‭learning‬ ‭rate,‬ ‭depth,‬ ‭L2‬ ‭leaf‬ ‭regularization,‬ ‭and‬ ‭the‬ ‭number‬ ‭of‬ ‭iterations,‬ ‭leveraging‬‭its‬‭ability‬‭to‬‭effectively‬‭handle‬‭tabular‬ ‭datasets,‬ ‭automatically‬ ‭process‬ ‭categorical‬ ‭variables,‬ ‭and‬ ‭reduce‬ ‭overfitting‬ ‭through‬ ‭L2‬ ‭regularization‬ ‭(Prokhorenkova‬ ‭et‬ ‭al.,‬ ‭2018).‬ ‭Similarly,‬ ‭the‬ ‭MLP‬ ‭algorithm’s‬ ‭optimization‬ ‭targeted‬ ‭hidden‬ ‭layer‬ ‭sizes,‬ ‭activation‬ ‭functions,‬ ‭learning‬ ‭rate,‬ ‭and‬ ‭the‬ ‭regularization‬ ‭parameter‬ ‭(α),‬ ‭enhancing‬ ‭its‬ ‭capacity‬ ‭to‬ ‭learn‬ ‭complex‬ ‭non-linear‬ ‭patterns‬ ‭due‬ ‭to‬ ‭its‬ ‭multiple‬ ‭hidden‬ ‭layers and adaptability (LeCun et al., 2015).‬ ‭To‬ ‭ensure‬ ‭robust‬ ‭performance,‬ ‭5-fold‬ ‭cross-validation‬ ‭was‬ ‭employed‬‭during‬‭the‬‭tuning‬ ‭process,‬ ‭minimizing‬ ‭the‬ ‭risk‬ ‭of‬ ‭overfitting‬ ‭by‬ ‭evaluating‬ ‭the‬ ‭models‬ ‭across‬ ‭multiple‬ ‭training-validation‬ ‭splits.‬ ‭The‬ ‭results‬ ‭demonstrated‬ ‭that‬ ‭hyperparameter‬ ‭tuning‬ ‭significantly‬ ‭improved‬ ‭the‬ ‭predictive‬ ‭performance‬ ‭of‬ ‭both‬ ‭models,‬ ‭transforming‬ ‭them‬ ‭into‬ ‭effective‬ ‭tools‬ ‭for‬ ‭capturing‬ ‭complex‬ ‭patterns in the data and achieving optimal results.‬ ‭4.5.2. Test design, i.e. training and validation‬ ‭datasets‬ ‭The‬ ‭test‬ ‭design‬ ‭employed‬ ‭a‬ ‭carefully‬ ‭planned‬ ‭dataset-splitting‬ ‭strategy‬ ‭and‬ ‭preprocessing‬ ‭pipeline‬ ‭to‬ ‭provide‬ ‭the‬ ‭machine-learning‬‭models‬ ‭with‬ ‭high-quality‬ ‭input‬ ‭data.‬ ‭Considering‬ ‭the‬ ‭substantial‬ ‭size‬ ‭of‬ ‭the‬ ‭final‬ ‭dataset,‬ ‭it‬ ‭was‬ ‭strategically‬ ‭divided‬ ‭into‬ ‭three‬ ‭subsets‬ ‭with‬ ‭stratification‬‭of‬‭the‬‭target‬‭variable.‬‭This‬‭approach‬

260

Made with FlippingBook - Online Brochure Maker