ADS Capstone Chronicles Revised

‭19‬

‭random‬ ‭forest‬ ‭model‬ ‭had‬ ‭125‬ ‭estimators,‬ ‭with‬ ‭the‬ ‭most‬‭accurate‬‭cross‬‭validation‬‭fold‬ ‭scoring 82.3%.‬ ‭4.6.1.4‬ ‭Boosted‬ ‭Tree.‬ ‭A‬ ‭boosted‬ ‭tree‬ ‭classifier‬ ‭was‬ ‭generated‬ ‭using‬ ‭the‬ ‭`‬ ‭GradientBoostingClassifier‬ ‭`‬ ‭class‬ ‭from‬ ‭Sci-kit‬ ‭learn‬ ‭(Pedregosa‬ ‭et‬ ‭al.,‬ ‭2011).‬ ‭Boosted‬ ‭tree‬ ‭models‬ ‭are‬ ‭built‬ ‭from‬ ‭multiple‬ ‭decision‬ ‭trees.‬ ‭However,‬ ‭unlike‬ ‭random‬ ‭forest‬ ‭classifiers‬ ‭which‬ ‭are‬ ‭built‬ ‭from‬‭multiple‬‭independent‬‭trees,‬‭they‬‭utilize‬ ‭boosting‬ ‭ensemble‬ ‭methods.‬ ‭This‬ ‭involves‬ ‭the‬ ‭sequential‬ ‭development‬ ‭of‬ ‭trees,‬ ‭where‬ ‭each‬ ‭new‬ ‭tree‬ ‭is‬ ‭trained‬ ‭to‬ ‭predict‬ ‭and‬ ‭correct‬ ‭the‬ ‭residuals‬ ‭(errors)‬ ‭made‬ ‭by‬ ‭the‬ ‭previous‬ ‭trees‬ ‭(Aliyev,‬ ‭2020).‬ ‭This‬ ‭is‬ ‭an‬ ‭iterative‬ ‭process‬ ‭that‬ ‭focuses‬ ‭on‬ ‭improving‬ ‭performance‬ ‭of‬ ‭data‬ ‭points‬ ‭that‬ ‭were‬ ‭previously‬ ‭mispredicted.‬ ‭The‬ ‭hyperparameters‬ ‭for‬ ‭the‬ ‭boosted‬ ‭tree‬ ‭were‬ ‭learning‬ ‭rate,‬ ‭max_depth,‬ ‭and‬ ‭the‬ ‭number‬ ‭of‬ ‭estimators.‬ ‭The‬ ‭best‬ ‭performing‬ ‭boosted‬ ‭tree‬ ‭had‬‭a‬‭learning‬‭rate‬‭of‬‭0.1,‬‭a‬‭max_depth‬ ‭of‬ ‭10,‬ ‭and‬ ‭150‬ ‭estimators,‬ ‭with‬ ‭the‬ ‭most‬ ‭accurate‬ ‭cross‬ ‭validation‬ ‭fold‬ ‭scoring‬ ‭81.4%.‬ ‭4.6.1.5‬ ‭K-Nearest‬ ‭Neighbors.‬ ‭A‬ ‭K-Nearest‬ ‭Neighbor‬ ‭(KNN)‬ ‭classifier‬ ‭was‬ ‭generated‬ ‭using‬ ‭the‬ ‭`‬ ‭KNeighborsClassifier‬ ‭`‬ ‭from‬ ‭Scikit-learn‬ ‭(Pedregosa‬ ‭et‬ ‭al.,‬ ‭2011).‬ ‭KNN‬ ‭is‬ ‭a‬ ‭lazy‬ ‭learning‬ ‭algorithm‬ ‭that‬ ‭predicts‬ ‭classes‬ ‭by‬ ‭identifying‬ ‭the‬ ‭closest‬ ‭data‬ ‭point‬ ‭to‬ ‭input‬ ‭test‬ ‭points‬ ‭(IBM,‬ ‭n.d.).‬ ‭The‬‭hyperparameter‬‭for‬‭KNN‬‭is‬‭the‬‭k-value,‬ ‭which‬ ‭was‬ ‭a‬ ‭list‬ ‭of‬ ‭odd‬ ‭numbers‬ ‭from‬ ‭1‬‭to‬ ‭21‬ ‭using‬ ‭5-fold‬ ‭cross-validation.‬ ‭The‬ ‭best‬ ‭performing‬ ‭KNN‬ ‭model‬ ‭had‬ ‭a‬ ‭k-value‬ ‭of‬‭1‬ ‭with‬ ‭the‬ ‭most‬‭accurate‬‭cross‬‭validation‬‭fold‬ ‭scoring 75.9%.‬ ‭5 Results, Findings, and End-User Tools‬ ‭5.1 Model Performance‬

‭Testing‬ ‭set‬ ‭performance‬ ‭was‬ ‭assessed‬ ‭with‬ ‭specificity,‬ ‭precision,‬ ‭recall,‬ ‭accuracy,‬ ‭and‬ ‭F1‬ ‭scores‬ ‭on‬ ‭the‬ ‭10%‬ ‭testing‬ ‭set‬ ‭(Table‬‭5).‬ ‭Receiver‬ ‭Operating‬ ‭Characteristic‬ ‭(ROC)‬ ‭curves‬ ‭were‬ ‭plotted‬ ‭and‬ ‭Area‬ ‭Under‬ ‭the‬ ‭Curve‬ ‭(AUC)‬ ‭was‬ ‭calculated‬ ‭(Figure‬ ‭12).‬ ‭Models‬ ‭offer‬ ‭value‬ ‭if‬ ‭they‬ ‭outperform‬ ‭baseline‬ ‭rates‬ ‭of‬ ‭model‬ ‭performance‬ ‭(Table‬ ‭5).‬ ‭Class‬ ‭2‬ ‭recall‬ ‭performance‬‭is‬‭prioritized‬ ‭in‬‭final‬‭model‬‭selection‬‭because‬‭it‬‭represents‬ ‭the‬ ‭most‬ ‭severe‬ ‭outcome,‬ ‭death.‬ ‭Correctly‬ ‭identifying‬ ‭these‬ ‭cases‬ ‭is‬‭crucial‬‭to‬‭forecast‬ ‭the potential for a deathly side effect.‬ ‭5.1.1‬ ‭Top‬ ‭Model‬ ‭Selection.‬ ‭The‬ ‭random‬ ‭forest‬ ‭and‬ ‭gradient‬ ‭boosted‬ ‭decision‬ ‭tree‬ ‭models‬ ‭had‬ ‭the‬ ‭overall‬ ‭best‬ ‭performance‬ ‭(Table‬ ‭5).‬ ‭While‬ ‭the‬ ‭gradient‬ ‭boosted‬ ‭decision‬ ‭tree‬ ‭had‬ ‭higher‬ ‭accuracy‬ ‭(+2.6%),‬ ‭the‬ ‭random‬ ‭forest‬ ‭model‬ ‭had‬ ‭higher‬‭class‬‭2‬ ‭recall‬‭(+0.9%).‬‭Computational‬‭time‬‭was‬‭also‬ ‭assessed;‬ ‭the‬ ‭random‬ ‭forest‬ ‭model‬ ‭training‬ ‭time‬ ‭was‬ ‭193‬ ‭times‬ ‭faster‬ ‭than‬‭the‬‭gradient‬ ‭boosted‬‭decision‬‭tree.‬‭Therefore,‬‭the‬‭random‬ ‭forest‬ ‭model‬ ‭was‬ ‭chosen‬ ‭as‬ ‭the‬ ‭top‬ ‭model‬ ‭due‬‭to‬‭the‬‭combination‬‭of‬‭high‬‭performance‬ ‭and‬ ‭low‬ ‭computational‬‭load.‬‭The‬‭associated‬ ‭classification‬‭matrix‬‭for‬‭the‬‭selected‬‭random‬ ‭forest‬ ‭model‬ ‭is‬‭shown‬‭in‬‭Figure‬‭13,‬‭and‬‭the‬ ‭feature‬ ‭importance‬ ‭scores‬ ‭are‬ ‭displayed‬ ‭in‬ ‭Figure‬ ‭14.‬ ‭Features‬ ‭“age,”‬ ‭and‬ ‭“weight”‬ ‭were‬‭the‬‭two‬‭most‬‭informative‬‭features,‬‭with‬ ‭respective‬‭feature‬‭importance‬‭scores‬‭of‬‭0.18,‬ ‭0.17.‬ ‭Notably,‬ ‭the‬ ‭other‬ ‭individual‬ ‭difference‬ ‭features‬ ‭are‬ ‭within‬ ‭the‬ ‭top‬ ‭ten‬ ‭features, sex, price, and report source.‬ ‭The‬ ‭performance‬ ‭of‬ ‭the‬ ‭selected‬ ‭random‬ ‭forest‬ ‭model‬ ‭was‬ ‭validated‬ ‭using‬ ‭the‬ ‭10%‬ ‭validation‬ ‭data‬ ‭split‬ ‭(Table‬‭6).‬‭Results‬‭were‬ ‭comparable‬ ‭to‬ ‭the‬ ‭test‬ ‭results;‬ ‭validation‬ ‭accuracy‬ ‭was‬ ‭0.4%‬ ‭higher,‬ ‭and‬ ‭recall‬ ‭for‬ ‭Class‬ ‭2‬ ‭was‬ ‭0.3%‬ ‭lower,‬‭indicating‬‭that‬‭the‬ ‭model‬‭performs‬‭consistently‬‭and‬‭generalizes‬ ‭well.‬

169

Made with FlippingBook - Online Brochure Maker