ADS Capstone Chronicles Revised
19
random forest model had 125 estimators, with the mostaccuratecrossvalidationfold scoring 82.3%. 4.6.1.4 Boosted Tree. A boosted tree classifier was generated using the ` GradientBoostingClassifier ` class from Sci-kit learn (Pedregosa et al., 2011). Boosted tree models are built from multiple decision trees. However, unlike random forest classifiers which are built frommultipleindependenttrees,theyutilize boosting ensemble methods. This involves the sequential development of trees, where each new tree is trained to predict and correct the residuals (errors) made by the previous trees (Aliyev, 2020). This is an iterative process that focuses on improving performance of data points that were previously mispredicted. The hyperparameters for the boosted tree were learning rate, max_depth, and the number of estimators. The best performing boosted tree hadalearningrateof0.1,amax_depth of 10, and 150 estimators, with the most accurate cross validation fold scoring 81.4%. 4.6.1.5 K-Nearest Neighbors. A K-Nearest Neighbor (KNN) classifier was generated using the ` KNeighborsClassifier ` from Scikit-learn (Pedregosa et al., 2011). KNN is a lazy learning algorithm that predicts classes by identifying the closest data point to input test points (IBM, n.d.). ThehyperparameterforKNNisthek-value, which was a list of odd numbers from 1to 21 using 5-fold cross-validation. The best performing KNN model had a k-value of1 with the mostaccuratecrossvalidationfold scoring 75.9%. 5 Results, Findings, and End-User Tools 5.1 Model Performance
Testing set performance was assessed with specificity, precision, recall, accuracy, and F1 scores on the 10% testing set (Table5). Receiver Operating Characteristic (ROC) curves were plotted and Area Under the Curve (AUC) was calculated (Figure 12). Models offer value if they outperform baseline rates of model performance (Table 5). Class 2 recall performanceisprioritized infinalmodelselectionbecauseitrepresents the most severe outcome, death. Correctly identifying these cases iscrucialtoforecast the potential for a deathly side effect. 5.1.1 Top Model Selection. The random forest and gradient boosted decision tree models had the overall best performance (Table 5). While the gradient boosted decision tree had higher accuracy (+2.6%), the random forest model had higherclass2 recall(+0.9%).Computationaltimewasalso assessed; the random forest model training time was 193 times faster thanthegradient boosteddecisiontree.Therefore,therandom forest model was chosen as the top model duetothecombinationofhighperformance and low computationalload.Theassociated classificationmatrixfortheselectedrandom forest model isshowninFigure13,andthe feature importance scores are displayed in Figure 14. Features “age,” and “weight” werethetwomostinformativefeatures,with respectivefeatureimportancescoresof0.18, 0.17. Notably, the other individual difference features are within the top ten features, sex, price, and report source. The performance of the selected random forest model was validated using the 10% validation data split (Table6).Resultswere comparable to the test results; validation accuracy was 0.4% higher, and recall for Class 2 was 0.3% lower,indicatingthatthe modelperformsconsistentlyandgeneralizes well.
169
Made with FlippingBook - Online Brochure Maker