M.S. Applied Data Science - Capstone Chronicles 2025

26

​ 4.5.9 Final Model Selection The F1-score was chosen as the performance metric because it balances precision and recall, with higher values (closer to 1.0) indicating better model performance. The scores reported in Figure 11 represent each model’s best performance following cross-validation, which offers a more reliable estimate of how well the models generalize to unseen data. The F1-scores, from highest to lowest, are as follows: Random forest (0.9215), decision tree (0.8986), MLP (0.8810), XGBoost (0.8781), and logistic regression (0.6894).

aimed to evaluate how the model’s generalization ability evolves with the size of the training data. Figure 12 presents the log loss as a function of training size. The decreasing trend in validation loss suggests that the model benefits from more data, continuing to generalize better as additional samples are added. In contrast, Figure 13 displays the F1-score, which improves steadily with increasing training size. As the validation F1-score approaches the training F1-score, overfitting is reduced, and the model’s robustness is enhanced. These findings suggest that the random forest model may benefit from additional data to enhance prediction accuracy for unseen data. Figure 12

Figure 11

Model Comparison - Best CV F1-scores

Log Loss VS. Training Size (Random Forest)

4.5.10 Learning Curve Analysis A learning curve analysis was performed to assess whether increasing the training data size would improve model performance, focusing on the random forest model. The model was initialized with the best hyperparameters obtained from grid search, and training was conducted on progressively larger subsets of the training data, ranging from 10% to 99%. For each subset, both log loss and weighted F1-score were calculated on the training and validation sets. This analysis

The graphical representation demonstrates clear evidence of overfitting in the random forest model under examination. The figure displays the relationship between log loss metrics and training size, represented as a fraction of available training data from 0.1 to 1.0. Two distinct performance curves are presented: training loss

30

Made with FlippingBook flipbook maker