M.S. Applied Data Science - Capstone Chronicles 2025

33

Figure 17 Confusion Matrix of Final Model on Test Set

several models demonstrated imbalanced precision-recall relationships, with recall values exceeding precision for MLP and logistic regression. This reversed imbalance indicates that these models tend toward false positive classifications for Class III instances. Logistic regression exhibited particularly poor precision (approximately 0.20) while maintaining moderate recall (approximately 0.65), suggesting indiscriminate classification of instances as Class III. The consistently lower performance across all models highlights the inherent complexity of discriminating Class III instances, potentially indicating greater feature overlap with other classes or higher within-class variability. ​ 5.3 Final Model Evaluation on Test Set To measure the generalizability of the final model, predictions were made on the holdout test set using the optimized classification pipeline. The classification report indicated robust performance across multiple metrics. The model achieved an overall accuracy of 93.2%, with macro-averaged precision, recall, and F1-scores all exceeding 93%. Among individual classes, Class II exhibited the strongest predictive performance, with a precision of 94.4%, recall of 96.8%, and an F1-score of 95.6%. Class I also demonstrated strong results, reflected by an F1-score of 92.2%. In contrast, Class III showed reduced predictive performance, with a precision of 76.1%, recall of 62.8%, and an F1-score of 68.8%, suggesting the model experienced difficulty in correctly identifying this class. The confusion matrix presented in Figure 17 illustrates the distribution of misclassifications. Most misclassifications were concentrated in Class III, often being confused with Classes I and II. This pattern aligns with the metrics from the classification report, indicating lower model confidence and precision for this class.

Further analysis of weighted averages supported these findings, with the model achieving a weighted precision of 92.95% , recall of 93.18% , and an F1-score of 93.01% . These results affirm that the model maintained balanced classification performance across the dataset, although disparities in class-level performance highlight potential limitations, particularly in differentiating minority or less distinct classes. To further assess discriminative capacity, receiver operating characteristic curves were generated for each class using predicted probabilities. As shown in Figure 18 , the model exhibited high discriminative ability for Classes I and II, with receiver operating characteristic curves approaching the upper-left corner and large areas under the curve. Class III, although showing relatively lower performance, still achieved a moderate areas under the curve, indicating that the model was capable of assigning useful probability estimates even in cases with higher misclassification rates.

37

Made with FlippingBook flipbook maker