M.S. Applied Data Science - Capstone Chronicles 2025

29

By integrating resampling, feature selection, and model training into one cohesive flow, consistent preprocessing was ensured during both training and eventual deployment. This comprehensive approach increases confidence that the model will generalize well to unseen data while maintaining fairness across all target classes. 5 Results and Findings ​ 5.1 Model’s Performance The comparative analysis of classification models revealed considerable variability in performance across algorithms. Tree-based methods, particularly ensemble and decision tree models,

demonstrated superior classification capabilities based on accuracy. Table 3 shows that random forest achieved the highest F1-score (0.9215), followed by decision tree (0.8986), with a marginal difference of 0.0229. MLP and XGBoost and exhibited moderate F1-scores of 0.8810 and 0.8781, respectively, while logistic regression underperformed with an F1-score of 0.6894. The tree-based models’ superior performance suggests that the classification problem involves complex, non-linear decision boundaries that these models are well-suited to capture.

Table 3 Optimal Hyperparameter Configurations for Each Model

Model

Optimal Hyperparameters

Logistic regression

C = 0.1, penalty = L2

Decision tree

max_depth = None, min_samples_leaf = 1, min_samples_split = 2

Random forest

n_estimators = 200, max_depth = None, min_samples_leaf = 1, min_samples_split = 2 n_estimators = 200, learning_rate = 0.1, max_depth = 6, subsample = 0.8, colsample_bytree = 1.0 hidden_layer_sizes = (50, 50), activation = tanh, alpha = 0.001, learning_rate = constant

XGBoost

MLPClassifier

Note . Optimal configurations were determined using grid search with F1-weighted score as the optimization metric.

33

Made with FlippingBook flipbook maker