M.S. Applied Data Science - Capstone Chronicles 2025
14
rate of 0.1, tree depth of 8, L2 regularization of 3.0, and Bernoulli bootstrap sampling. On the balanced training set, this baseline achieved a mean cross-validation accuracy of 75.37% (±0.28%) and a macro F1-score of 75.41% (±0.27%), performing comparably to LightGBM’s baseline. Hyperparameter optimization for CatBoost was conducted using RandomizedSearchCV. The search space included the number of iterations (200 – 500), tree depth (4 – 10), learning rate (0.01 – 0.1), L2 regularization strength (1.0 – 9.0), subsample ratio (70 – 100%), and bootstrap type (Bayesian or Bernoulli). The optimal configuration used 500 iterations, a learning rate of 0.1, depth of 8, L2 regularization of 1.0, an 80% subsample ratio, and Bernoulli bootstrap. This tuned model achieved a cross-validation macro F1-score of 75.77% and was retained as one of the top-performing gradient boosting candidates for subsequent validation and test-set evaluation reported in Section 5. 4.4.2.5 XGBoost Model XGBoost, one of the most widely adopted gradient boosting implementations, was configured using the histogram-based tree construction method with 300 estimators, a learning rate of 0.1, maximum depth of 8, and 80% subsample ratios for both rows and columns. Cross-validation on the balanced training set yielded a mean accuracy of 75.56% (±0.34%) and a macro F1-score of 75.63% (±0.33%). The hyperparameter search for XGBoost was intentionally comprehensive. RandomizedSearchCV explored the number of estimators (200 – 500), tree depth (3 – 9), learning rate (0.01 – 0.1), subsample ratios (70 – 100%), minimum child weight (1 – 7), gamma values for minimum loss reduction (0.0 – 0.5), and both L1 and L2 regularization terms (0.0 –
2.0). The optimal configuration used 400 estimators, a learning rate of 0.1, depth of 7, minimum child weight of 3, gamma of 0.5, an 80% row subsample, a 70% column subsample, L2 regularization of 1.0, and L1 regularization of 1.0. This tuned model achieved a cross-validation macro F1-score of 75.91%, the highest among the gradient boosting methods during cross-validation and was therefore retained as a key ensemble candidate for subsequent validation and test-set evaluation in Section 5. 4.4.2.6 AdaBoost Model AdaBoost represents a different boosting paradigm which sequentially trains weak learners, with each subsequent learner focusing on examples misclassified by its predecessors. The initial configuration used 200 estimators with a learning rate of 0.1. However, cross-validation on the balanced training set revealed substantial performance issues: the model achieved a mean accuracy of only 29.07% (±0.24%) and a macro F1-score of 16.80% (±0.29%), suggesting the base AdaBoost setup struggled with this problem. Hyperparameter optimization was conducted using “RandomizedSearchCV”, exploring the number of estimators (100 – 300), learning rate (0.01 – 0.3), and algorithm variant (SAMME.R for probability-based boosting versus SAMME for discrete predictions). The optimal configuration used 150 estimators, a learning rate of 0.3, and the SAMME algorithm, improving the cross-validation macro F1-score to 52.43%. Even with tuning, this performance remained substantially below the other gradient boosting and ensemble methods, so AdaBoost was retained primarily as a comparative baseline and not as a leading deployment candidate. 4.4.2.7 Logistic Regression Model
252
Made with FlippingBook flipbook maker