M.S. Applied Data Science - Capstone Chronicles 2025

20

In this study, logistic regression was implemented with L2 regularization to mitigate overfitting. Feature scaling was performed using standardization to ensure uniformity across predictors. Hyperparameter tuning was conducted via grid search, optimizing for regularization strength. To address class imbalance, the SMOTE was applied during training. Model evaluation employed stratified 5-fold cross-validation, ensuring robust performance assessment across different data subsets. 4.5.3 Selection of Modeling Techniques- Random Forest In the modeling phase, the random forest algorithm was implemented as a foundational ensemble learning method due to its balance of accuracy, interpretability, and resilience to overfitting. random forest constructs multiple decision trees and aggregates their predictions to improve generalization performance. This ensemble approach is particularly effective in multiclass classification tasks where class imbalance is present, as it helps mitigate the risk of overfitting to dominant classes. In our case, the model was configured with class_weight “balanced” to ensure fair treatment of the minority class, Class III. According to Datacamp (n.d.), Random forest models are robust to outliers and perform well even when key statistical assumptions are unmet. “Random forest is an ensemble learning method that builds multiple decision trees and aggregates their predictions” for better performance and robustness. The model is tuned over key parameters such as the number of trees (n_estimators), tree depth (max_depth), and the minimum samples required to split a node or be at a leaf. As with decision trees, scaling is not necessary. Feature selection is supported through

model-based selection (selectfrommodel), but again, the final configuration did not reduce the feature set (Pinto et al., 2025). Like the other models, training involves five-fold Stratified K-Fold cross-validation with SMOTE applied within each training fold to correct class imbalance. Each fold performs its own grid search to find optimal parameters. This layered validation approach helps ensure reliable model evaluation. Random forest typically handles feature interactions and noise well, making it resilient to overfitting. The ensemble approach improves generalizability and often outperforms a single decision tree on complex datasets. 4.5.4 Selection of Modeling Techniques- XGBoost Incorporating multiple modeling approaches is a strategic decision to capture various data patterns and relationships. XGBoost (extreme gradient boosting) was selected for its ability to model complex, non-linear interactions and its robustness against overfitting. This inclusion ensures that the modeling framework benefits from both simple, interpretable models and complex, high-performance algorithms, facilitating a comprehensive analysis of the data (Chen & Guestrin, 2016). XGBoost represents the ensemble learning paradigm, specifically gradient boosting, which sequentially builds models to correct the errors of previous models. This approach contrasts with other paradigms like generalized linear models (e.g., logistic regression) and bagging methods (e.g., random forests), providing a diverse set of algorithms that can capture different aspects of the data structure (Kuhn & Johnson, 2013).

24

Made with FlippingBook flipbook maker