M.S. Applied Data Science - Capstone Chronicles 2025

21

The inclusion of XGBoost complements simpler models by providing a benchmark for high predictive performance. While models like logistic regression offer interpretability, XGBoost contributes by capturing complex feature interactions and non-linear relationships. This combination allows for a balanced evaluation of model performance and interpretability, ensuring that the final model selection aligns with the specific needs of the application domain (Shwartz-Ziv & Armon, 2021). In this study, XGBoost was configured with the scale_pos_weight parameter to address class imbalance by penalizing misclassifications from minority classes. The model utilized a softmax objective (multi:softprob) for multiclass classification and was tuned over hyperparameters including learning rate, tree depth, number of estimators, and sampling parameters (subsample, colsample_bytree). Feature scaling was not required due to the tree-based structure, and optional feature selection was performed using SelectFromModel. The model was trained using a 5-fold stratified cross-validation with SMOTE applied to each training fold, and a nested 3-fold grid search was conducted for hyperparameter tuning. Feature importance tracking during the selection process helped identify variables that consistently contributed to model performance. 4.5.5 Selection of Modeling Techniques - MLP The inclusion of the MLP reflects a deliberate strategy to expand the diversity of learning algorithms and incorporate a deep learning paradigm into the modeling pipeline. MLPs are particularly adept at capturing complex, non-linear relationships, making them suitable when simpler models may fail to fully explain the

underlying data structures. In the context of model diversity, incorporating a neural network complements linear, ensemble, and kernel-based models by leveraging representation learning to extract patterns that may not be explicitly defined in the data (Raschka et al., 2022). MLP architectures represent a sophisticated approach to high-dimensional medical data analysis through their capacity to model complex non-linear relationships within hierarchical processing structures. When integrated with dimension-reduction techniques, MLPs demonstrate remarkable efficacy in medical decision support applications by systematically identifying the most relevant features while preserving essential diagnostic information (Lee et al., 2020). This methodological framework effectively addresses the computational challenges inherent in medical datasets where feature dimensionality frequently exceeds sample size, offering “a mathematically sound framework that balances model complexity with generalization capability” while maintaining predictive accuracy with optimized parametrization (Lee et al., 2020). MLPs complement simpler and more interpretable models by offering higher expressive capacity. While logistic regression serves as a strong baseline due to its transparency, MLP introduces a contrast in flexibility and learning capacity, allowing the analysis to benefit from a broader performance spectrum. This approach follows best practices in machine learning, which recommend comparing high-complexity models with interpretable baselines to balance predictive performance with model transparency (Molnar, 2022). Additionally, while ensemble models like random forest or XGBoost capture feature interactions through

25

Made with FlippingBook flipbook maker