M.S. Applied Data Science - Capstone Chronicles 2025

15

4.4.1 Class Imbalance and SMOTE Application The target variable (has_metabolic_syndrome) exhibited a notable class imbalance, with 3,941 individuals meeting the criteria for metabolic syndrome and 2,219 without. Because imbalanced datasets can bias models toward the majority class, the Synthetic Minority Oversampling Technique (SMOTE) was applied to the training set only. SMOTE generates synthetic minority-class samples by interpolating between existing cases, rather than duplicating them, which helps classifiers better learn decision boundaries and reduces the risk of overfitting to oversampled points. Applying SMOTE exclusively to the training data prevents information leakage into the test set, ensuring an unbiased evaluation of model performance. 4.4.2 Selection of Modeling Techniques A diverse set of five classifiers was used to capture a range of statistical and algorithmic approaches. Logistic regression served as a transparent baseline model widely used in clinical and epidemiological research, offering interpretable coefficients and odds ratios. Random forest was chosen for its ability to model nonlinear relationships and interactions while remaining relatively robust to noisy features. Support vector machine (SVM) was included for its capacity to construct optimal separating hyperplanes in high-dimensional spaces, making it well-suited for binary classification with potentially complex decision boundaries. Extreme Gradient Boosting (XGBoost), a high-performance gradient-boosted tree algorithm, was incorporated for its proven success in capturing subtle nonlinearities and maximizing predictive accuracy. Finally, a multi-layer perceptron (MLP) neural network was selected to evaluate the potential of deep learning to identify latent feature patterns not easily detected by traditional methods.

All models were trained and tuned using 5-fold stratified cross-validation on the training set, ensuring that each fold preserved the original class distribution. This approach allowed every observation to contribute to both training and validation, improving generalizability while mitigating overfitting. Hyperparameter tuning was conducted via grid search, optimizing key parameters for each algorithm to achieve the best possible predictive performance under consistent evaluation criteria. 5 Results and Findings 5.1 Optimal Hyperparameters Hyperparameter optimization was conducted using grid search with 5-fold cross-validation, separately for Group A (lifestyle + medications) and Group B (lifestyle only). The tuning process aimed to identify parameter configurations that maximized model generalization performance while avoiding overfitting. For Group A, the optimal parameters varied considerably by model type (Table 2). Logistic Regression performed best with a relatively high regularization parameter (C=10), while Random Forest achieved strong results with moderate tree depth (max_depth=20) and a standard number of estimators (n_estimators=100). XGBoost benefited from a higher learning rate (0.2) and moderately deep trees (max_depth=7). For Group B, several models retained similar parameter settings to Group A (Table 3), suggesting that excluding medication data did not require drastically different model complexity. Notably, Random Forest in Group B used double the number of estimators (n_estimators=200) compared to Group A, likely to compensate for the reduced feature set.

166

Made with FlippingBook flipbook maker