M.S. Applied Data Science - Capstone Chronicles 2025

21

variables alone can yield strong predictive performance—and partially support the second part, which anticipated additional gains from including medication-related variables. In Group A, the inclusion of medication indicators provided a modest performance boost, particularly for non-linear models, suggesting that clinical context may act as a proxy for prior diagnosis or disease severity. However, SHAP analyses revealed that these variables rarely outranked top lifestyle predictors such as avg_kcal, avg_fat, avg_sugar, avg_fiber, and physically_active. This indicates that the incremental gains from medication data are supplementary rather than central to predictive accuracy. Our findings align with prior research demonstrating the value of machine learning in chronic disease prediction. For example, Chen et al. (2021) found that incorporating electronic health record data improved hypertension risk prediction, while Sun et al. (2022) highlighted the predictive strength of lifestyle metrics in cardiometabolic outcomes. In contrast, our results show that even in the absence of clinical medication data, advanced non-linear models such as MLP and XGBoost can achieve high predictive performance, capturing complex interactions among behavioral variables that simpler models like Logistic Regression may overlook. While the use of cross-validation and careful hyperparameter tuning aimed to reduce overfitting, the near-perfect performance of Random Forest in some cases suggests that residual overfitting cannot be entirely ruled out. Furthermore, because NHANES is U.S.-based and relies partly on self-reported measures, generalizability to other populations or healthcare settings may be limited. Future

research could address this by validating models on international cohorts and incorporating longitudinal data to improve prediction of incident, rather than prevalent, metabolic syndrome. 6.1 Conclusion This study demonstrates that integrating clinical indicators such as medication use can enhance the predictive performance of machine learning models for metabolic syndrome. However, models trained solely on lifestyle and behavioral variables—particularly MLP and XGBoost—also delivered high accuracy, recall, and ROC-AUC scores, reinforcing the potential for non-clinical screening tools. Among all models tested, MLP provided balanced and robust performance with minimal overfitting indicators, making it a strong candidate for generalization. Random Forest achieved similarly strong results but may require further regularization to maintain stability across diverse populations. The findings suggest that while combining lifestyle and clinical data provides the best predictive value, behavioral data alone can support effective early detection strategies, especially in resource-limited settings where clinical data may be unavailable. 6.1.1 Limitations Several limitations should be acknowledged. First, the NHANES dataset is cross-sectional, limiting causal inference and preventing the assessment of changes over time. Second, key lifestyle and behavioral variables, such as dietary intake and physical activity, were self-reported, introducing potential recall and reporting bias. Third, medication variables were broadly classified into three categories, omitting

172

Made with FlippingBook flipbook maker