M.S. Applied Data Science - Capstone Chronicles 2025
1 Predicting Metabolic Syndrome Risk: The Role of Lifestyle and Medication in NHANES Data
Patricio Martinez Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego patriciomartinez@sandiego.edu
ABSTRACT
population health screening, particularly in settings with limited access to clinical data. KEYWORDS Metabolic Syndrome, NHANES, Machine Learning, Random Forest, MLP, Predictive Modeling, Health Analytics 1 Introduction Metabolic syndrome — a cluster of conditions including high blood pressure, abdominal obesity, elevated glucose, insulin resistance, and abnormal lipid levels — is a major risk factor for cardiovascular disease and type 2 diabetes. In the United States, rising rates of metabolic dysfunction have sparked growing interest in how lifestyle choices and pharmaceutical interventions contribute to these outcomes. Although medications such as antihypertensives, lipid-lowering drugs, and insulin-sensitizing agents play a role in disease management, lifestyle factors like diet, physical activity, and smoking status remain central to prevention efforts. Public health leaders and clinicians alike face an important question: How well can current health behaviors and treatment status predict who is already at risk? This study uses data from the National Health and Nutrition Examination Survey (NHANES) 2017–2020 to build predictive models for metabolic syndrome using both lifestyle-related variables and medication use. Unlike longitudinal
This study examines the predictive value of lifestyle, behavioral, and socioeconomic indicators in identifying individuals at risk for metabolic syndrome, using data from the 2017–March 2020 pre-pandemic National Health and Nutrition Examination Survey (NHANES). Two modeling approaches were compared: Model A, which incorporated medication variables alongside non-clinical predictors, and Model B, which relied solely on lifestyle, behavioral, and socioeconomic features. Five supervised machine learning algorithms—logistic regression, random forest, support vector machine (SVM), XGBoost, and multilayer perceptron (MLP)—were trained using a consistent pipeline with numerical encoding, variance filtering, scaling, SMOTE oversampling, and stratified cross-validation. Model A achieved its strongest performance with XGBoost (accuracy 0.90, ROC-AUC 0.96), followed closely by random forest and MLP. Model B retained strong predictive power despite excluding medication data, with random forest and MLP achieving accuracies of 0.90 and ROC-AUCs of 0.97 and 0.90, respectively. SHAP analysis revealed that dietary intake variables and physical activity were consistently influential in both models, while medication-related features ranked highest only in Model A. These results suggest that while clinical indicators enhance predictive accuracy, lifestyle-based models remain effective tools for early risk detection and
152
Made with FlippingBook flipbook maker