M.S. Applied Data Science - Capstone Chronicles 2025

1 Predicting Metabolic Syndrome Risk: The Role of Lifestyle and Medication in NHANES Data

Patricio Martinez Applied Data Science Master’s Program ​ Shiley Marcos School of Engineering / University of San Diego ​ patriciomartinez@sandiego.edu

ABSTRACT

population health screening, particularly in settings with limited access to clinical data. KEYWORDS Metabolic Syndrome, NHANES, Machine Learning, Random Forest, MLP, Predictive Modeling, Health Analytics 1 Introduction Metabolic syndrome — a cluster of conditions including high blood pressure, abdominal obesity, elevated glucose, insulin resistance, and abnormal lipid levels — is a major risk factor for cardiovascular disease and type 2 diabetes. In the United States, rising rates of metabolic dysfunction have sparked growing interest in how lifestyle choices and pharmaceutical interventions contribute to these outcomes. Although medications such as antihypertensives, lipid-lowering drugs, and insulin-sensitizing agents play a role in disease management, lifestyle factors like diet, physical activity, and smoking status remain central to prevention efforts. Public health leaders and clinicians alike face an important question: How well can current health behaviors and treatment status predict who is already at risk? This study uses data from the National Health and Nutrition Examination Survey (NHANES) 2017–2020 to build predictive models for metabolic syndrome using both lifestyle-related variables and medication use. Unlike longitudinal

This study examines the predictive value of lifestyle, behavioral, and socioeconomic indicators in identifying individuals at risk for metabolic syndrome, using data from the 2017–March 2020 pre-pandemic National Health and Nutrition Examination Survey (NHANES). Two modeling approaches were compared: Model A, which incorporated medication variables alongside non-clinical predictors, and Model B, which relied solely on lifestyle, behavioral, and socioeconomic features. Five supervised machine learning algorithms—logistic regression, random forest, support vector machine (SVM), XGBoost, and multilayer perceptron (MLP)—were trained using a consistent pipeline with numerical encoding, variance filtering, scaling, SMOTE oversampling, and stratified cross-validation. Model A achieved its strongest performance with XGBoost (accuracy 0.90, ROC-AUC 0.96), followed closely by random forest and MLP. Model B retained strong predictive power despite excluding medication data, with random forest and MLP achieving accuracies of 0.90 and ROC-AUCs of 0.97 and 0.90, respectively. SHAP analysis revealed that dietary intake variables and physical activity were consistently influential in both models, while medication-related features ranked highest only in Model A. These results suggest that while clinical indicators enhance predictive accuracy, lifestyle-based models remain effective tools for early risk detection and

152

Made with FlippingBook flipbook maker