M.S. Applied Data Science - Capstone Chronicles 2025
5
medication data in their models—an angle this study specifically addresses. 3.5 Prediction of Metabolic Syndrome with ML Models Zhou et al. (2023) constructed predictive models for metabolic syndrome using NHANES data and several machine learning algorithms, including LASSO and XGBoost. Their models showed high performance (AUC ≈ 0.91) and validated the use of routine health variables such as triglycerides, BMI, and glucose in predicting chronic disease risk. However, the study treated lifestyle and clinical variables equally without teasing apart their independent predictive contributions. The current project fills this gap by explicitly contrasting models that use lifestyle features alone with those that incorporate medication use. 3.6 Synthesis of Themes and Research Gaps Across the reviewed literature, a clear consensus emerges: both lifestyle and medication interventions are effective in managing obesity and metabolic dysfunction, and ML methods are effective in modeling these outcomes using NHANES data. However, a key gap is the lack of studies that explicitly compare the predictive strength of lifestyle-only models versus models including medication. Additionally, few studies frame the predictive task as a tool for prevention rather than treatment allocation. Importantly, most prior work has not evaluated whether medication use serves merely as a treatment indicator or also acts as a proxy for prior diagnosis, disease severity, or unobserved clinical risk. By applying interpretable ML models to lifestyle and medication data from NHANES, this project addresses these gaps and contributes to the growing field of preventive health analytics.
4 Methodology This study aimed to evaluate the predictive power of lifestyle and behavioral factors in identifying individuals with metabolic syndrome, and whether incorporating pharmaceutical use would enhance model performance. The methodology followed a structured pipeline: importing NHANES 2017–2020 data, filtering and merging relevant components, cleaning and engineering features, and preparing the dataset for supervised machine learning. Raw .xpt data files were imported into Python using pandas, with each file representing a different NHANES module—demographics, dietary intake, physical activity, body composition, laboratory biomarkers, and prescription drug use. Variables were selected based on clinical relevance to metabolic health, such as caloric intake, fasting glucose and insulin, triglycerides, HDL cholesterol, BMI, waist circumference, blood pressure, and self-reported behaviors. 4.1 Data Acquisition and Aggregation Data were sourced from the publicly available NHANES 2017–2020 cycle. Eighteen separate datasets were downloaded, each representing a different health-related module. All datasets contained the variable SEQN, a unique respondent identifier, which was used to merge the datasets using an inner join. Feature selection was guided by established clinical guidelines and previous literature on metabolic syndrome. Variables were retained if they were known or hypothesized to contribute to metabolic risk, available across the full cycle, and demonstrated sufficient data completeness. Table 1 summarizes the NHANES modules used in this
156
Made with FlippingBook flipbook maker