M.S. Applied Data Science - Capstone Chronicles 2025

6

study and highlights the key variables selected from each. Table 1 NHANES Modules and Key Variables Used in This Study Module File Name

Key Variables / Description

Demographics

P_DEMO.xpt

Age, sex, race/ethnicity, education, income Day 1 and Day 2 nutrient consumption Systolic and diastolic blood pressure readings Height, weight, BMI, waist circumference Glucose, insulin, HDL, triglycerides, inflammation markers Survey-based movement and activity levels Medication usage, drug type/class Behavioral indicators, healthcare access, comorbid conditions

Dietary Intake

P_DR1TOT.xpt, P_DR2TOT.xpt

Blood Pressure

P_BPXO.xpt

Body Measures

P_BMX.xpt

Lab Results

P_GLU.xpt, P_INS.xpt, P_HDL.xpt, P_TRIGLY.xpt, P_HSCRP.xpt

Physical Activity

P_PAQ.xpt

Prescription Drug Use

P_RXQ_RX.xpt

Health Insurance and Survey Questionnaires

P_INQ.xpt, P_HIQ.xpt, P_DIQ.xpt, P_DBQ.xpt, P_BPQ.xpt

Pregnant individuals were excluded using the variable RIDEXPRG from the demographics module. After merging all relevant modules on SEQN, the initial dataset consisted of 12,143 records and 65 selected variables. To ensure data quality and avoid the use of imputation, records with missing values in any key feature were removed. This yielded a final analytic dataset of 6,160 complete cases suitable for further analysis. Participants were limited to U.S. adults aged 20 and older. The final dataset was demographically

diverse, reflecting representation across gender, ethnicity, education, and income levels—supporting the generalizability of model findings. Metabolic syndrome status was derived using established clinical guidelines requiring the presence of at least three of five criteria: elevated waist circumference, high blood pressure, elevated fasting glucose, elevated triglycerides, and low HDL cholesterol. A binary classification target was created, where individuals meeting

157

Made with FlippingBook flipbook maker