M.S. Applied Data Science - Capstone Chronicles 2025
6
study and highlights the key variables selected from each. Table 1 NHANES Modules and Key Variables Used in This Study Module File Name
Key Variables / Description
Demographics
P_DEMO.xpt
Age, sex, race/ethnicity, education, income Day 1 and Day 2 nutrient consumption Systolic and diastolic blood pressure readings Height, weight, BMI, waist circumference Glucose, insulin, HDL, triglycerides, inflammation markers Survey-based movement and activity levels Medication usage, drug type/class Behavioral indicators, healthcare access, comorbid conditions
Dietary Intake
P_DR1TOT.xpt, P_DR2TOT.xpt
Blood Pressure
P_BPXO.xpt
Body Measures
P_BMX.xpt
Lab Results
P_GLU.xpt, P_INS.xpt, P_HDL.xpt, P_TRIGLY.xpt, P_HSCRP.xpt
Physical Activity
P_PAQ.xpt
Prescription Drug Use
P_RXQ_RX.xpt
Health Insurance and Survey Questionnaires
P_INQ.xpt, P_HIQ.xpt, P_DIQ.xpt, P_DBQ.xpt, P_BPQ.xpt
Pregnant individuals were excluded using the variable RIDEXPRG from the demographics module. After merging all relevant modules on SEQN, the initial dataset consisted of 12,143 records and 65 selected variables. To ensure data quality and avoid the use of imputation, records with missing values in any key feature were removed. This yielded a final analytic dataset of 6,160 complete cases suitable for further analysis. Participants were limited to U.S. adults aged 20 and older. The final dataset was demographically
diverse, reflecting representation across gender, ethnicity, education, and income levels—supporting the generalizability of model findings. Metabolic syndrome status was derived using established clinical guidelines requiring the presence of at least three of five criteria: elevated waist circumference, high blood pressure, elevated fasting glucose, elevated triglycerides, and low HDL cholesterol. A binary classification target was created, where individuals meeting
157
Made with FlippingBook flipbook maker