M.S. Applied Data Science - Capstone Chronicles 2025
14
reductions in energy and sugar consumption. These change scores were used, along with BMI status and self-reported intent to lose weight, to identify participants who were likely dieting. Additional variables reflected eating-out frequency, with respondents classified as either eating out often (every day or most days) or rarely (once or twice per week or never). Physical activity was represented both as a binary indicator for engaging in any of four queried activities and as an activity score reflecting the total number of activities reported. These components were aggregated into a continuous lifestyle_effort score ranging from 0 to 5, representing the sum of dieting behavior, reduced caloric intake, reduced sugar intake, rare eating-out habits, and reported physical activity. Prescription medication use was processed into a categorical variable, med_class, by matching drug names from the RXDDRUG variable against a curated list of antihypertensives, lipid-lowering agents, and glucose-regulating drugs. The resulting categories were: no medication, medications related to metabolic syndrome, unrelated medications, and both metabolic syndrome–related and other medications. This categorization was intended to capture both the type of treatment and polypharmacy patterns. Finally, socioeconomic status was represented by income_level, derived from the income-to-poverty ratio (PIR). This variable was discretized into low income (PIR ≤ 1.3), middle income (1.3 < PIR ≤ 3.5), and high income (PIR > 3.5). Several binary indicators were also converted to numeric form for ease of modeling and plotting, such as transforming the likely dieting variable into a 0/1 numeric format. Together, these engineered features provided a comprehensive set of predictors spanning
clinical, behavioral, and socioeconomic domains for subsequent modeling. 4.4 Modeling To evaluate the predictive value of lifestyle and behavioral indicators in identifying individuals with metabolic syndrome, a structured modeling pipeline was implemented. This pipeline incorporated multiple stages: numerical encoding of categorical variables, variance filtering, feature scaling, class balancing with SMOTE, model training, hyperparameter tuning, and validation. This ensured consistency, reproducibility, and fairness in performance comparisons across all models. The final feature sets used for modeling were derived directly from the engineered variables described earlier. Model A (Lifestyle + Medications) included lifestyle and behavioral variables, income level, and medication class. Model B (Lifestyle Only) excluded medication class, relying solely on lifestyle, behavioral, and socioeconomic predictors. ● Model A Predictors: avg_kcal, avg_sugar, avg_fat, avg_fiber, reduced_calories, reduced_sugar, likely_dieting, eats_out_often, eats_out_rarely, physically_active, lifestyle_effort, income_level, plus one-hot encoded indicators for med_class. ● Model B Predictors: same as above, excluding med_class. Before training, categorical variables were numerically encoded: income_level was ordinally mapped to integer values (0 = Low Income, 1 = Middle Income, 2 = High Income), and med_class was one-hot encoded with the first category dropped to avoid multicollinearity.
165
Made with FlippingBook flipbook maker