M.S. Applied Data Science - Capstone Chronicles 2025
13
performance metrics, particularly recall for the underrepresented group. 4.3 Feature Engineering Feature engineering was performed to create clinically relevant variables and behavioral indicators from the NHANES data. For metabolic syndrome, binary flags were generated for each diagnostic component using standard clinical thresholds. Central adiposity was defined using gender-specific waist circumference cutoffs (greater than 102 cm for men and 88 cm for women). Blood pressure was calculated as the mean of up to three systolic and diastolic measurements, with a high blood pressure flag assigned to individuals with systolic values of at least 130 mmHg or diastolic values of at least 85 mmHg. Fasting glucose levels of 100 mg/dL or higher were classified as elevated, while triglyceride levels of 150 mg/dL or greater indicated hypertriglyceridemia. HDL cholesterol was considered low if it fell below 40 mg/dL for men or 50 mg/dL for women. In addition, the HOMA-IR index was calculated to estimate insulin resistance, with values exceeding 2.5 flagged as elevated. A composite binary target variable, has_metabolic_syndrome, was assigned to participants meeting three or more of these criteria, consistent with established clinical guidelines. The number of criteria met was also summed as met_syndrome_count for descriptive analysis. Behavioral and lifestyle indicators were derived to capture dietary patterns, physical activity, and other health-related behaviors. Average daily intake for calories, sugar, fat, and fiber was computed by taking the mean of the two 24-hour dietary recall days, reducing the effect of day-to-day variability. Changes in intake between the two recall days were calculated to detect
4.2 Data Quality A key challenge when working with NHANES data is the presence of missing values, often resulting from participant nonresponse or age-based ineligibility for specific assessments (e.g., fasting bloodwork). Missingness was systematically evaluated across all variables prior to modeling. Because many biomedical variables were later used to engineer derived features—such as obesity flags, average blood pressure measures, and metabolic syndrome risk indicators—retaining incomplete cases would have introduced gaps into multiple dependent features at once. To ensure analytical consistency and avoid compounding missingness through feature engineering, a complete-case strategy was implemented: only participants with complete data for all selected variables were retained. Variables with high levels of missingness or limited relevance to the research objectives were excluded. This approach preserved data integrity, maintained internal consistency among engineered features, reduced potential bias, and supported reliable model training and evaluation. 4.2.1 Class Imbalance Handling In the cleaned dataset, the class representing individuals without metabolic syndrome was smaller than the class representing those with the condition. To address this imbalance and improve the model’s ability to detect the minority class, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training set only prior to model fitting. SMOTE generates synthetic samples by interpolating between existing minority class instances, thereby increasing class representation without simply duplicating observations. This approach helps prevent the model from being biased toward the majority class and supports more balanced
164
Made with FlippingBook flipbook maker