M.S. Applied Data Science - Capstone Chronicles 2025
7
three or more conditions were labeled as having metabolic syndrome.
was that individuals with metabolic syndrome would have noticeably different distributions—e.g., higher caloric and sugar intake, lower fiber intake—than those without. Surprisingly, the box plots revealed very similar distributions between the two groups across all four variables. This suggests that, when considered individually, these intake measures may not strongly differentiate metabolic syndrome status. While not definitive on its own, this pattern raises the possibility that lifestyle and behavioral features alone may be less competitive with medication-inclusive models than initially expected — a question further addressed in the modeling phase. However, this does not necessarily imply they lack predictive power. Univariate visual comparisons can mask subtle but important multivariate patterns. These features may interact in non-linear ways or gain relevance when modeled alongside other lifestyle and clinical variables—an effect often captured by machine learning models but not easily visible in simple group comparisons. To further examine relationships among the lifestyle variables, a Pearson correlation matrix was computed (Figure 2). A few fairly strong correlations were observed, such as:
4.1.1 Exploratory Data Analysis Exploratory Data Analysis (EDA) was conducted to better understand the relationships and distributions among the key features used to predict metabolic syndrome. The cleaned dataset included lifestyle, behavioral, clinical, and medication-related variables. Several variables were engineered to reflect clinically relevant risk factors, including obesity flags based on waist circumference, average systolic and diastolic blood pressure, fasting glucose, and HDL cholesterol thresholds. These indicators were selected based on criteria aligned with the National Cholesterol Education Program’s ATP III guidelines for diagnosing metabolic syndrome. For the purposes of this analysis, the EDA focused primarily on non-clinical and non-biomedical variables that might serve as proxies for lifestyle and socioeconomic status. The numerical variables examined in detail were: ● avg_kcal (average daily caloric intake) ● avg_sugar (average daily sugar intake) ● avg_fat (average daily fat intake) ● avg_fiber (average daily fiber intake) ● lifestyle_effort (composite score of self-reported health behaviors) ● income_level (ordinally encoded socioeconomic indicator) Box plots were generated to compare the distributions of the four dietary intake variables (avg_kcal, avg_sugar, avg_fat, avg_fiber) between individuals with and without metabolic syndrome (Figure 1). The working hypothesis
● avg_fat and avg_kcal (0.89) ● avg_sugar and avg_kcal (0.68) ● avg_fiber and avg_kcal (0.56)
These results indicate that average caloric intake may function as a central variable capturing much of the variance present in other dietary measures. Given the study’s primary goal of comparing predictive performance between models with and without medication data, all variables were retained to preserve potential
158
Made with FlippingBook flipbook maker