ADS Capstone Chronicles Revised
14
4.1.2.2 Individual Food Data The individual food dataset was similarly assessed for missing values and inconsistencies. Missing data were predominantly observed in the “brand” column and several nutritional columns, including calories, carbohydrates, fiber, sugars, fats, and proteins. The “brand” column had the most missing values (209), while the “fats” column had the least (9). For nutritional features, missing values were imputed using the median within each food category, as the right-skewed distributions shown in Figures 6 and 7 made the median a more appropriate measure than the mean. This category-based approach leveraged the similarity of items within the same group to provide more contextually relevant imputations. Any remaining missing values were filled with the global medians to ensure no gaps remained. Missing values in the “brand” column were imputed with “Unknown” to maintain consistency. To address outliers, threshold-based filtering was applied to key columns such as calories, carbohydrates, and fiber. Values in the top 1% of each feature were capped to preserve the overall dataset structure while minimizing the impact of extreme values. This approach ensured the data remained representative without being skewed by outliers. After the scoring process was completed and added to the dataset, duplicate foods were removed to further refine the data. This was achieved by grouping the records by the “food_name” column and aggregating the values for each group. Numeric columns, such as calories and proteins, were averaged, while non numeric columns retained the first value within each group. During this process, the “brand” column was excluded to generalize the
dataset and focus on individual food items rather than specific brands. This step ensured that each unique food name was represented by a single, aggregated record, improving the dataset's clarity and usability. 4.2 Data Acquisition and Aggregation - Patient Data Patient data were sourced and simulated to create comprehensive records representative of diabetic patients. Simulated diabetic patient demographics, health data, and simulated real-time glucose values were compiled via Kaggle Inc. and the access to a Dexcom API. Kaggle Inc. patient data were ingested into Python via the pandas package with a read_csv function. In contrast, the simulated glucose values were extracted via a Dexcom API with the pandas package. To construct a unified dataset where each record represents a diabetic patient’s vital statistics, general health information, and glucose measurements, the two datasets were merged using the pd.concat function. Glucose values were repeated across patient records to align with their demographic and health information, resulting in a dataset of 17,118 simulated patient records. 4.2.1 Exploratory Data Analysis - Patient Data Exploratory data analysis of the patient records begins with understanding the structure of the simulated data. Table 1 represents all features, data types, and a brief description of the features included in the original dataframe. The data consist of numerous binary, ordinal, and numeric features. The structure is further analyzed with the pandas .isnull function, which validates that there are no missing values within the 17,118 records
216
Made with FlippingBook - Online Brochure Maker