ADS Capstone Chronicles Revised
19
Table 2 Point Biserial Correlation Binary Features vs. Glucose Value
4.2.2 Data Quality - Patient Data Overall, the patient data frame is clean and provides the necessary health data for the simulated patients. The data does not have any missing data or outliers that will heavily skew the results when input into the food modeling system. Ordinal features are scaled to be represented equally across the different designated scales. As discussed previously, all binary features must be oversampled to level out the disparity between majority and minority classes. Figure 12 presents the binary features after resampling is completed. The resampling process ensured that all groups had equal opportunities to influence the model outcomes, thus promoting fairness and mitigating the risk of biased predictions. There are still disparities;
however, no disparity is as drastic as the raw data. Lastly, numeric variables are standardized with a StandardScaler to create a cohesive data frame in which all features are in either a zero to one scale or a predefined ordinal scale. Because patient data are inherently sensitive, privacy and confidentiality were primary considerations in the data processing and analysis stages. Although the data used in this study are simulated, it was treated as if it were real patient data, following the principles of the Health Insurance Portability and Accountability Act (HIPAA) in the U.S. To ensure compliance, any identifiable information was either anonymized or simulated, patient IDs were removed, and all data processing was done in a secure environment.
221
Made with FlippingBook - Online Brochure Maker