M.S. Applied Data Science - Capstone Chronicles 2025
7
“classification day of week.” The column names were reviewed for clarity, and the first five records were displayed to observe sample data points and their characteristics. A key component of EDA involved inspecting the data types of each column to determine necessary type conversions. The dataset contains a mix of categorical, numerical, and datetime fields. The “center classification date” was appropriately recognized as a datetime64 object, while other categorical variables, such as “product type” and “event classification,” were maintained in their respective string formats. Additionally, a missing value analysis was conducted, revealing only one missing value in the “distribution pattern” column. Given the low occurrence of missing data, appropriate techniques, such as imputation or omission, were considered based on the analysis requirements. The distribution of the target variable, “event classification,” was examined to understand the prevalence of different recall event types. The dataset contained three distinct classes: ● Class I (21.15%) represents the most serious type of recall, indicating products that could cause severe health consequences. ● Class II (70.81%) is a moderate-level recall where exposure to the product may lead to temporary or medically reversible health effects. ● Class III (8.04%) is the least severe classification, involving products unlikely to cause adverse health effects.
Figure 1 Distribution of Event Classification
Note. This figure shows the relative frequency of Class I, II, and III recalls. It is evident that Class II recalls predominate in the dataset, suggesting that moderate-level recalls warrant particular attention. However, the significance of Class I and Class III recalls should not be overlooked. Subsequent analysis focused on independent variables that may influence the target variable. A key variable, Product Type , displayed distinct patterns across different recall event classifications and appeared to influence the type and severity of recalls. Figure 2 shows the frequency distribution of product types, indicating that devices represented the most frequently recalled category (37.3%), followed by food/cosmetics (28.9%), drugs (17.6%), biologics (12.7%), veterinary products (3.6%), and tobacco (0.09%).
11
Made with FlippingBook flipbook maker