M.S. Applied Data Science - Capstone Chronicles 2025

16

top included variables like reason_word_count, month_sin, and has_listeria. To better understand the relationship between specific features and recall severity, a correlation analysis was conducted using a subset of selected variables. These features included pathogen indicators such as the presence of Listeria, Salmonella, and E. coli; allergen indicators like peanuts, nuts, shellfish, fish, milk, egg, wheat, and soy; and several manufacturing-related issues including mislabeling, foreign material contamination, and overall quality concerns. Other variables included risk factors such as possible illness or injury, different product types (devices, drugs, food/cosmetics, and veterinary products), distribution scope (nationwide, regional, or limited), and the word count of the predictors

reason for recall. After encoding the target variable, a correlation matrix was generated and visualized to identify patterns across these selected features. As shown in Figure 9. The heatmap highlights both positive and negative associations, with stronger correlations appearing in darker shades. Additionally, the features were ranked by their absolute correlation with the encoded event classification to identify the most relevant predictors. Figure 10 displays a bar plot of these sorted correlations, making it easier to interpret which features are most strongly associated with the severity of the recall. This correlation analysis serves as a foundation for further model development and feature selection.

20

Made with FlippingBook flipbook maker