M.S. Applied Data Science - Capstone Chronicles 2025
Capstone Chronicles 2025 Selections
MS-Applied Data Science University of San Diego
Image generated with OpenAI's DALL·E, facilitated by ChatGPT. 1
Dear Reader,
It is our great pleasure to welcome you to the 2025 edition of Capstone Chronicles , our second annual publication showcasing exemplary projects from the MS in Applied Data Science program at the University of San Diego. This volume reflects not only another year of exceptional student achievement but also the continued evolution of a program dedicated to preparing data science professionals for meaningful, responsible, and impactful work. The University of San Diego’s online Master of Science in Applied Data Science is committed to training current and future leaders in this important and transformative field. Our program places strong emphasis on real-world applications, ethical responsibility, and the pursuit of social good in the design and deployment of data science and AI-enabled systems. Developed by data science experts in close collaboration with key industry and government stakeholders, the Applied Data Science curriculum provides rigorous technical preparation and in-depth, practical training. Each graduating cohort represented in this edition—Spring, Summer, and Fall 2025—includes approximately 25-30 students. In the Applied Data Science Capstone course, students synthesize the knowledge and skills acquired throughout this master’s program, and this project serves as a culminating experience for students, enabling them to apply their theoretical knowledge to a research-driven, code-intensive original project. Students lead an end-to-end data science workflow, including data acquisition, processing, and analysis, while deploying appropriate analytical techniques. The project is documented in an academic journal-style article and presented in a recorded technical presentation. We hope the Capstone Chronicles serve as both an inspiration and a resource for future students, researchers, and practitioners in data science. By sharing these exemplary projects, we aim to celebrate our amazing students' accomplishments and their contributions to the broader discourse on applied data science. We extend our sincere appreciation to the students whose hard work and dedication fill these pages, as well as to the faculty mentors who guide them, and the industry partners who help ensure our curriculum remains relevant and forward-looking.
Thank you for joining us in celebrating the achievements of our 2025 graduates. We hope this edition of Capstone Chronicles inspires current and future students, collaborators, and leaders in data science.
Sincerely, The 2025 Capstone Chronicles Editorial Team
Anna Marbut, Ebrahim Tarshizi, & Erin Cooke
This letter was composed with the assistance of OpenAI’s ChatGPT.
2
Table of Contents Spring 2025 Early Detection of High-Risk Product Recalls: A Comparative Study of Multiclass Classification Approaches ……………………………………………………………………………………………. 5 Lorena Dorado, Parisa Kamizi Summer 2025 From Game Theory to Goal Theory: A Shapley Value Approach to Tactical Intelligence in Elite Soccer …………………………………………………………………………………………………………….…… 48 Mauricio Espinoza Acevedo, Maria Mora Mora, Gabriel Mancillas Gallardo Spatial-Temporal and Predictive Modeling of Chemical Contaminant Exceedances in California Public Water Systems …………………………………………………………………………………… 73 Tarane Javaherpour, Davood Aein A Predictive Model to Strengthen Retention in Government Agencies: Sentiment Factors Driving Employee Exits Using A Predictive Model Approach Segment Risk Level ……………………………………………………………………………………………………… 95 Sophia Jensen, Duy Nguyen Analyzing YouTube Trends Using Metadata and NLP ……………………………………....………………………. 113 Jose Guarneros, Tysir Shehadey Deep Learning Based Plant Identification for Automated Agricultural Weed Control …………………… 126 Edgar Rosales, Marinela Inguito, Bobby Marriott Predicting Metabolic Syndrome Risk: The Role of Lifestyle and Medication in NHANES Data ………………………………………………………………………………………………..…………. 152 Patricio Martinez Fall 2024 Mapping Education & Disability Inequities in Poverty Across Illinois Communities …….………………. 176 Madeline Chang, Matt Ammirati, Gabriel Duffy Developing a Simplified Early Warning System for Predicting Graduation Outcomes in California Public Schools …………………………………………………………………………………………. 190 Jun Clemente, Tanya Ortega, Amayrani Balbuena Mapping the Market: Uncovering Brand Alliances from Consumer Cross-Shopping Networks ……………………………………………………………………………………………………………………. 208 Christian Lee, Nolan Peters Machine Learning for IoT Intrusion Detection: A Realistic Evaluation of the CIC-IoT2023 Dataset .…………………………………………………………………………………………………. 238 Graham Ward, Anahit Shekikyan, Gerard Corrales Fernandez
3
Spring 2025
Image generated with OpenAI's DALL·E, facilitated by ChatGPT. 4
1
Early Detection of High-Risk Product Recalls: A Comparative Study of Multiclass Classification Approaches Lorena Dorado
Parisa Kamizi Applied Data Science Master’s Program Shiley Marcos School of Engineering /
Applied Data Science Master’s Program Shiley Marcos School of Engineering /
University of San Diego ldorado @sandiego.edu
University of San Diego pkamizi @sandiego.edu
ABSTRACT Timely identification and classification of product recalls are essential to safeguarding public health. This study explores the application of machine learning and natural language processing techniques to predict the severity of product recalls issued by the U.S. Food and Drug Administration (FDA). Using a dataset of over 95,000 FDA recall records, the study developed a multiclass classification system that categorizes recalls into Class I, II, or III based on structured features and textual recall descriptions. Feature engineering incorporated temporal patterns, categorical variables, and text-based features such as term frequency-inverse document frequency and word counts. Several classification models—including random forest, XGBoost, decision tree, multilayer perceptron, and logistic regression—were evaluated using metrics such as precision, recall, and F1-score. The random forest model achieved the best overall performance with an F1-score above 0.93. While the model effectively distinguished Class I and II recalls, Class III predictions proved more complex due to overlapping features. A Streamlit dashboard was deployed to demonstrate real-time classification capability. The findings highlight the potential for artificial intelligence-driven tools to enhance
regulatory decision-making, improve recall timeliness, and strengthen consumer protection. KEYWORDS product recalls, recall classification, risk prediction, machine learning, natural language processing, FDA, public health, model evaluation, regulatory analytics, classification modeling 1 Introduction Product recalls serve a critical role in protecting public health and safety across various industries, including food, pharmaceuticals, medical devices, and consumer goods. The process of identifying, classifying, and managing recalls is complex and involves regulatory bodies, manufacturers, and consumers. With the increasing frequency and complexity of recalls, there is a rising demand for efficiency and proactive approaches to recall management and risk prediction. The U.S. FDA categorizes recalls based on the severity of health risks posed by defective products: Class I : Products that could cause serious adverse health consequences or death. Class II : Products that might cause temporary or medically reversible adverse health consequences, with a remote probability of serious outcomes.
5
2
Class III : Products unlikely to cause adverse health consequences but that violate FDA labeling or manufacturing laws (U.S. FDA, n.d.). A recall event occurs when a company withdraws one or more of its products from the market. Each product involved receives an individual safety rating, referred to as the product classification (Class I, II, or III). The overall Event Classification is determined by the most severe classification assigned to any of the associated products (U.S. FDA, n.d.). For example, if an event involves three products—two classified as Class III and one as Class I—the event classification will be designated as Class I, reflecting the most severe classification among the products. Once individual product classifications are assigned, the event classification is determined accordingly. The time between FDA awareness of the event and final classification varies, but typically occurs within a few days. Timely classification and initiation of recalls are crucial to protect public health. However, delays in this process can lead to prolonged exposure to hazardous products, increasing the risk of adverse health outcomes. For instance, a study analyzing FDA medical device recalls from 2018 to 2022 found that only 26.5% of Class I recalls were terminated within a median of 24 months, indicating prolonged periods during which unsafe devices remained in the market (Darby et al., 2023). The automobile industry is also affected by recalls. A study examining the relationship between recall frequency and innovation by manufacturers found a u-shaped relationship, suggesting that recalls can facilitate innovation, but too many recalls may stifle it (Ni et al., 2023). Other research has shown that lobbying
activities by firms may influence the FDA’s recall classifications in the pharmaceutical industry, raising concerns about the impact on public safety and the push for more objective classification methods (Y. Zhou, 2023). These findings highlight the complexity of the recall process and emphasize the need for improvement through data-driven approaches. The FDA’s ongoing recall database provides valuable information for developing analytical tools to enhance recall management and risk prediction. Given the complexity of product recalls and their implications for public health, industry operations, and regulatory efficiency, there is a need for advanced data science techniques to address several challenges: ● Accurate and efficient classification of recall severity ● Identification of risk factors and early signs of risk across different product types ● Analysis of time-related trends and rising issues in product safety ● Establishment of a standardized language and risk assessment for improved communication To address these challenges, this study proposes the development of a multiclass classification and risk prediction system for recalls. This system applies machine learning, natural language processing (NLP), and statistical techniques to create a comprehensive framework for predicting recall severity, identifying risk factors, and providing actionable insights for proactive interventions. The working hypothesis is that by integrating diverse data sources—such as product descriptions, manufacturer information, recall reasons, and historical patterns—a predictive
6
3
model can be developed that improves the accuracy and efficiency of recall classification. This model is also expected to offer early indicators of potential safety issues. The project has the potential to reduce health impacts, lower manufacturer costs, improve regulatory efficiency, and enhance transparency in public risk awareness. These goals support public health and safety management efforts across industries and institutions. 2 Background Product recalls are a critical component of consumer safety, necessitating prompt action from manufacturers and regulatory agencies to identify and mitigate potential hazards. The U.S. FDA oversees recalls across various sectors, including biologics, medical devices, drugs, food and cosmetics, tobacco, and veterinary products. The FDA classifies recalls based on the following severity of health risks: ● Class I (most severe, life-threatening risks) ● Class II (Moderate risks) ● Class III (least severe, unlikely to cause harm) The recall process typically begins when a firm identifies a defect and notifies the FDA, providing relevant information about the product and the nature of the problem. The FDA then assesses the health hazard to determine the appropriate recall classification. This process involves evaluating the level of risk, determining the scope of the recall, notifying the public, and monitoring the effectiveness of the recall (FDA, 2024). In the current recall process, manufacturers voluntarily provide the necessary information to the FDA, which can delay public awareness of
the product’s risk severity. In addition, each industry has independent supply chain structures and recall management protocols, though they could share the same distribution networks. FDA recall records contain cross-category distribution data, offering an opportunity for researchers to develop predictive classification models that are generalizable across industries. 2.1 Problem Identification and Motivation Recall classification accuracy and timeliness are major concerns, with studies suggesting that external factors, such as lobbying activities, may influence recall classifications. This raises questions about the objectivity and consistency of the recall process. Additionally, the prolonged duration between recall initiation and termination, as highlighted by recent research, indicates a need for more efficient identification and management of recalls, especially for high-severity cases. The lack of proactive risk identification and inconsistent communication across different product types further complicate the recall process. Current systems often react to safety issues after they occur, rather than proactively identifying potential risks. The absence of standardized language and risk assessments can lead to confusion and poor communication with consumers and stakeholders. Additionally, the breakdown of recall management protocols across industries, despite shared distribution networks, can lead to prolonged circulation of hazardous products. Quality assurance professionals face significant challenges in mitigating defects that lead to costly and potentially hazardous recalls. This research is driven by the need to enhance public health protection, streamline resource allocation,
7
4
and enable proactive recall management. By developing methods to accurately identify high-risk recalls, the study aims to reduce the circulation time of hazardous products. In doing so, it supports regulatory agencies and manufacturers in prioritizing limited resources to expedite critical responses. The proposed approach has the potential to save lives by accelerating the identification of Class I recalls and shifting the industry focus from reactive problem-solving to proactive safety assurance. 2.2 Definition of Objectives The multiclass classification and risk prediction system for recalls will require several key actions to address challenges in recall management. These include developing a machine learning model for recall severity classification, creating an early warning system for high-risk products, utilizing NLP techniques to analyze recall reasons, implementing a time-based pattern analysis system, and building a risk assessment tool for manufacturers. The project will analyze text patterns in product descriptions and recall reasons, combine this analysis with categorical features to determine their effectiveness in different class categories. Through these actions, the project aims to provide more accurate and consistent recall severity classification, efficient identification of high-risk recalls, early detection of potential high-risk products, and standardized analysis of recall reasons across product types. It will also identify emerging trends and seasonal variations in product safety issues, offer customized risk scores and preventive measures for manufacturers, and determine which classification method performs better for different class balance scenarios. The expected outcomes include improved accuracy and
timeliness in recall classification, reduction in Class I recalls, more efficient resource allocation, enhanced communication, and potential economic benefits. Even if not all objectives are fully met, the project will still contribute valuable insights into recall classification complexities and help identify current limitations in the field. 3 Literature Review The purpose of this literature review is to position this research project in relation to existing studies on recall systems and data-driven approaches for improving recall efficiency and accuracy. By reviewing recent studies on FDA recall processes, predictive modeling, and emerging technologies such as NLP, this study identifies gaps in the literature and establishes the rationale for its analysis. This review will cover several key areas, including the evolution of recall research, challenges within current recall systems, data-driven approaches for prediction, industry-specific recall findings, and the potential of NLP in enhancing recall prediction across various sectors. 3.1 Recall Classification and Regulatory Oversight Y. Zhou (2023) and Dubin et al. (2021) examined the factors influencing FDA recall classifications and associated regulatory risks. Y. Zhou identifies lobbying as a potential influence on the classification process, suggesting that external pressures may compromise objectivity. Dubin et al. (2021) found that medical devices approved through the more rigorous premarket approval pathway had a higher risk of recall than those cleared through the 510(k) pathway, challenging the assumption that stricter approval leads to safer products.
8
5
Both studies point to shortcomings in current recall classification systems, with Y. Zhou focusing on external influences and Dubin et al. (2021) emphasizing the inherent risks tied to different approval pathways. However, neither study offers a data-driven, objective method for recall classification, leaving a gap in addressing these issues in a more systematic and unbiased way. 3.2 Characteristics and Trends in Medical Device Recalls Mooghali et al. (2023) provided a detailed analysis of Class I medical device recalls from 2018 to 2022, revealing that such recalls are frequent and affect millions of devices annually. Their findings highlight inefficiencies in the recall process, noting that Class I recalls lasted a median of 24 months from initiation to termination. The research identifies trends in recall frequency, the types of devices most affected, and the length of time recalls take to resolve. However, it does not address the potential for predictive models that could identify high-risk recalls before they occur, indicating a lack of proactive tools for managing recall risks in advance. 3.3 Impact of Recalls on Firm Innovation The study by Ni et al. (2023) emphasized the complex interplay between product failures and organizational learning. It identifies an inverted U-shaped relationship, where moderate levels of recalls may encourage innovation, but excessive recalls can have a negative effect on innovation and firm performance. The research highlights the complex interactions between product failures and organizational learning, suggesting that recalls can both drive
and hinder innovation depending on their frequency. However, the study is limited to the automotive sector and does not explore whether these findings can be generalized to other industries, leaving a gap in understanding the broader implications of recalls on innovation. 3.4 Recall Prediction J. An (2024) applies structural topic modeling to analyze FDA recalls in the plant-based food industry, revealing two dominant themes: market actors’ opportunism and food culture practices. This approach demonstrates the utility of advanced text analysis techniques for uncovering recall-related trends. However, the study’s focus on a single industry limits the generalizability of its findings across broader product categories. 3.5 Regulatory Approaches and Public Health Impact The study by Barbosa-Slivinskis et al. (2024) developed a machine learning algorithm to predict FDA medical device recalls, achieving high sensitivity and specificity with lead times of up to 12 months. This research highlights the potential of machine learning for proactive recall management, enabling earlier detection of risks and better resource allocation. However, the study focuses exclusively on medical devices and does not explore the broader application of these techniques across different industries. Several gaps are identified, including the need for a comprehensive, multi-industry approach to recall prediction, the use of machine learning and NLP techniques across product categories, and the development of standardized methods for assessing recall severity. Additionally, there is limited exploration of supply chain data integration and the long-term impacts of recalls on public health and industry innovation.
9
6
Furthermore, by developing a cross-sector approach, this research could contribute to the development of cross-sector models that integrate insights from various industries. This holistic approach could ultimately enhance recall precision and regulatory compliance, leading to faster, more effective interventions that improve public safety. By addressing these gaps, this study contributes to the advancement of a proactive, cross-sector recall system that anticipates and mitigates future product failures before they occur. This holistic approach can potentially improve regulatory precision, enhance public safety, and streamline recall decision-making across industries. 4 Methodology This study utilizes a dataset sourced from the U.S. FDA’s (n.d.) publicly available recall database. The dataset contains 95,082 records across 17 variables, including firm information, product classification, recall status, geographic distribution, recall dates, and descriptive text. exploratory data analysis (EDA) revealed a decline in the monthly recall volume starting in 2020, stabilizing between 400 and 600 recalls per month. Therefore, only data from 2020 onward were selected for modeling to ensure consistency and representativeness. As this dataset is publicly accessible and free of personally identifiable information, there are no ethical concerns regarding data privacy. The methodology follows a structured approach that includes EDA, statistical analysis, and data transformation over time. Visualizations are employed to explore categorical variable distributions, temporal trends, and geographic patterns. Additionally, statistical techniques, including chi-square tests, are utilized to examine
associations between key categorical features, thereby building a foundational understanding of recall dynamics and preparing the dataset for future analytical modeling. All code used for data analysis, figure generation, and machine learning in this paper is available on the following GitHub repository: https://github.com/PareesaK/Improving-Recall-E ffectiveness/ For any inquiries, please contact the authors of the paper. 4.1 Data Acquisition and Aggregation The preprocessing and analysis were conducted using Python, utilizing libraries such as Pandas and NumPy for data manipulation, and Matplotlib and Seaborn for EDA through visualizations. Scikit-learn was used for data partitioning and machine learning applications, ensuring efficient handling of the dataset. 4.1.1 EDA EDA serves as a critical initial step in the examination of the dataset, aiding in the identification of patterns and the detection of potential data issues. A systematic approach was followed to ensure the dataset was suitable for further modeling. The dataset consists of 95,082 records and 21 columns, which include the following variables: “FEI number,” “recalling firm name,” “product type,” “product classification,” “status,” “distribution pattern,” “recalling firm city,” “recalling firm state,” “recalling firm country,” “center classification date,” “reason for recall,” “product description,” “event ID,” “event classification,” “product ID,’” “center,’ “recall details,” “classification year,” “classification month,” “classification day,” and
10
7
“classification day of week.” The column names were reviewed for clarity, and the first five records were displayed to observe sample data points and their characteristics. A key component of EDA involved inspecting the data types of each column to determine necessary type conversions. The dataset contains a mix of categorical, numerical, and datetime fields. The “center classification date” was appropriately recognized as a datetime64 object, while other categorical variables, such as “product type” and “event classification,” were maintained in their respective string formats. Additionally, a missing value analysis was conducted, revealing only one missing value in the “distribution pattern” column. Given the low occurrence of missing data, appropriate techniques, such as imputation or omission, were considered based on the analysis requirements. The distribution of the target variable, “event classification,” was examined to understand the prevalence of different recall event types. The dataset contained three distinct classes: ● Class I (21.15%) represents the most serious type of recall, indicating products that could cause severe health consequences. ● Class II (70.81%) is a moderate-level recall where exposure to the product may lead to temporary or medically reversible health effects. ● Class III (8.04%) is the least severe classification, involving products unlikely to cause adverse health effects.
Figure 1 Distribution of Event Classification
Note. This figure shows the relative frequency of Class I, II, and III recalls. It is evident that Class II recalls predominate in the dataset, suggesting that moderate-level recalls warrant particular attention. However, the significance of Class I and Class III recalls should not be overlooked. Subsequent analysis focused on independent variables that may influence the target variable. A key variable, Product Type , displayed distinct patterns across different recall event classifications and appeared to influence the type and severity of recalls. Figure 2 shows the frequency distribution of product types, indicating that devices represented the most frequently recalled category (37.3%), followed by food/cosmetics (28.9%), drugs (17.6%), biologics (12.7%), veterinary products (3.6%), and tobacco (0.09%).
11
8
Figure 2 Distribution of Product Types
products per event ranging from 1 to 470, a mean of 2.77, and a median of 1.0. As each product type is uniquely associated with an FDA center, only the Product Type variable will be retained for modeling to eliminate redundancy and preserve interpretability. Table 1 Product Type and Its Center Association
Product type
Center
Note. This figure shows the distribution of product types. Further analysis revealed that the center variable had a perfect one-to-one correspondence with product type. Since each product type maps to one specific FDA center, this redundancy led to the exclusion of the center variable to enhance interpretability. The analysis of the distribution pattern variable also revealed significant trends, with certain distribution types more frequent in high-risk recalls. The distribution of recalls by the FDA center is presented in Table 1. The Center for Devices and Radiological Health center (devices) accounted for the largest share at 37.3%, followed by Center for Food Safety and Applied Nutrition (food/cosmetics, 28.9%), Center for Drug Evaluation and Research (drugs, 17.6%), Center for Biology Evaluation and Research (biologics, 12.7%), Center for Veterinary Medicine (veterinary, 3.6%), and Center for Tobacco Products (tobacco, <1%). This breakdown aligns precisely with the distribution of product types, further confirming the perfect correspondence between product type and center. Additionally, an analysis of products per recall event revealed considerable variation, with the number of
Devices
Center for Devices and Radiological Health (CDRH) Center for Food Safety and Applied Nutrition (CFSAN) Center for Drug Evaluation and Research (CDER) Center for Biology Evaluation and Research (CBER) Center for Veterinary Medicine (CVM) Center for Tobacco Products (CTP)
Food/cosmetics
Drugs
Biologics
Veterinary
Tobacco
The dataset also included several ID variables, such as Event ID , Product ID , FEI Number , and Recall Details , which were excluded from the analysis due to their uniqueness and lack of analytical value (Kuhn & Johnson, 2013).
12
9
Figure 3 depicts the trend of monthly recalls over time, which shows a general downward trend, with recall volumes ranging from 600 to 1,000 per month between 2012 and 2018. Starting in
2020, recall numbers stabilized at a lower, more consistent level. A notable spike occurred in early 2014, reaching nearly 1,800 recalls, followed by cyclical fluctuations.
Figure 3
Recall Trend Over Time
Note. This figure shows the trend of monthly recalls, highlighting both cyclical and downward patterns. The cyclical patterns suggest periodic fluctuations, but they are not consistent. The early years (2012-2018) show greater fluctuations compared to 2019-2025, where recall numbers have stabilized at a lower and more consistent level. The stacked bar chart in Figure 4 illustrates the annual distribution of recall bottom of each bar, represent the most critical cases, involving products that may cause serious health consequences or death. Their proportion fluctuates throughout the observed period, with notable declines around 2017–2018 and a resurgence in 2022–2023. Class II recalls, shown in teal/green in the middle of the bars, consistently account for the largest share, typically comprising approximately 60–70% of total recalls annually. These involve products that may lead to temporary or medically reversible adverse health effects. Class III recalls, shown in yellow at the top, represent the smallest portion classifications—Class I, Class II, and Class III—from 2012 to 2025. These classifications, defined by their severity, provide insight into the nature and seriousness of product recalls over time. Class I recalls, depicted in purple at the
13
10
each year—generally 5–15%—and pertain to products unlikely to cause harm. Overall, the distribution of recall classes remains relatively stable over the years, with Class II consistently dominating. This visualization provides valuable insights for regulatory
agencies, manufacturers, and public health stakeholders by highlighting trends in the severity of product safety issues. Monitoring these patterns aids in strategic resource allocation and policy development aimed at enhancing consumer protection and product safety. highlighting key trends over time. Several “hotspot” months include December 2013 (0.62), November 2014 (0.51), February 2015 (0.56), March 2019 (0.57), September 2019 (0.51), and November 2023 (0.63), where the recall proportions peaked. Periods with high recall activity, such as mid-2023, were marked by multiple months with elevated recall rates. Early 2025 also saw high proportions in January and February. Notably, mid-2019 exhibited significant spikes in March and September, pointing to ongoing recall risks. In contrast, months like January 2014 (0.03), April 2020 (0.04), and various months in 2018 consistently recorded proportions below 0.10, indicating a lower severity in recall events during those times.
Figure 4
Proportion of Recall Classes by Year
Note. This figure illustrates the annual distribution of recall classifications (Class I, II, and III) from 2012 to 2025. The heatmap in Figure 5 shows the proportion of Class I recalls from 2012 to early 2025,
14
11
Figure 5 Proportion of Class I recalls by Year-Month
Note. This figure illustrates the Heatmap of Class I recall by Year-Month.
Descriptive statistics and textual patterns in the “reason for recall” field of the FDA dataset were examined to explore how textual content may correlate with recall severity. The first step involved calculating the word count for each recall reason to assess the length and verbosity of these descriptions. The mean word count was approximately 21.7 words, with a standard
deviation of 17.5 and a maximum of 327 words, indicating substantial variance in the level of detail provided. This word count was subsequently used as a numerical feature in modeling recall severity. To enhance the feature set beyond raw text, we built a structured modeling dataset combining categorical dummies (e.g., product type, status,
15
12
Figure 6
recalling firm country), time-based variables (e.g., classification year, month, day of week), and term frequency-inverse document frequency (TF-IDF) features from cleaned recall descriptions. To reduce dimensionality and limit overfitting, the TF-IDF matrix was restricted to the top 100 components. An additional feature capturing the word count of the recall reason ( reason_word_count ) was also included as a potential signal of severity. To further explore recall class patterns, word clouds were generated for each recall class. Figure 6 illustrates the most frequently used words in Class I recalls, where terms such as “listeria,” “contaminated,” and “undeclared” prominently appear, emphasizing critical health risks. Figure 7 shows the word cloud for Class II recalls, revealing terms like “device,” “failure,” and “product,” indicative of mechanical or procedural issues. Finally, Figure 8 depicts the dominant vocabulary in Class III recalls, where terms such as “labeling,” “error,” and “incorrect” point to less severe but still significant issues related to compliance and documentation. These visualizations provide deeper insights into class-specific language trends, helping to understand how recall severity is conveyed in the FDA dataset.
Wordcloud of Class I Recalls
Figure 7
Wordcloud of Class II Recalls
Figure 8
Wordcloud of Class III Recalls
16
13
4.1.2 Categorical Features Categorical variables were processed using one-hot encoding, particularly for “product type,”, “status,” and “recalling firm country.” The target variable “event classification” was encoded into a binary feature “is_Class_I” for potential future binary classification tasks. The product type distribution showed that devices were the most frequently recalled, making up almost 37% of all recalls, followed by food/cosmetics and drugs. Recall status revealed that most recalls were “terminated” (84.43%), while “ongoing” recalls made up 13.83%, and “completed” recalls were the least common (1.74%). The analysis of recalls by state revealed that California had the highest number of recalls, followed by Illinois and Florida. Most recalls originated from the United States, with Canada, Germany, and the United Kingdom contributing significantly fewer recalls. Regarding event classification, Class II recalls were the most frequent (70.81%), while Class I recalls accounted for 21.15%, and Class III recalls were the least common (8.04%). A significant association was found between product type and event classification, indicating that certain product types are more likely to have specific recall classifications.” 4.2 Data Quality Ensuring data quality was a critical step before modeling. Missing values across all variables were examined. There was only one missing value in “distribution pattern” which was imputed as “unknown.” After reviewing recall trends over time in the EDA, the data was filtered to focus on a time period starting in 2019, when recall fluctuations began to stabilize. Lastly, variables
that could cause data leakage were dropped. For example, “Event Classification” was dropped due to its direct deterministic relationship with the target variable “Event Classification.” 4.2.1 Class Imbalance Strategy Synthetic minority over-sampling technique (SMOTE) will be applied to the training dataset during modeling to balance the under-represented class, Class I, which is the focus for accurately predicting severe events. 4.3 Feature Engineering The feature engineering methodology of the project adheres to standard machine learning best practices by first performing data cleaning before splitting the data, thereby preventing data leakage. The train-test split is conducted prior to any transformations, ensuring that feature engineering is applied only to the training data and then consistently replicated on the holdout test set. The holdout test set is preserved for final evaluation, providing an unbiased assessment of model performance. As implemented in the Data Preparation notebook, different data types—temporal, categorical, and text—are processed individually and then integrated, supporting a robust and valid model development framework. 4.3.1 Temporal Feature Engineering This section begins by initializing empty DataFrames, X_train_processed and X_test_processed, to store the processed features. The focus here is on extracting useful information from the “center classification date” column. The raw date is converted into a proper datetime format, and then split into three components: year, month, and day of the week. These components are then used to derive cyclical features for month and weekday using
17
14
sine and cosine transformations, which help capture the periodic nature of time-related patterns in the data (Lewinson, 2022). To avoid scaling issues, the year is normalized to a new feature called “years_since_first.” The original temporal features are then dropped, leaving only the transformed components for modeling. 4.3.2 Categorical Feature Engineering In this part, the script processes important categorical variables such as “product type,” “status,” “recalling firm country,” “recalling firm state,” and “distribution pattern.” One-hot encoding is used to convert “product type” and “status” into binary dummy variables, omitting the first category to prevent multicollinearity. For “recalling firm country,” a binary “‘is_US” feature is created to indicate whether the recall originated from the United States. Then, using a predefined mapping of US states to regions, recalls from the United States are classified into broader geographic regions like Northeast, Midwest, South, and West. These regions are also one-hot encoded. The “is_US” columns are subsequently dropped after region encoding to avoid redundancy. 4.3.3 Distribution and Text Feature Engineering This section focuses on simplifying and encoding the “distribution pattern” and preparing text data for future processing. The distribution pattern is mapped into broader categories such as “nationwide,” “international,” “regional,” “limited,” and “other” based on keywords found in the text. These categories are then one-hot encoded to make them suitable for modeling. Additionally, the script introduces a text-cleaning function for later use, designed to normalize and standardize textual data (e.g., replacing variations of pathogen names with consistent tokens). This
step lays the groundwork for extracting insights from unstructured text fields in a clean and uniform way. 4.3.4 Baseline Approach The baseline approach involves traditional classification using only structured data features, without incorporating unstructured text fields or advanced NLP techniques. A performance benchmark is established using encoded categorical and numerical variables, allowing an understanding of the predictive power of the structured information alone. This approach will be later compared with more complex models that incorporate text-driven features. The input features include 41 variables that capture various aspects of the recall events. These include cyclical representations of the month and day of the week, product type classifications, recall status, distribution scope, and region of recall. Additional binary indicators reflect the presence of specific contaminants (e.g., salmonella, listeria), allergens (e.g., milk, soy, peanut), and reasons for recall such as mislabeling or foreign material. Text data is not directly analyzed; instead, a simple word count of the recall reason (reason_word_count) serves as a proxy for information density in that field. The training and test sets consist of 31,492 and 7,874 samples, respectively, each with the same 42-column structure. No dimensionality reduction or advanced feature selection is performed at this stage—the aim is to preserve as much relevant structured information as possible. Basic models such as logistic regression and random forest classifiers are applied to assess baseline performance. The results from this stage will later serve as a comparison point for models
18
15
4.3.6 Assessing All Predictors All predictors, including continuous variables, categorical variables, and text data, are assessed for modeling. Temporal features were created from the “center classification date,” including classification year, month, day, and day of the week. Text features were also processed for “reason for recall” and “product description” through text cleaning (lowercasing, special character removal, tokenization), stop word removal, and lemmatization. These cleaned texts were saved in new columns. Further features were generated, including “reason_word_count,” which counts the number of words in the “reason for recall” field. 4.4 Feature Selection The modeling notebook uses a flexible feature selection framework tailored to each model type. For most models (logistic regression, decision trees, random forests, and XGBoost), it applies SelectFromModel, which chooses features based on model-derived importance scores. The multilayer perceptron (MLP) neural network instead uses SelectKBest with statistical tests (F-statistic) to select the top features. Multiple feature subset sizes (5, 10, 15, 20, and all features) are tested. For each configuration, the notebook builds a pipeline that includes feature selection and cross-validation, while also tracking which features are consistently selected across folds. Despite testing various subsets, the final models for all algorithms performed best using all available features, indicating that each feature provided useful information for classification. However, the notebook still records and visualizes the most frequently selected features, especially for the random forest model, where the
that incorporate text-based features or embeddings. 4.3.5 Hybrid Approach The hybrid approach integrates both traditional structured features and basic text-based predictors. It builds on a richer feature set that includes temporal, categorical, continuous, and textual data. Temporal variables were extracted from the “center classification date” column, including cyclically encoded features like classification year, month, day, and day of the week. These features help capture seasonal or time-related patterns in recall events. The textual fields “reason for recall” and “product description” were preprocessed through a standard NLP pipeline, including lowercasing, removal of special characters, tokenization, stop word removal, and lemmatization. The cleaned versions of these texts were saved as new variables. Additionally, the reason_word_count feature was created to represent the length of the recall explanation, and TF-IDF vectorization was applied to the cleaned “reason for recall” field, generating 50 new numerical features that quantify the importance of specific terms. Categorical features, such as “product type,” “status,” and “recalling firm country,” were one-hot encoded to ensure compatibility with machine learning models. The target variable, “event classification,” was transformed into a binary classification problem through the creation of a new variable, “is_Class_I,” which distinguishes Class I recalls from all others. With a total of 92 columns in the processed dataset, this hybrid model offers a more nuanced representation of the data than the baseline model and sets the stage for exploring the predictive value of text-based features in combination with structured inputs.
19
16
top included variables like reason_word_count, month_sin, and has_listeria. To better understand the relationship between specific features and recall severity, a correlation analysis was conducted using a subset of selected variables. These features included pathogen indicators such as the presence of Listeria, Salmonella, and E. coli; allergen indicators like peanuts, nuts, shellfish, fish, milk, egg, wheat, and soy; and several manufacturing-related issues including mislabeling, foreign material contamination, and overall quality concerns. Other variables included risk factors such as possible illness or injury, different product types (devices, drugs, food/cosmetics, and veterinary products), distribution scope (nationwide, regional, or limited), and the word count of the predictors
reason for recall. After encoding the target variable, a correlation matrix was generated and visualized to identify patterns across these selected features. As shown in Figure 9. The heatmap highlights both positive and negative associations, with stronger correlations appearing in darker shades. Additionally, the features were ranked by their absolute correlation with the encoded event classification to identify the most relevant predictors. Figure 10 displays a bar plot of these sorted correlations, making it easier to interpret which features are most strongly associated with the severity of the recall. This correlation analysis serves as a foundation for further model development and feature selection.
20
17
Figure 9 Correlation Matrix of Selected Features with Target
21
18
Figure 10 Feature Correlation with Target Variable
classification logistic regression, decision trees, random forest, XGBoost, and a MLP—are employed to determine the most effective predictive approach. Model performance is evaluated using key metrics such as F1-score, recall, and precision to ensure a balanced assessment of classification effectiveness. An iterative experimentation and models—including
4.5 Modeling This section implements advanced machine learning techniques to classify product recalls using an imbalanced dataset. The target variable represents distinct recall categories, requiring strategies such as class weighting and sampling adjustments to address class imbalance. Multiple
22
19
validation process is followed to enhance accuracy and ensure robust recall classification. 4.5.1 Train- Test Split Validation To ensure data integrity and modeling readiness, class distribution was analyzed in both the training and test datasets. In the training data, the majority class (Class I) accounted for approximately 71.8% of the samples, while class 0 represented 21.2%, and Class II comprised 7%. A nearly identical distribution was observed in the test set, confirming the success of stratification during the train-test split. The dataset contains 41 features, which can be categorized into several groups. The temporal features include five elements, such as month, day, and year. There are 16 categorical features related to product type, status, region, and distribution scope. Additionally, there are 19 binary flag indicators, reflecting conditions like allergen presence, illness, or manufacturing issues. One feature does not fall into these categories and is classified separately. Data validation checks revealed that there were no missing or infinite values, ensuring the dataset’s cleanliness and suitability for modeling. Logistic Regression. Logistic regression serves as a foundational Employing a diverse set of modeling techniques is a strategic approach to capture various data patterns and relationships. Logistic regression serves as a foundational model within this diverse ensemble, offering a benchmark for evaluating the performance of more complex algorithms. This strategy aligns with best practices in machine learning, where baseline models are 4.5.2 Selection of modeling techniques -
essential for contextualizing the efficacy of advanced models (MarkovML, 2023). Logistic regression represents the class of generalized linear models, providing a probabilistic framework for binary classification tasks. Its inclusion ensures coverage of linear modeling paradigms, complementing non-linear models such as decision trees and ensemble methods. This breadth allows for comprehensive analysis across different algorithmic approaches, facilitating a more robust understanding of the data (Kuhn & Johnson, 2013). The simplicity and interpretability of logistic regression make it an ideal baseline model. It enables clear insights into feature contributions and decision boundaries, which is particularly valuable in domains requiring transparency. When used alongside complex models like random forests or gradient boosting machines, logistic regression helps in diagnosing overfitting and understanding the marginal gains in predictive performance, thereby informing model selection and deployment decisions (Arshad et al., 2023). While logistic regression may not capture complex non-linear relationships as effectively as advanced models, its high interpretability and computational efficiency offer significant advantages. In scenarios where model transparency is paramount, such as healthcare or finance, logistic regression provides a balance between performance and explainability. Moreover, studies have demonstrated that in certain contexts, simpler models like logistic regression can outperform complex models, challenging the notion that increased complexity always leads to better performance (Arshad et al., 2023).
23
Made with FlippingBook flipbook maker