M.S. Applied Data Science - Capstone Chronicles 2025
6
Furthermore, by developing a cross-sector approach, this research could contribute to the development of cross-sector models that integrate insights from various industries. This holistic approach could ultimately enhance recall precision and regulatory compliance, leading to faster, more effective interventions that improve public safety. By addressing these gaps, this study contributes to the advancement of a proactive, cross-sector recall system that anticipates and mitigates future product failures before they occur. This holistic approach can potentially improve regulatory precision, enhance public safety, and streamline recall decision-making across industries. 4 Methodology This study utilizes a dataset sourced from the U.S. FDA’s (n.d.) publicly available recall database. The dataset contains 95,082 records across 17 variables, including firm information, product classification, recall status, geographic distribution, recall dates, and descriptive text. exploratory data analysis (EDA) revealed a decline in the monthly recall volume starting in 2020, stabilizing between 400 and 600 recalls per month. Therefore, only data from 2020 onward were selected for modeling to ensure consistency and representativeness. As this dataset is publicly accessible and free of personally identifiable information, there are no ethical concerns regarding data privacy. The methodology follows a structured approach that includes EDA, statistical analysis, and data transformation over time. Visualizations are employed to explore categorical variable distributions, temporal trends, and geographic patterns. Additionally, statistical techniques, including chi-square tests, are utilized to examine
associations between key categorical features, thereby building a foundational understanding of recall dynamics and preparing the dataset for future analytical modeling. All code used for data analysis, figure generation, and machine learning in this paper is available on the following GitHub repository: https://github.com/PareesaK/Improving-Recall-E ffectiveness/ For any inquiries, please contact the authors of the paper. 4.1 Data Acquisition and Aggregation The preprocessing and analysis were conducted using Python, utilizing libraries such as Pandas and NumPy for data manipulation, and Matplotlib and Seaborn for EDA through visualizations. Scikit-learn was used for data partitioning and machine learning applications, ensuring efficient handling of the dataset. 4.1.1 EDA EDA serves as a critical initial step in the examination of the dataset, aiding in the identification of patterns and the detection of potential data issues. A systematic approach was followed to ensure the dataset was suitable for further modeling. The dataset consists of 95,082 records and 21 columns, which include the following variables: “FEI number,” “recalling firm name,” “product type,” “product classification,” “status,” “distribution pattern,” “recalling firm city,” “recalling firm state,” “recalling firm country,” “center classification date,” “reason for recall,” “product description,” “event ID,” “event classification,” “product ID,’” “center,’ “recall details,” “classification year,” “classification month,” “classification day,” and
10
Made with FlippingBook flipbook maker