M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 17 Next page Last page

4.1.2 Categorical Features Categorical variables were processed using one-hot encoding, particularly for “product type,”, “status,” and “recalling firm country.” The target variable “event classification” was encoded into a binary feature “is_Class_I” for potential future binary classification tasks. The product type distribution showed that devices were the most frequently recalled, making up almost 37% of all recalls, followed by food/cosmetics and drugs. Recall status revealed that most recalls were “terminated” (84.43%), while “ongoing” recalls made up 13.83%, and “completed” recalls were the least common (1.74%). The analysis of recalls by state revealed that California had the highest number of recalls, followed by Illinois and Florida. Most recalls originated from the United States, with Canada, Germany, and the United Kingdom contributing significantly fewer recalls. Regarding event classification, Class II recalls were the most frequent (70.81%), while Class I recalls accounted for 21.15%, and Class III recalls were the least common (8.04%). A significant association was found between product type and event classification, indicating that certain product types are more likely to have specific recall classifications.” 4.2 Data Quality Ensuring data quality was a critical step before modeling. Missing values across all variables were examined. There was only one missing value in “distribution pattern” which was imputed as “unknown.” After reviewing recall trends over time in the EDA, the data was filtered to focus on a time period starting in 2019, when recall fluctuations began to stabilize. Lastly, variables

that could cause data leakage were dropped. For example, “Event Classification” was dropped due to its direct deterministic relationship with the target variable “Event Classification.” 4.2.1 Class Imbalance Strategy Synthetic minority over-sampling technique (SMOTE) will be applied to the training dataset during modeling to balance the under-represented class, Class I, which is the focus for accurately predicting severe events. 4.3 Feature Engineering The feature engineering methodology of the project adheres to standard machine learning best practices by first performing data cleaning before splitting the data, thereby preventing data leakage. The train-test split is conducted prior to any transformations, ensuring that feature engineering is applied only to the training data and then consistently replicated on the holdout test set. The holdout test set is preserved for final evaluation, providing an unbiased assessment of model performance. As implemented in the Data Preparation notebook, different data types—temporal, categorical, and text—are processed individually and then integrated, supporting a robust and valid model development framework. 4.3.1 Temporal Feature Engineering This section begins by initializing empty DataFrames, X_train_processed and X_test_processed, to store the processed features. The focus here is on extracting useful information from the “center classification date” column. The raw date is converted into a proper datetime format, and then split into three components: year, month, and day of the week. These components are then used to derive cyclical features for month and weekday using

Made with FlippingBook flipbook maker