M.S. Applied Data Science - Capstone Chronicles 2025

14

sine and cosine transformations, which help capture the periodic nature of time-related patterns in the data (Lewinson, 2022). To avoid scaling issues, the year is normalized to a new feature called “years_since_first.” The original temporal features are then dropped, leaving only the transformed components for modeling. ​ 4.3.2 Categorical Feature Engineering ​ In this part, the script processes important categorical variables such as “product type,” “status,” “recalling firm country,” “recalling firm state,” and “distribution pattern.” One-hot encoding is used to convert “product type” and “status” into binary dummy variables, omitting the first category to prevent multicollinearity. For “recalling firm country,” a binary “‘is_US” feature is created to indicate whether the recall originated from the United States. Then, using a predefined mapping of US states to regions, recalls from the United States are classified into broader geographic regions like Northeast, Midwest, South, and West. These regions are also one-hot encoded. The “is_US” columns are subsequently dropped after region encoding to avoid redundancy. ​ 4.3.3 Distribution and Text Feature Engineering ​ This section focuses on simplifying and encoding the “distribution pattern” and preparing text data for future processing. The distribution pattern is mapped into broader categories such as “nationwide,” “international,” “regional,” “limited,” and “other” based on keywords found in the text. These categories are then one-hot encoded to make them suitable for modeling. Additionally, the script introduces a text-cleaning function for later use, designed to normalize and standardize textual data (e.g., replacing variations of pathogen names with consistent tokens). This

step lays the groundwork for extracting insights from unstructured text fields in a clean and uniform way. 4.3.4 Baseline Approach The baseline approach involves traditional classification using only structured data features, without incorporating unstructured text fields or advanced NLP techniques. A performance benchmark is established using encoded categorical and numerical variables, allowing an understanding of the predictive power of the structured information alone. This approach will be later compared with more complex models that incorporate text-driven features. The input features include 41 variables that capture various aspects of the recall events. These include cyclical representations of the month and day of the week, product type classifications, recall status, distribution scope, and region of recall. Additional binary indicators reflect the presence of specific contaminants (e.g., salmonella, listeria), allergens (e.g., milk, soy, peanut), and reasons for recall such as mislabeling or foreign material. Text data is not directly analyzed; instead, a simple word count of the recall reason (reason_word_count) serves as a proxy for information density in that field. The training and test sets consist of 31,492 and 7,874 samples, respectively, each with the same 42-column structure. No dimensionality reduction or advanced feature selection is performed at this stage—the aim is to preserve as much relevant structured information as possible. Basic models such as logistic regression and random forest classifiers are applied to assess baseline performance. The results from this stage will later serve as a comparison point for models

18

Made with FlippingBook flipbook maker