M.S. Applied Data Science - Capstone Chronicles 2025
15
4.3.6 Assessing All Predictors All predictors, including continuous variables, categorical variables, and text data, are assessed for modeling. Temporal features were created from the “center classification date,” including classification year, month, day, and day of the week. Text features were also processed for “reason for recall” and “product description” through text cleaning (lowercasing, special character removal, tokenization), stop word removal, and lemmatization. These cleaned texts were saved in new columns. Further features were generated, including “reason_word_count,” which counts the number of words in the “reason for recall” field. 4.4 Feature Selection The modeling notebook uses a flexible feature selection framework tailored to each model type. For most models (logistic regression, decision trees, random forests, and XGBoost), it applies SelectFromModel, which chooses features based on model-derived importance scores. The multilayer perceptron (MLP) neural network instead uses SelectKBest with statistical tests (F-statistic) to select the top features. Multiple feature subset sizes (5, 10, 15, 20, and all features) are tested. For each configuration, the notebook builds a pipeline that includes feature selection and cross-validation, while also tracking which features are consistently selected across folds. Despite testing various subsets, the final models for all algorithms performed best using all available features, indicating that each feature provided useful information for classification. However, the notebook still records and visualizes the most frequently selected features, especially for the random forest model, where the
that incorporate text-based features or embeddings. 4.3.5 Hybrid Approach The hybrid approach integrates both traditional structured features and basic text-based predictors. It builds on a richer feature set that includes temporal, categorical, continuous, and textual data. Temporal variables were extracted from the “center classification date” column, including cyclically encoded features like classification year, month, day, and day of the week. These features help capture seasonal or time-related patterns in recall events. The textual fields “reason for recall” and “product description” were preprocessed through a standard NLP pipeline, including lowercasing, removal of special characters, tokenization, stop word removal, and lemmatization. The cleaned versions of these texts were saved as new variables. Additionally, the reason_word_count feature was created to represent the length of the recall explanation, and TF-IDF vectorization was applied to the cleaned “reason for recall” field, generating 50 new numerical features that quantify the importance of specific terms. Categorical features, such as “product type,” “status,” and “recalling firm country,” were one-hot encoded to ensure compatibility with machine learning models. The target variable, “event classification,” was transformed into a binary classification problem through the creation of a new variable, “is_Class_I,” which distinguishes Class I recalls from all others. With a total of 92 columns in the processed dataset, this hybrid model offers a more nuanced representation of the data than the baseline model and sets the stage for exploring the predictive value of text-based features in combination with structured inputs.
19
Made with FlippingBook flipbook maker