M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 16 Next page Last page

Figure 6

recalling firm country), time-based variables (e.g., classification year, month, day of week), and term frequency-inverse document frequency (TF-IDF) features from cleaned recall descriptions. To reduce dimensionality and limit overfitting, the TF-IDF matrix was restricted to the top 100 components. An additional feature capturing the word count of the recall reason ( reason_word_count ) was also included as a potential signal of severity. To further explore recall class patterns, word clouds were generated for each recall class. Figure 6 illustrates the most frequently used words in Class I recalls, where terms such as “listeria,” “contaminated,” and “undeclared” prominently appear, emphasizing critical health risks. Figure 7 shows the word cloud for Class II recalls, revealing terms like “device,” “failure,” and “product,” indicative of mechanical or procedural issues. Finally, Figure 8 depicts the dominant vocabulary in Class III recalls, where terms such as “labeling,” “error,” and “incorrect” point to less severe but still significant issues related to compliance and documentation. These visualizations provide deeper insights into class-specific language trends, helping to understand how recall severity is conveyed in the FDA dataset.

Wordcloud of Class I Recalls

Figure 7

Wordcloud of Class II Recalls

Figure 8

Wordcloud of Class III Recalls

Made with FlippingBook flipbook maker