ADS Capstone Chronicles Revised

6

normalized distribution (see Figure 3). However, the phrases appearing most frequently are names such as president obama , new york , or bbc news (see Table 3). Figure 3 Bigram Distribution After Removing Stopwords and Climate Phrases

order and context. Lastly, we removed the climate phrases from the corpus providing more contextualized phrases for classifying the news snippets (see Figure 4 and Table 4). Figure 4 Four-gram Distribution After Removing Stopwords and Climate Phrases

Table 3 Top-5 Bigram Frequencies After Removing Stopwords and Climate Phrases

Table 4 Top-5 Four-gram Frequencies After Removing Stopwords and Climate Phrases

Phrases

Frequency

Phrases

Frequency

president obama 2595 united states 2573 president trump 2305 donald trump 2206 white house 1872

melting pot impacted species pot impacted species going impacted species going touched potential just keeps growing

415 415 415 313 242

cars talk road sensors

Our fifth test removed the stopwords from the four-gram distribution and left the climate phrases. Climate phrases appear so frequently the four-grams become variations of the same or similar phrases, which is expected as the bag-of-words feature extraction method is an unstructured assortment of known phrases defined solely by frequency and ignores word

4.5 Modeling 4.5.1 Latent Dirichlet Allocation

LDA is a powerful algorithm for uncovering hidden structures within text data and has numerous applications in natural language processing and information retrieval. For our analysis, we use LDA with an online variational

22

Made with FlippingBook - Online Brochure Maker