ADS Capstone Chronicles Revised
6
normalized distribution (see Figure 3). However, the phrases appearing most frequently are names such as president obama , new york , or bbc news (see Table 3). Figure 3 Bigram Distribution After Removing Stopwords and Climate Phrases
order and context. Lastly, we removed the climate phrases from the corpus providing more contextualized phrases for classifying the news snippets (see Figure 4 and Table 4). Figure 4 Four-gram Distribution After Removing Stopwords and Climate Phrases
Table 3 Top-5 Bigram Frequencies After Removing Stopwords and Climate Phrases
Table 4 Top-5 Four-gram Frequencies After Removing Stopwords and Climate Phrases
Phrases
Frequency
Phrases
Frequency
president obama 2595 united states 2573 president trump 2305 donald trump 2206 white house 1872
melting pot impacted species pot impacted species going impacted species going touched potential just keeps growing
415 415 415 313 242
cars talk road sensors
Our fifth test removed the stopwords from the four-gram distribution and left the climate phrases. Climate phrases appear so frequently the four-grams become variations of the same or similar phrases, which is expected as the bag-of-words feature extraction method is an unstructured assortment of known phrases defined solely by frequency and ignores word
4.5 Modeling 4.5.1 Latent Dirichlet Allocation
LDA is a powerful algorithm for uncovering hidden structures within text data and has numerous applications in natural language processing and information retrieval. For our analysis, we use LDA with an online variational
22
Made with FlippingBook - Online Brochure Maker