ADS Capstone Chronicles Revised

7

Figure 5 Topic Modeling Term Importance

Bayes algorithm from the sklearn.decomposition package (Scikit-learn, n.d.). Each iteration was performed on a CountVectorizer of either unigrams, bigrams, trigrams, or a combination with a ‘max_ df = 0.7’ and ‘min_ df = 500’. Thus, only words that appear in at least five snippets and no more than 70% of all snippets will be included in the matrix of term frequencies. CountVectorizer was performed six times, Figure leading to the creation of six distinct LDA models: using unigrams (a) with and (b) without climate phrases, using bigrams (c) with and (d) without climate phrases and a mixture of unigrams and bigrams (e) with and (f) without climate phrases. The final model chosen used a CountVectorizer of bigrams and trigrams with climate phrases included (see Figure 5). For each topic, five keywords are displayed along with their weight, normalized to sum up to 100%. For example, a weight of 61.20 for the bigram ‘global warming’ in topic 00 means that it has 61 times more importance than the average importance of all words in the topic. Key phrases used to discuss climate change include global warming (61.20), greenhouse gas (12.73) and gas emissions (9.14), al gore (8.80), and paris climate (7.16). Along with climate-related phrases, other political entities were identified as well, including donald trump (5.99) in topic 0, the white house (10.89) in topic 2, and president obama (14.19) in topic 3.

4.5.2 GPT-3.5 Turbo 4.5.2.1 Prompt Engineering

We use an OpenAI pretrained large language model (LLM)—gpt-3.5-turbo-0125. The 0125 version of GPT-3.5 Turbo was released on January 25, 2024, with increased accuracy at

23

Made with FlippingBook - Online Brochure Maker