M.S. AAI Capstone Chronicles 2024
Detecting Fake News Using Natural Language Processing user-friendly interaction. Future plans involve expanding to a user interface or mobile app for real-time verification, empowering users to combat misinformation effectively. Data Summary To diversify our training data, we aggregated five datasets from Kaggle and the University of Victoria, spanning various domains such as fake news detection, the Syrian war, and the Egyptian Football League. These datasets collectively comprise thousands of text entries, ranging from approximately 7,000 to 20,000 rows each. Our cleaning and preprocessing pipeline involves standard procedures like removing duplicates, null values, and special characters, alongside text normalization techniques such as stemming and lowercase conversion. Standard cleaning steps are applied uniformly, tailored to each dataset's specific characteristics, such as the tweet-based nature of the Egyptian Football League dataset. These steps ensure consistency and quality across all datasets, essential for subsequent analysis and modeling tasks. The resulting dataset, totaling around 94,000 rows and two columns (text and class), exhibits a balanced distribution between fake news (0) and real news (1).
Figure 1: Class balance of labels for the combined dataset and word count in real and fake texts
3
28
Made with FlippingBook - professional solution for displaying marketing and sales documents online