ADS Capstone Chronicles Revised
Capstone Chronicles 2024 Selections
MS-Applied Data Science University of San Diego
1
Image generated with OpenAI's DALL·E, facilitated by ChatGPT.
Dear Reader,
It is with great pleasure that we introduce the inaugural edition of Capstone Chronicles , a collection of outstanding Capstone projects from the MS in Applied Data Science (ADS) program at the University of San Diego (USD) in 2024. This publication serves as a testament to the dedication, creativity, and analytical expertise of our students as they tackle real-world challenges through data-driven and analytical solutions. The University of San Diego’s innovative online Applied Data Science master’s degree program is committed to training current and future data science and engineering professionals for the important and fascinating work ahead. The strengths of our program include a signi cant emphasis on real-world applications, ethics, moral responsibility, and social good in designing data science projects, and it has been developed by data science professionals in close collaboration with key industry and government stakeholders to provide in-depth practical and technical training. Each graduating cohort (Spring, Summer, and Fall) included in this magazine consists of 25-30 students. In the Capstone course, students apply the knowledge and skills acquired in this master’s program. The Capstone project serves as a culminating experience for students in the program, enabling them to apply their theoretical knowledge to a research-driven, code-intensive project. Students lead an end-to-end data science work ow, including data acquisition, processing, and analysis, while deploying appropriate analytical techniques. The project is documented in an academic journal-style article and presented in a recorded technical presentation. We hope that Capstone Chronicles serves as both an inspiration and a resource for future students, researchers, and practitioners in the eld of data science. By sharing these exemplary projects, we aim to celebrate the accomplishments of our students and contribute to the broader discourse on applied data science. We extend our gratitude to the students whose hard work is showcased in these pages, as well as to the faculty and mentors who have guided them throughout their journeys. Thank you for your interest in Capstone Chronicles and the MS-Applied Data Science program at the University of San Diego.
The 2024 Capstone Chronicles Editorial Team Anna Marbut Ebrahim Tarshizi
This letter was composed with the assistance of OpenAI’s ChatGPT.
2
Table of Contents Spring 2024 Detecting and Removing Personal Identi able Information Using Machine Learning ………………….. 5 Ebad Akhter, Jiaqi He, Jacqueline Vo Emotionality Analysis of Climate Change Communication in News Media ………………………………… 17 Vivian Do, Bryan Flores NFL Teams Should Focus On Passing ……………………………………………………………………………………….. 28 Caleb McCurdy, UE Wang Summer 2024 Retail Analytics: Understanding Customer Behavior through Transaction Data ………………………….. 45 Jesse Gutierrez, Sultan Mahmud Rahat, Verity Pierson Guardians of the Crypto: A Streamlit Application for Enhanced Price Prediction and Informed Decision - Making …………………………………………………………………………………. 67 Mirna Philip, Justin Farnan, Arya Shahbaz Comparing the E ects of Various Demographic, Socioeconomic, and Health Disparity Metrics on Stomach Cancer Mortality Rates in 2019 Across U.S. Counties ………………………………… 92 Shailja Somani, Yicong Qiu Uncovering Healthcare Ine ciencies: A Data-Driven Solution for Market Saturation and Fraud …. 125 Jessica Hin, Samantha Rivas, Amy Ou Adverse Drug Reaction Surveillance: A Precision Public Health Model ………………………………………. 151 Halee Staggs, Vicky van der Wagt Fall 2024 Arti cial Intelligence - Driven Automation of Flow Cytometry Gating ……………………………………….. 186 Gabriella Rivera, John Vincent Deniega Smart Meal Choices: A Data Science Approach to Personalized Diabetes-Friendly Restaurant Meal Recommendations……………………………………………………………………………… 203 Claire Bentzen, Tara Dehdari, Logan Van Dine Predictive Modeling for Risk-Based Premiums and Real-Time Safety Guidance in Auto Insurance Customer Retention ……………………………………………………………………………. 241 Marvin Moran, Ben Ogle, Katie Mears Satellite Intelligence for Catastrophic Natural Disaster Recovery …………………………………………………. 267 Jeremiah Fa’atiliga, Ravita Kartawinata, Sowmiya Kanmani
3
Spring 2024
4
1 Detecting and Removing Personal Identifiable Information Using Machine Learning Ebad Akhter Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego eakhter@sandiego.edu JiaqiHe Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of SanDiego jhe@sandiego.edu Jacqueline Vo Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego jvo@sandiego.edu
ABSTRACT This research explored the development and evaluation of several machine learning algorithms designed to perform text classification tasks. These models targeted the detection of personal identifiable information (PII) within academic documents written by students. The objective of this project was to create a machine learning model that would accurately and precisely classify words considered PII while being efficient. Several algorithms were used such as logistic regression, random forest, extreme-gradient boosting (XGBoost), k-nearest neighbors (K-NN), and Presidio. These models were all trained and tested using pre-processed text data using tokenization and feature engineering in addition to hyperparameter tuning methods such as randomized search and grid search. Ultimately, this study found the random forest model performed the best in regards to precision. This study demonstrated the usefulness of machine learning to enhance data privacy. KEYWORDS personal identifiable information, modeling, machine learning, detection, anonymization, privacy
1 Introduction Protecting personal identifiable information (PII) as data science evolves is paramount. As companies exponentially store data by the millions, the prevalence of data breaches and the exchange of information among third-party vendors underscores the vulnerability of individuals' data privacy. Within the realm of education technology, commonly referred to as the ed-tech industry, PII poses an obstacle to developing open datasets to advance educational outcomes, as the public release of such data exposes students to potential risks. To mitigate these risks effectively, it is imperative to implement rigorous screening and cleansing procedures for institution data to identify and remove PII before their public dissemination. Applying data science methodologies such as tokenizing and vectorizing text data can significantly facilitate this task. 2 Background Educational institutions commonly store large amounts of personal information on students and faculty to fulfill many tasks. In June 2023, Progress, a business application software company, announced their large data file transfer service, MOVEit Transfer, was susceptible to security vulnerabilities. The
5
2
company was not able to address these issues in time, and affected many of the organizations using Progress’ MOVEit Transfer service. This meant nearly 900 colleges and universities, and more than 51,000 individuals experienced a data breach that compromised information such as Social Security numbers, birthdates, and school records (Donadel, 2023). As technology advances, cybersecurity practices need to grow and better protect PII. The U.S. Government Accountability Office (2020) notes that everyone is involved in safeguarding PII, from data collection, storage, and cybersecurity. Data breaches and leaked PII can lead to physical, emotional, and financial harm to an individual. Thus, when an individual discloses personal information to such large institutions as universities, a heavy burden of trust and security is placed on the university and stakeholders to prevent such information from getting into the wrong hands. 2.1 Purpose Data continues to evolve, as do the security practices needed to safeguard them. The education sector is currently facing a growing responsibility of maintaining the confidentiality of student and faculty personal information while fostering an environment instrumental to progressing research and science. Organizations such as The Learning Agency Lab rely on real student submissions such as essays to develop learning-based tools and programs to benefit both students and teachers. However, educational datasets are difficult to acquire due to concerns regarding the exposure of PII. Datasets are normally reviewed manually to remove PII, which is costly and time-consuming. By implementing a solution through data science and machine learning, a
more reliable method to identify and remove PII could significantly improve data privacy and allow public educational data sets to be more readily available. 2.2 DefinitionofObjectives This project aims to train and evaluate several robust machine-learning models to detect and efficiently remove PII from large datasets. A final model will be selected based on the expectation that it can maintain high recall and precision scores, minimizing false positives or negatives. Additionally, the model should have a low runtime to be more scalable in future applications. A successful model would significantly alleviate the education sector's challenge in maintaining data privacy. This would enable researchers to use high-quality public datasets and enhance student privacy. Should the proposed model fail to meet the expectations of this project, further hyperparameter tuning, refinement, and exploration would be necessary to address PII detection. 3 LiteratureReview Multiple studies have examined various PII risks and management strategies across multiple domains. Many studies have investigated approaches such as machine learning and deep learning models for detecting and anonymizing PII in medical health records, the significance of privacy measures in online learning environments, techniques for identifying PII in unstructured text corpus, cybersecurity risks in higher education, and the use of federated learning for safeguarding PII financial data. These studies emphasize the need to protect PII data and implement robust privacy measures to reduce risks and maintain individuals' privacy across different contexts and industries.
6
3
3.1 IdentificationandProcessingofPII Data, Applying Deep Learning Models with Improved Accuracy and Efficiency One study explored the use of deep learning models in maintaining data privacy for large enterprises. A natural language processing (NLP) based large language model was developed to automatically detect PII data and mask such information. Additionally, support vector machines, random forest (RF), logistic regression (LR), long short-term memory, and multilayer perceptron models were trained to detect and anonymize data. These models used text that was converted into vectors to detect PII data. The final results from the model performance suggested that neural network-based models were the most proficient in identifying PII data precisely when crafting NLP-based extensive language models (Mitra & Roy, 2018). 3.2 AnonymizationofSensitive Information in Medical Health Records Protected health information (PHI) is any information that contains sensitive medical records that identify an individual, such as health care services, diagnosis, treatment, and billing information. This information cannot be directly shared outside of the hospital. Thus, the exhaustive deidentification of all PII and PHI is required. A study explored using NLP to remove PHI within Spanish clinical records. Given the dataset this study worked with, their neural network model performed the best given the token-level features and static dictionaries of Spanish names and locations (Saluja et al., 2019).
Online learning environments require robust privacy measures to protect against data breaches. Experts like Jim Greer emphasize the importance of privacy, trust, and personalization, with three significant privacy theories–limitation theory, control theory, and contextual integrity theory–essential to address concerns. Greer's team has developed privacy preferences and identity management features to maintain privacy and trust. However, as data technologies advance, privacy becomes increasingly difficult to maintain, and service providers must uphold strict privacy standards to ensure the integrity of online learning environments (Anwar, 2021). 3.4 PersonallyIdentifiableInformation (PII) Detection in the Unstructured Large Text Corpus Using Natural Language Processing and Unsupervised Learning Technique Recognizing the significance of safeguarding PII data privacy, numerous research endeavors have produced a plethora of valuable approaches for implementing robust privacy measures for PII data. Although contrasting facts highlight divergent perspectives on the effectiveness of rule-based approaches versus machine learning models, using some modeling like the clustering-based PII detection model to validate facts underscores the potential of hybrid deep learning techniques, which can help enhance accuracy (Kulkarni & Cauvery, 2021). 3.5 A Systematic Review of Cybersecurity Risks in Higher Education Higher education faces unique cybersecurity challenges due to academic freedom and collaborative research environments. There is a lack of empirical research on security practices within academia. Higher educational institutions
3.3 SupportingPrivacy,Trust,and Personalization in Online Learning
7
4
4.1 DataAcquisitionandAggregation The datasets were found through a Kaggle Competition hosted by Vanderbilt University and The Learning Agency Lab, an Arizona-based independent nonprofit in the education sector (Holmes et al., 2024). Both the train and test datasets had common variables such as document number, full text, tokens, and trailing whitespace. The train set also included PII labels such as names, emails, usernames, identification numbers, phone numbers, personal URL addresses, and street addresses that might be associated with a student. 4.1.1 Exploratory Data Analysis. The train dataset had PII labels that were categorized according to the BILUO scheme tag format. This format is generally used to create a formatted spaCy library text string that saves tokens from documents to be more performance-driven (Prakash, 2020). The PII labels either start with the tag “B” for the beginning first token within a multitoken entity, or “I” for an inner token within a multitoken entity. Tables 1 and 2 illustrate a more detailed depiction of the train dataset and the PII label distribution. Table 1 identifies counts the documents in which each PII label has appeared. The proportion of each label implies that Name_Student is the most often seen PII label in documents, both at the beginning and in-between an entity. Overall, the Name_Student label is seen in 24.05% of documents. On the other hand, the Username and Email labels are least represented within documents, especially when located in-between entities. In fact, only 24 documents contain email addresses and 5 documents usernames. These labels are uncommon and were only seen at the beginning of a multi-token entity.
manage valuable assets such as personal information, financial data, research, and intellectual property. Strategic cybersecurity risks in higher education include data leakage, financial fraud, and attacks on data integrity. Security operations centers and computer emergency response teams should prioritize information sharing and incident data collaboration to enhance cybersecurity resilience. Maturity modeling and baseline studies can address critical gaps in empirical research and improve security practices in academia (Ulven & Wangen, 2021). 3.6 FederatedLearningfor Privacy-Preserving: A Review of PII Data Analysis in Fintech Facing the challenges of protecting PII data and cyber security issues, this study explores new methods like artificial intelligence have emerged. Federated learning is a recently developed method to protect confidential data analysis involving privacy or sensitive information. The solution involves identifying PII data via name entity recognition and using supervised machine learning to ensure the relationship between entities. Several automated solutions are provided by this study, such as lactate, tracking, and securing personal data in different situations, which guards against data leakage (Dash et al., 2022). 4 Methodology This project used Jupyter Notebooks and Python (Version 3.9) to load datasets, preprocess text data, and train, test, and evaluate machine learning models. Two datasets were used to build these models: train.json had five columns and 6,807 rows while test.json had four columns and 10 rows.
8
5
Table 1 Distribution of PII Labels in Documents Label
I-street_address I-ID_number I-URL_personal I-email I-username
20 1 1 0 0
0.73 0.04 0.04 0.00 0.00
Count Percent (%)
B-name_student I-name_student B-URL_personal B-ID_number B-email B-username B-phone_number I-phone_number B-street_address I-street_address I-ID_number I-URL_personal I-email I-username
891 814
12.09 11.96
4.2 DataQuality Given that this project delves into text data analysis, the pre-processing phase consisted of inspecting the data for signs of bias, imbalance, and noise. 4.2.1 PII Labels. The dataset had a total of 14 labels that were categorized as either starting at the beginning (‘B’) or in-between a sentence (‘I’). This meant that in actuality, seven PII labels existed, with the sub-category defining the token placement within a sentence. As seen in Figure 1, there was a slightly larger distribution of PII tokens found at the beginning of an entity compared to in-between an entity. Figure 1 Distribution of Train Dataset Token Scheme Tag
72 33 24
1.06 0.48 0.35 0.07 0.06 0.04 0.03 0.03 0.01 0.01 0.00 0.00
5 4 3 2 2 1 1 0 0
Table 2 illustrates the named entity recognition (NER) frequency of each label occurring within individual documents, providing further granular information on PII labels. This was done by creating a pipeline that used spaCy to analyze the number of times each of these PII labels are presented in each document. This analysis indicated that name_student had the highest frequency, and was seen multiple times within a single document. It is also implied that documents would rarely contain information on emails, usernames, or personal websites. Table 2 NER Frequency Distribution of PII Labels in Documents Label Count Percent (%)
B-name_student I-name_student B-URL_personal B-ID_number B-email B-username B-phone_number I-phone_number B-street_address
1,365 1,096 110
49.84 40.01
4.3 FeatureEngineering The project expected to use text classification models that could handle processing multiple labels. 4.3.1 Tokenization. A pipeline was created to preprocess the list of tokens to
4.02 2.85 1.42 0.22 0.22 0.55 0.07
78 39 6 6 15 2
9
6
lowercase all tokens, remove punctuation, remove stopwords, remove empty tokens, and remove new line punctuation. Each token was associated with a label identifying whether or not the token was considered PII, and if true, what specific PII label to which the token corresponded. Thus, the pipeline preprocessed both of these features simultaneously. This created tokens more consistent with all documents and eliminated symbols or text that might create additional noise within the dataset. 4.3.2 Descriptive Statistics of Tokens. The descriptive statistics of both the train and test dataset after being preprocessed are illustrated in Table 3. The train dataset has a total of 2,230,310 tokens, of which 1.98% were unique. Additionally, the overall tokens did not appear to be lexically diverse, indicating similar language used throughout all the documents. The training dataset contained a large corpus of text for model training and has extensive textual content present in the dataset. Conversely, the test dataset was much smaller, with only 3,377 total tokens. The calculated number of unique tokens was much higher at 44.45% given that there are fewer tokens overall. The test dataset is much more lexically diverse, but only due to the smaller dataset size and use of fewer documents. Table 3 Pre-Processed Tokens Descriptive Statistics Train Test Tokens Unique tokens 3,377 1,501 22,651 0.44 4.3.3. Feature Transformations. The labels were transformed using a multi-label binarizer to indicate to models that a document Character count Lexical diversity 2,230,310 44,088 1,4626,600 0.02
can contain more than one PII label. Additionally, the tokens were transformed using term frequency inverse document frequency vectorizer to transform the text data into a numerical format that would better fit the models. This vectorizer calculates the number of occurrences in which a word appears within a document and the number of documents. 4.4 Modeling 4.4.1 Selection of modeling techniques. The project wanted to select machine learning algorithms that would be able to perform text processing and classification tasks with minimal processing time. A baseline model was created to identify baseline predictors and metrics that would be monitored. This was then followed by more complex NLP models that would undergo hyperparameter tuning. 4.4.1.1. Training, Validation, and Test Partitioned Datasets. The original datasets included a train and test data set, but for the purpose of training models, the train dataset was further partitioned. A train and validation set were created, with an 80/20 split from the preprocessed train dataset. 4.4.1.2. Logistic Regression. A baseline model was important in evaluating the performance of all models being trained for this project. More specifically, a logistic regression model was used as a baseline because it is able to effectively handle high-dimensional data while being a fairly basic model that is easy to understand. This model was trained using a maximum of 1,000 iterations. 4.4.1.3. Random Forest (RF). RFmodels are widely used for machine learning tasks such as image processing, health care, and text processing such as in the case of this project. This model was selected for the project due to
10
7
the algorithm’s simplicity, ability to handle imbalance, and ease of interpretation. Given the name of this model, it is able to create several decision trees that are visually interpretable by a wide audience. An initial RF model was created with 100 maximum iterations. This model was then tuned using grid search to search and identify which combination of hyperparameters would yield the best performance. In this hyperparameter tuning, it was fitted in 3 folds for 288 candidates each. This totalled 864 fits of the RF model. 4.4.1.4. Extreme Gradient Boosting (XGBoost). Much like the RF model, this is a variation of the decision tree model that uses gradient-boosted decision trees. This model generally has a longer runtime and requires more resources to tune the model, however yields high performance results. Due to the longer run-time of the XGBoost model, random search was chosen over grid search as hyperparameter tuning because it randomly selects a set of iterations, using less processing time. This fitted 3 folds of 10 candidates each, resulting in 30 total fits. 4.4.1.5. K-Nearest Neighbors (K-NN). This is a powerful algorithm that is often applied to classification or regression tasks. K-NN is easy to understand and implement as it clusters similar data based on the number of neighbors. K-NN can learn directly from the instances in the training data because K-NN is a non-parametric algorithm that does not make assumptions about the distribution of the data. The initial K-NN model ran through several scenarios in which the value k varied from two to 29, and found the optimal k was either at ten or eleven. An additional K-NN model went through additional grid search parameter tuning
in which five folds of 28 candidates were trained, totalling 140 fits. The parameter tuning found that the optimal k value could be six. 4.4.1.6. Presidio Model. Presidio, a software development kit built by Microsoft, was also selected because it is able to facilitate PII detection and anonymization at an organizational scale. This model is able to use Regex to recognize patterns, leverage NLP to detect entities, validate patterns, and apply anonymization techniques that would be scalable for this project. It also has the flexibility to be expanded with other types of custom recognizers, like the BILOU scheme-based labels used in the dataset. 5 ResultsandFindings Both statistical and performance metrics were used to consider the performance of each model. Thus, precision, recall, accuracy, and F1 scores were used to evaluate the statistical performance of each model, in addition to the efficiency of each model. This metric was used to determine the suitability of the model in a real-world context in which thousands of documents would need to be evaluated. 5.1 Evaluation of Results One of the most important metrics used for evaluation was the level of efficiency in which the model was able to run. As seen in Formula 1, the model’s F1 score would be considered in addition to the number of seconds it takes for the submission to be evaluated. = ℎ − + 324,000 (1) Figure 2 illustrates the runtime of each model, including variations of models that went through several hyperparameter tuning iterations. The
11
8
baseline logistic regression model had the fastest runtime, most likely due to the lack of parameters. 5.1.1 Model Runtime. When looking at hyperparameter tuning methods, randomized search is generally considered more efficient because it randomly selects a limited number of parameters to run, while grid search runs through multiple iterations over the same parameter grid. The XGBoost model used randomized search to randomly run fit several parameters such as max depth, learning rate, number of estimators, and subsamples. On the other hand, the K-NN and RF models were trained using grid search to perform hyperparameter tuning. The XGBoost model had a longer runtime, even with the more efficient parameter tuning method. Figure 2 Model Runtime (seconds)
Both K-NN models had similar performance despite using different hyperparameter tuning methods and identifying a different amount of optimal neighbors. While the initial K-NN found k = 10 to be the best, the grid search found k = 5 to perform the best. This model performed the best during the validation stage when looking at the precision of evaluating PII labels. The baseline logistic regression model performed best when looking at the recall performance of all models. The exception was the XGBoost model, which was marginally better than the baseline model. The XGBoost model also performed the best when evaluating the F1 score. The results from the Presidio analyzer were the lowest performing model. The F1 and precision scores were much lower in comparison to the other models. Table 3 Model Validation Performance Precision and Recall Metrics Model Precision Recall Logistic regression 0.7405 0.8605 RF 0.8859 0.7815 RF - grid search 0.7810 0.7810 K-NN 0.9897 0.7827 K-NN - grid search 0.9150 0.7844 XGBoost 0.8118 0.8642 Presidio 0.0727 0.2582 Table 4 Model Validation Performance F1 and Accuracy Metrics Model F1 Accuracy Logistic regression 0.7960 0.8605 RF 0.7821 0.8605 RF - grid search 0.7810 0.8605
5.1.2 Validation Performance. As seen in Table 3 and 4, the initial validation performance of the models were calculated taking into account classification metrics such as accuracy, precision, recall, and F1, the models had similar performance in the accuracy of predicting PII labels. More variation in performance metrics can be seen in the precision, recall, and F1 score.
12
9
K-NN
0.7843
0.8612 0.8605 0.8642
In Figure 4, the recall of the model is evaluated on both the validation and test set. This score evaluates the proportion of true positives out of all the actual positives. This score determines whether the model can identify the most relevant label. All of the models did well during the validation phase, however dropped in performance when applied to the test set. The baseline logistic regression model had a minimal drop in performance, but the RF, K-NN, and XGBoost models saw a significant decrease. This implies that these models did well in making few false positive predictions, but were not able to predict many actual positive instances. Figure 4 Recall Score of Validation and Test Set
K-NN - grid search 0.7876
XGBoost Presidio
0.8088 0.1104
0.2582 5.1.3 Test Performance. The model was applied to a validation and test dataset to observe how various performance metrics performed, as seen in Figure 3 through 6. The evaluation metrics used include accuracy, precision, recall, and F1 scores, providing a comprehensive assessment of each model's performance. Figure 3 illustrates the precision performance, this metric identifies the proportion of true positive labels out of all the positive predictions made. The K-NN model performed the best during the validation stage, however significantly dropped in performance when applied to the test set. The same occurred with the XGBoost model and presidio model. On the other hand, both the logistic regression and RF models computed zero false positives in the test set, improving in performance from the validation to test set. Figure 3 Precision Score of Validation and Test Set
The harmonic mean between the precision and recall, or the F1 score of the models are illustrated in Figure 5. This metric gives an equal weight to both of these metrics and measures the class distribution. The baseline logistic regression model was the only model that performed better in the test set. This was followed by the RF models, and saw similar performance between the K-NN and XGBoost models. This score might have been impacted
13
10
Figure 6 Accuracy Score of Validation and Test Set
due to the class imbalance of the PII labels during the modeling stage, as the training dataset saw large occurrences of student_name labels compared to other labels. Figure 5 F1 Score of Validation and Test Set
6 Discussion Text
classification tasks require an understanding of the context in which words are being used. In the context of detecting PII, one priority is to prevent false negative and false positive detections otherwise sensitive information on an individual could be disseminated to the public. Several studies have used machine learning algorithms to automate the process of detecting and removing PII in many contexts such as the medical, financial, and educational industries. In this study, the RF model appeared to perform the best in detecting PII. It was important to use supervised learning models for this study to have a high level of understanding and interpretability as sensitive data was being used to train these models. Given the limited timeframe of this project, more hyperparameter tuning and evaluation of the models used in this study could have been done to provide a better understanding of what
Lastly, the accuracy of the model performances are seen in Figure 6. This metric evaluates the proportion of correct predictions within all predictions made. The K-NN and XGBoost models were highly inaccurate in the test set, despite having high performance during the validation set. On the other hand, the logistic regression model appeared to perform the best, followed by the RF models. The four classification metrics imply that the model may have been overfitted on the training data, which was why these models performed well on the validation set, but not on the test set. Additionally, there may have been random variation that causes a discrepancy between the validation and test set. The baseline logistic regression model and RF models performed the best, followed by the K-NN and XGBoost models.
14
11
indicators might detect PII. This is especially in the case of the Presidio model in which NLP was conducted using a broader range of text data. 6.1 Conclusion This study evaluated several supervised machine learning models in detecting and removing PII from large academic documents. After evaluating the performance metrics of all models, the RF model illustrated a better performance compared to the other models. When it came to the validation set, the K-NN model had the highest precision score and fastest runtime. When not looking at the runtime performance, the XGBoost model outperformed the K-NN model in both the recall and F1 scores in the validation phase. However, after testing the models on the test datasets, the RF model was able to exceed the initial validation results. Additional hyperparameter tuning and optimization could have also helped improve the performance and robustness of both models. Overall, this study enhanced comprehension and utilization of machine learning on PII data through text classification. By understanding the strengths and weaknesses of each algorithm, organizations can make better decisions to enhance data privacy and security practices in the future. 6.2 Recommended Next Steps PII detection is critical in safeguarding sensitive information and ensuring compliance with privacy regulations. While machine learning models were used for PII detection, there is still room to improve the effectiveness and reliability of such models. Additional feature engineering and additional training data could be implemented to help
improve model Through extracting or formulating more information out of a broad range of PII datasets, the project could have potentially captured more accurate underlying patterns between the documents and PII labels. Additionally, other feature scaling and selection techniques could have been used to streamline the feature space. Utilizing advanced models can further optimize project performance. Ensemble methods provide a robust means of boosting overall predictive accuracy by harnessing the strengths of individual models and combining their predictions. Artificial intelligence has exponentially grown over the past years, and more modern models have appeared such as large language models (LLMs). Variations of LLMs such as bidirectional encoder representations from transformers (BERT) and generative pretrained transformers, could have been implemented to better understand the context and classification of large documents. These LLMs require extensive computational resources, however, could have better-predicted target labels with pre-trained models and predefined labels. References Anwar, M. (2021). Supporting privacy, trust, and personalization in online learning. International Journal of Artificial Intelligence in Education, 31, 769–783. https://doi.org/10.1007/s40593-020-00216-0 Dash, B., Sharma, P., & Ali, A. (2022, July). Federated learning for privacy-preserving: A review of PII data analysis in fintech. International Journal of Software Engineering & Applications, 13(4), 1–13. https://papers.ssrn.com/sol3/papers.cfm?abstr act_id=4323967 performances.
15
12
Donadel, A. (2023, September 28). This data breach has compromised nearly 900 institutions. University Business. https://universitybusiness.com/in-just-3-mont hs-this-data-breach-has-compromised-nearly 900-institutions/ Ersinesen, A. (2023). Discovery, classification, and protection of PII: Current state, problems, solution proposals. Journal of Privacy and Security, 15(2), 127–142. https://medium.com/@ersinesen/discovery-cl assification-and-protection-of-pii-current-stat e-problems-and-solution-proposals-3627e16f 2d8b Holmes, L., Crossley, S., Baffour, P., King, J., Burleigh, L., Demkin, M., Holbrook, R., Reade, W., & Howard, A. (2024). The Learning Agency Lab - PII data detection. Kaggle. https://kaggle.com/competitions/pii-detection -removal-from-educational-data Kulkarni, P., & Cauvery, N. K. (2021). Personally Identifiable Information (PII) Detection in the Unstructured Large Text Corpus using Natural Language Processing and Unsupervised Learning Technique. International Journal of Advanced Computer Science & Applications (Online), 12(9). https://doi.org/10.14569/ijacsa.2021.0120957 Nowicki, J., & Young, C. (2020, October 15). Data security: Recent K-12 data breaches show that students are vulnerable to harm. U.S. Government Accountability Office. https://www.gao.gov/products/gao-20-644Po ornima, K., & Cauvery, N. K. (2021). Personally identifiable information (PII) detection in the unstructured large text corpus using natural language processing and unsupervised learning technique. International Journal of Advanced Computer Science and Applications, (12)9, 508–517. https://thesai.org/Downloads/Volume12No9/
Paper_57-Personally_Identifiable_Informatio n_PII_Detection.pdf Prakash, P. (2020, March 18). Extend named entity recogniser (NER) to label new entities with spaCy. Towards Data Science. https://towardsdatascience.com/extend-name d-entity-recogniser-ner-to-label-new-entities with-spacy-339ee5979044 Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In S. Stevenson & X. Carreras (Eds.), Proceedings of the thirteenth conference on computational Natural language learning ([CoNLL-2009]; pp. 147–155). https://aclanthology.org/W09-1119 Roy, S., & Mitra, M. (2018). Identification and processing of PII data, applying deep learning models with Improved accuracy and efficiency. Saluja, B., Kumar, G. S., Sedoc, J., & Callison-Burch, C. (2019). Anonymization of sensitive information in medical health records. IberLEF@SEPLN, 2421, 647–653. http://ceur-ws.org/Vol-2421/MEDDOCAN_p aper_2.pdf Ulven, J. B., & Wangen, G. (2021). A systematic review of cybersecurity risks in higher education. Future Internet, 13(2), 39. https://doi.org/10.3390/fi13020039
16
1 Emotionality Analysis of Climate Change Communication in News Media VivianDo
Applied Data Science Master’s Program Shiley Marcos School of Engineering University of San Diego vdo@sandiego.edu
Bryan Flores Applied Data Science Master’s Program Shiley Marcos School of Engineering
University of San Diego bryanflores@sandiego.edu
ABSTRACT
climate change impact, 0.88 for climate crisis, and 0.89 for global warming impact.Our study, using modern methods, aids readers and listeners in identifying the subjects of the media they are consuming, thereby enhancing their understanding of how climate change is portrayed across different platforms. Climate change, also often referred to as the climate crisis, has garnered significant attention in recent decades due to its profound impact on the environment, ecosystems, and human societies worldwide. Scientific consensus overwhelmingly supports that climate change is primarily driven by human activities. The American Association for the Advancement of Science and 17 other scientific associations concluded “the scientific evidence is clear: global climate change caused by human activities is occurring now, and it is a growing threat to society” (American Association for the Advancement of Science, 2009, para. 1). Despite the scientific evidence supporting anthropogenic causes of climate change, media messaging is often clouded with political ideologies and economic interests. The politicization of climate change results in many 1 Introduction 2 Background
Despite mounting evidence, the dissemination of climate change-related information through news channels is frequently mired in political ideologies. This often leads to conflicting messages regarding the validity of climate change and its underlying causes. Given the influential role of the media in shaping public opinion and steering policy discourse, analyzing how climate change is framed within media narratives is crucial to understanding public attitudes and sentiments. Our methodology leveraged the GDELT Project's Climate Change Television transcription dataset, encompassing over 95,000 media snippets. We utilized topic modeling and sentiment analysis to identify key themes and sentiments regarding climate change. Furthermore, we conducted a comparative analysis between traditional Latent Dirichlet Allocation (LDA) and contemporary (ChatGPT GPT-3.5-turbo-0125) topic modeling techniques. We categorized 250 randomly selected snippets into five distinct topic categories: climate change impact, climate crisis, carbon emissions reduction, climate action efforts, and global warming impact.We evaluate GPT model outputs against LDA-derived topic labels using BERTScores, which returns precision, recall, and F-1 scores. The evaluation yielded average F-1 scores as follows: 0.87 for carbon emissions reduction, 0.85 for climate action efforts, 0.84 for
17
2
conflicting messages being disseminated throughout the public sphere, with partisan agendas overshadowing scientific facts and evidence. For example, while visiting California amidst a series of devastating wildfires in September 2020, former President Donald Trump expressed doubts that climate change was to blame, stating “It’ll start getting cooler. You just – you just watch” (Jacobo, 2020, para. 2). During his presidency, Donald Trump also left the Paris Agreement and repealed many Obama-era regulations regarding coal production, fracking, and emission rules (Welch & Gibbens, 2020). Across the United States, climate change has become highly polarized, with political divisions over its validity, impact on human lives, and appropriate policy responses. Despite the overwhelming scientific consensus on the reality and severity of climate change, a significant portion of the population remains skeptical and even dismissive of the issue. Recent studies indicate that approximately one in seven Americans do not believe that a climate crisis exists (Dewan, 2024). This skepticism can be attributed, at least in part, to the diverse and often contradictory messaging surrounding climate change in the media. The framing of climate change narratives and the selection of topics for coverage contribute to shaping public perceptions and attitudes. Uncovering how climate change is portrayed and perceived in the media is essential for several reasons. Media representations play a significant role in shaping public opinion and guiding policy discussions By analyzing media coverage of climate change, researchers can gain 2.1 ProblemIdentificationand Motivation
insights into the dominant narratives and biases that may be at play. Understanding public attitudes and sentiments toward climate change is essential for the development of outreach efforts to raise awareness and foster collective action. 2.2 Definitionofobjectives There are two primary objectives for this study. First, we aim to identify the framing of climate change and associated subtopics using topic modeling. By categorizing and labeling different topics, we can gain a nuanced understanding of the specific subtopics emphasized in the media. Second, we intend to conduct sentiment analysis to discern the overall sentiment. Through these analytical approaches, we aim to contribute to a deeper understanding of how climate change is communicated and perceived in the media, with implications for public discourse and climate advocacy efforts. 3 LiteratureReview(RelatedWorks) 3.1 CommunicatingClimateChangeand Health in the Media Depoux et al. (2017) analyzed the evolution of discourse on climate change disseminated to the public in two different forms of media: the French newspaper, LeMonde, and Twitter tweets. Depoux et al. found framing climate change as a public health concern, rather than an environmental issue, has become more pertinent in climate change reporting. Furthermore, highlighting the health risks associated with climate change in conjunction with potential solutions was more effective in eliciting a response and increasing involvement in climate change (Depoux et al., 2017). Effective communication plays a crucial role in prompting a response to climate change.
18
3
However, scientific voices often struggle to convey information to diverse audiences, including the general public, politicians, and key stakeholders in climate change. The conclusions from this study suggest that individuals are more inclined to act when they perceive a direct relevance to themselves. 3.2 Rethinking Climate Communications and the “Psychological Climate Paradox” There are five psychological barriers that prevent the facts about climate change from being internalized and influencing behavior: (a) climate seems distant in time, space, and influence, (b) incorrect framings backfire on the message, (c) dissonance (i.e., lack of meaningful action weakens attitudes), (d) doubt and dissonance strengthens denial, and (e) climate message is filtered through cultural identity (Stoknes, 2014). Current factual scientific information campaigns and economic cost-effectiveness have not been sufficient in convincing the public to support climate policies and although most countries have access to the necessary solutions, documents, and resources to solve the climate problem, politicians have been reluctant to the costs and prefer stronger demands from citizens. We must develop a multidisciplinary approach to climate communication that incorporates evidence-based practical communication that actively addresses the five psychological barriers to create a more personal message (Stoknes, 2014). Our work seeks to address these concerns in the media’s climate communication through topic modeling and sentiment analysis. Through topic modeling, we can detect phrase patterns to characterize the excerpts into one of the stated psychological
barriers, and sentiment analysis will provide us with subjectivity and objectivity identification. In other words, sentiment analysis allows us to determine if the author was writing to adhere to the audience’s emotions. 3.3 Trend and Thoughts: Understanding Climate Change Concerns Using Machine Learning and Social Media Data Shangguan et al. (2021) analyzed the number of tweets during major climate events using a Twitter dataset of tweets discussing climate change. The team performs two primary methods of analysis: topic modeling and sentiment analysis. For topic modeling, the team uses a latent dirichlet allocation (LDA) approach to summarize climate topics and calculate the probabilities of various words appearing in each topic. Shangguan et al. found many tweets either discussed the importance of climate change, aspects of climate change, or possible solutions to climate change. Their sentiment analysis using a pretrained RoBERTa-base model to classify sentiments as negative, neutral, or positive, produced results that were heavily skewed toward negative and neutral and very few positive sentiments. Though we are not analyzing Twitter data (climate change news article excerpts), we will use a similar approach for topic modeling. However, LDA and pretrained models will serve as a baseline for evaluating large language model (LLM) results. LLMs provide new flexibilities in text-based analysis and with prompt engineering becoming more prevalent, we will be able to accomplish both tasks with a single input instead of two separate methods (Shangguan et al., 2021).
19
4
well-received technique across academic and research contexts (Grisales et al., 2023). Our research will use topic modeling for sentiment analysis and excerpts of news articles to categorize underlying themes within different types of climate communication.
3.4 A Survey on Sentiment Analysis Methods, Applications, and Challenges Wankhade et al. (2022) discussed the various methodologies for sentiment analysis, such as lexicon-based, machine learning, and hybrid approaches. Lexicon-based sentiment analysis is an unsupervised technique and can be applied to many industries. The main disadvantage is domain dependence, but it can be overcome by the development of a domain-specific lexicon dictionary or the adaption of an existing library. The machine learning approach can be applied to unsupervised and supervised problems and includes commonly used algorithms such as naive Bayes, support vector machine, and logistic regression, among others. Hybrid approaches combine machine learning and lexicon-based methodologies and can be used for polarity recognition (Wankhade, 2022). To analyze news article excerpts, we will apply an unsupervised hybrid approach to identify overall sentiments for each news station. 3.5 Topic Modeling: Perspectives From a Literature Review Grisales et al. (2023) analyzed the evolution of topic modeling, the main areas in which it is applied, and recommended models for specific types of data. Their study had three main objectives: map scientific production using topic modeling, identify prominent authors and journal articles, and identify main applications and emerging trends. Four clusters were identified for the main applications of topic modeling: social media, information sciences, sentiment analysis, and short text. Furthermore, sentiment analysis and short text make up 24% and 26% of applications, respectively. Topic modeling is a versatile and
4 Methodology 4.1 DataOverview
The data consists of 90,863 instances of television news coverage of climate change across BBC News, CNN, MSNBC, and FOX News between July 2009 to January 2020. Data were obtained through the GDELT’s Television Explorer interface to the Internet Archive’s Television News Archive by using the following keywords: climate change , global warming , climate crisis , greenhouse gas , greenhouse gases , or carbon tax . Each observation contains the news snippet where climate change was mentioned, along with the time (in UTC Timezone), station, show, and URL link to a 15-second video clip of the mention on the Internet Archive website. In total, there were 25,593 mentions of climate change for MSNBC, 23,837 for FOX News, 22,693 for BBC News, and 18,740 for CNN. Notably, there was a spike in coverage toward the end of 2009, followed by a significant decline at the beginning of the decade (see Figure 1). Over time, there was a gradual increase in media attention, with an all-time high of 3,003 mentions in December 2019. Figure 1 Volume of News Coverage
20
Made with FlippingBook - Online Brochure Maker