ADS Capstone Chronicles Revised

1 Detecting and Removing Personal Identifiable Information Using Machine Learning Ebad Akhter Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego eakhter@sandiego.edu JiaqiHe Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of SanDiego jhe@sandiego.edu Jacqueline Vo Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego jvo@sandiego.edu

ABSTRACT This research explored the development and evaluation of several machine learning algorithms designed to perform text classification tasks. These models targeted the detection of personal identifiable information (PII) within academic documents written by students. The objective of this project was to create a machine learning model that would accurately and precisely classify words considered PII while being efficient. Several algorithms were used such as logistic regression, random forest, extreme-gradient boosting (XGBoost), k-nearest neighbors (K-NN), and Presidio. These models were all trained and tested using pre-processed text data using tokenization and feature engineering in addition to hyperparameter tuning methods such as randomized search and grid search. Ultimately, this study found the random forest model performed the best in regards to precision. This study demonstrated the usefulness of machine learning to enhance data privacy. KEYWORDS personal identifiable information, modeling, machine learning, detection, anonymization, privacy

1 Introduction Protecting personal identifiable information (PII) as data science evolves is paramount. As companies exponentially store data by the millions, the prevalence of data breaches and the exchange of information among third-party vendors underscores the vulnerability of individuals' data privacy. Within the realm of education technology, commonly referred to as the ed-tech industry, PII poses an obstacle to developing open datasets to advance educational outcomes, as the public release of such data exposes students to potential risks. To mitigate these risks effectively, it is imperative to implement rigorous screening and cleansing procedures for institution data to identify and remove PII before their public dissemination. Applying data science methodologies such as tokenizing and vectorizing text data can significantly facilitate this task. 2 Background Educational institutions commonly store large amounts of personal information on students and faculty to fulfill many tasks. In June 2023, Progress, a business application software company, announced their large data file transfer service, MOVEit Transfer, was susceptible to security vulnerabilities. The

5

Made with FlippingBook - Online Brochure Maker