ADS Capstone Chronicles Revised

2

company was not able to address these issues in time, and affected many of the organizations using Progress’ MOVEit Transfer service. This meant nearly 900 colleges and universities, and more than 51,000 individuals experienced a data breach that compromised information such as Social Security numbers, birthdates, and school records (Donadel, 2023). As technology advances, cybersecurity practices need to grow and better protect PII. The U.S. Government Accountability Office (2020) notes that everyone is involved in safeguarding PII, from data collection, storage, and cybersecurity. Data breaches and leaked PII can lead to physical, emotional, and financial harm to an individual. Thus, when an individual discloses personal information to such large institutions as universities, a heavy burden of trust and security is placed on the university and stakeholders to prevent such information from getting into the wrong hands. 2.1 Purpose Data continues to evolve, as do the security practices needed to safeguard them. The education sector is currently facing a growing responsibility of maintaining the confidentiality of student and faculty personal information while fostering an environment instrumental to progressing research and science. Organizations such as The Learning Agency Lab rely on real student submissions such as essays to develop learning-based tools and programs to benefit both students and teachers. However, educational datasets are difficult to acquire due to concerns regarding the exposure of PII. Datasets are normally reviewed manually to remove PII, which is costly and time-consuming. By implementing a solution through data science and machine learning, a

more reliable method to identify and remove PII could significantly improve data privacy and allow public educational data sets to be more readily available. 2.2 DefinitionofObjectives This project aims to train and evaluate several robust machine-learning models to detect and efficiently remove PII from large datasets. A final model will be selected based on the expectation that it can maintain high recall and precision scores, minimizing false positives or negatives. Additionally, the model should have a low runtime to be more scalable in future applications. A successful model would significantly alleviate the education sector's challenge in maintaining data privacy. This would enable researchers to use high-quality public datasets and enhance student privacy. Should the proposed model fail to meet the expectations of this project, further hyperparameter tuning, refinement, and exploration would be necessary to address PII detection. 3 LiteratureReview Multiple studies have examined various PII risks and management strategies across multiple domains. Many studies have investigated approaches such as machine learning and deep learning models for detecting and anonymizing PII in medical health records, the significance of privacy measures in online learning environments, techniques for identifying PII in unstructured text corpus, cybersecurity risks in higher education, and the use of federated learning for safeguarding PII financial data. These studies emphasize the need to protect PII data and implement robust privacy measures to reduce risks and maintain individuals' privacy across different contexts and industries.

6

Made with FlippingBook - Online Brochure Maker