ADS Capstone Chronicles Revised

11

indicators might detect PII. This is especially in the case of the Presidio model in which NLP was conducted using a broader range of text data. 6.1 Conclusion This study evaluated several supervised machine learning models in detecting and removing PII from large academic documents. After evaluating the performance metrics of all models, the RF model illustrated a better performance compared to the other models. When it came to the validation set, the K-NN model had the highest precision score and fastest runtime. When not looking at the runtime performance, the XGBoost model outperformed the K-NN model in both the recall and F1 scores in the validation phase. However, after testing the models on the test datasets, the RF model was able to exceed the initial validation results. Additional hyperparameter tuning and optimization could have also helped improve the performance and robustness of both models. Overall, this study enhanced comprehension and utilization of machine learning on PII data through text classification. By understanding the strengths and weaknesses of each algorithm, organizations can make better decisions to enhance data privacy and security practices in the future. 6.2 Recommended Next Steps PII detection is critical in safeguarding sensitive information and ensuring compliance with privacy regulations. While machine learning models were used for PII detection, there is still room to improve the effectiveness and reliability of such models. Additional feature engineering and additional training data could be implemented to help

improve model Through extracting or formulating more information out of a broad range of PII datasets, the project could have potentially captured more accurate underlying patterns between the documents and PII labels. Additionally, other feature scaling and selection techniques could have been used to streamline the feature space. Utilizing advanced models can further optimize project performance. Ensemble methods provide a robust means of boosting overall predictive accuracy by harnessing the strengths of individual models and combining their predictions. Artificial intelligence has exponentially grown over the past years, and more modern models have appeared such as large language models (LLMs). Variations of LLMs such as bidirectional encoder representations from transformers (BERT) and generative pretrained transformers, could have been implemented to better understand the context and classification of large documents. These LLMs require extensive computational resources, however, could have better-predicted target labels with pre-trained models and predefined labels. References Anwar, M. (2021). Supporting privacy, trust, and personalization in online learning. International Journal of Artificial Intelligence in Education, 31, 769–783. https://doi.org/10.1007/s40593-020-00216-0 Dash, B., Sharma, P., & Ali, A. (2022, July). Federated learning for privacy-preserving: A review of PII data analysis in fintech. International Journal of Software Engineering & Applications, 13(4), 1–13. https://papers.ssrn.com/sol3/papers.cfm?abstr act_id=4323967 performances.

15

Made with FlippingBook - Online Brochure Maker