ADS Capstone Chronicles Revised
4
4.1 DataAcquisitionandAggregation The datasets were found through a Kaggle Competition hosted by Vanderbilt University and The Learning Agency Lab, an Arizona-based independent nonprofit in the education sector (Holmes et al., 2024). Both the train and test datasets had common variables such as document number, full text, tokens, and trailing whitespace. The train set also included PII labels such as names, emails, usernames, identification numbers, phone numbers, personal URL addresses, and street addresses that might be associated with a student. 4.1.1 Exploratory Data Analysis. The train dataset had PII labels that were categorized according to the BILUO scheme tag format. This format is generally used to create a formatted spaCy library text string that saves tokens from documents to be more performance-driven (Prakash, 2020). The PII labels either start with the tag “B” for the beginning first token within a multitoken entity, or “I” for an inner token within a multitoken entity. Tables 1 and 2 illustrate a more detailed depiction of the train dataset and the PII label distribution. Table 1 identifies counts the documents in which each PII label has appeared. The proportion of each label implies that Name_Student is the most often seen PII label in documents, both at the beginning and in-between an entity. Overall, the Name_Student label is seen in 24.05% of documents. On the other hand, the Username and Email labels are least represented within documents, especially when located in-between entities. In fact, only 24 documents contain email addresses and 5 documents usernames. These labels are uncommon and were only seen at the beginning of a multi-token entity.
manage valuable assets such as personal information, financial data, research, and intellectual property. Strategic cybersecurity risks in higher education include data leakage, financial fraud, and attacks on data integrity. Security operations centers and computer emergency response teams should prioritize information sharing and incident data collaboration to enhance cybersecurity resilience. Maturity modeling and baseline studies can address critical gaps in empirical research and improve security practices in academia (Ulven & Wangen, 2021). 3.6 FederatedLearningfor Privacy-Preserving: A Review of PII Data Analysis in Fintech Facing the challenges of protecting PII data and cyber security issues, this study explores new methods like artificial intelligence have emerged. Federated learning is a recently developed method to protect confidential data analysis involving privacy or sensitive information. The solution involves identifying PII data via name entity recognition and using supervised machine learning to ensure the relationship between entities. Several automated solutions are provided by this study, such as lactate, tracking, and securing personal data in different situations, which guards against data leakage (Dash et al., 2022). 4 Methodology This project used Jupyter Notebooks and Python (Version 3.9) to load datasets, preprocess text data, and train, test, and evaluate machine learning models. Two datasets were used to build these models: train.json had five columns and 6,807 rows while test.json had four columns and 10 rows.
8
Made with FlippingBook - Online Brochure Maker