ADS Capstone Chronicles Revised
5
Table 1 Distribution of PII Labels in Documents Label
I-street_address I-ID_number I-URL_personal I-email I-username
20 1 1 0 0
0.73 0.04 0.04 0.00 0.00
Count Percent (%)
B-name_student I-name_student B-URL_personal B-ID_number B-email B-username B-phone_number I-phone_number B-street_address I-street_address I-ID_number I-URL_personal I-email I-username
891 814
12.09 11.96
4.2 DataQuality Given that this project delves into text data analysis, the pre-processing phase consisted of inspecting the data for signs of bias, imbalance, and noise. 4.2.1 PII Labels. The dataset had a total of 14 labels that were categorized as either starting at the beginning (‘B’) or in-between a sentence (‘I’). This meant that in actuality, seven PII labels existed, with the sub-category defining the token placement within a sentence. As seen in Figure 1, there was a slightly larger distribution of PII tokens found at the beginning of an entity compared to in-between an entity. Figure 1 Distribution of Train Dataset Token Scheme Tag
72 33 24
1.06 0.48 0.35 0.07 0.06 0.04 0.03 0.03 0.01 0.01 0.00 0.00
5 4 3 2 2 1 1 0 0
Table 2 illustrates the named entity recognition (NER) frequency of each label occurring within individual documents, providing further granular information on PII labels. This was done by creating a pipeline that used spaCy to analyze the number of times each of these PII labels are presented in each document. This analysis indicated that name_student had the highest frequency, and was seen multiple times within a single document. It is also implied that documents would rarely contain information on emails, usernames, or personal websites. Table 2 NER Frequency Distribution of PII Labels in Documents Label Count Percent (%)
B-name_student I-name_student B-URL_personal B-ID_number B-email B-username B-phone_number I-phone_number B-street_address
1,365 1,096 110
49.84 40.01
4.3 FeatureEngineering The project expected to use text classification models that could handle processing multiple labels. 4.3.1 Tokenization. A pipeline was created to preprocess the list of tokens to
4.02 2.85 1.42 0.22 0.22 0.55 0.07
78 39 6 6 15 2
9
Made with FlippingBook - Online Brochure Maker