ADS Capstone Chronicles Revised

First page Table of contents Previous page 9 Next page Last page

Table 1 Distribution of PII Labels in Documents Label

I-street_address I-ID_number I-URL_personal I-email I-username

20 1 1 0 0

0.73 0.04 0.04 0.00 0.00

Count Percent (%)

B-name_student I-name_student B-URL_personal B-ID_number B-email B-username B-phone_number I-phone_number B-street_address I-street_address I-ID_number I-URL_personal I-email I-username

891 814

12.09 11.96

4.2 DataQuality Given that this project delves into text data analysis, the pre-processing phase consisted of inspecting the data for signs of bias, imbalance, and noise. 4.2.1 PII Labels. The dataset had a total of 14 labels that were categorized as either starting at the beginning (‘B’) or in-between a sentence (‘I’). This meant that in actuality, seven PII labels existed, with the sub-category defining the token placement within a sentence. As seen in Figure 1, there was a slightly larger distribution of PII tokens found at the beginning of an entity compared to in-between an entity. Figure 1 Distribution of Train Dataset Token Scheme Tag

72 33 24

1.06 0.48 0.35 0.07 0.06 0.04 0.03 0.03 0.01 0.01 0.00 0.00

5 4 3 2 2 1 1 0 0

Table 2 illustrates the named entity recognition (NER) frequency of each label occurring within individual documents, providing further granular information on PII labels. This was done by creating a pipeline that used spaCy to analyze the number of times each of these PII labels are presented in each document. This analysis indicated that name_student had the highest frequency, and was seen multiple times within a single document. It is also implied that documents would rarely contain information on emails, usernames, or personal websites. Table 2 NER Frequency Distribution of PII Labels in Documents Label Count Percent (%)

B-name_student I-name_student B-URL_personal B-ID_number B-email B-username B-phone_number I-phone_number B-street_address

1,365 1,096 110

49.84 40.01

4.3 FeatureEngineering The project expected to use text classification models that could handle processing multiple labels. 4.3.1 Tokenization. A pipeline was created to preprocess the list of tokens to

4.02 2.85 1.42 0.22 0.22 0.55 0.07

78 39 6 6 15 2

Made with FlippingBook - Online Brochure Maker