ADS Capstone Chronicles Revised

6

lowercase all tokens, remove punctuation, remove stopwords, remove empty tokens, and remove new line punctuation. Each token was associated with a label identifying whether or not the token was considered PII, and if true, what specific PII label to which the token corresponded. Thus, the pipeline preprocessed both of these features simultaneously. This created tokens more consistent with all documents and eliminated symbols or text that might create additional noise within the dataset. 4.3.2 Descriptive Statistics of Tokens. The descriptive statistics of both the train and test dataset after being preprocessed are illustrated in Table 3. The train dataset has a total of 2,230,310 tokens, of which 1.98% were unique. Additionally, the overall tokens did not appear to be lexically diverse, indicating similar language used throughout all the documents. The training dataset contained a large corpus of text for model training and has extensive textual content present in the dataset. Conversely, the test dataset was much smaller, with only 3,377 total tokens. The calculated number of unique tokens was much higher at 44.45% given that there are fewer tokens overall. The test dataset is much more lexically diverse, but only due to the smaller dataset size and use of fewer documents. Table 3 Pre-Processed Tokens Descriptive Statistics Train Test Tokens Unique tokens 3,377 1,501 22,651 0.44 4.3.3. Feature Transformations. The labels were transformed using a multi-label binarizer to indicate to models that a document Character count Lexical diversity 2,230,310 44,088 1,4626,600 0.02

can contain more than one PII label. Additionally, the tokens were transformed using term frequency inverse document frequency vectorizer to transform the text data into a numerical format that would better fit the models. This vectorizer calculates the number of occurrences in which a word appears within a document and the number of documents. 4.4 Modeling 4.4.1 Selection of modeling techniques. The project wanted to select machine learning algorithms that would be able to perform text processing and classification tasks with minimal processing time. A baseline model was created to identify baseline predictors and metrics that would be monitored. This was then followed by more complex NLP models that would undergo hyperparameter tuning. 4.4.1.1. Training, Validation, and Test Partitioned Datasets. The original datasets included a train and test data set, but for the purpose of training models, the train dataset was further partitioned. A train and validation set were created, with an 80/20 split from the preprocessed train dataset. 4.4.1.2. Logistic Regression. A baseline model was important in evaluating the performance of all models being trained for this project. More specifically, a logistic regression model was used as a baseline because it is able to effectively handle high-dimensional data while being a fairly basic model that is easy to understand. This model was trained using a maximum of 1,000 iterations. 4.4.1.3. Random Forest (RF). RFmodels are widely used for machine learning tasks such as image processing, health care, and text processing such as in the case of this project. This model was selected for the project due to

10

Made with FlippingBook - Online Brochure Maker