ADS Capstone Chronicles Revised
7
the algorithm’s simplicity, ability to handle imbalance, and ease of interpretation. Given the name of this model, it is able to create several decision trees that are visually interpretable by a wide audience. An initial RF model was created with 100 maximum iterations. This model was then tuned using grid search to search and identify which combination of hyperparameters would yield the best performance. In this hyperparameter tuning, it was fitted in 3 folds for 288 candidates each. This totalled 864 fits of the RF model. 4.4.1.4. Extreme Gradient Boosting (XGBoost). Much like the RF model, this is a variation of the decision tree model that uses gradient-boosted decision trees. This model generally has a longer runtime and requires more resources to tune the model, however yields high performance results. Due to the longer run-time of the XGBoost model, random search was chosen over grid search as hyperparameter tuning because it randomly selects a set of iterations, using less processing time. This fitted 3 folds of 10 candidates each, resulting in 30 total fits. 4.4.1.5. K-Nearest Neighbors (K-NN). This is a powerful algorithm that is often applied to classification or regression tasks. K-NN is easy to understand and implement as it clusters similar data based on the number of neighbors. K-NN can learn directly from the instances in the training data because K-NN is a non-parametric algorithm that does not make assumptions about the distribution of the data. The initial K-NN model ran through several scenarios in which the value k varied from two to 29, and found the optimal k was either at ten or eleven. An additional K-NN model went through additional grid search parameter tuning
in which five folds of 28 candidates were trained, totalling 140 fits. The parameter tuning found that the optimal k value could be six. 4.4.1.6. Presidio Model. Presidio, a software development kit built by Microsoft, was also selected because it is able to facilitate PII detection and anonymization at an organizational scale. This model is able to use Regex to recognize patterns, leverage NLP to detect entities, validate patterns, and apply anonymization techniques that would be scalable for this project. It also has the flexibility to be expanded with other types of custom recognizers, like the BILOU scheme-based labels used in the dataset. 5 ResultsandFindings Both statistical and performance metrics were used to consider the performance of each model. Thus, precision, recall, accuracy, and F1 scores were used to evaluate the statistical performance of each model, in addition to the efficiency of each model. This metric was used to determine the suitability of the model in a real-world context in which thousands of documents would need to be evaluated. 5.1 Evaluation of Results One of the most important metrics used for evaluation was the level of efficiency in which the model was able to run. As seen in Formula 1, the model’s F1 score would be considered in addition to the number of seconds it takes for the submission to be evaluated. = ℎ − + 324,000 (1) Figure 2 illustrates the runtime of each model, including variations of models that went through several hyperparameter tuning iterations. The
11
Made with FlippingBook - Online Brochure Maker