M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 105 Next page Last page

nodes, and leaf nodes (terminating node; Kelleher et al., 2020). The advantage of using decision trees to classify data is like logistic regression, decision trees are highly interpretable and simplistic. This simple structure makes decision trees particularly useful for understanding feature importance and the algorithm’s decision-making process (Kelleher et al., 2020). In this project, the decision tree model was set to these parameters: ● max_depth was set to 15 to prevent the model from overfitting the training data. ● min_samples_leaf was set to 50 to ensure the leaf nodes had a good sample size ● class_weight was set to ‘balanced’ to address class imbalance ● random_state was set to 42 to ensure reproducibility. 4.4.4 Selection of Modeling Techniques- XGBoost The extreme gradient boost (XGBoost) model was selected as a modeling technique due to its speed and ability to handle large datasets effectively. The algorithm implements a technique known as gradient boosting, where it trains multiple decision trees and combines them to create a strong final prediction model (Kelleher et al., 2020) These models are trained to fix the mistakes made by earlier models, helping improve the overall predictions of the ensemble (Kelleher et al., 2020). This ensemble learning approach allows for the model to improve its overall performance in accuracy and robustness. Because the FEVS training dataset consists of over 2 million entries, XGBoost can effectively process the large volume of data while maintaining high performance.

In this project, the XGBoost was trained using the following parameters: ● use_label_encoder was set to False. ● eval_metric was set to ‘logloss’ ● random state was set to 42 to ensure reproducibility. No hyperparameter tuning was applied. Instead, default values were used to set a baseline. This baseline allowed for a better understanding of the general feature performance and model behavior. The model could benefit from future optimization of model parameters to improve its overall performance. 4.4.5 Evaluation Metrics Models were trained and tested for accuracy, precision, recall, and F-1 score, and compared to the logistic regression model as a baseline. Accuracy determines how well the model predicted the expected outcome of employee turnover and is one of the factors of evaluating model performance (Kuhn & Johnson, 2013). Models with a high level of accuracy can be overfit to the training and testing data, which will not lend well to the introduction and adaptations of new variables. In addition to accuracy, the precision, recall, and F1-score will be assessed to determine not only if the model is accurate, but also how well the model performed in predicting and actual positives. The F1-score will determine the balance of precision and recall. Table 3 reports the performance of the accuracy, precision, recall, and F1-score of the decision tree classifier, XGBoost, and logistic regression model. The XGBoost model was more accurate than the base and decision tree models at 81% compared to 68% and 69% from the decision tree 4.4.6 Final Model Selection

105

Made with FlippingBook flipbook maker