M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 104 Next page Last page

Figure 5 Distribution of Target Variable before SMOTE

4.4.2 Selection of Modeling Techniques- Logistic Regression A logistic regression model was selected as one of the modeling techniques for its effectiveness in binary classification problems, low computational complexity, interpretability, and simplicity. The goal was to predict an employee’s intent to stay or leave based on survey data. Logistic regression can model the probability of a binary outcome (Kuhn & Johnson, 2013). The model also served as a good baseline model to compare against more complex machine learning techniques, such as decision trees and XGBoost, due to its simplistic and stable nature (Kuhn & Johnson, 2013). In this project, the logistic regression was implemented with the L2 regularization to prevent overfitting and multicollinearity. Prior to training, the data was standardized to ensure all features were on the same scale. A grid search with 5-fold cross-validation was used to perform hyperparameter tuning. The parameter grid explored different values for the regularization strength (C) while keeping the penalty set to ‘l2’ and the solver to ‘lbfgs.’. The model achieved its best performance with C = 0.1, as determined by optimizing the F1 score. 4.4.3 Selection of Modeling Techniques- Decision Tree The decision tree classifier was selected as a modeling technique because it provides an interpretable and highly structured method that makes predictions by performing a sequence of tests on the values of descriptive features. At each node, a question was asked about one of the features, and the data is split based on the answer. This process continues down the tree until a final prediction is made at the leaf node. A decision tree consists of a root node (starting node), interior

The stay class (Class 0) compromised approximately 80% of the data and the leave class (Class 1) accounted for only about 19% of the data. To address class imbalance, an oversampling technique was applied to the training data, resulting in a more balanced class distribution. This technique helped improve model performance by ensuring it was not biased toward the majority class. Figure 6 illustrates the balanced distribution after applying SMOTE. Figure 6 Distribution of Target Variable after SMOTE

104

Made with FlippingBook flipbook maker