M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 103 Next page Last page

4.3 Feature Engineering This research focused on how important features were related to turnover versus the magnitude of each feature in comparison to one another. Figure 2 represents the Likert distribution of the 28 questions that were consistently asked between 2018 and 2024. Respondents showed high favorability to each question, with most agreeing or strongly agreeing. Suárez‑García et al. (2024) concluded that although the Likert scale brings more richness and gives rise to a probability mass function, a binary scale provided a more common approach to and figure to determine the success rate of a study. To prepare the data for binary modeling, scale responses from strongly disagree to neutral, 1–3 were recoded to 0, and agree to strongly agree were recoded to 1. In addition to recoded variables, survey responses with missing values were replaced with 1 since the mean of the question was more favorable than unfavorable. The target variable, Intent to Leave , was recoded into a binary classification to simplify the modeling process and improve predictive performance. Responses A ( Staying ) and B ( Yes, to another federal job ) were coded as 0, representing the Stay class. Responses C ( Yes, to a job outside federal government ) and D ( Other , which may include retirement or career changes) were coded as 1, representing the Leave class. This grouping reflects a logical distinction: respondents selecting B still plan to remain in the federal workforce, aligning more closely with A than with C or D. Similarly, response D indicates a departure from the organization, justifying its inclusion with the Leave class.

To prepare the dataset for modeling, all demographic variables were one-hot encoded, ensuring that all input features are numerical and suitable for a machine learning algorithm. 4.4 Modeling This phase of the project used machine learning techniques to classify employees’ intent to leave based on survey responses and demographic information. The target variable is whether an employee will leave their job or stay. Multiple classification models were used, including: logistic regression, decision tree, and XGBoost. All models were evaluated using key metrics such as accuracy, F1-score, recall, and precision to determine the overall best-performing model. 4.4.1 Training and Test Split The dataset was split into a training and test set using train_test_split. The training set was 80% of the data and the test set was 20% of the data. To ensure the split was correctly performed, the shape of the training set and test set were examined where the training set contained 2,521,731 entries and the test set contained 630,433 entries. Before classifying the models, the training data set was balanced using SMOTE so the classes in the target variable were evenly distributed. Figure 5 shows the distribution of the target variable, Intent to leave (‘dleaving’).

103

Made with FlippingBook flipbook maker