M.S. Applied Data Science - Capstone Chronicles 2025

7

which had a lower representation of leaders. These distributions provide foundational context for interpreting Likert-scale engagement outcomes and help inform subgroup analyses conducted later in the study. 4.2 Data Quality Ensuring the data were clean, free from missing values and bias, was an important part of the cleaning process. The FEVS survey data contained missing values in most of the columns. This pattern is to be expected for survey data, because some responses are not applicable to the employee, or they wish not to respond to a given question or demographic characteristics, gender, race, or ethnicity. 4.2.1 Data Quality Issues. Each year, survey questions were added, removed, or changed. To model features that impact employee turnover over time, only questions that were consistently asked between 2020 and 2024 were selected for modeling (see Appendix) the dataset contains missing values in each variable. Employees’ likelihood of leaving records with missing values was removed from the dataset since this is the dependent variable. Demographic variables, minority and race, were excluded since there were over 2,000,000 missing records. The demographic columns – sex, federal tenure, and supervisory status – the missing values in these columns were filled with their respective modes. 4.2.1 Class Imbalance To address class imbalance, the synthetic minority over-sampling technique (SMOTE) was applied after splitting the data into training and validation sets. Applying SMOTE on the training data

ensured the underrepresented class, Class 1 (Leave Class), was more accurately predicted during the modeling phase. 4.2.2 Correlation Analysis To assess potential multicollinearity, a correlation matrix was computed for the survey questions. The Pearson correlation method revealed several moderate to strong correlations between survey questions. A correlation heatmap visualized these relationships and helped ensure that features with extremely high correlation (e.g., > 0.7) were flagged for further review in modeling. Figure 4 illustrates the correlation heat map between the survey questions. Some notable correlations are visible: ● Q21 and Q22 had a high correlation of 0.76, these questions relate to supervisor leadership and the employee’s experience with their supervisor. ● Q30 through Q32 have moderate to strong correlations ranging from 0.67 to 0.74. These questions relate to a common theme of managerial and/or supervisor support. No features were immediately removed at this stage. The correlation matrix provided valuable insights to make decisions regarding feature selection and regularization.

101

Made with FlippingBook flipbook maker