M.S. Applied Data Science - Capstone Chronicles 2025
4
Learning Evidence That Perceived Comprehensive HR Practices Predict Turnover Intention The Classification and Regression Trees (CART) prediction model identified six at-risk subgroups: satisfaction with the organization, loyalty, accomplishment, and involvement in decisions, likeness to the job, satisfaction with promotion, skill development opportunities, organizational tenure, and pay satisfaction, with job satisfaction as the strongest predictor of employee turnover. Gaps in this research included demographic information of the population and other models proven with more accurate results (Kang et al., 2021). 4 Methodology This project used open-sourced data from the Federal Employee Viewpoint Survey (FEVS) located in the U.S. Office of Personnel Management (OPM) website. Survey data from federal employees is available annually from 2016 to 2024. Each file contained between 292,520 and 674,207 survey records. This data represents real world survey responses and is an “organizational climate survey and assess how employees jointly experience the policies, practices, and procedures characteristic of their agency and its leadership” (U.S. Office of Personnel Management, n.d.) The FEVS is voluntary. Employees are invited to complete the survey via email, and reminders are sent to increase response rates. Collected data are weighted to ensure the survey estimates accurately represent the size and scope of different agency populations, reducing biased estimates of population statistics. After the acquisition of the data, the research methodology involved conducting an exploratory data analysis (EDA), statistical analysis, visualizing trends, and examining annual trends between overall responses and responses across different employee
demographics. Understanding the statistical relationship between variables enabled data cleaning and feature engineering to select and optimize model performance. All code for this paper can be found on the GitHub repository link: https://github.com/sophiaajensen/ADS599_Capst oneProject For inquiries, please get in touch with the authors of the paper. 4.1 Data Acquisition and Aggregation For this project, FEVS data was downloaded from the OPM website as a comma-separated values (CSV) files by year. Data from 2020 through 2024 were selected with years prior excluded. In addition to raw data, a codebook and a readme file are available each year to define terms and explain changes made to the survey year-over-year. Python was used as the processing language and conducted in Jupyter Notebook. Libraries such as Sqlite3, Pandas, Matplotlib, and Seaborn were imported to store data in databases and analyze and visualize the information. 4.1.1 Exploratory Data Analysis EDA is the first step in examining the datasets. EDA is the process of understanding and visualizing the data to uncover patterns, clean the data, and prepare for the modeling phase. Familiarity with the data provided insights on how the data should be cleaned such as, how missing values will be handled and if features need to be engineered (Liu, 2020) The combined FEVS survey database, from 2020 to 2024, contains 2,774,873 survey responses. There were 47 common columns and 215 unique columns between all response files. The common demographic columns included (a) age group, (b) disability, (c) Hispanic, Latino, or Spanish origin,
98
Made with FlippingBook flipbook maker