M.S. Applied Data Science - Capstone Chronicles 2025

11

4.2 Data Quality To have reliable school-level predictions, a critical step was generating high data quality. Based on the insights gained from the EDA, the next step was to assess the quality of the dataset to understand the strengths, limitations, and readiness for the modeling phase. After cleaning and merging the CDE datasets, it was already generally high in quality and suitable for modeling. There was consistent formatting across the variables, standardized percentage fields, and aligned CDS codes, with aggregate and subgroup rows removed. The features were mostly within the expected numerical ranges, and the key ones related to absenteeism, poverty, school climate, and staffing all had and aligned with expected statewide patterns.

The dataset was mostly complete, however there were some missingness in CalSCHLS climate and safety measures and in some categories of absenteeism, which was a reflection of the limitations in reporting. The missing climate and safety data was handled by creating two different modeling datasets. The first one was a dataset that excluded the climate index variable ( n = 958 schools), the second dataset excluded seven counties that had data that was unreported ( n = 806 schools). The other missing values for staff ratios, teacher characteristics, and grade retention, were imputed using the median. This affected 14-59 values per variable. Doing this allowed a maximum sample size to be used while maintaining the integrity of the data for analysis. There were also a small number of outliers,

200

Made with FlippingBook flipbook maker