M.S. Applied Data Science - Capstone Chronicles 2025

7

determined how to handle each variable during modeling. 4.1.1.2 Univariate Analysis Univariate analysis was used to review the distribution and variability of each feature. The primary target variable, regular_hs_diploma_graduates_rate, was left-skewed, showing that most schools report high graduation rates and, therefore, confirming a strong class imbalance (see Figure 2). Several predictors were also skewed, suggesting that transformation would be needed during modeling. Figure 2 Density Plot of Graduation Rates (%)

Further inspection revealed that the missingness patterns were attributable to data source limitations rather than random data loss. The student-staff ratio field is inconsistently reported in the CBEDS staff assignment dataset, particularly among smaller or newly established schools. The School Safety and Climate indicators, derived from county-level CalSCHLS aggregates, contained structural gaps because some counties did not report all perception measures for all grade bands, especially during the 2017-2019 school years. Because these missing values primarily reflected systematic reporting practices rather than random omissions, they were retained at this stage and revisited in the data quality and feature engineering steps, where we

196

Made with FlippingBook flipbook maker