M.S. Applied Data Science - Capstone Chronicles 2025

12

such as unusually high student-teacher ratios or absenteeism that had zero reported values for the whole school. A number of features also measured similar values, these included overlapping school climate indicators and teacher experience categories. Overall, the dataset provided a reliable foundation for modeling, with remaining limitations clearly identified and manageable during late preprocessing steps. Those steps included: addressing outliers by removing them or capping extreme variables, removing redundant and highly correlated variables using feature selection, and standardizing numerical features to make sure the scales were comparable for model training. Additionally, all publicly accessible CDE datasets are presuppressed for privacy under FERPA regulations and do not include any personally identifiable information. Small subgroups, such as graduation outcomes, absenteeism categories, discipline incidents, or demographic breakouts with very few students, are intentionally masked by the state to prevent re-identification. This suppression creates some of the structural missingness observed during EDA, particularly in the absenteeism subgroups. In contrast, the limited availability of certain CalSCHLS climate indicators reflects systemic differences in county reporting rather than privacy masking due to FERPA. Because our analysis uses only school- and county-level aggregates, the risk of re-identification is eliminated, and no additional suppression was needed during data preparation. 4.3 Feature Engineering Feature engineering was performed to prepare a cleaned dataset for modeling and to ensure alignment with the ABC framework. There were several variables that were identified during EDA as non-informative and were removed prior to modeling since they did not contribute meaningful variance or provided no predictive value:

●​ Derived variables: high_conn, low_conn, and conn_ratio. These variables showed low variance. ●​ All categorical features were excluded because EDA showed that each feature showed minimal variation and limited predictive contribution. ●​Identification-only variables: latitude, longitude, county, CDS Code. These variables were only used for initial identification and merging the datasets. These features were removed to ensure generalizability. Since a number of behavioral indicators came from the CalSCHLS climate surveys, the raw data required additional engineering to convert student perception measures into numeric features that could be incorporated into the model. The original safety indicators were reported as distributions across a five-point Likert scale: “ Very Safe ”, “ Safe ”, “ Neither Safe nor Unsafe ”, “ Unsafe ”, and “ Very Unsafe .” These ordinal categories were converted into numeric values and combined into an expected-value safety_score, weighing each category on a 1 - 5 scale according to its favorability. This created a continuous measure of perceived school safety at the county level. Another engineered feature, avg_safety_score, was created by averaging safety scores across available grades, providing a stable and interpretable climate feature aligned with the behavior component of the ABC framework. Connectedness-based climate variables were evaluated but ultimately removed due to a lack of variance and limited predictive value. Several numeric predictors exhibited strong right skews, including absenteeism rates, staffing-related percentages, and climate measures. These variables were left untransformed in the final dataset but were standardized later within the modeling pipeline to prevent data leakage and ensure consistent scaling during training. The target variable was engineered as a binary label using a 90% graduation rate threshold to serve as an early warning system to flag connectedness

201

Made with FlippingBook flipbook maker