M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 194 Next page Last page

EWS compared to the sole measure of accuracy. 3.3 Geospatial Analysis and Educational Equity Cobb (2020) defined geospatial analysis as an analytical strategy to understand the geographical disparities of educational opportunities and access through the use of geographic information systems (GIS) methodology. Through the review of 42 GIS-based studies conducted from 1995-2019, Cobb explains how geospatial data can be visually represented to display inequities in school quality, resource allocations, and student outcome measures, which may be invisible or obscured in aggregate reports. Cobb concluded that while geospatial analysis can identify where these inequities occur, it can also serve as a basis for evidence-based policy making through the interconnection of social / economic / geographic dimensions of educational equity. We will build on Cobb's (2020) findings by using a spatial dimension in our analysis and mapping predicted risk at the school level throughout California. This spatial dimension of our analysis not only provides context for the model output, but it will also allow policymakers to visually represent which areas of the state have the lowest high school graduation rates and, therefore, enable targeted interventions and fair distribution of educational resources. 4 Methodology To develop a statewide, school-level EWS to predict high school graduation outcomes, we needed to create a unified dataset built from publicly available indicators aligned with the ABC framework. This step was necessary because California’s education data is not stored in a single repository. It is spread across multiple state reporting systems that differ in structure, file format, reporting categories, and reporting years. Several datasets, such as the graduation, absenteeism, staffing, and socioeconomic files, were available for the 2021–22 academic year. However, the school

safety and climate data were only publicly accessible for earlier years. Therefore, all datasets required filtering and temporal alignment before they could be merged into a single analytic file for modeling. 4.1 Data Acquisition and Aggregation Data acquisition and merging were performed using Python (pandas) within a Jupyter Notebook environment. Publicly accessible datasets were acquired from the CDE for the 2021-22 academic school year. Each dataset was downloaded in its native format, such as CSV, XLSX, or TXT files, and imported into Python for initial cleaning and merging. We used the 14-digit County-District-School (CDS) code, which serves as the unique identifier for joining each dataset into the main combined dataset. Each dataset was filtered for high schools only. Rows that reported on subgroup or aggregate data were removed from the dataset. Schools that did not report at least 90% of their expected cohort enrollment (pc_hs_enrollment < 90%) were removed to ensure graduation rate validity. Additionally, any rows missing the target variable (graduation_rate) were excluded, since these records could not be used in supervised modeling. To build the unified dataset, we combined a number of publicly available sources from the CDE. These included: ●Adjusted Cohort Graduation Rate (ACGR) ● Chronic Absenteeism ● FRPM (Free/Reduced-Price Meals) ● CBEDS Staff Assignment ● School Directory / School Characteristics ● CalSCHLS School Safety and Climate (2017 - 2019) These datasets collectively provide school-level measures aligned with the ABC framework.

194

Made with FlippingBook flipbook maker