M.S. Applied Data Science - Capstone Chronicles 2025

2

2 Background California’s Safe Drinking Water Information System (SDWIS) contains millions of laboratory records for dozens of chemical analytes measured by public water systems. While Maximum Contaminant Levels (MCLs) are strictly enforced for public safety, monitoring data often exhibit heavy censoring; over 70% of measurements fall below laboratory detection limits and irregular sampling frequencies across counties. These characteristics complicate straightforward statistical summaries and hinder early detection of emerging hazards. Moreover, water systems differ in the population they serve, treatment infrastructure, and geochemical conditions, leading to geographically heterogeneous contamination patterns. Rural counties may produce only sparse monthly records, whereas urban centers generate thousands of samples annually. Recognizing these challenges, our project focuses on designing robust analysis pipelines that harmonize non-detect substitution, unit conversions, and stratified modeling, thereby reflecting true environmental dynamics rather than artifacts of regulatory sampling protocols. 2.1 Context of Chemical Monitoring in California Safe drinking water regulations (Title 22 CCR) require periodic testing for radiological isotopes. The SDWIS extract includes over 600,000 records, with variables spanning system identifiers, sample dates, result values, and regulatory limits. We harmonize units, handle non ‑ detects, and geocode by county, because prolonged exposure above MCLs can lead to serious health impacts. 2.2 Problem Identification and Motivation A primary gap in current water quality management is the reactive nature of compliance

where and when Maximum Contaminant Levels (MCLs) are exceeded in California’s vast network of public water systems. To address this, we conduct a spatial-temporal analysis of the EPA’s SDWIS radiological monitoring records, mapping exceedance “hot spots” and uncovering temporal patterns. This approach quantifies county-level exceedance rates, highlights seasonal and long-term trends, and sets the stage for predictive models of system-level risk. Ensuring safe drinking water is a fundamental public health imperative. In California, home to diverse geographies and water sources, regulatory agencies routinely test for chemical contaminants to protect consumers from exposure to harmful levels of metals and other analytes. However, the sheer volume and complexity of compliance data can obscure emerging trends, episodic spikes, and long-term changes in contaminant concentrations. This project addresses that challenge by developing a reproducible, data-driven framework for exploring and forecasting key water quality analytes such as arsenic, lead, iron, manganese, and zinc at the county level across the state. Our central aim is twofold: first, to perform a rigorous exploratory analysis that uncovers patterns, outliers, and geographic differences in county-level contaminant measurements; and second, to architect a time-series modeling pipeline capable of providing early warning of contaminant spikes. By harmonizing raw compliance records, handling non-detect values transparently, and leveraging both statistical and machine-learning techniques, we seek to equip water managers and regulators with actionable forecasts. Such forecasts can improve sampling strategies, accelerate remediation actions, and ultimately help safeguard public health.

74

Made with FlippingBook flipbook maker