M.S. Applied Data Science - Capstone Chronicles 2025
1 Spatial-Temporal and Predictive Modeling of Chemical Contaminant Exceedances in California Public Water Systems Tarane Javaherpour Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego tjavaherpour@sandiego.edu Davood Aein Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego daein@sandiego.edu ABSTRACT
demonstrated that population served, service connections, and analytical methods are strong predictors of exceedance events. We engineered features spanning temporal (lags, rolling medians, cyclical encodings), categorical (county, analyte), and distributional characteristics, comparing classical time-series models (ARIMA, Prophet) with machine-learning approaches, MLP, two-stage classifiers. On a 12-month hold-out, random-forest regressors achieved mean absolute errors as low as 12 µg/L, while exceedance classifiers attained >90% recall for critical analytes. A hybrid ARIMA-forest model further reduced forecasting errors by 8% and boosted recall by 3%. Deployed in a Stream-lit dashboard, our framework delivers interactive geospatial predictions and exceedance probabilities, enabling regulators to shift from reactive compliance to proactive risk management. KEYWORDS chemical contaminants, water quality forecasting, heavy metals, non-detect imputation, time-series modeling, machine learning, random forest, ARIMA, exceedance classification, geospatial decision support. 1 Introduction Chemical contaminants primarily heavy metals (arsenic, lead, iron, manganese, zinc) and fluoride in drinking water pose a serious public-health risk, yet regulators often lack a unified view of
Chemical contaminants primarily heavy metals (e.g., arsenic, lead, iron, manganese, zinc) and fluoride in drinking water pose significant public-health risks, yet routine compliance data often remain underutilized for targeted intervention. California’s public water systems generate millions of laboratory measurements for chemical analytes, many falling below detection limits and sampled irregularly making it difficult to anticipate contamination events. This study presents a comprehensive spatial-temporal and an end-to-end framework for county-level forecasting of heavy-metal concentrations and exceedance risks using California’s Safe Drinking Water Information System data and predictive modeling analysis of radiological monitoring records from California’s Safe Drinking Water Information System (SDWIS). We harmonized over 600,000 test results standardizing units, imputing nondetects, and geocoding by county and applied spatial autocorrelation to identify geographic hot spots of MCL exceedances. By imputing nondetects at half the reporting limit, standardizing units to µg/L, and aggregating daily medians by county and analyte. Exploratory data analysis uncovered spatial hotspots (e.g., elevated arsenic in the Central Valley) and episodic temporal spikes without consistent seasonality. Seasonal-trend decomposition (STL) and ARIMA modeling revealed consistent summer peaks in contaminant levels, while logistic regression and random forest classifiers (ROC-AUC > 0.85)
73
Made with FlippingBook flipbook maker