M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 91 Next page Last page

standards for these analytes, others exhibit episodic spikes tied to local geological conditions or treatment operations. Iron and lead display even greater heterogeneity: coastal urban centers often show minor seasonal fluctuations linked to distribution system hydraulics, whereas inland and rural counties register sharp, sporadic peaks that likely stem from changes in source water chemistry or well maintenance events. These patterns underscore the importance of tailoring forecasting models to the sampling density and contaminant behavior of each county rather than relying on a one-size-fits-all approach. It is critical to interpret our spatial–temporal findings and forecast results in the context of California’s regulatory and public-health landscape. We observed significant positive spatial autocorrelation (Moran’s I > 0.3, p < .01) for arsenic and manganese, indicating persistent “hot spots” in the Central Valley and parts of Northern California. Temporally, additive decomposition revealed modest upward trends in iron and manganese across most counties, with pronounced seasonal peaks in late summer—likely reflecting lower flows and higher geogenic mobilization during dry months. Our SARIMAX forecasts for Alameda, when extended statewide, suggest that while fluoride levels remain near detection limits, arsenic and iron will continue their slight upward drift, and manganese will exhibit recurring seasonal spikes. Logistic regression and tree-based classifiers achieved ROC-AUCs between 0.78 and 0.85 for exceedance risk, outperforming random chance but indicating room for improvement in data-sparse counties. By applying SARIMAX models to each county’s monthly aggregated data, we achieved median forecast errors of 8–12 µg/L for arsenic and fluoride across the state, and somewhat larger errors 20–35 µg/L for iron and lead in counties with infrequent sampling. Counties with robust monitoring programs (e.g., Los Angeles, San Diego, Sacramento) consistently yielded tighter prediction intervals, demonstrating that model

performance scales with data richness. In sparsely sampled areas, however, forecasts carried wider uncertainty bands, highlighting a need for supplemental data sources such as rainfall, source water turbidity, or system operational logs to improve predictive confidence. From a resource-allocation perspective, these statewide forecasts can guide the State Water Resources Control Board and individual water systems in prioritizing monitoring and infrastructure investment. For instance, counties where models predict forthcoming exceedances of the 10 µg/L arsenic MCL could be flagged for accelerated treatment upgrades, while regions forecasting elevated iron could schedule corrosion control interventions before aesthetic or operational complaints arise. Looking ahead, integrating exogenous drivers like precipitation, land use changes, or treatment modifications into multivariate forecasting frameworks will be critical for addressing the remaining uncertainty in less-monitored jurisdictions and ensuring safe, compliant drinking water for all Californians. Censoring and Non-Detects: Substituting half the reporting limit for non-detects may bias low-value estimates and artificially inflate variance. Extreme Outliers: SARIMAX’s Gaussian errors struggle with infrequent but enormous spikes (e.g., iron > 20 000 µg/L), leading to under-forecasted peaks. Fixed Hyperparameters: Using the same (p,d,q)(P,D,Q,m) across all counties simplifies comparison but may not be optimal—some series would benefit from higher seasonal order or inclusion of exogenous covariates (e.g. rainfall). Data Gaps: Irregular sampling (especially for lead and copper under compliance programs) creates missing-at-random patterns that standard interpolation may not fully address.

Made with FlippingBook flipbook maker