M.S. Applied Data Science - Capstone Chronicles 2025

20

6.1 Conclusion Seasonal ARIMA models can successfully capture the bulk of temporal structure across diverse contaminants and geographies, offering reasonable short- to medium-term forecasts (1–2 years ahead) with interpretable confidence intervals. Drawing clear, actionable conclusions ensures our work informs policy and future research. By synthesizing our multi-method approach spatial clustering, time-series decomposition, exceedance classification, and forecasting into a cohesive narrative, we find that California’s drinking-water quality exhibits both geographic clustering and seasonality: arsenic and manganese are highest in agricultural and volcanic regions, and all four key analytes show late-summer peaks. Forecast models project continued persistence of these patterns over the next decade, with iron and manganese spikes likely to recur annually. While our exceedance classifiers provide a useful early-warning tool (AUC ≈ 0.8), they perform less reliably in under-sampled counties, highlighting the importance of data equity in environmental monitoring. However, very large outliers and counties with sparse data remain challenging. Contaminant-specific behaviors (e.g., monotonic fluoride trends vs. episodic manganese spikes) require tailored modeling strategies. 6.2 Recommend Next Steps/Future Studies To strengthen monitoring and prediction statewide, we recommend increasing sampling frequency in identified hot spots especially during peak dry-season months and deploying real-time sensors where feasible. Models should be enriched with geologic, hydrologic, and infrastructure metadata (e.g., well depth, aquifer type, pipe age), and handle non-detects using censored-data methods (Tobit or survival analysis) or hierarchical Bayesian frameworks to

borrow strength across counties. Decision-making can be accelerated with interactive geospatial dashboards that overlay forecasts, exceedance probabilities, and regulatory thresholds to guide resource allocation. On the modeling side, pursue hybrids that couple SARIMAX with threshold-based trigger components (e.g., Poisson processes for spikes) or apply machine-learning regressors (Random Forest) to residuals to better capture extremes; extend to SARIMAXX by incorporating exogenous drivers such as precipitation, temperature, or operational logs (e.g., maintenance schedules). Finally, automate county-specific hyperparameter tuning ideally via Bayesian optimization to tailor models to local dynamics and improve out-of-sample accuracy. ACKNOWLEDGMENTS We would like to express our deepest gratitude to Dr. Ebrahim Tarshizi for his guidance and valuable feedback throughout the course of this research. We also thank the California State Water Resources Control Board for granting access to the water-quality dataset. American Public Health Association. (2017). Standard methods for the examination of water and wastewater (23rd ed.). APHA/AWWA/WEF. Box, G. E. P., & Pierce, D. A. (1970). Distribution of residual autocorrelations in ARIMA models. Journal of the American Statistical Association, 65(332), 1509–1526. https://doi.org/10.1080/01621459.1970.10481 180 References

92

Made with FlippingBook flipbook maker