M.S. Applied Data Science - Capstone Chronicles 2025

3

3 Literature Review This review is organized thematically to position our work within existing spatial, temporal, predictive, decision-support, and regulatory studies. Prior studies in environmental time-series forecasting have successfully applied ARIMA, state-space models, and machine-learning regressors to parameters like turbidity and nutrient loading (Smith et al., 2018; Gao & Lin, 2021). However, discrete compliance data on heavy metals characterized by zero-inflation and right-skewed distributions remain underexplored at the county scale. Machine-learning frameworks have shown promise for surface-water quality with continuous sensor data but often rely on dense, high-frequency inputs (Patel et al., 2022). Two-stage approaches classifying detect vs. non-detect, then regressing on positive values offer a strategy to handle censoring (Lee et al., 2020). Yet, their application to geospatial compliance datasets is nascent. 3.1 Spatial Patterns of Contaminant Exceedances Regional geology and treatment practices drive spatial heterogeneity in contaminant profiles. Studies show arsenic hotspots in Central Valley aquifers and elevated manganese in Sierra foothill wells (Johnson et al., 2019). Mapping exceedance rates reveals counties where >5% of samples surpass MCLs, enabling targeted interventions. Our spatial analysis quantifies exceedance frequencies per county and clusters adjacent high-risk areas. This geospatial perspective guides sampling intensification and infrastructure investments in vulnerable regions. 3.2 Temporal Trend Analysis Heavy-metal concentrations may exhibit seasonality driven by hydraulic changes, rainfall, or treatment schedules and long-term trends reflecting infrastructure upgrades. Prior work on nitrate and turbidity streams identifies clear

monitoring: regulators typically learn of MCL exceedances only after violations occur. This delay may prolong public exposure to harmful concentrations, especially for health-critical metals like arsenic and lead. Additionally, the absence of integrated geospatial forecasting tools leaves water agencies reliant on static dashboards rather than predictive alerts. Motivated by public health imperatives and regulatory efficiency, our research seeks to transform discrete laboratory records into dynamic forecasts. By anticipating spikes in contaminant levels, agencies can allocate resources more effectively prioritizing inspections, communications, and treatment upgrades in high-risk counties before violations arise. 2.2.1 Definition of Objectives. The project’s objectives are fourfold: 1.​ Perform an end-to-end data preparation and EDA that uncovers spatial and temporal patterns in heavy-metal concentrations. 2.​Develop feature engineering strategies addressing censored non-detects, unit standardization, and temporal covariates that enable accurate modeling. 3.​Compare classical time-series and machine-learning methods for forecasting both numerical concentrations and exceedance risks. 4.​ Build predictive models (logistic regression, random forest) using system metadata. Deliver geospatial decision support outputs, including interactive maps and a web-based dashboard, for real-time monitoring . Risk reduction, actionable insights, and regulatory compliance underpin these goals. Even partial success such as reliably forecasting 70% of contaminant spikes can markedly improve early warning systems and public health interventions.

75

Made with FlippingBook flipbook maker