M.S. Applied Data Science - Capstone Chronicles 2025

4

monthly cycles (Wang & Zhang, 2020), but heavy metals often show episodic spikes without consistent seasonality. We analyze temporal autocorrelation, seasonal decomposition, and changepoint detection to discern patterns in each analyte’s time series. These insights inform model selection e.g., ARIMA for smooth trends vs. tree-based models for irregular events. 3.3 Predictive Modeling of Exceedance Risk Binary exceedance forecasting transforms concentration prediction into a classification problem, predicting whether the next sample will exceed regulatory thresholds. Approaches using logistic regression and random forests achieve >85% recall for nitrate exceedances (Khan et al., 2022), yet heavy-metal exceedance models remain underdeveloped. Our work implements both single-stage regressors (predict concentration then threshold) and two-stage classifiers (detect vs. non-detect, then risk). Evaluation metrics emphasize recall minimizing missed exceedances while maintaining acceptable precision. 3.4 Geospatial Decision ‑ Support Tools Interactive mapping platforms (e.g., ArcGIS Online) can integrate forecast outputs with infrastructure layers. Prior dashboards for lead action-level predictions demonstrate the value of real-time visual analytics (Smith et al., 2023). We design a Stream-lit based dashboard that overlays forecasts on county maps, enabling regulators to filter by analyte, date range, and exceedance probability. Geospatial decision support thus becomes an integral part of proactive water quality management. 3.5 Regulatory Framework and Data Quality Data quality in SDWIS hinges on lab accreditation and reporting guidelines but struggles with inconsistent reporting limits and

missing metadata. The U.S. EPA’s guidance for non-detect substitution offers one standard (half-RL), yet alternative methods (MLE, Kaplan Meier) provide more rigorous inference (EPA, 2018). Understanding the regulatory context sampling frequencies, MCLs, and reporting protocols is critical for interpreting model outputs. We audit the dataset for missing fields, inconsistent date formats, and outlier lab results that may reflect data-entry errors rather than genuine contamination events. 4 Methodology Our methodology spans data ingestion, cleaning, EDA, feature engineering, model development, and deployment. We use Python (pandas, NumPy, scikit-learn), R (forecast, changepoint), and Stream-lit for interactive delivery. All codes used for data analysis, figure generation, and machine learning in this paper is available on the following GitHub repository: Tarane2028 ADS-599-Capstone-Project For any inquiries, please contact the authors of the paper. 4.1 Data Acquisition and Aggregation Raw SDWIS exports are loaded directly from Excel into pandas Data Frames. We aggregate by county–analyte–date, computing daily medians to reduce sampling irregularity. Samples taken multiple times per day are consolidated. 4.1.1 Exploratory Data Analysis We begin our exploratory data analysis by examining both univariate and multivariate relationships among key numeric features. In addition to distribution plots and temporal decompositions, we generated a correlation heatmap to assess pairwise associations and identify potential multicollinearity among predictors.

76

Made with FlippingBook flipbook maker