M.S. Applied Data Science - Capstone Chronicles 2025

9

privacy, and duplicate entries arising from repeated submissions were removed. The resulting cleaned dataset contains 57 variables, with a focus on seven core fields for downstream analysis. We resolve these through unit parsing functions, date coercion, and linking analyte codes to lookup tables for RL values. 4.3 Feature Engineering We created a Result filled for numeric substitution of non ‑ detects, standardized units, parsed dates, and sorted by date. We documented privacy measures and biases introduced by imputation and sampling frequency. 4.3.1 Variable Creation. Key features engineered include: ●​ Result filled: numeric result substituting non ‑ detects at half the reporting limit. ●​ Year, Month, Season: temporal variables extracted from sample dates. ●​ County_FIPS: standardized county identifiers for geospatial analysis. 4.3.1.1 Privacy and Bias Documentation. When constructing temporal features, we carefully document any potential privacy or sampling biases. Date–time stamps are aggregated to the day or month level to avoid exposing exact sampling times that might inadvertently identify individuals or specific facilities. We also record disparities in sampling frequency—rural counties may sample infrequently, while urban systems sample monthly—so that model inputs include flags for “low ‑ frequency” versus “high ‑ frequency” series, preventing temporal artifacts from driving predictions. All transformations are logged with provenance metadata (source column, aggregation method, date of processing) to ensure reproducibility and ethical transparency. 4.4 Modeling This section outlines the selection of algorithms and the design of training and validation datasets for robust prediction of exceedance events.

Both logistic regression and random forest models achieved ROC ‑ AUC values above 0.85. The random forest slightly outperformed logistic regression in recall for exceedance events, while logistic regression provided clearer insights into the relative importance of system characteristics. Predictive Modeling Logistic regression (L1-regularized) and random forest classifiers were trained on system metadata to predict exceedance events. Models were evaluated via ROC ‑ AUC and precision ‑ recall metrics, with results summarized in a formatted table. 4.4.1 Selection of modeling techniques. We compared L1 ‑ regularized logistic regression and random forest classifiers due to their complementary strengths: logistic regression offers interpretable feature coefficients and direct insight into system-level risk factors, while random forests capture nonlinear relationships and interactions without assuming a parametric form. We employ the SARIMAX implementation from the stats model’s library to jointly capture autoregressive (AR), integrated (I), moving ‑ average (MA), and seasonal (SAR, SMA) components. Order selection is performed via a grid search over and seasonal with a 12 ‑ month cycle, using AIC as the optimization criterion. Model fitting uses maximum ‑ likelihood estimation with non ‑ restrictive stationarity and invertibility settings. Forecasting generates point predictions and 95% confidence intervals for the final 12 months. Evaluation metrics include mean absolute error (MAE) and root mean squared error (RMSE) computed on the test set to assess both bias and variance in the forecasts.

81

Made with FlippingBook flipbook maker