M.S. Applied Data Science - Capstone Chronicles 2025

10

Figure 6 Model Performance compersion

robustness across California’s diverse water systems. Table 2 Forecasting and Exceedance-Risk Model Performance Summary (AIC and ROC-AUC) M ODEL T ASK M ETRI C V ALUE

SARIMA outperformed its non-seasonal counterpart by approximately 115 AIC points, confirming strong annual cycles in contaminant levels. Likewise, Random Forest achieved the highest ROC AUC (≈ 0.97), slightly above XGBoost (≈ 0.94), demonstrating that nonlinear interactions among system features carry predictive power for exceedance events. Normalized performance: SARIMA vs ARIMA (AIC scale) and Random Forest vs XGBoost (ROC AUC scale). In raw metric terms, SARIMA’s AIC of ~ 530 versus ARIMA’s ~ 645 underscores the value of modeling seasonality explicitly. On the classification side, Random Forest’s ROC AUC of 0.97 outstrips XGBoost’s 0.94, making it our preferred excessance-risk detector; logistic regression remains useful for its transparent feature coefficients (AUC ≈ 0.90). AIC for ARIMA and SARIMA; ROC AUC for Random Forest and XGBoost. Taken together, these results guided our choice of SARIMA for concentration forecasting capturing trend and seasonality parsimoniously and Random Forest for exceedance ‐ risk classification, balancing predictive accuracy with

ARIMA ( P , D , Q ) = (0,0,1)

Concentration Forecasting Concentration Forecasting Exceedance Risk Classification Exceedance Risk Classification Exceedance Risk Classification

AIC

645.3

SARIMA (0,0,1)(1,1, 1,12) L OGISTIC R EGRESSION (L1) R ANDOM F OREST ​ XGB OOST ​

AIC

529.7

ROC AUC ROC AUC ROC AUC

0.90

0.97

0.94

82

Made with FlippingBook flipbook maker