M.S. Applied Data Science - Capstone Chronicles 2025
7
Figure 3 Yearly Heatmap
This sampling ‐ count distribution directly informs our modeling strategy. Analytes like arsenic and fluoride with abundant data support complex
Figure 4 Number of Measurements by Analyte
We visualize the overall sampling intensity across analytes using a simple bar chart of measurement counts. This chart clearly shows that arsenic dominates our dataset with nearly 80 000 samples, followed by fluoride (~44 000), manganese (~41 000), iron (~30 000), lead (~11 500), zinc (~4 700), and cadmium with only a few hundred observations. The steep drop ‐ off in sample counts for certain analytes (e.g., cadmium and zinc) highlights the uneven data availability inherent in compliance monitoring programs.
time ‐ series and machine ‐ learning models with richer feature sets and cross ‐ validation. Conversely, for sparsely sampled analytes such as cadmium, we may need simpler baseline forecasts or data augmentation approaches (e.g., borrowing strength across similar analytes or pooling by county clusters). Figure thus guides both the selection of analytes for modeling and the design of tailored forecasting pipelines that respect each analyte’s data richness.
79
Made with FlippingBook flipbook maker