M.S. Applied Data Science - Capstone Chronicles 2025
6
Next, we visualize the full time series of raw analyte measurements to understand their distribution over time. Figure 2 shows each sample’s concentration (y ‑ axis) plotted against its sampling date (x ‑ axis). We see a dense cluster of points near the bottom of the chart; most values lie at or just above the laboratory detection limit highlighting the heavy censoring with over 70% of samples recorded as non ‑ detects or substituted at half the reporting limit. Scattered above this baseline are intermittent, high ‑ magnitude spikes (some exceeding 400 000 µg/L), indicating episodic contamination events or potential data ‑ entry anomalies. These extreme events occur irregularly rather than seasonally, underscoring the need for models that handle unpredictable outliers and capture temporal dependencies. To address this, later feature engineering will include threshold flags for spikes, rolling ‑ window statistics, and lagged covariates, while modeling will employ robust loss functions and architectures tailored to episodic behavior. Figure 2 Analytics result Over Time
layout, each row corresponds to a calendar year and each column to a month, with cell color and overlaid numeric labels showing the mean concentration for that period. This visualization reveals subtle seasonal trends: winter and early spring months (e.g., January–April) often exhibit elevated median levels compared to midsummer and highlights interannual variability. For instance, 2024 shows pronounced peaks in March and May, whereas 2021 and 2022 are comparatively stable. An outlier appears in June 2025 (9.5 µg/L), likely reflecting heavy censoring at the detection limit rather than a true drop in contaminant levels. These insights directly inform our feature engineering: they motivate the inclusion of seasonal dummy variables, month-on-year interactions, and rolling-window features to capture cyclic behavior and anomalous events in the modeling stage. Figure 3 below illustrates these monthly averages in a concise, comparative format. Monthly average analyte concentration heatmap by year and month.
Finally, we aggregate by month to compute average analyte concentrations and display them in a calendar ‑ style heatmap (Figure 3). In this
78
Made with FlippingBook flipbook maker