M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 80 Next page Last page

Figure 5 Median Concentration by Analyte ( µg/L)

4.2 Data Quality and Cleaning Our cleaning pipeline ensured that censored values, inconsistent units, and duplicate records were handled transparently to maintain data integrity. We audit units (converting mg/L to µg/L), standardize inconsistent date formats, and drop records with missing or nonsensical values. Non-detect measurements are imputed at half the reporting limit, following EPA guidance, to retain censoring information while enabling numeric modeling. All cleaning operations are logged with provenance metadata for reproducibility. We systematically identify missing values, report-level inconsistencies, and improbable concentrations (e.g., iron > 50 000 µg/L). A data-quality report flags records for manual review or exclusion. 4.2.1 Data Quality Issues. Non ‑ detects were imputed at half the reporting limit to retain censored observations. Units were converted to µg/L to standardize across analytes. Personal identifiers were excluded to protect

We visualize the central tendency of contaminant levels across analytes using a logarithmic bar chart of median concentrations. Median values range from approximately 0.5 µg/L for both fluoride and cadmium up to nearly 200 µg/L for iron, with intermediate levels observed for lead (~3 µg/L), arsenic (~6 µg/L), zinc (~30 µg/L), and manganese (~70 µg/L). This log ‐ scale representation underscores the orders ‐ of ‐ magnitude differences among analytes, reflecting both regulatory monitoring priorities and geochemical prevalence. Lower median concentrations for fluoride and cadmium indicate frequent non ‐ detects or efficient treatment processes, whereas higher medians for iron and manganese align with naturally occurring groundwater levels in many California aquifers. Understanding these baseline concentration patterns guides model selection suggesting, for example, that analytes with low variability near detection limits may benefit from classification ‐ focused approaches, while those with higher median levels can leverage standard time ‐ series regression techniques.

Made with FlippingBook flipbook maker