M.S. Applied Data Science - Capstone Chronicles 2025

5

Figure 1 Correlation Heatmap of Numeric Features in df

Figure 1 shows the correlation heatmap for all numeric columns (e.g., service connections, population measures, analyte codes, reporting levels, result values, MCL, and trigger flags). Notably, the three population variables (Population TINWSYS, Population R, Population NT) exhibit very high inter-correlations (r > 0.99), indicating redundancy that we later address via feature aggregation. The analyte code correlates strongly with the reporting limit and trigger fields (r ≈ 0.73), suggesting these variables capture overlapping aspects of laboratory detection thresholds. Most other features display low to moderate correlations, confirming a diverse set of inputs for modeling. These insights guided our feature selection process: we removed or combined highly collinear variables to reduce dimensionality, while preserving those with unique information.

The scripted EDA (using pandas and seaborn) is fully reproducible and forms the basis for subsequent feature engineering steps. Table 1 Summary Statistics for Selected Numeric Features

Feature Mean

Std

Min

Max

22,301

74,739

0

709,623

Service corrections

96,607

385,320 0

3,856,043

Population TINWSYS

Result 45.8

210.3

0 1

27000

Reporting Level

100.0

80.5

500

77

Made with FlippingBook flipbook maker