M.S. Applied Data Science - Capstone Chronicles 2025
5
Figure 1 Correlation Heatmap of Numeric Features in df
Figure 1 shows the correlation heatmap for all numeric columns (e.g., service connections, population measures, analyte codes, reporting levels, result values, MCL, and trigger flags). Notably, the three population variables (Population TINWSYS, Population R, Population NT) exhibit very high inter-correlations (r > 0.99), indicating redundancy that we later address via feature aggregation. The analyte code correlates strongly with the reporting limit and trigger fields (r ≈ 0.73), suggesting these variables capture overlapping aspects of laboratory detection thresholds. Most other features display low to moderate correlations, confirming a diverse set of inputs for modeling. These insights guided our feature selection process: we removed or combined highly collinear variables to reduce dimensionality, while preserving those with unique information.
The scripted EDA (using pandas and seaborn) is fully reproducible and forms the basis for subsequent feature engineering steps. Table 1 Summary Statistics for Selected Numeric Features
Feature Mean
Std
Min
Max
22,301
74,739
0
709,623
Service corrections
96,607
385,320 0
3,856,043
Population TINWSYS
Result 45.8
210.3
0 1
27000
Reporting Level
100.0
80.5
500
77
Made with FlippingBook flipbook maker