M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 179 Next page Last page

evaluated using tract-level ACS summary tables, which report only aggregated probabilities.

showed moderate correlation. To reduce dependency between predictors, several features were converted to percentage- or ratio-based indicators rather than raw numeric counts. Spatial patterns from tract-level ACS estimates were referenced only for contextual comparison, revealing clustering of disability and poverty in rural southern Illinois and post-industrial regions. These geographic patterns aligned with person-level PUMS disparities and reinforced the importance of modeling both demographic and structural factors. 4.2.1 Data Quality Because this project relies primarily on ACS Public Use Microdata Sample (PUMS) person-level data for modeling, data quality verification focused on completeness, weighting reliability, and consistency across the individual-level records. PUMS provides raw, de-identified individual responses, so unlike summary ACS tables, it does not require tract-level GEOID alignment or cross-table reconciliation. Instead, quality checks ensured valid responses for key variables: more than 98 percent of PUMS records contained valid values for educational attainment, disability indicators, labor-force status, and income fields. The primary missingness occurred in POVPIP, where approximately 1.6 percent of records contained the ACS placeholder value – 1; these cases were removed to avoid distortion when constructing poverty classifications. Because PUMS uses person-weights (WGTP, PWGTP) to generate population-representative estimates, additional checks confirmed the stability of weighted distributions across demographic and socioeconomic groups. Weighted poverty and disability rates from PUMS closely aligned with expected Illinois benchmarks, indicating no irregularities in the weighting structure. ACS tract level tables were used only for geographic context and mapping — never for modeling — and their limitations did not affect PUMS-based analyses. These checks confirmed that the PUMS dataset was sufficiently complete, statistically reliable, and

Tract-level ACS summary tables were still retrieved using the tidycensus API, but they were used only for descriptive mapping and contextual framing, not for predictive modeling. The summary tables provide geographic patterns of poverty, disability prevalence, education, and employment across approximately 1,700 Illinois census tracts. However, because these tract-level estimates lack individual-level detail, they were unsuitable for constructing interaction terms or person-specific poverty indicators required by the modeling framework. Poverty classification followed ACS definitions in both datasets. For PUMS, poverty status was taken directly from the person-level poverty flag, which identifies individuals below the federal poverty threshold after adjusting for household size, alongside a quantitative variable that represented the percent of the poverty line an individual was at. These person level classifications served as the dependent variable for all modeling procedures, while tract-level ACS poverty rates were used only for comparison and geographic visualization. 4.2 Exploratory Data Analysis Exploratory data analysis was conducted using dplyr, ggplot2, and corrplot to develop a detailed understanding of the person-level PUMS dataset prior to modeling. Summary statistics and distribution checks confirmed valid ranges for income, individualized poverty indicators, disability measures, and educational attainment. Univariate and multivariate visualizations were generated to assess variable behavior, identify potential outliers, and examine disparities across demographic groups, particularly between disabled and non-disabled adults. Correlation matrices were used to detect redundant predictors and guide feature reduction. Because regression models require limited multicollinearity, variance inflation factors (VIFs) were computed; most predictors fell below common thresholds (VIF < 5), though education and income-related measures

179

Made with FlippingBook flipbook maker