M.S. Applied Data Science - Capstone Chronicles 2025

9

The correlation matrix shows that total personal income (PINCP) and wages (WAGP) are strongly positively correlated, while both variables exhibit moderate negative correlations with poverty status (POVPIP). Other relationships are weaker, confirming that income and wages drive most of the variance in poverty-related outcomes. 4.2.2 Bias, Ethics, and Privacy Considerations The ACS and PUMS data used in this study adhere to strict federal privacy protections, including data suppression, geographic masking, and the top-coding of income values to prevent re-identification. Because disability status is a protected attribute, both ACS fallacies, all interpretations explicitly distinguish person-level PUMS findings from community-level patterns in ACS estimates. Sampling reliability was also considered, particularly in rural Illinois where ACS tract estimates may exhibit higher variance due to smaller samples. Model diagnostics were used to monitor these effects and prevent over-generalization. These safeguards ensure that the analysis remains methodologically responsible, privacy-preserving, and fully aligned with federal statistical data-use standards. 4.2.3 System Architecture, Schema Design, and Data Flow Diagram Figure 7 ACS – PUMS system architecture and data flow diagram. summary tables and PUMS microdata report disability information in de-identified and confidentiality-compliant formats. Importantly, only PUMS person-level data were used for modeling, while ACS tract-level estimates were referenced solely for contextual comparison. To avoid ecological

This flowchart illustrates the streamlined data pipeline used in the project. ACS tract-level summary tables and PUMS person-level microdata feed into a unified processing stage, where data are cleaned, reshaped, and validated. The prepared data then move into modeling, which includes feature engineering and training/testing predictive models. This visual highlights how raw census inputs are transformed into analytical datasets for empirical evaluation. 4.3 Feature Engineering Feature engineering focused on transforming raw PUMS person-level variables into interpretable measures that capture how education, disability, and employment jointly influence poverty outcomes. Individual poverty status was defined using the ACS poverty ratio variable (POVPIP), with individuals classified as below poverty when POVPIP was less than 100. Educational attainment was collapsed into ordered numeric categories ranging from 1 to 7. Employment variables were standardized to produce measures of labor-force participation, hours worked, and wage income. Finally, tract and PUMA (Public Use Microdata Area) level data were combined to create a continuous variable that indicated the number of tracts within a PUMA. Because a PUMA is defined to be within 10,000-20,000 people, rural areas require more census tracts to meet the population requirement. To evaluate whether disability alters the protective effect of education, interaction terms were created, including disability multiplied by bachelor’s -degree attainment and disability multiplied by employment status. Continuous income-related variables were normalized using min-max scaling to reduce skewness and improve comparability, and all estimates were

184

Made with FlippingBook flipbook maker