M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 185 Next page Last page

weighted using the PUMS person-weight fields (WGTP, PWGTP) to ensure population-representative results. ACS tract-level estimates were used only to contextualize geographic patterns through their association with PUMA. These engineered PUMS features served as the sole foundation for prediction, ensuring alignment between feature construction and the person-level modeling strategy. 4.4 Modeling The modeling strategy consisted of regression and classification methods, designed to capture varying relationships across Illinois. Classification models focused on predicting whether an individual fell above or below the poverty line, whereas regression models predicted the percent of the poverty line an individual is at. Predictors included age, educational attainment level, type of employment, normal hours worked per week, source of health insurance, sex, veteran status, number of tracts within the PUMA, and difficulties caused by disability. All models were weighted using the person-weight fields to account for population distributions. 4.4.1 Selection of Modeling Techniques Classification methods were selected to create a method of identifying individuals more at risk of being below the poverty line. For initial modeling, hyperparameter tuning was completed using 3-fold cross validation on the following models: logistic regression, linear discriminant analysis, lasso regression, penalized logistic regression, nearest shrunken centroid, neural network, bagging, and random forest. From these models, the three with the highest ROC scores were chosen for further hyperparameter tuning using 5-fold repeated cross validation. The tuned models were then applied to the test dataset. Regression-based methods were selected because they provide interpretable coefficients suitable for understanding policy-relevant inequality patterns across Illinois. Four different linear regression models quantified continuous relationships between various combinations of educational attainment, disability, employment, geography, and insurance indicators,

allowing the analysis to measure how each variable impacts poverty outcomes. All modeling was conducted using PUMS microdata, allowing for modeling focused on individual people and their specific circumstances. Together, this combination of regression and classification modeling provides a comprehensive modeling framework that captures statewide structural patterns while identifying localized deviations critical for regional policy interpretation. 4.4.4.1 Test Design (Training and Validation Datasets) The analytical dataset was partitioned for classification models into training and validation subsets using a 80/20 split with a fixed seed to ensure full reproducibility. Because the PUMS dataset is aggregated, the person-weight column was used in all models. The classification models used educational attainment as numeric input, whereas the regression models used it as a factor variable. This change was added for interpretability in the regression model. Stratified random splits of the data were created based on the “poverty” versus “not poverty” outcome indicators. This ensured that both high and low poverty areas were proportionally represented in each subset. Model performance for linear regression was evaluated using RMSE, while Receiver Operating Characteristic (ROC) was used to compare competing models. All data preparation, sampling procedures, diagnostic checks, and model evaluations were implemented in RMarkdown to ensure the workflow is fully transparent and replicable for future ACS data releases. 4.4.4.2 Summary of Modeling Findings Across all modeling approaches, the results consistently showed that higher educational attainment substantially decreases the probability of an individual falling below the poverty line, while disability and related indicators increase it, confirming th at disability weakens education’s protective effect. Rural areas were also more highly affected by poverty. Common variables of importance when predicting poverty status and percent of the poverty line included hours worked per week, having

185

Made with FlippingBook flipbook maker