M.S. Applied Data Science - Capstone Chronicles 2025

18

Data Quality Through our data preparation process, only POIs located in San Diego County zip codes were retained to ensure a more practical and geographically consistent analysis. Across the United States, in July 2025, the dataset measured consumer spending at commercial POIs based on a panel of approximately 290 million customers’ debit and credit card transactions. However, in California, the panel only included approximately 12 million across about 2.8 million cards. Since geographic details were only available at the state level, to assess representativeness, we applied the formula provided in SafeGraph’s documentation, shown in equation 2, for quantifying geographic bias using U.S. Census data (SafeGraph 2022). = _ − _ We found that the SafeGraph Spend dataset was under-indexed (i.e., negatively biased) for California by approximately 1.37% in July 2025 (see Figure 4). In other words, given the overall sample size, the panel represented California 1.37% less than would be expected from a perfectly random large-N sample. Based on this finding, we inferred that similar or greater bias likely existed in the San Diego region. This can have implications in downstream graph analysis, as underrepresented zip codes could weaken edge weights between POI in the area. Additionally, geographic sampling drift can create false communities in community analysis; this is considered in the final analysis. See Figure 8 for sample bias by state. (2)

226

Made with FlippingBook flipbook maker