A new benchmark study shows that AI models trained on satellite imagery can predict neighbourhood-level crime, income, and health outcomes across U.S. cities — but their accuracy depends heavily on which city is being analysed and which indicator is being measured.
Published on ArXiv in April 2025, the study evaluated three families of geospatial foundation models — AlphaEarth, Prithvi, and Clay — against 14 neighbourhood-level urban indicators spanning six U.S. metropolitan areas from 2020 to 2023. The researchers used a unified supervised-learning framework, testing performance across four settings: global, city-wise, year-wise, and city-year combinations. The goal was to determine whether compact satellite image representations, known as Earth embeddings, could serve as a low-cost substitute for traditional data collection methods such as censuses, surveys, and administrative records.
Why Traditional Urban Data Falls Short
Conventional urban monitoring is expensive, slow, and spatially uneven. Census data updates infrequently, survey coverage varies by region, and administrative records often lack the granularity needed for neighbourhood-scale analysis. These gaps matter for policymakers and urban planners trying to track progress on goals such as the United Nations Sustainable Development Goals, which require timely, localised data on health, inequality, and mobility.
Geospatial foundation models offer a potential alternative. By processing satellite imagery at scale, they can generate dense numerical representations — embeddings — that compress visual information about the built environment into a form machine learning models can use for downstream prediction tasks. The question this study addresses is whether those representations carry enough meaningful signal to be practically useful for urban monitoring.
Earth embeddings capture substantial urban variation, with the highest predictive skill for outcomes more directly tied to built-environment structure.
What the Satellite Models Got Right
The results show that Earth embeddings perform best when the target indicator has a strong physical footprint. Chronic health burdens — such as rates of obesity or diabetes — and dominant commuting modes like car dependency proved highly predictable from imagery alone. This is intuitive: dense urban cores look different from car-oriented suburbs, and those visual differences correlate with measurable health and transport outcomes.
Income prediction also showed reasonable accuracy, consistent with prior research linking neighbourhood wealth to visible features such as building density, green space, and road infrastructure. Crime prediction yielded more mixed results, reflecting the complex social and contextual factors that satellite data cannot capture.
Where the Models Struggle
Indicators driven by fine-scale behaviour and local policy proved far harder to infer. Cycling rates, for instance, were difficult to predict across most cities — a finding the researchers attribute to cycling's dependence on local infrastructure decisions, cultural norms, and policy incentives that do not leave a clear visual imprint on the landscape.
The study also found marked variation in predictive performance across cities. A model that performs well in one metropolitan area may underperform significantly in another, even when predicting the same indicator. The researchers suggest this reflects differences in urban form — the way cities are physically laid out — interacting with each prediction task in distinct ways. A sprawling Sun Belt city and a dense Northeastern metro simply look different from space, and those structural differences affect how much information the embeddings carry.
By contrast, performance remained comparatively stable across the 2020–2023 time window, suggesting that whatever signal the models capture does not degrade quickly. This temporal robustness is an encouraging sign for monitoring applications that require consistent tracking over multiple years.
AlphaEarth Leads in Efficiency
Among the three model families, AlphaEarth stood out in controlled experiments focused on representation efficiency. When all three models were compressed to 64-dimensional embeddings — a practical constraint in many real-world deployment scenarios — AlphaEarth retained more predictive information than equivalent reductions of Prithvi or Clay. This matters because storage, compute, and integration costs scale with embedding size; a more compact representation that loses less signal has practical advantages.
The researchers note that these benchmarks were conducted under a unified framework of their own design, which allows fair comparison across models but also means results are specific to this evaluation setup. All benchmark results in the study are self-reported by the research team and have not yet undergone peer review, as the paper is a preprint.
Scalable Monitoring, Real-World Caveats
The broader ambition of the study is to establish a reproducible benchmark for the field — a common baseline against which future geospatial models can be measured. The authors frame their findings explicitly in the context of SDG-aligned urban monitoring, positioning Earth embeddings as scalable, low-cost features that could supplement or, in some cases, replace slower and more expensive data collection methods.
That framing comes with important caveats. The study covers only U.S. cities, and it is not clear how well these models would generalise to urban environments with different architectural styles, infrastructure patterns, or data availability. The reliance on a supervised learning framework also means that ground-truth labels — the very census and survey data the approach aims to replace — are still required for training, at least initially.
Experts in urban informatics have previously noted that satellite-based predictions can embed existing biases present in training labels, potentially reproducing historical inequities in resource allocation if used uncritically in policy decisions. The study does not directly address this concern, though its emphasis on cross-city variability implicitly acknowledges that models trained in one context may not transfer cleanly to another.
What This Means
For urban planners, researchers, and policymakers, this study offers a credible starting point for using satellite AI to monitor neighbourhoods at scale — but it also sets a boundary: the physical environment is visible from space, while human behaviour and local policy remain largely invisible to it.