Age-Estimation AI Bias: Record Accuracy, Demographic

The most accurate AI model for estimating a person's apparent age still produces systematically worse results for Asian and African American individuals, according to a new paper published on ArXiv in April 2025.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

Apparent age estimation — predicting how old someone looks, rather than their biological age — is used in business personalisation contexts such as targeted advertising, age-gating, and retail analytics. The field has advanced considerably in recent years, but the new review argues that accuracy gains have consistently outpaced fairness improvements, leaving demographic disparities largely unaddressed.

How the Study Was Conducted

The researchers re-examined the influential DEX (Deep EXpectation) method, a convolutional neural network approach that treats age estimation as a classification problem over age ranges. They applied two distribution-learning refinements on top of DEX: Mean-Variance Loss (MVL) and Adaptive Mean-Residue Loss (AMRL), evaluating both on the IMDB-WIKI, APPA-REAL, and FairFace datasets — three benchmarks that together provide a broad sweep of face imagery and demographic labelling.

The team assessed models on both predictive accuracy and demographic fairness, using UMAP embeddings to visualise how the models cluster age representations, and saliency maps to reveal which facial regions each model focuses on when making predictions.

AMRL achieves state-of-the-art accuracy, yet trade-offs between precision and demographic equity persist — a finding the authors say the field can no longer ignore.

Accuracy Versus Fairness: A Persistent Trade-Off

AMRL delivered the strongest overall accuracy of the two techniques tested, outperforming MVL on standard benchmarks. However, the accuracy headline obscures a more troubling pattern. When the researchers broke down performance by demographic group, they found significant performance degradation for Asian and African American populations — meaning the model that scores best on aggregate metrics is also the one producing the least equitable outcomes across groups.

The UMAP visualisations showed that age representations do cluster clearly, suggesting the models are learning meaningful age-related features at a structural level. But saliency maps — which highlight which pixels drive a model's decision — told a different story: the features the model focuses on shift inconsistently across demographic groups. In practical terms, the model is not using the same facial cues to estimate age for every group, which helps explain why error rates diverge so sharply.

Why Current Datasets Are Part of the Problem

The paper points to dataset composition as a root cause rather than a secondary concern. IMDB-WIKI, one of the most widely used benchmarks in the field, is heavily skewed toward Western, white faces drawn from celebrity and Wikipedia sources. A model trained primarily on such data learns age cues that generalise poorly to underrepresented populations — a problem that accuracy metrics averaged across the full dataset can mask entirely.

The researchers argue that FairFace, which was specifically designed to balance demographic representation, exposes biases that IMDB-WIKI and APPA-REAL obscure. Using all three datasets together gives a more honest picture of where models actually fail.

This structural critique extends beyond dataset selection. The paper calls for stricter fairness validation protocols to be built into the standard evaluation pipeline for age estimation research — protocols that would require disaggregated performance reporting by demographic group before a model can claim state-of-the-art status.

Technical Fixes Are Not Enough

Perhaps the paper's most pointed conclusion is that engineering improvements alone cannot close the fairness gap. More sophisticated loss functions, better architectures, and larger training sets all help with aggregate accuracy — but they do not automatically produce equitable results across demographic groups.

The authors advocate for two parallel remedies. First, the collection and use of localised and diverse datasets that capture age presentation across different ethnicities, skin tones, and geographic contexts. Second, the adoption of explicit fairness validation as a non-negotiable step in model development and publication, rather than an optional addendum.

This position aligns with a broader current in AI fairness research, which has moved from arguing that bias exists to arguing about what systematic obligations follow from that fact. For apparent age estimation specifically, the commercial stakes are real: a model deployed in retail age-gating that performs significantly worse for certain demographic groups creates both legal exposure and direct harm to users.

What This Means

For developers and organisations deploying age estimation systems, this research is a direct signal that benchmark accuracy scores are insufficient due-diligence — disaggregated fairness testing across demographic groups must become standard practice before any such system goes into production.

Age-Estimation AI Achieves Record Accuracy — But Still Fails Asian and African American Faces

How the Study Was Conducted

Accuracy Versus Fairness: A Persistent Trade-Off

Why Current Datasets Are Part of the Problem

Technical Fixes Are Not Enough

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

Age-Estimation AI Achieves Record Accuracy — But Still Fails Asian and African American Faces

How the Study Was Conducted

Accuracy Versus Fairness: A Persistent Trade-Off

Why Current Datasets Are Part of the Problem

Technical Fixes Are Not Enough

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models