AI Bias in Quote Attribution: New Benchmark Reveals

A new study has found that large language models systematically misattribute or withhold credit for quotes depending on the race and gender of the original author, with researchers introducing AttriBench — the first benchmark dataset explicitly designed to measure demographic bias in quote attribution.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

As AI-powered search and information retrieval tools become more embedded in how people access knowledge, the question of who gets credited for ideas carries real consequences. The paper, published on ArXiv CS.AI, evaluated 11 widely used LLMs across multiple prompt configurations, finding that accurate quote attribution remains a challenge even for the most capable frontier models — and that the failures are not randomly distributed.

AttriBench: Building a Fairer Test

Prior attribution benchmarks have not controlled for the fame or demographic profile of quoted authors, making it difficult to distinguish a model's general knowledge gaps from its biases. AttriBench addresses this directly by explicitly balancing authors across race, gender, and intersectional groups, as well as by fame level — ensuring that any disparities in performance reflect bias rather than simply the relative obscurity of certain figures.

The dataset focuses on quote attribution: given a quote, can a model correctly identify who said it? This task sits at the intersection of factual recall and representational fairness, making it a practical proxy for how equitably LLMs represent different groups of people.

Suppression is widespread and unevenly distributed across demographic groups, revealing systematic biases not captured by standard accuracy metrics.

A Failure Mode Hidden from Standard Metrics

Beyond measuring accuracy, the researchers identified and named a distinct failure mode they call suppression — instances where a model declines to provide any attribution at all, even when it demonstrably has access to information about the author. This is categorically different from producing a wrong answer and would not be flagged as an error under conventional accuracy scoring.

The study found suppression to be widespread across models, and crucially, unevenly distributed across demographic groups. Certain racial and gender groups were significantly more likely to have their authorship omitted entirely. This means that models assessed as performing adequately by standard benchmarks may still be systematically erasing the contributions of specific communities.

The researchers do not claim a definitive cause for suppression, but the pattern suggests that training data representation, reinforcement learning feedback, and content moderation heuristics could all be contributing factors — an area the paper identifies for further investigation.

What 11 Models Got Wrong

The evaluation covered a broad range of LLMs, though the paper does not single out specific models by name in the abstract. Across all tested systems, attribution accuracy varied significantly by race, gender, and intersectional identity — meaning the combination of being, for example, a woman of colour made correct attribution even less likely than either factor alone.

Frontier models — the most capable and widely deployed systems — were not exempt from these disparities. While they generally outperformed smaller models on raw accuracy, they still exhibited the same structural biases, including suppression. This is a meaningful finding because frontier models are the systems most likely to be integrated into consumer-facing search products, AI assistants, and research tools.

All benchmark results described are based on the researchers' own evaluations, as reported in the paper.

Why Quote Attribution Matters Beyond Citations

The practical stakes are higher than they might first appear. When an AI assistant summarises a topic or answers a question by drawing on existing writing, the decision of whether and how to credit a source is not merely academic. It shapes which voices are perceived as authoritative, which thinkers are introduced to new audiences, and ultimately whose intellectual contributions are preserved in the AI-mediated information ecosystem.

For journalists, researchers, students, and anyone using AI tools to gather information, a model that systematically fails to attribute quotes from women or people of colour is not just making factual errors — it is reproducing and potentially amplifying existing representational inequalities at scale.

The researchers explicitly frame AttriBench as a tool for ongoing evaluation, positioning quote attribution as a benchmark for representational fairness rather than a one-time measurement. By making the dataset available, they aim to enable other researchers and model developers to test their systems against the same controlled conditions.

What This Means

AI developers building search and retrieval products now have a concrete, controlled tool to audit whether their models credit authors equitably — and the evidence that skipping this step means deploying systems that quietly erase the contributions of already underrepresented groups.

New Benchmark Reveals Systematic Racial and Gender Bias in AI Quote Attribution

AttriBench: Building a Fairer Test

A Failure Mode Hidden from Standard Metrics

What 11 Models Got Wrong

Why Quote Attribution Matters Beyond Citations

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

New Benchmark Reveals Systematic Racial and Gender Bias in AI Quote Attribution

AttriBench: Building a Fairer Test

A Failure Mode Hidden from Standard Metrics

What 11 Models Got Wrong

Why Quote Attribution Matters Beyond Citations

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models