AI Safety Training Hides Bias Rather Than Fixing It

Safety-aligned AI models that appear unbiased in standard tests still reproduce harmful stereotypes when evaluated differently, according to new research published on ArXiv — suggesting that current alignment practices conceal representational harm rather than correct it.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The study, titled Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignment, audited 7 commercial and open-weight large language models using roughly 45,000 prompts across a structured set of evaluation tasks. Its central finding challenges a widespread assumption in AI safety: that a model performing well on bias benchmarks is, in fact, less biased.

The Gap Between What Models Say and What They Imply

The researchers designed a hierarchical taxonomy covering 9 bias types, including frequently overlooked categories such as caste bias, linguistic bias, and geographic bias. They then tested models across 7 different task formats, ranging from explicit decision-making — such as choosing between candidates from different social groups — to implicit association tasks, where models complete sentences or fill in blanks.

The results showed a clear and consistent pattern: models refused to make discriminatory choices when asked directly but defaulted to stereotypical associations in implicit formats. In one illustrative example from the paper, a model that declined to choose between castes for a leadership role would, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with poor hygiene.

The researchers found Stereotype Score divergences of up to 0.43 between task types for the same model and identity group — a gap that standard single-benchmark audits would never detect.

This divergence — up to 0.43 on the paper's Stereotype Score metric — appeared consistently across models and demographic groups, according to the researchers.

Alignment That Protects Some Groups More Than Others

The study identifies a second structural problem: safety alignment is asymmetric in its protections. Models consistently refused to assign negative traits to marginalised groups — a behaviour that looks like fairness on the surface. But those same models freely associated positive traits with privileged groups, reinforcing hierarchies through affirmation rather than denigration.

This asymmetry means that standard refusal-based safety checks capture only half the picture. A model can pass a harmful-output test while still encoding the same underlying social hierarchies — just expressed differently.

The third finding may be the most structurally significant: understudied bias axes showed the strongest stereotyping across all models tested. Caste, linguistic origin, and geographic identity produced more pronounced bias than the demographic categories — such as gender and race — that dominate existing benchmarks. The researchers argue this pattern is not coincidental. Alignment effort tracks benchmark coverage rather than actual harm severity. Models are trained to perform well on the tests they are evaluated against, leaving understudied dimensions of bias largely untouched.

Why Single-Benchmark Audits Are Structurally Inadequate

The paper's methodological critique has direct implications for how AI companies and regulators assess model safety. Current industry practice frequently relies on fixed benchmark suites to certify that a model meets fairness standards before deployment. The research argues that this approach is not merely incomplete — it is systematically misleading.

Because bias expression is task-dependent, a model's score on any single benchmark reflects only one slice of its behaviour. A model optimised to perform well on explicit-choice bias tests may do so precisely by shifting stereotypical associations into implicit outputs that the benchmark never measures. The result is a model that appears safer without becoming safer.

This is not a hypothetical concern. The 0.43 divergence the researchers observed between task types represents a substantial difference — not a marginal statistical artefact. Across the 7 models tested, the pattern held consistently, suggesting it reflects something structural about how alignment training operates rather than an idiosyncrasy of any single model.

A Vocabulary Problem With Real Consequences

The inclusion of caste, linguistic, and geographic bias in the taxonomy is itself a contribution worth noting. Most bias research — and most alignment training — concentrates on gender, race, and occasionally religion. This focus reflects the demographics of AI research institutions and the datasets used to construct existing benchmarks, not a principled assessment of where harm is greatest.

For the 1.4 billion people affected by caste-based discrimination globally, or for speakers of lower-prestige language varieties who interact with AI systems, this gap has practical consequences. If alignment training ignores these dimensions, the systems deployed into those communities will carry unchecked biases regardless of how well they perform on standard audits.

The researchers do not claim to have solved the problem. But by operationalising a broader taxonomy and demonstrating consistent task-dependence across multiple models and bias axes, they provide both an empirical case and a methodological template for more rigorous evaluation.

What This Means

For anyone deploying or regulating AI systems, this research makes one thing clear: a model that passes a bias benchmark has not been shown to be unbiased — it has been shown to be unbiased on that benchmark, under those conditions, using those task formats. Meaningful alignment requires evaluation methods as varied as the contexts in which these systems are actually used.

AI Safety Training Hides Bias Rather Than Fixing It, New Research Finds

The Gap Between What Models Say and What They Imply

Alignment That Protects Some Groups More Than Others

Why Single-Benchmark Audits Are Structurally Inadequate

A Vocabulary Problem With Real Consequences

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

AI Safety Training Hides Bias Rather Than Fixing It, New Research Finds

The Gap Between What Models Say and What They Imply

Alignment That Protects Some Groups More Than Others

Why Single-Benchmark Audits Are Structurally Inadequate

A Vocabulary Problem With Real Consequences

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models