AI Models Play Cards Against Humanity: Study

Five large language models were tested on their sense of humor using nearly 10,000 rounds of Cards Against Humanity, and the results suggest AI systems may share a manufactured comic sensibility that diverges significantly from what humans actually find funny.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The study, published on ArXiv in April 2025, is one of the first systematic attempts to benchmark humor as a dimension of LLM alignment — the field concerned with making AI systems behave in ways that reflect genuine human values and preferences. Researchers had the models each select the funniest card from a slate of ten candidates across 9,894 rounds, mirroring the same gameplay conditions experienced by human players.

Models Perform Above Random Baseline

Every model tested performed above the random baseline, meaning they weren't simply guessing. But the margin of improvement was modest, and none of the models demonstrated strong alignment with what human players found funniest. The gap between AI and human humor preference was consistent enough across models to suggest a systemic issue rather than individual model weakness.

More revealing was the pattern of agreement among the models themselves. When researchers compared model choices against each other, the AI systems converged on the same answers at a substantially higher rate than any single model converged with human players.

Models agree with each other substantially more often than they agree with humans — raising the question of whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.

This inter-model consensus is significant because it implies the models may be picking up on the same underlying signals — signals that don't map cleanly onto human comedic taste.

Position Bias and Content Preferences May Be Driving Choices

The researchers identified two likely culprits for the divergence. The first is position bias — the tendency of language models to favor options presented earlier or later in a list, regardless of content. When choosing from ten candidate cards, models may systematically prefer certain positions in the slate, an artifact of how attention mechanisms process sequential input.

The second is content preference — models appear to gravitate toward particular themes or phrasings that may reflect the distribution of their training data or the filtering applied during alignment fine-tuning. Cards Against Humanity is a game built around dark, transgressive, and absurdist humor. If models were trained or fine-tuned to avoid or downweight certain content categories, their card selections could reflect those constraints rather than a genuine read of what's funniest.

Neither of these explanations is mutually exclusive, and the study suggests both may be operating simultaneously.

Why Cards Against Humanity Is a Useful Benchmark

Cards Against Humanity works as a research tool here for a specific reason: the game has a clear, human-validated signal. In actual gameplay, human players vote on which card is funniest, providing a ground-truth preference that researchers can compare against model selections. The game also spans a wide range of humor styles — from wordplay to shock value to cultural reference — giving the benchmark broader coverage than a narrow joke dataset might.

The study's framing as an alignment problem is deliberate. Humor is described by the researchers as "one of the most culturally embedded and socially significant dimensions of human communication." If AI systems are systematically miscalibrated on humor — defaulting to choices that reflect structural biases rather than human sensibility — it raises questions about alignment evaluation more broadly. Most alignment benchmarks focus on factual accuracy, helpfulness, or safety. Humor has received comparatively little attention despite being central to how humans communicate, build rapport, and navigate social situations.

It is worth noting that all benchmarks and performance figures cited in this article are self-reported by the researchers in the ArXiv preprint, which has not yet undergone formal peer review.

The Deeper Problem: Distinguishing Preference From Artifact

The study's most pointed question is whether model outputs in humor tasks reflect anything like genuine preference at all. When a model selects a card, it is performing a next-token prediction task shaped by training data, reinforcement learning from human feedback, and safety filtering. Any of these stages could introduce systematic distortions that look like preferences but are actually artifacts.

This matters beyond humor specifically. If researchers cannot reliably distinguish genuine model preference from structural artifacts in a relatively controlled setting like a card game, the same uncertainty applies to higher-stakes alignment evaluations. A model that appears to prefer safe, helpful responses may be doing so for structural reasons that don't generalize reliably to novel situations.

The five models tested are described as "frontier" models — meaning they represent current state-of-the-art systems — though the paper does not name them explicitly in the abstract. The use of multiple models strengthens the study's conclusions: the convergence pattern holds across different architectures and training pipelines, making it harder to attribute to any single model's quirks.

What This Means

AI systems that align more with each other than with humans on humor suggest that current alignment methods may be producing a kind of artificial consensus — one that passes surface-level tests but diverges from the messy, culturally specific reality of what people actually find funny, and potentially other subjective human preferences too.

AI Models Play Cards Against Humanity — and Agree With Each Other More Than With Humans

Models Perform Above Random Baseline

Position Bias and Content Preferences May Be Driving Choices

Why Cards Against Humanity Is a Useful Benchmark

The Deeper Problem: Distinguishing Preference From Artifact

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

AI Models Play Cards Against Humanity — and Agree With Each Other More Than With Humans

Models Perform Above Random Baseline

Position Bias and Content Preferences May Be Driving Choices

Why Cards Against Humanity Is a Useful Benchmark

The Deeper Problem: Distinguishing Preference From Artifact

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models