Language models trained to know that "Marie Curie won the Nobel Prize" often cannot answer "who won the Nobel Prize?" if the phrasing reverses the direction of the original training data — and new research suggests that the fixes proposed for this problem may not work the way researchers hoped.

The phenomenon, known as the reversal curse, has been a known weakness of autoregressive language models — the family of models that includes most modern large language models — since it was first formally described. These models are trained to predict the next token in a sequence, which means they learn relationships in one direction. Ask them to reverse that relationship, and they frequently fail. A paper published on ArXiv in April 2025 by researchers in computational linguistics now extends the analysis of proposed solutions and finds that the apparent fixes may be papering over a deeper architectural problem.

Why the Reversal Curse Matters More Than It Sounds

The reversal curse is not merely a quirk. It strikes at the heart of what it means for a model to "know" something. If a system has genuinely learned a fact — that A is related to B — it should be able to retrieve that relationship regardless of which end of the chain the question starts from.

Objective-level fixes can improve reversal behaviour without necessarily inducing the kind of latent generalisation one might expect from a unified concept.

The proposed remedies have focused on changing how models are trained. Bidirectional attention — used in models like BERT — allows every token to attend to every other token, rather than only to preceding ones. Masking-based reconstruction, applied even to decoder-only architectures, forces the model to predict hidden tokens within a sequence rather than always predicting what comes next. Both approaches have shown measurable improvements on reversal benchmarks. The new research accepts those gains but asks a harder question: why do they work, and what does that tell us about what the model has actually learned?

Separating Genuine Understanding from Stored Directions

The researchers tested a vanilla masked language modelling (MLM) objective — the training method used in BERT-style models — alongside decoder-only masking approaches, across four reversal benchmarks. The accuracy results were encouraging for bidirectional and masking-based methods. But the mechanistic analysis told a different story.

Using representation distance measurements and linear probes — tools that let researchers examine the internal geometry of a model's learned representations — the team found little evidence that successful models had formed a single, direction-agnostic mental representation of a fact. Instead, the internal structure was consistent with the model storing the forward version of a fact and the reverse version as two distinct entries, indexed differently depending on the training objective used.

In plain terms: the model isn't learning that Marie Curie and the Nobel Prize are connected. It is learning "Marie Curie → Nobel Prize" and, separately, "Nobel Prize → Marie Curie" as two different memory items. The training objective determines how those items are filed, not whether they are unified.

What the Mechanistic Study Reveals

The paper's mechanistic findings are specific: reversal accuracy requires that the source entity — the starting point of a query — must have been an explicit prediction target during training. If a model was never trained to predict "Marie Curie" as an output, it struggles to retrieve facts when Marie Curie is the answer rather than the prompt.

This explains why bidirectional and masking-based methods help. They force the model to predict entities from both ends of a relationship during training. But it also explains why this is not the same as genuine conceptual understanding. The model is not developing a richer internal concept of Marie Curie that naturally connects to all her attributes. It is filing additional lookup entries.

The distinction between MLM and decoder-only masking-based training also showed different indexing geometry in the representations — meaning the two approaches solve the reversal problem through different internal mechanisms, even when their benchmark scores look similar from the outside. This has practical implications for interpretability research, which often assumes that similar behaviours reflect similar internal processes.

Benchmarks and Their Limits

It is worth noting that all benchmark results in this study are reported by the researchers themselves and have not yet undergone peer review, as the paper is a preprint. The four reversal benchmarks used are standard in the field, which lends the methodology credibility, but independent replication will be important before the findings can be considered settled.

The study also deliberately scopes itself as a "minimal mechanistic study" — the researchers are not claiming a complete theory of how language models store knowledge. Rather, they are surfacing evidence that contradicts a convenient assumption: that training objectives which boost reversal scores are producing the kind of generalisation researchers intend.

What This Means

For anyone building or evaluating AI systems that need to reliably retrieve and reason about factual relationships, this research is a warning that benchmark improvements do not guarantee genuine comprehension — and that understanding how a model succeeds may matter as much as whether it succeeds.