Researchers have built a fully automated system that reads and explains the internal 'circuit diagrams' of large language models, a task that previously required painstaking manual inspection by human experts.

The field of mechanistic interpretability has long sought to answer a deceptively simple question: when a language model produces an output, which internal components caused it, and how? The subfield of circuit tracing attempts to map these causal chains — identifying which model features activated, how they influenced one another, and what computational role each played. Until now, the final step of that process, actually interpreting what a feature does, has relied entirely on researchers manually reviewing activation data, a slow and subjective bottleneck.

From Manual Inspection to Automated Explanation

The new system, called ADAG (Automatically Describing Attribution Graphs), replaces that human bottleneck with an end-to-end automated pipeline, according to the researchers who published the work on ArXiv in April 2025. The core innovation begins with what the team calls attribution profiles — a method that quantifies the functional role of any given model feature by measuring its gradient effects on both inputs and outputs. Rather than asking a researcher to examine which text examples a feature responds to, attribution profiles produce a structured, mathematical description of what that feature does and when.

Those profiles then feed into a novel clustering algorithm that groups features with similar functional roles together. This step is significant: individual features in a large model are often too granular and interdependent to interpret one at a time, so grouping them into coherent clusters makes explanation tractable.

The final step of circuit tracing — actually interpreting what a feature does — has until now relied entirely on researchers manually reviewing activation data.

The LLM Explains Itself

Once clusters are formed, ADAG deploys what the researchers describe as an LLM explainer-simulator setup. A language model generates natural-language explanations of what each feature cluster does functionally, and a separate simulation step scores those explanations for accuracy — checking whether the proposed description correctly predicts how the cluster behaves on held-out examples. This self-verifying loop is designed to filter out plausible-sounding but incorrect explanations, a known failure mode when using LLMs to interpret other models.

The team validated ADAG by running it on circuit-tracing tasks that human researchers had already analysed manually. According to the paper, the automated system recovered interpretable circuits consistent with the prior human-generated analyses, suggesting the pipeline produces descriptions grounded in model behavior rather than mere fluency.

A Jailbreak Traced to Specific Internal Clusters

The most striking application in the paper involves Meta's Llama 3.1 8B Instruct model. The researchers applied ADAG to investigate a harmful advice jailbreak — a prompt strategy that causes the model to bypass its safety training and provide dangerous guidance. ADAG identified specific feature clusters that the researchers describe as "steerable" — meaning they appear causally responsible for the model's susceptibility to that jailbreak.

This finding carries direct implications for AI safety work. Pinpointing which internal components enable a particular failure mode is a prerequisite for targeted intervention: if engineers know which clusters are responsible for a jailbreak, they have a more precise target for fine-tuning or other corrective measures than if they can only observe the behaviour at the output level.

It is worth noting that the benchmark results and jailbreak findings reported here are self-reported by the authors and have not yet undergone formal peer review, as the paper was posted to ArXiv as a preprint.

Scaling Interpretability Research

The broader challenge ADAG addresses is one of scale. Modern language models contain billions of parameters and vast numbers of internal features. Even dedicated interpretability teams cannot manually examine more than a small fraction of the circuits involved in any given behaviour. Automated pipelines that can produce reliable, natural-language explanations of those circuits at scale could expand the scope of what interpretability research is able to cover.

The approach also sidesteps a persistent criticism of interpretability work: that human analysts may unconsciously project meaning onto features, finding patterns that confirm expectations rather than reflect genuine causal structure. By automating the description step and building in a simulation-based scoring mechanism, ADAG attempts to make the process more systematic and falsifiable — though whether the LLM explainer itself introduces its own biases remains an open question for follow-on research.

What This Means

If ADAG's automated approach holds up to independent scrutiny, it could accelerate mechanistic interpretability research and give AI safety teams a practical tool for tracing — and potentially closing — specific vulnerability pathways in deployed language models.