A new automated framework can compare AI safety policy documents from different countries and organisations using large language models, but a peer evaluation finds that the choice of LLM substantially changes the results — and that machine scores still diverge from those of human experts.
The research, posted to ArXiv in April 2025, addresses a practical problem in AI governance: as governments, intergovernmental bodies, and research institutions multiply their AI safety commitments, manually tracking and comparing those commitments has become time-consuming and inconsistent. The authors propose a structured, automated alternative they call a "crosswalk framework."
How the Crosswalk Framework Works
The system uses the activity categories defined in the Activity Map on AI Safety — an existing taxonomy of AI safety-relevant work — as fixed reference points. For any pair of policy documents, the framework instructs an LLM to extract relevant activities, produce a short summary for each document under each category, write a brief comparison, and generate a similarity score between the two documents on that dimension. The result is a structured, side-by-side analysis that can be visualised as a heatmap of similarity scores across document pairs.
The approach is designed to make comparative policy inspection faster and more systematic, particularly for analysts monitoring how different national or institutional strategies align or diverge on specific safety activities.
The choice of LLM substantially affects crosswalk outcomes, and some document pairs yield high disagreements across models.
What the Testing Revealed
The researchers tested the framework using five large language models across ten publicly available AI safety policy documents, generating crosswalks for each possible document pair. The heatmap visualisations revealed meaningful variation: not only did individual similarity scores shift depending on which LLM was used, but some document pairs produced high disagreement across models — meaning different LLMs reached substantially different conclusions about how similar two documents were.
This instability is a significant finding. If the framework is intended to support consistent, replicable policy analysis, variability driven by model choice introduces a reliability problem that users would need to account for.
Human Experts Agree With Each Other, Less So With the Models
To validate the automated scores, three human experts evaluated two document pairs and rated similarity across the same taxonomy categories. The human annotators showed high inter-annotator agreement — they largely concurred with each other. However, their scores differed from those produced by the LLMs, reinforcing that the models are not yet reliable proxies for expert human judgment in this domain.
The authors are transparent about this gap. Rather than claiming the system replaces human analysis, they frame it as a tool that supports comparative inspection — helping analysts identify where documents align or diverge before applying deeper human scrutiny.
Why Policy Comparisons Are Hard to Automate
AI safety policy documents vary considerably in structure, specificity, and terminology. A national strategy document from one government may use different language to describe the same activity as a technical report from an international body. Mapping these onto a shared taxonomy requires interpretive judgments that LLMs handle inconsistently — particularly when documents are ambiguous or when the taxonomy category is broad.
The reliance on a fixed external taxonomy — the Activity Map on AI Safety — is both a strength and a constraint. It provides a stable comparison structure, but the framework's usefulness depends on that taxonomy adequately covering the activities described in the documents being compared. Documents that address activities outside the taxonomy's scope may be systematically undercounted.
Practical Applications and Limitations
For policy analysts and researchers monitoring AI governance developments, a tool like this could meaningfully reduce the time required to produce initial comparative assessments. Governments and multilateral organisations that need to track alignment between their own commitments and those of peers could use it as a first-pass screening tool.
However, the findings suggest several precautions. Users should not rely on outputs from a single LLM, given demonstrated score variability. Document pairs that generate high cross-model disagreement likely require more intensive human review rather than less. And because all benchmarks and evaluations reported in the paper are self-reported by the research team, independent replication using different document sets would strengthen confidence in the framework's generalisability.
The study also does not yet address how the framework performs across documents in different languages — a meaningful limitation for genuinely global policy comparison.
What This Means
This research offers a credible starting point for automating AI safety policy comparison, but the demonstrated gap between LLM outputs and human expert judgment means analysts should treat the tool as a structured aid rather than an authoritative assessor — at least until model reliability on this task improves.