AI Safety Policy Comparison Tool — LLM Framework

A new automated framework can compare AI safety policy documents from different countries and organisations using large language models, but a peer evaluation finds that the choice of LLM substantially changes the results — and that machine scores still diverge from those of human experts.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The research, posted to ArXiv in April 2025, addresses a practical problem in AI governance: as governments, intergovernmental bodies, and research institutions multiply their AI safety commitments, manually tracking and comparing those commitments has become time-consuming and inconsistent. The authors propose a structured, automated alternative they call a "crosswalk framework."

How the Crosswalk Framework Works

The system uses the activity categories defined in the Activity Map on AI Safety — an existing taxonomy of AI safety-relevant work — as fixed reference points. For any pair of policy documents, the framework instructs an LLM to extract relevant activities, produce a short summary for each document under each category, write a brief comparison, and generate a similarity score between the two documents on that dimension. The result is a structured, side-by-side analysis that can be visualised as a heatmap of similarity scores across document pairs.

The approach is designed to make comparative policy inspection faster and more systematic, particularly for analysts monitoring how different national or institutional strategies align or diverge on specific safety activities.

The choice of LLM substantially affects crosswalk outcomes, and some document pairs yield high disagreements across models.

What the Testing Revealed

The researchers tested the framework using five large language models across ten publicly available AI safety policy documents, generating crosswalks for each possible document pair. The heatmap visualisations revealed meaningful variation: not only did individual similarity scores shift depending on which LLM was used, but some document pairs produced high disagreement across models — meaning different LLMs reached substantially different conclusions about how similar two documents were.

This instability is a significant finding. If the framework is intended to support consistent, replicable policy analysis, variability driven by model choice introduces a reliability problem that users would need to account for.

Human Experts Agree With Each Other, Less So With the Models

To validate the automated scores, three human experts evaluated two document pairs and rated similarity across the same taxonomy categories. The human annotators showed high inter-annotator agreement — they largely concurred with each other. However, their scores differed from those produced by the LLMs, reinforcing that the models are not yet reliable proxies for expert human judgment in this domain.

The authors are transparent about this gap. Rather than claiming the system replaces human analysis, they frame it as a tool that supports comparative inspection — helping analysts identify where documents align or diverge before applying deeper human scrutiny.

Why Policy Comparisons Are Hard to Automate

AI safety policy documents vary considerably in structure, specificity, and terminology. A national strategy document from one government may use different language to describe the same activity as a technical report from an international body. Mapping these onto a shared taxonomy requires interpretive judgments that LLMs handle inconsistently — particularly when documents are ambiguous or when the taxonomy category is broad.

The reliance on a fixed external taxonomy — the Activity Map on AI Safety — is both a strength and a constraint. It provides a stable comparison structure, but the framework's usefulness depends on that taxonomy adequately covering the activities described in the documents being compared. Documents that address activities outside the taxonomy's scope may be systematically undercounted.

Practical Applications and Limitations

For policy analysts and researchers monitoring AI governance developments, a tool like this could meaningfully reduce the time required to produce initial comparative assessments. Governments and multilateral organisations that need to track alignment between their own commitments and those of peers could use it as a first-pass screening tool.

However, the findings suggest several precautions. Users should not rely on outputs from a single LLM, given demonstrated score variability. Document pairs that generate high cross-model disagreement likely require more intensive human review rather than less. And because all benchmarks and evaluations reported in the paper are self-reported by the research team, independent replication using different document sets would strengthen confidence in the framework's generalisability.

The study also does not yet address how the framework performs across documents in different languages — a meaningful limitation for genuinely global policy comparison.

What This Means

This research offers a credible starting point for automating AI safety policy comparison, but the demonstrated gap between LLM outputs and human expert judgment means analysts should treat the tool as a structured aid rather than an authoritative assessor — at least until model reliability on this task improves.

AI Researchers Build Automated Tool to Compare Global AI Safety Policies — With Caveats

How the Crosswalk Framework Works

What the Testing Revealed

Human Experts Agree With Each Other, Less So With the Models

Why Policy Comparisons Are Hard to Automate

Practical Applications and Limitations

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

AI Researchers Build Automated Tool to Compare Global AI Safety Policies — With Caveats

How the Crosswalk Framework Works

What the Testing Revealed

Human Experts Agree With Each Other, Less So With the Models

Why Policy Comparisons Are Hard to Automate

Practical Applications and Limitations

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models