A new benchmarking framework has found that popular AI explanation tools produce significantly less stable outputs than existing evaluations suggest, raising concerns about their reliability in safety-critical applications such as medical imaging and autonomous systems.
Post-hoc feature attribution methods — tools that attempt to explain which parts of an image influenced an AI model's decision — are increasingly used to justify automated decisions in high-stakes contexts. But a paper published on ArXiv by researchers introducing the Feature Attribution Stability Suite (FASS) argues that the field has been measuring stability incorrectly, producing results that are both inflated and misleading.
The Hidden Flaw in How AI Explanations Are Tested
The core problem, according to the researchers, lies in how existing benchmarks evaluate explanation consistency. Most current approaches test whether an attribution method produces a similar explanation when small amounts of noise are added to an input image. But these tests rarely check whether the underlying AI model actually made the same prediction on both the original and perturbed image.
Without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions.
This matters because comparing explanations for two different predictions is not a fair test of explanation stability — it conflates the model changing its mind with the explanation method being unreliable. FASS enforces what the researchers call "prediction-invariance filtering," ensuring that only image pairs where the model reaches the same conclusion are included in stability assessments.
Three Metrics, Three Types of Perturbation
Beyond fixing the filtering problem, FASS also restructures how stability is measured. Existing tools typically collapse explanation consistency into a single number. The new benchmark instead decomposes stability into three complementary metrics: structural similarity (how visually alike two attribution maps are), rank correlation (whether the same features are ranked similarly in importance), and top-k Jaccard overlap (whether the same features appear in the most important group).
The benchmark also tests across three distinct categories of image perturbation: geometric changes such as rotation and cropping, photometric changes such as brightness and contrast adjustments, and compression artifacts. This distinction turns out to be crucial. The researchers found that geometric perturbations expose substantially greater attribution instability than photometric changes — a finding that would be invisible to benchmarks relying solely on additive noise.
Four attribution methods were evaluated: Integrated Gradients, GradientSHAP, Grad-CAM, and LIME. These were tested across four model architectures and three datasets — ImageNet-1K, MS COCO, and CIFAR-10 — all self-reported results from the research team.
Grad-CAM Achieves Highest Stability Scores, But Context Matters
Among the four methods tested, Grad-CAM achieved the highest stability scores consistently across datasets under the FASS framework. However, the researchers emphasize that the key finding is not which method performs best, but that stability estimates depend critically on which perturbation type is applied and whether prediction-invariance filtering is enforced.
The implication is that previous comparisons between attribution methods may have produced unreliable rankings. A method that appears stable under additive noise tests may perform poorly when images are geometrically transformed — exactly the kind of variation that occurs in real-world deployment, where camera angles, cropping, and compression are routine.
This has direct consequences for any domain where explainability is treated as a safety feature. In radiology, for instance, a clinician relying on an attribution map to understand why a model flagged an anomaly needs to trust that a slightly different image of the same patient would produce a consistent explanation. The FASS findings suggest that confidence in such consistency may be misplaced.
A Field Without Agreed Standards
The broader context here is that AI explainability — sometimes called XAI — lacks standardized evaluation protocols. Individual research teams and companies select their own metrics and perturbation strategies, making it difficult to compare methods or audit their real-world reliability. Regulatory frameworks in the European Union, including provisions within the AI Act, increasingly require that high-risk AI systems be explainable, but do not yet specify how explanation quality should be measured.
FASS does not propose a single definitive solution, but it does establish a more rigorous framework than what currently exists. By open-sourcing the benchmark — details and code availability are described in the paper — the researchers are inviting the community to adopt a shared standard, or at minimum to scrutinize their own evaluation practices.
The four methods tested represent a reasonable cross-section of current practice, but the benchmark does not cover every attribution approach in use. Methods such as SHAP variants beyond GradientSHAP, or attention-based explanations in transformer architectures, are not included in this initial release, leaving room for future expansion.
What This Means
Organizations deploying AI explanation tools in safety-critical settings should treat existing stability claims with caution — particularly if those claims were established without prediction-invariance filtering or geometric perturbation testing, as FASS suggests most were.