Automated AI evaluators can reliably assess whether large language models respond safely to users experiencing psychosis, according to new research that validated the approach against human clinical consensus and achieved agreement scores as high as Cohen's κ = 0.75.
The study, posted to arXiv in April 2025, addresses a growing concern in AI safety: general-purpose chatbots are increasingly used for mental health support, yet no scalable, clinically grounded method exists to test whether their responses are safe for some of the most vulnerable users. People experiencing psychosis — a condition involving breaks from reality, including delusions and hallucinations — represent a particularly high-risk group, since a poorly calibrated AI response could reinforce harmful beliefs rather than redirect users toward care.
The research found that LLM-as-a-Judge aligned closely with human clinical consensus, with Gemini achieving a Cohen's κ of 0.75 — a score that indicates strong agreement by conventional standards.
How the Evaluation Framework Was Built
The research team constructed its evaluation system in three stages. First, they developed seven safety criteria informed by clinicians — specific, measurable standards for what constitutes a safe or unsafe model response when a user demonstrates signs of psychosis. Second, they built a human-consensus dataset, in which human raters assessed AI responses against those criteria, establishing a gold standard. Third, they tested whether LLMs could replicate that human judgment automatically, using two approaches: a single LLM acting as a sole evaluator (LLM-as-a-Judge), and a majority-vote system across multiple LLM evaluators (LLM-as-a-Jury).
The models tested as judges included Gemini, Qwen, and Kimi. All three showed meaningful alignment with human raters, though Gemini performed best. Notably, the single best judge marginally outperformed the jury system (κ = 0.75 versus κ = 0.74 for the jury), suggesting that ensemble voting does not automatically improve accuracy in this context. These benchmarks are self-reported by the study authors and have not been independently replicated.
Why Psychosis Is the Test Case That Matters
The choice of psychosis as the focal condition is deliberate and significant. People experiencing active psychosis may present with fixed false beliefs (delusions) or perceive things that are not there (hallucinations). An AI chatbot that engages uncritically with delusional content — for example, by affirming a user's belief that they are being persecuted — could deepen distress or delay clinical intervention.
Existing safety evaluations for mental health AI tend to focus on more commonly discussed conditions such as depression or suicidal ideation, where clearer clinical guidelines already exist. Psychosis has been comparatively under-studied in this context, and the researchers argue it represents one of the highest-stakes scenarios for AI deployment. Their seven safety criteria are designed to capture nuanced failures — not just overtly harmful responses, but subtler problems like epistemic validation of delusional content.
The Scalability Problem This Research Tries to Solve
Clinical validation of AI responses at scale is expensive and slow. Recruiting qualified mental health professionals to manually review thousands of model outputs is not feasible for continuous monitoring or large-scale benchmarking. This is why the LLM-as-a-Judge paradigm is attractive: if an automated evaluator can match human clinical judgment reliably, it unlocks the ability to run safety evaluations continuously, cheaply, and at the scale modern AI deployment demands.
The Cohen's κ score is a standard measure of inter-rater agreement that accounts for chance agreement. A κ of 0.75 is generally interpreted as strong agreement; a κ below 0.60 is considered moderate. The fact that two of the three tested models exceeded 0.60, and one reached 0.75, suggests the approach is practically viable — though the researchers stop short of claiming it can fully replace clinical human review. Their framing positions automated evaluation as a complement to, not a replacement for, expert oversight.
The jury approach — taking a majority vote across Gemini, Qwen, and Kimi — achieved κ = 0.74, slightly below the top individual judge. This result challenges a common assumption in evaluation research that ensemble methods are inherently more reliable. It may reflect that the weaker judges introduce noise that offsets the benefits of aggregation, or that the task requires a level of nuanced clinical reasoning where one well-calibrated model outperforms averaged opinion.
What Comes Next for Clinical AI Safety Testing
The research does not test specific commercial AI products for safety failures, nor does it make claims about any particular chatbot's risk profile. Its contribution is methodological: a validated framework and dataset that other researchers and developers can use to run their own evaluations. The dataset and criteria could, in principle, be applied to any general-purpose LLM to assess how it handles psychosis-related conversations.
The broader research context includes growing regulatory and public interest in AI mental health tools. Several consumer-facing AI companions and chatbot services have attracted scrutiny over their handling of crisis conversations. Rigorous, repeatable evaluation frameworks are increasingly seen as a prerequisite for responsible deployment in sensitive domains.
What This Means
For developers building or deploying AI in mental health contexts, this research offers a practical, clinically grounded toolkit for safety testing at scale — and a warning that current evaluation standards may be insufficient for users experiencing psychosis.