Microsoft, Amazon, and OpenAI have all launched medical AI chatbots within the past few months, accelerating a race to embed artificial intelligence into healthcare — but evidence that these tools deliver reliable, safe outcomes for patients has not kept pace with their rollout.
The wave of launches reflects surging commercial interest in AI-powered health tools, a market that research firm Grand View Research valued at $20.9 billion in 2024 and projects to grow significantly through the decade. Demand from hospital systems, insurers, and individual consumers is real. Yet independent evaluation of whether these chatbots improve patient outcomes — rather than simply replicating or occasionally undermining the work of clinicians — remains limited and inconsistent.
A Market Racing Ahead of the Evidence
The core concern is not that AI health tools are useless. Several studies suggest they can triage common queries, surface relevant medical literature, and reduce administrative burden on overstretched clinicians. A 2023 study published in JAMA Internal Medicine, involving responses evaluated by physicians, found that ChatGPT's answers to patient questions were rated as higher quality and more empathetic than physician responses in a blinded comparison — a finding that generated both enthusiasm and significant pushback about study design.
But that research, like much in this space, tested a narrow, controlled scenario. Real-world deployments introduce variables that benchmarks rarely capture: ambiguous symptoms, patients with multiple conditions, and the risk that a confident-sounding but incorrect AI answer delays someone seeking urgent care.
The question is not whether AI can answer medical questions — it can — but whether it answers the right question for the right patient at the right moment.
Microsoft's DAX Copilot, Amazon's HealthScribe, and OpenAI's recently announced health-focused GPT features each target different parts of the healthcare system. DAX Copilot focuses on clinical documentation, transcribing patient-doctor conversations to reduce physician paperwork. HealthScribe operates in a similar space. OpenAI's tools aim more directly at consumer-facing interactions. The distinctions matter: a documentation assistant that makes an error creates different risks than a chatbot advising a patient about chest pain.
The Anthropic-Pentagon Story: A Different Kind of AI Controversy
Running parallel to the health AI debate is a separate and politically charged story involving Anthropic, the AI safety company behind the Claude model series. According to MIT Technology Review's reporting, the Pentagon's use of Anthropic's technology has sparked an internal culture clash — a tension between Anthropic's publicly stated safety-first mission and the realities of working with one of the world's largest military institutions.
The specifics of the reported conflict centre on how Claude is being deployed within defence contexts and whether that deployment aligns with the values Anthropic has articulated to the public and to its own employees. Anthropic has positioned itself as one of the more cautious voices in the AI industry, publishing research on model alignment and advocating for regulatory frameworks. A contract with the Pentagon — and any friction it generates internally — complicates that image.
This is not a novel tension. Google faced significant internal rebellion in 2018 when employees protested the company's involvement in Project Maven, a Pentagon initiative using AI to analyse drone footage. Google ultimately did not renew that contract. Whether Anthropic faces comparable internal pressure, and how its leadership responds, will be closely watched across the industry.
What the Two Stories Share
On the surface, AI health tools and a Pentagon culture clash appear unrelated. But they share a structural problem: the speed of AI deployment is outpacing the frameworks — ethical, regulatory, and clinical — designed to govern it.
In healthcare, regulators including the U.S. Food and Drug Administration have begun developing pathways for AI-enabled medical devices, but chatbots that position themselves as informational rather than diagnostic have largely operated in a grey zone. The EU AI Act, which classifies certain health AI applications as high-risk, represents one of the more concrete attempts to impose accountability — but its full implementation remains a work in progress.
In the defence context, questions about what values an AI company is willing to compromise, and under what conditions, go to the heart of how the industry governs itself. Voluntary commitments to safety are easier to maintain when they cost nothing.
For the researchers, clinicians, and patient advocates paying close attention, the pattern is consistent: announcements arrive faster than audits, and marketing language about AI's potential fills the space where outcome data should be.
What This Means
For anyone using or considering AI health tools, the absence of robust independent evaluation means the burden of scepticism still falls on the user — and for Anthropic, the Pentagon episode is an early, public test of whether its safety commitments are principles or positioning.
