A new model called CIPHER can extract speech-sound information from scalp brainwave recordings, but its creators are cautioning that high error rates and data confounds mean it falls well short of a functional brain-to-text system.

Published on ArXiv in April 2025, the paper introduces CIPHER — short for Conformer-based Inference of Phonemes from High-density EEG Representations — developed to advance non-invasive neural speech decoding. Unlike implanted electrode systems, scalp EEG captures brain activity from outside the skull, making it safer and more practical, but the signal quality is far noisier and spatially blurred, which makes decoding meaningful information from it a persistent challenge in neuroscience and AI.

How CIPHER Processes Brainwave Signals

CIPHER uses a dual-pathway architecture, meaning it processes two distinct types of brain signal in parallel. The first pathway analyses ERP features — event-related potentials, which are voltage changes in the brain time-locked to a specific event, such as hearing a sound. The second pathway analyses broadband DDA coefficients, a way of capturing the broader frequency structure of the EEG signal over time. Both streams feed into a Conformer, a neural network architecture that combines convolutional layers (good at capturing local patterns) with attention mechanisms (good at capturing longer-range dependencies), originally developed for speech recognition in audio.

The dataset used is OpenNeuro ds006104, a publicly available collection recorded from 24 participants across two studies that included concurrent transcranial magnetic stimulation (TMS) — a technique that delivers magnetic pulses to the scalp to temporarily interfere with specific brain regions.

The authors explicitly position this work as a benchmark and feature-comparison study rather than an EEG-to-text system, constraining neural-representation claims to confound-controlled evidence.

Strong Simple Results, Weak Complex Results

The headline numbers tell a split story. On binary articulatory tasks — where the model simply distinguishes between two broad categories of speech sound — CIPHER reaches what the authors describe as near-ceiling performance. That sounds impressive, but the paper immediately flags a critical problem: those high scores are highly vulnerable to confounds. Specifically, the model may be picking up on acoustic onset separability (differences in when sounds begin, rather than their phonetic content) and patterns created by TMS targeting, rather than genuine phoneme-level neural representations.

On the harder, more meaningful test — an 11-class CVC phoneme classification task using consonant-vowel-consonant syllables — the results are considerably weaker. Under a rigorous leave-one-subject-out (LOSO) protocol across 16 held-out subjects from Study 2 (the stricter of the two), real-word error rates came in at 0.671 ± 0.080 for the ERP pathway and 0.688 ± 0.096 for the DDA pathway. In plain terms: both approaches got roughly two-thirds of words wrong. The two pathways also performed comparably to each other, suggesting neither feature type holds a clear advantage at this level of granularity.

Why the Authors' Own Caution Matters

It is notable — and methodologically responsible — that the research team explicitly frames their contribution as a benchmark and feature-comparison study rather than a demonstration of working neural speech decoding. This kind of self-constraint is relatively uncommon in a field where results are often presented optimistically. The authors specifically limit their claims about neural representations to evidence that survives confound controls, which is a meaningful caveat given how easily EEG studies can produce misleading high scores when acoustic or experimental artifacts leak into the data.

The confound issue is particularly important here because the dataset involves TMS. TMS pulses create large electrical artifacts in EEG recordings, and the targeting of specific brain regions during different conditions can inadvertently create separable signal patterns that have nothing to do with phoneme perception. The authors acknowledge this risk directly.

The Broader Challenge of Non-Invasive Speech Decoding

CIPHER sits within a growing field attempting to decode language from brain signals without surgery. Invasive approaches — using electrode arrays placed directly on or inside the brain — have shown substantially more success. Meta and several academic groups have demonstrated systems that can reconstruct words and sentences from intracranial recordings with meaningful accuracy. Scalp EEG, by contrast, averages activity across millions of neurons and loses spatial resolution in the process, making fine-grained phoneme discrimination genuinely hard.

The practical stakes are high: a reliable non-invasive system could eventually give a voice to people with paralysis or conditions like ALS who cannot speak or move. But the gap between current EEG-based performance and clinical usefulness remains large, and CIPHER's results illustrate exactly how large that gap still is for the harder classification problems that matter most.

By releasing both the model and a clear-eyed analysis of its limitations against a public dataset, the authors provide a replicable reference point for other researchers — something the field benefits from even when, perhaps especially when, results are modest.

What This Means

CIPHER's value lies not in what it achieves but in what it honestly measures: for researchers building on EEG-based speech decoding, it establishes a rigorous, confound-aware benchmark that future systems will need to genuinely surpass.