A new AI model called PHONSSM has achieved 72.1% accuracy on a 2,000-sign American Sign Language benchmark by treating signs as structured combinations of smaller linguistic units — the same way linguists have described sign languages for decades.

Sign language recognition systems have long struggled with a fundamental scaling problem: models that perform well on small vocabularies of 20 or 50 signs tend to collapse when vocabulary size grows to the hundreds or thousands. A paper published on ArXiv by researchers working with the largest ASL dataset ever assembled — covering 5,565 signs — argues this is not a data problem or a compute problem, but a representation problem.

Why Existing Models Break at Scale

Most current sign recognition architectures treat each sign as a single, indivisible visual pattern. The model learns what a sign looks like, but not why it looks that way. This approach, the researchers argue, forces the system to memorise an ever-growing library of unrelated visual templates — a task that becomes increasingly impractical as vocabulary size grows.

Sign languages, however, are not arbitrary collections of gestures. Like spoken languages, they have internal phonological structure. Every sign in American Sign Language can be described using a small set of discrete parameters: the shape of the hand, where in space the sign is made, how the hand moves, and its orientation. Different signs reuse these components in different combinations — meaning a system that understands the components can generalise far more efficiently than one that memorises whole signs.

The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.

How PHONSSM Encodes Linguistic Structure

PHONSSM — the name combining "phonological" with "SSM", short for State Space Model — enforces this decomposition through three mechanisms working together. First, it uses anatomically-grounded graph attention, which means the model's internal structure reflects the physical structure of the human hand and body, rather than treating all skeleton joints as equivalent. Second, it factorises its internal representations into orthogonal subspaces, each dedicated to one phonological parameter. This prevents information about handshape from bleeding into information about location, and vice versa. Third, it uses prototypical classification, a technique from few-shot learning that represents each category as a point in a shared feature space rather than learning a separate classifier for every sign.

Critically, the model operates on skeleton data alone — structured representations of body pose extracted from video — rather than raw RGB video frames. This makes it computationally leaner and more privacy-preserving than video-based systems, yet it surpasses most RGB-based methods on the WLASL2000 benchmark, according to the paper. These benchmark results are self-reported by the researchers.

Performance Gains, Especially at the Margins

The headline result — +18.4 percentage points over previous skeleton-based state of the art on WLASL2000 — is substantial, but arguably not the most striking finding. The gains are most dramatic in the few-shot regime, where the model must recognise signs it has seen very few examples of during training. The researchers report a +225% relative improvement in this setting, which reflects the core claim: a system that understands phonological composition needs far fewer examples of any given sign because it can decompose it into components it has already learned.

The model also transfers zero-shot to ASL Citizen, a separate ASL dataset it was not trained on, exceeding supervised RGB baselines in that setting. Zero-shot transfer — where a model performs a new task without any task-specific training examples — is a meaningful test of whether representations are genuinely generalisable rather than overfitted to a specific benchmark.

State Space Models as the Architectural Choice

The choice of State Space Models as the underlying architecture is itself notable. SSMs, including the Mamba family of models that gained attention in 2023 and 2024, are designed to model sequential data efficiently. They have been proposed as alternatives to transformers for tasks involving long sequences, offering lower computational cost without sacrificing the ability to capture long-range dependencies. For sign language, where the temporal evolution of a gesture is as important as any single frame, this is a natural fit.

The integration of SSMs with phonological inductive biases is what distinguishes this work from prior attempts to apply deep learning to sign recognition. The architectural choice is not incidental — SSMs allow the model to track how phonological parameters evolve across the duration of a sign, which is essential for distinguishing signs that share handshape or location but differ in movement.

What This Means

For the field of sign language technology, this research suggests that closing the gap between laboratory demonstrations and real-world, vocabulary-scale recognition systems may require embedding linguistic knowledge directly into model architecture — and that doing so can deliver substantial gains without requiring video input or large amounts of labelled data per sign.