Researchers have built an automated pipeline that annotates American Sign Language video at scale, releasing more than 300 hours of machine-generated labels and a new human-annotated benchmark to support AI sign language research.

The study, posted to ArXiv in April 2025, targets a well-known bottleneck in the field: large, high-quality ASL video datasets exist, but professional annotation is expensive enough that most of the footage goes unused. Datasets such as ASL STEM Wiki and FLEURS-ASL contain hundreds of hours recorded by professional interpreters, yet remain only partially labelled, limiting their usefulness for training and evaluating models.

Why Annotating Sign Language Video Is So Hard

Unlike speech, where automatic transcription tools are mature and widely deployed, sign language lacks an equivalent shortcut. Annotations must capture glosses (the discrete units of a sign language), fingerspelled words (where signers spell out English words letter by letter), and sign classifiers (handshapes that represent categories of objects or movement). Each requires specialist knowledge to label correctly, and the labour costs at dataset scale are prohibitive.

The pipeline takes signed video and English text as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers.

The new pipeline sidesteps the need for exhaustive human effort by combining sparse predictions from two underlying models — a fingerspelling recognizer and an isolated sign recognizer (ISR) — with a K-Shot LLM approach that uses a large language model prompted with a small number of examples to estimate plausible annotation sequences. The system outputs a ranked list of candidate annotations with associated time intervals, which can then be reviewed or used directly as pseudo-labels for model training.

Baseline Models Achieve High Performance on Two Benchmarks

In building the pipeline, the researchers also developed baseline models for each recognition sub-task. According to the paper, their fingerspelling model achieves a 6.7% character error rate (CER) on the FSBoard benchmark, and their isolated sign recognizer reaches 74% top-1 accuracy on the ASL Citizen dataset. These benchmarks are self-reported by the research team and have not been independently verified.

The FSBoard and ASL Citizen results matter beyond the pipeline itself. Fingerspelling recognition and isolated sign recognition are foundational capabilities; better baseline models in these areas directly benefit any downstream application, from real-time interpretation tools to educational technology.

A Human-Annotated Gold Standard for Validation

To guard against the risk that pseudo-annotations simply propagate errors at scale, the team commissioned a professional interpreter to manually annotate nearly 500 videos from ASL STEM Wiki. These sequence-level gloss labels — covering glosses, classifiers, and fingerspelling — serve as a gold-standard benchmark against which the automated pipeline can be evaluated and calibrated.

Releasing this human-annotated set alongside the machine-generated labels is a meaningful methodological choice. It gives other researchers a way to measure how far pseudo-annotation quality falls short of expert human labelling, and allows the community to track improvement over time as pipeline methods advance.

The combined release — human annotations plus over 300 hours of pseudo-labels — is being made available as supplemental material to the paper, according to the authors.

Implications for a Data-Starved Field

Sign language AI has lagged behind speech and text processing partly because the data problem is harder. Video is bandwidth-heavy, annotation requires rare expertise, and sign languages are not monolithic — American Sign Language differs structurally from British, Auslan, and dozens of other national sign languages, each of which needs its own data infrastructure.

Pseudo-annotation pipelines represent one practical route forward. If machine-generated labels are accurate enough — or if their errors are well-characterised — they can multiply the effective size of training sets without proportional increases in cost. The approach mirrors techniques that have been productive in low-resource spoken language processing, where pseudo-labelled audio has enabled meaningful progress even when transcribed data is scarce.

The K-Shot LLM component is particularly worth noting. Large language models carry knowledge about English vocabulary and syntax that can inform plausible gloss sequences when video evidence alone is ambiguous. Whether this cross-modal transfer holds up robustly across the diversity of signing styles, speeds, and domains in the wild remains an open empirical question — one the released benchmark is now better positioned to help answer.

What This Means

For researchers and developers working on sign language AI, this release substantially lowers the barrier to training and benchmarking new models — the combination of a scalable annotation pipeline, strong baseline recognizers, and a professional gold-standard dataset addresses three bottlenecks at once.