A new synthetic training pipeline enables vision-language models as small as 24–32 billion parameters to match or exceed models seven times their size on long-document understanding benchmarks, by teaching them to reason through visual documents rather than simply retrieve from them.
The research, published on arXiv in late April 2025, targets a capability gap in multimodal AI: while reasoning has driven major performance gains in mathematics and code, it has not been systematically applied to visual document understanding — the task of answering questions about lengthy PDFs, scanned reports, legal filings, and scientific papers that may span dozens or hundreds of pages.
Why Long-Document Reasoning Has Lagged
Most vision-language models handle long documents through retrieval or direct attention over page images, without an explicit reasoning step. The challenge is that generating reasoning traces for visual documents is difficult — there is no natural equivalent of the step-by-step working that makes math reasoning data easy to produce and verify.
The researchers address this with a synthetic data pipeline that constructs reasoning traces automatically. For each document-question pair, the pipeline scores every page for relevance to the question, extracts textual evidence from the most relevant pages, and orders that evidence from most to least pertinent before generating a final answer. These structured traces are then used for supervised fine-tuning (SFT), wrapped inside special <think> tags and gated by a control token.
Internalized reasoning produces 12.4× fewer mean output tokens compared to explicit reasoning — slashing inference cost without sacrificing accuracy.
Internalization: The Key Architectural Choice
The most technically distinctive element of the approach is how the reasoning capability is embedded into the model. Rather than requiring the model to output its full chain of thought at inference time — which is slow and token-expensive — the researchers use low-strength model merging to internalize the reasoning. The result is a model that reasons internally but does not narrate that process step by step in its output.
This distinction matters for practical deployment. Explicit reasoning models can generate thousands of tokens of visible working before producing an answer, which increases latency and API costs. According to the paper, internalized reasoning produces 12.4× fewer mean output tokens than explicit reasoning on the same tasks, while maintaining accuracy gains.
Benchmark Results: Matching a Model Seven Times Larger
The team applied the pipeline to two base models: Qwen3 VL 32B and Mistral Small 3.1 24B. Both are mid-sized vision-language models within the range that organisations can run on-premise or via standard cloud inference.
With Qwen3 VL 32B, the pipeline achieves a score of 58.3 on MMLongBenchDoc — a benchmark specifically designed for long visual document understanding. This exceeds Qwen3 VL 235B A22B, which scores 57.0 despite being a mixture-of-experts model with 235 billion total parameters. The result suggests that training methodology can substitute for raw scale on this class of tasks.
The Mistral results add a separate finding: the synthetic reasoning pipeline outperforms distillation from the Thinking version's traces by 3.8 points on MMLBD-C. This is notable because distillation from a capable teacher model is a common and often effective strategy. The fact that synthetic traces — generated without a large teacher — outperform distillation suggests the pipeline is capturing something structurally useful about document reasoning, not merely mimicking a stronger model's outputs.
What the Pipeline Actually Produces
The pipeline's design reflects a specific theory about what makes document understanding difficult: the model needs to know where to look before it can reason about what it finds. By explicitly scoring pages for relevance and ordering evidence, the synthetic traces teach the model a search-then-reason pattern rather than a read-everything-and-summarise pattern.
The researchers applied SFT on these traces using standard fine-tuning, with the reasoning wrapped in <think> tags and controlled by a gating token — meaning the reasoning mode can be toggled. The low-strength merging step then folds the learned reasoning behaviour into the base model weights without requiring separate inference infrastructure.
The pipeline and training code are publicly released for reproducibility, enabling the research community to apply the approach to other base models and document domains.
Applications Across Enterprise and Research
The paper targets enterprise, legal, and scientific use cases — domains where documents are long, structured, and high-stakes. Contract review, regulatory compliance, clinical record analysis, and patent search all involve the kind of multi-page visual document understanding that the benchmark measures.
Current approaches in these domains typically rely on chunking documents and running retrieval-augmented generation, which can miss reasoning that requires synthesising evidence across non-adjacent pages. A model that internalises a page-scoring and evidence-ordering process could handle such cross-document reasoning more robustly.
The efficiency gains are also commercially significant. A 12.4× reduction in output tokens translates directly to lower inference costs and faster response times — both critical for production document processing pipelines that handle thousands of documents per day.
What This Means
Organisations building document AI systems can now access a publicly released pipeline that enables mid-sized vision-language models to match or exceed models several times larger, at a fraction of the inference cost — making high-quality long-document reasoning viable without frontier-scale compute.