A new study from researchers publishing on ArXiv shows that 10 yes/no questions — exactly 10 bits of information — can recover the majority of the performance difference between a weak and a powerful AI model on standard benchmarks, achieving compression ratios that are over 100 times smaller than the previous state of the art.
The paper, posted to ArXiv CS.LG in April 2025, explores how efficiently large language model (LLM) outputs can be compressed, covering both lossless compression — where the original text is recovered exactly — and lossy compression, where some information is sacrificed to achieve much smaller data sizes. The headline result involves a novel interactive protocol the authors call Question-Asking (QA) compression, inspired by the classic parlour game Twenty Questions.
How Question-Asking Compression Works
The core idea is straightforward. A smaller, less capable model iteratively asks yes/no questions to a larger, stronger model. Each answer transfers exactly one bit of information. After just 10 questions, the small model refines its response using only those 10 bits — no full text, no lengthy explanation, no model weights transferred.
Across 8 benchmarks spanning mathematics, science, and code, 10 binary questions recovered 23% to 72% of the capability gap between the small and large model on standard benchmarks. On harder benchmarks, the recovery was 7% to 38%. The compression ratios achieved — between 0.0006 and 0.004 — represent the fraction of the original data size needed to transmit this knowledge.
Ten bits of structured, question-driven information can transfer a significant fraction of what separates a weak AI model from a strong one.
To put that in concrete terms: if a large model's response runs to several hundred bytes, QA compression transmits the equivalent capability boost in roughly one byte or less of information. The authors report this is more than 100 times smaller than the compression ratios achieved by the leading prior method from Deletang et al. (2024).
What the Lossless Results Show
Beyond the headline QA protocol, the paper also advances lossless LLM compression — situations where the goal is to store or transmit AI-generated text and recover it perfectly. The researchers found that domain-adapted LoRA adapters (lightweight model fine-tuning modules) improve LLM-based arithmetic coding by 2x over using the base model alone.
Arithmetic coding is a well-established compression technique that works by assigning shorter codes to more probable outputs. When an LLM is used to predict what comes next in a text, it can guide arithmetic coding — the better the LLM's predictions, the tighter the compression. Domain adaptation sharpens those predictions significantly.
For lossy compression, the researchers also tested prompting a model to produce a shorter rewrite of its own output before applying arithmetic coding. This approach achieves compression ratios of approximately 0.03 — itself a 2x improvement over compressing the original, unedited response.
The Compression-Compute Trade-Off
A key conceptual contribution of the paper is formalizing what the authors call a compression-compute frontier: the more compute you are willing to spend, the more you can compress. This mirrors a well-known trade-off in traditional data compression, but applied here to AI-generated content and AI-assisted transmission.
The QA protocol sits at the extreme high-compression, high-compute end of this frontier. Asking 10 questions requires running inference on both a small and a large model, which is computationally expensive relative to simply sending a short text response. Whether this trade-off makes practical sense depends on the deployment context — for example, extremely low-bandwidth communication channels, or scenarios where running a large model locally is impossible.
It is worth noting that all benchmark results in this paper are self-reported by the authors and have not yet undergone formal peer review, as is standard for ArXiv preprints.
Why This Research Matters for AI Deployment
The implications extend beyond academic compression benchmarks. The finding that a small model can rapidly close much of its gap with a large model through structured binary interaction challenges existing assumptions about how AI knowledge must be transferred. Currently, the dominant approaches are either distillation — training a smaller model on outputs from a larger one — or simply running the large model directly, which requires significant compute and memory.
QA compression suggests a third path: interactive, bit-efficient knowledge transfer at inference time, without any training. The small model doesn't need to be retrained; it simply queries the large model with precise yes/no questions and updates its response accordingly.
This could have practical relevance for edge AI deployment, where a lightweight model runs on a device but can query a remote, more powerful model under strict data limits. It also raises interesting questions about the theoretical limits of how much capability can be transferred per bit, and whether more than 10 questions would continue to yield diminishing returns.
The gap between the standard benchmark results (23%–72% recovery) and harder benchmark results (7%–38% recovery) also highlights an important caveat: the technique works better when tasks have cleaner, more binary-compatible answer structures. Open-ended or highly nuanced tasks appear harder to close via yes/no questioning alone.
What This Means
This research reframes AI capability transfer as a communication problem with a measurable efficiency frontier — and suggests that structured, interactive protocols can move substantial intelligence per bit of transmitted information.