Cloudflare Unweight Compresses LLM Weights 15

Cloudflare has published details of Unweight, a lossless compression system the company says reduces large language model weights by 15–22% on Llama-3.1-8B while preserving bit-exact outputs and requiring no specialized hardware. According to Cloudflare, the technique works by compressing the exponent bytes of BF16 weights using Huffman coding and decompressing them inside GPU on-chip memory before feeding the results directly to tensor cores.

The company says it is releasing a technical paper alongside open-source GPU kernels on GitHub under its cloudflareresearch/unweight-kernels repository.

The memory bandwidth problem Cloudflare says it is targeting

In the blog post, Cloudflare frames Unweight as a response to memory bandwidth constraints on the NVIDIA H100 GPUs it uses across its inference fleet. The company states that H100 tensor cores can process data nearly 600 times faster than GPU memory can deliver it, making memory bandwidth — not compute — the bottleneck during token generation.

Cloudflare says Unweight follows two earlier efficiency projects on its stack: Infire, a Rust-based inference engine, and Omni, a model scheduling platform the company previously described as eliminating cold starts.

The core challenge isn't vanilla compression — exponent bytes in BF16 weights are highly redundant, so entropy coding works well on them. The challenge is decompressing fast enough that it doesn't slow down inference.

According to Cloudflare, each GPU compute unit can run either a decompression kernel or a matrix multiplication kernel at a given time, but not both simultaneously, due to shared memory constraints. The company says any decode latency that is not overlapped with the matrix multiplication adds directly to token latency.

How Cloudflare describes the compression approach

Cloudflare's post explains that BF16 weights consist of a 1-bit sign, an 8-bit exponent, and a 7-bit mantissa. The company says the sign and mantissa resemble random data and cannot be meaningfully compressed, but the exponent distribution is heavily skewed.

Citing prior research, Cloudflare states that out of 256 possible exponent values, the top 16 most common exponents cover more than 99% of all weights in a typical LLM layer, and that information theory indicates roughly 2.6 bits are sufficient to represent the distribution. Unweight applies Huffman coding to the exponent stream only, which Cloudflare says yields roughly 30% compression on that stream.

The company says it applies this selectively to MLP weight matrices — the gate, up, and down projections — which it states make up roughly two-thirds of a model's parameters and account for most memory traffic during token generation. Attention weights, embeddings, and layer norms are left uncompressed, according to the post.

Row-level handling of rare exponents

For weights whose exponents fall outside the top-16 palette, Cloudflare says it stores the entire row of 64 weights verbatim rather than handling exceptions per element. The company says this design choice eliminates per-element branching in the hot path, trading a small amount of compression ratio for simpler kernel logic.

Cloudflare states that a runtime component selects from multiple execution strategies depending on the workload, with an autotuner choosing a strategy per weight matrix and batch size. Some strategies prioritize simplicity, while others minimize memory traffic, according to the post.

Reported results on Llama-3.1-8B

Cloudflare reports that initial results on Llama-3.1-8B show roughly 30% compression of MLP weights alone, which translates to a 15–22% reduction in overall model size and approximately 3 GB of VRAM savings. The company says this allows it to fit more models on a single GPU.

Cloudflare contrasts Unweight with quantization, which it describes as a lossy technique that converts 32- or 16-bit floating point values to 8- or 4-bit integers and can affect output quality in ways the company says are unpredictable. Cloudflare says it wanted a lossless approach for production inference serving diverse use cases.

How Cloudflare positions Unweight against prior work

The post references three prior systems and explains why Cloudflare says they did not meet its requirements. According to Cloudflare, ZipNN compresses weights for storage and distribution with CPU-side decompression. The company says HUff-LLM proposes custom FPGA hardware for decoding. Cloudflare states that ZipServ fuses decompression with GPU inference but targets consumer-grade GPUs rather than the H100 hardware Cloudflare operates.

Cloudflare says none of these systems provided lossless inference-time decompression on Hopper-class GPUs integrated with its Rust-based inference engine, which is what prompted the Unweight project.

Availability

Cloudflare has published the GPU kernels as open source on GitHub and linked a technical report hosted at research.cloudflare.com/nikulin2026. The blog post does not specify pricing changes for Cloudflare's inference customers; the company frames the work as an internal efficiency improvement that it says makes inference cheaper and faster to run across its network.

Cloudflare Says Its Unweight System Compresses LLM Weights 15–22% Losslessly

The memory bandwidth problem Cloudflare says it is targeting

How Cloudflare describes the compression approach

Row-level handling of rare exponents

Reported results on Llama-3.1-8B

How Cloudflare positions Unweight against prior work

Availability

Anthropic Launches Claude Design, a Research Preview for Visual Work

Anthropic Launches Claude Design for Generating Prototypes and Slides

Google Adds Google Photos Integration to Gemini App Image Generation

Cloudflare Says Its Unweight System Compresses LLM Weights 15–22% Losslessly

The memory bandwidth problem Cloudflare says it is targeting

How Cloudflare describes the compression approach

Row-level handling of rare exponents

Reported results on Llama-3.1-8B

How Cloudflare positions Unweight against prior work

Availability

Related

Anthropic Launches Claude Design, a Research Preview for Visual Work

Anthropic Launches Claude Design for Generating Prototypes and Slides

Google Adds Google Photos Integration to Gemini App Image Generation