Nvidia CEO Jensen Huang used this week's GTC conference in San Jose to unveil the Groq 3 language processing unit, a chip designed specifically for AI inference — and built on intellectual property Nvidia licensed from startup Groq for $20 billion on Christmas Eve last year.
The announcement arrived just two and a half months after that deal closed, a timeline reflecting how urgently the company views the inference market. More than 30,000 attendees gathered at GTC, where Huang also revealed the broader Vera Rubin chip family. The Groq 3 LPU drew particular attention from the AI infrastructure community.
Why Inference Is Now Nvidia's Urgent Priority
For most of AI's recent history, the dominant computational challenge was training — feeding enormous datasets through neural networks over days or weeks. Inference, the process of actually running a trained model in response to a user query, was treated as the cheaper, simpler cousin. That calculus has changed. As AI applications scale to hundreds of millions of users, and as reasoning models run multiple inference passes before producing a single output, the demand for fast, low-latency inference has become a defining constraint.
"Now that AI is able to do productive work, the inflection point of inference has arrived," Huang told the crowd.
Unlike training, inference cannot be batched across days. It must respond to a user's query in real time, making latency — not raw throughput — the metric that matters most. This requirement shapes every architectural decision in a chip designed for inference work.
How the Groq 3 LPU Works
Groq's core architectural insight is straightforward: move memory onto the chip itself, and eliminate the back-and-forth that slows conventional GPU designs. Where Nvidia's Rubin GPU relies on 288 gigabytes of high-bandwidth memory (HBM) sitting adjacent to the processor, the Groq 3 LPU uses just 500 megabytes of on-chip SRAM. That sounds like a dramatic downgrade — until you look at the bandwidth numbers.
The Rubin GPU achieves 22 terabytes per second of memory bandwidth. The Groq 3 LPU delivers 150 terabytes per second — approximately seven times faster — because data flows directly through the SRAM in a linear, deterministic sequence rather than shuttling off-chip and back. "The data actually flows directly through the SRAM," said Mark Heaps, formerly chief technology evangelist at Groq and now director of developer marketing at Nvidia. "We don't have that [off-chip round-trip]. It all passes through in a linear order."
The trade-off is raw compute power. The Rubin GPU is capable of 50 petaFLOPS of 4-bit computation; the Groq 3 LPU delivers 1.2 petaFLOPS of 8-bit computation. For inference workloads, that trade is considered practical.
The Rise of Inference Disaggregation
Nvidia is not simply swapping GPUs for LPUs. Its new Groq 3 LPX compute tray pairs 8 Groq 3 LPUs with a single Vera Rubin unit — combining Rubin GPUs with a Vera CPU — and takes advantage of a technique called inference disaggregation. This approach splits inference into two distinct phases: prefill, which processes the incoming prompt and is computationally intensive but memory-bandwidth-light; and decode, which generates the output token by token and demands high memory bandwidth.
In the LPX tray, prefill and the compute-heavy portions of decode run on Vera Rubin, while the final decode stage runs on the Groq 3 LPU. Each component performs its specialized function. "We're in volume production now," Huang confirmed.
Amazon Web Services announced a parallel approach the same week, pairing its Trainium AI accelerator with Cerebras Systems' CS-3 chip — built around the largest single chip ever manufactured — to exploit the same prefill/decode split. Cerebras addresses the memory bandwidth bottleneck by integrating 44 gigabytes of SRAM connected by a 21 petabytes-per-second internal network.
A Crowded Field Watches Nvidia Validate the Market
The Groq 3 announcement comes amid a period of significant growth in inference-focused chip startups. Companies including d-Matrix (digital in-memory compute), Etched (transformer-specific ASICs), RainAI (neuromorphic chips), EnCharge (analog in-memory compute), Tensordyne (logarithmic arithmetic), and FuriosaAI (tensor-optimised hardware) have each pursued distinct architectural approaches to the inference market.
Nvidia's move validates the space but also raises competitive pressure. "Nvidia's announcement validates the importance of SRAM-based architectures for large-scale inference, and no one has pushed SRAM density further than d-Matrix," said d-Matrix CEO Sid Sheth. Sheth argues that data centre customers will ultimately deploy a mix of silicon rather than a single winner-takes-all chip. "The winning systems will combine different types of silicon and fit easily into existing data centers alongside GPUs."
That view is consistent with Nvidia's own LPX tray design, which treats the LPU not as a replacement for the GPU but as a complement to it.
What This Means
Nvidia's $20 billion commitment to Groq architecture — and its rapid productization into a shipping system — confirms that the centre of gravity in AI computing is shifting from model training to real-world deployment at scale, and that low-latency inference hardware is now a strategic priority for the industry's dominant player.
