Nvidia Just Admitted Its GPUs Aren't Good Enough for AI Anymore
Nvidia is building a secret inference chip using Groq's LPU technology. It debuts at GTC on March 16. Here's why the GPU king is pivoting.
Nvidia is building a dedicated AI inference chip that integrates technology from Groq, the startup it acquired for $20 billion in December. The chip debuts at GTC in San Jose on March 16. It's the clearest signal yet that the company that built the AI boom on GPUs now believes GPUs alone can't sustain it.
The GPU Problem Nobody Talks About
Here's the uncomfortable math. Training an AI model costs billions of dollars — once. Running that model afterward (inference) costs 15-20x more over its lifetime. And inference now accounts for two-thirds of all AI computing, according to Deloitte's 2026 TMT Predictions report. Enterprise companies are spending 85% of their AI budgets on inference, not training.
Nvidia's GPUs dominate training. They're also used for inference, but they're energy hogs. Running autonomous AI agents — systems that carry out tasks on your behalf, chaining together dozens of reasoning steps — burns through GPU compute at a rate that makes CFOs nervous. The chips were designed to do everything. The market now needs chips that do one thing fast.
That's why OpenAI signed a $10 billion contract with Cerebras in January. That's why Google and Amazon built custom inference silicon. Nvidia's biggest customers were quietly shopping for alternatives.
Enter Groq
Groq's Language Processing Units work fundamentally differently from GPUs. Where a GPU shuffles data between processor and memory thousands of times per operation, an LPU stores everything on-chip in SRAM and streams it through in a single pass. The result: 1,200+ tokens per second on Llama 4, and 3-5x lower latency than the best GPU solutions for single-stream tasks.
Groq demonstrated that 10,000 "thought tokens" — the internal reasoning steps that make AI agents work — can be produced in roughly two seconds. On a GPU cluster, that same operation takes 6-10 seconds. In a world where AI agents need to chain together dozens of reasoning steps before responding, those seconds compound into minutes.
Jensen Huang compared the Groq deal to Nvidia's 2019 acquisition of Mellanox, which solved networking for data centers and enabled the company's InfiniBand dominance. "We'll extend our architecture with Groq as an accelerator," Huang said during Nvidia's Q4 2026 earnings call, "in very much the way that we extended Nvidia's architecture with Mellanox."
That's a telling analogy. Mellanox didn't replace Nvidia's GPUs. It made them work better together. Groq won't replace GPUs either. It'll handle the latency-sensitive decode stage while GPUs handle the heavy prefill computation. Think of it as a relay race: the GPU sprints the first leg, then hands the baton to the LPU for the final stretch.
Why Now
The shift from training to inference isn't gradual. It's a cliff.
Two years ago, every AI company was racing to train bigger models. The metric that mattered was FLOPS — raw computational horsepower. Nvidia owned that race. Blackwell Ultra chips could brute-force any training workload.
But the models got good enough. GPT-5, Claude 4, Gemini Ultra — they exist. The bottleneck moved from "can we build a smart model?" to "can we run it cheaply enough that people actually use it?" And more specifically: "can we run it fast enough for AI agents?"
Agentic AI is the use case that breaks GPUs. A chatbot answers one question. An agent might execute 50 sequential reasoning steps to book your flights, check your calendar, compare prices, and email you a summary. Each step needs to be fast. A two-second delay per step means you're waiting nearly two minutes for something that should feel instant.
Qualcomm CEO Cristiano Amon said at MWC Barcelona this week that 2026 is "the year of the AI agent." Samsung's Galaxy S26 launched with agentic features built in. McKinsey and Gartner report that autonomous agents now handle 45-50% of routine knowledge work at companies that have deployed them. The demand for fast inference isn't theoretical. It's here.
The New Chip War
Nvidia's pivot creates a fascinating three-way competition.
Cerebras has the most radical approach: a single chip the size of a dinner plate with 4 trillion transistors. OpenAI is already running models on it. The company raised another $1 billion in February. Groq (now inside Nvidia) pioneered the LPU architecture. Its speed advantage is real but comes with trade-offs — GPUs still win on batch throughput by a wide margin. Nvidia plans to deploy 256 LPUs per rack at GTC, a fourfold increase over the first generation. Google and Amazon have custom TPUs and Trainium chips that sit somewhere between — not as fast as Groq, not as flexible as Nvidia GPUs, but cheaper to operate at massive scale.A Futurum Group survey found that in 2026, non-GPU accelerators (XPUs) are expected to lead data center compute spending growth at 22%, outpacing GPUs at 19%. The GPU's share of the pie is shrinking for the first time.
What This Means
Nvidia isn't losing. It's adapting. The company saw OpenAI writing $10 billion checks to competitors and decided to buy the competition's best technology rather than pretend GPUs could do everything forever. The $20 billion Groq deal and the $30 billion investment in OpenAI happened within weeks of each other. That's not coincidence — it's strategy.
But the broader picture matters more than any single company. The AI industry is splitting into two distinct hardware eras. Era one was about building intelligence: massive training runs, enormous GPU clusters, brute computational force. Era two is about deploying intelligence: fast, efficient, always-on inference that makes AI agents feel as responsive as a human assistant.
The companies that figured out training won the last five years. The companies that figure out inference will win the next ten.
Nvidia is betting $20 billion that it can win both eras. On March 16, we'll find out if that bet looks smart.
Sources for this article are being documented. Albis is building transparent source tracking for every story.
Keep Reading
DeepSeek Trained on Nvidia's Best Chips. Now Nvidia Can't Use the Result.
Export controls created a world where the customer gets the product and the supplier doesn't. How America's chip restrictions just inverted tech dominance.
The Robots Are Inheriting the Roads' Intelligence
Physical AI is converging: the same technology that taught cars to drive is now teaching humanoid robots to walk, work, and reason.
An $8.6 Billion Startup Just Proved Self-Driving Cars Don't Need Maps
Wayve raised $1.2B from Nvidia, Microsoft, and Uber for AI that learns to drive like a human — no HD maps required. Here's why that changes everything.
Explore Perspectives
Get this delivered free every morning
The daily briefing with perspectives from 7 regions — straight to your inbox.