Nvidia Blackwell Chips Break Records in Training Massive AI Models, MLCommons Data Shows
Nvidia’s newest Blackwell AI chips have delivered record-setting performance in training the world’s largest language models, according to new benchmark results published by MLCommons on Wednesday. The nonprofit group, which tracks AI performance across leading chipmakers, revealed a dramatic reduction in the number of chips required to train large-scale models like Meta’s Llama 3.1 405B, thanks to Nvidia’s latest hardware.
Nvidia Blackwell vs Hopper: Over 2x Faster Per Chip
The MLCommons AI training benchmark results show that 2,496 Nvidia Blackwell GPUs completed the Llama 3.1 405B training test in just 27 minutes. In contrast, it took more than three times that number of Nvidia’s previous-generation Hopper GPUs to achieve a slightly faster time. On a per-chip basis, Blackwell is over twice as fast as Hopper, signalling a major leap forward in AI training efficiency.
Key Highlights:
- AI Model Trained: Llama 3.1 405B (trillions of parameters)
- Blackwell Chip Count: 2,496 units
- Training Time: 27 minutes
- Performance vs. Hopper: More than 2x faster per chip
- Training Scale: Focused on trillion-parameter models
Why AI Training Still Matters
While market focus has recently shifted toward AI inference—where trained models respond to user inputs—AI training remains a critical competitive frontier. Training large language models (LLMs) requires immense computing power and hardware efficiency. The number of chips and time required to train models directly impacts cost, development speed, and AI scalability.
According to Chetan Kapoor, Chief Product Officer at CoreWeave, who collaborated with Nvidia on the benchmarks, there’s a trend toward modular, distributed AI training systems. Instead of massive clusters of 100,000+ chips, developers now favour smaller subsystems for more flexible and efficient training of multi-trillion parameter models.
“Using a methodology like that, they’re able to continue to accelerate or reduce the time to train some of these crazy, multi-trillion parameter model sizes,” Kapoor said during a press briefing.
China’s DeepSeek Makes Strides with Fewer Chips
The benchmark results also note a competitive push from China’s DeepSeek, which claims to have developed a high-performing chatbot using far fewer chips than U.S. rivals. Although not directly compared in this round of MLCommons data, DeepSeek’s approach highlights growing global interest in efficient AI model training.
Why This Matters for the AI Industry
These new results from MLCommons confirm that Nvidia’s Blackwell GPUs are setting a new bar for training efficiency. For AI developers, enterprises, and cloud providers, this means:
- Lower operational costs
- Faster model development cycles
- Greater scalability for multi-trillion parameter models
- Improved resource allocation with fewer chips
This shift could significantly reshape AI infrastructure investments, especially for companies building foundational AI models, digital assistants, or custom LLMs.