Vera Rubin: Nvidia
Inside Nvidia
deep-dive February 25, 2026
Vera Rubin: Nvidia’s $4 Million Bet on the Next Era of AI Infrastructure
Inside Nvidia’s most complex AI system to date: 1.3 million components, 10x efficiency gains over Blackwell, and the hardware calculus driving the next wave of foundation model development.
This week, CNBC got an exclusive look at Nvidia’s next AI system, and the numbers are staggering. Vera Rubin, scheduled to ship in the second half of 2026, promises 10 times more performance per watt than its predecessor Grace Blackwell. It is made up of 1.3 million components. Its flagship rack weighs nearly two tons. And Nvidia has already shipped the first samples to customers.
This is not a modest incremental upgrade. It is a rethinking of what AI compute infrastructure means, arriving at a moment when the race to dominate AI workloads has never been more consequential.
A System, Not a Chip
The first thing to understand about Vera Rubin is that Nvidia is no longer selling GPUs. It is selling systems: integrated, rack-scale supercomputers where every component from the processing units to the networking fabric to the cooling loops is co-designed as a single machine.
The Vera Rubin NVL72, the flagship configuration, packs 72 Rubin GPUs and 36 Vera CPUs into a single rack connected through NVLink 6. The combined system delivers 3.6 exaflops of NVFP4 inference performance, 20.7 terabytes of HBM4 memory capacity, and 260 terabytes per second of scale-up bandwidth.
Breaking that down to the chip level: each Rubin GPU contains 336 billion transistors across two reticle dies, delivering 50 petaflops of inference performance and 35 petaflops of training. That represents a 5x and 3.5x improvement over Blackwell respectively. Each GPU carries 288 gigabytes of HBM4 memory at 22 terabytes per second of bandwidth. The Vera CPU, built on custom Arm “Olympus” cores at 227 billion transistors, runs 88 cores and 176 threads using Nvidia’s Spatial Multi-Threading architecture, with up to 1.5 terabytes of LPDDR5x memory.
At the heart of the system sits the Vera Rubin superchip, which pairs one Vera CPU with two Rubin GPUs. It contains 17,000 components and slides out of one of the rack’s 18 compute trays in seconds. In the Blackwell system, equivalent components were soldered to the board, making repairs a multi-hour ordeal. That single design change has large implications for the hyperscale data centers running thousands of these racks.
The Efficiency Argument
The most important number in Nvidia’s announcement is not raw performance. It is cost per token.
The Vera Rubin NVL72 performs inference at roughly 10% of the cost per million tokens compared to the current Blackwell GB200 NVL72. That is not a 10% improvement. That is a 10x reduction. For AI labs running frontier models at scale, this single figure determines how many users they can serve, at what latency, and at what margin.
Vera Rubin will consume approximately twice as much power as its predecessor, but because of the 10x performance gain, the effective energy cost per useful output plummets. Analysts at Mizuho Securities put it plainly: what matters most is “how many tokens per power consumed can you get.” Vera Rubin moves that ratio dramatically in the customer’s favor.
The system is also Nvidia’s first to use 100% liquid cooling, which carries operational advantages beyond raw efficiency. Traditional evaporative cooling systems in data centers consume substantial water for heat rejection. Liquid cooling recirculates coolant through closed loops, dramatically reducing water consumption. This matters as hyperscale operators face scrutiny over their environmental footprints and as cities increasingly restrict data center water use.
The cable-free modular tray design has a parallel operational impact. Nvidia says installation time drops from two hours per unit to approximately five minutes, a change that compounds across thousands of nodes in a large deployment and reduces the skilled labor required to maintain fleet health.
Who Is Buying It
The customer list for Vera Rubin reads like a roll call of every major AI organization: Meta, OpenAI, Anthropic, Amazon Web Services, Google, and Microsoft are all named by Nvidia as expected customers. Meta specifically announced plans to deploy Vera Rubin in its data centers by 2027.
These are not speculative commitments. These organizations are making infrastructure bets measured in billions of dollars, and locking into Vera Rubin signals where they expect AI model development to go over the next two to three years. The architecture favors large, dense inference workloads of the kind generated by frontier models like GPT-5-class systems and Anthropic’s next-generation Claude, rather than the distributed training jobs that characterized the 2023-2024 buildout cycle.
The pricing reflects the ambition. Futurum Group estimates the Vera Rubin rack will cost approximately $3.5 million to $4 million, roughly 25% more than Grace Blackwell. At those price points, buyers are not experimenting. They are making infrastructure commitments with 3-to-5-year depreciation horizons.
The Supply Chain Behind 1.3 Million Components
One of the less-discussed dimensions of Vera Rubin is its supply chain complexity. The system draws components from more than 80 suppliers across at least 20 countries, including China, Vietnam, Thailand, Mexico, Israel, and the United States. Its core chips are manufactured primarily by TSMC.
Dion Harris, Nvidia’s AI infrastructure head, told CNBC that the company is providing suppliers with “very detailed forecasts” to ensure alignment. The challenge is formidable: memory costs are rising sharply due to a global HBM shortage driven by AI-related demand, and any supply constraint in a single component category can stall production of an integrated system with 1.3 million interdependent parts.
This supply chain exposure is one reason Nvidia’s announcement about delivering samples to customers carries weight. Sample delivery signals that the component pipeline is stable enough to support at least limited production, a meaningful milestone given that Jensen Huang announced full production status at CES in January 2026.
The racks themselves are manufactured in the United States, Taiwan, and at a new Foxconn plant in Mexico. Nvidia has committed to building up to $500 billion of AI infrastructure in the U.S. through 2029, including Blackwell GPU production at TSMC’s new Arizona fabs. That domestic manufacturing push reflects both political pressure and strategic hedging against geopolitical risk in the semiconductor supply chain.
The Competition Nvidia Does Not Dismiss
Nvidia occupies approximately 70-80% of the AI accelerator market, but the competitive environment is shifting in ways the company takes seriously enough to acknowledge publicly.
AMD will ship its first rack-scale AI system, called Helios, later in 2026. Meta, already a confirmed Vera Rubin customer, simultaneously announced a commitment to up to six gigawatts of AMD GPU capacity. That is not a contradiction; it is standard hyperscale procurement strategy. Major buyers maintain multiple suppliers to avoid single-vendor dependency and to use competition to negotiate pricing.
Custom silicon is the longer-term competitive pressure. Amazon’s Trainium 2 chips are already filling racks at AWS data centers. Google’s TPUs have powered Gemini inference at scale for years. Broadcom is building custom AI accelerators for multiple hyperscale clients. These in-house chips are purpose-built for specific workloads and bypass the Nvidia premium, but they require massive engineering investment and lack the software ecosystem depth that CUDA provides.
Harris’s response when asked about competition was diplomatically dismissive: “Hats off to anyone who’s going to try. But this is certainly not a simple endeavor.” The complexity argument is genuine. An integrated 1.3-million-component system optimized across hardware, networking, and software is genuinely difficult to replicate. But it is not a permanent moat. AMD’s Helios represents five years of dedicated rack-scale engineering, and Broadcom’s custom silicon pipeline is maturing rapidly.
What the Architecture Reveals About AI’s Direction
The design choices embedded in Vera Rubin are revealing about where Nvidia, and by extension the customers who shape its roadmap, believe AI is heading.
The emphasis on inference efficiency over raw training throughput signals that the center of gravity in AI compute is shifting. The training runs that defined 2022-2025 are giving way to inference at massive scale. Vera Rubin is optimized to serve trillions of tokens at minimal cost, not to train the next frontier model in record time.
The 100% liquid cooling and modular serviceability reflect an assumption that these systems will run continuously for years in high-density environments, not cycle through aggressive upgrade schedules. The move away from soldered components is particularly telling: it assumes that operators will need to maintain and repair these systems at scale, not simply replace them.
The 1.5-terabyte CPU memory and the tight CPU-GPU integration in the superchip design address a specific bottleneck in large model inference: the cost of moving data between the host CPU and GPU memory. By co-packaging Vera CPUs with Rubin GPUs and connecting them with high-bandwidth interconnects, Nvidia reduces the data movement overhead that currently limits inference throughput for large mixture-of-experts models.
These are not features that matter for running small models on a handful of servers. They are engineering choices made for organizations running trillion-parameter models at planetary scale, a capability tier that currently exists only in the most advanced AI labs but which Nvidia is betting will become standard infrastructure within three years.
The Stakes
Vera Rubin arrives in an AI infrastructure market spending at a pace that has no historical precedent. The five largest hyperscalers combined are on track to invest over $300 billion in AI-related capital expenditure in 2026. Nvidia’s share of that spending is the foundation of a valuation that has made it one of the most valuable companies in history.
See also: Grok 5: xAI’s 6-Trillion-Parameter Bet on the Frontier.
For related context, see OpenAI at $730 Billion: The Valuation That Defies Physics | X01.
The Vera Rubin NVL72 performs inference at roughly 10% of the cost per million tokens compared to the current Blackwell GB200 NVL72. That is not a 10% improvement. That is a 10x reduction. For AI labs running frontier models at scale, this single figure determines how many users they can serve, at what latency, and at what margin.
Vera Rubin will consume approximately twice as much power as its predecessor, but because of the 10x performance gain, the effective energy cost per useful output plummets. Analysts at Mizuho Securities put it plainly: what matters most is “how many tokens per power consumed can you get.” Vera Rubin moves that ratio dramatically in the customer’s favor.
The system is also Nvidia’s first to use 100% liquid cooling, which carries operational advantages beyond raw efficiency. Traditional evaporative cooling systems in data centers consume substantial water for heat rejection. Liquid cooling recirculates coolant through closed loops, dramatically reducing water consumption. This matters as hyperscale operators face scrutiny over their environmental footprints and as cities increasingly restrict data center water use.
The cable-free modular tray design has a parallel operational impact. Nvidia says installation time drops from two hours per unit to approximately five minutes, a change that compounds across thousands of nodes in a large deployment and reduces the skilled labor required to maintain fleet health.
Who Is Buying It
The customer list for Vera Rubin reads like a roll call of every major AI organization: Meta, OpenAI, Anthropic, Amazon Web Services, Google, and Microsoft are all named by Nvidia as expected customers. Meta specifically announced plans to deploy Vera Rubin in its data centers by 2027.
These are not speculative commitments. These organizations are making infrastructure bets measured in billions of dollars, and locking into Vera Rubin signals where they expect AI model development to go over the next two to three years. The architecture favors large, dense inference workloads of the kind generated by frontier models like GPT-5-class systems and Anthropic’s next-generation Claude, rather than the distributed training jobs that characterized the 2023-2024 buildout cycle.
The pricing reflects the ambition. Futurum Group estimates the Vera Rubin rack will cost approximately $3.5 million to $4 million, roughly 25% more than Grace Blackwell. At those price points, buyers are not experimenting. They are making infrastructure commitments with 3-to-5-year depreciation horizons.
The Supply Chain Behind 1.3 Million Components
One of the less-discussed dimensions of Vera Rubin is its supply chain complexity. The system draws components from more than 80 suppliers across at least 20 countries, including China, Vietnam, Thailand, Mexico, Israel, and the United States. Its core chips are manufactured primarily by TSMC.
Dion Harris, Nvidia’s AI infrastructure head, told CNBC that the company is providing suppliers with “very detailed forecasts” to ensure alignment. The challenge is formidable: memory costs are rising sharply due to a global HBM shortage driven by AI-related demand, and any supply constraint in a single component category can stall production of an integrated system with 1.3 million interdependent parts.
This supply chain exposure is one reason Nvidia’s announcement about delivering samples to customers carries weight. Sample delivery signals that the component pipeline is stable enough to support at least limited production, a meaningful milestone given that Jensen Huang announced full production status at CES in January 2026.
The racks themselves are manufactured in the United States, Taiwan, and at a new Foxconn plant in Mexico. Nvidia has committed to building up to $500 billion of AI infrastructure in the U.S. through 2029, including Blackwell GPU production at TSMC’s new Arizona fabs. That domestic manufacturing push reflects both political pressure and strategic hedging against geopolitical risk in the semiconductor supply chain.
The Competition Nvidia Does Not Dismiss
Nvidia occupies approximately 70-80% of the AI accelerator market, but the competitive environment is shifting in ways the company takes seriously enough to acknowledge publicly.
AMD will ship its first rack-scale AI system, called Helios, later in 2026. Meta, already a confirmed Vera Rubin customer, simultaneously announced a commitment to up to six gigawatts of AMD GPU capacity. That is not a contradiction; it is standard hyperscale procurement strategy. Major buyers maintain multiple suppliers to avoid single-vendor dependency and to use competition to negotiate pricing.
Custom silicon is the longer-term competitive pressure. Amazon’s Trainium 2 chips are already filling racks at AWS data centers. Google’s TPUs have powered Gemini inference at scale for years. Broadcom is building custom AI accelerators for multiple hyperscale clients. These in-house chips are purpose-built for specific workloads and bypass the Nvidia premium, but they require massive engineering investment and lack the software ecosystem depth that CUDA provides.
Harris’s response when asked about competition was diplomatically dismissive: “Hats off to anyone who’s going to try. But this is certainly not a simple endeavor.” The complexity argument is genuine. An integrated 1.3-million-component system optimized across hardware, networking, and software is genuinely difficult to replicate. But it is not a permanent moat. AMD’s Helios represents five years of dedicated rack-scale engineering, and Broadcom’s custom silicon pipeline is maturing rapidly.
What the Architecture Reveals About AI’s Direction
The design choices embedded in Vera Rubin are revealing about where Nvidia, and by extension the customers who shape its roadmap, believe AI is heading.
The emphasis on inference efficiency over raw training throughput signals that the center of gravity in AI compute is shifting. The training runs that defined 2022-2025 are giving way to inference at massive scale. Vera Rubin is optimized to serve trillions of tokens at minimal cost, not to train the next frontier model in record time.
The 100% liquid cooling and modular serviceability reflect an assumption that these systems will run continuously for years in high-density environments, not cycle through aggressive upgrade schedules. The move away from soldered components is particularly telling: it assumes that operators will need to maintain and repair these systems at scale, not simply replace them.
The 1.5-terabyte CPU memory and the tight CPU-GPU integration in the superchip design address a specific bottleneck in large model inference: the cost of moving data between the host CPU and GPU memory. By co-packaging Vera CPUs with Rubin GPUs and connecting them with high-bandwidth interconnects, Nvidia reduces the data movement overhead that currently limits inference throughput for large mixture-of-experts models.
These are not features that matter for running small models on a handful of servers. They are engineering choices made for organizations running trillion-parameter models at planetary scale, a capability tier that currently exists only in the most advanced AI labs but which Nvidia is betting will become standard infrastructure within three years.
The Stakes
Vera Rubin arrives in an AI infrastructure market spending at a pace that has no historical precedent. The five largest hyperscalers combined are on track to invest over $300 billion in AI-related capital expenditure in 2026. Nvidia’s share of that spending is the foundation of a valuation that has made it one of the most valuable companies in history.
The Rubin generation is Nvidia’s argument that this investment cycle is not complete. The hardware gains still available justify continued spending at current or higher rates. A 10x reduction in inference token cost means the economic case for deploying AI at scale gets substantially stronger, which in turn means more deployment, more inference demand, and more hardware procurement.
For the organizations consuming this infrastructure, the calculus is similar. Cheaper, more efficient inference unlocks use cases that were economically marginal at Blackwell prices: real-time processing of video and audio streams, continuous document analysis at enterprise scale, always-on agentic workflows that maintain context across extended periods. Each drop in inference cost opens a new tier of applications.
The gap between what AI can do and what organizations are actually deploying at scale is still wide. Vera Rubin, if it delivers on its specifications, is the hardware that begins closing it.
Sources: CNBC exclusive first look (Feb 25, 2026), Tom’s Hardware technical specifications, Futurum Group analyst estimates, Mizuho Securities commentary.