Nvidia Nemotron 3 Super: 5x Throughput for Agentic AI
Nvidia launches Nemotron 3 Super, a 120B open model with a 1M-token context window designed to cut cost and latency for multi-agent AI workloads.
Nvidia today released Nemotron 3 Super, a 120-billion-parameter open-weight model built from the ground up for multi-agent AI workflows. The company says the model delivers five times higher throughput than its predecessor Nemotron Super and two times higher accuracy, while keeping only 12 billion parameters active on any given forward pass.
The release drops on the same day Perplexity added Nemotron 3 Super as one of 20 orchestrated models inside its Computer agent platform, giving the launch immediate production traction beyond Nvidia’s own ecosystem.
Why Multi-Agent Systems Struggle Today
The core problem Nvidia is targeting is well-documented: multi-agent pipelines are expensive and fragile in ways that standard chatbot benchmarks do not expose.
Nvidia quantifies two specific failure modes. The first is context explosion. Multi-agent workflows re-send full conversation histories, tool outputs, and intermediate reasoning steps at every turn, generating up to 15 times more tokens than a standard chat session. Over long tasks, the accumulating context bloat leads to goal drift, where agents gradually lose alignment with the original objective.
The second is the thinking tax. Complex agents must reason at every step, but running a frontier-scale reasoning model for every subtask makes multi-agent applications too expensive and too slow for production deployment. Most enterprise agentic pilots stall here.
Nemotron 3 Super targets both constraints simultaneously rather than trading off one for the other.
The Architecture Behind the Claims
The model uses a hybrid Mamba-Transformer mixture-of-experts backbone, a combination Nvidia describes as delivering four times higher memory and compute efficiency compared to standard transformer architectures.
Three specific innovations drive the efficiency gains:
Latent MoE: Token representations are compressed before reaching the expert layers, allowing the router to activate four times as many specialists for the same inference cost. The model carries 120 billion parameters of specialized knowledge with only 12 billion active per forward pass.
Multi-token prediction: The model predicts multiple future tokens in a single forward pass, reducing generation time for long output sequences and enabling built-in speculative decoding without a separate draft model.
Native NVFP4 pretraining: The model was pretrained in NVFP4 format optimized for Nvidia Blackwell GPUs. On B200 hardware, Nvidia claims four times the inference speed compared to running FP8 models on H100, with accuracy preserved.
The 1-million-token context window addresses context explosion directly. Agents can retain full workflow state in a single context rather than truncating or summarizing prior steps, which is where goal drift typically originates.
On benchmarks, Nemotron 3 Super currently holds the top position on both DeepResearch Bench and DeepResearch Bench II, which measure an AI system’s ability to conduct multi-step research across large document sets while maintaining reasoning coherence over extended sessions.
Who Is Already Using It
Nvidia announced a set of early integrations spanning AI-native companies and enterprise software platforms.
On the AI-native side, CodeRabbit, Factory, and Greptile are integrating Nemotron 3 Super into their code review and software development agents, mixing it with proprietary models to hit higher accuracy at lower per-task cost. Life sciences organizations Edison Scientific and Lila Sciences are deploying it for deep literature search and molecular understanding workflows.
On the enterprise side, Palantir, Cadence, Siemens, Dassault Systemes, and Amdocs are customizing the model for domain-specific agentic automation in cybersecurity, semiconductor design, and telecom. Palantir’s deployment is oriented toward cybersecurity triaging, one of the specific use cases Nvidia cited in its technical documentation.
The model is also powering Nvidia’s own AI-Q research agent to the top of the DeepResearch Bench leaderboard.
Open Weights, Fully Customizable
Unlike some recent open model releases that carry usage restrictions for commercial applications, Nemotron 3 Super ships with open weights, open datasets, and open training recipes. Nvidia is positioning this as a build-your-own foundation, not a hosted API.
The model is available now on Hugging Face and through Nvidia’s NIM microservice infrastructure. Nvidia says it was post-trained with reinforcement learning across 21 environment configurations using NeMo Gym, an approach the company argues produces stronger performance on real-world agentic tasks than reward modeling on static benchmarks alone.
For teams currently running proprietary frontier models for agentic tasks, Nemotron 3 Super presents a direct cost-versus-accuracy trade-off calculation. At 12 billion active parameters, inference costs run significantly below GPT-5-class models, and the 1M-token context window eliminates the compression hacks most teams currently use to keep context costs manageable.
Whether the benchmark claims hold up in production outside Nvidia’s reference integrations is the open question. The early adopter list is credible, but real-world agentic workloads have a history of exposing gaps that controlled benchmarks miss.
What This Means for the Model Market
Nemotron 3 Super enters a market where the cost of frontier-scale inference is the primary barrier to multi-agent deployment at scale. GPT-5-class models run well on tasks where a single call is all you need. They get expensive fast when an agentic pipeline calls them dozens of times per task, each time with an expanding context window.
Nvidia’s pitch is that the efficiency gains from the Mamba-Transformer hybrid and NVFP4 pretraining close enough of that gap to make production-scale agentic systems financially viable for organizations that could not justify the cost at current frontier model prices.
The open-weight approach reinforces that positioning. Teams can fine-tune Nemotron 3 Super on domain-specific data, reducing the accuracy gap versus larger proprietary models in specialized workflows. The full training recipe is public, which means any organization with Blackwell hardware can reproduce and extend Nvidia’s results without relying on a vendor API.
The longer-term question is whether Nvidia holds this position. Anthropic, Google, and Meta all have the resources and incentive to target the same efficiency gap. But Nvidia controls the hardware stack that current agentic infrastructure runs on, and Nemotron 3 Super was designed from the ground up to extract maximum performance from that hardware. Competitors building on the same chips start from the same ceiling.