Nemotron 3 Ultra
or, The American Open-Weights King
Originally presented as a live talk on June 17, 2026
Background
I want to start our review today with some history. Specifically, the history of NVIDIA in the language modeling space.
Now NVIDIA chips, and libraries for using those chips, have been part of machine learning for a long time. Most of the breakthrough moments in what people used to call “deep learning”, like AlexNet or GPT-1, have used the NVIDIA stack.
But it wasn’t until September 2019 - over a year after the release of GPT-1 - that NVIDIA started putting in some language modeling efforts of their own. Specifically, I’m referring to this paper on the left: Megatron-LM, which is all about training large language models across many GPUs.
That sounds pedestrian today, with data centers containing thousands of GPUs all running in parallel to train the next SOTA model, but in 2019 it was a real pain point. For context, GPT-1 was 117M parameters and trained across just 8 GPUs. GPT-2, which was training when this paper came out, went up to 1.5B parameters and trained across 32 TPUs. Those are Google’s special ML chips, so not actually GPUs.
By contrast, the Megatron-LM paper trained a model with 8.3B parameters, which is a model size many researchers still use today, across 512 GPUs. The model itself didn’t make much of a splash, but the engineering did; the ideas and tools from Megatron-LM live on today - including in the original GitHub repo, shown on the right, which is still in active development.
Things scale up rapidly from there. GPT-3 comes out in 2020, rocking 175B parameters, having trained on 10k NVIDIA V100 GPUs. Then ChatGPT comes out in November 2022 and kicks off the generative AI explosion.
While OpenAI by this point has become closed, enough knowledge has gotten out that other folks are able to train large language models. The big name in the open-weights space, LLaMA, comes out in February 2023, with sizes from 7B to 65B parameters. Llama 2 quickly follows in July 2023.
Among the many other companies releasing open-weights models is NVIDIA, with Nemotron 3 8B in November 2023.
Now I know what you’re thinking: is this Nemotron 3 connected to Nemotron 3 Ultra, the model we’re discussing today?
Strangely, the answer is no. Other than both being language models from NVIDIA, the Nemotron of 2023 is not related to the Nemotron of today. The original line extended to Nemotron 4 in 2024, but then they rebooted the franchise so to speak, and released a new Nemotron 2 in August 2025, which we covered in a previous Friendly Paper Review.
I include all this confusing nomenclature because I find it emblematic of NVIDIA’s model efforts on the whole, and of my experience reading the Nemotron 3 Ultra paper in particular. Quite the opposite of Microsoft AI’s approach, which we covered last week and which I found refreshingly clear and principled.
So we know NVIDIA has been working on models for some time now. The next question is, why?
The answer boils down to one business principle: commoditize your complement.
Let me explain with reference to Windows and the PC market, which is the classic tale of commoditizing your complement. Once upon a time, PCs were differentiated - different hardware, different drivers, different standards. That differentiation is how PC makers stood out with their customers, but it was hell for operating system makers like Microsoft. If you’re Microsoft, of course you want all PCs to be interchangeable so that you can easily run on all of them. But more importantly, you want all PCs to be interchangeable so that manufacturers have to compete primarily on price rather than on features - that’s the hallmark of commodities markets. That price competition on hardware side then drives down the price of PCs overall but leaves the price on the software side untouched. The lower the cost for a PC, the more people can buy them, and the bigger the market for Windows is, at the same price.
Now let’s apply that wisdom today, but with the roles reversed: NVIDIA is the hardware maker, and the big model makers like OpenAI and Anthropic are the software folks. They are battling over who will extract the value in their chain.
Obviously they all want the demand for inference to go up, and that tide is really lifting all boats right now. But no budget is unlimited, and for a fixed amount of money there has to be a split between hardware and software. Right now OpenAI and Anthropic in particular collect fat margins on their tokens, competing on quality or perhaps some differentiated product experiences, which means fewer tokens for a given budget. If NVIDIA can somehow commoditize the token, making providers compete on price, then demand for cheap tokens is gonna go up - there’s gonna be a lot more inference. And that new inference is mostly gonna run on new NVIDIA chips.
Microsoft did it by playing PC makers off each other. NVIDIA is doing it by introducing its own competitor in Nemotron - in addition to funding lots of other model makers.
Now transitioning from business to technology, one thing to point out that blends both is how AI workloads are evolving.
As this fun graphic from the recent Anthropic report on recursive self-improvement shows, the paradigm has shifted from chatbots to agents to swarms of agents. And because of how agents work - looping through thoughts, actions, and observations indefinitely until the task is complete - they are token-intensive. Back in the days of the original ChatGPT, inference was too expensive to do this kind of thing, not to mention the severe constraints on context window size and the generally insufficient intelligence.
So the dropping cost of tokens encourages more token use overall - a more general economic phenomenon known as the Jevons paradox - and unblocks the switch from chatbots to agents. It’s also very Bitter Lesson, throwing scaled-up resources at a problem instead of eking out wins with human effort; a chatbot assisting a human in a task is more efficient in tokens but ultimately unscalable compared to a swarm of agents working autonomously in parallel to complete that same task.
All that is to say: NVIDIA is naturally going to focus on agents, especially the always-on ones like OpenClaw, which are going to consume even more tokens because of all the work they do in the background - you don’t even need a human there to push up token demand!
Also, NVIDIA is going to satisfice on quality while optimizing for throughput and cost. They are not racing to achieve ASI; they want their good-enough models deployed as fast, cheap agents at every enterprise in the world.
Completing our transition into technical topics, I want to discuss a major shift in language model training to parallelize yet another aspect of the process.
The technique in question is “Multi-teacher On-Policy Distillation”, or MOPD. Like any technology, it has a long lineage, but the citations for it typically go to the tech report for MiMo-V2-Flash, an open-weights model by Xiaomi.
This graphic from that paper breaks it down: after establishing a shared starting point, you train many specialist models, then use them as teachers to distill into one shared student - typically the same model you used as a starting point. So instead of taking your one model and training it on search, and then code, and then math etc etc, you do all that training in parallel.
Training in parallel gets you two benefits. One, just in general, parallelizing work makes it take less wall clock time. Like it takes the same amount of computer cycles, but for you as the researcher it only takes as long as the longest individual teacher. So that could be significant.
And two, you decouple the training of each teacher, so you can train in the way that is best for each domain. So if one is very SFT-heavy but another uses mostly RL, or even just different hyperparameters - basically the settings during training - you have that freedom. There’s no compromise, no sequential dependence, all starting from the same clean slate in the form of the SFT model.
Now I don’t want to worry about the exact details of the method here, because the Nemotron folks made some adjustments of their own, but the broad idea and the specific MOPD term comes from here.
And in a final instance of parallelism, the MAI-Thinking-1 paper from last week also used multiple teachers, although it’s not on-policy distillation; they choose to make SFT data from the teachers and then train the student on that, rather than directly comparing next token predictions given the same inputs, which is how on-policy distillation works.
The Paper
A quick note: given the volume of content and the interests of my audience, I glossed over most of the infra and training details. There is a lot of good detail in the paper, I recommend reviewing it if those are your interests.
As usual for new model releases, we’re going to start off with architecture.
Nemotron 3 Ultra is a big model. It’s not as big as the biggest open-weights options, like DeepSeek V4 Pro at 1.6T parameters or Kimi K2.6 at 1T parameters, but it’s up there.
Notably, it has more active parameters than either of those models: 55B here, compared to 49B on DeepSeek and 32B on Kimi. That’s due primarily to the number of activated experts in the MoE layer, which is two or three times higher for Nemotron than for a typical model of this size. The more experts you activate, the more parameters you activate.
The other factor is the sheer number of layers, which is higher than any of the other big open-weights models. And here I want to split out the impact of layer count on training vs on inference.
On training, the number of layers is mostly a function of your available compute. Adding layers provides diminishing returns, but it still adds returns, so if you can afford the compute you might as well go for it. And NVIDIA can certainly afford the compute.
On inference though, each layer is going to increase latency - there’s just more stuff to do for each token - and each attention layer is going to add to the KV cache, the “state of mind” of the model. NVIDIA combats this by mostly using Mamba instead of attention. See, whereas attention’s demand on memory grows as the square of the input length, Mamba is a different architecture that uses a flat amount of memory, regardless of the input length. It’s kind of like having a fixed-length summary instead of just adding to your notes.
They do add the occasional attention layer in, which is similar to how many models nowadays interleave local and global attention, but mostly it’s Mamba.
Now let’s talk about how they pretrain that model they’ve laid out.
I want to prime you with a reminder of what the Microsoft AI team did with their model, which was to use no synthetic pretraining data. We’re going to see a very different approach here, which is not a surprise when you think about it; NVIDIA has the GPUs, so they can make all the synthetic data they want. Of course at Scale we’re going to bring a skeptical eye to that synthetic data, and we know the Microsoft AI folks would agree.
They have plenty of natural data too, from the web and GitHub and the like, but they spend much more time in the paper describing how they make synthetic data than how they find or clean natural data. Again, a stark contrast to Microsoft’s approach.
Because there is so much detail in fact, I can only provide a summary of their synthetic pretraining data efforts. They include:
Generating question-answer pairs based on the training sets of many public benchmarks
Using the domain and difficulty distributions from those benchmarks to inspire even more QA pairs
Extracting facts from a Wikipedia dataset and turning them into QA pairs
Making chains of thought about moral scenarios
Pulling legal codes and case law to then summarize
Synthesizing random character profiles from a specialized model, Nemotron Persona, and inserting them into legal cases
Most of that is available for download by the way - another service NVIDIA has done for the research community.
Anyway, they end up pretraining on 20T tokens, shifting from diversity to quality between the two phases pictured here. So for example, “finepdfs-unfiltered” is in phase 1, but only the “medium” and “high” filtered splits are in phase 2. And just in general for all LLM training, you always want to increase in quality as you progress in training. In their case, they do that first quality ratchet after about 15T tokens of pretraining.
Here’s where the base model nets out, before post-training.
Competition in the base model space is a bit thin actually; it’s become less and less common to release one, partly for competitive reasons I think but also because base models have little or no safety training. So at minimum, the NVIDIA folks have done the research community a service.
However, I do take issue with this slide, because they left off their stiffest competition: DeepSeek V4 Pro, which we’ll see later when they compare post-trained models. Not sure why they did it, but it makes the results here much less meaningful.
Now that we have our base model, we can focus on post-training.
Our old friends SFT and RLVR are in here of course, but the fancy new thing is that multi-teacher on-policy distillation we touched on at the top.
The other unusual thing I’ll note here now but not dwell on later is MTP Boosting. MTP stands for “multi-token prediction”, and it does what the name implies - predicts multiple tokens at once instead of just one. That’s baked in from the start in the architecture and is part of pretraining, but it needs a little post-training of its own close to the end apparently. Don’t worry if the jargon in the yellow box escapes you.
Calling back explicitly to the Microsoft AI paper, the setup here is more complicated and less principled. Like they’re clearly building upon previous experiences rather than starting with a clean slate like the Microsoft folks did. Not better or worse necessarily, just a very different flavor in reading this paper. And of course the proof is in the pudding, but sadly the Microsoft AI model is not publicly available yet.
Anyway, on to supervised finetuning. I’m putting up all these other AI company logos because the researchers really went ham on synthetic SFT data, in terms of the domains and the content and so on but also in terms of the models generating that data.
Here is the entire list:
gpt-oss-120b
DeepSeek V3
DeepSeek V3.2
DeepSeek V3.2-Speciale
DeepSeek V4 Pro
Qwen3-30B
Qwen3-235B Instruct
Qwen3-235B Thinking
Qwen3-Coder-480B
MiniMax M2.1
MiniMax M2.5
GLM 5
GLM 5.1
That is a ton of models! And notably all open weights models, surely self-hosted. The implication is that diverse training data produces more robust models, which is accurate, but they never give an explicit rationale. I did think it was a little odd they purposely used some much older models though, like in practice there’s no reason to use MiniMax M2.1 if you have MiniMax M2.5 on hand, but the principle is right.
This is in addition to models for filtering or judging data by the way. They use some models here for those purposes and also some internal models built for purpose.
As for the data itself, there’s a lot of it, and the researchers provide a lot of detail - albeit with many gaps in information.
The vast majority of the data is either pre-existing, taken from the training split of benchmarks they don’t plan to eval on, or synthesized. It’s a pretty incredible variety of data too. Like they have data for CUDA, their software stack for controlling their GPUs, as well as RTL, which is a hardware design language. Unique to NVIDIA from what I’ve seen but it makes complete sense.
By complete contrast, there are only two short paragraphs about their RLVR stage. They list the domains they target, like terminal use and instruction following and white collar workflows, but no specifics about the data. They don’t even specify which harnesses they use, just that they use many. It’s a bit baffling given the copious detail of the SFT section, although we’ll see more detail on the teacher-specific RL down the road.
Anyway, with RLVR done, we now have a student ready to transform into all the teachers. Although actually we have two students apparently, that Agentic SFT/RL box towards the bottom of the Prep column, and again that’s not really explained either. At minimum it’s from the same base model though.
So with our chatbot student and our agentic student, we’re ready to train some teachers. Each one gets its own particular training recipe, which is one of the virtues of MOPD, but it also makes for a nightmarish amount of content. So in the interest of time let me just note a couple commonalities.
One is the general use of RL over SFT. In particular, many of the training recipes use a technique from another recent NVIDIA paper called PivotRL, which takes SFT data and finds the most consequential steps - the pivots - and does RL from those points forward. It’s a nice way to squeeze more juice out of SFT data, basically turning one example into multiple examples.
Two is, again, the prevalence of synthetic data. They even have some tool called NeMo Data Designer they used for at least the Usability teacher. However, they do mention for the Office Work teacher that they bought data from a vendor called AfterQuery to help them on GDPval.
There are basically no numbers on their teacher training data, except for two places: 3.5k samples for Coding teacher, and 40B tokens for STEM teacher.
Now the distillation from teacher to student isn’t perfect, partly because the student’s weights have to optimize all these different skills at once, which inevitably leads to compromise - just like how no human can be an expert in everything.
So to quantify that compromise, they measure the pre-MOPD student, the student after both rounds of MOPD, and the teacher. The recovery is the improvement of the student divided by the improvement of the teacher. So in the first row for example, MOPD2 minus RLVR is 9.5, and Teacher minus RLVR is 5.5, and 9.5/5.5 is about 173% - the student exceeded the teacher.
Most cases weren’t like that of course, but generally the student recovered most of the teacher’s performance, although in a couple cases recovery was quite poor.
They hypothesize that distillation works best when the student reasonably could have made the correct choice, and works poorly when the right choice was not at all a possibility. Like if the student just uses suboptimal terms for a web search tool call for example, it probably has the right search terms somewhere in its distribution of likely next tokens. But if the student is doing a hard math problem, there’s really no guarantee at all that the right next token will have any likelihood - like how I would have no shot of guessing the next step of an algebraic topology problem.
To be clear, that is speculation and maybe they just have a skill issue. But it seems plausible to me.
So after all that post-training, including a little polish we again had to cut for time, this is where we land across their chosen benchmarks.
I have a few comments to make. First is the key: I highlighted the winner of each benchmark, with a 1% band below the top score to allow for ties due to noise. Like there is no practical difference between an 82% and an 81.7% score on IFBench or whatever.
Second is the selection of models: it’s all the best open-weights models, no comparison at all to even the closed non-SOTA models, like Qwen3.7 or the latest Grok. Even here though, the range of total and active parameters is wide, with MiniMax M2.7 and DeepSeek V4 Flash on the small side on both. On the big side, for total parameters it’s Kimi K2.6 and DeepSeek V4 Pro, but on active parameters it’s actually Nemotron 3 Ultra at the top! As I mentioned on the architecture slide, it’s due to all those activated experts and all those layers.
Third is, unfortunately, the relative underperformance of Nemotron in my view. Like despite being the newest and also having the most active parameters, it only wins outright on one benchmark, and it’s one I’ve never heard of, with suspiciously bad performance for Qwen and both DeepSeeks. And one of the three it ties on, the Scale benchmark Multi-Challenge, is a tie between basically every model except MiniMax. Worse yet, Qwen is smaller on both dimensions, and is four months older, yet it places first in twice as many benchmarks.
Remember what I said about NVIDIA though: they’re satisficing on quality and spiking on throughput and cost.
That’s the story we see here. At the top in the key you’ll see two versions of Nemotron: the full-precision one, which stores its numbers in the format BF16; and the quantized one, which stores its numbers in the format NVFP4. The “NV” is short for NVIDIA, it’s a special format designed to work well with their GPUs, preserving most or all of the quality yet taking up only a fraction of the space.
NVIDIA actually trains the NVFP4 version natively, basically meaning that quantizing down from BF16 to NVFP4 remains high-fidelity. And they do so because it speeds up the model massively, particularly on decode, i.e. when you’re generating tokens. It helps with prefill too - that’s the part where you’re reading all the input - but for agents with these long trajectories, decode speed is crucial.
The relative comparisons on the right are a bit misleading, since all the models on the chart are in NVFP4, but it is true that NVIDIA engineered the hell outta their serving infrastructure to get that speedup.
My Takeaways
NVIDIA is the American open-weights champ
Nemotron 3 Ultra is their best model yet
I don’t know why it took them so long to embrace this position and put serious resources into near-SOTA models
NVIDIA is also a gift to the research community
Base models, open pre-training and post-training data
Not to mention the hardware, software, and work in other domains (e.g. robotics)
They rely too much on synthetic data
MAI-Thinking-1 is the perfect counterpoint here
Some Scale post-training data could help here 😉
MOPD is the new normal
Still evolving somewhat but the high-level idea is firm


















