Nemotron 3 Ultra — who trains whom

A map of the post-training pipeline of NVIDIA's largest open model, built around MOPD — Multi-teacher On-Policy Distillation. For every student checkpoint and every teacher: what it's initialized from, the data and algorithm that built it, and how it feeds the next student. Click any node to light up its connections and read what its algorithm means.

How to read the arrows

initialized from — the only dashed arrows. Weights forked; each arrow is painted in the colour of its source checkpoint (thick = the student backbone, thin = a teacher forked off a checkpoint), so you can trace any fork by its hue.

MOPD distillation (solid rose) — a teacher grading the student's rollouts token-by-token, pulled into the student GenRM reward (solid purple) — the Ultra GenRM scores a teacher's RLHF generates data (solid grey) — out-of-family model

student line — each checkpoint its own colour: Base SFT RLVR MOPD1 Final Released Agentic SFT/RL path

and the helpers: general teacher agentic teacher Ultra GenRM (judge) external models

The pipeline

open SVG ↗ ctrl/⌘ + scroll (or pinch) = zoom · drag = pan · click a node for details (click again to deselect) ·

loading graph…

Click any node — its details pop up at the bottom of the viewer. Use ▾ to reduce the card to its title bar, ✕ (or click the node again) to close it.

What the red MOPD arrow actually computes

Every red arrow is the same operation — on-policy reverse-KL distillation — the core of MOPD. The direction of the KL is the whole trick, and it's really a picture. Toggle between the visual and the math.

One generation step = a distribution over what to write next. The teacher has a preferred distribution (here, two good continuations); we train the student to "match" it. What "match" means depends entirely on which way you point the KL — flip it and watch:

The other word: “on-policy”

The teacher doesn't grade a fixed pile of teacher-written text — it grades the student's own rollouts. The model generates, and at every token the teacher says "here's how I'd have weighted the options there." So the student is corrected exactly on the situations it actually lands in, killing the usual gap where a model trained on teacher transcripts then wanders into states it never practiced. Practice your own mistakes, expert over your shoulder, every token.

Bonus: you only need the teacher to score the student's tokens, never to hand over its full distribution on a fixed corpus — which is what makes running 10+ specialized teachers affordable.

Student $\pi_\theta$, domain-$i$ teacher $\pi^{T_i}$. The student samples a completion $y=(y_1,\dots,y_H)$ for prompt $q$; let $s_t=(q,y_{\lt t})$ be the prefix. MOPD maximizes

$$\mathcal{J}_{\text{MOPD}}(\theta)=\sum_{i=1}^{N}\lambda_i\; \mathbb{E}_{q\sim\mathcal{D}_i,\;y\sim\pi_\theta(\cdot\mid q)} \!\left[\sum_{t=1}^{H}\Big(\log\pi^{T_i}(y_t\mid s_t)-\log\pi_\theta(y_t\mid s_t)\Big)\right]$$

The report's key line: maximizing this is exactly, at every prefix $s_t$, minimizing the reverse KL

$$\min_\theta\;D_{\mathrm{KL}}\!\big(\pi_\theta(\cdot\mid s_t)\,\big\|\,\pi^{T_i}(\cdot\mid s_t)\big) =\sum_{v}\pi_\theta(v\mid s_t)\,\log\frac{\pi_\theta(v\mid s_t)}{\pi^{T_i}(v\mid s_t)}.$$

Why "reverse" is the mode-seeker the visual shows — compare the two directions at one prefix:

forward (mass-covering): $\;D_{\mathrm{KL}}(\pi^{T}\,\|\,\pi_\theta)=\sum_v \pi^{T}(v)\,\log\frac{\pi^{T}(v)}{\pi_\theta(v)}$ — weighted by the teacher, so it punishes the student for missing any teacher mode → covers everything, hedges.

reverse (mode-seeking): $\;D_{\mathrm{KL}}(\pi_\theta\,\|\,\pi^{T})=\sum_v \pi_\theta(v)\,\log\frac{\pi_\theta(v)}{\pi^{T}(v)}$ — weighted by the student, so wherever the student puts mass the teacher must agree; $\pi^T\!\approx\!0\Rightarrow\pi_\theta\!\to\!0$ (zero-forcing) → commits to a mode.

In practice: asynchronous + clipped

At scale the rollouts come from a slightly stale snapshot $\pi_{\text{behav}}$ while the learner is already at $\pi_{\text{prox}}$. MOPD wraps the per-token reverse-KL signal in a PPO-style clipped surrogate:

$$\mathcal{J}_{\text{async-MOPD}}(\theta)= \mathbb{E}_{q\sim\mathcal{D}_i,\;y\sim\pi_{\text{behav}}} \!\left[\sum_{t=1}^{H} m_t\,c_t\, \min\!\big(r_t(\theta)\hat A_t,\;\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat A_t\big)\right]$$

with per-token distillation advantage = the teacher-minus-student log-prob gap,

$$\hat A_t=\operatorname{sg}\!\big[\log\pi^{T_i}(y_t\mid s_t)-\log\pi_{\text{prox}}(y_t\mid s_t)\big],$$

$r_t(\theta)=\pi_\theta/\pi_{\text{prox}}$ the policy ratio, $c_t=\operatorname{sg}[\pi_{\text{prox}}/\pi_{\text{behav}}]$ the stale-sampler correction, $\operatorname{sg}[\cdot]$ stop-gradient, $m_t$ IcePop token-masking. Strip the bookkeeping and $\hat A_t$ is just "raise tokens the teacher likes more than you currently do" — the reverse-KL gradient. (Report §3.3.1, Eqs 1–3.)

The same thing as a table

If the graph is busy, this is the precise version. Feeds → = which student this model is used to train.

Model	Init from	Data	Algorithm	Feeds →
Ultra Base	— (from scratch)	20T tokens, 2-phase (diversity→quality), synthetic-heavy web/code/math/legal	NVFP4 WSD pretrain + 1M-ctx CPT	SFT student
SFT student	Ultra Base	multi-domain SFT, distilled from the external committee	2-stage supervised fine-tune (+ shared MTP)	RLVR student · GenRM · several teachers
RLVR student = self-teacher	SFT student	verifiable envs — math, code, JSON-schema, instr-following, agentic DB-state, search, long-ctx	unified RLVR (async GRPO, 16 rollouts)	Ultra MOPD1 · init for the general teachers (STEM, Chat, Instr-follow)
Ultra MOPD1	RLVR student	the student's own rollouts, graded token-by-token by the iter-1 teachers	MOPD = on-policy reverse-KL distillation (+ warmup)	Ultra Final · init for iter-2 new teachers
Ultra Final	Ultra MOPD1	student rollouts graded by iter-2 teachers	MOPD — iteration 2	Released Ultra
Released Ultra	Ultra Final	—	MTP boosting (head-only KL) + NVFP4 PTQ	(shipped)
Agentic SFT/RL path	Ultra Base parallel to General SFT	SFT on a blend of agentic data (intro, p3)	dedicated agentic SFT, then per-teacher RL	the agentic teachers (terminal, conv-tool, SWE, search, usability, safety)
External committee	out-of-family	DeepSeek-V3.2/V4-Pro · GPT-OSS-120B (also judge) · GLM-5/5.1 · Qwen3 · Kimi-K2 · Minimax-M2.5	—	generates SFT + some teacher data
Ultra GenRM judge	Ultra SFT (same family)	HelpSteer3 + LMArena prefs + synthetic safety	RLVR, learns to match human score + ranking; principle-following	reward signal for Chat & Instr-follow teachers
STEM teacher ↺	RLVR student	DeepSeek-V4-Pro traces; gpt-oss-120b judge (science/math/code/proofs)	extra SFT + RL	MOPD1 → (reused) Final
Chat teacher	RLVR student Fig 10	LMArena/WildChat seeds; GLM-5 responses; best-of selected by Ultra GenRM	iterative RLHF (uses Ultra GenRM)	MOPD1
Instr-follow + Factuality ↺	RLVR student	instr-follow + abstention + RLHF envs	domain RLVR (+RLHF “to avoid behavioral collapse”)	MOPD1 → (reused) Final
Terminal-use teacher ↺	agentic SFT/RL path init not stated	long-timeout terminal trajectories (up to ~1h)	PivotRL + re-profiling	MOPD1 → (reused) Final
Conv tool-use teacher	agentic SFT/RL path init not stated	Super multi-turn tool-use recipe	PivotRL	MOPD1
SWE teacher	Ultra Base agentic SFT path	agentic SFT + live repo, hidden-test verifier (anti-cheat: gold-patch git history deleted)	SFT → PivotRL → end-to-end SWE-RL (binary reward)	MOPD1
Search teacher ↺	an Ultra ckpt agentic SFT path	trajectories with context management (discard-all resets, summary compression)	SFT	MOPD1 → (reused) Final
Office / Workplace teacher	Ultra ckpt (post general-SFT)	AfterQuery GDPval-style deliverables	light SFT + pivot RL	MOPD1
Model-usability teacher	agentic SFT/RL path init not stated	Nemo Data Designer (gpt-oss-120b) structured outputs — JSON/YAML/XML/TOML/CSV, extraction	RL (Nemo-Gym)	MOPD1
Agentic-safety teacher ↺	agentic SFT/RL path init not stated	indirect-prompt-injection tasks; red-team attacker = Nemotron-3 Super, defender = Nano	RL + deterministic verifier	MOPD1 → (reused) Final
★ Coding teacher	reasoning teacher	competitive coding (14K + 4K problems)	RL (+2.4 on LiveCodeBench v6)	Ultra Final
★ Chat teacher 2	Ultra MOPD1	refreshed chat + Ultra GenRM	iterative RLHF	Ultra Final
★ Conv tool-use teacher 2	Ultra MOPD1	sequential, dependent multi-step tool use	PivotRL	Ultra Final
★ SWE teacher 2	Ultra MOPD1	SWE repos	SFT → PivotRL → SWE-RL	Ultra Final
★ Office teacher 2	Ultra MOPD1	AfterQuery	SFT + pivot RL	Ultra Final

★ = teacher created fresh in iteration 2, initialized from Ultra MOPD1. ↺ = iteration-1 teacher reused in iteration 2. Source: NVIDIA Nemotron 3 Ultra technical report (2026-06-04), §3.3 + Figures 9–10, Tables 4–5. The animated curves in the visual are illustrative; equations are quoted from §3.3.1.

Accuracy note. Most edges are stated outright in the report; the MOPD structure, the iteration-2 teachers forking off Ultra MOPD1, the self-teacher, and the reverse-KL objective are all explicit. A few “initialized-from” links are inferred where the report is vague: it gives no init checkpoint for the terminal-use, conversational-tool-use, model-usability and agentic-safety teachers, and only “an Ultra checkpoint” for search — Figure 10 places these on a dedicated agentic SFT/RL path, drawn here forking from Ultra Base (its only documented anchor being the SWE teacher, SFT’d from the base model). The chat teacher’s init isn’t named in prose either; it’s placed on the RLVR student per Figure 10. Educational explainer — not affiliated with NVIDIA.