A map of the post-training pipeline of NVIDIA's largest open model, built around MOPD — Multi-teacher On-Policy Distillation. For every student checkpoint and every teacher: what it's initialized from, the data and algorithm that built it, and how it feeds the next student. Click any node to light up its connections and read what its algorithm means.
Click any node — its details pop up at the bottom of the viewer. Use ▾ to reduce the card to its title bar, ✕ (or click the node again) to close it.
Every red arrow is the same operation — on-policy reverse-KL distillation — the core of MOPD. The direction of the KL is the whole trick, and it's really a picture. Toggle between the visual and the math.
One generation step = a distribution over what to write next. The teacher has a preferred distribution (here, two good continuations); we train the student to "match" it. What "match" means depends entirely on which way you point the KL — flip it and watch:
The teacher doesn't grade a fixed pile of teacher-written text — it grades the student's own rollouts. The model generates, and at every token the teacher says "here's how I'd have weighted the options there." So the student is corrected exactly on the situations it actually lands in, killing the usual gap where a model trained on teacher transcripts then wanders into states it never practiced. Practice your own mistakes, expert over your shoulder, every token.
Bonus: you only need the teacher to score the student's tokens, never to hand over its full distribution on a fixed corpus — which is what makes running 10+ specialized teachers affordable.
Student \(\pi_\theta\), domain-\(i\) teacher \(\pi^{T_i}\). The student samples a completion \(y=(y_1,\dots,y_H)\) for prompt \(q\); let \(s_t=(q,y_{\lt t})\) be the prefix. MOPD maximizes
$$\mathcal{J}_{\text{MOPD}}(\theta)=\sum_{i=1}^{N}\lambda_i\; \mathbb{E}_{q\sim\mathcal{D}_i,\;y\sim\pi_\theta(\cdot\mid q)} \!\left[\sum_{t=1}^{H}\Big(\log\pi^{T_i}(y_t\mid s_t)-\log\pi_\theta(y_t\mid s_t)\Big)\right]$$The report's key line: maximizing this is exactly, at every prefix \(s_t\), minimizing the reverse KL
$$\min_\theta\;D_{\mathrm{KL}}\!\big(\pi_\theta(\cdot\mid s_t)\,\big\|\,\pi^{T_i}(\cdot\mid s_t)\big) =\sum_{v}\pi_\theta(v\mid s_t)\,\log\frac{\pi_\theta(v\mid s_t)}{\pi^{T_i}(v\mid s_t)}.$$Why "reverse" is the mode-seeker the visual shows — compare the two directions at one prefix:
forward (mass-covering): \(\;D_{\mathrm{KL}}(\pi^{T}\,\|\,\pi_\theta)=\sum_v \pi^{T}(v)\,\log\frac{\pi^{T}(v)}{\pi_\theta(v)}\) — weighted by the teacher, so it punishes the student for missing any teacher mode → covers everything, hedges.
reverse (mode-seeking): \(\;D_{\mathrm{KL}}(\pi_\theta\,\|\,\pi^{T})=\sum_v \pi_\theta(v)\,\log\frac{\pi_\theta(v)}{\pi^{T}(v)}\) — weighted by the student, so wherever the student puts mass the teacher must agree; \(\pi^T\!\approx\!0\Rightarrow\pi_\theta\!\to\!0\) (zero-forcing) → commits to a mode.
At scale the rollouts come from a slightly stale snapshot \(\pi_{\text{behav}}\) while the learner is already at \(\pi_{\text{prox}}\). MOPD wraps the per-token reverse-KL signal in a PPO-style clipped surrogate:
$$\mathcal{J}_{\text{async-MOPD}}(\theta)= \mathbb{E}_{q\sim\mathcal{D}_i,\;y\sim\pi_{\text{behav}}} \!\left[\sum_{t=1}^{H} m_t\,c_t\, \min\!\big(r_t(\theta)\hat A_t,\;\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat A_t\big)\right]$$with per-token distillation advantage = the teacher-minus-student log-prob gap,
$$\hat A_t=\operatorname{sg}\!\big[\log\pi^{T_i}(y_t\mid s_t)-\log\pi_{\text{prox}}(y_t\mid s_t)\big],$$\(r_t(\theta)=\pi_\theta/\pi_{\text{prox}}\) the policy ratio, \(c_t=\operatorname{sg}[\pi_{\text{prox}}/\pi_{\text{behav}}]\) the stale-sampler correction, \(\operatorname{sg}[\cdot]\) stop-gradient, \(m_t\) IcePop token-masking. Strip the bookkeeping and \(\hat A_t\) is just "raise tokens the teacher likes more than you currently do" — the reverse-KL gradient. (Report §3.3.1, Eqs 1–3.)
If the graph is busy, this is the precise version. Feeds → = which student this model is used to train.
| Model | Init from | Data | Algorithm | Feeds → |
|---|---|---|---|---|
| Ultra Base | — (from scratch) | 20T tokens, 2-phase (diversity→quality), synthetic-heavy web/code/math/legal | NVFP4 WSD pretrain + 1M-ctx CPT | SFT student |
| SFT student | Ultra Base | multi-domain SFT, distilled from the external committee | 2-stage supervised fine-tune (+ shared MTP) | RLVR student · GenRM · several teachers |
| RLVR student = self-teacher | SFT student | verifiable envs — math, code, JSON-schema, instr-following, agentic DB-state, search, long-ctx | unified RLVR (async GRPO, 16 rollouts) | Ultra MOPD1 · init for the general teachers (STEM, Chat, Instr-follow) |
| Ultra MOPD1 | RLVR student | the student's own rollouts, graded token-by-token by the iter-1 teachers | MOPD = on-policy reverse-KL distillation (+ warmup) | Ultra Final · init for iter-2 new teachers |
| Ultra Final | Ultra MOPD1 | student rollouts graded by iter-2 teachers | MOPD — iteration 2 | Released Ultra |
| Released Ultra | Ultra Final | — | MTP boosting (head-only KL) + NVFP4 PTQ | (shipped) |
| Agentic SFT/RL path | Ultra Base parallel to General SFT | SFT on a blend of agentic data (intro, p3) | dedicated agentic SFT, then per-teacher RL | the agentic teachers (terminal, conv-tool, SWE, search, usability, safety) |
| External committee | out-of-family | DeepSeek-V3.2/V4-Pro · GPT-OSS-120B (also judge) · GLM-5/5.1 · Qwen3 · Kimi-K2 · Minimax-M2.5 | — | generates SFT + some teacher data |
| Ultra GenRM judge | Ultra SFT (same family) | HelpSteer3 + LMArena prefs + synthetic safety | RLVR, learns to match human score + ranking; principle-following | reward signal for Chat & Instr-follow teachers |
| STEM teacher ↺ | RLVR student | DeepSeek-V4-Pro traces; gpt-oss-120b judge (science/math/code/proofs) | extra SFT + RL | MOPD1 → (reused) Final |
| Chat teacher | RLVR student Fig 10 | LMArena/WildChat seeds; GLM-5 responses; best-of selected by Ultra GenRM | iterative RLHF (uses Ultra GenRM) | MOPD1 |
| Instr-follow + Factuality ↺ | RLVR student | instr-follow + abstention + RLHF envs | domain RLVR (+RLHF “to avoid behavioral collapse”) | MOPD1 → (reused) Final |
| Terminal-use teacher ↺ | agentic SFT/RL path init not stated | long-timeout terminal trajectories (up to ~1h) | PivotRL + re-profiling | MOPD1 → (reused) Final |
| Conv tool-use teacher | agentic SFT/RL path init not stated | Super multi-turn tool-use recipe | PivotRL | MOPD1 |
| SWE teacher | Ultra Base agentic SFT path | agentic SFT + live repo, hidden-test verifier (anti-cheat: gold-patch git history deleted) | SFT → PivotRL → end-to-end SWE-RL (binary reward) | MOPD1 |
| Search teacher ↺ | an Ultra ckpt agentic SFT path | trajectories with context management (discard-all resets, summary compression) | SFT | MOPD1 → (reused) Final |
| Office / Workplace teacher | Ultra ckpt (post general-SFT) | AfterQuery GDPval-style deliverables | light SFT + pivot RL | MOPD1 |
| Model-usability teacher | agentic SFT/RL path init not stated | Nemo Data Designer (gpt-oss-120b) structured outputs — JSON/YAML/XML/TOML/CSV, extraction | RL (Nemo-Gym) | MOPD1 |
| Agentic-safety teacher ↺ | agentic SFT/RL path init not stated | indirect-prompt-injection tasks; red-team attacker = Nemotron-3 Super, defender = Nano | RL + deterministic verifier | MOPD1 → (reused) Final |
| ★ Coding teacher | reasoning teacher | competitive coding (14K + 4K problems) | RL (+2.4 on LiveCodeBench v6) | Ultra Final |
| ★ Chat teacher 2 | Ultra MOPD1 | refreshed chat + Ultra GenRM | iterative RLHF | Ultra Final |
| ★ Conv tool-use teacher 2 | Ultra MOPD1 | sequential, dependent multi-step tool use | PivotRL | Ultra Final |
| ★ SWE teacher 2 | Ultra MOPD1 | SWE repos | SFT → PivotRL → SWE-RL | Ultra Final |
| ★ Office teacher 2 | Ultra MOPD1 | AfterQuery | SFT + pivot RL | Ultra Final |
★ = teacher created fresh in iteration 2, initialized from Ultra MOPD1. ↺ = iteration-1 teacher reused in iteration 2. Source: NVIDIA Nemotron 3 Ultra technical report (2026-06-04), §3.3 + Figures 9–10, Tables 4–5. The animated curves in the visual are illustrative; equations are quoted from §3.3.1.
Accuracy note. Most edges are stated outright in the report; the MOPD structure, the iteration-2 teachers forking off Ultra MOPD1, the self-teacher, and the reverse-KL objective are all explicit. A few “initialized-from” links are inferred where the report is vague: it gives no init checkpoint for the terminal-use, conversational-tool-use, model-usability and agentic-safety teachers, and only “an Ultra checkpoint” for search — Figure 10 places these on a dedicated agentic SFT/RL path, drawn here forking from Ultra Base (its only documented anchor being the SWE teacher, SFT’d from the base model). The chat teacher’s init isn’t named in prose either; it’s placed on the RLVR student per Figure 10. Educational explainer — not affiliated with NVIDIA.