Nemotron 3 Ultra — who trains whom

A map of the post-training pipeline of NVIDIA's largest open model, built around MOPD — Multi-teacher On-Policy Distillation. For every student checkpoint and every teacher: what it's initialized from, the data and algorithm that built it, and how it feeds the next student. Click any node to light up its connections and read what its algorithm means.

How to read the arrows

initialized from — the only dashed arrows. Weights forked; each arrow is painted in the colour of its source checkpoint (thick = the student backbone, thin = a teacher forked off a checkpoint), so you can trace any fork by its hue.
MOPD distillation (solid rose) — a teacher grading the student's rollouts token-by-token, pulled into the student GenRM reward (solid purple) — the Ultra GenRM scores a teacher's RLHF generates data (solid grey) — out-of-family model
student line — each checkpoint its own colour: Base SFT RLVR MOPD1 Final Released Agentic SFT/RL path
and the helpers: general teacher agentic teacher Ultra GenRM (judge) external models

The pipeline

open SVG ↗ ctrl/⌘ + scroll (or pinch) = zoom · drag = pan · click a node for details (click again to deselect) ·
loading graph…

Click any node — its details pop up at the bottom of the viewer. Use to reduce the card to its title bar, (or click the node again) to close it.

What the red MOPD arrow actually computes

Every red arrow is the same operation — on-policy reverse-KL distillation — the core of MOPD. The direction of the KL is the whole trick, and it's really a picture. Toggle between the visual and the math.

One generation step = a distribution over what to write next. The teacher has a preferred distribution (here, two good continuations); we train the student to "match" it. What "match" means depends entirely on which way you point the KL — flip it and watch:

The other word: “on-policy”

The teacher doesn't grade a fixed pile of teacher-written text — it grades the student's own rollouts. The model generates, and at every token the teacher says "here's how I'd have weighted the options there." So the student is corrected exactly on the situations it actually lands in, killing the usual gap where a model trained on teacher transcripts then wanders into states it never practiced. Practice your own mistakes, expert over your shoulder, every token.

Bonus: you only need the teacher to score the student's tokens, never to hand over its full distribution on a fixed corpus — which is what makes running 10+ specialized teachers affordable.

The same thing as a table

If the graph is busy, this is the precise version. Feeds → = which student this model is used to train.

ModelInit fromDataAlgorithmFeeds →
Ultra Base— (from scratch)20T tokens, 2-phase (diversity→quality), synthetic-heavy web/code/math/legalNVFP4 WSD pretrain + 1M-ctx CPTSFT student
SFT studentUltra Basemulti-domain SFT, distilled from the external committee2-stage supervised fine-tune (+ shared MTP)RLVR student · GenRM · several teachers
RLVR student = self-teacherSFT studentverifiable envs — math, code, JSON-schema, instr-following, agentic DB-state, search, long-ctxunified RLVR (async GRPO, 16 rollouts)Ultra MOPD1 · init for the general teachers (STEM, Chat, Instr-follow)
Ultra MOPD1RLVR studentthe student's own rollouts, graded token-by-token by the iter-1 teachersMOPD = on-policy reverse-KL distillation (+ warmup)Ultra Final · init for iter-2 new teachers
Ultra FinalUltra MOPD1student rollouts graded by iter-2 teachersMOPD — iteration 2Released Ultra
Released UltraUltra FinalMTP boosting (head-only KL) + NVFP4 PTQ(shipped)
Agentic SFT/RL pathUltra Base parallel to General SFTSFT on a blend of agentic data (intro, p3)dedicated agentic SFT, then per-teacher RLthe agentic teachers (terminal, conv-tool, SWE, search, usability, safety)
External committeeout-of-familyDeepSeek-V3.2/V4-Pro · GPT-OSS-120B (also judge) · GLM-5/5.1 · Qwen3 · Kimi-K2 · Minimax-M2.5generates SFT + some teacher data
Ultra GenRM judgeUltra SFT (same family)HelpSteer3 + LMArena prefs + synthetic safetyRLVR, learns to match human score + ranking; principle-followingreward signal for Chat & Instr-follow teachers
STEM teacher RLVR studentDeepSeek-V4-Pro traces; gpt-oss-120b judge (science/math/code/proofs)extra SFT + RLMOPD1 → (reused) Final
Chat teacherRLVR student Fig 10LMArena/WildChat seeds; GLM-5 responses; best-of selected by Ultra GenRMiterative RLHF (uses Ultra GenRM)MOPD1
Instr-follow + Factuality RLVR studentinstr-follow + abstention + RLHF envsdomain RLVR (+RLHF “to avoid behavioral collapse”)MOPD1 → (reused) Final
Terminal-use teacher agentic SFT/RL path init not statedlong-timeout terminal trajectories (up to ~1h)PivotRL + re-profilingMOPD1 → (reused) Final
Conv tool-use teacheragentic SFT/RL path init not statedSuper multi-turn tool-use recipePivotRLMOPD1
SWE teacherUltra Base agentic SFT pathagentic SFT + live repo, hidden-test verifier (anti-cheat: gold-patch git history deleted)SFT → PivotRL → end-to-end SWE-RL (binary reward)MOPD1
Search teacher an Ultra ckpt agentic SFT pathtrajectories with context management (discard-all resets, summary compression)SFTMOPD1 → (reused) Final
Office / Workplace teacherUltra ckpt (post general-SFT)AfterQuery GDPval-style deliverableslight SFT + pivot RLMOPD1
Model-usability teacheragentic SFT/RL path init not statedNemo Data Designer (gpt-oss-120b) structured outputs — JSON/YAML/XML/TOML/CSV, extractionRL (Nemo-Gym)MOPD1
Agentic-safety teacher agentic SFT/RL path init not statedindirect-prompt-injection tasks; red-team attacker = Nemotron-3 Super, defender = NanoRL + deterministic verifierMOPD1 → (reused) Final
★ Coding teacherreasoning teachercompetitive coding (14K + 4K problems)RL (+2.4 on LiveCodeBench v6)Ultra Final
★ Chat teacher 2Ultra MOPD1refreshed chat + Ultra GenRMiterative RLHFUltra Final
★ Conv tool-use teacher 2Ultra MOPD1sequential, dependent multi-step tool usePivotRLUltra Final
★ SWE teacher 2Ultra MOPD1SWE reposSFT → PivotRL → SWE-RLUltra Final
★ Office teacher 2Ultra MOPD1AfterQuerySFT + pivot RLUltra Final

★ = teacher created fresh in iteration 2, initialized from Ultra MOPD1. ↺ = iteration-1 teacher reused in iteration 2. Source: NVIDIA Nemotron 3 Ultra technical report (2026-06-04), §3.3 + Figures 9–10, Tables 4–5. The animated curves in the visual are illustrative; equations are quoted from §3.3.1.

Accuracy note. Most edges are stated outright in the report; the MOPD structure, the iteration-2 teachers forking off Ultra MOPD1, the self-teacher, and the reverse-KL objective are all explicit. A few “initialized-from” links are inferred where the report is vague: it gives no init checkpoint for the terminal-use, conversational-tool-use, model-usability and agentic-safety teachers, and only “an Ultra checkpoint” for search — Figure 10 places these on a dedicated agentic SFT/RL path, drawn here forking from Ultra Base (its only documented anchor being the SWE teacher, SFT’d from the base model). The chat teacher’s init isn’t named in prose either; it’s placed on the RLVR student per Figure 10. Educational explainer — not affiliated with NVIDIA.