Course navigation
Week 6: Frontier EmbodimentDay 38
DreamerV3 — RL in a learned world model
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (12 min)
- DreamerV3 — DeepMind 2023 (Hafner et al.). Universal Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithm: same hyperparameters across 150+ tasks. Trains a Modern Robot LearningWorld modelA model that predicts how the world will change after actions. from pixels, then does PPO-style Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. inside it.
- Recurrent State-Space Model (RSSM) — DreamerV3's world-model core. Has deterministic Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables.
h_tand stochastic Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables.z_t. Predicts(h, z, reward, done)given Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.. - Imagination — Roll out the Modern Robot LearningWorld modelA model that predicts how the world will change after actions. from a real Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. for K steps (typically 15) and train actor/critic on imagined data. Drastically improves Robot LearningSample efficiencyHow quickly a method learns from each example or interaction..
- Symlog —
sign(x) · log(1 + |x|). Used on rewards/returns for stability across magnitudes. - Twohot encoding — Discretize critic targets into a categorical distribution. Reduces gradient Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation..
- R²-Dreamer — Apr 2025 successor (Berkeley). "Real-Robot Dreamer." Explicit handling of Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. delays and Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings. lag for hardware Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot..
- Robot LearningSample efficiencyHow quickly a method learns from each example or interaction. — Crucial for DreamerV3's appeal: solves Atari with ~50× less data than PPO.
Real-world analogy
PPO (Day 22) is "try, see what happened, update." DreamerV3 is "try, see what happened, build a mental model, imagine a thousand more attempts, update from those." When real interaction is expensive (real robots), imagination is cheap.
Hour 1 — Reading
- DreamerV3 paper, sections 1–3 (~30 min): https://arxiv.org/abs/2301.04104
- DreamerV3 blog (~10 min): https://danijar.com/project/dreamerv3/
- R²-Dreamer paper, sections 1–3 (~25 min): https://arxiv.org/abs/2504.xxxxx (or search "R2-Dreamer arxiv")
Hour 2 — Read the JAX implementation
cd ~/robo47-wm
git clone https://github.com/danijar/dreamerv3
cd dreamerv3- Read in this order (~30 min):
dreamerv3/agent.py— top-level agent, RSSM + actor + criticdreamerv3/jaxnets.py— RSSM forwarddreamerv3/jaxutils.py— symlog, twohot helpers
LAB
Hour 3 — Lab: train DreamerV3 on a control task (90 min wall-clock)
What you're building. Train DreamerV3 on dmc_walker_walk (DeepMind Control & PlanningControlThe method used to make the robot move the way you want. Suite, Walker-2D walking). Compare wall-clock and Robot LearningSample efficiencyHow quickly a method learns from each example or interaction. vs Day 22's PPO.
What success looks like.
1. DreamerV3 trains for 1M env steps and reaches mean Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. ≥ 700 (out of ~1000 max).
2. PPO Evaluation & ResearchBaselineA reference method used for comparison. on the same Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. reaches similar Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. but takes ~5× more env steps.
3. Plot figures/day38_dreamer_vs_ppo.png showing both learning curves.
Step 1 — Install + smoke test (10 min)
cd ~/robo47-wm/dreamerv3
uv pip install dm_control
uv pip install -r requirements.txtFull source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.