DreamerV3 — RL in a learned world model

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (12 min)

DreamerV3 — DeepMind 2023 (Hafner et al.). Universal algorithm: same hyperparameters across 150+ tasks. Trains a from pixels, then does PPO-style inside it.
Recurrent State-Space Model (RSSM) — DreamerV3's world-model core. Has deterministic h_t and stochastic z_t. Predicts (h, z, reward, done) given .
Imagination — Roll out the from a real for K steps (typically 15) and train actor/critic on imagined data. Drastically improves .
Symlog — sign(x) · log(1 + |x|). Used on rewards/returns for stability across magnitudes.
Twohot encoding — Discretize critic targets into a categorical distribution. Reduces gradient .
R²-Dreamer — Apr 2025 successor (Berkeley). "Real-Robot Dreamer." Explicit handling of delays and lag for hardware .
— Crucial for DreamerV3's appeal: solves Atari with ~50× less data than PPO.

Real-world analogy

PPO (Day 22) is "try, see what happened, update." DreamerV3 is "try, see what happened, build a mental model, imagine a thousand more attempts, update from those." When real interaction is expensive (real robots), imagination is cheap.

Hour 1 — Reading

DreamerV3 paper, sections 1–3 (~30 min): https://arxiv.org/abs/2301.04104
DreamerV3 blog (~10 min): https://danijar.com/project/dreamerv3/
R²-Dreamer paper, sections 1–3 (~25 min): https://arxiv.org/abs/2504.xxxxx (or search "R2-Dreamer arxiv")

Hour 2 — Read the JAX implementation

cd ~/robo47-wm
git clone https://github.com/danijar/dreamerv3
cd dreamerv3

Read in this order (~30 min):
dreamerv3/agent.py — top-level agent, RSSM + actor + critic
dreamerv3/jaxnets.py — RSSM forward
dreamerv3/jaxutils.py — symlog, twohot helpers

LAB

Hour 3 — Lab: train DreamerV3 on a control task (90 min wall-clock)

What you're building. Train DreamerV3 on dmc_walker_walk (DeepMind Suite, Walker-2D walking). Compare wall-clock and vs Day 22's PPO.

What success looks like. 1. DreamerV3 trains for 1M env steps and reaches mean ≥ 700 (out of ~1000 max). 2. PPO on the same reaches similar but takes ~5× more env steps. 3. Plot figures/day38_dreamer_vs_ppo.png showing both learning curves.

Step 1 — Install + smoke test (10 min)

cd ~/robo47-wm/dreamerv3
uv pip install dm_control
uv pip install -r requirements.txt

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.