PPO foundations + Abbeel primer + cart-pole from scratch

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (15 min)

— Learning a π(a|s) that maximizes expected E[Σγᵗ rₜ] from interaction with an .
Markov Decision Process (MDP) — (S, A, P, r, γ): states, actions, transition kernel, , discount.
gradient — Update θ ← θ + α ∇_θ J(θ) where J = E[Σ rₜ]. The fundamental gradient.
Advantage A(s, a) — How much better a is than average at s. A = Q − V.
GAE (Generalized Advantage Estimation) — Smoothed advantage estimator: Â_t = Σ (γλ)^l δ_{t+l} where δ_t = r_t + γV(s_{t+1}) − V(s_t). Trades bias and variance via λ.
PPO (Proximal Optimization) — Schulman 2017. Clips the ratio r_t = π(a|s)/π_old(a|s) to prevent destructive updates. Default algorithm in 2026.
Clip ratio ε — Typically 0.2. min(r·A, clip(r, 1−ε, 1+ε)·A).
V(s) — Critic. Predicts expected from s.
Entropy bonus — Encourage by rewarding uncertain policies. Coefficient 0.01 typical.
— Collect N steps of by running in env(s).
Vectorized envs — Run K parallel envs simultaneously to amortize forward pass.

Real-world analogy

PPO is "try a slightly different , but only step in the direction of improvement if the new isn't too different from the old one." The clipping is a — without it, one lucky-but-low-probability sequence of actions can yank the into a region from which it never recovers.

Hour 1 — Abbeel Deep RL primer

Watch Foundations of Deep Lecture 4 — TRPO and PPO

Video

Watch Foundations of Deep Lecture 4 — TRPO and PPO

Open source

Pieter Abbeel's pacing on this is excellent. Watch at 1.25× if comfortable. The single most useful 30 minutes for understanding PPO.

Hour 2 — Spinning Up + 37 Implementation Details

Spinning Up in Deep — PPO page (~25 min): https://spinningup.openai.com/en/latest/algorithms/ppo.html
37 Implementation Details of PPO (Huang et al., ICLR 2022 blog) — read first 15 details (~30 min): https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

The "37 details" blog is what separates "PPO works" from "PPO doesn't work" in your code. Read it before writing PPO from scratch.

LAB

Hour 3 — Lab: PPO from scratch on CartPole-v1 (90 min)

What you're building. A 250-line PPO implementation in pure PyTorch (no Stable-Baselines3, no CleanRL) that solves Gymnasium's CartPole-v1 to = 500 within 100k env steps. You'll log learning curves and intentionally compare to the Stable-Baselines3 .

What success looks like at the end. You have: 1. w4-rl/src/day22_ppo_cartpole.py (~250 lines). 2. CartPole-v1 reaches 500 (the env's max) within 100k steps for ≥ 2 of 3 seeds. 3. figures/day22_ppo_curves.png showing vs env steps. 4. Comparison: SB3 PPO trained on the same env reaches 500 in similar wall-clock; you're not faster, but you understand it.

Step 1 — Install (5 min)

mkdir -p ~/robo47/w4-rl && cd ~/robo47/w4-rl
uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install torch gymnasium[classic-control] stable-baselines3 wandb matplotlib numpy

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.