Day 22
PPO foundations + Abbeel primer + cart-pole from scratch
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (15 min)
- Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. — Learning a Core ConceptsPolicyThe rule or model that maps observations or states to actions.
π(a|s)that maximizes expected Imitation & Reinforcement LearningReturnThe total accumulated reward over time.E[Σγᵗ rₜ]from interaction with an Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces.. - Markov Decision Process (MDP) —
(S, A, P, r, γ): states, actions, transition kernel, Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing., discount. - Core ConceptsPolicyThe rule or model that maps observations or states to actions. gradient — Update
θ ← θ + α ∇_θ J(θ)whereJ = E[Σ rₜ]. The fundamental Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. gradient. - Advantage A(s, a) — How much better Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.
ais than average at Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables.s.A = Q − V. - GAE (Generalized Advantage Estimation) — Smoothed advantage estimator:
Â_t = Σ (γλ)^l δ_{t+l}whereδ_t = r_t + γV(s_{t+1}) − V(s_t). Trades bias and variance via λ. - PPO (Proximal Core ConceptsPolicyThe rule or model that maps observations or states to actions. Optimization) — Schulman 2017. Clips the Core ConceptsPolicyThe rule or model that maps observations or states to actions. ratio
r_t = π(a|s)/π_old(a|s)to prevent destructive updates. Default Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithm in 2026. - Clip ratio ε — Typically 0.2.
min(r·A, clip(r, 1−ε, 1+ε)·A). - Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward. V(s) — Critic. Predicts expected Imitation & Reinforcement LearningReturnThe total accumulated reward over time. from Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. s.
- Entropy bonus — Encourage Imitation & Reinforcement LearningExplorationTrying different actions to discover useful behavior. by rewarding uncertain policies. Coefficient 0.01 typical.
- Robot LearningRolloutA full run of a policy in simulation or the real world. — Collect N steps of Core ConceptsTrajectoryA sequence of states or actions over time. by running Core ConceptsPolicyThe rule or model that maps observations or states to actions. in env(s).
- Vectorized envs — Run K parallel envs simultaneously to amortize Core ConceptsPolicyThe rule or model that maps observations or states to actions. forward pass.
Real-world analogy
PPO is "try a slightly different Core ConceptsPolicyThe rule or model that maps observations or states to actions., but only step in the direction of improvement if the new Core ConceptsPolicyThe rule or model that maps observations or states to actions. isn't too different from the old one." The clipping is a Safety & DeploymentGuardrailA system-level rule or limit placed around model behavior. — without it, one lucky-but-low-probability sequence of actions can yank the Core ConceptsPolicyThe rule or model that maps observations or states to actions. into a region from which it never recovers.
Hour 1 — Abbeel Deep RL primer
Watch Foundations of Deep Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. Lecture 4 — TRPO and PPO
Video
Watch Foundations of Deep Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. Lecture 4 — TRPO and PPO
Open sourcePieter Abbeel's pacing on this is excellent. Watch at 1.25× if comfortable. The single most useful 30 minutes for understanding PPO.
Hour 2 — Spinning Up + 37 Implementation Details
- Spinning Up in Deep Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. — PPO page (~25 min): https://spinningup.openai.com/en/latest/algorithms/ppo.html
- 37 Implementation Details of PPO (Huang et al., ICLR 2022 blog) — read first 15 details (~30 min): https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
The "37 details" blog is what separates "PPO works" from "PPO doesn't work" in your code. Read it before writing PPO from scratch.
LAB
Hour 3 — Lab: PPO from scratch on CartPole-v1 (90 min)
What you're building. A 250-line PPO implementation in pure PyTorch (no Stable-Baselines3, no CleanRL) that solves Gymnasium's CartPole-v1 to Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. = 500 within 100k env steps. You'll log learning curves and intentionally compare to the Stable-Baselines3 Evaluation & ResearchBaselineA reference method used for comparison..
What success looks like at the end. You have:
1. w4-rl/src/day22_ppo_cartpole.py (~250 lines).
2. CartPole-v1 Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. reaches 500 (the env's max) within 100k steps for ≥ 2 of 3 seeds.
3. figures/day22_ppo_curves.png showing Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. vs env steps.
4. Comparison: SB3 PPO trained on the same env reaches 500 in similar wall-clock; you're not faster, but you understand it.
Step 1 — Install (5 min)
mkdir -p ~/robo47/w4-rl && cd ~/robo47/w4-rl
uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install torch gymnasium[classic-control] stable-baselines3 wandb matplotlib numpyFull source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.