Week 4 integration + fresh-clone

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (5 min)

No new terms. Reflection + day.

Hour 1 — Capstone Track C pre-design (40 min)

Write docs/day28_track_c_design.md:

# Track C: Sim-to-real locomotion on Go1

## Hypothesis
"Adding a teacher-student stage on top of plain DR yields ≥ 30% improvement
in robust return on a perturbed eval, vs DR alone."

## Variables
- IV: training pipeline (DR-only vs DR + RMA)
- DV: episode return on perturbed eval (50 episodes, 3 seeds)
- Controls: same env, same DR config, same perturbation values

## Experiments
1. Plain DR Go1 (Day 25)
2. RMA teacher + student with DR (Day 27)
3. Same student deployed on a real Go1 (if available)

## Metric
Mean perturbed episode return ± std across 3 seeds. Bar chart.

## Compute budget
1× H100, ~3 hours per pipeline. 6 hours total.

## Risk / kill criteria
- If teacher-student doesn't beat DR by ≥ 15%, the implementation is broken.
- If real-Go1 fails entirely, log the gap and analyze (action latency, IMU noise).

Hour 2 — Fresh-clone test (45 min)

Clone the w4-rl repo to /tmp/w4-test, install via Makefile, re-run the Day 22 cartpole script. Verify final = 500 (CartPole is deterministic up to seed; should match exactly).

LAB

Hour 3 — Week 4 retro + commit (45 min)

RETRO_w4.md:

# Week 4 retro

## Numbers
| Day | Method | Env | Return / Reward |
|---|---|---|---|
| 22 | PPO from scratch | CartPole-v1 | 500 (3 seeds) |
| 23 | PPO + Playground | Spot Joystick Flat | 28.1 |
| 24 | PPO + Playground | Go1 Joystick Flat | 38.4 |
| 25 | DR PPO | Go1 (perturbed eval) | 21.7 (vs no-DR 2.1) |
| 26 | rsl_rl | ANYmal-C Flat (Isaac Lab) | ~30 |
| 27 | RMA teacher + student | Go1 (perturbed eval) | 29 (vs DR 21) |

## Reproducibility
- Day 22 CartPole reproduces exactly: 500 reward, same seed.

## What I learned
1. PPO's "37 details" really matter. Without grad clip + adv norm, my from-scratch version fails.
2. 4096 parallel envs is a different paradigm. RL throughput went 100x in 5 years.
3. Reward shaping is the real algorithm in locomotion. The PPO trainer is a fixed component.
4. Domain randomization buys 10x robustness vs no-DR. Cheap insurance.
5. Teacher-student adds another 30% on top of DR. The dominant 2025 sim-to-real recipe.
6. Isaac Lab is heavier than MuJoCo Playground but has photorealistic rendering — necessary for vision-conditioned policies.

## What still confuses me
- Why does Playground PPO converge in 60M steps but my from-scratch PPO needs many more for similar tasks?  Architecture differences, value-loss clipping, observation normalization?
- How do real-world deployments deal with the discrete-time action delay? Constant interpolation vs. zero-order hold?

cd ~/robo47/w4-rl
git add docs/ RETRO_w4.md
git commit -m "Day 28: Week 4 retro + Track C design + fresh-clone test"
git push

Deliverable checklist

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.