Day 16

ACT (Action Chunking Transformer) on ALOHA insertion

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (10 min)

  • ACT (Modern Robot LearningAction chunkingPredicting several future actions at once instead of one action at a time. Transformer) — Stanford 2023. CVAE-based transformer that predicts the next K actions per call, sliding-window deployed.
  • CVAE — Conditional Variational AutoEncoder. Trains an encoder to map (obs, Core ConceptsActionA command the robot sends to its motors, controller, or low-level system._chunk) → latent z, decoder to map (obs, z) → Core ConceptsActionA command the robot sends to its motors, controller, or low-level system._chunk. At Robot LearningInferenceUsing a trained model to make predictions or choose actions., sample z = 0 (mean) for determinism.
  • Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. Modern Robot LearningChunk sizeHow many future actions are predicted together in one chunk. K — Hyperparameter, typically 100 (one full Manipulation & TasksInsertionPlacing one object into another, like plugging in a connector.!). Trade-off: long chunks = smoother but slower to react.
  • Temporal ensembling — Average overlapping chunks. Prediction at time t blends chunks centered at t, t-1, t-2, ... weighted exponentially.
  • Backbone — Vision encoder. ACT uses ResNet-18 by default.
  • 51M parameters — ACT's typical size. Fits on a 24GB GPU for Robot LearningTrainingThe process of fitting a model using data or experience..

Real-world analogy

ACT is "watch the master, then plan ahead — predict the whole next move (e.g. the entire reach-grasp-insert sequence) in one go, instead of just the next 1/30th of a second."

Hour 1 — Reading

Hour 2 — Read the LeRobot ACT implementation

  • Open ~/robo47-il/.venv/lib/python3.12/site-packages/lerobot/policies/act/modeling_act.py. Read for ~30 min. Find:
  • The CVAE encoder and decoder transformers.
  • Where chunk_size=100 is consumed.
  • The temporal ensembling logic in select_action.

LAB

Hour 3 — Lab: train ACT, eval, beat the BC baseline (90 min)

What you're building. Train ACT on the same ALOHA Manipulation & TasksInsertionPlacing one object into another, like plugging in a connector. Robot LearningDatasetA collection of training or evaluation data. for 20k steps (≈45 min on 1× H100). Compare directly against Day 15's Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. Evaluation & ResearchBaselineA reference method used for comparison..

What success looks like at the end. You have: 1. ACT checkpoint at runs/act_aloha/checkpoints/last/pretrained_model/. 2. Eval Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. ≈ 0.70–0.95 (vs Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions.'s 0.05–0.30) — the >0.5 win condition. 3. Eval video showing smooth, deliberate insertions. 4. Side-by-side comparison plot figures/day16_act_vs_bc.png.

Step 1 — Train ACT (45 min)

cd ~/robo47-il
source .venv/bin/activate

lerobot-train \
  --policy.type=act \
  --dataset.repo_id=lerobot/aloha_sim_insertion_human \
  --env.type=aloha \
  --env.task=AlohaInsertion-v0 \
  --batch_size=8 \
  --steps=20000 \
  --eval_freq=5000 \
  --save_freq=5000 \
  --output_dir=runs/act_aloha \
  --wandb.enable=true \
  --wandb.project=robo47 \
  --seed=1

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.

Papers you will re-read after this