SmolVLA fine-tuning on LIBERO-Spatial

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (12 min)

SmolVLA — Hugging Face's compact (2.4B parameter) , released 2025. Designed for on consumer GPUs.
() — Model that takes images + text instructions, outputs actions. RT-2 (2023) was the first major one; π0, GR00T, SmolVLA are descendants.
LIBERO — A of 100+ short tasks with language instructions. LIBERO-Spatial: 10 spatial-relation tasks ("put the bowl on the right of the plate").
LoRA (Low-Rank Adaptation) — Fine-tune by adding small trainable matrices to attention layers, freezing the base. Reduces fine-tune memory by ~5×.
PaliGemma backbone — Google's (3B params, 224×224 images). SmolVLA uses it.
expert — Lightweight MLP head that converts hidden states into targets.
Pretrained — A model trained on a large mixture of data; we fine-tune it for our specific tasks rather than from scratch.

Real-world analogy

A is a chef who's read every cookbook (vision-language ), is now learning your specific kitchen (LIBERO tasks). You don't reteach them how to chop onions; you just demonstrate "how I want it done in this kitchen". LoRA is teaching them via post-it notes (small trainable layers) instead of rewriting the cookbook.

Hour 1 — Reading

SmolVLA blog post (~15 min): https://huggingface.co/blog/smolvla
LIBERO paper, abstract + Sec 3 (~15 min): https://arxiv.org/abs/2306.03310
LoRA paper, Sec 1–3 (~15 min): https://arxiv.org/abs/2106.09685

Hour 2 — Setup + verify pretrained inference

cd ~/robo47-il
source .venv/bin/activate

# Download SmolVLA pretrained
python -c "
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained('lerobot/smolvla_base')
print(f'Loaded SmolVLA, {sum(p.numel() for p in policy.parameters())/1e6:.1f}M params')
"

Expected: Loaded SmolVLA, 2401.3M params (or similar; ~2.4B).

LAB

Hour 3 — Lab: LoRA-fine-tune SmolVLA on LIBERO-Spatial (90 min)

What you're building. Fine-tune SmolVLA on LIBERO-Spatial via LoRA. Evaluate (no ) first, then after . Quantify the lift.

What success looks like at the end. You have: 1. SmolVLA on LIBERO-Spatial: 0.30–0.45 (the pretrained ). 2. Fine-tuned SmolVLA : 0.70–0.85. 3. LoRA peaks at ~25 GB of GPU memory (vs ~50 GB for full fine-tune).

Step 1 — Zero-shot evaluation (15 min)

lerobot-eval \
  --policy.path=lerobot/smolvla_base \
  --env.type=libero --env.task_suite=libero_spatial \
  --eval.n_episodes=50 \
  --output_dir=runs/smolvla_zeroshot \
  --seed=1

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.