Course navigation
Week 3: Imitation LearningDay 18
SmolVLA fine-tuning on LIBERO-Spatial
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (12 min)
- SmolVLA — Hugging Face's compact (2.4B parameter) Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions., released 2025. Designed for Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. on consumer GPUs.
- Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. (Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions.) — Model that takes images + text instructions, outputs Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actions. RT-2 (2023) was the first major one; π0, GR00T, SmolVLA are descendants.
- LIBERO — A Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. of 100+ short Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks with language instructions. LIBERO-Spatial: 10 spatial-relation tasks ("put the bowl on the right of the plate").
- LoRA (Low-Rank Adaptation) — Fine-tune by adding small trainable matrices to attention layers, freezing the base. Reduces fine-tune memory by ~5×.
- PaliGemma backbone — Google's Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. (3B params, 224×224 images). SmolVLA uses it.
- Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. expert — Lightweight MLP head that converts hidden states into Movement, Mechanics & Robot BodyJointA movable connection between robot parts. targets.
- Pretrained Core ConceptsPolicyThe rule or model that maps observations or states to actions. — A model trained on a large mixture of data; we fine-tune it for our specific tasks rather than Robot LearningTrainingThe process of fitting a model using data or experience. from scratch.
Real-world analogy
A Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. is a chef who's read every cookbook (vision-language Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task.), is now learning your specific kitchen (LIBERO tasks). You don't reteach them how to chop onions; you just demonstrate "how I want it done in this kitchen". LoRA is teaching them via post-it notes (small trainable layers) instead of rewriting the cookbook.
Hour 1 — Reading
- SmolVLA blog post (~15 min): https://huggingface.co/blog/smolvla
- LIBERO paper, abstract + Sec 3 (~15 min): https://arxiv.org/abs/2306.03310
- LoRA paper, Sec 1–3 (~15 min): https://arxiv.org/abs/2106.09685
Hour 2 — Setup + verify pretrained inference
cd ~/robo47-il
source .venv/bin/activate
# Download SmolVLA pretrained
python -c "
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained('lerobot/smolvla_base')
print(f'Loaded SmolVLA, {sum(p.numel() for p in policy.parameters())/1e6:.1f}M params')
"Expected: Loaded SmolVLA, 2401.3M params (or similar; ~2.4B).
LAB
Hour 3 — Lab: LoRA-fine-tune SmolVLA on LIBERO-Spatial (90 min)
What you're building. Fine-tune SmolVLA on LIBERO-Spatial via LoRA. Evaluate Modern Robot LearningZero-shotDoing a new task without task-specific training. (no Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task.) first, then after Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task.. Quantify the lift.
What success looks like at the end. You have: 1. Modern Robot LearningZero-shotDoing a new task without task-specific training. SmolVLA Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. on LIBERO-Spatial: 0.30–0.45 (the pretrained Evaluation & ResearchBaselineA reference method used for comparison.). 2. Fine-tuned SmolVLA Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly.: 0.70–0.85. 3. LoRA Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. peaks at ~25 GB of GPU memory (vs ~50 GB for full fine-tune).
Step 1 — Zero-shot evaluation (15 min)
lerobot-eval \
--policy.path=lerobot/smolvla_base \
--env.type=libero --env.task_suite=libero_spatial \
--eval.n_episodes=50 \
--output_dir=runs/smolvla_zeroshot \
--seed=1Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.