Separate weekend track

Track A Lite

This is not the Day 1 flow. It is a standalone two-day portfolio project for learners who want one focused artifact this weekend.

# Track A Lite: Fine-tune SmolVLA on LIBERO-Spatial in 2 Days

A standalone portfolio project. No prerequisites beyond Python and basic ML literacy.

---

What you're shipping

By end of Day 2 you have a public GitHub repo containing:

1. A fine-tuned SmolVLA checkpoint that beats SmolVLA on LIBERO-Spatial by ≥ 30 percentage points (e.g. 0.36 → 0.70). 2. Three runs across three seeds, with mean ± std reported. 3. An eval video showing successful completions. 4. A 1-page writeup with a headline plot. 5. A Makefile that lets a stranger reproduce your headline number with make reproduce.

This is portfolio-grade work: small in scope, fully reproducible, defensible at a job interview, comprehensible to a friend.

---

What this isn't

Not a research contribution. You're reproducing a known result with your own data.
Not a hardware project. Pure .
Not a deep-dive on architecture. We treat SmolVLA and LoRA as black boxes.

If those things matter to you, the full 47-day curriculum is the right move. If you want to ship something concrete and credible this week, this doc is for you.

---

Why SmolVLA + LIBERO-Spatial?

SmolVLA (Hugging Face, 2025): a 2.4B-parameter designed for on consumer GPUs. Open weights, well-documented, integrated with LeRobot's CLI.
LIBERO-Spatial: a of 10 spatial-relation tasks ("put the bowl on the right of the plate"). Standard, fast to evaluate, has known numbers — meaning you can sanity-check your results against published work.
LoRA (Low-Rank Adaptation): a technique that adds small trainable matrices to attention layers while freezing the base. Reduces fine-tune memory by ~5× — fits in 30GB GPU memory.

You'll fine-tune ~30M parameters out of 2.4B (1.3% of the model) and watch a ~2× improvement materialize in 60 minutes.

---

Compute and cost

1× H100 80GB for ~6 hours total (3 fine-tunes × ~75 min, plus evals)
Substitutes: 1× A100 80GB works identically. 1× L40S works with smaller batches. 1× consumer GPU (24GB+) works with rank=16 instead of rank=32.
Provider: Nebius, Lambda Labs, RunPod, Vast.ai. Budget ~$15–25 if you're efficient. Or use existing cloud credits.

---

Day 0 — Environment setup (~30 min, before Day 1)

Install on local laptop (for editing)

curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc
uv python install 3.12
mkdir -p ~/track-a-lite && cd ~/track-a-lite
git init

Provision GPU instance

Pick your provider. On Nebius: 1. Sign up at https://nebius.com/, add SSH key. 2. Provision: 1× H100 80GB SXM, 16 vCPU, 200 GB NVMe, Ubuntu 22.04 + CUDA 12.4. 3. SSH in:

ssh -i ~/.ssh/<your-key> ubuntu@<instance-ip>

On the GPU instance

# Tooling
sudo apt update && sudo apt install -y tmux htop nvtop git build-essential ffmpeg \
    libgl1 libegl1 libglfw3 libosmesa6 libgles2-mesa-dev pkg-config

# uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

# Verify GPU
nvidia-smi

Expected: A box showing H100 80GB HBM3, CUDA Version 12.4, GPU memory 0 / 81559 MiB.

Project workspace on GPU

mkdir -p ~/track-a && cd ~/track-a
uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install "lerobot[all]==0.5.0"
uv pip install wandb
wandb login   # paste API key from https://wandb.ai/authorize

Sanity-check LeRobot is alive

python -c "import lerobot; print(f'LeRobot {lerobot.__version__}')"
lerobot-train --help | head -20

Expected: Version 0.5.0, then help text.

Use tmux

Always work inside tmux so disconnects don't kill jobs:

tmux new -s track-a
# Detach with Ctrl-b d
# Reattach later with: tmux attach -t track-a

---

Day 1 — Zero-shot baseline + first fine-tune (3–4 hours active, more in background)

Step 1: Inspect the dataset (10 min)

LIBERO-Spatial is on HuggingFace as lerobot/libero_spatial. Verify it loads:

python -c "
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
ds = LeRobotDataset('lerobot/libero_spatial')
print(f'Episodes: {ds.num_episodes}, frames: {ds.num_frames}')
print(f'Sample keys: {list(ds[0].keys())[:8]}')
print(f'Action shape: {ds[0][\"action\"].shape}')
"

Expected output:

Episodes: 432, frames: 52000
Sample keys: ['observation.images.image', 'observation.images.wrist_image', 'observation.state', 'action', 'episode_index', 'frame_index', 'timestamp', 'next.done']
Action shape: torch.Size([7])

If the doesn't download (rare), retry: rm -rf ~/.cache/huggingface/datasets/_locks/.

Step 2: Zero-shot eval (~30 min)

Establish your before any . SmolVLA was pretrained on a mix of data including LIBERO; we want to see what it knows out of the box.

mkdir -p runs figures videos

lerobot-eval \
  --policy.path=lerobot/smolvla_base \
  --env.type=libero --env.task_suite=libero_spatial \
  --eval.n_episodes=50 \
  --output_dir=runs/zeroshot \
  --seed=1

This loads the pretrained SmolVLA from HuggingFace (~5 GB download on first run; cached after) and rolls it out on 50 LIBERO-Spatial episodes.

What you should see in the first 60 seconds:

INFO  Loading policy from lerobot/smolvla_base
INFO  Loaded 2,401,300,000 parameters
INFO  Initializing libero_spatial env suite...
INFO  Episode 1/50: success=False, length=235
INFO  Episode 2/50: success=True, length=189
...

Expected at completion (~25 minutes):

INFO  eval/success_rate: 0.36
INFO  eval/episode_length: 232.4
INFO  Wrote runs/zeroshot/eval_summary.json

Your number will be in 0.30–0.45. If it's 0.0, something is wrong (env install, action-space mismatch); see "common failures" below.

Append to a results log:

mkdir -p logs
echo "zeroshot,seed=1,success_rate=0.36,n_episodes=50" >> logs/results.csv

(Replace 0.36 with your actual number throughout this doc.)

Step 3: First fine-tune — LoRA r=32, seed=1 (60–75 min)

This is the main event. Launch in tmux pane 1:

lerobot-train \
  --policy.type=smolvla \
  --policy.pretrained_path=lerobot/smolvla_base \
  --policy.lora.enable=true \
  --policy.lora.rank=32 \
  --policy.lora.alpha=64 \
  --policy.lora.target_modules=["q_proj","k_proj","v_proj","o_proj"] \
  --dataset.repo_id=lerobot/libero_spatial \
  --env.type=libero --env.task_suite=libero_spatial \
  --batch_size=4 \
  --gradient_accumulation_steps=2 \
  --steps=10000 \
  --eval_freq=2000 \
  --save_freq=2000 \
  --output_dir=runs/lora_r32_s1 \
  --wandb.enable=true \
  --wandb.project=track-a-lite \
  --seed=1

What you should see in the first 60 seconds:

INFO  Loading dataset: lerobot/libero_spatial
INFO  Trainable params: 31,457,280 / 2,401,331,712 (1.31%)
INFO  step:0 smpl:8 ep:0 epch:0.00 loss:0.487 grdn:1.21 lr:1.0e-04 updt_s:0.812

The "1.31%" line is LoRA in — you're a small adapter while the base 2.4B parameters stay frozen.

Open a second tmux pane and watch GPU memory:

watch -n 2 nvidia-smi

Expected: Memory usage settles at 22–28 GB / 80 GB. If it's >40GB, something's wrong with LoRA config; verify policy.lora.enable=true made it into the run config.

Progress checkpoints:
Step 2000 (~12 min): eval/success_rate: 0.55 (anywhere 0.45–0.65 normal)
Step 6000 (~40 min): eval/success_rate: 0.71
Step 10000 (~70 min): eval/success_rate: 0.79

Final expected range: 0.65–0.85 depending on randomness.

Step 4: While seed 1 trains — set up reproducibility scaffold (45 min in background)

Open tmux pane 3 and create the project structure:

cd ~/track-a-lite

cat > README.md <<'EOF'
# Track A Lite: SmolVLA + LoRA on LIBERO-Spatial

## Hypothesis
Fine-tuning SmolVLA on LIBERO-Spatial via LoRA r=32 yields ≥ 30 percentage point
improvement in success rate over zero-shot SmolVLA.

## Headline result
| Variant | Success rate (mean ± std) | n seeds |
|---|---|---|
| Zero-shot SmolVLA | 0.36 ± 0.00 | 1 |
| SmolVLA + LoRA r=32 | 0.78 ± 0.04 | 3 |

42 percentage point improvement.

## Reproduce

make install make eval

Wall-clock: ~6 GPU-hours on 1× H100.

## Files
- `Makefile`: install + train + eval targets
- `requirements.txt`: pinned deps
- `scripts/train.sh`: training command
- `scripts/eval.sh`: eval command
- `figures/headline.png`: bar plot
- `videos/eval_episode.mp4`: sample successful rollout
- `logs/results.csv`: all seed-level results

EOF

cat > requirements.txt <<'EOF'
lerobot[all]==0.5.0
wandb
matplotlib
pandas
numpy
EOF

cat > Makefile <<'EOF'
.PHONY: install zeroshot train eval reproduce clean

install:
	uv venv --python 3.12 .venv
	. .venv/bin/activate && uv pip install -r requirements.txt

zeroshot:
	. .venv/bin/activate && bash scripts/zeroshot.sh

train:
	. .venv/bin/activate && bash scripts/train_all_seeds.sh

eval:
	. .venv/bin/activate && bash scripts/eval_all_seeds.sh

reproduce: install eval

clean:
	rm -rf runs/ wandb/ figures/*.png
EOF

mkdir -p scripts figures videos logs

cat > scripts/zeroshot.sh <<'EOF'
#!/bin/bash
set -e
lerobot-eval \
  --policy.path=lerobot/smolvla_base \
  --env.type=libero --env.task_suite=libero_spatial \
  --eval.n_episodes=50 \
  --output_dir=runs/zeroshot --seed=1
EOF

cat > scripts/train_all_seeds.sh <<'EOF'
#!/bin/bash
set -e
for SEED in 1 2 3; do
  lerobot-train \
    --policy.type=smolvla \
    --policy.pretrained_path=lerobot/smolvla_base \
    --policy.lora.enable=true \
    --policy.lora.rank=32 \
    --policy.lora.alpha=64 \
    --policy.lora.target_modules='["q_proj","k_proj","v_proj","o_proj"]' \
    --dataset.repo_id=lerobot/libero_spatial \
    --env.type=libero --env.task_suite=libero_spatial \
    --batch_size=4 \
    --gradient_accumulation_steps=2 \
    --steps=10000 \
    --eval_freq=2000 \
    --save_freq=2000 \
    --output_dir=runs/lora_r32_s${SEED} \
    --wandb.enable=true --wandb.project=track-a-lite \
    --seed=${SEED}
done
EOF

cat > scripts/eval_all_seeds.sh <<'EOF'
#!/bin/bash
set -e
for SEED in 1 2 3; do
  lerobot-eval \
    --policy.path=runs/lora_r32_s${SEED}/checkpoints/last/pretrained_model \
    --env.type=libero --env.task_suite=libero_spatial \
    --eval.n_episodes=50 \
    --output_dir=runs/lora_r32_s${SEED}/eval \
    --seed=${SEED}
done
EOF

chmod +x scripts/*.sh

git add -A && git commit -m "Day 1: Track A Lite scaffold + zero-shot baseline"

Step 5: Confirm seed 1 is healthy (5 min check at ~step 4000)

Around 25 minutes into , peek at the logs:

tmux attach -t track-a   # if you detached
# look for the most recent eval line in the training output

At step 4000 you want to see eval/success_rate somewhere in 0.55–0.70. If it's still under 0.45, see "common failures."

Detach and let it finish.

Step 6: While seed 1 finishes — record an eval video (15 min)

Once seed 1 hits its first checkpoint at step 2000, you can render a video from that intermediate checkpoint to verify visually that the is learning something:

lerobot-eval \
  --policy.path=runs/lora_r32_s1/checkpoints/002000/pretrained_model \
  --env.type=libero --env.task_suite=libero_spatial \
  --eval.n_episodes=5 \
  --output_dir=runs/lora_r32_s1/preview_eval \
  --seed=999

Watch one of the videos:

ls runs/lora_r32_s1/preview_eval/videos/
# Pick episode_0.mp4

Even at step 2000, the arm should be making purposeful motions toward the right object. If it's flailing, something's wrong.

Step 7: When seed 1 completes — log result + start seed 2 (5 min)

Once you see the final step 10000 log, capture the number:

SR=$(cat runs/lora_r32_s1/eval_summary.json | python -c "import sys,json; print(json.load(sys.stdin)['eval/success_rate'])")
echo "lora_r32,seed=1,success_rate=${SR}" >> logs/results.csv

Edit the train script to launch seed 2:

# In a fresh tmux pane:
lerobot-train [...] --seed=2 --output_dir=runs/lora_r32_s2

Let it run overnight if needed. Same for seed 3 in a third pane.

End of Day 1 deliverable check

logged (~0.36)
Seed 1 LoRA complete; final logged (~0.78)
Seeds 2 and 3 launched (running overnight if late)
One preview video saved
Repo scaffolded with README, Makefile, scripts
First commit pushed

---

Day 2 — Finish seeds, plot, write up, ship (3 hours active)

Step 1: Confirm seeds 2 and 3 finished (10 min)

Reattach to tmux. Each should have completed overnight (~70 min each).

SR2=$(cat runs/lora_r32_s2/eval_summary.json | python -c "import sys,json; print(json.load(sys.stdin)['eval/success_rate'])")
SR3=$(cat runs/lora_r32_s3/eval_summary.json | python -c "import sys,json; print(json.load(sys.stdin)['eval/success_rate'])")
echo "lora_r32,seed=2,success_rate=${SR2}" >> logs/results.csv
echo "lora_r32,seed=3,success_rate=${SR3}" >> logs/results.csv
cat logs/results.csv

If a seed crashed overnight, decide: retry (~70 min) or report n=2 with the issue documented in your writeup. n=2 is acceptable for a portfolio piece if you're transparent about it.

Step 2: Headline bar plot (30 min)

Create scripts/make_plot.py:

"""Track A Lite: produce the headline bar plot."""
import json, glob
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

# Zero-shot
zs_path = Path("runs/zeroshot/eval_summary.json")
zs_sr = json.loads(zs_path.read_text())["eval/success_rate"]

# LoRA seeds
lora_srs = []
for d in sorted(glob.glob("runs/lora_r32_s*/eval_summary.json")):
    lora_srs.append(json.loads(Path(d).read_text())["eval/success_rate"])
lora_srs = np.array(lora_srs)

print(f"Zero-shot:  {zs_sr:.3f}")
print(f"LoRA r=32:  {lora_srs.mean():.3f} ± {lora_srs.std():.3f}  (n={len(lora_srs)})")

fig, ax = plt.subplots(figsize=(7, 5))
labels = ["Zero-shot\nSmolVLA", f"+ LoRA r=32\n(n={len(lora_srs)} seeds)"]
means = [zs_sr, lora_srs.mean()]
stds = [0.0, lora_srs.std()]
colors = ["#888", "#3a86ff"]
bars = ax.bar(labels, means, yerr=stds, capsize=8, color=colors, edgecolor="black")
for bar, m in zip(bars, means):
    ax.text(bar.get_x() + bar.get_width()/2, m + 0.02, f"{m:.2f}",
            ha="center", fontsize=12, fontweight="bold")
ax.set_ylabel("Success rate (LIBERO-Spatial, 50 eval episodes/seed)")
ax.set_ylim(0, 1)
ax.set_title("SmolVLA on LIBERO-Spatial: zero-shot vs LoRA fine-tune")
ax.grid(alpha=0.3, axis="y")
plt.tight_layout()
plt.savefig("figures/headline.png", dpi=150)
print("Wrote figures/headline.png")

Run:

python scripts/make_plot.py

Expected console output:

Zero-shot:  0.360
LoRA r=32:  0.778 ± 0.041  (n=3)
Wrote figures/headline.png

Expected figure: Two bars side-by-side. Left bar (gray) at 0.36, no error bar. Right bar (blue) at ~0.78 with a small error bar (±0.04). Numerical labels on top of each. Clean white background, axis grid.

Step 3: Pick the best eval video (15 min)

ls runs/lora_r32_s1/eval/videos/
# Watch episode_0 through episode_4

Pick the cleanest successful . Copy to project root:

cp runs/lora_r32_s1/eval/videos/episode_3.mp4 videos/eval_episode.mp4

If your scp is set up, pull it to your laptop to watch. Otherwise install mpv or vlc on the GPU box and play over X-forwarding (ssh -X).

Step 4: 1-page writeup (30 min)

Create WRITEUP.md:

# SmolVLA + LoRA on LIBERO-Spatial

## TL;DR
LoRA fine-tuning lifts SmolVLA from 0.36 to 0.78 on LIBERO-Spatial — a 42pp
improvement, achieved in ~75 min on 1× H100 by training 1.3% of the model's
parameters.

## Setup
- **Model:** `lerobot/smolvla_base` (2.4B params, PaliGemma backbone)
- **Dataset:** `lerobot/libero_spatial` (432 episodes, 10 spatial-relation tasks)
- **Method:** LoRA r=32 α=64, target modules `{q,k,v,o}_proj`
- **Training:** 10k steps, batch 4 × grad-accum 2, AdamW, cosine LR schedule
- **Compute:** 1× H100 80GB, ~75 min/seed × 3 seeds = ~4 GPU-hours
- **Eval:** 50 episodes per seed via LeRobot's LIBERO env wrapper

## Result
| Variant | Success rate | n seeds |
|---|---|---|
| Zero-shot | 0.36 | 1 |
| LoRA r=32 | **0.78 ± 0.04** | 3 |

![Headline](figures/headline.png)

## Why this works
LIBERO tasks are *in-distribution* for SmolVLA's pretraining mix, but the
specific scene compositions (object positions, spatial relations) require
fine-tuning to specialize. LoRA's low-rank adapters are sufficient because
the pretrained features are already good — we're nudging the policy, not
retraining it.

GPU memory peaked at ~26 GB during training. A full fine-tune of 2.4B params
would have taken ~50 GB and likely required gradient checkpointing.

## Limitations
- Single dataset, single benchmark. Results may not transfer to LIBERO-Object
  or LIBERO-Goal without re-tuning.
- 50 eval episodes per seed is the LeRobot default but is on the small side
  for tight error bars; n=200 would be more reliable.
- No ablation across LoRA ranks. r=32 was picked because it's the LeRobot
  default; r=8 might give similar results at lower memory.

## Reproducibility
Repo: `<your-github-url>`
\```
git clone <url>
cd track-a-lite
make install
make reproduce
\```
Headline number reproduces within ±0.05 across fresh-clone runs (env seed
randomness on LIBERO sim).

## Stack
LeRobot v0.5.0, SmolVLA, HuggingFace Transformers, LIBERO sim, PyTorch 2.5,
CUDA 12.4, Python 3.12.

Step 5: Fresh-clone test (30 min)

This is the rubric step that separates a portfolio piece from a screenshot. Verify a stranger could reproduce.

cd /tmp && rm -rf clone-test
git clone <your-github-url> clone-test
cd clone-test
make install
# Sanity-check zero-shot (~25 min)
bash scripts/zeroshot.sh
# Compare:
SR_FRESH=$(cat runs/zeroshot/eval_summary.json | python -c "import sys,json; print(json.load(sys.stdin)['eval/success_rate'])")
echo "Original: 0.36, Fresh-clone: $SR_FRESH"

Should match within ±0.04 (LIBERO has some env-seed variance even with --seed=1).

If you have GPU-hours to spare, also re-eval seed 1's saved checkpoint:

# This requires the saved checkpoint to be in the repo or downloaded from HF
# Easier path: skip and just do the zeroshot match

Step 6: Push, share, log (30 min)

cd ~/track-a-lite
git add -A
git commit -m "Track A Lite complete: zero-shot 0.36 -> LoRA 0.78"
git push origin main

Add a 60-second screen recording of you running make reproduce and watching the eval video. Save as videos/demo.mp4. Optional but worth it for portfolio.

Where to share:
GitHub repo with the README front and center
LinkedIn post: "Spent the weekend a 2.4B-param on a robotics . 0.36 → 0.78 on LIBERO-Spatial. Repo + writeup: <>"
X / Twitter: same with #robotics #ML
HuggingFace: push your LoRA adapter to the Hub if you want

End of Day 2 deliverable check

3 LoRA seeds done (or 2 with documented reason)
figures/headline.png exists
videos/eval_episode.mp4 shows a clean success
WRITEUP.md is 1 page, has the headline plot inline
Fresh-clone reproduces within ±0.05
Repo public on GitHub
At least one external share (LinkedIn, X, etc.)

---

Common failures and fixes

Symptom	Likely cause	Fix
success = 0.0	space mismatch between SmolVLA and LIBERO	Check `policy.action_dim` matches ; LeRobot v0.5 should auto-handle this
eval hangs	LIBERO env init slow on first run	Wait 90 s for first ; subsequent episodes are normal
OOM during	LoRA disabled, or rank too high	Verify `--policy.lora.enable=true` in run config; drop rank to 16 if still OOM
Loss decreases but eval stays at 0.0	stats not loaded	`lerobot-compute-stats --dataset.repo_id=lerobot/libero_spatial` then retrain
Loss is NaN early	bf16 underflow	Add `--policy.use_amp=false`
LoRA fine-tune < 0.50	LR too low for	Try `--lr=1e-4` (default is 1e-5); also verify alpha/rank ratio is 2:1
`wandb` won't connect	Network on GPU box blocked	`wandb offline` then `wandb sync` later
Cached HF locked	Previous job died mid-download	`rm -rf ~/.cache/huggingface/datasets/_locks/`
Different model versions changing	SmolVLA repo updated	Pin: `lerobot/smolvla_base@v1.0` if available

---

When this is done

You have a public GitHub repo, a one-page writeup with a published-looking plot, three seeded numbers, an eval video, and a fresh-clone-tested make reproduce. That's enough to:

Drop into a portfolio or job application
Talk through with a recruiter or interviewer ("I fine-tuned a 2.4B on a robotics , got 42pp improvement in 6 GPU-hours, here's the repo")
Use as a launchpad for a real research project (an across LoRA ranks, a different LIBERO suite, a custom )

If after this you want to go deeper, the natural next stops are:

Try a harder LIBERO suite: LIBERO-Long (long-horizon tasks) — same code, swap libero_spatial for libero_long, expect lower numbers.
Try a different : π0 has the same LeRobot integration. Swap --policy.type=smolvla for --policy.type=pi0. Compare numbers across both, write that up as a follow-up.
Add a real : Train at LoRA rank ∈ {8, 32, 128} and plot rank vs success_rate. This makes your repo a 1.5-week project instead of a 2-day project, but it's a real research-y comparison.
Collect your own : With an SO-101 arm or a sim setup, record 30 episodes of a custom and fine-tune on those. This is the real Track A from the full curriculum.

But if you stop here, you've shipped something genuine.

---

End of Track A Lite.