Day 36

V-JEPA 2, V-JEPA 2-AC, LeJEPA

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (15 min)

  • JEPA (Movement, Mechanics & Robot BodyJointA movable connection between robot parts. Embedding Predictive Architecture) — Yann LeCun 2022. An architecture that predicts representations of future states (in embedding space), not pixels. Sidesteps the "predict every pixel" inefficiency of generative world models.
  • V-JEPA — Video JEPA. First major instantiation, Meta 2024. Predicts masked video tubelets in DINO-style embedding space.
  • V-JEPA 2 — Meta 2024 (released late 2024). Scaled-up V-JEPA: 1B+ params, trained on 2M+ hours of video.
  • V-JEPA 2.1 — Meta Mar 2026. Latest checkpoint with better Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. recipes and added action-conditioning hooks.
  • V-JEPA 2-AC — Action-Conditioned variant. Fine-tuned for "given current obs and next Core ConceptsActionA command the robot sends to its motors, controller, or low-level system., predict next obs embedding." Usable as a Modern Robot LearningWorld modelA model that predicts how the world will change after actions. for Control & PlanningPlanningFiguring out what the robot should do before or during movement..
  • LeJEPA — LeRobot's JEPA implementation (HuggingFace, 2025). Smaller, robotics-data-trained.
  • Predictor — Small transformer that takes context embeddings + mask → predicts target embeddings.
  • Stop-gradient — On the target encoder. Prevents collapse where the predictor learns a degenerate constant function.
  • EMA target — Target encoder is exponential moving average of online encoder. Standard self-supervised trick (BYOL, DINO, JEPA).
  • Tubelet — A small spatiotemporal patch (e.g. 2×16×16) in video. The unit of masking.

Real-world analogy

Traditional video models predict pixels: "what will the next frame look like?" — wasteful, since most pixels (background, lighting) don't matter for Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.. JEPA predicts representations: "what will the next frame mean?" — discards low-level texture Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation., focuses on semantic content. Like the difference between transcribing every word in a meeting versus writing meeting minutes.

Hour 1 — Reading

Hour 2 — LeJEPA codebase

ssh -i ~/.ssh/nebius_key ubuntu@<your-h100-ip>
cd ~ && mkdir -p robo47-wm && cd robo47-wm
uv venv --python 3.12 .venv && source .venv/bin/activate

git clone https://github.com/huggingface/lejepa
cd lejepa
uv pip install -e .
  • Read these files for ~30 min:
  • lejepa/models/predictor.py — the small predictor transformer
  • lejepa/models/encoder.py — DINOv3-style ViT encoder
  • lejepa/training/loss.py — the L1/L2 prediction loss + stop-gradient logic

LAB

Hour 3 — Lab: V-JEPA 2-AC zero-shot inference (75 min)

What you're building. Run V-JEPA 2-AC on a short video clip from one of your imitation-learning rollouts (Day 16's ACT eval video). Use it to predict the embedding of the next frame given current frame + next Core ConceptsActionA command the robot sends to its motors, controller, or low-level system., then verify the prediction matches the real next frame's embedding within a small tolerance.

What success looks like at the end. You have: 1. w6-frontier/src/day36_vjepa2_ac.py runnable. 2. Console output: predicted-vs-actual embedding cosine similarity ≥ 0.85 across 10 random frame pairs. 3. Plot figures/day36_jepa_pred_quality.png showing cosine-similarity distribution; should be tightly clustered above 0.8.

Step 1 — Download V-JEPA 2-AC checkpoint (15 min)

huggingface-cli download facebook/vjepa2-ac-vitl16 --local-dir checkpoints/vjepa2_ac
ls -la checkpoints/vjepa2_ac/

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.

Papers you will re-read after this