V-JEPA 2, V-JEPA 2-AC, LeJEPA

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (15 min)

JEPA ( Embedding Predictive Architecture) — Yann LeCun 2022. An architecture that predicts representations of future states (in embedding space), not pixels. Sidesteps the "predict every pixel" inefficiency of generative world models.
V-JEPA — Video JEPA. First major instantiation, Meta 2024. Predicts masked video tubelets in DINO-style embedding space.
V-JEPA 2 — Meta 2024 (released late 2024). Scaled-up V-JEPA: 1B+ params, trained on 2M+ hours of video.
V-JEPA 2.1 — Meta Mar 2026. Latest checkpoint with better recipes and added action-conditioning hooks.
V-JEPA 2-AC — Action-Conditioned variant. Fine-tuned for "given current obs and next , predict next obs embedding." Usable as a for .
LeJEPA — LeRobot's JEPA implementation (HuggingFace, 2025). Smaller, robotics-data-trained.
Predictor — Small transformer that takes context embeddings + mask → predicts target embeddings.
Stop-gradient — On the target encoder. Prevents collapse where the predictor learns a degenerate constant function.
EMA target — Target encoder is exponential moving average of online encoder. Standard self-supervised trick (BYOL, DINO, JEPA).
Tubelet — A small spatiotemporal patch (e.g. 2×16×16) in video. The unit of masking.

Real-world analogy

Traditional video models predict pixels: "what will the next frame look like?" — wasteful, since most pixels (background, lighting) don't matter for . JEPA predicts representations: "what will the next frame mean?" — discards low-level texture , focuses on semantic content. Like the difference between transcribing every word in a meeting versus writing meeting minutes.

Hour 1 — Reading

LeCun's A Path Towards Autonomous Machine Intelligence (2022) — sections on JEPA, ~25 min: https://openreview.net/pdf?id=BZ5a1r-kVsf
V-JEPA paper, abstract + Section 3 (~20 min): https://arxiv.org/abs/2404.08471
V-JEPA 2 / 2-AC blog (~15 min): https://ai.meta.com/blog/v-jepa-2-world-model-physical-reasoning/

Hour 2 — LeJEPA codebase

ssh -i ~/.ssh/nebius_key ubuntu@<your-h100-ip>
cd ~ && mkdir -p robo47-wm && cd robo47-wm
uv venv --python 3.12 .venv && source .venv/bin/activate

git clone https://github.com/huggingface/lejepa
cd lejepa
uv pip install -e .

Read these files for ~30 min:
lejepa/models/predictor.py — the small predictor transformer
lejepa/models/encoder.py — DINOv3-style ViT encoder
lejepa/training/loss.py — the L1/L2 prediction loss + stop-gradient logic

LAB

Hour 3 — Lab: V-JEPA 2-AC zero-shot inference (75 min)

What you're building. Run V-JEPA 2-AC on a short video clip from one of your imitation-learning rollouts (Day 16's ACT eval video). Use it to predict the embedding of the next frame given current frame + next , then verify the prediction matches the real next frame's embedding within a small tolerance.

What success looks like at the end. You have: 1. w6-frontier/src/day36_vjepa2_ac.py runnable. 2. Console output: predicted-vs-actual embedding cosine similarity ≥ 0.85 across 10 random frame pairs. 3. Plot figures/day36_jepa_pred_quality.png showing cosine-similarity distribution; should be tightly clustered above 0.8.

Step 1 — Download V-JEPA 2-AC checkpoint (15 min)

huggingface-cli download facebook/vjepa2-ac-vitl16 --local-dir checkpoints/vjepa2_ac
ls -la checkpoints/vjepa2_ac/

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.

Papers you will re-read after this

TD-MPC2 — scalable world models