Course navigation
Week 5: VLA ArchitecturesDay 33
Gemini Robotics + Robot Academy IBVS primer
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (10 min)
- Gemini Robotics — Google DeepMind 2025. Robotics adaptation of Gemini-2.0/2.5. Native Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. (image + video + audio + text → Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.).
- Gemini Robotics-ER — "Embodied Reasoning" variant. Spatial reasoning, scene understanding, plan generation.
- Gemini Robotics-ER 1.6 — Apr 14, 2026 release. Latest ER. Stronger spatial grounding.
- Native Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. — Trained from scratch on mixed image/text/audio/video tokens. Not "text LLM + vision adapter."
- VLM-as-policy — Use Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. directly as Core ConceptsPolicyThe rule or model that maps observations or states to actions. via Movement, Mechanics & Robot BodyEnd-effectorThe tool at the end of a robot arm, like a gripper, hand, or suction cup. deltas in language form (e.g. "move +0.05 m in x"). Gemini Robotics-ER does this.
- IBVS (Image-Based Visual Servoing) — Classical: drive image features (Pixel position of object) to a target by computing visual-feature Jacobian. Predates VLAs by 30 years; conceptually similar to "Core ConceptsPolicyThe rule or model that maps observations or states to actions. outputs EE deltas given vision".
Real-world analogy
Gemini Robotics is "Tesla AutoPilot": vertically integrated, proprietary, fed by enormous private data. ER 1.6 is the latest "FSD beta" with sharper spatial reasoning.
Hour 1 — Robot Academy IBVS primer (visual intuition first)
Watch Visual Servoing masterclass, focus on Image-Based VS lessons (~35 min): https://robotacademy.net.au/masterclass/vision-and-motion/
Why now? IBVS predates VLAs by decades but the conceptual loop — "vision → EE delta" — is identical. Modern policies are IBVS, with a learned visual-feature Jacobian. Watching Corke's animated IBVS demos makes "what does Gemini Robotics-ER actually do?" click.
Hour 2 — Reading
- Gemini Robotics announcement (Mar 2025) (~20 min): https://deepmind.google/discover/blog/introducing-gemini-robotics/
- Gemini Robotics-ER 1.6 announcement (Apr 14, 2026) (~25 min): https://deepmind.google/discover/blog/gemini-robotics-er-16/
LAB
Hour 3 — Lab: Gemini Robotics-ER inference via API (75 min)
What you're building. Use Google's Gemini API (which exposes Gemini Robotics-ER 1.6 publicly as of Apr 2026) to do spatial reasoning queries on images, then use the responses to drive a simulated Panda toward a designated object.
Step 1 — Setup API key (10 min)
uv pip install google-generativeai
export GOOGLE_API_KEY=<your-key> # from https://ai.google.devStep 2 — Spatial reasoning query (30 min)
Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.