RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich
ARCHITECTURE
THE PROBLEM
Before RT-2, Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. faced a fundamental Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. crisis. Systems like RT-1 (its predecessor) were trained on specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trajectories in controlled environments—they learned mappings from "image → Core ConceptsActionA command the robot sends to its motors, controller, or low-level system." but couldn't reason about what they were seeing or why. When presented with a novel object or a command that didn't appear in Robot LearningTrainingThe process of fitting a model using data or experience. data, they failed catastrophically. The core limitation: robots had to memorize every scenario. Vision-language models (like CLIP, PaLM-E, PaLI-X) solved this for Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. and language understanding through massive Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task., but nobody had figured out how to meaningfully combine that semantic understanding with Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningControlThe method used to make the robot move the way you want.. Previous attempts either treated vision-language models as separate Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. modules (losing the reasoning benefits) or tried clunky hybrid approaches that didn't leverage the full power of the pretrained models. The gap was concrete and costly: a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trained to pick up red cups wouldn't pick up a red mug, even though a human instantly understands the semantic similarity.
HOW IT WORKS
Represent Actions as Language Tokens
Co-Fine-Tune on Robot and Web Data Together
Evaluate Emergent Semantic Reasoning
Measure Generalization Against Baselines
Ablate Critical Design Choices
MORE DEMONSTRATIONS
FIGURES
KEY RESULTS
vs. RT-1 (previous SOTA) and VC-1 (vision pretraining baseline)
On tasks the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. never trained on (symbol understanding, reasoning, chain-of-thought), RT-2 succeeded where baselines failed. This 3x improvement is the clearest evidence that knowledge from internet-scale Robot LearningTrainingThe process of fitting a model using data or experience. transfers to Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. reasoning.
vs. RT-1 and other baselines across all generalization axes
When shown objects it hadn't seen during Robot LearningTrainingThe process of fitting a model using data or experience., RT-2 performed roughly twice as well. This is the practical Evaluation & ResearchMetricA numerical measure of performance. that matters for real-world Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.—robots in factories and homes encounter novel objects constantly.
vs. typical robotics papers with hundreds of trials
The sheer scale of Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. (6k trials) gives statistical confidence that these results aren't Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation.. Robotics is notoriously noisy; this level of rigor is rare and reassuring.
vs. Ablation comparing PaLI-X 55B vs. 5B variant
Larger pretrained models show significantly better transfer. This suggests that web-scale knowledge scales with model capacity—more internet knowledge means more Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before..
PERFORMANCE COMPARISON
WHY DEVELOPERS SHOULD CARE
If you're building robotics software, RT-2 fundamentally changes the game. Before, you'd train a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. model on thousands of hand-annotated demonstrations, and it would work only in the narrow slice of world it was trained on. Now, you can leverage pretrained vision-language models to give your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. common-sense reasoning "for free." The practical implication: instead of collecting 10,000 Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trajectories to teach a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to pick up cups, you might collect 1,000 and get better Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. because the model learned about cup-ness from the internet. More importantly, RT-2 demonstrates that robots can reason semantically—they understand not just "move arm to position (0.5, 0.3, 0.2)" but "pick up the thing that would make a good hammer." This opens doors to robots understanding user intent in natural language, adapting to novel scenarios, and operating in unstructured real-world environments. For a developer, the lesson is: stop thinking of vision, language, and Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. as separate problems. Unify them through a shared token representation, leverage existing pretrained models, and let the network discover the semantic connections. This is how you get emergent capabilities you didn't explicitly program.
LIMITATIONS
RT-2 still has significant gaps. The Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trajectories used were from relatively constrained Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks (Manipulation & TasksGraspingTaking hold of an object., object placement in table-top environments)—it's unclear how well this generalizes to Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running., Navigation & LocomotionNavigationMoving through an environment toward a goal., or dynamic tasks. The Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. tokenization is lossy; by converting continuous Control & PlanningControlThe method used to make the robot move the way you want. into discrete tokens, the model loses fine-grained precision, which could be problematic for tasks requiring delicate Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.. The chain-of-thought reasoning, while impressive, is still "rudimentary"—the model can pick a rock as a hammer, but it's not clear how well it would handle truly complex multi-step Control & PlanningPlanningFiguring out what the robot should do before or during movement. or recovery from failure. There's also a Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. question: these models are 12-55B parameters, which is computationally expensive for edge robots. Finally, the Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs., while extensive, is still in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. or controlled lab settings; real-world Modern Robot LearningRobustnessHow well a robot keeps working despite noise, disturbances, or variation. at scale remains unproven.
WHAT COMES NEXT
The next frontier is scaling RT to Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running. and more complex Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects., combining it with actual chain-of-thought Control & PlanningPlanningFiguring out what the robot should do before or during movement. (where the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. generates intermediate reasoning steps), and deploying it to real robots in unstructured environments. We'll likely see RT-3 add: (1) real-time Robot LearningOnline learningTraining while continuing to collect new live data. where robots update their understanding as they encounter novel objects, (2) multi-modal reasoning (interpreting gestures, tone of voice, not just language commands), (3) Modern Robot LearningFailure recoveryA system’s ability to detect and recover from errors. and self-correction (when a grasp fails, the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. reasons about why and adjusts), and (4) better integration with world models so the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can plan multiple steps ahead. The long-term vision is a universal Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. brain that understands language, vision, and physical consequence as deeply as humans do—and RT-2 is the critical stepping stone showing that leveraging internet-scale Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task. is the right path forward.