Open X-Embodiment: Robotic Learning Datasets and RT-X Models
ARCHITECTURE
THE PROBLEM
Before this work, robotic learning suffered from radical fragmentation. Every lab trained separate models for their specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions., Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., and Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces.. If you wanted to teach a robotic arm to pick up objects, you'd collect data on that Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions., train a model, and hope it worked. If you then wanted that same arm to do something new, you'd start almost from scratch. This meant massive redundancy: researchers were independently collecting data of robots doing similar tasks, but the knowledge gained by one Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. to grasp never benefited another Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.. Previous multi-robot work existed (like RT-1 from DeepMind), but it was typically limited to a handful of robots within one lab. The scaling challenge was enormous—how do you even standardize data formats across 21 different institutions? How do you handle vastly different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. morphologies, camera angles, Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. spaces, and Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. definitions? The field lacked both the collaborative infrastructure and empirical evidence that a single model could genuinely improve performance across diverse hardware.
HOW IT WORKS
Create a Standardized Data Format (Bridge Protocol)
Scale to a Multi-Modal Transformer (RT-X Architecture)
Evaluate Transfer Learning Across 22 Robot Platforms
Enable Few-Shot Adaptation via In-Context Learning
MORE DEMONSTRATIONS
KEY RESULTS
vs. Prior multi-robot work typically involved 2-5 robots from a single lab
This is roughly 10-50× larger than previous multi-robot datasets. The scale is what enables the transformer to learn robust, general representations. Bigger datasets = better Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before., and robotics had been severely limited by data scarcity. This Robot LearningDatasetA collection of training or evaluation data. alone is a contribution to the field.
vs. Models trained only on individual robot data or smaller multi-robot subsets
This is the core claim and it holds up. The transfer isn't magical—you don't get 100% improvement—but it's statistically consistent. Robots that had never performed certain skills improved when trained on models that had seen those skills on other platforms. This proves the foundation-model hypothesis is viable in robotics, not just theoretical.
vs. Typical single-robot systems train on 1-20 skills per study
Diversity forces the model to learn abstractions. Instead of Data, Distributions & Training IssuesOverfittingWhen a model performs well on training data but poorly on new data. to a narrow Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., RT-X had to find principles that apply across ice cream scooping, cable routing, object repositioning, and cloth Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.. This breadth is why transfer works—the model learned robust primitives, not task-specific hacks.
vs. Robotics data collection typically happens in 1-2 labs
This coordination is unprecedented in robotics and required significant organizational effort. It demonstrates that cross-institutional, open-science approaches are feasible and valuable. It also means the results are less likely to be a quirk of one lab's setup or bias—they're validated across diverse experimental conditions.
PERFORMANCE COMPARISON
WHY DEVELOPERS SHOULD CARE
If you're building robotics software, this paper is a watershed moment: you can now start with a pretrained Core ConceptsPolicyThe rule or model that maps observations or states to actions. that's seen more Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. experience than any individual team could collect. Instead of collecting 10,000 demonstrations to teach a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. a new Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., you might need 100, because the model already understands Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. concepts from other robots. This is a 100× reduction in data collection, which directly translates to faster Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. and lower cost. Second, this paper proves that robotics is moving toward the Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. era. Just like you'd use BERT or GPT as a starting point in NLP, the next generation of robotics software will use pretrained policies like RT-X. You should be thinking about how to design your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. interfaces and data pipelines to be compatible with these models now. The Standardized Bridge Protocol is the earliest version of this standard—expect it to evolve, but the idea is durable. Third, the collaborative infrastructure matters as much as the model. The Open X-Embodiment project is open-sourcing datasets and models, and they're accepting contributions. If you're working with a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. type not yet represented (humanoid, quadruped, aquatic), contributing your data makes the Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. better for everyone, including you. This is the opposite of the typical ML Movement, Mechanics & Robot BodyDynamicsThe study of motion including forces, torques, mass, and inertia. where data is a competitive advantage—in robotics, pooling data makes everyone stronger.
LIMITATIONS
RT-X doesn't solve the morphology problem entirely. A Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. with 2 fingers transfers better to another 2-finger Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. than to a 3-finger hand, suggesting the model is still learning hardware-specific features rather than fully abstract Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. principles. The Robot LearningDatasetA collection of training or evaluation data. is also heavily weighted toward tabletop Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. and Manipulation & TasksGraspingTaking hold of an object.; Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects., door opening, and contact-rich tasks are underrepresented. Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. is still required for best performance—you can't just point RT-X at a new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. and expect Modern Robot LearningZero-shotDoing a new task without task-specific training. mastery. Additionally, the paper doesn't deeply explore failure modes: which Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. types benefit most? Which skills fail to transfer? And there's the unsolved problem of real-world Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. transfer—almost all data is from real robots, but adding Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. data might further improve Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. (though it brings its own challenges).
WHAT COMES NEXT
The natural next steps are scaling further (more robots, more institutions, more Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. diversity), improving Robot LearningData efficiencyHow much useful performance a method gets from limited data. (can we get these results with 1/10th the data?), and extending beyond tabletop to Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects. and Control & PlanningWhole-body controlCoordinating the whole robot body at once, common in humanoids.. We'll likely see specialized variants: RT-X-Humanoid, RT-X-Mobile, RT-X-Surgical. There's also the question of true Robot LearningOnline learningTraining while continuing to collect new live data.—can RT-X continuously adapt as robots encounter new scenarios in Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.? And the really ambitious question: can a single RT-X scale to include Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running., Navigation & LocomotionNavigationMoving through an environment toward a goal., and long-horizon Control & PlanningPlanningFiguring out what the robot should do before or during movement., not just Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.? The paper hints at RT-2-X (a next iteration), suggesting the team is already pursuing these extensions. Expect the robotics field to gradually consolidate around open-source foundation models, similar to how Hugging Face transformed NLP.