VLACURRENT2026-04-16

π0.7: a Steerable Model with Emergent Capabilities

Physical Intelligence

ARCHITECTURE
VLA (vision-language-action) with multimodal prompts
ROBOT
multiple: mobile manipulation, bimanual UR5e, various embodiments
DATASET
multi-robot diverse dataset
KEY METRIC
zero-shot cross-embodiment
TASK
manipulation, dexterous tasks, cross-embodiment transfer

π0.7 is a Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. that does something roboticists have been chasing for years: a single generalist model that matches the performance of fine-tuned specialist models without any Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task.. More importantly, it exhibits compositional Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.—the ability to recombine skills it learned on different tasks to solve entirely new problems it's never seen before. Think of it like a large language model: if you train an LLM on English-to-French translation and JSON formatting separately, it can automatically produce French translations in JSON format. π0.7 does this with Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. skills. It can fold laundry on a new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. with zero laundry-folding Robot LearningTrainingThe process of fitting a model using data or experience. data, use unfamiliar kitchen appliances, and handle long-horizon household tasks—all from a single 7B parameter model running at Evaluation & ResearchInference timeHow long the model takes to produce an output. without adaptation. This is the first robotics Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. to demonstrate this kind of broad compositional capability across multiple embodiments and Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. distributions.

ARCHITECTURE

THE PROBLEM

Previous Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. models required Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. on each new Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. or domain to perform well. While models like π*0.6 achieved high performance through Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task., this approach doesn't scale: you need new Robot LearningTrainingThe process of fitting a model using data or experience. runs for each Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., each Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. platform, and each variation. Foundation models in NLP solved this through compositional Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.—combining learned concepts in novel ways—but robotics VLAs hadn't demonstrated this capability at scale. They could understand diverse semantic concepts but couldn't reliably recombine skills the way LLMs do. Additionally, Robot LearningTrainingThe process of fitting a model using data or experience. data integration was naive: combining datasets from different robots, human demonstrations, and autonomous data sources without careful structuring led to performance degradation. The field lacked a framework that could unify diverse data sources while preserving the ability to extract generalizable skills.

HOW IT WORKS

1

Multimodal Prompt Conditioning Framework

The breakthrough insight is that you can't just throw diverse data at a model and expect it to generalize. Instead, π0.7 uses rich, structured prompts with multiple modality channels during Robot LearningTrainingThe process of fitting a model using data or experience.. Beyond simple text instructions ("fold the shirt"), the model learns from prompts that include: visual subgoal images showing the desired end-state of each sub-step, metadata about Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. speed and quality, Control & PlanningControlThe method used to make the robot move the way you want. modality labels (whether to use Movement, Mechanics & Robot BodyJointA movable connection between robot parts. or Movement, Mechanics & Robot BodyEnd-effectorThe tool at the end of a robot arm, like a gripper, hand, or suction cup. Control & PlanningControlThe method used to make the robot move the way you want.), and descriptions of individual sub-steps. This diversity of conditioning signals acts as a data Data, Distributions & Training IssuesAnnotationHuman-provided labels or metadata attached to data. system that disambiguates how the same Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. can be performed in different ways. A suboptimal autonomous Core ConceptsTrajectoryA sequence of states or actions over time. can be labeled as low-quality, so the model learns which behaviors to prefer without filtering out the data entirely. At test time, the model accepts standard language, but can also accept synthetically generated visual subgoals from a lightweight Modern Robot LearningWorld modelA model that predicts how the world will change after actions., enabling Modern Robot LearningZero-shotDoing a new task without task-specific training. visual Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. to new scenes.

Zero-shot air fryer attempt
With step-by-step language coaching
With detailed coaching
Source robot: laundry folding
2

Heterogeneous Data Integration

π0.7 unifies multiple data sources under a single prompting framework: multi-robot data (mobile manipulators, bimanual UR5e arms, various embodiments), human Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. videos, and autonomous data collected from running different policies. The key challenge is that these sources have different quality levels, Control & PlanningControlThe method used to make the robot move the way you want. conventions, and success rates. The Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. conditioning approach solves this by allowing the model to learn from suboptimal data without degrading performance. For example, autonomous Imitation & Reinforcement LearningExplorationTrying different actions to discover useful behavior. data that achieved 40% success can be included in Robot LearningTrainingThe process of fitting a model using data or experience. with quality annotations, and the model learns to extract useful patterns while not copying failure modes. This creates a virtuous cycle: more diverse data sources improve compositional Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. without requiring careful curation or data filtering.

Taking out the trash
Assembling a pinwheel
Peeling a rainbow carrot
Cutting a zucchini
3

Lightweight World Model for Visual Subgoal Synthesis

At Evaluation & ResearchInference timeHow long the model takes to produce an output., π0.7 can accept visual subgoals generated on-the-fly by a lightweight Modern Robot LearningWorld modelA model that predicts how the world will change after actions. rather than requiring pre-annotated Imitation & Reinforcement LearningDemonstrationAn example of a task being done correctly, often by a human. data. This is powerful because it means the model can work in novel scenes and with new objects without needing example videos. The Modern Robot LearningWorld modelA model that predicts how the world will change after actions. predicts what the scene will look like after each intermediate step, creating a visual roadmap for the Core ConceptsPolicyThe rule or model that maps observations or states to actions.. This breaks the dependency on having Imitation & Reinforcement LearningDemonstrationAn example of a task being done correctly, often by a human. data for every new scenario, enabling true Modern Robot LearningZero-shotDoing a new task without task-specific training. compositional Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.. For instance, the model can fold laundry on a new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. using the same Modern Robot LearningWorld modelA model that predicts how the world will change after actions. predictions it would use for other Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks, even though there's zero laundry folding data in Robot LearningTrainingThe process of fitting a model using data or experience..

Autonomous execution with world model
4

Steerable Output Generation

π0.7 isn't just a single monolithic predictor—it's designed to accept steering signals at test time that Control & PlanningControlThe method used to make the robot move the way you want. how it performs. You can specify desired Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. speed, Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. strategy, or visual subgoals, and the model adapts its behavior accordingly. This steerability is critical for compositional Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. because it means the model doesn't just memorize task-specific behaviors; it learns underlying Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. principles that can be recombined with different Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. parameters. This is why the same model can use a new kitchen appliance by applying learned interaction skills with different spatial targets, or fold different clothing items by adjusting its approach based on visual Control & PlanningFeedbackInformation returned from sensors during action to help correct behavior..

CROSS-EMBODIMENT TRANSFER

Skills learned on one robot transferred to a completely different robot

UR5e transfer: zero-shot laundry folding

MORE DEMONSTRATIONS

Installing a screw
Folding diverse clothing
Making coffee
Shirt folding
Peeling a cucumber
Making a peanut butter sandwich
Cleaning a glass door
Peeling a zucchini
Folding jeans
Turning clothes right-side out
Opening a door and driving through
Interactive language-directed cleanup

FIGURES

KEY RESULTS

Zero-shot specialist matchingMatches fine-tuned specialist model performance

vs. previous generalist models requiring fine-tuning to match specialist accuracy

This is the headline result. A single model performs as well as models specifically trained on individual tasks, without any task-specific adaptation. This eliminates the need for the Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. pipeline that roboticists currently rely on, reducing Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. time from weeks to seconds.

Cross-embodiment transferAchieves human teleoperator-level success rates on new robot platforms

vs. embodiment-specific models that fail when deployed on different hardware

π0.7 can transfer skills across different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. platforms—from mobile manipulators to bimanual arms—without retraining. The fact that it matches human teleoperator performance (which is the ground truth for what's achievable) shows the model has learned robust Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. principles rather than platform-specific quirks.

Compositional generalization (laundry folding)Successfully folds laundry on a new embodiment with zero laundry-folding training data

vs. no prior VLA demonstrating this type of skill recombination

This is the most impressive qualitative result. The model never saw laundry folding in Robot LearningTrainingThe process of fitting a model using data or experience., but by composing cloth Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. skills from other tasks with new object interaction patterns, it accomplishes the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening.. This is exactly the kind of Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. that made LLMs revolutionary—using learned components in novel combinations. No robotics Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. had demonstrated this at scale before.

Data diversity integrationSuccessfully trains on heterogeneous sources: multi-robot, human video, autonomous data

vs. prior models that required careful filtering and separate training for each source

By using Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. conditioning to disambiguate diverse behaviors, π0.7 can leverage suboptimal autonomous data and human videos simultaneously. This multiplies the effective Robot LearningDatasetA collection of training or evaluation data. size without requiring expensive curation, which is critical for scaling foundation models in robotics.

PERFORMANCE COMPARISON

π0.7 vs. task-specific RL-trained specialist models

WHY DEVELOPERS SHOULD CARE

For software developers building robotics systems, π0.7 changes the Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. model fundamentally. You're no longer choosing between a slow, expensive Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. pipeline or a limited generalist model. Instead, you get a model that handles new tasks, new robots, and new objects out of the box with steering prompts—text, visual goals, or Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. metadata. This means your software can be more adaptive: users can specify tasks in natural language with optional visual subgoals, and the system handles the Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. without requiring model retraining or even careful prompt engineering. The compositional Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. is the real insight to understand: the model learned to decompose tasks into spatial-temporal subgoals and Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. primitives, then recombine them. When you're designing your robotics application, think about how to provide rich context (visual subgoals, Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. constraints, Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. breakdowns) rather than just text commands. You should learn from this work that scaling robotics systems requires solving the data integration problem, not just collecting more data. π0.7's success came from a clever conditioning framework that let messy, diverse data coexist in Robot LearningTrainingThe process of fitting a model using data or experience.. If you're building a Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. system, the key takeaway is: structure your prompts to disambiguate behavior and Robot LearningLabelA target annotation used for training, such as object class or desired action. your data with metadata about Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. style, not just Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. identity.

LIMITATIONS

Despite its strengths, π0.7 still has meaningful constraints. The compositional Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before., while impressive, remains emergent rather than systematic—the model successfully generalizes on some novel Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. combinations but the paper doesn't characterize when or why it fails. Real-world Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. requires 100% Safety & DeploymentReliabilityHow consistently the system works over time.; losing laundry occasionally is different from a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. that works 90% of the time. The lightweight Modern Robot LearningWorld modelA model that predicts how the world will change after actions. for visual subgoal generation is mentioned but not detailed—if this model itself requires task-specific Robot LearningTrainingThe process of fitting a model using data or experience. or has failure modes, it limits the Modern Robot LearningZero-shotDoing a new task without task-specific training. claim. Modern Robot LearningCross-embodiment transferTransferring knowledge across different robot bodies. to "new" robots likely means robots similar to those in Robot LearningTrainingThe process of fitting a model using data or experience.; scaling to radically different morphologies (quadrupeds, manipulators with different Movement, Mechanics & Robot BodyDegrees of Freedom (DoF)The number of independent ways a robot can move. counts) is unproven. The paper also doesn't address the sample complexity for learning new skills entirely from scratch—if you need a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to perform a Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. that shares nothing with Data, Distributions & Training IssuesTraining distributionThe kinds of examples the model saw during training., how much human data is required?

WHAT COMES NEXT

The natural next step is improving the compositional Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. from emergent to systematic—developing better interpretability to understand what combinations of skills transfer and which fail, and potentially adding explicit composition modules that teach the model to reason about Modern Robot LearningTask decompositionBreaking a large task into smaller subproblems.. We'll likely see π1.0 or beyond focus on pushing the Core ConceptsEmbodimentThe robot’s physical form, including its body, joints, sensors, and actuation limits. diversity further (flying robots, legged Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running. + Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.), extending to longer-horizon Control & PlanningPlanningFiguring out what the robot should do before or during movement. (multi-hour household tasks rather than single activities), and tightening the integration with world models so visual reasoning becomes a first-class component rather than an auxiliary feature. The biggest unlock would be on-robot continual learning: today the model is frozen at test time, but roboticists will want to fine-tune on new tasks using Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. experience, turning the Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. into a true learning agent that improves with Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot..

RELATED PAPERS