VLACURRENT2026-04-16

π0.7: a Steerable Model with Emergent Capabilities

Physical Intelligence

ARCHITECTURE

VLA (vision-language-action) with multimodal prompts

ROBOT

multiple: mobile manipulation, bimanual UR5e, various embodiments

DATASET

multi-robot diverse dataset

KEY METRIC

zero-shot cross-embodiment

TASK

manipulation, dexterous tasks, cross-embodiment transfer

π0.7 is a that does something roboticists have been chasing for years: a single generalist model that matches the performance of fine-tuned specialist models without any . More importantly, it exhibits compositional —the ability to recombine skills it learned on different tasks to solve entirely new problems it's never seen before. Think of it like a large language model: if you train an LLM on English-to-French translation and JSON formatting separately, it can automatically produce French translations in JSON format. π0.7 does this with skills. It can fold laundry on a new with zero laundry-folding data, use unfamiliar kitchen appliances, and handle long-horizon household tasks—all from a single 7B parameter model running at without adaptation. This is the first robotics to demonstrate this kind of broad compositional capability across multiple embodiments and distributions.

ARCHITECTURE

THE PROBLEM

Previous models required on each new or domain to perform well. While models like π*0.6 achieved high performance through , this approach doesn't scale: you need new runs for each , each platform, and each variation. Foundation models in NLP solved this through compositional —combining learned concepts in novel ways—but robotics VLAs hadn't demonstrated this capability at scale. They could understand diverse semantic concepts but couldn't reliably recombine skills the way LLMs do. Additionally, data integration was naive: combining datasets from different robots, human demonstrations, and autonomous data sources without careful structuring led to performance degradation. The field lacked a framework that could unify diverse data sources while preserving the ability to extract generalizable skills.

HOW IT WORKS

Multimodal Prompt Conditioning Framework

The breakthrough insight is that you can't just throw diverse data at a model and expect it to generalize. Instead, π0.7 uses rich, structured prompts with multiple modality channels during . Beyond simple text instructions ("fold the shirt"), the model learns from prompts that include: visual subgoal images showing the desired end-state of each sub-step, metadata about speed and quality, modality labels (whether to use or ), and descriptions of individual sub-steps. This diversity of conditioning signals acts as a data system that disambiguates how the same can be performed in different ways. A suboptimal autonomous can be labeled as low-quality, so the model learns which behaviors to prefer without filtering out the data entirely. At test time, the model accepts standard language, but can also accept synthetically generated visual subgoals from a lightweight , enabling visual to new scenes.

Zero-shot air fryer attempt

With step-by-step language coaching

With detailed coaching

Source robot: laundry folding

Heterogeneous Data Integration

π0.7 unifies multiple data sources under a single prompting framework: multi-robot data (mobile manipulators, bimanual UR5e arms, various embodiments), human videos, and autonomous data collected from running different policies. The key challenge is that these sources have different quality levels, conventions, and success rates. The conditioning approach solves this by allowing the model to learn from suboptimal data without degrading performance. For example, autonomous data that achieved 40% success can be included in with quality annotations, and the model learns to extract useful patterns while not copying failure modes. This creates a virtuous cycle: more diverse data sources improve compositional without requiring careful curation or data filtering.

Taking out the trash

Assembling a pinwheel

Peeling a rainbow carrot

Cutting a zucchini

Lightweight World Model for Visual Subgoal Synthesis

At , π0.7 can accept visual subgoals generated on-the-fly by a lightweight rather than requiring pre-annotated data. This is powerful because it means the model can work in novel scenes and with new objects without needing example videos. The predicts what the scene will look like after each intermediate step, creating a visual roadmap for the . This breaks the dependency on having data for every new scenario, enabling true compositional . For instance, the model can fold laundry on a new using the same predictions it would use for other tasks, even though there's zero laundry folding data in .

Autonomous execution with world model

Steerable Output Generation

π0.7 isn't just a single monolithic predictor—it's designed to accept steering signals at test time that how it performs. You can specify desired speed, strategy, or visual subgoals, and the model adapts its behavior accordingly. This steerability is critical for compositional because it means the model doesn't just memorize task-specific behaviors; it learns underlying principles that can be recombined with different parameters. This is why the same model can use a new kitchen appliance by applying learned interaction skills with different spatial targets, or fold different clothing items by adjusting its approach based on visual .

CROSS-EMBODIMENT TRANSFER

Skills learned on one robot transferred to a completely different robot

UR5e transfer: zero-shot laundry folding

MORE DEMONSTRATIONS

Installing a screw

Folding diverse clothing

Making coffee

Shirt folding

Peeling a cucumber

Making a peanut butter sandwich

Cleaning a glass door

Peeling a zucchini

Folding jeans

Turning clothes right-side out

Opening a door and driving through

Interactive language-directed cleanup

FIGURES

KEY RESULTS

Zero-shot specialist matchingMatches fine-tuned specialist model performance

vs. previous generalist models requiring fine-tuning to match specialist accuracy

This is the headline result. A single model performs as well as models specifically trained on individual tasks, without any task-specific adaptation. This eliminates the need for the pipeline that roboticists currently rely on, reducing time from weeks to seconds.

Cross-embodiment transferAchieves human teleoperator-level success rates on new robot platforms

vs. embodiment-specific models that fail when deployed on different hardware

π0.7 can transfer skills across different platforms—from mobile manipulators to bimanual arms—without retraining. The fact that it matches human teleoperator performance (which is the ground truth for what's achievable) shows the model has learned robust principles rather than platform-specific quirks.

Compositional generalization (laundry folding)Successfully folds laundry on a new embodiment with zero laundry-folding training data

vs. no prior VLA demonstrating this type of skill recombination

This is the most impressive qualitative result. The model never saw laundry folding in , but by composing cloth skills from other tasks with new object interaction patterns, it accomplishes the . This is exactly the kind of that made LLMs revolutionary—using learned components in novel combinations. No robotics had demonstrated this at scale before.

Data diversity integrationSuccessfully trains on heterogeneous sources: multi-robot, human video, autonomous data

vs. prior models that required careful filtering and separate training for each source

By using conditioning to disambiguate diverse behaviors, π0.7 can leverage suboptimal autonomous data and human videos simultaneously. This multiplies the effective size without requiring expensive curation, which is critical for scaling foundation models in robotics.

PERFORMANCE COMPARISON

π0.7 vs. task-specific RL-trained specialist models

WHY DEVELOPERS SHOULD CARE

For software developers building robotics systems, π0.7 changes the model fundamentally. You're no longer choosing between a slow, expensive pipeline or a limited generalist model. Instead, you get a model that handles new tasks, new robots, and new objects out of the box with steering prompts—text, visual goals, or metadata. This means your software can be more adaptive: users can specify tasks in natural language with optional visual subgoals, and the system handles the without requiring model retraining or even careful prompt engineering. The compositional is the real insight to understand: the model learned to decompose tasks into spatial-temporal subgoals and primitives, then recombine them. When you're designing your robotics application, think about how to provide rich context (visual subgoals, constraints, breakdowns) rather than just text commands. You should learn from this work that scaling robotics systems requires solving the data integration problem, not just collecting more data. π0.7's success came from a clever conditioning framework that let messy, diverse data coexist in . If you're building a system, the key takeaway is: structure your prompts to disambiguate behavior and your data with metadata about style, not just identity.

LIMITATIONS

Despite its strengths, π0.7 still has meaningful constraints. The compositional , while impressive, remains emergent rather than systematic—the model successfully generalizes on some novel combinations but the paper doesn't characterize when or why it fails. Real-world requires 100% ; losing laundry occasionally is different from a that works 90% of the time. The lightweight for visual subgoal generation is mentioned but not detailed—if this model itself requires task-specific or has failure modes, it limits the claim. to "new" robots likely means robots similar to those in ; scaling to radically different morphologies (quadrupeds, manipulators with different counts) is unproven. The paper also doesn't address the sample complexity for learning new skills entirely from scratch—if you need a to perform a that shares nothing with , how much human data is required?

WHAT COMES NEXT

The natural next step is improving the compositional from emergent to systematic—developing better interpretability to understand what combinations of skills transfer and which fail, and potentially adding explicit composition modules that teach the model to reason about . We'll likely see π1.0 or beyond focus on pushing the diversity further (flying robots, legged + ), extending to longer-horizon (multi-hour household tasks rather than single activities), and tightening the integration with world models so visual reasoning becomes a first-class component rather than an auxiliary feature. The biggest unlock would be on-robot continual learning: today the model is frozen at test time, but roboticists will want to fine-tune on new tasks using experience, turning the into a true learning agent that improves with .

Project page →

π0.7: a Steerable Model with Emergent Capabilities

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Multimodal Prompt Conditioning Framework

Heterogeneous Data Integration

Lightweight World Model for Visual Subgoal Synthesis

Steerable Output Generation

CROSS-EMBODIMENT TRANSFER

MORE DEMONSTRATIONS

FIGURES

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy