COMPUTER-VISIONFOUNDATIONAL2023-03-07

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, Shuran Song

ARCHITECTURE
diffusion policy, time-series diffusion transformer
ROBOT
multiple (12 tasks across 4 benchmarks)
KEY METRIC
46.9%
TASK
manipulation

Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. represents a fundamental shift in how we teach robots to manipulate objects. Instead of treating Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningControlThe method used to make the robot move the way you want. like traditional Robot LearningMachine learningTraining models from data rather than programming every behavior manually. classification, the authors borrowed diffusion models—the same technology that generates stunning images in DALL-E and Midjourney—and applied them to Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actions. The result is striking: a 46.9% improvement in Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. across 12 different Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks, from picking up cans to flipping mugs to spreading sauce on pizza. What makes this genuinely revolutionary is that diffusion models naturally handle the "Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception." nature of Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actions—meaning when there are multiple valid ways to accomplish a Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., the Core ConceptsPolicyThe rule or model that maps observations or states to actions. can learn all of them and pick the best one at runtime. For a developer building Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. systems, this is a watershed moment: diffusion has proven it's not just a generative modeling trick for images, but a fundamentally better way to think about Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. behavior.

ARCHITECTURE

THE PROBLEM

Before Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques., Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. methods like Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. with LSTM-GMM (Long Short-Term Memory Gaussian Mixture Models) and IBC (Implicit Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions.) had a critical flaw: they struggled when tasks had multiple valid solutions. Imagine a pizza-making Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. spreading sauce—there are infinitely many valid patterns that work, but traditional policies would average them together, producing mediocre trajectories. Meanwhile, transformer-based methods like BET (Behavior Transformer) could predict Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequences but failed to "commit" to a single solution, hedging bets across all possibilities and causing failed executions. On top of this, as Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. spaces grew larger (like controlling 6 Movement, Mechanics & Robot BodyDegrees of Freedom (DoF)The number of independent ways a robot can move. for a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. arm), existing methods became increasingly unstable during Robot LearningTrainingThe process of fitting a model using data or experience.. The field had no principled way to leverage the rich generative capabilities that were revolutionizing computer vision—Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. was stuck using tools designed for classification, not generation.

HOW IT WORKS

1

Reformulate Robot Policy as a Denoising Process

Instead of asking "what Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. should the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. take given this image?", Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. flips the question: "starting from random Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation., what sequence of actions best explains this observed scene?" During Robot LearningTrainingThe process of fitting a model using data or experience., the model learns to gradually denoise noisy Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequences conditioned on visual input, learning the gradient (score function) of the Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. distribution. This is borrowed directly from how diffusion models generate images, but here it's applied to Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequences. The genius is that this naturally captures multimodality—the denoising process can learn multiple peaks in the Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. distribution and explore them during Robot LearningInferenceUsing a trained model to make predictions or choose actions.. For tasks with multiple solutions (like sauce spreading), this means the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. learns all valid strategies and can pick one intelligently.

highlight pusht process
highlight mug
highlight sauce
highlight pusht
2

Time-Series Diffusion Transformer Architecture

The authors designed a specialized transformer that operates on Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequences rather than single actions. It processes the entire planned Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. Core ConceptsTrajectoryA sequence of states or actions over time. (typically 16 Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. steps into the future) as a time-series, allowing the model to maintain smooth, physically-plausible motion. Each transformer block incorporates visual conditioning—the image Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings. is embedded and cross-attended at every step. This is critical because robots need to react continuously to what they see. The time-series formulation matters because it forces smooth Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. predictions; unlike methods that predict one Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. at a time, this predicts a coherent plan and can use Langevin Movement, Mechanics & Robot BodyDynamicsThe study of motion including forces, torques, mass, and inertia. (a technique from physics) to iteratively refine it during Core ConceptsExecutionActually carrying out planned or predicted actions on the robot..

3

Receding Horizon Control for Real-World Execution

During real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. operation, the system doesn't commit to the full 16-step plan. Instead, it executes only the first few actions, then re-plans based on the new visual Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings.. This is the receding horizon technique—a proven Control & PlanningControlThe method used to make the robot move the way you want. strategy that makes policies robust to mistakes and disturbances. If a human bumps the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. or an object shifts slightly, the next re-plan corrects course. The paper shows this is essential for real-world success: the Push-T Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. videos demonstrate the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. remaining robust against hand occlusions and external perturbations specifically because of continuous re-planning. This design choice bridges the gap between offline Imitation & Reinforcement LearningPolicy learningTraining a model that maps observations to actions. and online Control & PlanningReactive controlControl that responds immediately to sensor input or disturbances..

4

Visual Conditioning and Perception Integration

The Core ConceptsPolicyThe rule or model that maps observations or states to actions. doesn't operate in some abstract feature space—it directly conditions on RGB images from the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s camera. During Robot LearningTrainingThe process of fitting a model using data or experience., visual encoders (pre-trained vision models like R3M) extract features that the diffusion model uses to condition Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. denoising. This end-to-end visual grounding is crucial for Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. transfer and Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.. The project page highlights real-world successes on Push-T (pushing blocks precisely), Mug Flipping (complex 6-DOF Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. with orientation constraints), and Sauce Pouring (fluid Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. with periodic motions)—all learned from visual Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings. alone, no depth sensors or Perception & SensingState estimationCombining noisy sensor data to estimate the robot’s true state. required.

MORE DEMONSTRATIONS

lift
can
square
tool hang
transport
pusht
block push
kitchen
pusht ep6 diffusion
pusht ep6 r3m
pusht ep6 bcrnn
pusht ep6 ibc
all pusht wide web
pusht robustness web
mug flipping 20 web
mug hard diffusion video wall web
mug hard bcrnn video wall web
sauce pour spread web
pour diffusion video wall
spread diffusion video wall
pour bcrnn video wall
spread bcrnn video wall

FIGURES (6 of 8)

KEY RESULTS

Average Success Rate Improvement Across 12 Tasks46.9%

vs. prior state-of-the-art methods (LSTM-GMM, IBC, BET, Transformer-BC)

This is not a marginal improvement—a 46.9% relative improvement means tasks that failed nearly half the time now succeed nearly reliably. The benchmarks span 4 different environments (Robomimic, Implicit Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. tasks, Behavior Transformer tasks, and Franka Kitchen), proving this isn't a lucky win on one Robot LearningDatasetA collection of training or evaluation data.. This suggests diffusion's Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. handling is genuinely solving fundamental problems that prior methods couldn't.

Real-World Task Success: Push-T (Precise Block Pushing)100% (end-to-end)

vs. LSTM-GMM failure mode (stuck near block) and IBC failure mode (premature end-zone entry)

Real-world success is the ultimate Evaluation & ResearchMetricA numerical measure of performance. in robotics. Push-T is deceptively hard—it requires precise pushing in confined spaces, exactly where small errors compound. The paper shows Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. succeeds end-to-end while competitors get stuck or commit Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. errors. The Modern Robot LearningRobustnessHow well a robot keeps working despite noise, disturbances, or variation. videos are compelling: the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. survives hand occlusion, external perturbations during pushing, and perturbations during the finishing phase. This proves receding horizon Control & PlanningControlThe method used to make the robot move the way you want. and Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. learning actually work on physical hardware.

Multimodal Task Performance: Mug Flipping and Sauce HandlingSuccessful complex 6-DOF manipulation with periodic actions

vs. LSTM-GMM (shown failing in project page videos)

Mug flipping requires the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to pickup a mug at a random location, flip it upside-down, and rotate it so the handle points left—this has multiple valid approaches and requires near-kinematic limits accuracy. Sauce pouring and spreading requires dipping, approach, periodic spreading motions, and precise liquid Control & PlanningControlThe method used to make the robot move the way you want.. These tasks have high action-space dimensionality (6 Movement, Mechanics & Robot BodyDegrees of Freedom (DoF)The number of independent ways a robot can move.) and require learning multi-modal strategies (different ways to flip a mug depending on its starting orientation). Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. handles these gracefully; the comparison videos show LSTM-GMM failing, suggesting the Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. advantage is real and crucial for complex Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects..

Training StabilityDemonstrated across diverse tasks without reported divergence

vs. prior methods requiring careful hyperparameter tuning for high-dimensional action spaces

Diffusion formulations naturally regularize the learned distribution through the denoising objective. The paper notes this yields "impressive Robot LearningTrainingThe process of fitting a model using data or experience. stability," which matters because unstable Robot LearningTrainingThe process of fitting a model using data or experience. means wasted compute and failed experiments. For a developer, this means fewer hyperparameter searches and more reliable model convergence—a practical win that compounds across projects.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

This paper fundamentally changes what's possible in Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules.. Before Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques., if you wanted a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to learn from video demonstrations, you faced a hard choice: use Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. (biased, mode-averaging) or use Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. (slow, sample-inefficient, hard to get right). Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. offers a third path that combines the Robot LearningData efficiencyHow much useful performance a method gets from limited data. of Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task. with the multi-modality handling of generative models. For developers building production Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. systems, this means you can now learn richer, more flexible policies from the same amount of Imitation & Reinforcement LearningDemonstrationAn example of a task being done correctly, often by a human. data. The receding horizon Control & PlanningControlThe method used to make the robot move the way you want. layer is especially important—it makes learned policies robust to real-world disturbances without requiring explicit uncertainty quantification or risk-aware Control & PlanningPlanningFiguring out what the robot should do before or during movement.. You can train in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested., deploy on real hardware with continuous re-planning, and the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. will adapt. The architectural innovations (time-series diffusion transformer, visual conditioning) are also teachable patterns you can adapt to new tasks. The project page reveals this works across wildly different Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks (pushing, Manipulation & TasksGraspingTaking hold of an object., flipping, pouring, spreading), suggesting diffusion isn't a one-trick pony. If you're building a Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. platform, this paper's code and data release means you can implement and iterate on these ideas immediately. The 46.9% improvement isn't just a number—it's the difference between a system that works most of the time and one that actually solves the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. reliably.

LIMITATIONS

The paper doesn't deeply explore failure modes or fundamental limitations. One implicit Control & PlanningConstraintA rule the robot must obey, such as avoiding collisions or staying within joint limits.: all tasks shown are manipulation-focused; Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. to Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running. or other domains is untested. Diffusion Robot LearningInferenceUsing a trained model to make predictions or choose actions. requires iterative denoising steps (typically multiple passes through the model), which is slower than single-shot Core ConceptsPolicyThe rule or model that maps observations or states to actions. methods—this matters for Simulation & Sim-to-RealReal-time controlProducing actions fast enough for live robot control. where Simulation & Sim-to-RealLatencyDelay between input, computation, and action. is critical. The real-world experiments, while compelling, are still limited: Push-T, Mug Flipping, and Sauce tasks represent 3 real-world domains, whereas the simulated benchmarks span 12 tasks. Real-world Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. (Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. transfer) is briefly mentioned but not thoroughly evaluated—questions remain about how much Data, Distributions & Training IssuesDomain randomizationChanging simulator visuals or physics during training so policies transfer better to reality. is needed, how the method scales to new objects or unseen lighting conditions, and whether visual pre-training (R3M) is strictly necessary or if Robot LearningTrainingThe process of fitting a model using data or experience. from scratch works. The paper also doesn't discuss computational cost during Robot LearningTrainingThe process of fitting a model using data or experience. or Robot LearningInferenceUsing a trained model to make predictions or choose actions.; diffusion models are generally more expensive than their discriminative counterparts. Finally, the reliance on pre-trained vision models (R3M) for real-world success hints at a dependency on good feature learning that may not always be available for novel robots or domains.

WHAT COMES NEXT

The immediate direction is clear from the paper's own hints: scaling to longer-horizon tasks (beyond 16-step Control & PlanningPlanningFiguring out what the robot should do before or during movement. windows), experimenting with different diffusion schedules and Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation. levels to trade off Control & PlanningPlanningFiguring out what the robot should do before or during movement. flexibility vs. commitment, and testing on robots beyond Franka arms (the real-world experiments use specific hardware). Longer term, the field will likely explore hybrid approaches combining diffusion policies with Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task., enabling both Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task. Robot LearningData efficiencyHow much useful performance a method gets from limited data. and Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. optimization. Model-based extensions (learning world models + Control & PlanningPlanningFiguring out what the robot should do before or during movement. in Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. space with diffusion) are natural follow-ups. The foundation is set for diffusion to become the standard approach to visuomotor Imitation & Reinforcement LearningPolicy learningTraining a model that maps observations to actions., similar to how it displaced GANs in image generation. The open-sourcing of code, data, and notebooks means the community will rapidly iterate—expect ablations on architectural choices, scaling laws for Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequence length and image resolution, and applications to underexplored domains like Manipulation & TasksIn-hand manipulationManipulating an object within the robot hand without putting it down. or soft robotics. The key technical insight—that diffusion's score-matching objective naturally handles distribution multimodality—will likely influence how future generative models are applied to sequential decision-making problems beyond robotics.

RELATED PAPERS