Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, Shuran Song
ARCHITECTURE
THE PROBLEM
Before Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques., Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. methods like Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. with LSTM-GMM (Long Short-Term Memory Gaussian Mixture Models) and IBC (Implicit Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions.) had a critical flaw: they struggled when tasks had multiple valid solutions. Imagine a pizza-making Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. spreading sauce—there are infinitely many valid patterns that work, but traditional policies would average them together, producing mediocre trajectories. Meanwhile, transformer-based methods like BET (Behavior Transformer) could predict Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequences but failed to "commit" to a single solution, hedging bets across all possibilities and causing failed executions. On top of this, as Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. spaces grew larger (like controlling 6 Movement, Mechanics & Robot BodyDegrees of Freedom (DoF)The number of independent ways a robot can move. for a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. arm), existing methods became increasingly unstable during Robot LearningTrainingThe process of fitting a model using data or experience.. The field had no principled way to leverage the rich generative capabilities that were revolutionizing computer vision—Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. was stuck using tools designed for classification, not generation.
HOW IT WORKS
Reformulate Robot Policy as a Denoising Process
Time-Series Diffusion Transformer Architecture
Receding Horizon Control for Real-World Execution
Visual Conditioning and Perception Integration
MORE DEMONSTRATIONS
FIGURES (6 of 8)
KEY RESULTS
vs. prior state-of-the-art methods (LSTM-GMM, IBC, BET, Transformer-BC)
This is not a marginal improvement—a 46.9% relative improvement means tasks that failed nearly half the time now succeed nearly reliably. The benchmarks span 4 different environments (Robomimic, Implicit Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. tasks, Behavior Transformer tasks, and Franka Kitchen), proving this isn't a lucky win on one Robot LearningDatasetA collection of training or evaluation data.. This suggests diffusion's Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. handling is genuinely solving fundamental problems that prior methods couldn't.
vs. LSTM-GMM failure mode (stuck near block) and IBC failure mode (premature end-zone entry)
Real-world success is the ultimate Evaluation & ResearchMetricA numerical measure of performance. in robotics. Push-T is deceptively hard—it requires precise pushing in confined spaces, exactly where small errors compound. The paper shows Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. succeeds end-to-end while competitors get stuck or commit Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. errors. The Modern Robot LearningRobustnessHow well a robot keeps working despite noise, disturbances, or variation. videos are compelling: the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. survives hand occlusion, external perturbations during pushing, and perturbations during the finishing phase. This proves receding horizon Control & PlanningControlThe method used to make the robot move the way you want. and Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. learning actually work on physical hardware.
vs. LSTM-GMM (shown failing in project page videos)
Mug flipping requires the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to pickup a mug at a random location, flip it upside-down, and rotate it so the handle points left—this has multiple valid approaches and requires near-kinematic limits accuracy. Sauce pouring and spreading requires dipping, approach, periodic spreading motions, and precise liquid Control & PlanningControlThe method used to make the robot move the way you want.. These tasks have high action-space dimensionality (6 Movement, Mechanics & Robot BodyDegrees of Freedom (DoF)The number of independent ways a robot can move.) and require learning multi-modal strategies (different ways to flip a mug depending on its starting orientation). Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. handles these gracefully; the comparison videos show LSTM-GMM failing, suggesting the Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. advantage is real and crucial for complex Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects..
vs. prior methods requiring careful hyperparameter tuning for high-dimensional action spaces
Diffusion formulations naturally regularize the learned distribution through the denoising objective. The paper notes this yields "impressive Robot LearningTrainingThe process of fitting a model using data or experience. stability," which matters because unstable Robot LearningTrainingThe process of fitting a model using data or experience. means wasted compute and failed experiments. For a developer, this means fewer hyperparameter searches and more reliable model convergence—a practical win that compounds across projects.
PERFORMANCE COMPARISON
WHY DEVELOPERS SHOULD CARE
This paper fundamentally changes what's possible in Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules.. Before Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques., if you wanted a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to learn from video demonstrations, you faced a hard choice: use Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. (biased, mode-averaging) or use Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. (slow, sample-inefficient, hard to get right). Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. offers a third path that combines the Robot LearningData efficiencyHow much useful performance a method gets from limited data. of Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task. with the multi-modality handling of generative models. For developers building production Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. systems, this means you can now learn richer, more flexible policies from the same amount of Imitation & Reinforcement LearningDemonstrationAn example of a task being done correctly, often by a human. data. The receding horizon Control & PlanningControlThe method used to make the robot move the way you want. layer is especially important—it makes learned policies robust to real-world disturbances without requiring explicit uncertainty quantification or risk-aware Control & PlanningPlanningFiguring out what the robot should do before or during movement.. You can train in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested., deploy on real hardware with continuous re-planning, and the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. will adapt. The architectural innovations (time-series diffusion transformer, visual conditioning) are also teachable patterns you can adapt to new tasks. The project page reveals this works across wildly different Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks (pushing, Manipulation & TasksGraspingTaking hold of an object., flipping, pouring, spreading), suggesting diffusion isn't a one-trick pony. If you're building a Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. platform, this paper's code and data release means you can implement and iterate on these ideas immediately. The 46.9% improvement isn't just a number—it's the difference between a system that works most of the time and one that actually solves the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. reliably.
LIMITATIONS
The paper doesn't deeply explore failure modes or fundamental limitations. One implicit Control & PlanningConstraintA rule the robot must obey, such as avoiding collisions or staying within joint limits.: all tasks shown are manipulation-focused; Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. to Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running. or other domains is untested. Diffusion Robot LearningInferenceUsing a trained model to make predictions or choose actions. requires iterative denoising steps (typically multiple passes through the model), which is slower than single-shot Core ConceptsPolicyThe rule or model that maps observations or states to actions. methods—this matters for Simulation & Sim-to-RealReal-time controlProducing actions fast enough for live robot control. where Simulation & Sim-to-RealLatencyDelay between input, computation, and action. is critical. The real-world experiments, while compelling, are still limited: Push-T, Mug Flipping, and Sauce tasks represent 3 real-world domains, whereas the simulated benchmarks span 12 tasks. Real-world Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. (Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. transfer) is briefly mentioned but not thoroughly evaluated—questions remain about how much Data, Distributions & Training IssuesDomain randomizationChanging simulator visuals or physics during training so policies transfer better to reality. is needed, how the method scales to new objects or unseen lighting conditions, and whether visual pre-training (R3M) is strictly necessary or if Robot LearningTrainingThe process of fitting a model using data or experience. from scratch works. The paper also doesn't discuss computational cost during Robot LearningTrainingThe process of fitting a model using data or experience. or Robot LearningInferenceUsing a trained model to make predictions or choose actions.; diffusion models are generally more expensive than their discriminative counterparts. Finally, the reliance on pre-trained vision models (R3M) for real-world success hints at a dependency on good feature learning that may not always be available for novel robots or domains.
WHAT COMES NEXT
The immediate direction is clear from the paper's own hints: scaling to longer-horizon tasks (beyond 16-step Control & PlanningPlanningFiguring out what the robot should do before or during movement. windows), experimenting with different diffusion schedules and Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation. levels to trade off Control & PlanningPlanningFiguring out what the robot should do before or during movement. flexibility vs. commitment, and testing on robots beyond Franka arms (the real-world experiments use specific hardware). Longer term, the field will likely explore hybrid approaches combining diffusion policies with Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task., enabling both Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task. Robot LearningData efficiencyHow much useful performance a method gets from limited data. and Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. optimization. Model-based extensions (learning world models + Control & PlanningPlanningFiguring out what the robot should do before or during movement. in Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. space with diffusion) are natural follow-ups. The foundation is set for diffusion to become the standard approach to visuomotor Imitation & Reinforcement LearningPolicy learningTraining a model that maps observations to actions., similar to how it displaced GANs in image generation. The open-sourcing of code, data, and notebooks means the community will rapidly iterate—expect ablations on architectural choices, scaling laws for Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequence length and image resolution, and applications to underexplored domains like Manipulation & TasksIn-hand manipulationManipulating an object within the robot hand without putting it down. or soft robotics. The key technical insight—that diffusion's score-matching objective naturally handles distribution multimodality—will likely influence how future generative models are applied to sequential decision-making problems beyond robotics.