COMPUTER-VISIONFOUNDATIONAL2023-03-07

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, Shuran Song

ARCHITECTURE

diffusion policy, time-series diffusion transformer

ROBOT

multiple (12 tasks across 4 benchmarks)

KEY METRIC

46.9%

TASK

manipulation

represents a fundamental shift in how we teach robots to manipulate objects. Instead of treating like traditional classification, the authors borrowed diffusion models—the same technology that generates stunning images in DALL-E and Midjourney—and applied them to actions. The result is striking: a 46.9% improvement in across 12 different tasks, from picking up cans to flipping mugs to spreading sauce on pizza. What makes this genuinely revolutionary is that diffusion models naturally handle the "" nature of actions—meaning when there are multiple valid ways to accomplish a , the can learn all of them and pick the best one at runtime. For a developer building systems, this is a watershed moment: diffusion has proven it's not just a generative modeling trick for images, but a fundamentally better way to think about behavior.

ARCHITECTURE

THE PROBLEM

Before , methods like with LSTM-GMM (Long Short-Term Memory Gaussian Mixture Models) and IBC (Implicit ) had a critical flaw: they struggled when tasks had multiple valid solutions. Imagine a pizza-making spreading sauce—there are infinitely many valid patterns that work, but traditional policies would average them together, producing mediocre trajectories. Meanwhile, transformer-based methods like BET (Behavior Transformer) could predict sequences but failed to "commit" to a single solution, hedging bets across all possibilities and causing failed executions. On top of this, as spaces grew larger (like controlling 6 for a arm), existing methods became increasingly unstable during . The field had no principled way to leverage the rich generative capabilities that were revolutionizing computer vision— was stuck using tools designed for classification, not generation.

HOW IT WORKS

Reformulate Robot Policy as a Denoising Process

Instead of asking "what should the take given this image?", flips the question: "starting from random , what sequence of actions best explains this observed scene?" During , the model learns to gradually denoise noisy sequences conditioned on visual input, learning the gradient (score function) of the distribution. This is borrowed directly from how diffusion models generate images, but here it's applied to sequences. The genius is that this naturally captures multimodality—the denoising process can learn multiple peaks in the distribution and explore them during . For tasks with multiple solutions (like sauce spreading), this means the learns all valid strategies and can pick one intelligently.

highlight pusht process

highlight mug

highlight sauce

highlight pusht

Time-Series Diffusion Transformer Architecture

The authors designed a specialized transformer that operates on sequences rather than single actions. It processes the entire planned (typically 16 steps into the future) as a time-series, allowing the model to maintain smooth, physically-plausible motion. Each transformer block incorporates visual conditioning—the image is embedded and cross-attended at every step. This is critical because robots need to react continuously to what they see. The time-series formulation matters because it forces smooth predictions; unlike methods that predict one at a time, this predicts a coherent plan and can use Langevin (a technique from physics) to iteratively refine it during .

Receding Horizon Control for Real-World Execution

During real operation, the system doesn't commit to the full 16-step plan. Instead, it executes only the first few actions, then re-plans based on the new visual . This is the receding horizon technique—a proven strategy that makes policies robust to mistakes and disturbances. If a human bumps the or an object shifts slightly, the next re-plan corrects course. The paper shows this is essential for real-world success: the Push-T videos demonstrate the remaining robust against hand occlusions and external perturbations specifically because of continuous re-planning. This design choice bridges the gap between offline and online .

Visual Conditioning and Perception Integration

The doesn't operate in some abstract feature space—it directly conditions on RGB images from the 's camera. During , visual encoders (pre-trained vision models like R3M) extract features that the diffusion model uses to condition denoising. This end-to-end visual grounding is crucial for transfer and . The project page highlights real-world successes on Push-T (pushing blocks precisely), Mug Flipping (complex 6-DOF with orientation constraints), and Sauce Pouring (fluid with periodic motions)—all learned from visual alone, no depth sensors or required.

MORE DEMONSTRATIONS

lift

can

square

tool hang

transport

pusht

block push

kitchen

pusht ep6 diffusion

pusht ep6 r3m

pusht ep6 bcrnn

pusht ep6 ibc

all pusht wide web

pusht robustness web

mug flipping 20 web

mug hard diffusion video wall web

mug hard bcrnn video wall web

sauce pour spread web

pour diffusion video wall

spread diffusion video wall

pour bcrnn video wall

spread bcrnn video wall

FIGURES (6 of 8)

KEY RESULTS

Average Success Rate Improvement Across 12 Tasks46.9%

vs. prior state-of-the-art methods (LSTM-GMM, IBC, BET, Transformer-BC)

This is not a marginal improvement—a 46.9% relative improvement means tasks that failed nearly half the time now succeed nearly reliably. The benchmarks span 4 different environments (Robomimic, Implicit tasks, Behavior Transformer tasks, and Franka Kitchen), proving this isn't a lucky win on one . This suggests diffusion's handling is genuinely solving fundamental problems that prior methods couldn't.

Real-World Task Success: Push-T (Precise Block Pushing)100% (end-to-end)

vs. LSTM-GMM failure mode (stuck near block) and IBC failure mode (premature end-zone entry)

Real-world success is the ultimate in robotics. Push-T is deceptively hard—it requires precise pushing in confined spaces, exactly where small errors compound. The paper shows succeeds end-to-end while competitors get stuck or commit errors. The videos are compelling: the survives hand occlusion, external perturbations during pushing, and perturbations during the finishing phase. This proves receding horizon and learning actually work on physical hardware.

Multimodal Task Performance: Mug Flipping and Sauce HandlingSuccessful complex 6-DOF manipulation with periodic actions

vs. LSTM-GMM (shown failing in project page videos)

Mug flipping requires the to pickup a mug at a random location, flip it upside-down, and rotate it so the handle points left—this has multiple valid approaches and requires near-kinematic limits accuracy. Sauce pouring and spreading requires dipping, approach, periodic spreading motions, and precise liquid . These tasks have high action-space dimensionality (6 ) and require learning multi-modal strategies (different ways to flip a mug depending on its starting orientation). handles these gracefully; the comparison videos show LSTM-GMM failing, suggesting the advantage is real and crucial for complex .

Training StabilityDemonstrated across diverse tasks without reported divergence

vs. prior methods requiring careful hyperparameter tuning for high-dimensional action spaces

Diffusion formulations naturally regularize the learned distribution through the denoising objective. The paper notes this yields "impressive stability," which matters because unstable means wasted compute and failed experiments. For a developer, this means fewer hyperparameter searches and more reliable model convergence—a practical win that compounds across projects.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

This paper fundamentally changes what's possible in . Before , if you wanted a to learn from video demonstrations, you faced a hard choice: use (biased, mode-averaging) or use (slow, sample-inefficient, hard to get right). offers a third path that combines the of with the multi-modality handling of generative models. For developers building production systems, this means you can now learn richer, more flexible policies from the same amount of data. The receding horizon layer is especially important—it makes learned policies robust to real-world disturbances without requiring explicit uncertainty quantification or risk-aware . You can train in , deploy on real hardware with continuous re-planning, and the will adapt. The architectural innovations (time-series diffusion transformer, visual conditioning) are also teachable patterns you can adapt to new tasks. The project page reveals this works across wildly different tasks (pushing, , flipping, pouring, spreading), suggesting diffusion isn't a one-trick pony. If you're building a platform, this paper's code and data release means you can implement and iterate on these ideas immediately. The 46.9% improvement isn't just a number—it's the difference between a system that works most of the time and one that actually solves the reliably.

LIMITATIONS

The paper doesn't deeply explore failure modes or fundamental limitations. One implicit : all tasks shown are manipulation-focused; to or other domains is untested. Diffusion requires iterative denoising steps (typically multiple passes through the model), which is slower than single-shot methods—this matters for where is critical. The real-world experiments, while compelling, are still limited: Push-T, Mug Flipping, and Sauce tasks represent 3 real-world domains, whereas the simulated benchmarks span 12 tasks. Real-world ( transfer) is briefly mentioned but not thoroughly evaluated—questions remain about how much is needed, how the method scales to new objects or unseen lighting conditions, and whether visual pre-training (R3M) is strictly necessary or if from scratch works. The paper also doesn't discuss computational cost during or ; diffusion models are generally more expensive than their discriminative counterparts. Finally, the reliance on pre-trained vision models (R3M) for real-world success hints at a dependency on good feature learning that may not always be available for novel robots or domains.

WHAT COMES NEXT

The immediate direction is clear from the paper's own hints: scaling to longer-horizon tasks (beyond 16-step windows), experimenting with different diffusion schedules and levels to trade off flexibility vs. commitment, and testing on robots beyond Franka arms (the real-world experiments use specific hardware). Longer term, the field will likely explore hybrid approaches combining diffusion policies with , enabling both and optimization. Model-based extensions (learning world models + in space with diffusion) are natural follow-ups. The foundation is set for diffusion to become the standard approach to visuomotor , similar to how it displaced GANs in image generation. The open-sourcing of code, data, and notebooks means the community will rapidly iterate—expect ablations on architectural choices, scaling laws for sequence length and image resolution, and applications to underexplored domains like or soft robotics. The key technical insight—that diffusion's score-matching objective naturally handles distribution multimodality—will likely influence how future generative models are applied to sequential decision-making problems beyond robotics.

Read on arxiv →HTML source →Project page →

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Reformulate Robot Policy as a Denoising Process

Time-Series Diffusion Transformer Architecture

Receding Horizon Control for Real-World Execution

Visual Conditioning and Perception Integration

MORE DEMONSTRATIONS

FIGURES (6 of 8)

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics