LEARNINGFOUNDATIONAL2019-09-25

Good Robot!: Efficient Reinforcement Learning for Multi-Step Visual Tasks with Sim to Real Transfer

Andrew Hundt, Benjamin Killeen, Nicholas Greene, Hongtao Wu, Heeyeon Kwon, Chris Paxton, Gregory D. Hager

ARCHITECTURE

RL policy

ROBOT

not specified in abstract

KEY METRIC

100%

TASK

manipulation, stacking, assembly

Imagine a to stack blocks—a that requires dozens of precise steps in sequence. Traditional AI approaches fail spectacularly: they waste time exploring useless actions and easily undo progress they've made. The SPOT framework fixes this by teaching robots to stay within safe zones, learn from mistakes without making them, and prioritize experiences that recover from setbacks. The results are stunning: robots improved from 13% success to 100% when stacking 4 cubes, trained in just 1-20k actions (roughly 10 minutes to an hour of time), and—most impressively—transferred directly from to real robots with zero , achieving 100% success on physical stacking tasks. For developers, this is the first time has cracked the gap for complex, long-horizon tasks.

ARCHITECTURE

THE PROBLEM

Before SPOT, agents were terrible at multi-step tasks. A to stack blocks had to explore billions of possible arm movements, and most led nowhere—the would push a cube the wrong way, undo previous progress, and start over. approaches achieved only 13% success rates on 4-cube stacking and wasted 30%+ of actions on inefficient movements. The core issue: standard algorithms treat all experiences equally, so an agent wastes as much time learning 'what NOT to do' as learning 'what to do.' Worse, the gap between and real-world meant even successful simulated policies would fail on real robots due to minor physics differences, requiring expensive real-world with thousands of trials.

HOW IT WORKS

Action Safety Zones

Instead of letting the try any movement, SPOT constrains the space to 'safe zones'—regions where the arm can move without knocking things over or hitting the table. This is clever: it's not restricting what the learns, it's restricting what it explores. The safety zones are defined by the (e.g., 'don't move down into the pile of cubes you're stacking'). By eliminating obviously bad actions upfront, the spends 10x more experience on promising movements. This reduces the search space from millions of possibilities to hundreds.

Unsafe Region Learning

The doesn't ignore dangerous actions—it learns from them without executing them. When SPOT encounters a that would violate safety constraints, it still updates its neural network based on what *would have happened* if the tried it. This is like learning that touching a hot stove is bad by reading about it, not by burning yourself. The algorithm uses relaxation and auxiliary loss functions to penalize unsafe actions in the , dramatically reducing real waste.

Progress Reversal Prioritization

Most agents sample experiences randomly from memory. SPOT instead prioritizes experiences where the *undoes previous progress*—like knocking over a partially-built tower. This is counterintuitive but brilliant: these 'failure' moments are the hardest to learn from and the most critical to get right. By seeing reversals 5-10x more often in , the learns to avoid setbacks. The algorithm tracks progress (e.g., 'how many blocks are stacked?') and weights experiences that decrease progress much higher than routine successes.

Sim-to-Real Transfer via Domain Randomization

in is 100x faster than on a real , but simulated physics never perfectly match reality. SPOT uses heavy : during , it randomly varies block colors, coefficients, camera positions, and object shapes. When the trains across thousands of these variations, it learns features that work in the real world because it's already learned to be robust to slight differences. The breakthrough: SPOT achieves 100% real-world success *without any real-world *, loading the simulation-trained model directly onto hardware.

KEY RESULTS

Stacking 4 cubes - success rate100%

vs. 13% baseline

This is the headline result. Going from 1-in-8 success to perfect success represents an 8x improvement. For a manufacturing , this is the difference between a system you can deploy and a system that's useless.

Training efficiency1-20k actions to convergence

vs. millions for baseline RL

in 20,000 actions means a real could learn this in 2-3 hours of wall-clock time. Most papers require weeks. This is why it matters: you can now iterate on tasks in a day instead of a month.

Real-world stacking - success rate with direct transfer100%

vs. ~0% for standard sim-to-real approaches without fine-tuning

This is historically significant. Before SPOT, you'd expect the real to fail 50-90% of the time when given a trained only in . Achieving 100% without any real-world is the kind of result that lets companies deploy systems without building expensive real-world datasets.

Action efficiency in real-world stacking61%

vs. typical inefficiency of 30%+ wasted actions

This means 61% of the 's actions directly contribute to progress—the rest are corrections, stabilizations, and overhead. That's remarkably tight. For real manufacturing, this translates to faster cycle times and less wear on hardware.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For a developer building robotics software, SPOT demolishes two major barriers: and transfer. Traditionally, teaching a a complex meant running it on real hardware for weeks while it slowly learned. SPOT lets you train in in hours, then deploy with confidence. This means startups can compete with large labs—you don't need a warehouse of robots anymore, just good and this algorithm. The priority sampling insight (learning from reversals) is broadly applicable: any where progress can be undone (, with obstacles, ) benefits from this approach. The safety zone concept is equally powerful: it's a bridge between unconstrained and constrained optimization, giving you a way to encode domain knowledge without building a rigid hand-coded . Most importantly, SPOT proves that long-horizon tasks with transfer aren't theoretically impossible—they're just waiting for the right algorithm.

LIMITATIONS

SPOT requires manual definition of safety zones, which means domain expertise. A developer can't just apply this to arbitrary tasks—you need to think about what arm movements are geometrically safe. The approach is also tested primarily on tabletop with rigid objects (cubes, toy clearing). Tasks with deformable objects (cloth, rope), dynamic environments (moving obstacles), or where safety constraints are genuinely hard to specify ( in clutter) are untested. The paper doesn't deeply explore what happens when the real-world domain shift is larger (different table height, design, lighting conditions). While 100% success is reported, the 61% efficiency suggests there's still 39% waste—substantial compared to expert human performance. Additionally, the approach requires good progress metrics (how many cubes stacked?), which may not exist for all tasks.

WHAT COMES NEXT

The next frontier is generalizing SPOT beyond tabletop tasks. Can it handle in clutter, where safety zones overlap and interact? Can it work with vision-based policies that learn features rather than hand-engineered representations? A natural extension is combining SPOT with meta-learning or learning—train in on 100 tasks, then adapt to new tasks with minimal real-world data. There's also room to automate the safety zone definition using computer vision or learned constraints from demonstrations. Finally, scaling to humanoid robots or multi-arm systems would test whether the progress reversal prioritization generalizes beyond simple stacking, and whether the approach remains sample-efficient when the space grows to thousands of dimensions.

Read on arxiv →HTML source →Project page →

Good Robot!: Efficient Reinforcement Learning for Multi-Step Visual Tasks with Sim to Real Transfer

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Action Safety Zones

Unsafe Region Learning

Progress Reversal Prioritization

Sim-to-Real Transfer via Domain Randomization

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy