COMPUTER-VISIONFOUNDATIONAL2020-04-08

CURL: Contrastive Unsupervised Representations for Reinforcement Learning

Aravind Srinivas, Michael Laskin, Pieter Abbeel

ARCHITECTURE
contrastive learning with off-policy RL
KEY METRIC
1.9x
TASK
reinforcement learning, control

CURL solves one of the most frustrating problems in Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules.: teaching robots to Control & PlanningControlThe method used to make the robot move the way you want. complex tasks from camera images is incredibly sample-inefficient. This paper shows that by borrowing ideas from Robot LearningSelf-supervised learningLearning from structure in data without needing manual labels for everything. (specifically contrastive learning), you can extract meaningful visual features without any labels, then train a standard Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithm on top. The result? On the DeepMind Control & PlanningControlThe method used to make the robot move the way you want. Suite Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly., CURL achieves 1.9x better performance than previous pixel-based methods using the same 100,000 Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces. steps—that's almost twice as good with the same amount of data. Even more impressively, CURL narrows the gap with methods that use perfect Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. information (like exact Movement, Mechanics & Robot BodyJointA movable connection between robot parts. angles), something image-based methods have historically struggled with. For robotics developers, this is a big deal because it means you can now build vision-based Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. controllers that learn efficiently enough to be practical.

ARCHITECTURE

THE PROBLEM

Before CURL, learning Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningControlThe method used to make the robot move the way you want. from raw camera pixels was a messy compromise. Model-free methods like SAC (Soft Actor-Critic) could learn good policies but needed enormous amounts of data—often millions of Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces. steps to get reasonable performance. Model-based approaches tried to build a Modern Robot LearningWorld modelA model that predicts how the world will change after actions. from pixels first, but that's even harder: you're asking the neural network to predict future pixel values, which is notoriously difficult and sample-inefficient. The core issue? Standard neural networks don't know how to extract useful visual features for Control & PlanningControlThe method used to make the robot move the way you want. without being explicitly told (i.e., supervised labels). Meanwhile, methods that use privileged information like perfect Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. observations (no vision) learn 1.5-2x faster. This creates a practical wall: if you want a vision-based Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions., you're stuck doing expensive manual engineering or accepting massive data requirements.

HOW IT WORKS

1

Extract features using contrastive learning

CURL borrows from the Robot LearningSelf-supervised learningLearning from structure in data without needing manual labels for everything. world: instead of predicting pixels or having someone Robot LearningLabelA target annotation used for training, such as object class or desired action. 'this is a mug,' the algorithm learns features by solving a contrastive Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening.. The idea is elegant—take an image, create two slightly different views of it (crop it differently, adjust brightness, etc.), and train the network so that these two views produce similar feature vectors while being different from features from other random images in the batch. This forces the network to learn high-level concepts (like object shapes and positions) rather than low-level Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation.. Why this works: the network must capture what's invariant across the two views, which turns out to be exactly the visual information a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. needs for Control & PlanningControlThe method used to make the robot move the way you want.. This is completely unsupervised—no human ever labels anything.

2

Stack features in a memory buffer

Raw single-frame features aren't enough for temporal reasoning. CURL stacks the last 4 frames of extracted features (not pixels—features), creating a 4-frame temporal window. This gives the downstream Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithm information about motion and Movement, Mechanics & Robot BodyVelocityHow fast something moves. without explicitly computing it. This is a small but important detail: you're applying contrastive learning to individual frames, but the Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithm sees temporal context. The stacking happens after feature extraction, keeping the contrastive learning Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. simple while giving the Control & PlanningControlThe method used to make the robot move the way you want. algorithm what it needs.

3

Learn control with off-policy RL on top of fixed features

Once features are extracted, CURL applies SAC (Soft Actor-Critic), a standard off-policy Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithm, on top of the frozen feature representation. The actor (Core ConceptsPolicyThe rule or model that maps observations or states to actions.) and critic (value estimator) operate in the lower-dimensional feature space rather than the high-dimensional pixel space. This is where the Robot LearningSample efficiencyHow quickly a method learns from each example or interaction. comes from: the Core ConceptsPolicyThe rule or model that maps observations or states to actions. only has to learn in a compact, meaningful representation. Importantly, CURL keeps the feature extractor frozen initially, then later fine-tunes it end-to-end. This two-stage approach is key—you get good features quickly from contrastive learning without waiting for Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. to slowly sculpt them.

4

Fine-tune features with RL signal

After the initial phase, CURL unfreezes the feature encoder and trains it end-to-end with the Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. loss. Now the contrastive learning objective and Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. objective are both shaping the representations. This ensures features don't just capture generic visual concepts but are optimized specifically for the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening.. The Movement, Mechanics & Robot BodyJointA movable connection between robot parts. optimization is what pushes CURL ahead of baselines that use fixed, pre-trained features.

KEY RESULTS

DeepMind Control Suite performance at 100K steps1.9x

vs. prior pixel-based RL methods (DrQ, SLAC)

This is the headline result. On complex Control & PlanningControlThe method used to make the robot move the way you want. tasks like 'Humanoid Walk' and 'Quadruped Run,' CURL achieves nearly double the performance of previous vision-based methods in the same amount of Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces. data. 100K steps is still a relatively small sample budget in robotics, so this efficiency matters for real systems where data is expensive.

Atari performance at 100K steps1.2x

vs. prior pixel-based RL methods

The gains are smaller but still meaningful on Atari—a different domain with different visual properties. This suggests the method generalizes beyond continuous Control & PlanningControlThe method used to make the robot move the way you want.. A 1.2x improvement means you need 20% less data to reach the same score.

Gap to state-based methods~90%

vs. CURL nearly matches state-based SAC which uses ground-truth observations

This is the clincher. Historically, vision-based Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. is 40-60% worse than state-based Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. even with unlimited data. CURL closes that gap dramatically—on some tasks, the performance difference is negligible. For developers, this means: use vision if you want, and you won't pay a massive penalty.

Learning stabilityLower variance across seeds

vs. DrQ and other baselines

Beyond raw performance, CURL is more stable—the standard deviation across different random seeds is smaller. This matters for Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. because it means the method is more reliable, less dependent on lucky initialization.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For a robotics developer, CURL is a game-changer because it decouples two hard problems: learning good visual representations and learning good Control & PlanningControlThe method used to make the robot move the way you want. policies. Before, you had to do both at once, and they interfered with each other. CURL says: 'First, let the network learn what matters in the visual world using contrastive learning (which is data-efficient). Then, learn Control & PlanningControlThe method used to make the robot move the way you want. on top.' This modular approach is powerful because contrastive learning is now well-understood and there are pre-trained models available. You could potentially use CURL with a frozen pre-trained feature extractor from another domain and finetune only the Core ConceptsPolicyThe rule or model that maps observations or states to actions., saving even more data. The concrete takeaway: if you're building a vision-based Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningControllerThe algorithm or system that turns desired behavior into motor commands., stop trying to learn features and Control & PlanningControlThe method used to make the robot move the way you want. jointly in a single end-to-end network. Use Robot LearningSelf-supervised learningLearning from structure in data without needing manual labels for everything. first. CURL shows this works and scales. The method also works with standard Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithms (SAC in this case), so you don't need exotic new learning algorithms—solid engineering on the feature extraction side does the trick.

LIMITATIONS

CURL still requires 100K Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces. steps for complex continuous Control & PlanningControlThe method used to make the robot move the way you want., which is feasible in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. but expensive on real robots (though researchers have adapted it for real hardware since publication). The method assumes you can generate multiple augmented views of the same Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings. (crops, color jitter, etc.), which works well for image observations but would need rethinking for other sensory modalities. The paper doesn't address how CURL scales to much higher-dimensional observations (like 480p video or multi-camera systems). Additionally, the contrastive learning component adds hyperparameter tuning overhead—you need to tune the temperature parameter, augmentation strategies, and batch sizes carefully, which isn't trivial. Finally, CURL was evaluated primarily on relatively clean, well-lit Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. environments (MuJoCo-based); real-world Modern Robot LearningRobustnessHow well a robot keeps working despite noise, disturbances, or variation. under shadows, reflections, and clutter is not thoroughly tested.

WHAT COMES NEXT

The natural next step is learning from truly minimal data by combining CURL with model-based Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards.: use the contrastive features to train a forward model that predicts future features (not pixels), then use that for Control & PlanningPlanningFiguring out what the robot should do before or during movement.. Recent work has also explored using pre-trained vision models (like ResNets trained on ImageNet) as the feature extractor, which would make CURL even more sample-efficient by transferring knowledge from massive internet-scale datasets. On the robotics side, the logical evolution is deploying CURL on real hardware with real cameras and demonstrating that the efficiency gains hold outside of perfect simulators. Another promising direction is combining CURL with Data, Distributions & Training IssuesData augmentationArtificially varying training data to improve generalization. techniques specifically designed for Control & PlanningControlThe method used to make the robot move the way you want., and exploring whether contrastive learning can handle multiple camera viewpoints simultaneously—critical for real robotic systems that rarely have just one camera.

RELATED PAPERS