COMPUTER-VISIONFOUNDATIONAL2020-04-08

CURL: Contrastive Unsupervised Representations for Reinforcement Learning

Aravind Srinivas, Michael Laskin, Pieter Abbeel

ARCHITECTURE

contrastive learning with off-policy RL

KEY METRIC

1.9x

TASK

reinforcement learning, control

CURL solves one of the most frustrating problems in : teaching robots to complex tasks from camera images is incredibly sample-inefficient. This paper shows that by borrowing ideas from (specifically contrastive learning), you can extract meaningful visual features without any labels, then train a standard algorithm on top. The result? On the DeepMind Suite , CURL achieves 1.9x better performance than previous pixel-based methods using the same 100,000 steps—that's almost twice as good with the same amount of data. Even more impressively, CURL narrows the gap with methods that use perfect information (like exact angles), something image-based methods have historically struggled with. For robotics developers, this is a big deal because it means you can now build vision-based controllers that learn efficiently enough to be practical.

ARCHITECTURE

THE PROBLEM

Before CURL, learning from raw camera pixels was a messy compromise. Model-free methods like SAC (Soft Actor-Critic) could learn good policies but needed enormous amounts of data—often millions of steps to get reasonable performance. Model-based approaches tried to build a from pixels first, but that's even harder: you're asking the neural network to predict future pixel values, which is notoriously difficult and sample-inefficient. The core issue? Standard neural networks don't know how to extract useful visual features for without being explicitly told (i.e., supervised labels). Meanwhile, methods that use privileged information like perfect observations (no vision) learn 1.5-2x faster. This creates a practical wall: if you want a vision-based , you're stuck doing expensive manual engineering or accepting massive data requirements.

HOW IT WORKS

Extract features using contrastive learning

CURL borrows from the world: instead of predicting pixels or having someone 'this is a mug,' the algorithm learns features by solving a contrastive . The idea is elegant—take an image, create two slightly different views of it (crop it differently, adjust brightness, etc.), and train the network so that these two views produce similar feature vectors while being different from features from other random images in the batch. This forces the network to learn high-level concepts (like object shapes and positions) rather than low-level . Why this works: the network must capture what's invariant across the two views, which turns out to be exactly the visual information a needs for . This is completely unsupervised—no human ever labels anything.

Stack features in a memory buffer

Raw single-frame features aren't enough for temporal reasoning. CURL stacks the last 4 frames of extracted features (not pixels—features), creating a 4-frame temporal window. This gives the downstream algorithm information about motion and without explicitly computing it. This is a small but important detail: you're applying contrastive learning to individual frames, but the algorithm sees temporal context. The stacking happens after feature extraction, keeping the contrastive learning simple while giving the algorithm what it needs.

Learn control with off-policy RL on top of fixed features

Once features are extracted, CURL applies SAC (Soft Actor-Critic), a standard off-policy algorithm, on top of the frozen feature representation. The actor () and critic (value estimator) operate in the lower-dimensional feature space rather than the high-dimensional pixel space. This is where the comes from: the only has to learn in a compact, meaningful representation. Importantly, CURL keeps the feature extractor frozen initially, then later fine-tunes it end-to-end. This two-stage approach is key—you get good features quickly from contrastive learning without waiting for to slowly sculpt them.

Fine-tune features with RL signal

After the initial phase, CURL unfreezes the feature encoder and trains it end-to-end with the loss. Now the contrastive learning objective and objective are both shaping the representations. This ensures features don't just capture generic visual concepts but are optimized specifically for the . The optimization is what pushes CURL ahead of baselines that use fixed, pre-trained features.

KEY RESULTS

DeepMind Control Suite performance at 100K steps1.9x

vs. prior pixel-based RL methods (DrQ, SLAC)

This is the headline result. On complex tasks like 'Humanoid Walk' and 'Quadruped Run,' CURL achieves nearly double the performance of previous vision-based methods in the same amount of data. 100K steps is still a relatively small sample budget in robotics, so this efficiency matters for real systems where data is expensive.

Atari performance at 100K steps1.2x

vs. prior pixel-based RL methods

The gains are smaller but still meaningful on Atari—a different domain with different visual properties. This suggests the method generalizes beyond continuous . A 1.2x improvement means you need 20% less data to reach the same score.

Gap to state-based methods~90%

vs. CURL nearly matches state-based SAC which uses ground-truth observations

This is the clincher. Historically, vision-based is 40-60% worse than state-based even with unlimited data. CURL closes that gap dramatically—on some tasks, the performance difference is negligible. For developers, this means: use vision if you want, and you won't pay a massive penalty.

Learning stabilityLower variance across seeds

vs. DrQ and other baselines

Beyond raw performance, CURL is more stable—the standard deviation across different random seeds is smaller. This matters for because it means the method is more reliable, less dependent on lucky initialization.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For a robotics developer, CURL is a game-changer because it decouples two hard problems: learning good visual representations and learning good policies. Before, you had to do both at once, and they interfered with each other. CURL says: 'First, let the network learn what matters in the visual world using contrastive learning (which is data-efficient). Then, learn on top.' This modular approach is powerful because contrastive learning is now well-understood and there are pre-trained models available. You could potentially use CURL with a frozen pre-trained feature extractor from another domain and finetune only the , saving even more data. The concrete takeaway: if you're building a vision-based , stop trying to learn features and jointly in a single end-to-end network. Use first. CURL shows this works and scales. The method also works with standard algorithms (SAC in this case), so you don't need exotic new learning algorithms—solid engineering on the feature extraction side does the trick.

LIMITATIONS

CURL still requires 100K steps for complex continuous , which is feasible in but expensive on real robots (though researchers have adapted it for real hardware since publication). The method assumes you can generate multiple augmented views of the same (crops, color jitter, etc.), which works well for image observations but would need rethinking for other sensory modalities. The paper doesn't address how CURL scales to much higher-dimensional observations (like 480p video or multi-camera systems). Additionally, the contrastive learning component adds hyperparameter tuning overhead—you need to tune the temperature parameter, augmentation strategies, and batch sizes carefully, which isn't trivial. Finally, CURL was evaluated primarily on relatively clean, well-lit environments (MuJoCo-based); real-world under shadows, reflections, and clutter is not thoroughly tested.

WHAT COMES NEXT

The natural next step is learning from truly minimal data by combining CURL with model-based : use the contrastive features to train a forward model that predicts future features (not pixels), then use that for . Recent work has also explored using pre-trained vision models (like ResNets trained on ImageNet) as the feature extractor, which would make CURL even more sample-efficient by transferring knowledge from massive internet-scale datasets. On the robotics side, the logical evolution is deploying CURL on real hardware with real cameras and demonstrating that the efficiency gains hold outside of perfect simulators. Another promising direction is combining CURL with techniques specifically designed for , and exploring whether contrastive learning can handle multiple camera viewpoints simultaneously—critical for real robotic systems that rarely have just one camera.

Read on arxiv →HTML source →

CURL: Contrastive Unsupervised Representations for Reinforcement Learning

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Extract features using contrastive learning

Stack features in a memory buffer

Learn control with off-policy RL on top of fixed features

Fine-tune features with RL signal

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy