DEXTEROUS-MANIPULATIONCURRENT2026-02-18

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, Linxi Fan

ARCHITECTURE

VLA (Vision Language Action)

ROBOT

22 DoF dexterous robotic hand

DATASET

20,854 hours of action labeled egocentric human video

KEY METRIC

54%

TASK

dexterous manipulation

EgoScale demonstrates that you can teach a dexterous hand with 22 to manipulate objects with remarkable by learning from videos of humans doing the same tasks. The key breakthrough: the researchers trained a () model on over 20,854 hours of egocentric human video—20 times larger than any previous attempt—and discovered a reliable : more human data consistently means better performance. The final achieves 54% higher success rates on complex tasks like shirt folding and bottle unscrewing compared to robots trained from scratch. Why this matters: (fine-grained hand ) has always been the hard problem in robotics. Most robots can move arms around, but getting intricate finger movements right is extraordinarily difficult. This work shows you can sidestep years of hand-engineering by simply scaling up human video data.

ARCHITECTURE

THE PROBLEM

Previous attempts at learning from human data worked only in toy domains—simple tasks with constrained objects. Papers like those on human-to-robot transfer showed promise in controlled lab settings, but nobody had proven that human data could handle the complexity of real dexterous tasks with high . The fundamental doubt was whether humans and robots move differently enough that human video becomes rather than signal. Prior work used tiny datasets (often <1,000 hours) and couldn't demonstrate clear scaling laws. Most systems relied on carefully hand-engineered functions, tricks, or massive amounts of direct interaction—all expensive and brittle.

HOW IT WORKS

Massive Egocentric Video Collection and Labeling

The team collected and action-labeled 20,854 hours of first-person video showing humans performing tasks like folding clothes, handling tools, and manipulating objects. This is not passive video—each frame is labeled with the human's hand pose and intent. The egocentric (first-person) perspective is critical because it matches what a 's wrist camera sees, creating natural alignment between human and viewpoints. Collecting this much data required a dedicated team of operators over sustained effort, but the payoff is a 20× larger than previous work. This scale is what enables the scaling laws they discover.

Vision-Language-Action (VLA) Model with Flow-Based Policy

Rather than a simple predictor, they built a model—a system that understands both visual observations and language descriptions of tasks, then predicts actions. The architecture uses a backbone for and a diffusion-based expert (DiT) for smooth, realistic motion generation. Actions are represented at the wrist level (camera frame) and then retargeted to the 's specific hand . This architectural choice matters because it creates an embodiment-agnostic motor prior: the learned skills work across different hand designs. The flow-based approach generates physically plausible trajectories rather than discrete, jerky movements.

Discovering the Scaling Law and Validation Correlation

They systematically trained models on 1k, 2k, 4k, 10k, and 20k hours of human video and measured validation loss (how well the model predicts actions on held-out video). They discovered a near-perfect log-linear (R²=0.9983): validation loss decreases reliably as you add more data. Crucially, they proved this validation loss on human video directly predicts real performance on downstream tasks. This is the holy grail of —a on the source domain (human videos) that correlates with target performance (actual success). Once you know this relationship, you can confidently invest in data collection knowing it will improve performance in a predictable way.

EgoScaleTwitterVideo

20khrs

shirt 8x

fold 8x

Two-Stage Transfer: Pretraining + Aligned Mid-Training

Rather than the human-pretrained model directly on tasks, they introduced a lightweight mid-training stage. This stage trains on aligned human-robot play data: pairs of videos showing a human and performing the same simultaneously (like folding the same towel). This alignment teaches the model to bridge the gap—how a human hand movement maps to motor commands—without requiring massive amounts of data. The mid-training is brief and cheap (a small amount of paired human-robot video), yet it proves essential for real success. After mid-training, the is post-trained on downstream tasks with minimal supervision (sometimes just one per for one-shot learning).

One-Shot Task Adaptation and Lower-DoF Generalization

The final shows emergent : it can learn brand new tasks from a single combined with ~100 human demonstrations of similar tasks. For example, after mid-training on 'fold towel,' it learns 'fold shirt' from just one example. Additionally, the learned motor prior transfers to robots with fewer (lower-DoF hands). This is remarkable because it means the human data isn't to the 22-DoF hand—it learns abstract skills that work across embodiments. This is the signature of a true prior, like how language models learn abstract concepts rather than surface patterns.

FIGURES

KEY RESULTS

Improvement over No-Pretraining Baseline54%

vs. training a 22-DoF hand from scratch with no human pretraining

This is the headline result. A 54% boost in average rate means the difference between a that fumbles objects and one that completes complex multi-step reliably. For real , this gap often separates failure from viability.

Scaling Law CoefficientR²=0.9983

vs. previous work with no demonstrated scaling laws

This near-perfect fit means the log-linear relationship between data scale and validation loss is rock-solid, not noisy. It justifies investment in larger datasets with high confidence. The researchers can predict: add 2× the data, get X% better loss, which translates to predictable performance gains.

Human Dataset Size20,854 hours

vs. prior work using <1,000 hours

The 20× scale increase is not incremental—it fundamentally changes what's possible. At small scales, models overfit and don't generalize to robots. At this scale, patterns emerge that transfer reliably. This is the key innovation: recognizing that , like language, has enough complexity to demand large-scale data.

One-Shot Transfer SuccessLearns new tasks from 1 robot demo + 100 human demos

vs. prior work requiring thousands of robot demonstrations per task

This is the practical win. In production, collecting data is expensive and slow. Being able to demonstrate a new once for the , then letting it learn by watching humans, reduces data collection burden by orders of magnitude.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For developers building robotics software, EgoScale rewires how you should think about the data problem. Instead of hand-engineering functions or spending months doing , you now have a playbook: collect human video at scale, use scaling laws to validate your is big enough, then transfer to your with lightweight mid-training. The specific insight—that human video + a is more predictable than hand-engineered rewards—should change your architecture decisions. If you're building a system, your first instinct should be 'where do I get human video data?' rather than 'how do I code the ?' The embodiment-agnostic motor prior is profound: you can train once on human data, then deploy across different hands. This decouples learning from hardware, a major shift in robotics philosophy. For software engineers, the architecture () is worth studying. It's how you unify , language understanding, and generation in a single coherent system—a pattern that's emerging across . The two-stage recipe (pretrain on humans, mid-train on aligned data, post-train on tasks) is also modular and reusable: you can swap components and scale each stage independently.

LIMITATIONS

The approach requires 20,854 hours of carefully action-labeled egocentric video—a massive upfront investment that's not trivial to replicate. The holds for human validation loss, but the authors don't deeply explore what happens if your 's differ radically from humans (e.g., very different actuators or payload capacities). The two-stage transfer recipe requires aligned human-robot mid-training data, which still demands some infrastructure and careful choreography. focuses on five dexterous tasks; to entirely novel domains (e.g., in extreme temperatures, underwater, or with exotic materials) remains unclear. The paper also doesn't address what happens when human and capabilities genuinely diverge—humans have better and dexterity in some respects, coarser in others. Finally, the method trains a single across many tasks; specialization or multi-policy approaches might outperform the generalist approach.

WHAT COMES NEXT

The immediate next step is exploring even larger human video datasets (50k+ hours) to see if the continues or plateaus, and understanding what the asymptotic performance ceiling looks like. More ambitiously, future work will likely investigate from unlabeled human video (removing the action-labeling bottleneck), multi-modal policies that can handle video, language, and demonstrations simultaneously, and scaling to full-body systems beyond hands. There's also rich territory in understanding what makes human data transferable—which tasks or motions in human video contribute most to performance—so you can be strategic about which videos to collect. Finally, combining EgoScale's scaling insights with foundation models (like large vision transformers) that are already trained on billions of internet images could create even more powerful priors, collapsing the need for explicit human labels by inferring them from large-scale unlabeled video.

Read on arxiv →HTML source →Project page →

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Massive Egocentric Video Collection and Labeling

Vision-Language-Action (VLA) Model with Flow-Based Policy

Discovering the Scaling Law and Validation Correlation

Two-Stage Transfer: Pretraining + Aligned Mid-Training

One-Shot Task Adaptation and Lower-DoF Generalization

FIGURES

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy