20,854 hours of action labeled egocentric human video
KEY METRIC
54%
TASK
dexterous manipulation
EgoScale demonstrates that you can teach a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. dexterous hand with 22 Movement, Mechanics & Robot BodyDegrees of Freedom (DoF)The number of independent ways a robot can move. to manipulate objects with remarkable Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. by learning from videos of humans doing the same tasks. The key breakthrough: the researchers trained a Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. (Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions.) model on over 20,854 hours of egocentric human video—20 times larger than any previous attempt—and discovered a reliable Robot LearningScaling lawA pattern showing how performance improves as data, compute, or model size increases.: more human data consistently means better Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. performance. The final Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.Core ConceptsPolicyThe rule or model that maps observations or states to actions. achieves 54% higher success rates on complex tasks like shirt folding and bottle unscrewing compared to robots trained from scratch. Why this matters: Manipulation & TasksDexterous manipulationHighly precise object handling, usually with fingers or complex contact. (fine-grained hand Control & PlanningControlThe method used to make the robot move the way you want.) has always been the hard problem in robotics. Most robots can move arms around, but getting intricate finger movements right is extraordinarily difficult. This work shows you can sidestep years of hand-engineering by simply scaling up human video data.
ARCHITECTURE
THE PROBLEM
Previous attempts at learning Manipulation & TasksDexterous manipulationHighly precise object handling, usually with fingers or complex contact. from human data worked only in toy domains—simple tasks with constrained objects. Papers like those on human-to-robot transfer showed promise in controlled lab settings, but nobody had proven that human data could handle the complexity of real dexterous tasks with high Movement, Mechanics & Robot BodyDegrees of Freedom (DoF)The number of independent ways a robot can move.. The fundamental doubt was whether humans and robots move differently enough that human video becomes Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation. rather than signal. Prior work used tiny datasets (often <1,000 hours) and couldn't demonstrate clear scaling laws. Most Manipulation & TasksDexterous manipulationHighly precise object handling, usually with fingers or complex contact. systems relied on carefully hand-engineered Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. functions, Data, Distributions & Training IssuesDomain randomizationChanging simulator visuals or physics during training so policies transfer better to reality. tricks, or massive amounts of direct Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. interaction—all expensive and brittle.
HOW IT WORKS
1
Massive Egocentric Video Collection and Labeling
The team collected and action-labeled 20,854 hours of first-person video showing humans performing tasks like folding clothes, handling tools, and manipulating objects. This is not passive video—each frame is labeled with the human's hand pose and Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. intent. The egocentric (first-person) perspective is critical because it matches what a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s wrist camera sees, creating natural alignment between human and Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. viewpoints. Collecting this much data required a dedicated team of operators over sustained effort, but the payoff is a Robot LearningDatasetA collection of training or evaluation data. 20× larger than previous work. This scale is what enables the scaling laws they discover.
2
Vision-Language-Action (VLA) Model with Flow-Based Policy
Rather than Robot LearningTrainingThe process of fitting a model using data or experience. a simple Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. predictor, they built a Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. model—a system that understands both visual observations and language descriptions of tasks, then predicts actions. The architecture uses a Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. backbone for Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. and a diffusion-based Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. expert (DiT) for smooth, realistic motion generation. Actions are represented at the wrist level (camera frame) and then retargeted to the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s specific hand Movement, Mechanics & Robot BodyKinematicsThe study of motion without considering forces.. This architectural choice matters because it creates an embodiment-agnostic motor prior: the learned skills work across different hand designs. The flow-based approach generates physically plausible trajectories rather than discrete, jerky movements.
3
Discovering the Scaling Law and Validation Correlation
They systematically trained models on 1k, 2k, 4k, 10k, and 20k hours of human video and measured validation loss (how well the model predicts actions on held-out video). They discovered a near-perfect log-linear Robot LearningScaling lawA pattern showing how performance improves as data, compute, or model size increases. (R²=0.9983): validation loss decreases reliably as you add more data. Crucially, they proved this validation loss on human video directly predicts real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. performance on downstream tasks. This is the holy grail of Modern Robot LearningTransfer learningUsing knowledge from one task, domain, or robot to help with another.—a Evaluation & ResearchMetricA numerical measure of performance. on the source domain (human videos) that correlates with target performance (actual Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. success). Once you know this relationship, you can confidently invest in data collection knowing it will improve Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. performance in a predictable way.
Rather than Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. the human-pretrained model directly on Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. tasks, they introduced a lightweight mid-training stage. This stage trains on aligned human-robot play data: pairs of videos showing a human and Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. performing the same Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. simultaneously (like folding the same towel). This alignment teaches the model to bridge the Core ConceptsEmbodimentThe robot’s physical form, including its body, joints, sensors, and actuation limits. gap—how a human hand movement maps to Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. motor commands—without requiring massive amounts of Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data. The mid-training is brief and cheap (a small amount of paired human-robot video), yet it proves essential for real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. success. After mid-training, the Core ConceptsPolicyThe rule or model that maps observations or states to actions. is post-trained on downstream tasks with minimal supervision (sometimes just one Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.Imitation & Reinforcement LearningDemonstrationAn example of a task being done correctly, often by a human. per Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. for one-shot learning).
5
One-Shot Task Adaptation and Lower-DoF Generalization
The final Core ConceptsPolicyThe rule or model that maps observations or states to actions. shows emergent Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.: it can learn brand new tasks from a single Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.Imitation & Reinforcement LearningDemonstrationAn example of a task being done correctly, often by a human. combined with ~100 human demonstrations of similar tasks. For example, after mid-training on 'fold towel,' it learns 'fold shirt' from just one Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. example. Additionally, the learned motor prior transfers to robots with fewer Movement, Mechanics & Robot BodyDegrees of Freedom (DoF)The number of independent ways a robot can move. (lower-DoF hands). This is remarkable because it means the human data isn't Data, Distributions & Training IssuesOverfittingWhen a model performs well on training data but poorly on new data. to the 22-DoF hand—it learns abstract Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. skills that work across embodiments. This is the signature of a true prior, like how language models learn abstract concepts rather than surface patterns.
FIGURES
KEY RESULTS
Improvement over No-Pretraining Baseline54%
vs. training a 22-DoF hand from scratch with no human pretraining
This is the headline result. A 54% boost in average Data, Distributions & Training IssuesTask successWhether the robot completed the task correctly. rate means the difference between a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. that fumbles objects and one that completes complex multi-step Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. reliably. For real Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot., this gap often separates failure from viability.
Scaling Law CoefficientR²=0.9983
vs. previous work with no demonstrated scaling laws
This near-perfect fit means the log-linear relationship between data scale and validation loss is rock-solid, not noisy. It justifies investment in larger datasets with high confidence. The researchers can predict: add 2× the data, get X% better loss, which translates to predictable Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. performance gains.
Human Dataset Size20,854 hours
vs. prior work using <1,000 hours
The 20× scale increase is not incremental—it fundamentally changes what's possible. At small scales, models overfit and don't generalize to robots. At this scale, patterns emerge that transfer reliably. This is the key innovation: recognizing that Manipulation & TasksDexterous manipulationHighly precise object handling, usually with fingers or complex contact., like language, has enough complexity to demand large-scale data.
One-Shot Transfer SuccessLearns new tasks from 1 robot demo + 100 human demos
vs. prior work requiring thousands of robot demonstrations per task
This is the practical win. In production, collecting Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data is expensive and slow. Being able to demonstrate a new Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. once for the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions., then letting it learn by watching humans, reduces data collection burden by orders of magnitude.
PERFORMANCE COMPARISON
WHY DEVELOPERS SHOULD CARE
For developers building robotics software, EgoScale rewires how you should think about the data problem. Instead of hand-engineering Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. functions or spending months doing Data, Distributions & Training IssuesDomain randomizationChanging simulator visuals or physics during training so policies transfer better to reality., you now have a playbook: collect human video at scale, use scaling laws to validate your Robot LearningDatasetA collection of training or evaluation data. is big enough, then transfer to your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. with lightweight mid-training. The specific insight—that human video + a Robot LearningScaling lawA pattern showing how performance improves as data, compute, or model size increases. is more predictable than hand-engineered rewards—should change your architecture decisions. If you're building a Manipulation & TasksDexterous manipulationHighly precise object handling, usually with fingers or complex contact. system, your first instinct should be 'where do I get human video data?' rather than 'how do I code the Simulation & Sim-to-RealPhysics simulatorSoftware that models motion, forces, and collisions realistically.?' The embodiment-agnostic motor prior is profound: you can train once on human data, then deploy across different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hands. This decouples learning from hardware, a major shift in robotics philosophy. For software engineers, the Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. architecture (Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions.) is worth studying. It's how you unify Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world., language understanding, and Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. generation in a single coherent system—a pattern that's emerging across Core ConceptsEmbodied AIAI that can perceive, reason, and act in the physical world through a body, like a robot.. The two-stage recipe (pretrain on humans, mid-train on aligned data, post-train on tasks) is also modular and reusable: you can swap components and scale each stage independently.
LIMITATIONS
The approach requires 20,854 hours of carefully action-labeled egocentric video—a massive upfront investment that's not trivial to replicate. The Robot LearningScaling lawA pattern showing how performance improves as data, compute, or model size increases. holds for human validation loss, but the authors don't deeply explore what happens if your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s Movement, Mechanics & Robot BodyDynamicsThe study of motion including forces, torques, mass, and inertia. differ radically from humans (e.g., very different actuators or payload capacities). The two-stage transfer recipe requires aligned human-robot mid-training data, which still demands some Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. infrastructure and careful Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. choreography. Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. focuses on five dexterous tasks; Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. to entirely novel Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. domains (e.g., Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. in extreme temperatures, underwater, or with exotic materials) remains unclear. The paper also doesn't address what happens when human and Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. capabilities genuinely diverge—humans have better Perception & SensingProprioceptionThe robot sensing its own body state, such as joint angles, velocity, and force. and dexterity in some respects, coarser Control & PlanningControlThe method used to make the robot move the way you want. in others. Finally, the method trains a single Core ConceptsPolicyThe rule or model that maps observations or states to actions. across many tasks; specialization or multi-policy approaches might outperform the generalist approach.
WHAT COMES NEXT
The immediate next step is exploring even larger human video datasets (50k+ hours) to see if the Robot LearningScaling lawA pattern showing how performance improves as data, compute, or model size increases. continues or plateaus, and understanding what the asymptotic performance ceiling looks like. More ambitiously, future work will likely investigate Robot LearningSelf-supervised learningLearning from structure in data without needing manual labels for everything. from unlabeled human video (removing the action-labeling bottleneck), multi-modal policies that can handle video, language, and demonstrations simultaneously, and scaling to full-body Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. systems beyond hands. There's also rich territory in understanding what makes human data transferable—which tasks or motions in human video contribute most to Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. performance—so you can be strategic about which videos to collect. Finally, combining EgoScale's scaling insights with foundation models (like large vision transformers) that are already trained on billions of internet images could create even more powerful Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. priors, collapsing the need for explicit human Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. labels by inferring them from large-scale unlabeled video.