GRASPINGCURRENT2026-06-15

Human Universal Grasping

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

ARCHITECTURE
Flow-matching transformer with RGB-PC fusion and point-conditioned query for multi-fingered grasp generation
ROBOT
YOR mobile manipulator with WUJI hands; also Ability Hand (smaller morphology)
DATASET
1M image-grasp pairs from 6,707 recordings, ~1.5K unique objects, 41 buildings, 27.8 hours of egocentric video
KEY METRIC
66.7% success rate (tabletop)
TASK
Dexterous object grasping from RGB-D

HUG demonstrates that learning dexterous Manipulation & TasksGraspingTaking hold of an object. from human motion data—rather than Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. demonstrations or Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested.—produces generalizable Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. primitives that transfer Modern Robot LearningZero-shotDoing a new task without task-specific training. to multiple Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hands. The system trains a flow-matching model on 1M egocentric human grasps collected via Aria Gen 2 smart glasses, learns to predict full MANO hand poses (wrist position, rotation, and 15 finger joints) from Perception & SensingRGB-DSensor input that combines color images and depth information. images, and retargets these predictions to Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. morphologies without per-hand retraining. On a carefully curated Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. of 30 unseen objects spanning five geometric categories and three size classes, HUG reaches 66.7% tabletop success and 62.0% in-the-wild success, beating prior multi-fingered Manipulation & TasksGraspingTaking hold of an object. methods by 23–34 percentage points. The key insight is that natural human Manipulation & TasksGraspingTaking hold of an object. data is abundant and easier to collect at scale than Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. or synthetic generation. By pairing lightweight egocentric Perception & SensingSensorA device that provides information about the robot or its environment. streams (RGB, stereo depth, hand landmarks) with anatomically-valid MANO fitting and flow-matching diffusion, HUG captures the distribution of how humans grasp real objects in everyday environments. The learned Core ConceptsPolicyThe rule or model that maps observations or states to actions. generalizes across unseen object geometries, camera intrinsics, Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hand morphologies, and household environments without any robot-specific Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task.. The paper also introduces HUG-Bench, a metric-scale Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. of 90 objects reconstructed from real egocentric video, enabling paired Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. and real-world Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs.. All code, data, trained models, and Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. assets are released publicly.

THE PROBLEM

Multi-fingered Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Manipulation & TasksGraspingTaking hold of an object. remains far from human-level generality. Prior approaches suffer from fundamental limitations: synthetic methods (optimization-based or learned generators in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested.) struggle with Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. gap and require retraining per Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hand; Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. produces real embodiment-specific grasps but is tedious and cannot cover object diversity; and learning from Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data is expensive because dexterous Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. is slow and difficult to scale. Most prior work trains on Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. (e.g., DexGraspNet, Dex1B, UniDexGrasp++) or lab-collected data (DexYCB, AnyDexGrasp), which lack the scale and diversity of real-world Manipulation & TasksGraspingTaking hold of an object.. Critically, existing multi-fingered methods often require complete object point clouds, which are unavailable in single-view real Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot., limiting practical applicability. The core problem is one of data sourcing: robots need the diverse, naturally executed Manipulation & TasksGraspingTaking hold of an object. experience humans accumulate daily, but collecting this via robot-specific Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. is prohibitively expensive. Recent advances in lightweight egocentric sensors (Aria Gen 2) and anthropomorphic Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hands with learned retargeting have made a previously infeasible approach practical: collect human grasps at scale, learn the natural distribution of human Manipulation & TasksGraspingTaking hold of an object., and retarget to robots. No prior work has demonstrated this pipeline for multi-fingered dexterous Manipulation & TasksGraspingTaking hold of an object..

HOW IT WORKS

1

1M-HUGs Dataset Collection & Curation

The authors collected 1M egocentric image-grasp pairs using Aria Gen 2 smart glasses across 41 buildings. For each object, the wearer stands in front, moves their head for 15–30 seconds to capture diverse viewpoints without their hand visible, then grasps with their right hand. A key insight is that a single grasp is back-propagated to preceding no-hand frames via camera pose, yielding hundreds of (object image, grasp) pairs from different viewpoints at no additional cost. Raw recordings are filtered on five criteria: object mask presence, ≥60% confident stereo depth, hand landmarks intersecting the object mask, sufficient in-frame landmarks, and absence of hands in the frame. Each recording is verified with a Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. for object identification, SAM3 for mask propagation across frames, and stability heuristics for grasp-frame selection. All entries are human-reviewed via a web interface. The 1M surviving frames span ~1.5K unique objects in diverse environments (kitchens, bedrooms, offices, etc.). Crucially, the authors fit a full MANO hand (10-dim shape + 15×3-dim pose) to the sparse 21-point Aria landmarks using anatomical constraints, standardizing all grasps to a canonical hand size and producing an articulated mesh suitable for Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested.. This aria2mano pipeline is released as a standalone tool.

2

Flow-Matching Architecture with RGB-PC Fusion

HUG predicts a 99-dimensional grasp Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. consisting of wrist translation (3D), wrist rotation (6D continuous), and 15 finger Movement, Mechanics & Robot BodyJointA movable connection between robot parts. rotations (90D total). The input is an Perception & SensingRGB-DSensor input that combines color images and depth information. image from a stereo camera plus a user-specified 2D pixel click on the target object. The Perception & SensingRGB imageA standard color image with red, green, and blue channels. is encoded with a frozen DINOv2-Base ViT (256 patch tokens); the depth is back-projected to a Evaluation & ResearchMetricA numerical measure of performance. point cloud, cropped to a 0.3 m radius ball around the 3D query point, and processed by a trainable PointNeXt U-Net (4096 points → 256 region tokens). The two streams are fused via point painting: each point cloud centroid is projected into the Perception & SensingRGB imageA standard color image with red, green, and blue channels. using camera intrinsics K, its DINOv2 feature is bilinearly sampled, concatenated with the PC token, and projected by an MLP into a 1024-dim fused token. Both the query point and point centroids are encoded with random Fourier features to retain Evaluation & ResearchMetricA numerical measure of performance. scale information. The fused tokens cross-attend to the query token in a 4-layer pre-norm transformer to produce 256 scene-conditioning tokens. These are then fed to the flow transformer: a 6-layer Diffusion Transformer (DiT) that separately processes translation, rotation, and finger pose tokens (keeping geometric components from over-mixing) with timestep conditioning via AdaLN-Zero. The flow model is trained to predict Movement, Mechanics & Robot BodyVelocityHow fast something moves. in normalized space, with an L1 loss on 3D MANO landmarks (weighted λ₃D=20) combined with MSE Movement, Mechanics & Robot BodyVelocityHow fast something moves. loss (λᵥ=1). Camera intrinsics appear only in back-projection and projection, never as learned parameters, enabling transfer across different stereo cameras.

3

Hand Retargeting to Robot Embodiments

Predicted MANO grasps are retargeted to multiple Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hands (WUJI, Ability Hand) without per-hand Robot LearningTrainingThe process of fitting a model using data or experience.. The paper leverages recent learned retargeting methods (cited as qin2023anyteleop, mandi2025dexmachina, li2025maniptrans, wuji2026retargeting) to transform the canonical MANO hand pose to each Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s morphology. This is possible because anthropomorphic hands have narrowed the human-robot morphology gap. The fixed MANO shape simplifies retargeting: rather than handling variable human hand sizes, a single canonical hand size is used for all Robot LearningTrainingThe process of fitting a model using data or experience. data, and the network predicts only articulation and placement. At Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot., the same predicted grasp can be retargeted to hands with different sizes and Movement, Mechanics & Robot BodyJointA movable connection between robot parts. counts, enabling Modern Robot LearningZero-shotDoing a new task without task-specific training. Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. across embodiments without retraining or per-robot optimization.

4

HUG-Bench: Metric-Scale Benchmark Construction

To standardize Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs., the authors built HUG-Bench, comprising 90 unseen objects from five geometric categories (cylindrical, spheroidal, prismatic, appendaged, amorphous) and three size bins (small, medium, large), with six objects per combination. The 30 test objects are deliberately hard to grasp: many are articulated, very short (~1 cm), or large and unwieldy. Crucially, all objects are reconstructed at Evaluation & ResearchMetricA numerical measure of performance. scale from real egocentric video using an extended Multi-view SAM3D pipeline. For each object, five spread-out Aria Gen 2 views are collected, injected into MV-SAM3D with Aria intrinsics/extrinsics and stereo depth, and manually inspected in Viser for scale and pose alignment with the Navigation & LocomotionSLAMSimultaneous Localization and Mapping. semi-dense point cloud. Meshes are made watertight with Alpha Wrap and decomposed into convex parts for Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. fidelity. Each object also has 10 human grasps recorded for oracle Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs.. This construction ensures simulation-to-real consistency: the same metric-scale meshes are used in both MuJoCo Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. and real-world experiments.

5

Evaluation in Simulation and Real World

In Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. (MuJoCo), HUG is evaluated on the 30 test objects using a simulated MANO hand. The paper tracks three metrics: Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. (object lifted 10 cm), fingertip Movement, Mechanics & Robot BodyContactPhysical interaction between the robot and an object or surface. count (mean and std across successful grasps), and penetration depth (violations of object geometry). Baselines include DexGraspNet, Dex1B, UniDexGrasp++, and a learned Manipulation & TasksGraspingTaking hold of an object. Core ConceptsPolicyThe rule or model that maps observations or states to actions.. Real-world Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. uses a YOR mobile manipulator with WUJI hands. The Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. executes grasps on all 30 test objects in both tabletop (controlled) and in-the-wild (unconstrained) settings across multiple stereo cameras and unseen homes. Success is measured by lifting objects 10 cm vertically. Failure modes are traced at three stages: pre-grasp approach, grasp Core ConceptsExecutionActually carrying out planned or predicted actions on the robot., and lift, categorizing failures (Movement, Mechanics & Robot BodyContactPhysical interaction between the robot and an object or surface. loss, slip, etc.). Results show HUG achieves 66.7% tabletop and 62.0% in-the-wild success, with detailed failure analysis revealing penetration depth and single-modality (RGB-only or PC-only) failures.

In-the-wild match box: autonomous HUG rollouts in an unseen home.
HUG on storage bin: 10 tabletop rollouts from the project-page comparison set.
Dex1B baseline on the same storage-bin object.
CAP gripper baseline on the same storage-bin object.

KEY RESULTS

Real-world tabletop grasping success rate on 30 unseen HUG-Bench test objects66.7%

vs. Beats DexGraspNet by +23%, Dex1B by +34%

HUG achieves significantly higher success than prior multi-fingered methods, demonstrating that learning from natural human grasps generalizes better than synthetic or simulation-trained approaches. The gap widens on this challenging set of articulated, tiny, and oversized objects.

In-the-wild grasping success rate (unconstrained household environments, unseen homes)62.0%

vs. Consistent zero-shot transfer across stereo cameras and robot embodiments

The 62% in-the-wild rate shows HUG generalizes robustly to Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. conditions far from the collection distribution. This is the most realistic setting—unconstrained object placement, variable lighting, different cameras—yet performance remains strong, validating that human Manipulation & TasksGraspingTaking hold of an object. distributions capture generalizable strategies.

Scaling behavior: success rate vs. training dataset sizeMonotonic improvement from 100K to 1M image-grasp pairs

vs. Figure 7 shows scaling curve; performance continues to improve without saturation

The paper demonstrates a positive Robot LearningScaling lawA pattern showing how performance improves as data, compute, or model size increases.: larger datasets yield better Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.. This is critical evidence that the approach benefits from more human grasp data, suggesting further scaling could push performance higher. The curve does not plateau, implying diminishing but ongoing returns.

Multi-modality ablation: RGB-D fusion vs. single modalityRGB+PC fusion achieves higher success than RGB-only or PC-only

vs. Figure 9 shows qualitative cases (pineapple, hairbrush, spoon) where single modality fails but fusion succeeds

Point painting and Movement, Mechanics & Robot BodyJointA movable connection between robot parts. RGB-PC conditioning are necessary: RGB alone struggles on transparent/reflective objects (anchovies in water, glass), while point clouds alone lose texture-based information. The fusion approach balances both signals, critical for diverse real-world objects.

WHY DEVELOPERS SHOULD CARE

For software developers building Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. systems, this paper demonstrates a paradigm shift: multi-fingered dexterous Manipulation & TasksGraspingTaking hold of an object. can be learned from human data rather than Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data or Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested., and the resulting Core ConceptsPolicyThe rule or model that maps observations or states to actions. transfers Modern Robot LearningZero-shotDoing a new task without task-specific training. to new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. embodiments without retraining. This is significant because it decouples data collection from Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.. Developers no longer need to commission expensive Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. campaigns or endure Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. gaps; instead, they can leverage egocentric video—which is increasingly easy to collect at scale with consumer smart glasses—to bootstrap Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. capabilities. The 1M-HUGs Robot LearningDatasetA collection of training or evaluation data. and aria2mano curation pipeline provide a concrete template for scaling this approach: capture diverse human grasps with calibrated depth and hand tracking, fit anatomical hand models, and train a simple flow-matching model. The architecture itself is surprisingly standard (DINOv2 + PointNeXt + DiT with cross-attention), suggesting that the bottleneck is data quality and diversity, not model design. For roboticists, the key takeaway is that natural human Manipulation & TasksGraspingTaking hold of an object. distributions matter. Rather than optimizing for force-closure or sampling all physically valid grasps, learning what humans actually do—which is often simpler and more conservative—produces policies that execute reliably on real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hardware. The paper also introduces a new Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. standard: HUG-Bench, with metric-scale reconstructions and paired simulation-to-real Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs., is a more honest Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. than purely simulation-only tests. The open release of code, data, trained models, and Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. assets lowers the barrier for future work, making this a platform for advancing Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. research.

LIMITATIONS

The paper lists several practical constraints: HUG is trained only on right-handed grasps with a fixed canonical MANO hand, so it does not model left-handed, bimanual, or hand-specific morphology. Retargeting can fail when a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hand cannot realize the predicted human pose, and real-world executions are open-loop, so shifted or articulated objects can break the plan. Labels can also be noisy under hand occlusion, accuracy drops for very small objects due to 224 x 224 inputs and for large or far objects that are rare in Perception & SensingEgocentric dataData captured from the robot’s or operator’s own point of view., and the Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. remains indoor-only.

WHAT COMES NEXT

The natural next step is to turn HUG from a single open-loop grasp predictor into a closed-loop Manipulation & TasksGraspingTaking hold of an object. system: generate multiple candidate grasps, rank them, and replan during Movement, Mechanics & Robot BodyContactPhysical interaction between the robot and an object or surface. and lift with visual Control & PlanningFeedbackInformation returned from sensors during action to help correct behavior.. The paper also points toward broader grasp data collection: left-handed and bimanual grasps, variable hand morphology, outdoor or less controlled scenes, and more data for large or far objects would make the human-to-robot transfer story more complete.

RELATED PAPERS