GRASPINGCURRENT2026-06-15

Human Universal Grasping

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

ARCHITECTURE

Flow-matching transformer with RGB-PC fusion and point-conditioned query for multi-fingered grasp generation

ROBOT

YOR mobile manipulator with WUJI hands; also Ability Hand (smaller morphology)

DATASET

1M image-grasp pairs from 6,707 recordings, ~1.5K unique objects, 41 buildings, 27.8 hours of egocentric video

KEY METRIC

66.7% success rate (tabletop)

TASK

Dexterous object grasping from RGB-D

HUG demonstrates that learning dexterous from human motion data—rather than demonstrations or —produces generalizable primitives that transfer to multiple hands. The system trains a flow-matching model on 1M egocentric human grasps collected via Aria Gen 2 smart glasses, learns to predict full MANO hand poses (wrist position, rotation, and 15 finger joints) from images, and retargets these predictions to morphologies without per-hand retraining. On a carefully curated of 30 unseen objects spanning five geometric categories and three size classes, HUG reaches 66.7% tabletop success and 62.0% in-the-wild success, beating prior multi-fingered methods by 23–34 percentage points. The key insight is that natural human data is abundant and easier to collect at scale than or synthetic generation. By pairing lightweight egocentric streams (RGB, stereo depth, hand landmarks) with anatomically-valid MANO fitting and flow-matching diffusion, HUG captures the distribution of how humans grasp real objects in everyday environments. The learned generalizes across unseen object geometries, camera intrinsics, hand morphologies, and household environments without any robot-specific . The paper also introduces HUG-Bench, a metric-scale of 90 objects reconstructed from real egocentric video, enabling paired and real-world . All code, data, trained models, and assets are released publicly.

THE PROBLEM

Multi-fingered remains far from human-level generality. Prior approaches suffer from fundamental limitations: synthetic methods (optimization-based or learned generators in ) struggle with gap and require retraining per hand; produces real embodiment-specific grasps but is tedious and cannot cover object diversity; and learning from data is expensive because dexterous is slow and difficult to scale. Most prior work trains on (e.g., DexGraspNet, Dex1B, UniDexGrasp++) or lab-collected data (DexYCB, AnyDexGrasp), which lack the scale and diversity of real-world . Critically, existing multi-fingered methods often require complete object point clouds, which are unavailable in single-view real , limiting practical applicability. The core problem is one of data sourcing: robots need the diverse, naturally executed experience humans accumulate daily, but collecting this via robot-specific is prohibitively expensive. Recent advances in lightweight egocentric sensors (Aria Gen 2) and anthropomorphic hands with learned retargeting have made a previously infeasible approach practical: collect human grasps at scale, learn the natural distribution of human , and retarget to robots. No prior work has demonstrated this pipeline for multi-fingered dexterous .

HOW IT WORKS

1M-HUGs Dataset Collection & Curation

The authors collected 1M egocentric image-grasp pairs using Aria Gen 2 smart glasses across 41 buildings. For each object, the wearer stands in front, moves their head for 15–30 seconds to capture diverse viewpoints without their hand visible, then grasps with their right hand. A key insight is that a single grasp is back-propagated to preceding no-hand frames via camera pose, yielding hundreds of (object image, grasp) pairs from different viewpoints at no additional cost. Raw recordings are filtered on five criteria: object mask presence, ≥60% confident stereo depth, hand landmarks intersecting the object mask, sufficient in-frame landmarks, and absence of hands in the frame. Each recording is verified with a for object identification, SAM3 for mask propagation across frames, and stability heuristics for grasp-frame selection. All entries are human-reviewed via a web interface. The 1M surviving frames span ~1.5K unique objects in diverse environments (kitchens, bedrooms, offices, etc.). Crucially, the authors fit a full MANO hand (10-dim shape + 15×3-dim pose) to the sparse 21-point Aria landmarks using anatomical constraints, standardizing all grasps to a canonical hand size and producing an articulated mesh suitable for . This aria2mano pipeline is released as a standalone tool.

Flow-Matching Architecture with RGB-PC Fusion

HUG predicts a 99-dimensional grasp consisting of wrist translation (3D), wrist rotation (6D continuous), and 15 finger rotations (90D total). The input is an image from a stereo camera plus a user-specified 2D pixel click on the target object. The is encoded with a frozen DINOv2-Base ViT (256 patch tokens); the depth is back-projected to a point cloud, cropped to a 0.3 m radius ball around the 3D query point, and processed by a trainable PointNeXt U-Net (4096 points → 256 region tokens). The two streams are fused via point painting: each point cloud centroid is projected into the using camera intrinsics K, its DINOv2 feature is bilinearly sampled, concatenated with the PC token, and projected by an MLP into a 1024-dim fused token. Both the query point and point centroids are encoded with random Fourier features to retain scale information. The fused tokens cross-attend to the query token in a 4-layer pre-norm transformer to produce 256 scene-conditioning tokens. These are then fed to the flow transformer: a 6-layer Diffusion Transformer (DiT) that separately processes translation, rotation, and finger pose tokens (keeping geometric components from over-mixing) with timestep conditioning via AdaLN-Zero. The flow model is trained to predict in normalized space, with an L1 loss on 3D MANO landmarks (weighted λ₃D=20) combined with MSE loss (λᵥ=1). Camera intrinsics appear only in back-projection and projection, never as learned parameters, enabling transfer across different stereo cameras.

Hand Retargeting to Robot Embodiments

Predicted MANO grasps are retargeted to multiple hands (WUJI, Ability Hand) without per-hand . The paper leverages recent learned retargeting methods (cited as qin2023anyteleop, mandi2025dexmachina, li2025maniptrans, wuji2026retargeting) to transform the canonical MANO hand pose to each 's morphology. This is possible because anthropomorphic hands have narrowed the human-robot morphology gap. The fixed MANO shape simplifies retargeting: rather than handling variable human hand sizes, a single canonical hand size is used for all data, and the network predicts only articulation and placement. At , the same predicted grasp can be retargeted to hands with different sizes and counts, enabling across embodiments without retraining or per-robot optimization.

HUG-Bench: Metric-Scale Benchmark Construction

To standardize , the authors built HUG-Bench, comprising 90 unseen objects from five geometric categories (cylindrical, spheroidal, prismatic, appendaged, amorphous) and three size bins (small, medium, large), with six objects per combination. The 30 test objects are deliberately hard to grasp: many are articulated, very short (~1 cm), or large and unwieldy. Crucially, all objects are reconstructed at scale from real egocentric video using an extended Multi-view SAM3D pipeline. For each object, five spread-out Aria Gen 2 views are collected, injected into MV-SAM3D with Aria intrinsics/extrinsics and stereo depth, and manually inspected in Viser for scale and pose alignment with the semi-dense point cloud. Meshes are made watertight with Alpha Wrap and decomposed into convex parts for fidelity. Each object also has 10 human grasps recorded for oracle . This construction ensures simulation-to-real consistency: the same metric-scale meshes are used in both MuJoCo and real-world experiments.

Evaluation in Simulation and Real World

In (MuJoCo), HUG is evaluated on the 30 test objects using a simulated MANO hand. The paper tracks three metrics: (object lifted 10 cm), fingertip count (mean and std across successful grasps), and penetration depth (violations of object geometry). Baselines include DexGraspNet, Dex1B, UniDexGrasp++, and a learned . Real-world uses a YOR mobile manipulator with WUJI hands. The executes grasps on all 30 test objects in both tabletop (controlled) and in-the-wild (unconstrained) settings across multiple stereo cameras and unseen homes. Success is measured by lifting objects 10 cm vertically. Failure modes are traced at three stages: pre-grasp approach, grasp , and lift, categorizing failures ( loss, slip, etc.). Results show HUG achieves 66.7% tabletop and 62.0% in-the-wild success, with detailed failure analysis revealing penetration depth and single-modality (RGB-only or PC-only) failures.

In-the-wild match box: autonomous HUG rollouts in an unseen home.

HUG on storage bin: 10 tabletop rollouts from the project-page comparison set.

Dex1B baseline on the same storage-bin object.

CAP gripper baseline on the same storage-bin object.

KEY RESULTS

Real-world tabletop grasping success rate on 30 unseen HUG-Bench test objects66.7%

vs. Beats DexGraspNet by +23%, Dex1B by +34%

HUG achieves significantly higher success than prior multi-fingered methods, demonstrating that learning from natural human grasps generalizes better than synthetic or simulation-trained approaches. The gap widens on this challenging set of articulated, tiny, and oversized objects.

In-the-wild grasping success rate (unconstrained household environments, unseen homes)62.0%

vs. Consistent zero-shot transfer across stereo cameras and robot embodiments

The 62% in-the-wild rate shows HUG generalizes robustly to conditions far from the collection distribution. This is the most realistic setting—unconstrained object placement, variable lighting, different cameras—yet performance remains strong, validating that human distributions capture generalizable strategies.

Scaling behavior: success rate vs. training dataset sizeMonotonic improvement from 100K to 1M image-grasp pairs

vs. Figure 7 shows scaling curve; performance continues to improve without saturation

The paper demonstrates a positive : larger datasets yield better . This is critical evidence that the approach benefits from more human grasp data, suggesting further scaling could push performance higher. The curve does not plateau, implying diminishing but ongoing returns.

Multi-modality ablation: RGB-D fusion vs. single modalityRGB+PC fusion achieves higher success than RGB-only or PC-only

vs. Figure 9 shows qualitative cases (pineapple, hairbrush, spoon) where single modality fails but fusion succeeds

Point painting and RGB-PC conditioning are necessary: RGB alone struggles on transparent/reflective objects (anchovies in water, glass), while point clouds alone lose texture-based information. The fusion approach balances both signals, critical for diverse real-world objects.

WHY DEVELOPERS SHOULD CARE

For software developers building systems, this paper demonstrates a paradigm shift: multi-fingered dexterous can be learned from human data rather than data or , and the resulting transfers to new embodiments without retraining. This is significant because it decouples data collection from . Developers no longer need to commission expensive campaigns or endure gaps; instead, they can leverage egocentric video—which is increasingly easy to collect at scale with consumer smart glasses—to bootstrap capabilities. The 1M-HUGs and aria2mano curation pipeline provide a concrete template for scaling this approach: capture diverse human grasps with calibrated depth and hand tracking, fit anatomical hand models, and train a simple flow-matching model. The architecture itself is surprisingly standard (DINOv2 + PointNeXt + DiT with cross-attention), suggesting that the bottleneck is data quality and diversity, not model design. For roboticists, the key takeaway is that natural human distributions matter. Rather than optimizing for force-closure or sampling all physically valid grasps, learning what humans actually do—which is often simpler and more conservative—produces policies that execute reliably on real hardware. The paper also introduces a new standard: HUG-Bench, with metric-scale reconstructions and paired simulation-to-real , is a more honest than purely simulation-only tests. The open release of code, data, trained models, and assets lowers the barrier for future work, making this a platform for advancing research.

LIMITATIONS

The paper lists several practical constraints: HUG is trained only on right-handed grasps with a fixed canonical MANO hand, so it does not model left-handed, bimanual, or hand-specific morphology. Retargeting can fail when a hand cannot realize the predicted human pose, and real-world executions are open-loop, so shifted or articulated objects can break the plan. Labels can also be noisy under hand occlusion, accuracy drops for very small objects due to 224 x 224 inputs and for large or far objects that are rare in , and the remains indoor-only.

WHAT COMES NEXT

The natural next step is to turn HUG from a single open-loop grasp predictor into a closed-loop system: generate multiple candidate grasps, rank them, and replan during and lift with visual . The paper also points toward broader grasp data collection: left-handed and bimanual grasps, variable hand morphology, outdoor or less controlled scenes, and more data for large or far objects would make the human-to-robot transfer story more complete.

Read on arxiv →HTML source →Project page →

Human Universal Grasping

THE PROBLEM

HOW IT WORKS

1M-HUGs Dataset Collection & Curation

Flow-Matching Architecture with RGB-PC Fusion

Hand Retargeting to Robot Embodiments

HUG-Bench: Metric-Scale Benchmark Construction

Evaluation in Simulation and Real World

KEY RESULTS

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy