LEARNINGCURRENT2025-09-30

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, Guanya Shi

ARCHITECTURE

RL policy with interaction-preserving retargeting

ROBOT

Unitree G1 humanoid

DATASET

8+ hours of generated trajectories

KEY METRIC

30 seconds

TASK

loco-manipulation, locomotion, parkour, scene interaction

OmniRetarget solves one of robotics' hardest problems: teaching humanoid robots complex acrobatic skills from human videos. The breakthrough result is stunning—a Unitree G1 humanoid successfully executes 30-second parkour sequences, carrying chairs, climbing platforms, and performing parkour rolls, all trained with just 5 simple terms and no curriculum learning. What makes this remarkable is that the learns from human motion capture without any computer vision during (proprioceptive-only ). The key innovation is preserving interactions: instead of treating human-to-robot motion retargeting as a pure kinematic problem, OmniRetarget explicitly models and maintains relationships between the , objects it's manipulating, and terrain. This turns a single human into a data goldmine—you can automatically generate data with different object sizes, positions, terrains, and even different embodiments, all while keeping the interaction semantics intact.

ARCHITECTURE

THE PROBLEM

Before OmniRetarget, motion retargeting (converting human movements into commands) was plagued by the gap problem. Humans and humanoid robots have fundamentally different body proportions, ranges, and physical capabilities. When you naively retarget human motion to robots, you get physical disasters: feet sliding through floors (foot-skating), hands penetrating objects, and artifacts that make the motion physically implausible. Existing methods like Geometric Motion Retargeting (GMR) and physics-based humanoid controllers (PHC) tried to fix kinematic infeasibility, but they completely ignored the semantic content—the actual interactions between the human, objects, and . A human of 'carry a box up stairs' contains rich relational information about hand-object and foot-ground that previous methods simply discarded. This meant data was wasteful: one human video could only train one on one with one object configuration. Developers had to manually create massive motion datasets or craft functions by hand, making it expensive and brittle to scale humanoid learning.

HOW IT WORKS

Interaction Mesh Construction

OmniRetarget represents the scene as an interaction mesh—a unified geometric representation that explicitly tracks spatial relationships between the agent's body, manipulated objects, and terrain. Instead of treating retargeting as independent angle conversion, the system models relationships: which fingers touch the object, which foot contacts the ground, and what the relative geometry should be. The interaction mesh becomes a that must be satisfied during retargeting, ensuring that if a human's hand grasped a box at a specific location and angle, the 's retargeted motion preserves that same grasp relationship. This is fundamentally different from prior work that either ignored interactions or handled them as soft objectives that could be violated.

flagship

wallflip

roll

climb 4

Laplacian Deformation with Kinematic Constraints

Given the interaction mesh, OmniRetarget solves a constrained optimization problem: deform the human skeleton into a skeleton while minimizing Laplacian deformation (preserving local geometric structure) and enforcing hard kinematic constraints. Laplacian deformation ensures the motion stays smooth and natural—local neighborhoods of the skeleton maintain their shape even though the global skeleton changes. Simultaneously, the system enforces that all limits are satisfied, that feet don't penetrate terrain, and that hands maintain their relationships with objects. This produces kinematically feasible trajectories that a real can actually execute, eliminating the foot-skating and penetration artifacts that plague naive retargeting. The math is solving a non-convex optimization per frame, but the authors made it efficient enough to process over 9 hours of motion data.

Systematic Data Augmentation from Interaction Semantics

Because OmniRetarget preserves the underlying interaction structure (not just angles), it can automatically generate diverse data from a single . You show the system one video of a carrying a box—OmniRetarget extracts the semantic interaction (hand-object , foot-ground patterns). It then automatically generates new data by varying the object's initial position (rotated 45°, translated left/right), the object's size (small/large), the terrain height (0.8× to 1.2× scale), and even different embodiments (Unitree T1 vs H1). Each augmented preserves the core interaction semantics while adapting to new configurations. This is genuinely powerful: one human becomes dozens of trajectories automatically.

Proprioceptive RL Training with Minimal Rewards

The high-quality retargeted motion data serves as kinematic references for . Instead of learning from scratch with hand-crafted engineering, the (trained using standard methods) simply tries to track these reference trajectories while respecting physics and using only proprioceptive ( angles, velocities, IMU). The authors used only 5 terms and 4 parameters—shared across all tasks (parkour, carrying, climbing). This is remarkably minimal compared to typical humanoid papers that require task-specific tuning and curriculum learning. The agent learns that following the retargeted references leads to successful , and the quality of the retargeting data determines whether this is actually feasible.

MORE DEMONSTRATIONS

climb 1

climb 3

climb 2

climb 5

step

crawl 1

crawl 2

crawl 3

crawl 4

box 1

box 2

box 3

box 4

box 5

box 6

box 7

box 8

box aug 1

box aug original

box aug 2

box size aug 1

box size aug original

box size aug 2

terrain aug 1

terrain aug ori

terrain aug 2

KEY RESULTS

Long-horizon task execution30 seconds

vs. typical humanoid skills at 5-10 seconds

The successfully executes multi-phase tasks (carry chair → climb platform → parkour roll) lasting 30 seconds continuously. This demonstrates coherent, long-horizon reasoning where the maintains balance, object , and dynamic movement across multiple phases without falling or losing track of the .

Training simplicity5 reward terms, 4 domain randomization parameters, no curriculum

vs. typical humanoid papers requiring 15-20+ rewards and multi-stage curricula

The entire pipeline uses minimal hyperparameter tuning—one shared structure and simple work across all tasks (parkour, , climbing, crawling). This suggests the retargeted motion data is so high-quality that doesn't need extensive task-specific engineering. This is practically important: it means you can scale to new tasks without rebuilding functions from scratch.

Data generation scale9+ hours of motion trajectories

vs. from multiple human mocap datasets (OMOMO, LAFAN1, proprietary)

OmniRetarget processed and retargeted over 9 hours of human motion capture data across three different datasets, producing physically feasible trajectories. This demonstrates across different human movement styles and datasets, not just curated in-house motion capture.

Contact preservation vs. baselinesBetter kinematic constraint satisfaction and zero foot-skating

vs. GMR and PHC baselines showing visible foot-skating and penetration

Visual comparisons in the project page show that GMR produces obvious foot-sliding artifacts and object penetration, while OmniRetarget trajectories obey non-penetration constraints and maintain integrity. This is the core technical contribution—making retargeting interaction-aware actually eliminates the physical artifacts that break .

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building humanoid robotics applications, OmniRetarget changes the game in two ways. First, it solves the data bottleneck. Creating data has been expensive—you either hire motion capture studios, manually craft trajectories by hand, or run massive simulations. OmniRetarget lets you harvest human motion from public datasets (LAFAN1 contains thousands of diverse human movements) and automatically convert them to data for multiple embodiments and configurations. One human video becomes dozens of scenarios. Second, it shows that motion retargeting, done correctly, is actually a foundational building block for humanoid learning. Prior work treated retargeting as a preprocessing step that was 'good enough'—but this paper demonstrates that preserving interaction semantics during retargeting is critical. The can then focus on the problem (tracking references with physics) rather than learning from scratch. This is important philosophically: it suggests that human demonstrations contain rich structure about what skillful movement should look like, and respecting that structure (especially interaction structure) makes learning much more efficient. For developers, this means: (1) leverage human mocap data systematically instead of collecting robot-only data, (2) think about interaction constraints as first-class citizens in motion processing, (3) don't over-engineer functions if your reference data is high-quality.

LIMITATIONS

OmniRetarget relies on accurate motion capture input and requires that interactions can be modeled geometrically (hand-object , foot-ground ). It doesn't handle situations where the scene geometry is unknown or complex interaction logic is needed (e.g., 'grasp the handle, not the blade'). The method also assumes that human motion is retargetable to the target at all—some human movements (like extreme flexibility) simply aren't feasible for robots, and the paper doesn't discuss how gracefully it handles such cases. Additionally, all experiments are on Unitree humanoids in relatively controlled environments; to other morphologies or unstructured real-world scenes is untested. The proprioceptive-only also means the has no visual , limiting to unexpected scene variations or dynamic obstacles that the didn't cover.

WHAT COMES NEXT

The next frontier is likely bridging the gap more reliably and adding . Currently, OmniRetarget generates data in , and there's always slippage when deploying to real robots. Combining interaction-preserving retargeting with vision-based policies (so the can adapt when objects or terrain don't match the exactly) would make this approach production-ready. Another direction is learning from in-the-wild human video (YouTube, TikTok) without mocap—estimating 3D human pose from video, preserving interactions, and retargeting for robots. Finally, extending interaction meshes to more complex scenarios (multi-object , human-robot collaboration, contact-rich like piano playing) could unlock even richer learning from human demonstrations.

Read on arxiv →HTML source →Project page →

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Interaction Mesh Construction

Laplacian Deformation with Kinematic Constraints

Systematic Data Augmentation from Interaction Semantics

Proprioceptive RL Training with Minimal Rewards

MORE DEMONSTRATIONS

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy