VideoMimic
Visual imitation enables contextual humanoid control
Anonymous Authors Anonymous Affiliation

VideoMimic is a real-to-sim-to-real pipeline that converts monocular videos into transferable humanoid skills, letting robots learn context-aware behaviors (terrain-traversing, climbing, sitting) in a single policy.

Additional Video.

(a) input video

(b) reconstructed environment and human

(c) tracking the motion in sim

We conducted additional experiments to track an internet video of a person crawling down the stairs. This shows our pipeline's ability to learn from scalable web data and its ability to learn diverse motions. (Source of the input video: YouTube)

Abstract.
How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them—casually capture a human motion video and feed it to humanoids. We introduce VideoMimic, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills—all from a single policy, conditioned on the environment and global root commands. VideoMimic offers a scalable path towards teaching humanoids to operate in diverse real-world environments.
Real-world Demo.
Approach.

Input Video

Human + Scene Reconstruction

          G1 Retargeted Results

          Egoview (RGB/Depth)

Training in Simulation      

From a monocular video, we jointly reconstruct metric-scale 4D human trajectories and dense scene geometry. The human motion is retargeted to a humanoid, and with the scene converted to a mesh in the simulator, the motion is used as a reference to train a context-aware whole-body control policy. While our policy does not use RGB conditioning for now, we demonstrate the potential of our reconstruction for ego-view rendering.

1. Real to Sim.
real-to-sim pipeline

Figure 1: The Real-to-Sim pipeline reconstructs human motion and scene geometry from video, outputting simulator-ready data.

Approach Overview

Figure 2: Versatile capabilities include handling internet videos, multi-human reconstruction, and ego-view rendering.

2. Training in Sim.
Training in Sim

Figure 3: Policy training pipeline in simulation, progressing from MoCap pre-training to environment-aware tracking and distillation.