Structured World Models from Human Videos

Russell Mendonca*1    Shikhar Bahl*1,2    Deepak Pathak1
   Carnegie Mellon University   RSS 2023

We present SWIM, an approach for learning manipulation tasks in the real world with only a handful of trajectories and only 30 min of real-world sampling


In this paper, we tackle the problem of learning complex, general behaviors directly in the real world. We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories from many different settings. Inspired by the success of learning from large-scale datasets in the fields of computer vision and natural language, our belief is that in order to efficiently learn, a robot must be able to leverage internet-scale, human video data. Humans interact with the world in many interesting ways, which can allow a robot to not only build an understanding of useful actions and affordances but also how these actions affect the world for manipulation. Our approach builds a structured, human-centric action space grounded in visual affordances learned from human videos. Further, we train a world model on human videos and fine-tune on a small amount of robot interaction data without any task supervision. We show that this approach of affordance-space world models enables different robots to learn various manipulation skills in complex settings, in under 30 minutes of interaction.
VRB Model

Our approach involves 3 steps -
#1 : Pre-training a world model on human videos,
#2 : Finetuning the world model on unsupervised robot data, and
#3 : Using the finetuned model to plan to achieve goals

Step #1 - Pre-training World Model on Human Videos

We use a shared human-robot high level action space by leveraging affordances. These specify interaction points and post-contact trajectory, following our prior work. Our action space is flexible enough to also support actions outside this shared space.

Step #2 - Finetuning on unsupervised robot data

The robot samples from the affordance space to collect data for finetuning the world model. This data collection is unsupervised, since there is no task reward.

Step #3 - Multi-Task Deployment

With the finetuned world model, we can solve tasks using planning. In our experiments we specify tasks using goal images.

Effect of Pre-training

VRB Model

Pre-training on human videos significantly improves performance, especially when using a world model jointly across multiple tasks, where average task success increases from 20 to 80 percent.


              title={Structured World Models from Human Videos},
              author={Mendonca, Russell  and Bahl, Shikhar and Pathak, Deepak},


We thank Shagun Uppal and Murtaza Dalal for feedback on early drafts of this manuscript.This work is supported by the Sony Faculty Research Award and ONR N00014-22-1-2096