DexMV: Imitation Learning for Dexterous Manipulation
from Human Videos


ECCV 2022


Yuzhe Qin*,  Yueh-Hua Wu*,  Shaowei Liu,  Hanwen Jiang, 
Ruihan Yang,  Yang Fu,  Xiaolong Wang

UC San Diego

Paper
Code


While significant progress has been made on understanding hand-object interactions in computer vision, it is still very challenging for robots to perform complex dexterous manipulation. In this paper, we propose a new platform and pipeline DexMV (Dexterous Manipulation from Videos) for imitation learning. We design a platform with: (i) a simulation system for complex dexterous manipulation tasks with a multi-finger robot hand and (ii) a computer vision system to record large-scale demonstrations of a human hand conducting the same tasks. In our novel pipeline, we extract 3D hand and object poses from videos, and propose a novel demonstration translation method to convert human motion to robot demonstrations. We then apply and compare multiple imitation learning algorithms with the demonstrations. We show that the demonstrations can indeed improve robot learning by a large margin and solve the complex tasks which reinforcement learning alone cannot solve.

DexMV Platform and Pipeline




Demonstration Translation

Raw Video
Pose Estimation
Robot Motion (rendered)



Main Results

We use our dexmv pipeline with DAPG for the imitation learning algorithm component. RL(TRPO) is trained without demonstration.


Pour

Objective: reach the mug and pour the particles inside the mug to a container. The robot need to manipulate the orientation of mug to pour the particles. This task is evaluated by the percentage of particles poured into the container.

Ours
RL


Place Inside

Objective: pick up the banana and then place it inside the mug. The robot needs to rotate the banana to a suitable orientation before place it inside the mug. This task is evaluated by the volume percentage of the banana inside the mug.

Ours
RL


Relocate

Objective: move the object to the target position regardless of orientation. The transparent green shape represents the goal location, which is randomized for each episode. This task is evaluated by the distance between object and target position.

Ours
RL




Demonstration Transfer



(i) Different Size

Left: we use demonstration on relocating a tomato soup can with normal size and train on a larger tomato soup can.
Right: we use demonstration on relocating a tomato soup can with normal size and train on a smaller tomato soup can.
Larger
Smaller



(ii) Different Object

Left: we use demonstration on relocating a tomato soup can and train on a potted meat can.
Right: we use demonstration on relocating a sugar box and train on a foam brick.


Video


Paper



BibTeX


@misc{qin2021dexmv, title={DexMV: Imitation Learning for Dexterous Manipulation from Human Videos}, author={Qin, Yuzhe and Wu, Yueh-Hua and Liu, Shaowei and Jiang, Hanwen, and Yang, Ruihan and Fu, Yang and Wang, Xiaolong}, year={2021}, archivePrefix={arXiv}, primaryClass={cs.LG} }