We collected IVOS for the purpose of bench-marking (I)nteractive (V)ideo (O)bject (S)egmentation in the HRI setting. We collected the dataset in two different settings:

Human teaching objects

Translation Scale Rotation
Translation Scale Rotation

The final teaching videos contains ∼ 50,000 frames for 12 object categories, with a total of 36 instances under these categories. The detection crops are provided for all the frames.
The segmentation annotation is currently provided for 20 instances with ~ 18,000 segmentation masks.

Manipulation tasks setting

robot teaching image

4 main manipulation tasks: cutting, pouring, stirring, and drinking for both robot and human manipulation.
The dataset contains ∼ 8984 frames from robotic manipulation sessions, covering a total of 56 tasks with the different objects and configurations.
The segmentation annotation is provided for the main objects of interest. Along with recording the robot trajectories to enable further research on how to learn these trajectories from visual cues.
As for the human manipulation tasks, it covers 11 tasks similar to the robot tasks.