As a bridge to connect vision and language, visual relations between objects such as “person-touch-dog” and “cat-above-sofa” provide a more comprehensive visual content understanding beyond objects. Video Visual Relation Detection (VidVRD) aims to detect instances of visual relations of interest in a video, where a visual relation instance is represented by a relation triplet <subject, predicate, object> with the trajectories of the subject and object (as shown in Figure 1). As compared to still images, videos provide a more natural set of features for detecting visual relations, such as the dynamic relations like “A-follow-B” and “A-towards-B”, and temporally changing relations like “A-chase-B” followed by “A-hold-B”. Yet, VidVRD is technically more challenging than ImgVRD due to the difficulties in accurate object tracking and diverse relation appearances in the video domain.

Figure 1: examples of video visual relation instances

We release the first dataset, namely ImageNet-VidVRD, in order to facilitate innovative researches on the problem. The dataset contains 1,000 videos selected from ILVSRC2016-VID dataset based on whether the video contains clear visual relations. It is split into 800 training set and 200 test set, and covers common subject/objects of 35 categories and predicates of 132 categories. Ten people contributed to labeling the dataset, which includes object trajectory labeling and relation labeling. Since the ILVSRC2016-VID dataset has the object trajectory annotation for 30 categories already, we supplemented the annotations by labeling the remaining 5 categories. In order to save the labor of relation labeling, we labeled typical segments of the videos in the training set and the whole of the videos in the test set. Several statistics of the dataset are shown in below.


We provide the visual relation annotations for the 1,000 videos. Each video has a single annotation file in JSON format, which is named after the ID from the train/val set of ImageNet Object Detection from Video Challenge. The detailed JSON file format is as follows:

    "video_id": "ILSVRC2015_train_00010001",        # Video ID from the original ImageNet ILSVRC2016 video dataset
    "frame_count": 219,
    "fps": 30, 
    "width": 1920, 
    "height": 1080, 
    "subject/objects": [                            # List of subject/objects
            "tid": 0,                               # Trajectory ID of a subject/object
            "category": "bicycle"
     "trajectories": [                              # List of frames
        [                                           # List of bounding boxes in each frame
                "tid": 0,                       
                "bbox": {
                    "xmin": 672,                    # left
                    "ymin": 560,                    # top
                    "xmax": 781,                    # right
                    "ymax": 693                     # bottom
                "generated": 0,                     # 0 - the bounding box is manually labeled
                                                    # 1 - the bounding box is automatically generated
    "relation_instances": [                         # List of annotated visual relation instances
            "subject_tid": 0,                       # Corresponding trajectory ID of the subject
            "object_tid": 1,                        # Corresponding trajectory ID of the object
            "predicate": "move_right", 
            "begin_fid": 0,                         # Frame index where this relation begins (inclusive)
            "end_fid": 210                          # Frame index where this relation ends (exclusive)

Useful downloading links can be found as follows. Note that you need to manually merge the two parts of videos into a single folder after unarchiving them.

If this dataset helps your research, please kindly cite this paper:

    author={Shang, Xindi and Ren, Tongwei and Guo, Jingfan and Zhang, Hanwang and Chua, Tat-Seng},
    title={Video Visual Relation Detection},
    booktitle={ACM International Conference on Multimedia},
    address={Mountain View, CA USA},


Statistics of our VidVRD dataset are listed below. The number of video visual relation is not available on training set because it is only sparsely labeled as mentioned above.

training set test set
video 800 200
subject/object category 35 35
predicate category 132 132
relation triplet 2,961 1,011
visual relation instance 4,835