VidOR Dataset

Introduction

VidOR (Video Object Relation) dataset contains 10,000 videos (98.6 hours) from YFCC100M collection together with a large amount of fine-grained annotations for relation understanding. In particular, 80 categories of objects are annotated with bounding-box trajectory to indicate their spatio-temporal location in the videos; and 50 categories of relation predicates are annotated among all pairs of annotated objects with starting and ending frame index. This results in around 50,000 object and 380,000 relation instances annotated. To use the dataset for model development, the dataset is split into 7,000 videos for training, 835 videos for validation, and 2,165 videos for testing. VidOR can provide foundation for many kinds of research and has been used in:

ACM Multimedia 2019 Grand Challenge
ACM Multimedia 2020 Grand Challenge
ACM Multimedia 2021 Grand Challenge
VidSTG: a dataset for video grounding
NExT-QA: a video QA dataset for explaining temporal actions
VidOR-MPVC: a video captioning dataset for multi-perspective visual captioning

Downloads

Please download the videos in training/validation set using the following links, and extract video frames using FFmpeg-3.3.4. The total sizes of training and validation videos is around 24.5G and 2.9G, respectively. It is recommended to unarchive the downloaded parts into the same directory.
Alternatively, as the videos are exclusively drawn from the YFCC100M collection without any processing, you can also obtain them via AWS S3 data storage hosted by Multimedia Commons but need to organize the files in the directory structure consistent with the "video_path" in the annotations.

training videos (part 1) [Lark] [Baidu] [HuggingFace] [MD5: eec7a718c05a16e388d6ee23102370d3]
training videos (part 2) [Lark] [Baidu] [HuggingFace] [MD5: 40ab82dccd9084c855452f4a2886137a]
training videos (part 3) [Lark] [Baidu] [HuggingFace] [MD5: bdea6aa584913600fe5da95902fd938a]
training videos (part 4) [Lark] [Baidu] [HuggingFace] [MD5: a8dd511d36da4057a9b52f461065126f]
training videos (part 5) [Lark] [Baidu] [HuggingFace] [MD5: b2daedc2aa5d118e71946ea2614901df]
training videos (part 6) [Lark] [Baidu] [HuggingFace] [MD5: bfa77115ce69256063f22e59f87b90d2]
training videos (part 7) [Lark] [Baidu] [HuggingFace] [MD5: b4ff9a77d2629d372115e4fe8a9398b4]
training videos (part 8) [Lark] [Baidu] [HuggingFace] [MD5: e222802b0ce37aec1c3f14e81ea7e2c0]
validation videos [Lark] [Baidu] [HuggingFace] [MD5: b5386d83faeed98b8342ed31be737900]
testing videos are available after participating the grand challenges

Please download the annotations in training/validation set using the following links, in which one JSON file contains the annotation for one video. The format of a JSON file is shown in below, and you can load the annotations together using this helper script.

training annotations [Lark] [Baidu] [HuggingFace] [MD5: 85f39fdd81a780bbb9b975cca8f219a2]
validation annotations [Lark] [Baidu] [HuggingFace] [MD5: 96c6870910e8d4fb3836878215432c1f]

Please view the JSON format in larger screen

{
    "version": "VERSION 1.0",
    "video_id": "5159741010",                       # Video ID in YFCC100M collection
    "video_hash": "6c7a58bb458b271f2d9b45de63f3a2", # Video hash offically used for indexing in YFCC100M collection 
    "video_path": "1025/5159741010.mp4",            # Relative path name in this dataset
    "frame_count": 219,
    "fps": 29.97002997002997, 
    "width": 1920, 
    "height": 1080, 
    "subject/objects": [                            # List of subject/objects
        {
            "tid": 0,                               # Trajectory ID of a subject/object
            "category": "bicycle"
        }, 
        ...
    ], 
    "trajectories": [                               # List of frames
        [                                           # List of bounding boxes in each frame
            {                                       # The bounding box at the 1st frame
                "tid": 0,                           # The trajectory ID to which the bounding box belongs
                "bbox": {
                    "xmin": 672,                    # Left
                    "ymin": 560,                    # Top
                    "xmax": 781,                    # Right
                    "ymax": 693                     # Bottom
                }, 
                "generated": 0,                     # 0 - the bounding box is manually labeled
                                                    # 1 - the bounding box is automatically generated by a tracker
                "tracker": "none"                   # If generated=1, it is one of "linear", "kcf" and "mosse"
            }, 
            ...
        ],
        ...
    ],
    "relation_instances": [                         # List of annotated visual relation instances
        {
            "subject_tid": 0,                       # Corresponding trajectory ID of the subject
            "object_tid": 1,                        # Corresponding trajectory ID of the object
            "predicate": "in_front_of", 
            "begin_fid": 0,                         # Frame index where this relation begins (inclusive)
            "end_fid": 210                          # Frame index where this relation ends (exclusive)
        }, 
        ...
    ]
}

Statistics

This section provides an overview of the dataset statistics. A detailed description about the dataset can be found in this paper.

train_val_vid_length — Figure 1: Statistics for video lengths in train/val set. The lengths of most videos are distributed between 3 and 93 seconds. The shortest video is 1.00 second, and the longest video of them is 180.01 seconds. Overall, the average video length in train/val set is 35.73 seconds.

train_val_object — Figure 2: Objects statistics per category in train/val set. The categories are grouped into three upper level categories: Human(3), Animal(28) and Other(49). It can be found that the number of objects in Human accounts for 56.34% of the proportion, while the portion of objects in Animal and Other are 35.78% and 7.98%, respectively.

train_val_predicate — Figure 3: Predicate statistics per category in train/val set. Each bar indicates the number of relation instances whose predicate belongs to that category. The two types of predicates (i.e. spatial(8) and actions(42)) are highlighted in different colors. And their proportions are 76.77% and 23.23%, respectively.

Figure 4: Statistics of the types of relation triplets in train/val set. The dark area shows the number and portion of triplet types unique in the training set; the grey area shows that of triplet types appearing in both of the training and validation set; and the light area shows that of triplet types unique in the validation set. In the validation set, 641 out of 30,142 relation instances belong to the 295 triplet types (light area).

Citations

Please kindly cite these works if the dataset helps your research.

@inproceedings{shang2019annotating,
    title={Annotating Objects and Relations in User-Generated Videos},
    author={Shang, Xindi and Di, Donglin and Xiao, Junbin and Cao, Yu and Yang, Xun and Chua, Tat-Seng},
    booktitle={Proceedings of the 2019 on International Conference on Multimedia Retrieval},
    pages={279--287},
    year={2019},
    organization={ACM}
}

@article{thomee2016yfcc100m,
    title={YFCC100M: The New Data in Multimedia Research},
    author={Thomee, Bart and Shamma, David A and Friedland, Gerald and Elizalde, Benjamin and Ni, Karl and Poland, Douglas and Borth, Damian and Li, Li-Jia},
    journal={Communications of the ACM},
    volume={59},
    number={2},
    pages={64--73},
    year={2016},
    publisher={ACM New York, NY, USA}
}