Introduction

VidOR (Video Object Relation) dataset contains 10,000 videos (98.6 hours) from YFCC100M collection together with a large amount of fine-grained annotations for relation understanding. In particular, 80 categories of objects are annotated with bounding-box trajectory to indicate their spatio-temporal location in the videos; and 50 categories of relation predicates are annotated among all pairs of annotated objects with starting and ending frame index. This results in around 50,000 object and 380,000 relation instances annotated. To use the dataset for model development, the dataset is split into 7,000 videos for training, 835 videos for validation, and 2,165 videos for testing. VidOR can provide foundation for many kinds of research and has been used in:


Downloads

Please download the videos in training/validation set using the following links, and extract video frames using FFmpeg-3.3.4. The total sizes of training and validation videos is around 24.5G and 2.9G, respectively. It is recommended to unarchive the downloaded parts into the same directory.
Alternatively, as the videos are exclusively drawn from the YFCC100M collection without any processing, you can also obtain them via AWS S3 data storage hosted by Multimedia Commons but need to organize the files in the directory structure consistent with the "video_path" in the annotations.

Please download the annotations in training/validation set using the following links, in which one JSON file contains the annotation for one video. The format of a JSON file is shown in below, and you can load the annotations together using this helper script.

Statistics

This section provides an overview of the dataset statistics. A detailed description about the dataset can be found in this paper.

Figure 1: Statistics for video lengths in train/val set. The lengths of most videos are distributed between 3 and 93 seconds. The shortest video is 1.00 second, and the longest video of them is 180.01 seconds. Overall, the average video length in train/val set is 35.73 seconds.

train_val_vid_length

Figure 2: Objects statistics per category in train/val set. The categories are grouped into three upper level categories: Human(3), Animal(28) and Other(49). It can be found that the number of objects in Human accounts for 56.34% of the proportion, while the portion of objects in Animal and Other are 35.78% and 7.98%, respectively.

train_val_object

Figure 3: Predicate statistics per category in train/val set. Each bar indicates the number of relation instances whose predicate belongs to that category. The two types of predicates (i.e. spatial(8) and actions(42)) are highlighted in different colors. And their proportions are 76.77% and 23.23%, respectively.

train_val_predicate

Figure 4: Statistics of the types of relation triplets in train/val set. The dark area shows the number and portion of triplet types unique in the training set; the grey area shows that of triplet types appearing in both of the training and validation set; and the light area shows that of triplet types unique in the validation set. In the validation set, 641 out of 30,142 relation instances belong to the 295 triplet types (light area).

Citations

Please kindly cite these works if the dataset helps your research.

@inproceedings{shang2019annotating,
    title={Annotating Objects and Relations in User-Generated Videos},
    author={Shang, Xindi and Di, Donglin and Xiao, Junbin and Cao, Yu and Yang, Xun and Chua, Tat-Seng},
    booktitle={Proceedings of the 2019 on International Conference on Multimedia Retrieval},
    pages={279--287},
    year={2019},
    organization={ACM}
}

@article{thomee2016yfcc100m,
    title={YFCC100M: The New Data in Multimedia Research},
    author={Thomee, Bart and Shamma, David A and Friedland, Gerald and Elizalde, Benjamin and Ni, Karl and Poland, Douglas and Borth, Damian and Li, Li-Jia},
    journal={Communications of the ACM},
    volume={59},
    number={2},
    pages={64--73},
    year={2016},
    publisher={ACM New York, NY, USA}
}