VidOR (Video Object Relation) dataset contains 10,000 videos (98.6 hours) from YFCC100M collection together with a large amount of fine-grained annotations for relation understanding. In particular, 80 categories of objects are annotated with bounding-box trajectory to indicate their spatio-temporal location in the videos; and 50 categories of relation predicates are annotated among all pairs of annotated objects with starting and ending frame index. This results in around 50,000 object and 380,000 relation instances annotated. To use the dataset for model development, the dataset is split into 7,000 videos for training, 835 videos for validation, and 2,165 videos for testing. VidOR can provide foundation for many kinds of research and has been used in:
Please download the videos in training/validation set using the following links, and extract video frames using
FFmpeg-3.3.4.
The total sizes of training and validation videos is around 24.5G and 2.9G, respectively.
It is recommended to unarchive the downloaded parts into the same directory.
Alternatively, as the videos are exclusively drawn from the YFCC100M collection without any processing, you can also obtain
them via AWS S3 data storage hosted by
Multimedia Commons
but need to organize the files in the directory structure consistent with the "video_path" in the annotations.
Please download the annotations in training/validation set using the following links, in which one JSON file contains the annotation for one video. The format of a JSON file is shown in below, and you can load the annotations together using this helper script.
{ "version": "VERSION 1.0", "video_id": "5159741010", # Video ID in YFCC100M collection "video_hash": "6c7a58bb458b271f2d9b45de63f3a2", # Video hash offically used for indexing in YFCC100M collection "video_path": "1025/5159741010.mp4", # Relative path name in this dataset "frame_count": 219, "fps": 29.97002997002997, "width": 1920, "height": 1080, "subject/objects": [ # List of subject/objects { "tid": 0, # Trajectory ID of a subject/object "category": "bicycle" }, ... ], "trajectories": [ # List of frames [ # List of bounding boxes in each frame { # The bounding box at the 1st frame "tid": 0, # The trajectory ID to which the bounding box belongs "bbox": { "xmin": 672, # Left "ymin": 560, # Top "xmax": 781, # Right "ymax": 693 # Bottom }, "generated": 0, # 0 - the bounding box is manually labeled # 1 - the bounding box is automatically generated by a tracker "tracker": "none" # If generated=1, it is one of "linear", "kcf" and "mosse" }, ... ], ... ], "relation_instances": [ # List of annotated visual relation instances { "subject_tid": 0, # Corresponding trajectory ID of the subject "object_tid": 1, # Corresponding trajectory ID of the object "predicate": "in_front_of", "begin_fid": 0, # Frame index where this relation begins (inclusive) "end_fid": 210 # Frame index where this relation ends (exclusive) }, ... ] }
This section provides an overview of the dataset statistics. A detailed description about the dataset can be found in this paper.
Please kindly cite these works if the dataset helps your research.
@inproceedings{shang2019annotating, title={Annotating Objects and Relations in User-Generated Videos}, author={Shang, Xindi and Di, Donglin and Xiao, Junbin and Cao, Yu and Yang, Xun and Chua, Tat-Seng}, booktitle={Proceedings of the 2019 on International Conference on Multimedia Retrieval}, pages={279--287}, year={2019}, organization={ACM} } @article{thomee2016yfcc100m, title={YFCC100M: The New Data in Multimedia Research}, author={Thomee, Bart and Shamma, David A and Friedland, Gerald and Elizalde, Benjamin and Ni, Karl and Poland, Douglas and Borth, Damian and Li, Li-Jia}, journal={Communications of the ACM}, volume={59}, number={2}, pages={64--73}, year={2016}, publisher={ACM New York, NY, USA} }