The AVA-Kinetics dataset consists of the original 430 videos from AVA v2.2, together with 238k videos from the Kinetics-700 dataset. For Kinetics we provide one annotated frame per video clip. The annotations are provided as CSV files, as described in the included README.txt file.
All of the annotation is provided in the in the .tar.gz file. Although there are separate CSV files for AVA and for Kinetics, it's expected that users will want to train / test on the union of the two.
The AVA v2.2 dataset contains 430 videos split into 235 for training, 64 for validation, and 131 for test. Each video has 15 minutes annotated in 1 second intervals. The annotations are provided as CSV files:
For Task B - Spatio-temporal Action Localization (AVA) at the ActivityNet 2019 Challenge, we're releasing the video ids for a set of 131 labeled test videos. The challenge will only evaluate performance on a subset of 60 classes. For details on how to submit your predictions on these videos please see the ActivityNet 2019 Challenge page.
Generally raters provided annotations at timestamps 902:1798 inclusive, in seconds, at 1-second intervals. Performance is measured on all of these "included" timestamps, including those for which raters determined no action was present. For certain videos, some timestamps were excluded from annotation because raters marked the corresponding video clips as inappropriate. Performance is not measured on the "excluded" timestamps. The lists of included and excluded timestamps are:
Each row contains an annotation for one person performing an action in an interval, where that annotation is associated with the middle frame. Different persons and multiple action labels are described in separate rows.
The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, person_id
AVA v2.2 differs from v2.1 in two ways. First another round of human rating was conducted to insert labels that were missing, increasing the number of annotations by 2.5%. Second, boxes locations were corrected for a small number of videos with aspect ratios much larger than 16:9.
AVA v2.1 differs from v2.0 only by the removal of a small number of movies that were determined to be duplicates. The class list and label map remain unchanged from v1.0.
Code for running the Frame-mAP evaluation can be found in the ActivityNet GitHub.
A pre-trained baseline model is also available. It was created using the Tensorflow Object Detection API.
The baseline model is an image-based Faster RCNN detector with ResNet-101 feature extractor. Compared with other commonly used object detectors, the action classification loss function has been changed to per-class sigmoid loss to handle boxes with multiple labels. The model was trained on the training split of AVA v2.1 for 1.5M iterations, and achieves mean AP of 11.25% over 60 classes on the validation split of AVA v2.1.
The model checkpoint can be obtained here. The predictions of this model on the AVA v2.1 validation split, in the CSV format described above, can be downloaded here: ava_baseline_detections_val_v2.1.zip.
Each zipped file contains a set of CSV files, one for each video in the [train/val] partition. The partitions are the same as AVA Actions.
For Task B, Challenge #2- Active Speaker Detection at the ActivityNet 2019 Challenge, we're releasing data for 131 videos with labels anonymized. For details on how to submit your predictions on these videos, please see the information for Challenge #2 on the ActivityNet 2019 Challenge: Task B page.
Each row in the CSV files contains an annotation for speaking activity associated with a single face for that frame. Different persons are described in separate rows. The format of a row is the following: video_id, frame_timestamp, entity_box, label, entity_id.
The AVA Active Speaker labels v1.0 release contains dense labels for 160 videos (from the original list of 188 videos in AVA v1.0) that are still available on YouTube.
Each row contains an annotation for an interval of a video clip. For each video in the dataset, 15 minutes (starting at 15 minutes and 0 seconds to 30 minutes 0 seconds) are densely labeled for speech activity using one of the 4 possible labels: {NO_SPEECH, CLEAN_SPEECH, SPEECH_WITH_MUSIC, SPEECH_WITH_NOISE}. Each new label appears in a separate row.
The format of a row is the following: video_id, label_start_timestamp_seconds, label_end_timestamp_seconds, label
The AVA speech labels v1.0 release contains dense labels for 160 videos (from the original list of 188 videos in AVA v1.0) that are still available on YouTube.
All datasets listed here are made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.