Skip to content

04. Training and Testing

Felipe Batista edited this page Mar 19, 2021 · 5 revisions

Training Data

Our pipeline relies on two neural network models:

  • VGG16
  • Siamese Neural Network (trained with triplet loss)

Model 1 is called a "pre-trained" model as it is the industry standard when training a large-scale model on a large dataset and then publishing for widespread use. In this case, Model 1 was pre-trained on the Imagenet Dataset, one of the most popular computer vision datasets.

Model 2 is trained using the CC_WEB_VIDEO dataset which contains sets of near-duplicate videos. The dataset provides our model sets of similar videos and labeled diversions that allow the model to create fine-grained fingerprints.

Although intended for media from conflict or repressive environments, such as Syria, neither model solely relies on specific data related to conflict zones or other sensitive material.

For more information, please refer to the reference implementation/paper used by our project "Near-Duplicate Video Retrieval with Deep Metric Learning".

Performance and Reliability

In order to evaluate this problem, we used mAP (mean average precision) which is a common metric used on information retrieval problems. In simple terms, we use unique video as queries to find similar videos / duplicated content. In this context, it's important for the model to return all relevant videos in the most effective way possible and not simply just return all elements within the search space.

In a hypothetical scenario, if we had one unique video (that we used as a query), a search space of 100 videos and 10 relevant videos (these could be very similar videos, low / high quality variants...). If our model ranked all of these videos within our search space in a way that all top 10 videos were relevant videos, this would result in an average precision of 100%. Any result different from that would result in a lower average precision.

If we run this hypothetical scenario for multiple unique videos, we can get an estimate on how well the model handles different videos. If the evaluation of each unique video results in one average precision score, the evaluation of all unique videos would be merely the aggregation of all those average precision scores, which results in the metric called Mean Average Precision, or simply mAP. This explanation oversimplifies the mAP concept, for an in-depth mathematical explanation please refer to this paper and this wiki

  1. The base models were able to reproduce the published performance of its reference paper, which was ~96% mAP (mean average precision) on the CC_WEB_VIDEO dataset. (please refer to the "paper" for more details on this metric).

  2. We validated the model on our own augmented dataset, as noted below.

Internal Benchmarks

Video Deduplication

Augmented Dataset

Our augmented dataset is a combination of 1,656 seed videos and 1407 synthetic videos, created by applying random transformations and adding different artifacts to the seed videos. As a result, we have a synthetic dataset with 827 for which there is at least one duplicated match.

In order to evaluate our performance in a consistent and robust manner, we perform random sampling for each potential duplicated video:

Given video duplicated A, we'll build a sample dataset containing positive matches at a parametrized proportion ("proportion of positive samples") and negative samples randomly pulled from the rest of the dataset. Our evaluation pipeline then records the mAP score and repeats the process 5 times so multiple combinations of query and search spaces (positive + negative instances) can be evaluated.

We also evaluate the impact of reducing the number of frames sampled for each video (when frame sampling is set to 1, we sample 1 frame per second, at 2, we sample one frame every 2 seconds, and so forth).

Proportion of Positive Samples vs Frame Sampling (mAP) 1 2 3 4 5
10% 0.880123 0.854020 0.847633 0.847159 0.852357
15% 0.908248 0.876877 0.870665 0.873855 0.875837
20% 0.916558 0.894647 0.887559 0.894328 0.896487
25% 0.930120 0.903080 0.904946 0.902789 0.906379

Main takeaways from the results:

  1. At the proportion of positive samples of 25% (consistent with the results of the paper mentioned above), the range of our performance is 90%-93%. The tradeoff between using just one frame every 5 seconds and one frame each second seems quite favorable. Using one-fifth of the frames results in a decrease in performance of just ~3%
  2. Performance at more conservative proportions of positive samples (10% and 15%) ranges from 85% to ~90% and the same considerations for frame sampling from above still apply.

Template Matching

Landmarks

In order to evaluate our template matching solution, we used a smaller subset of the google landmark dataset.

Our script uses samples of landmarks to create query templates and runs those templates against random subsets landmarks, that are also stress-tested using the same methods described at the previous section.

We also evaluate the impact of using one or more images as a query template and different ways building a unique query template using different aggregation functions (max, mean, median, min).

mAP
agg_func max mean median min
ratio n_query_samples
0.10 1 0.440084 0.452707 0.459968 0.441521
5 0.499991 0.567167 0.577313 0.536600
10 0.460727 0.601804 0.607118 0.534356
0.15 1 0.513849 0.538346 0.518568 0.514333
5 0.576192 0.639614 0.642629 0.609321
10 0.546408 0.663328 0.670662 0.615866
0.20 1 0.570644 0.582559 0.562193 0.574257
5 0.626623 0.688094 0.687686 0.672377
10 0.612945 0.718985 0.718681 0.681545
0.25 1 0.622055 0.620114 0.626854 0.628346
5 0.682959 0.736068 0.732309 0.712143
10 0.656734 0.755929 0.762348 0.720121

Main takeaways:

  1. Using more than one sample to build a template seems to significantly increase performance.
  2. Using the median as the aggregator of multiple templates seems to yield the best results.
  3. When using at least 5 query samples to build a query template, our performance will range between ~50% and 73% mAP (depending on how conservative we are in terms of the proportion of positive samples).
  4. In order to minimize false positives when using this feature, the user should increase the minimum distance parameter.

Scene / Shot Detection

In order to evaluate our scene detection solution, we used the BBC Planet Earth dataset.

Our solution takes into account the fact that we cannot expect a steady stream of contiguous frames, as our pipeline samples only 1 frame per second (or even less). This makes the shot detection problem much harder as changes from a sampled frame to the next are much more pronounced.

At a high level, our solution uses the time series of the frame-level feature activations and their rolling differences to extract shot changes indices from a given video, which then are turned into the start and end timestamps for each video.

The main parameter that we use to evaluate our ability to segment videos into a group of scenes is the sensibility regarding the rolling differences between frame-level feature activations ("min_dif"). We also use another parameter that allows us to only consider globally significant changes and ignore small local fluctuations in the values of the frame-level feature activations ("upper_thresh").

Our chosen metrics for evaluating our performance are:

  1. Homogeneity: Ranges from 0 to 1, where 1 means that all frames from a given scene are correctly assigned to just one scene and not fragmented into multiple smaller scenes.
  2. Completeness: Ranges from 0 to 1, where 1 means that all frames are correctly classified as a specific scene
  3. v-measure-score: Ranges from 0 to 1, where 1 means that scenes were perfectly segmented. It's a combination of the two previous metrics.

For more details please refer to this resource

Experiment upper_thresh min_dif homogeneity completeness v_measure avg_number_scenes avg_scene_duration_seconds std_scene_duration_seconds
Best results 0.77 0.040 0.89 0.90 0.90 381.00 7.78 0.73
Best results #2 0.77 0.020 0.89 0.90 0.90 381.00 7.78 0.73
Worst results #2 0.10 0.0 0.60 0.96 0.74 80.45 37.92 8.59
Worst results #1 0.10 0.0 0.60 0.96 0.74 74.45 40.83 8.95

Experiments were sorted by their v_measure_score.

Partner Testing and Feedback