UNIVERSIDAD DE CHILE
FACULTAD DE CIENCIAS FÍSICAS Y MATEMÁTICAS
DEPARTAMENTO DE CIENCIAS DE LA COMPUTACIÓN
CONTENT-BASED VIDEO COPY DETECTION
TESIS PARA OPTAR AL GRADO DE DOCTOR EN CIENCIAS, MENCIÓN COMPUTACIÓN
JUAN MANUEL BARRIOS NÚÑEZ
PROFESOR GUÍA:
BENJAMÍN BUSTOS CÁRDENAS
MIEMBROS DE LA COMISIÓN:
GONZALO NAVARRO BADINO
JORGE PÉREZ ROJAS
SHIN’ICHI SATOH
Este trabajo ha sido parcialmente financiado por CONICYT.
SANTIAGO – CHILE
NOVIEMBRE 2013
Resumen
La cantidad y el uso de videos en Internet ha aumentado exponencialmente durante los
últimos años. La investigación académica en tópicos de videos se ha desarrollado durante décadas,
sin embargo la actual ubicuidad de los videos presiona por el desarrollo de nuevos y mejores algoritmos. Actualmente existen variadas necesidades por satisfacer y muchos problemas abiertos
que requieren de investigación cientı́fica. En particular, la Detección de Copias de Video (DCV)
aborda la necesidad de buscar los videos que son copia de un documento original. El proceso de
detección compara el contenido de los videos en forma robusta a diferentes transformaciones audiovisuales. Esta tesis presenta un sistema de DCV llamado P-VCD, el cual utiliza algoritmos y
técnicas novedosas para lograr alta efectividad y eficiencia.
Esta tesis se divide en dos partes. La primera parte se enfoca en el estado del arte, donde
se revisan técnicas comunes de procesamiento de imágenes y búsqueda por similitud, se analiza
la definición y alcance de la DCV, y se presentan técnicas actuales para resolver este problema.
La segunda parte de esta tesis detalla el trabajo realizado y sus contribuciones al estado del arte,
analizando cada una de las tareas que componen esta solución, a saber: preprocesamiento de videos,
segmentación de videos, extracción de caracterı́sticas, búsqueda por similitud y localización de
copias.
En relación a la efectividad, se desarrollan las ideas de normalización de calidad de videos,
descripción múltiple de contenidos, combinación de distancias, y uso de distancias métricas versus no-métricas. Como resultado se proponen las técnicas de creación automática de descriptores
espacio-temporales a partir de descriptores de fotogramas, descriptores de audio combinables con
descriptores visuales, selección automática de pesos, y distancia espacio-temporal para combinación
de descriptores.
En relación a la eficiencia, se desarrollan los enfoques de espacios métricos y tabla de pivotes
para acelerar las búsquedas. Como resultado se proponen una búsqueda aproximada utilizando
objetos pivotes para estimar y descartar distancias, búsquedas multimodales en grandes colecciones,
y un ı́ndice que explota la similitud entre objetos de consulta consecutivos.
Esta tesis ha sido evaluada usando la colección MUSCLE-VCD-2007 y participando en las
evaluaciones TRECVID 2010 y 2011. El desempeño logrado en estas evaluaciones es satisfactorio. En el caso de MUSCLE-VCD-2007 se supera el mejor resultado publicado para esa colección,
logrando la máxima efectividad posible, mientras que en el caso de TRECVID se obtiene una
performance competitiva con otros sistemas del estado del arte.
i
Abstract
The amount of digital videos on Internet has grown exponentially over the last few years.
Academic research on video topics has developed over many decades, however the current ubiquity of
videos pushes the development of new and better algorithms. There are currently many unsatisfied
needs and open problems that demand scientific research. In particular, Content-Based Video
Copy Detection (CBVCD) addresses the need to retrieve videos in a collection that are copies of
some original document. The copy detection process relies exclusively on the audiovisual content,
and must be robust to common audiovisual transformations. This thesis details a novel CBVCD
system, called P-VCD. The system is founded on the metric space approach and develops many
novel algorithms and techniques in order to achieve high effectiveness and efficiency.
The thesis is divided into two parts. The first part gives an overview of the area, presents
different techniques from image processing and similarity search, analyzes the definition and scope
of the CBVCD problem, and summarizes the state-of-the-art on the CBVCD topic. The second
part details our approach for CBVCD, following the tasks of the detection process, namely: video
preprocessing, video segmentation, feature extraction, similarity search, and copy localization.
Regarding effectiveness, we explore the ideas of video quality normalization, multiple content
description, combination of distances, and metric versus non-metric distances. In particular, we
propose techniques for the automatic creation of spatio-temporal descriptors using frame-based
global descriptors, an acoustic descriptor that can be combined with global descriptors, automatic
weight selection, and spatio-temporal distance to combine descriptors.
Regarding efficiency, we explore the ideas of metric access methods and pivot tables in order
to reduce the amount of distance computations. In particular, we propose a novel approximate
search that uses pivot objects in order to estimate and discard distance evaluations, a multimodal
search in large datasets, and a novel index structure that exploits the similarity between consecutive
query objects.
This thesis has been evaluated by using the MUSCLE-VCD-2007 dataset and by participating in TRECVID 2010 and 2011. We are very pleased with the performance achieved in both
evaluations. In the case of MUSCLE-VCD-2007, the system outperforms the best published result
for that dataset, achieving the maximum detection effectiveness, whereas in the case of TRECVID
it shows competitive performance with other state-of-the-art systems.
ii
Dedicado a Solange, Daniel y Elizabeth.
iii
Contents
Page
1 Introduction
1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Thesis Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3
6
I
8
Background and Related Work
2 Image Processing and Analysis
2.1 Concepts . . . . . . . . . . . .
2.2 Image Processing . . . . . . . .
2.3 Content Description . . . . . .
2.3.1 Global Descriptors . . .
2.3.2 Local Descriptors . . . .
2.3.3 Bag of Visual Words . .
2.4 Summary . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
11
16
17
18
23
24
3 Similarity Search
3.1 Concepts . . . . . . . . . . . . . . .
3.2 Vector Spaces . . . . . . . . . . . . .
3.2.1 R-tree . . . . . . . . . . . . .
3.2.2 Kd-tree . . . . . . . . . . . .
3.2.3 K-means tree . . . . . . . . .
3.2.4 LSH . . . . . . . . . . . . . .
3.3 Metric Spaces . . . . . . . . . . . . .
3.3.1 Efficiency in Metric Spaces .
3.3.2 Effectiveness in Metric Spaces
3.4 Vector spaces versus Metric spaces .
3.5 Summary . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
27
27
28
28
29
29
30
33
36
37
4 Related Work
4.1 Definition of Content-Based Video Copy Detection .
4.1.1 CBVCD based on content transformations . .
4.1.2 CBVCD based on semantic similarity . . . .
4.2 Applications of CBVCD . . . . . . . . . . . . . . . .
4.3 Content Description . . . . . . . . . . . . . . . . . .
4.3.1 Description for an entire video document . .
4.3.2 Video Segmentation and Keyframe Selection
4.3.3 Visual Global Description . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38
38
39
40
41
42
43
44
45
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
4.4
4.5
4.6
4.7
4.8
II
4.3.4 Visual Local Description . . . . .
4.3.5 Global description based on local
4.3.6 Acoustic Description . . . . . . .
4.3.7 High-level Description . . . . . .
Similarity Search . . . . . . . . . . . . .
4.4.1 Linear Scan . . . . . . . . . . . .
4.4.2 Lookup Tables . . . . . . . . . .
4.4.3 Space Filling Curves . . . . . . .
Temporal Consistency . . . . . . . . . .
Multimodal Fusion . . . . . . . . . . . .
4.6.1 Early Fusion . . . . . . . . . . .
4.6.2 Late Fusion . . . . . . . . . . . .
Alternative Approach: Watermarking .
Summary . . . . . . . . . . . . . . . . .
. . . . . . .
descriptors
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CBVCD and the Metric Space Approach
5 Overview
5.1 Preprocessing Task . . . . .
5.2 Video Segmentation Task .
5.3 Feature Extraction Task . .
5.4 Similarity Search Task . . .
5.5 Copy Localization Task . .
5.6 Evaluation of CBVCD . . .
5.6.1 Dataset . . . . . . .
5.6.2 Evaluation measures
5.7 Summary . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Preprocessing
6.1 Quality normalization . . . . . . . . . . . .
6.1.1 Frame-by-frame normalization . . .
6.1.2 Global normalization . . . . . . . . .
6.2 Detection and reversion of transformations .
6.2.1 Picture-in-picture . . . . . . . . . . .
6.2.2 Camcording . . . . . . . . . . . . . .
6.2.3 Vertical flip . . . . . . . . . . . . . .
6.3 Summary . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Video Segmentation and Feature Extraction
7.1 Video Segmentation . . . . . . . . . . . . . .
7.1.1 Fixed-length segmentation . . . . . . .
7.1.2 Variable-length segmentation . . . . .
7.2 Spatial global description . . . . . . . . . . .
7.3 Spatio-temporal global description . . . . . .
7.4 Acoustic description . . . . . . . . . . . . . .
7.5 Spatial Local description . . . . . . . . . . . .
7.6 Evaluation on MUSCLE-VCD-2007 . . . . . .
7.6.1 Evaluation of global description . . . .
7.6.2 Effectiveness of acoustic description .
v
47
48
49
49
50
50
50
51
52
52
52
53
54
54
56
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
58
59
59
60
61
61
62
63
65
.
.
.
.
.
.
.
.
66
66
66
67
67
67
69
69
69
.
.
.
.
.
.
.
.
.
.
72
72
72
73
73
75
76
76
77
77
84
7.7
7.6.3 Evaluation of local descriptors . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Improving Effectiveness in the Similarity Search
8.1 Spatial-Combined Distance . . . . . . . . . . . . .
8.1.1 α-Normalization of distances . . . . . . . .
8.1.2 Weighting by Max-ρ . . . . . . . . . . . . .
8.1.3 Weighting by Max-τ . . . . . . . . . . . . .
8.1.4 Discussion and implementation details . . .
8.1.5 Evaluation . . . . . . . . . . . . . . . . . .
8.2 Spatio-Temporal Combined Distance . . . . . . . .
8.2.1 Evaluation . . . . . . . . . . . . . . . . . .
8.3 Other Metric and Non-Metric Distances . . . . . .
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . .
9 Improving Efficiency in the Similarity Search
9.1 Approximate Search with Pivots . . . . . . . . .
9.1.1 Index Structure . . . . . . . . . . . . . . .
9.1.2 Exact Search with pivots . . . . . . . . .
9.1.3 Approximate Search with pivots . . . . .
9.1.4 Approximate Search with Pivots for Local
9.1.5 Two-step Search . . . . . . . . . . . . . .
9.1.6 Evaluation . . . . . . . . . . . . . . . . .
9.2 Efficiency for Streams of Exact Searches . . . . .
9.2.1 Snake Table . . . . . . . . . . . . . . . . .
9.2.2 Snake Distribution . . . . . . . . . . . . .
9.2.3 Evaluation . . . . . . . . . . . . . . . . .
9.3 Comparison with Multidimensional Indexes . . .
9.3.1 Exact search . . . . . . . . . . . . . . . .
9.3.2 Approximate search . . . . . . . . . . . .
9.3.3 Fusion of descriptors . . . . . . . . . . . .
9.4 Summary . . . . . . . . . . . . . . . . . . . . . .
10 Copy Localization
10.1 Voting algorithm . . . . . . . .
10.2 Evaluation of voting algorithm
10.3 Combination of candidates . . .
10.4 Summary . . . . . . . . . . . .
86
88
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
89
. 89
. 90
. 91
. 92
. 93
. 93
. 98
. 99
. 102
. 105
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Descriptors
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
106
106
106
108
108
109
111
112
116
118
119
120
125
126
127
129
132
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
133
133
135
136
138
11 Evaluation at TRECVID
11.1 Introduction . . . . . . . . . . . .
11.2 TRECVID datasets . . . . . . . .
11.3 Evaluation process . . . . . . . .
11.4 Participation at TRECVID 2010
11.4.1 Submissions . . . . . . . .
11.4.2 Results . . . . . . . . . .
11.5 Participation at TRECVID 2011
11.5.1 Submissions . . . . . . . .
11.5.2 Results . . . . . . . . . .
11.6 Precision/Recall analysis . . . . .
11.6.1 TRECVID 2010 . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
139
139
139
141
143
143
144
148
148
151
155
155
vi
11.6.2 TRECVID 2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
11.6.3 Evaluation of preprocessing task . . . . . . . . . . . . . . . . . . . . . . . . . 163
11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12 Conclusions
166
12.1 Summary of the main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.2 Benefits and drawbacks of the proposed solution . . . . . . . . . . . . . . . . . . . . 168
12.3 Trends for future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Bibliography
172
A MUSCLE-VCD-2007 ground-truth
183
B Results at TRECVID 2010
189
C Results at TRECVID 2011
198
vii
List of Tables
5.1
5.2
7.1
7.2
Summary of MUSCLE-VCD-2007 collection. . . . . . . . . . . . . . . . . . . . . . . .
Detections without false alarms at MUSCLE-VCD-2007 dataset and their associated
Recall at Precision 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Number of query and reference segments for each fixed-length segmentation. . . . . .
Average number of vectors per segment, size of frame from which vectors are extracted, and number of pairs between query and reference vectors. . . . . . . . . . .
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
64
86
8.1
8.2
8.3
8.4
8.5
8.6
8.7
Base effectiveness of the three configurations to be combined. . . . .
Effectiveness of γ when combining distances from EH and IH. . . .
Effectiveness of γ when combining distances from EH and KF. . . .
Effectiveness of γ when combining distances from IH and KF. . . .
Effectiveness of γ when combining distances from EH, IH and KF.
Effectiveness of five more configurations to be combined. . . . . . . .
Base effectiveness of the two configurations to be combined. . . . . .
. 94
. 94
. 95
. 95
. 96
. 98
. 102
9.1
9.2
9.3
Configurations used in following experiments. . . . . . . . . . . . . . . . . . . . . . . 112
Metric spaces defined by different local descriptors and segmentations. . . . . . . . . 115
Configurations used in the comparison between multidimensional and metric indexing.125
10.1 Detections without false alarms when increasing k and both rank relevance and
dist relevance are disabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.2 Detections without false alarms when increasing k, rank relevance is active and
dist relevance is disabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.3 Detections without false alarms when increasing k, both rank relevance and dist relevance
are active. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
The evaluated transformations at TRECVID 2010 and
Summary of TRECVID 2010 and 2011 collections. . .
Descriptors used at TRECVID 2010. . . . . . . . . . .
Configurations used at TRECVID 2010. . . . . . . . .
Evaluation for submitted Runs to TRECVID 2010. . .
Descriptors used at TRECVID 2011. . . . . . . . . . .
Configurations used at TRECVID 2011. . . . . . . . .
Evaluation for submitted Runs to TRECVID 2011. . .
2011.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
142
143
144
145
149
150
151
A.1 Ground-truth of ST1 collection in MUSCLE-VCD-2007 dataset. . . . . . . . . . . . . 183
A.2 Ground-truth of ST2 collection in MUSCLE-VCD-2007 dataset. . . . . . . . . . . . . 184
B.1 The 22 participant teams in CCD evaluation at TRECVID 2010. . . . . . . . . . . . 189
B.2 Results for NoFA profile at TRECVID 2010. . . . . . . . . . . . . . . . . . . . . . . 190
viii
B.3 Optimal NDCR and Optimal F1 for submissions nofa.ehdNghT10 and nofa.ehdNgryhst
at TRECVID 2010. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
B.4 Results for Balanced profile at TRECVID 2010. . . . . . . . . . . . . . . . . . . . . . 192
B.5 Optimal NDCR and Optimal F1 for submissions balanced.ehdNclrhst and balanced.ehdNgryhst
at TRECVID 2010. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
B.6 RP1 (maximum Recall with Precision 1) for all TRECVID 2010 submissions. Part 1
of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
B.7 RP1 (maximum Recall with Precision 1) for all TRECVID 2010 submissions. Part 2
of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
B.8 RP.5 (maximum Recall with Precision greater of equal than 0.5) for all TRECVID
2010 submissions. Part 1 of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
B.9 RP.5 (maximum Recall with Precision greater of equal than 0.5) for all TRECVID
2010 submissions. Part 2 of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
C.1 The 22 participant teams in CCD evaluation at TRECVID 2011. . . . . . . . . . . . 198
C.2 Results for Nofa profile at TRECVID 2011. . . . . . . . . . . . . . . . . . . . . . . . 199
C.3 Optimal NDCR and Optimal F1 for submissions nofa.EhdGry and nofa.EhdRgbAud at
TRECVID 2011. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
C.4 Results for Balanced profile at TRECVID 2011. . . . . . . . . . . . . . . . . . . . . . 201
C.5 Optimal NDCR and Optimal F1 for submissions balanced.EhdGry and balanced.EhdRgbAud
at TRECVID 2011. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
C.6 RP1 (maximum Recall with Precision 1) for all TRECVID 2011 submissions. Part 1
of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
C.7 RP1 (maximum Recall with Precision 1) for all TRECVID 2011 submissions. Part 2
of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
C.8 RP.5 (maximum Recall with Precision greater of equal than 0.5) for all TRECVID
2011 submissions. Part 1 of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
C.9 RP.5 (maximum Recall with Precision greater of equal than 0.5) for all TRECVID
2011 submissions. Part 2 of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
ix
List of Figures
1.1
Internet users data traffic per month for 2005, 2008 and 2011. . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
The RGB unitary cube. . . . . . . . . . . . . . . . . .
Gray and Color Histogram. . . . . . . . . . . . . . . .
Blurring filters. . . . . . . . . . . . . . . . . . . . . . .
Sobel filters. . . . . . . . . . . . . . . . . . . . . . . . .
Laplacian filter. . . . . . . . . . . . . . . . . . . . . . .
Edge selection. . . . . . . . . . . . . . . . . . . . . . .
The normal representation of line L. . . . . . . . . . .
Edge Histogram descriptor. . . . . . . . . . . . . . . .
Ordinal Measurement descriptor. . . . . . . . . . . . .
Extracting and matching local descriptors between two
.
.
.
.
.
.
.
.
.
.
10
11
12
13
13
14
16
18
18
21
3.1
3.2
The object-pivot distance constraint for distance d(a, b) using the pivot object p. . .
Histogram of distances with median µ, variance σ 2 , and maximum distance M for
some metric space (R, d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.1
4.2
4.3
Some examples of content transformations used in the TRECVID datasets. . . . . .
Two video clips showing the same physics experiment in similar conditions. . . . . .
The mirrored versions of the Edge Histogram filters. . . . . . . . . . . . . . . . . . .
40
41
46
5.1
5.2
The five main tasks of the P-VCD system. . . . . . . . . . . . . . . . . . . . . . . . .
Video Segmentation and Feature Extraction tasks. . . . . . . . . . . . . . . . . . . .
58
60
6.1
6.2
PIP detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Camcording detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
70
7.1
Effectiveness of the fourteen spatial descriptors for different segmentation length
without video preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Effectiveness of the fourteen spatial descriptors for preprocessed collections. . . . . .
Effectiveness of the fourteen spatio-temporal descriptors for preprocessed collections.
Effectiveness of different parameters for the acoustic descriptor. . . . . . . . . . . . .
Effectiveness of Matches and Spatial for local descriptions SF4, SF5, and SF6. . . . .
80
81
82
85
87
7.2
7.3
7.4
7.5
8.1
8.2
8.3
8.4
8.5
8.6
8.7
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
images.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Normalization by maximum distances on two functions d1 and d2 . . . . . . . . .
The α-normalization of two functions d1 and d2 . . . . . . . . . . . . . . . . . . .
Relationship between intrinsic dimensionality (ρ), the value that α-normalizes
(τα,γ ), and MAP, when combining distances from EH, IH and KF. . . . . . . .
Effectiveness of combining incrementally from one up to eight descriptors. . . . .
Effectiveness of increasing temporal window W . . . . . . . . . . . . . . . . . . . .
Effectiveness of s-t distance with temporal window W . . . . . . . . . . . . . . . .
Effectiveness achieved by different distance functions. . . . . . . . . . . . . . . . .
x
.
.
.
.
.
.
.
.
.
.
1
32
. . 90
. . 91
γ
. . 97
. . 99
. . 100
. . 102
. . 104
9.1
Amount of queries whose actual nearest neighbor is between the smallest values of
LBP for |P| ∈ {1, 3, 5, 10, 20, 40, 80}. . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Computational time of approximate searches with pivots compared to linear scan. .
9.3 Effectiveness of approximate search with pivots for local description SF4 and segmentation S3 varying |P|, m and T . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4 Performance of approximate search for local descriptors and exact search. . . . . . .
9.5 Snake Tables created for stream of queries Q1 and Q2 . . . . . . . . . . . . . . . . . .
9.6 Stream of queries Q={q1 , ..., q12 } with a snake distribution. . . . . . . . . . . . . . .
9.7 Snake distribution of order p for the four configurations. . . . . . . . . . . . . . . . .
9.8 Search time and distance evaluations for KF and EH (Group 1). . . . . . . . . . . .
9.9 Search time and distance evaluations for EIK and EK3 (Group 2). . . . . . . . . .
9.10 Search time spent by the exact search implemented by different multidimensional
and metric indexes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11 Effectiveness-versus-efficiency tradeoff for multidimensional and metric indexes. . . .
9.12 Effectiveness of incremental combination from one up to eight descriptors. . . . . . .
113
114
115
117
118
120
121
122
123
126
128
131
10.1 Example showing the result of the voting algorithm. . . . . . . . . . . . . . . . . . . 134
11.1 Example of visual transformations in TRECVID datasets. . . . . . . . . . . . . . . . 140
11.2 Average Optimal NDCR, Average Optimal F1 and Average MPT for Runs at TRECVID
2010. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.3 Optimal NDCR and Optimal F1 by transformation at TRECVID 2010. . . . . . . . 147
11.4 Average Optimal NDCR, Average Optimal F1 and Average MPT for Runs at TRECVID
2011. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11.5 Optimal NDCR and Optimal F1 by transformation at TRECVID 2011. . . . . . . . 154
11.6 Precision/Recall curves for selected submissions type V to TRECVID 2010. . . . . . 156
11.7 Copy detections at RP1 and RP.5 at TRECVID 2010. . . . . . . . . . . . . . . . . . . 157
11.8 Precision/Recall curves for selected submissions to TRECVID 2011. . . . . . . . . . 158
11.9 Copy detections at RP1 and RP.5 at TRECVID 2011. . . . . . . . . . . . . . . . . . . 159
11.10Copy detections at RP1 and RP.5 at TRECVID 2011. . . . . . . . . . . . . . . . . . . 160
11.11Results achieved by the fusion between our system into Telefonica’s system. . . . . . 161
11.12Precision/Recall curves for the original Run, and for a Run discarding every detection
from queries with either camcording or PIP transformations. . . . . . . . . . . . . . 164
A.1
A.2
A.3
A.4
Copies in ST1
Copy excerpts
Copy excerpts
Copy excerpts
collection. . . .
in ST2Query1.
in ST2Query2.
in ST2Query3.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
186
187
188
List of Algorithms
9.1
9.2
9.3
9.4
9.5
The Sparse Spatial Selection (SSS) algorithm. . . . . . . . . . . .
Pivot Selection Algorithm. . . . . . . . . . . . . . . . . . . . . . .
Classic NN+range search using pivots. . . . . . . . . . . . . . . . .
Approximate NN+range search using pivots. . . . . . . . . . . . .
k-NN search by distance aggregation of partial nearest neighbors.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
108
109
110
129
10.1 Voting algorithm for copy localization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2 CalculateVote function for copy localization. . . . . . . . . . . . . . . . . . . . . . . . 136
xii
Chapter 1
Introduction
The creation, distribution, storage, and broadcasting of digital videos has proliferated enormously in the last decade. In fact, a regular study on internet traffic shows that more than 50%
of the current traffic corresponds to streams and downloads of videos for viewing on a PC screen
[Cisco Systems Inc., 2012]. This traffic includes user generated videos (i.e., YouTube-like videos),
online movies (i.e., Netflix-like videos), television broadcasts, webcam viewing, and mobile phone
videos. Moreover, the study projects that the video traffic will represent more than 80% of the
global consumer traffic on Internet by 2016. This is even more impressive if we acknowledge that
video traffic made up less than 5% by 2005 [Cisco Systems Inc., 2008] (see Figure 1.1). The exponential growth of videos on the Internet can mostly be explained by the fall in the prices of digital
cameras and video capture devices, the increase in network bandwidth, the emergence of devices
showing online content, and the spread of mobile phones with embedded cameras.
Academic research on video topics has developed over many decades, however the current
ubiquity of videos pushes the development of new and better algorithms and techniques. Regarding
video topics, there are currently many open problems to address, novel applications to develop,
and unsatisfied needs to fulfill. Most users would be interested in enhancing their possibilities to
manage personal video collections, analyze large videos datasets, extract information in realtime
from streams of videos, automatize processes using digital cameras, etc.
100%
Traffic distribution
Traffic/month [EB]
Multimedia Information Retrieval (MIR) is the research area that aims at searching and
retrieving semantic information from multimedia documents [Lew et al., 2006]. A multimedia document can be understood as any repository of information, structured or not. It includes image,
20
15
10
5
2005 2008 2011
Video
75%
P2P
50%
Web, email, data
25%
Others
0%
2005 2008 2011
Figure 1.1 – Internet users data traffic per month for 2005, 2008 and 2011 [Cisco Systems
Inc., 2008, 2012].
1
audio and videos, but also comprises other sources of information like time series, webpages, graphs,
XML documents, DNA sequences, etc. In general, MIR systems follow two steps: a process for summarizing the content of the documents into descriptors, and a process for analyzing the descriptors
in order to create knowledge.
Two classes of descriptors can be distinguished in MIR systems, namely high-level and lowlevel descriptors. High-level descriptors represent semantic features of documents, usually in the
form of metadata. These descriptors usually have a human basis, i.e., they are produced by a person
describing the content of a document, either directly (by assigning concepts, tags, text annotations,
or other metadata) or indirectly (by derivations of direct metadata or by capturing the user behavior
on the document). On the other hand, low-level descriptors represent features of the content itself,
usually in the form of statistics and patterns. These descriptors are usually produced automatically
by computers following different algorithms.
Content-Based MIR (CBMIR) aims to fulfill the objectives of MIR but relying exclusively on
the content of the documents, i.e., by analyzing and producing semantic information from low-level
descriptors and limiting the use of user generated metadata. The content-based techniques enable
the use of MIR in cases when large datasets lack high-level descriptors, or when it is unfeasible to
ask users to create descriptors. There is a broad field of research that addresses the problem of
automatically creating high-level descriptors from low-level descriptors, i.e., to predict the concepts
that a person would assign to a document given its content. The lack of coincidence between
high-level descriptors that a user would create for a document and the high-level descriptors that a
computer automatically produce is known as Semantic Gap [Smeulders et al., 2000]. Even in cases
where enough semantic descriptors exist, content-based techniques can improve the effectiveness of
MIR systems by allowing them to analyze an additional source of information [Lew et al., 2006].
Content-Based Video Retrieval (CBVR) aims at searching and retrieving videos in a collection based on their audiovisual content. Common use cases for CBVR systems are the organization
and browsing of video collections, the retrieval of videos that “look similar” to a query video, and
the localization of videos with specific features. As many image processing and image analysis
techniques are needed, CBVR might be seen as a natural extension of content-based image retrieval. However, large differences are produced by the inclusion of the temporal dimension (like
video segmentation and tracking of objects), the existence of acoustic and visual tracks (like fusing
information from different sources), and the usually huge amount of data stored in video collections.
Content-Based Video Copy Detection (CBVCD) is a problem from CBVR that aims at
detecting and retrieving video documents that are copies of some original document. The copy
detection process must rely exclusively on the audiovisual content, ignoring any metadata associated
with videos. Depending on the definition of what a copy is, CBVCD systems must detect either
semantically identical videos or audio/visual derivatives. A derivative is a video that uses any part
of the original, including edition processes such as extracting scenes, video-montages, and chroma
key composition (green-screens), therefore a copy does not necessarily look similar to the original.
In general, the videos retrieved by CBVR systems may differ from videos retrieved by CBVCD systems. For example, assume we have a clip with the best point from a tennis game. By
giving that clip to a common CBVR system, it may retrieve videos containing other points of the
game, other games of the tournament, or any other tennis game. In turn, a CBVCD system will retrieve videos containing that specific point, like a news program reviewing the game, fan videos with
the best tennis points of the history, or maybe no videos at all (if it did not locate any duplicate).
The detection performed by a CBVCD system should achieve both high effectiveness and
2
high efficiency. Effectiveness measures the quality of the answer of the system. Efficiency measures
the resources used by the system to produce the answer. A CBVCD system with high effectiveness
will retrieve all the videos that indeed are copies, without including false alarms (i.e., those videos
that are not copies). A CBVCD system with high efficiency will require less memory and disk
storage to run, and will take less time to produce the answer.
When dealing with large datasets, the resources required to achieve the highest effectiveness
usually produce very poor efficiency, i.e., comparing every possible video following every possible
criteria commonly leads to unreasonable search time and storage needs. Hence, a common issue
that CBVCD must address is the effectiveness-versus-efficiency trade-off. Moreover, even in the
hypothetical case of unlimited resources, there are no guarantees to achieve the maximum effectiveness because the definition of copy is defined by the human perception. Therefore, another issue
that CBVCD must address is to bridge the semantic gap between low-level content descriptors and
what a user would state as a copy.
Finally, TRECVID is an evaluation and a workshop organized yearly by the National Institute of Standards and Technology (NIST), which promotes the research in video information
retrieval [Smeaton et al., 2006]. Many teams from different universities and private companies
around the world participate regularly in the proposed tasks. A specific evaluation for CBVCD
systems was held at TRECVID between 2008 and 2011. This evaluation provided common datasets
and evaluation measures to benchmark different CBVCD systems.
1.1
Thesis Overview
This thesis details a novel CBVCD system called P-VCD, which achieves competitive performance with other state-of-the-art systems. The system is founded on the metric space approach and
develops many novel algorithms and techniques in order to achieve high effectiveness and efficiency.
P-VCD is divided into five main tasks, namely Preprocessing, Video Segmentation, Feature
Extraction, Similarity Search, and Copy Localization. In particular, the Similarity Search task uses
metric spaces to achieve both high effectiveness by defining smart similarity measures to compare
objects, and high efficiency by using mathematical properties to reduce the number of comparisons
required to perform a search. To the best of our knowledge, this is the first system which applies
techniques from the metric spaces to address the CBVCD problem.
The system is evaluated in detail using the MUSCLE-VCD-2007 dataset [Law-To et al., 2007b],
a medium-size dataset which provides a ground-truth for CBVCD. It contains a collection of original
videos, a collection of query videos, and the expected answer for each query video. In addition to
our own evaluations, we participated in the CBVCD evaluation at TRECVID between 2010 and
2011 as the “PRISMA” team, hence, the performance of our system has also been compared with
other CBVCD systems.
This thesis comprises twelve chapters and three appendixes. It is divided in two parts:
Part I: Background and Related Work. This part introduces the areas of image processing,
image analysis and metric spaces and reviews the state-of-the-art in the CBVCD topic.
• In Chapter 2 we briefly review the techniques of image processing and image analysis that we
3
use in this thesis.
• In Chapter 3 we give the background for the similarity search process, and present some
general techniques to improve effectiveness and efficiency in a search.
• In Chapter 4 we analyze the definition and scope of the CBVCD problem and common applications, and we summarize the related work on the CBVCD topic.
Part II: CBVCD and the Metric Space Approach. This part details our approach for
CBVCD following each of its tasks.
• In Chapter 5 we give a general overview of the P-VCD system and its five main tasks. We also
present the MUSCLE-VCD-2007 dataset used for the evaluation of the following chapters.
• In Chapter 6 we detail the Preprocessing task. We define a process for quality normalization
of videos, and a process for detection and reversion of content transformations.
• In Chapter 7 we detail the Video Segmentation and the Feature Extraction tasks. We show
the low-level descriptors used by the system, and we perform a basal evaluation of their effectiveness. In particular, we present six different global descriptors, an approach for converting
them into spatio-temporal descriptors, and a novel acoustic descriptor. The most important
conclusions from the experiments in this chapter are:
– The quality normalization at the preprocessing task improves the effectiveness of all the
evaluated global descriptors.
– The spatio-temporal description improves the effectiveness of all the evaluated global
descriptors.
– The descriptor based on edge orientations achieves higher effectiveness than the other
evaluated global descriptors, when using both quality normalization and spatio-temporal
description. Moreover, the achieved effectiveness outperforms the state-of-the-art systems evaluated on the MUSCLE-VCD-2007.
– The proposed acoustic descriptor achieves high detection effectiveness. It also outperforms the state-of-the-art systems evaluated on the MUSCLE-VCD-2007.
• In Chapter 8 we detail the developed techniques for improving the effectiveness of the Similarity Search task. We analyze the weighted combination of distances, the spatio-temporal
combination of distances, and the use of non-metric distances. In this chapter we propose three
novel techniques to automatize the selection of weights in a combination: the α-normalization,
which scales different distances in order use them in a weighted combination of distances, the
weighting by max-ρ algorithm, which automatically allots weights that maximizes the intrinsic dimensionality of the combination, and the weighting by max-τ algorithm automatically,
which automatically allots weights that maximizes the value that α-normalizes the combination. The most important conclusions from the experiments in this chapter are:
– Using the α-normalization, weighting by max-ρ, and weighting by max-τ algorithms, it
is possible to automatically locate a set of weights that achieve high effectiveness without
requiring the use of training data.
– The spatio-temporal combined distance can improve the effectiveness. In fact, the temporal weighted combination of global and acoustic descriptors can achieve the maximum
effectiveness for MUSCLE-VCD-2007, i.e., to detect all the copies without giving any false
alarm. This result outperforms the state-of-the-art systems evaluated on this dataset.
4
– The use of some common non-metric distances does not produce a noticeable improvement in effectiveness.
• In Chapter 9 we detail the developed techniques for improving the efficiency of the Similarity
Search task. In particular, we focus on the use of static and dynamic pivots. In this chapter we
propose the Approximate Search with Pivots, which uses a fast estimator to discard most of the
irrelevant objects of the search; the Approximate Search with Pivots for Local Descriptors,
which is an adaptation of previous search to local descriptors; the Two-step search, which
divides a combined search into an approximate search and an exact search; and the Snake
Table, which is a pivot table using dynamic pivots. The most important conclusions from the
experiments in this chapter are:
– The Approximate Search with Pivots shows a satisfactory effectiveness-versus-efficiency
trade-off. The search time can be drastically reduced at a small cost in the effectiveness.
– The Approximate Search with Pivots for Local Descriptors can increase both effectiveness
and efficiency of local descriptors when it is compared with the exact search.
– The Snake Table can improve the efficiency without decreasing the effectiveness by using
previous query objects as dynamic pivots.
• In Chapter 10 we detail the techniques used at the Copy Localization task. We determine
the boundaries of a copy by using a voting algorithm between the objects retrieved by the
similarity search. The most important conclusions from the experiments in this chapter are:
– The voting algorithm can improve its effectiveness when it considers more than one
similar object by weighting the votes according to rank positions.
– The inclusion of similarity values in the voting algorithm has almost no impact on the
effectiveness.
• In Chapter 11 we review our participation in the CBVCD evaluation at TRECVID 2010 and
2011. Here, we tested most of the proposed algorithms. The most important conclusions from
the experiments in this chapter are:
– P-VCD achieves competitive performance with other state-of-the-art CBVCD systems,
especially in detecting copies without false alarms.
– The Preprocessing task has a high impact on increasing the effectiveness of detection.
– The Two-step search can improve the effectiveness, especially for multimodal searches.
– The metric space approach can successfully be applied to CBVCD systems: it can improve the effectiveness by enabling a novel method to combine multimodal information
at the search time, and it can improve the efficiency by enabling the use of mathematical
properties to accelerate searches.
Finally, in Chapter 12 we conclude this thesis by summarizing our main contributions, and
offering some final thoughts regarding this work. Additionally, we outline different research trends
we plan to develop in future work.
The thesis also includes three appendixes: Appendix A presents the ground-truth for the
MUSCLE-VCD-2007 dataset, and Appendix B and Appendix C publish some data provided by
NIST for the CBVCD evaluation at TRECVID 2010 and TRECVID 2011, respectively. We provide
these data for academic purposes, in order to give insight into this thesis and its contributions.
5
P-VCD has been released as an Open Source Project under the GNU General Public License
version 3.0 (GPLv3). Most of the source code used in this thesis can be freely downloaded from its
website1 .
1.2
Thesis Publications
The main contributions in this thesis have been published in the following research papers:
[Barrios and Bustos, 2009] This work, developed during the preparation of this thesis, presents
a content-based image retrieval system that combines low-level features with high-level features
following the metric space approach.
[Barrios, 2009] It is the proposal of this thesis. This work reviews the related work at the CBVCD
topic and proposes to use the metric space approach to address the problem.
[Barrios and Bustos, 2010] This work summarizes the first participation of the PRISMA team
in the CBVCD evaluation at TRECVID 2010, described in Chapter 11.
[Barrios and Bustos, 2011a] This work presents the Approximate Search with Pivots algorithm
and the Copy Localization algorithm, described in Chapter 9 and Chapter 10, respectively.
[Barrios and Bustos, 2011b] This work presents the weighting by max-ρ algorithm, described
in Chapter 8.
[Barrios, Bustos, and Anguera, 2011] This work summarizes the second participation of the
PRISMA team in the CBVCD evaluation at TRECVID 2011, described in Chapter 11.
[Barrios and Bustos, 2013] Journal paper. This work details the achieved results at TRECVID
2010 and it presents the weighting by max-τ algorithm, described in Chapter 8.
[Barrios, Bustos, and Skopal, 2012] This work presents the Snake Table, described in Chapter 9.
[Barrios, Bustos, and Skopal, 2013] Journal paper. This work presents the Snake Table and
gives new insights about the analysis and indexing of query sets.
This thesis also considers a joint work with Telefonica Research. The results of that collaboration have been published in:
[Anguera, Adamek, Xu, and Barrios, 2011a] This work summarizes the participation of
Telefonica Research team in the CBVCD evaluation at TRECVID 2011. That participation included
a joint submission between the Telefonica team and PRISMA team, described in Chapter 11.
1
P-VCD: https://sourceforge.net/projects/p-vcd/
6
[Anguera, Barrios, Adamek, and Oliver, 2011b] This work presents a late fusion algorithm
for multimodal detection in CBVCD systems, described in Chapter 4.
This thesis also considers some unpublished results.
7
Part I
Background and Related Work
8
Chapter 2
Image Processing and Analysis
In this chapter we briefly review the image processing and image analysis techniques that
are used in this thesis. The content of this chapter is mainly extracted from the books Gonzalez
and Woods [2007], Bradski and Kaehler [2008], and Szeliski [2010].
2.1
Concepts
Image In this thesis, we define image as a two-dimensional function I(x, y), where x and y are
called the spatial coordinates. The coordinates are bounded integer values x ∈ [0, width − 1] and
y ∈ [0, height − 1], hence the image consists in width×height picture elements or pixels.
Gray images In the case of gray images, the value of I(x, y) is a single value, representing the
gray level or intensity for each pixel. The depth of the image is the number of bits used to indicate
the value for each pixel. Commonly used depths are 8-bits, where each value is an integer between 0
(black) and 255 (white), and 32-bits, where each value is a floating point number between 0 (black)
and 1 (white). Additionally, the 16-bits-signed depth, where each value is an integer between -32,768
and 32,767, is commonly used for storing intermediate values during the processing of 8-bits depth
images.
Color images In the case of color images, a color space must be first selected. The color space
(also called color model or color system) specifies a multidimensional space, a coordinate system,
and a region in that space in which a color corresponds to a single multidimensional point. Each
coordinate of the color space is called a color channel. Most color spaces define three coordinates,
thus the value of I(x, y) is a triplet (a, b, c) containing the value for each of the three channels.
Alternatively, a color image can be seen as a combination of three gray images Ia , Ib , Ic each one
containing the value for each channel, therefore I(x, y) = (Ia (x, y), Ib (x, y), Ic (x, y)).
RGB color space The most common color space is RGB, which represents colors as a uniform
combination of primary colors red, green and blue. In this case, the value of each pixel is a triplet
containing the value for each channel, i.e., I(x, y)=(r, g, b). The region containing RGB colors is
usually represented by a unitary cube (see Figure 2.1). In this cube, the gray intensities lie in the
9
B
(0,0,1)
(1,1,1)
G
(0,1,0)
(1,0,0)
R
Figure 2.1 – The RGB unitary cube.
main diagonal with r=g=b. The RGB space is usually used to display and store images due to its
closeness with displaying hardware. Frequently, RGB images are stored using 24-bits color depth,
which corresponds to use 8-bits depth to each of the three channels. This color depth, also known
as full-color or true-color, can represent up to (28 )3 =16,777,216 colors.
HSV color space The HSV color space represents colors as a combination of three characteristics:
Hue, Saturation, and Value. Hue is a value that represents a pure color (e.g., yellow, red, etc.).
Saturation measures the colorfulness of the color, i.e., the degree in which a color is pure (full
saturation) or is diluted with white light (no saturation). Value measures the brightness of the
color between black and its pure color. The region containing HSV colors is usually represented
by a cylinder, where Hue is a circular coordinate, Saturation is a radial coordinate, and Value is a
vertical coordinate. The HSV space is created by a geometrical transformation of the RGB space.
The colors in HSV space with medium Value are a circular mapping of the colors in RGB space that
lie in the main triangle perpendicular to the gray diagonal in the RGB cube. The HSV space intends
to represent colors in a space closer to the way humans describe and represent colors, therefore it
is usually used for color manipulation and image representation. An image stored with RGB space
needs to be converted into the HSV space following a mathematical conversion between both color
spaces.
Histograms Given n observations, the histogram counts the number of observations that fall into
each of k disjoint categories (known as bins)). Given a gray image I of size W ×H with n gray
levels (e.g., n=256 for 8-bits depth), the normalized gray histogram is a function h defined as:
h(k) =
nk
WH
where nk is the number of pixels of I containing the k th gray level. The gray histogram (or intensity
histogram) represents the distribution of gray levels in the image, discarding the spatial locations.
It provides useful information about the global content of the image that can be used either for
image processing or for image analysis.
Analogously, the color histogram represents the distribution of colors in the image, discarding
the spatial locations. In the case of full-color images, a common approach to reduce the number
of bins consists in partitioning the RGB color space following a regular grid producing nr ×ng ×nb
10
(a)
(b)
(c)
(d)
Figure 2.2 – Gray and Color Histogram. (a) Sample gray image. (b) 32-bins gray histogram.
(c) Sample color image. (d) 27-bins RGB histogram (regular grid nr =ng =nb =3).
bins, where nr , ng and nb are the number of ranges in which each dimension of the RGB cube is
divided (see Figure 2.2).
2.2
Image Processing
Point operations The point operators modify each pixel of the image independently of the
others. Given an input image I and a point operator f , the output image G is defined as G(x, y) =
f (I(x, y)). Some examples of these operators are: the conversion between color spaces, the gamma
correction (defined by the function f (x) = xγ ), the adjustment of brightness and contrast (defined
by f (x) = ax + b with constants a > 0 and b), the binarization or thresholding (f (x) = 0 iff x < t,
f (x) = 1 otherwise, for some predefined threshold t), and the histogram equalization which takes
the histogram of the image and defines a function f (x) that produces a flatter histogram.
Convolution In image processing, the convolution is a linear operator (denoted by ∗) that filters
an image using second smaller image (known as filter, mask, or kernel ). It is also known as linear
filtering or spatial filtering. Given a gray image I of size W ×H and a filter w of size a×b the
convolution produces an image G of size W ×H defined as:
G(x, y) = (I ∗ w)(x, y) =
a−1 X
b−1
X
i=0 j=0
I(x − ⌊a/2⌋ + i, y − ⌊a/2⌋ + j) · w(i, j)
There are several way to handle the convolution in the boundaries of I. In this thesis,
we extend the first and last column and row of I to contain the whole filter, i.e. I(x, y) =
I(max{0, min{x, W − 1}}, max{0, min{y, H − 1}}). In order to have a precise definition of the
center of the filter, the size of the filter is usually restricted to odd numbers.
11
1
×
9
1
1
1
1
1
1
1
1
1
1
×
16
(a) 3×3 Average.
1
2
1
2
4
2
1
2
1
(b) 3×3 Gaussian.
Figure 2.3 – Blurring filters.
The convolution for color images is usually performed by separating the image into its color
channels, convolving each channel with the filter, and combining the filtered channels into the final
image, i.e.:
Grgb (x, y) = (Irgb ∗ w)(x, y) = ((Ir ∗ w)(x, y), (Ig ∗ w)(x, y), (Ib ∗ w)(x, y))
Blur The blurring of images is frequently used to remove small details and filter noise. It is usually
performed by the convolution of the image with an average filter or Gaussian filter. The average
filter is a square matrix with all values equal summing to 1. The Gaussian filter is created by
adjusting a 2D Gaussian function into the center of the filter and adjusting the standard deviation
according to the size of the filter. The values in the Gaussian filter are normalized to sum 1. Given
a standard deviation σ, the 2D Gaussian function is defined as:
Gσ (x, y) =
1 − x2 +y2 2
e 2σ
2πσ 2
One alternative to control the amount of blur in the image, is to perform a convolution with
a larger filter, e.g., a 5×5 average filter will produce more blur than a 3×3 average filter. Another
alternative is to leave the filter unmodified and perform many successive convolutions with the
image.
Gradient
The gradient of a two dimensional function I is defined as:
∂I
Ix
= ∂x
∇I =
∂I
Iy
∂y
The gradient ∇I(x, y) is a vector which represents the increase rate and the orientation of
the maximum increase of I at the point (x, y). The magnitude and the orientation of the gradient
are given by:
q
Iy
mag(∇I) = Ix2 + Iy2 ≈ |Ix | + |Iy |
(2.1)
θ(∇I) = arctan
Ix
Because images are two dimensional discrete functions, the partial first-order derivatives are
defined as:
∂I
(x, y) = I(x + 1, y) − I(x, y)
∂x
∂I
(x, y) = I(x, y + 1) − I(x, y)
Iy (x, y) =
∂y
Ix (x, y) =
12
-1
0
1
-1
-2
-1
-2
0
2
0
0
0
-1
0
1
1
2
1
(a) Sobel filter for Ix .
(b) Sobel filter for Iy .
Figure 2.4 – Sobel filters.
0
1
0
1
-4
1
0
1
0
Figure 2.5 – Laplacian filter.
The Sobel operators are filters that are frequently used to approximate Ix and Iy . Figure 2.4
shows two commonly used Sobel operators. The gradient for each pixel can be approximated
by performing and independent convolution between I and each Sobel operator, and then using
Equation 2.1 to combine them.
Laplacian The Laplacian is an operator defined as the sum of the second-order derivatives, hence
the Laplacian of a two dimensional discrete function I is:
∇2 I = Ixx + Iyy
where the second-order derivatives correspond to:
∂2I
(x, y) = I(x + 1, y) + I(x − 1, y) − 2I(x, y)
∂x2
∂2I
(x, y) = I(x, y + 1) + I(x, y − 1) − 2I(x, y)
Iyy (x, y) =
∂y 2
Ixx (x, y) =
The Laplacian can be calculated with a convolution of I with the filter shown in Figure 2.5.
It is usually used for sharpening images, i.e. to enhance details that have been blurred, and for
edge detection, as described next.
Edge Detection A common approach for edge detection is the gradient approach. It is based on
two steps: first analyzing the first-order derivative of the image and then analyzing the second-order
derivative. It should be noted that an edge is a “local” concept, i.e., the decision of marking a pixel
as an edge does not affect the decision for other pixels. The first step calculates the magnitude of
the gradient for each pixel, either using the Sobel operator or another technique. A pixel is marked
as an edge when the magnitude of the gradient for that point is higher than a predefined threshold
t. The second step uses the sum of the second-order derivatives in order to produce thin edges. A
pixel is marked as an edge if it was marked by the first step and the Laplacian of the pixel is zero.
This approach was formalized by Canny [1986], who also developed two improvements for
the edge selection: to follow the perpendicular direction of the gradient, i.e., to select pixels by
13
(a)
(b)
(c)
Figure 2.6 – Edge selection. (a) Sample gray image. (b) Edges according to Canny algorithm.
(c) Edges according to DoG algorithm.
“walking” on the edge, and to use an uncertainty threshold t′ < t in order to mark edge pixels
with gradient magnitude between t′ and t when it is next to another already marked pixel. These
techniques enable the selection of thin and long edges (see Figure 2.6).
A second approach for edge detection is trough the use of a Difference of Gaussians (DoG). It
consists in applying a Gaussian filter to an image and compare the blurred image against the original
image. The subtraction between original and blurred images remarks zones with great variation in
pixel values, therefore it detects edges as well as noise. In order to reduce the noise, the difference
image is computed between two blurred images I1 and I2 , where I1 is the result of applying a
Gaussian filter with standard deviation σ, and I2 is the result of applying a Gaussian filter with
standard deviation kσ, for a given step size k. Finally, the edges are detected by thresholding the
absolute value of the difference (see Figure 2.6).
After the edge pixels have been determined, a common task is to analyze their distribution
in order to locate shapes in the image. In particular, given a set C containing detected edge pixels,
the line detection process consists of detecting sets of collinear edge pixels. The set C may contain
pixels not belonging to any valid line (i.e., outliers), thus the detection process should be robust
to them. The following section reviews two algorithms that are frequently used to locate collinear
pixels: RANSAC and Hough Transform.
Random Sample Consensus (RANSAC)
and corrects many different lines.
It is an randomized algorithm that iteratively tests
Any line is defined by a pair of points, thus RANSAC randomly picks two pixels from C,
creates a line L, and calculates the number of inliers to L in C, i.e., locates the pixels in C that are
traversed by L. A correction cycle is usually performed to improve the convergence of RANSAC:
between the located inliers to L, the least-squares method is used to calculate a new corrected line
L′ , and in turn the inliers for L′ are located. The correction cycle ends when there are no new
inliers located. In summary, the RANSAC algorithm works as follows:
• Iterate n times:
1. Select random seed pixels S ⊆ C with |S| = 2.
2. M ← S.
3. Use least-squares method to fit a line LM for all pixels in M .
4. Calculate the set of inliers IM ⊆ C, which contains the pixels at a distance less or equal
than ǫ to LM .
14
5. If |IM | > |M |, then M ← IM and goto step 3.
6. Otherwise, end current iteration.
• Between all the evaluated lines, choose the line with more inliers, i.e., the LM with maximum
|IM |.
The performance of RANSAC is controlled by parameters n and ǫ. RANSAC can succeed
at detecting a line when there is a reasonable chance of selecting actual inliers as seeds at step 1.
That chance increases linearly with the number of iterations, but it decreases quadratically with the
number of outliers in C. Therefore, RANSAC is recommendable only when the ratio inliers/outliers
is relatively high.
Hough Transform It is an exhaustive and deterministic algorithm that efficiently tests every
possible line passing through every pixel in C.
All the lines passing through a pixel with coordinates (x0 , y0 ) satisfy the equation:
y0 = ax0 + b ,
(2.2)
for varying values of a and b. The parameter space corresponds to the two-dimensional space where
pairs (a, b) reside. Hence, a line L in the image corresponds to a point in the parameter space. The
parameters for every line passing through (x0 , y0 ) can be characterized by rewriting Equation 2.2:
b = −ax0 + y0
(2.3)
Using Equation 2.3, the line L passing through many pixels in C can be detected by drawing a line
in the parameter space for each pixel in C, and locating the pair (a′ , b′ ) with most intersections.
The efficiency of the Hough Transform proceeds from discretizing the parameter space into a voting
table T with fixed-size cells, where each cell T [i, j] represents a small interval of values [ai , ai + ε] ×
[bj , bj + ∆].
However, an implementation issue arises from using Equation 2.3, since a and b may vary
between −∞ and +∞. In order to use a bounded parameter space, Equation 2.2 can be replaced by
the normal representation of lines. In that representation, a line L is defined by two parameters (see
Figure 2.7): ρ, which is the minimum distance of L to the origin; and θ, which is the angle between
the normal to L passing through the origin and the x-axis. Using the normal representation, all the
lines passing through a pixel (x0 , y0 ) satisfy the equation:
ρ(θ) = x0 cos θ + y0 sin θ
(2.4)
The parameter space defined by θ and ρ is bounded. In fact, because only the lines passing
through a pixel in the image need to be
√ represented, θ is bounded to the range [−π/2, π] and ρ is
bounded to the range [0, D], with D= W 2 + H 2 .
Voting for all the lines passing through (x0 , y0 ) corresponds to summing one vote to every
cell that satisfies Equation 2.4. A cell receiving k votes means there are k pixels sharing the same
line (i.e., k collinear pixels), hence the line detection process consists in running the voting algorithm
and selecting the most voted cell. In summary, the Hough Transform algorithm works as follows:
• Allocate table T with n × m cells, and initialize each cell to zero.
15
θ
L
ρ
(x0 , y0 )
Figure 2.7 – The normal representation of line L.
• Calculate Θ={θ0 , ..., θn−1 } by uniform sampling the range [−π/2, π].
• Calculate P ={ρ0 , ..., ρm−1 } by uniform sampling the range [0, D].
• For each pixel (x0 , y0 ) ∈ C:
1. For each θi ∈ Θ:
(a)
(b)
(c)
(d)
Calculate ρ(θi ) according to Equation 2.4.
If ρ(θi ) is out of the range [0, D], end current iteration continuing with θi+1 .
Locate value ρj ∈ P which is the closest to ρ(θi ).
T [i, j] ← T [i, j] + 1.
• Locate cell T [r, s] with more votes, and choose the line L determined by parameters θr and
ρs .
The performance of the Hough Transform is controlled by parameters n and m. Their values present
a tradeoff between accuracy and computational cost. Once n and m have been fixed, the processing
time of the Hough Transform decreases linearly with the number of points (independently of the
ratio inliers/outliers), therefore it is a convenient approach when there are a high amount of outliers.
RANSAC and Hough Transform are two methods for detecting lines in a set of pixels.
Besides the line detection, these two algorithms can be generalized to detect subsets satisfying a
given condition. For instance, the edge pixels lying in a circumference can be located either using
RANSAC (randomly picking three pixels, generating a candidate model and locating inliers to the
model) and Hough Transform (using a tridimensional parameter space comprising the center and
radius). Moreover, these two algorithms can also be applied to object localization as described in
next section.
2.3
Content Description
The content description consists in computing one or many descriptors (also called feature
vectors or fingerprints) to represent some visual characteristic or feature of the image (like the
distribution of colors, orientation of edges, etc.). The extracted descriptors usually correspond to
multidimensional vectors or binary sequences.
The descriptors can be either global or local. A global descriptor represents the content of
the whole image. A local descriptor represents the content of only a small zone of the image, hence
16
the whole image is represented by many local descriptors. The following subsections review some
basic feature extraction methods for images. In Chapter 4 we review feature extraction methods
specific to CBVCD.
2.3.1
Global Descriptors
Intensities A simple global description is to represent an image by the intensities of its pixels.
Given an image I of size W ×H, convert I to a gray image IY , and create a vector with W · H
dimensions, where each dimension corresponds to the intensity of a pixel in IY . If the dataset
contains images of different sizes, then each image should be first scaled to a fixed size. A common
distance to compare two vectors is the Euclidean distance. This approach is used, for example, in
processes such as clustering images or PCA on images. Chapter 7 presents an evaluation of this
descriptor using the name KF.
Histograms A widely-used global descriptor is to represent the image by its gray or color histogram. As described in Section 2.1, the gray or color histogram is a vector summarizing the gray
intensities or the colors in the image, discarding the spatial locations. Chapter 7 presents an evaluation of this descriptor using the names GH (gray histogram), IH (histogram by channels), and CH
(color histogram).
MPEG-7 descriptors MPEG-7 is a standard for describing multimedia documents that enables
the attachment of high-level data (concepts and full-text description) and low-level data (contentbased descriptors) to multimedia files. The low-level descriptions defined in the standard are frequently used in CBMIR systems. Manjunath et al. [2001] provides an overview of color and texture
descriptors defined by this standard. The MPEG-7 eXperimentation Model1 is the reference implementation of these descriptions. In this thesis we use the Edge Histogram, which is described in
the following section.
Edge Histogram The Edge Histogram descriptor represents the spatial distribution of edges
in the image [Manjunath et al., 2001]. The extraction method converts an image to gray scale,
partitions it into 4×4 zones, and for each zone, it calculates a local edge histogram. To calculate the
local edge histogram, each zone is scaled and divided in many 2×2 blocks. The main orientation of
the edges for each block is determined by measuring the energy for five different orientation kernels
(see Figure 2.8). If the kernel with maximum strength exceeds a certain threshold, the block is
marked as an edge block. The local edge histogram for a zone is created by accumulating the
orientations of its edge blocks. The descriptor is a vector with 80 dimensions (4 · 4 · 5), where each
dimension is quantized into three bits/dimension. Two descriptors can be compared with the L1
distance. Chapter 7 presents an evaluation of this descriptor using the name EH.
Ordinal Measurement The Ordinal Measurement descriptor captures the relative ordering of
intensities in the image [Bhat and Nayar, 1998; Kim, 2003]. The extraction method converts an
image into gray scale, and divides it into n×m zones. For each zone, the average intensity is
calculated. The relative ordering between zones is expressed by their ranks. The descriptor is a
1
MPEG-7 XM. ISO/IEC 15938-6:2003. Information technology – Multimedia content description interface – Part
6: Reference software.
17
1
1
-1
1
-1
-1
◦
1
√
-1
0
◦
(a) 180 edge
(b) 90 edge
2
0
√
- 2
◦
(c) 135 edge
√
0
√
- 2
2
0
◦
2
-2
-2
2
(e) Isotropic 1
(d) 45 edge
Figure 2.8 – Edge Histogram descriptor. (a-e) Five orientation filters (four directional and one
isotropic).
(a)
(b)
230
204
192
243
15
14
13
16
149
115
101
136
9
4
3
6
174
88
86
144
11
2
1
7
187
147
130
160
12
8
5
10
(c)
(d)
Figure 2.9 – Ordinal Measurement 4×4 descriptor. (a) Sample gray image. (b-c) Average intensities for the image divided into 4×4 zones. (d) Rank assigned to each zone.
permutation of (1, . . . , n · m) where each dimension is the rank for each zone after sorting them in
ascending order by intensity (see Figure 2.9). Two descriptors can be compared with the L1 or the
Hamming distance. Chapter 7 presents an evaluation of this descriptor using the name OM.
2.3.2
Local Descriptors
The process of calculating local descriptors can be divided in two phases. The detection
phase is the process which locates geometrically stable points under different transformations that
are likely to be recurrent between different images of the same object, called keypoints or interest
points. The description phase analyzes the neighborhood of a keypoint and represents the content
of the image around it with a single descriptor.
Given two images, the comparison method commonly contains three steps: 1) extract local
descriptors from each image, 2) match the most similar descriptors between both images, and 3)
select a subset of matches with spatial consistency.
Once a subset of matches has been located, the quantification of the similarity between
images may consist of counting the number of matched descriptors, or calculating the percentage
of local descriptors in the first image that have a matching counterpart in the second image.
18
Keypoint detection The keypoint detection methods can be classified by its core technique: the
Autocorrelation and the Difference of Gaussians. The autocorrelation consists of comparing a zone
of the image with a small displacement of itself. The autocorrelation function can be defined for
each pixel and displacement as:
XX
E(x0 , y0 , ∆x , ∆y ) =
(I(x + ∆x , y + ∆y ) − I(x, y))2 · w(x − x0 , y − y0 )
x
y
where w is a weighting window that defines the boundaries of the zones. When the autocorrelation
for a pixel (x0 , y0 ) is high for any displacement, it implies that the pixel can be distinguished from
its environment, and will likely be distinguishable even under geometric transformations. Hence,
the keypoint detection is performed by analyzing this function. The first-order approximation of I
is:
I(x + ∆x , y + ∆y ) = I(x, y) + Ix (x, y) · ∆x + Iy (x, y) · ∆y
Replacing that approximation into the autocorrelation function:
XX
E(x0 , y0 , ∆x , ∆y ) =
(Ix (x, y) · ∆x + Iy (x, y) · ∆y )2 · w(x − x0 , y − y0 )
x
y
This function can be analyzed using the autocorrelation matrix (also known as the tensor image):
2
Ix Ix Iy
A=
Ix Iy Iy2
The pixels that are distinguishable correspond to the points where the eigenvalues of A, λ0 and
λ1 , are high. Instead of directly calculating these eigenvalues, Harris and Stephens [1988] use the
indicator:
g = det(A) − α trace(A)2 = λ0 λ1 − α(λ0 + λ1 )2
with α=0.06. The pixels where g exceeds a predefined threshold are marked as keypoints. Shi
and Tomasi [1994] propose to calculate the eigenvalues and mark as keypoints the pixels where the
lowest eigenvalue is a local maximum.
The Difference of Gaussians (DoG) approach detects keypoints based on the changes that
produce a Gaussian filter on the image. In order to achieve invariance to scale, the detection process
uses a continuous function of scale, known as scale space:
L(x, y, σ) = Gσ (x, y) ∗ I(x, y)
The keypoint detection analyzes a sequence of differences of two consecutive scales separated by a
factor k:
Dj (x, y) = L(x, y, k j+1 σ) − L(x, y, k j σ)
The scale factor k is fixed to a value 21/s , thus, there is an integer number s of differences before
doubling σ (which is called “an octave”). Given three consecutive difference images Dj−1 , Dj ,
and Dj+1 , the pixel (x, y) is a keypoint when Dj (x, y) is a local maximum or minimum between
27 neighbor pixels: 9 pixels surrounding Dj−1 (x, y), 9 pixels surrounding Dj (x, y), and 9 pixels
surrounding Dj+1 (x, y). This approach selects the white circular zones surrounded by black, or
viceversa, also known as “blobs”. These keypoints are complementarity to the ones detected by the
autocorrelation approach.
Finally, a keypoint P is determined by up to four values: the spatial location (Px , Py ), the size
of the neighborhood that represents Pσ (given by the level of the scale space where P was detected),
and the orientation Pθ (computed from the neighborhood of P in order to provide invariance to
rotations).
19
Keypoint description The descriptor represents the content of the image around a keypoint
with a single vector. The descriptors most commonly used are SIFT [Lowe, 2004] and SURF [Bay
et al., 2008].
SIFT is a 128-dimensional vector that represents the orientation of the gradient around the
keypoint. It uses σ and θ from the keypoint to define the size and the orientation of the neighborhood
of the image to represent. Once the neighborhood has been defined, it is divided into 4×4 zones,
and for each zone an 8-bins histogram is calculated with the orientations of gradient. Finally, the
SIFT descriptor is the concatenation of the 16 histograms, producing the 128-d vector. A common
variation for SIFT is PCA-SIFT [Ke and Sukthankar, 2004], which reduces the dimensionality of
the descriptor while improving its effectiveness.
SURF is a 64-dimensional vector representing the orientations of gradients. It computes
the integral image (table of summed areas) to efficiently find the sum of any rectangular area and
estimate the gradients. Bay et al. [2008] show that SURF requires less computational resources
than SIFT while achieving almost the same effectiveness.
Finally, the full image is described by the set of local descriptors, each one containing the
keypoint data (i.e., spatial location, size, and orientation) along with the descriptor itself (i.e., the
vector describing the neighborhood of the keypoint).
Van de Sande et al. [2008] present a comparison of the performance of different variations
of SIFT for the object recognition problem.
Matching local descriptors In order to compare images I1 and I2 , each descriptor in I1 is
compared with every descriptor in I2 using a distance function δ. Two descriptors P ∈ I1 and
Q ∈ I2 create a match (denoted by P → Q) when:
1. Q is the closest object in I2 to P :
∀X ∈ I2 − {Q}, δ(P, Q) ≤ δ(P, X)
2. The ratio between the distance to Q and the distance to the second closest object in I2 is
lower than s:
δ(P, Q)
∀X ∈ I2 − {Q},
≤s
δ(P, X)
In the case of SIFT descriptors, δ is the Euclidean distance in the 128-dimensional space,
and s is commonly fixed to 0.8 [Lowe, 2004]. Figure 2.10b shows matching descriptors between two
images following this criterium.
Spatial consistency of matches Given the set of matches C={{P 1 → Q1 }, {P 2 → Q2 }, ...},
the spatial consistency determines the best spatial transformation that is satisfied by most of them,
i.e., it is an algorithm that computes S and T , where S ⊆ C and ∀{P j → Qj } ∈ S, T (P j ) = Qj .
The algorithm must first decide the model of transformation that T will satisfy, for example:
• Translation. T can only move a set of points in I1 to match a set of points in I2 , i.e., T
20
(a) SIFT descriptors for two images. Image on the left has 327
descriptors. Image on the right has 1180 descriptors.
(b) Pairs whose ratio between distance to 1st-NN and 2nd-NN is
smaller than s=0.8 (70 pairs).
(c) Subset of pairs satisfying a homographic transformation (45
pairs).
Figure 2.10 – Extracting and matching local descriptors between two images.
21
corresponds to values (tx , ty ) which ∀{P j → Qj } ∈ S:
"
#
" j #
Qx
Pxj
tx
j
j
=
T (P ) = Q ⇔
+
j
t
Py
Qjy
y
• Scale+Translation (ST). T can scale and move a set of points, i.e., T corresponds to values
(tx , ty , σx , σy ) which ∀{P j → Qj } ∈ S:
" j #
" j #
t
P
Qx
σ
0
x
x
x
T (P j ) = Qj ⇔
+
=
j
ty
0 σy
Py
Qjy
Sometimes, the scale is restricted to σx =σy .
• Rotation+Scale+Translation (RST). T can rotate in the plane, scale and move a set of points,
i.e., T corresponds to values (tx , ty , σx , σy , θ) which ∀{P j → Qj } ∈ S:
" j #
" j #
Px
cos θ − sin θ
σx 0
Qx
tx
j
j
T (P ) = Q ⇔
+
=
j
sin θ cos θ
0 σy
t
Py
Qjy
y
• Affine transformation. T can apply a linear transformation maintaining the parallelism
between lines, including translation, scale, rotation and shear. T corresponds to values
(tx , ty , a, b, c, d) which ∀{P j → Qj } ∈ S:
" j #
" j #
Qx
tx
Px
a b
j
j
=
+
T (P ) = Q ⇔
c d
ty
Qjy
Pyj
• Homographic or Projective transformation. T simulates the change of the point of view of the
observer, including Affine transformation and the rotation outside the plane. T corresponds
to values (a, b, c, d, e, f, g, h, i) which ∀{P j → Qj } ∈ S:
j
wQjx
Px
a b c
T (P j ) = Qj ⇔ d e f Pyj = wQjy
g h i
w
1
Depending on the model, an amount of matches are needed to determine a transformation.
For instance, the Translation transformations are determined by one match, the ST transformations
by two, the RST and Affine transformations by three, and the Projective transformations are
determined by four matches. Once the model of transformation is decided, the algorithms commonly
used to determine the best T and S are RANSAC and Hough Transform (both already introduced
in Section 2.2).
In the case of RANSAC, the algorithm follows these steps: a set of seed matches are randomly
sampled from C (as many as required by the model of transformation) and a first T candidate is
computed; the set S is created by locating inliers in C for T ; and a corrected model T is computed
using the whole set of inliers S. The algorithm is performed many times using different seeds, the
T that collected more inliers in S is selected.
In the case of Hough Transform, a single match is used to fix some variables in T , and the
remaining free variables are used to fill the parameter space. In order to decrease the degrees of
freedom in T , the scale and orientation in the keypoint can also be used: A single match is able
to determine a full RST transformation by using rotation θ=Qiθ − Pθi and scale σ=Qiσ /Pσi , and
therefore T becomes a single point in the parameter space. The Hough Transform algorithm takes
each match in C, computes T , and increments the corresponding accumulator cell in the parameter
space. The most voted cell defines T and the set of voter matches define S.
22
2.3.3
Bag of Visual Words
The Bag-of-Visual-Words (BOVW) approach, initially described by Sivic and Zisserman [2003],
is a global description which summarizes the local descriptors computed from an image. The computation of BOVW descriptors follows three main steps:
1. The local descriptors for the whole collection of images are computed.
2. The “codebook” or “visual vocabulary” is determined. Usually, the codebook is obtained by
running the k-means algorithm on a representative sample of local descriptors. The k-means
algorithm is chosen because it is fast and can deal with large sets of vectors. Each centroid
corresponds to a “visual word” and the set of k centroids corresponds to the codebook.
3. The codebook is used to compute a global descriptor for each image. The most simple approach
consists in quantizing each local descriptor to its nearest visual word. Thereafter, the BOW
descriptor is computed by following the tf-idf weighting [Baeza-Yates and Ribeiro-Neto, 1999]:
• tf : The frequency of occurrence of each visual word in the image.
• idf : Each frequency is weighted by the logarithm of the amount of images in the collection
that contains that visual word.
Following this approach, every image in the collection is described by k values corresponding
to the relevance of each visual word in the image. Two descriptors are compared using the cosine
similarity:
Pk
x i yi
sim(~x, ~y ) = qP i=1qP
k
k
2
2
i=1 xi
i=1 yi
The combination of codebook and cosine similarity enable to achieve high efficiency by using
an inverted index to resolve similarity searches (more details are given in Chapter 3). On the other
hand, two main issues arise when following this approach: the high computational cost required
by the codebook creation, and the loss of information due to vector quantization. During the last
years, many techniques have been developed in order to improve the performance of the codebook
computation and/or increase the information extracted from the local descriptors in an image.
Van Gemert et al. [2008] address two issues produced by the hard assignment to the nearest
visual word: uncertainty, defined as the problem produced when many visual words are equally
close to the descriptor, and implausibility, defined as the problem produced when no visual word is
actually close to the descriptor. These problems are overcome by estimating density functions for
each visual word, and replacing the hard assignment by a soft assignment based on the distance of
the descriptor to each visual word.
The Spatial Pyramids [Lazebnik et al., 2006] focuses on increasing the spatial information
stored by BOW descriptors. It partitions the image into increasingly finer regions and computing
histograms of local features found inside each region. Another alternative to reduce the loss of
information due to quantization consists in increasing the vocabulary size, e.g. Le et al. [2011] use
hierarchical k-means to compute a large vocabulary tree with one million leaves and obtain a BOW
representation that achieves high effectiveness.
In other work, the Hamming Embedding [Jégou et al., 2008] extends the information in the
BOW with a binary signature. The signature stores the relative position of each descriptor with
23
respect to the associated visual word: for each n-dimensional descriptor the Hamming Embedding
is an n-bits sequence, where the ith bit is 1 if the ith dimension of the descriptor was higher than
the ith dimension of its visual word.
Similarly, BOSSA [Avila et al., 2011] extends the information in the BOW by including a
histogram of distances of the descriptors assigned to each visual word.
The BOW approach has been successfully used in image retrieval [Chum et al., 2007; Jégou
et al., 2009], image classification [van Gemert et al., 2008], object localization [Le et al., 2011], and
other related problems.
2.4
Summary
In this chapter we have reviewed some techniques of image processing and image analysis
that are relevant to understand this thesis.
In the case of image processing, we reviewed different techniques for image filtering and edge
detection. In particular, the RANSAC and Hough Transform algorithms enable to fit a model from
data samples, which can be used to address the line detection problem and to determine the spatial
consistency of local descriptors.
In the case of image analysis, we focused on techniques to describe the content of an image.
These techniques can broadly be divided in global description and local description. Once the
content has been described, the comparison of any two images relies on the comparison of their
descriptors.
The efficient comparison of descriptors is a key problem in CBMIR systems. The next chapter
reviews different approaches to address this issue, focusing on a specific approach to compare objects
known as Metric Spaces.
24
Chapter 3
Similarity Search
In this chapter we briefly review different techniques to perform a similarity search. In
the context of CBMIR, the similarity search is the algorithm that locates within the database the
objects that closely match a given query object. Because databases may be very large, it is essential
to use appropriate indexing algorithms to efficiently perform the searches.
Some indexing and search algorithms are designed for specific kinds of descriptors, other
more generic algorithms are designed for any descriptor modeled as a vector (vector spaces). There
are even more generic algorithms designed for any kind of descriptor while the comparison between
descriptors satisfies some minimum properties (metric spaces).
3.1
Concepts
Let D be the descriptor space (the domain), let R ⊆ D be a collection of descriptors (the
search space), let q ∈ D be the query object (the query), and let d : D × D → R be a function that
compares descriptors (the distance). Typically, d is a dissimilarity function between descriptors,
i.e., d(a, b) is a small value (near zero) when two descriptors a and b are similar, and d(a, b) is a
high value when they are dissimilar.
The range search returns all the objects in R that are closer than a distance threshold ǫ to
q. The k nearest neighbor search (k-NN) returns the k closest objects to q in R, and the NN+range
search returns the intersection between a k-NN search and a range search.
The linear scan or sequential scan is an algorithm that resolves the similarity search by
sequentially comparing q to every descriptor in R and retrieving those that are the closest. The
linear scan essentially is a brute-force approach that sets a baseline for high effectiveness and low
efficiency.
We should stress that once the query objects, distance function, and search space have
been fixed, it is highly improbable to outperform the effectiveness achieved by the linear scan.
Hence, in order to increase the effectiveness, the search input must be changed in some semantically
meaningful way, like using a different feature extraction and/or using a different distance function.
25
Indexing and exact search
In order to improve the efficiency, an index structure can be built to reduce the computational
effort needed to resolve the search. An exact search is any search algorithm that retrieves the same
result than the linear scan presumably with higher efficiency.
A system is commonly divided in two phases: the offline phase corresponds to the bulk
processing of the whole collection of multimedia documents, including the feature extraction and
the creation of the index structure, and the online phase corresponds to processing queries and
resolving similarity searches.
Most of the indexes are designed to be created during the offline phase, that is, a timeexpensive process creates the index structure prior to resolving any search. It is expected that the
index will resolve many similarity searches, amortizing its creation time, but no information is a
priori known about the queries that will be resolved afterwards. In the online phase, the index
efficiently receives and resolves any similarity search that may proceed from different sources. All
the searches share the same index, and the index should achieve good performance at every search.
The efficiency gain that an index structure can provide is related to: 1) the cost of creating
the index during the offline phase; 2) the amount of objects that are discarded during the resolution
of similarity searches; 3) the internal cost for deciding whether each object can be discarded or
not. A similarity search will be faster than a linear scan when the saved time, due to discarded
computation, is greater than the overtime spent due to index creation and internal cost.
Indexes can be classified as static or dynamic depending on how they manage the insertion
or deletion of objects in R during the online phase. A dynamic index can efficiently update its
structure to add or remove any object while it is in the online phase, hence it can resolve searches
even for growing collections. A static index cannot manage large updates in its structures, therefore
after many modifications of R the whole indexing structure must be rebuilt.
Approximate search
An approximate search is an algorithm that admits differences in the answer compared to
an exact search. The possibility of returning incorrect nearest neighbors permits greater gains
in efficiency at the cost of decreasing effectiveness. However, it should be noted that the loss in
effectiveness due to search approximation can even be negligible when compared to the inherent
subjectiveness of content similarity and imprecision of content description.
Zezula et al. [2005] classify approximate search approaches into two broad categories:
1. Reduction of the data to be examined, analyzing less data than is technically needed.
2. Transformation of the search space, replacing either the objects or the distance to reduce the
search cost.
A more general classification, presented by Patella and Ciaccia [2009], identifies four characteristics from approximate searches and classifies many known algorithms according to them. These
characteristics are:
26
1. Type of space the approach applies to: for instance, vector spaces or metric spaces.
2. Approximation type: for instance, transformation of the search space (to modify either the
distance or the objects) or reduction of the distance evaluations (by smart pruning of less
promising regions or by just an early termination of the search).
3. Quality guarantees: for instance, deterministic guarantees, probabilistic guarantees, or no
guarantees at all.
4. User interaction: for instance, static or interactive.
In Chapter 9 we present an approximate search algorithm that applies to metric spaces. The
approximation consists in replacing a time-expensive distance by a fast estimator. Then, it performs
an exact search using the original distance only for the objects with most promising estimations.
3.2
Vector Spaces
A vector is an n-tuple of real numbers ~x = (x1 , ..., xn ), each individual number xi being
referred to as coordinates or components. A vector space is a mathematical structure formed by
n-dimensional vectors and two operations: vector addition and scalar multiplication. These operations satisfy properties including associativity, commutativity, distributivity, identity element for
addition, inverse element, and identity element for multiplication.
The length of a vector can be measured by defining a norm (notation ||~v ||). The angle
between any two vectors can be measured by defining a dot product (notation h~x, ~y i). A common
definition for these functions is given by the Euclidean space (also known as L2 -space):
v
u n
uX
x2i
||~x|| = t
i=1
h~x, ~y i = cos(∠(~x, ~y )) · ||~x|| · ||~y || =
n
X
x i yi
i=1
A common extension to these definitions is given by the family of vector spaces Lp , which
are based on the p-norm:
v
u n
uX p
p
||~x||p = t
xi
with p ≥ 1
i=1
The descriptors computed by feature extraction methods commonly correspond to highdimensional vectors, therefore the use of indexes for vector spaces appear as a natural alternative
for improving the efficiency of similarity searches. In the following we review some indexes using
data organization, space division, and random projections multidimensional hashing.
3.2.1
R-tree
The R-tree [Guttman, 1984] is a balanced tree that resembles the well-known B-Tree. The
internal nodes store minimum bounding rectangles (MBR) and tree leaves store groups of data
27
vectors. It is a hierarchical organization of data vectors that dynamically supports insertions and
deletes. When a new vector is added to a previously-full node, the node is split and the overflow
is iteratively added to the parents following a backtracking algorithm similar to B-trees. Different
criteria can be used to decide the node splitting, the author recommends an heuristic to minimize
the total volume of the resulting division. The search algorithm uses a depth-first traversal where a
branch is pruned if the MBR does not overlap the query ball. The R-tree was designed for spatial
data (i.e., 2-d and 3-d vectors), but can also be used with high dimensional vectors. However, in
high dimensional spaces it is highly probable that most of the MBRs overlap, in that case, the
R-Tree will not provide any efficiency gain compared to a linear scan.
Hjaltason and Samet [1995] propose a search algorithm that does not use a recursive search
in spatial indexes. Instead, a priority queue sorts the tree nodes according to their distance from
the query object. The search adds the tree root to the queue and then iteratively removes and
examines the node in the head of the queue. If the node is a leaf, its vectors are examined and the
relevant vectors are reported. If the node is not a leaf, its child nodes are added to the queue (and
sorted according to their distance to the query). Böhm et al. [2001] demonstrate the optimality of
this algorithm to retrieve the nearest neighbor in hierarchical spatial indexes.
3.2.2
Kd-tree
The kd-tree [Friedman et al., 1977] is a binary tree which is created by recursively splitting the
dataset according to a dimension i and a coordinate value m. The splitting algorithm analyzes the
coordinate values for the subset, and selects as the splitting dimension i the one whose coordinates
show the highest variance, and as the splitting value m the median value for coordinate values. A
node is created containing i, m and one child nodes for each half. Each node is recursively split,
producing a balanced binary tree with height ⌈log n⌉.
The original search algorithm for the kd-tree uses a depth-first recursive search that first
locates the node that would contain the query object and compares all the elements in that node.
Then, it compares the neighbor nodes by backtracking the recursion and discarding nodes that do
not intersect with the query ball.
Beis and Lowe [1997] propose an approximate search for the kd-tree called BBF search (Best
Bin First), which uses a priority queue similar to Hjaltason and Samet [1995] search algorithm.
The approximation is incorporated by stopping the search when the number of examined nodes has
exceeded some user-defined threshold.
Muja and Lowe [2009] propose to compute several independent kd-trees using the same
dataset, each tree selecting a random dimension between the top D dimensions with greater variance.
The search algorithm traverses all the randomized kd-trees at the same time by examining nodes
using a unique priority queue. The search starts by adding all the roots to the priority queue,
iteratively examines the node in the head of the queue and adds its child nodes to the queue, and
stops the search according to the number of nodes that has been examined.
3.2.3
K-means tree
The k-means is a widely used clustering algorithm for vector spaces. It is based on selecting k
seed vectors (either by random selection or using some heuristic), and iteratively corrected them to
28
minimize the sum of the squared error [Tan et al., 2009]. The k centroids induce a Voronoi diagram
that partitions the space into k cells, hence the search algorithm locates the cells that overlap the
query ball, and examines the objects assigned to them.
The hierarchical k-means is a k-ary tree which is created by recursively computing k centroids
and subdividing the dataset. The recursion ends when a cell contains no more than k data vectors.
This produces a tree with height ⌈logk n⌉. Muja and Lowe [2009] adapted the approximate search
for kd-tree to hierarchical k-means tree by using a priority queue that sorts nodes according to the
distance from the their centroid to the query.
The inverted index is frequently used to improve the efficiency when a codebook has been
computed (see Section 2.3.3). The inverted index is a table with one entry for each visual word,
each entry contains a list with all the images that contain some descriptor assigned to that visual
word (i.e., the vectors containing a greater-than-zero weight for that dimension). The similarity
search uses the inverted index to locate all the images that share one or more visual words with the
query image in “immediate run-time” [Sivic and Zisserman, 2003].
3.2.4
LSH
Locality-Sensitive Hashing (LSH) is a randomize algorithm which uses several hash functions
to perform approximate searches [Andoni and Indyk, 2008]. It converts a multidimensional vector
into a sequence of n bits, and indexes them in a hash table. The Manhattan distance between objects
can be approximated by the Hamming distance between the corresponding binary strings. LSH is
designed to perform approximate searches. The effective-versus-efficiency tradeoff is controlled by
the number n of bits used to quantize vectors. A query vector is processed by the hash functions
and all the vectors that are assigned to the same bin are selected as candidates. The final result is
created by performing a linear search through candidate vectors.
3.3
Metric Spaces
A metric space is defined by the pair (D, d), where d satisfies the following properties:
reflexivity
non-negativity
symmetry
triangle inequality
∀x ∈ D,
∀x, y ∈ D,
∀x, y ∈ D,
∀x, y, z ∈ D,
d(x, x) = 0
d(x, y) ≥ 0
d(x, y) = d(y, x)
d(x, z) ≤ d(x, y) + d(y, z)
For comparing descriptors, the Minkowski distances (Lp ) are an example of widely used
metrics. For a n-dimensional space, the Lp metric is defined as:
Lp (~x, ~y ) =
n
X
i=1
!1
p
|xi − yi |
p
with p ≥ 1
In general, the metric properties represent a tradeoff between efficiency and effectiveness
for similarity searches. On one hand, the metric properties enable the use of well studied index
29
a
p
d(a, b)
b
d(p, b)|
|d(a, p) −
d(p, b)
d(a, p) +
Figure 3.1 – The object-pivot distance constraint for distance d(a, b) using the pivot object p.
structures, accelerating searches by discarding groups of objects (as we review below). On the other
hand, the metric properties restrict the similarity model that can be used for comparing two objects
[Skopal, 2007; Skopal and Bustos, 2011].
3.3.1
Efficiency in Metric Spaces
In order to improve efficiency in metric spaces, the Metric Access Methods (MAMs) [Chávez
et al., 2001] are index structures designed to efficiently perform similarity search queries. MAMs
avoid a linear scan over the whole database by using the metric properties to save distance computations, usually at the cost of storing some previously computed distances. Given the metric space
(D, d), the object-pivot distance constraint [Zezula et al., 2005] guarantees that:
∀a, b, p ∈ D,
|d(a, p) − d(b, p)| ≤ d(a, b) ≤ d(a, p) + d(b, p)
(3.1)
This constraint implies that for any two objects a and b, a lower bound and an upper bound
for d(a, b) can be calculated using a third object p, which is called a pivot (see Figure 3.1). If d(a, p)
and d(b, p) are precalculated, then these bounds can be efficiently computed.
MAMs resolve similarity searches by grouping objects in the database according to some
criteria. Then Equation 3.1 is used to discard groups of objects, thus saving distance computations
and search time. MAMs differ in their methods for grouping objects and for selecting pivots.
The GNAT [Brin, 1995] is a n-ary tree where that recursively subdivides a dataset according
to n split objects. It is analogous to the hierarchical k-means tree from vector spaces, with the
difference that the split objects belong to the dataset. In fact, the split objects can be selected
at random, but the author recommends to select objects that are far away from each other. The
search algorithm uses the split point as a pivot to discard tree branches.
The M-tree [Ciaccia et al., 1997] is a balanced tree where each node stores a representative
object and a covering radius, like the GNAT. The M-tree is a dynamic structure that manages
object inserts and deletes in a similar manner as the R-tree: every new object is added to a leaf,
the full nodes are divided in two groups, and the overflow is recursively added to the parent nodes
up to the root.
Pivot Tables
Given a collection of objects R and a set of pivot objects P (which may or may not be a
subset of R), a pivot table is a |R|×|P| matrix that stores the distance from each pivot to every
30
object in the collection. The similarity search for a query object q (not necessarily in R) evaluates
the distance d(q, p) for each pivot p ∈ P, and then sequentially scans each object r ∈ R, calculating
a lower bound for d(q, r):
max {|d(q, p) − d(r, p)|} ≤ d(q, r)
p∈P
This lower bound can be evaluated efficiently because d(q, p) is already calculated and d(r, p)
is stored in the pivot table. In the case of range searches, if the lower bound is greater than ǫ then r
can be safely discarded because r cannot be part of the search result. In the case of k-NN searches,
if the lower bound is greater than the current k th nearest neighbor candidate, then r can be safely
discarded. If r could not be discarded, the actual distance d(q, r) must be evaluated to decide
whether or not r is part of the search result.
The Approximating and Eliminating Search Algorithm (AESA) [Vidal, 1994] computes the
distance between every pair of objects in R and stores them in a |R|×|R| matrix, i.e., P is the
whole set R. Actually, due to the metric properties, only (|R| − 1)·(|R| − 2) evaluations of d are
needed. AESA can resolve nearest neighbor searches with an average constant number of distance
computations, at the cost of requiring quadratic space for storing the pivot table. The Linear AESA
(LAESA) [Micó et al., 1994] overcomes this issue by selecting a set of pivots P ⊆ R. LAESA can
reduce the space for storing the pivot table compared to AESA, however it requires an algorithm
for selecting a good set of pivots.
One approach for selecting pivots is to randomly select objects in R. However, as can be
inferred from Figure 3.1, a good pivot should be either close to the query or to the object in R.
Additionally, a key property for selecting good sets of pivots is that each pivot in the set should be
also far away from each other [Zezula et al., 2005].
The Sparse Spatial Selection (SSS) [Bustos et al., 2008] is an algorithm that selects pivots far
away from each other: given a distance threshold t, it traverses R and it chooses an object x ∈ R to
be added to P when its distance to each previously selected pivot is higher than t. If many candidate
sets of pivots have been selected, either by random selection or by the SSS algorithm, they must
be evaluated and compared. The evaluation consists of calculating µP , which is the average value
of lower bounds [Bustos et al., 2003]. The set of pivots P with higher µP is finally selected while
the other sets are discarded. The evaluation prefers sets with high lower bounds, because they will
probably discard more distances in the search.
LAESA can manage the insertion or deletion of objects and pivots by adding or removing
rows or columns from the pivot table [Micó and Oncina, 2010]. However, LAESA is mainly a static
index because the actual implementation of the pivot table may not support dynamic updates. In
that case, a new table is created and the old table is discarded. Also, after many modifications in
R the set of pivots can begin to perform poorly and a new set of pivots should be selected.
Intrinsic Dimensionality
In order to analyze the efficiency that any MAM can achieve in a metric space (R, d) with
R ⊆ D, Chávez et al. [2001] propose to analyze the histogram of distances of d. A histogram
of distances is constructed by evaluating d(a, b) for a random sample of objects a, b ∈ R. The
histogram of distances reveals information about the distribution of objects in the collection, see
Figure 3.2.
31
frequency
σ
0
µ
M
d(a, b)
Figure 3.2 – Histogram of distances with median µ, variance σ 2 , and maximum distance M for
some metric space (R, d).
Given a histogram of distances of a metric space, the intrinsic dimensionality is defined as
[Chávez et al., 2001]:
µ2
ρ= 2 ,
(3.2)
2σ
where µ and σ 2 are the mean and the variance of the histogram of distances.
The intrinsic dimensionality estimates the efficiency that any MAM can achieve in (R, d),
therefore it tries to quantify the difficulty in indexing a metric space. A histogram of distances with
small variance (i.e., a high value of ρ) means that the distance between any two objects d(a, b) with
high probability will be near µ, thus the difference between any two distances with high probability
will be a small value. In that case, for most of the pivots the lower bound from Equation 3.1 will
probably become useless at discarding objects.
In the case of LAESA, the internal cost for the similarity search comprises: the evaluation
of the distance between the query object and each pivot at the beginning of the search, and the
evaluation of the lower bound between the query object and each object in the collection. Increasing
the number of pivots will improve the value of the lower bounds, thus more objects may be discarded.
However, the cost for evaluating the lower bounds increases linearly with the number of pivots.
Hence, the number of pivots presents a tradeoff between the amount of distances discarded and the
cost of evaluating each lower bound. The optimal solution for this tradeoff mainly depends on the
intrinsic dimensionality of the search space and quality of the set of pivots.
Indexing the query set
In some domains the query objects may have some special properties that can be exploited
to improve the performance of the index. In particular, content-based video retrieval systems
usually extract many keyframes from a query video, and similarity searches are performed for
those keyframes. Because keyframes proceed from the frame video, it may be expected that two
consecutive query objects will frequently be similar. In interactive content-based systems, the user
starts a search with a text or an example, the system performs a k-NN search and the results
are presented to the user. The user iteratively selects new query objects between those presented
objects, and a new search is performed to refine the results. Because the new queries are selected
from the answers of a previous search, it may be expected that two consecutive query objects will
be similar.
32
The D-file [Skopal et al., 2012] is a dynamic MAM that is created online while processing a
stream of queries. The D-file is the database file which is accompanied by a main-memory structure,
called the D-cache. The D-cache stores the evaluated distances d(qi , oj ) while processing the queries
in the stream. When the nth query in the stream is processed, the D-cache calculates a lowerbound distance for d(qn , oj ) evaluating the distance from qn to previous qi , thus treating previous
queries as pivots. D-cache content is modeled as a sparse dynamic pivot table which stores all the
evaluated distances. If some distances have been discarded, then some rows may be incomplete.
Using the stored distances, the D-cache tries to discard objects using the same approach as pivot
tables. Because the D-cache is built during query processing, the D-file does not need an offline
indexing step. As the D-cache uses the previously processed queries as dynamic pivots, the authors
recommend that previous queries should be as close to the current query as possible.
The D-cache is implemented with: 1) a fixed-size hash table that stores triplets (qi , oj , d(qi , oj ));
2) a hash function h(qi , oj ) for accessing the bucket where a triplet is stored; 3) a collision interval,
for searching a near available bucket when some triplet is mapped into an already used bucket; and
4) a replacement policy, that decides whether or not a new triplet should replace an old triplet when
a collision occurs and there is not an available bucket in the collision interval.
In Chapter 9 we analyze the performance of the D-cache, and we show it suffers from high
internal complexity. This thesis proposes a new index structure called the Snake Table, which
preserves these ideas of dynamic pivots and achieves high performance.
3.3.2
Effectiveness in Metric Spaces
In this section we review two approaches to modify the distance function in order to improve
the effectiveness of the search: defining a dissimilarity function as a linear combination of metrics,
and raising the restrictions of a metric.
Multi-Metric Approach
Let {g1 , ..., gm } be a set of a feature extraction methods where gi : D → Fi extracts a global
descriptor, let {d1 , ..., dm } be a set of distance functions where di : Fi × Fi → R is a dissimilarity
function that defines the metric space (Fi , di ), and let {w1 , ..., wm } be the set of weights wi ∈ R,
then a multi-metric space (D, γ) is defined where γ : D ×D → R calculates the dissimilarity between
two objects as:
m
X
∀x, y ∈ D, γ(x, y) =
wi · di (gi (x), gi (y))
i=1
We will call the set {d1 , ...dm } as theP
underlying metrics of γ. In this thesis, we will focus
on convex combinations, i.e., wi ∈ [0, 1] and m
i=1 wi =1.
Bustos and Skopal [2006] proposed a dynamic weighting of metrics, called entropy impurity,
where each weight changes depending on the query object. This method computes the set of weights
prior to each similarity search by analyzing the result of a search on a database already classified.
A similar technique is also proposed by Deselaers et al. [2007] under the name of maximum entropy.
The major problem with dynamic weighting is that it breaks the metric properties, thus general
MAMs cannot be used.
33
If weights are static (i.e., a fixed value for all searches), then γ also satisfies the metric
properties [Batko et al., 2009; Bustos, 2006], thus any MAM can be used for indexing the objects.
The value assigned to each weight depends on the actual implementation of descriptors (a different
weight is required depending on whether a descriptor represents colors or textures), and application
specifics (a system that retrieves sport images may use different weights than a system that retrieves
hand-made sketches). The set of weights should be fixed subjectively as fine tuning parameters
[Batko et al., 2009], or should be fixed in accordance with evaluations of effectiveness. However, the
evaluation of the effectiveness of a system is usually difficult because it requires the definition of
proper methodology and indicators, and in some cases it requires to hire people to use the system
and fill out evaluation forms.
Non-Metric Approach
Another way for improving the effectiveness is to raise the restrictions of a metric. When
a distance function does not satisfy some of the metric properties it is known as a non-metric
[Skopal and Bustos, 2011]. In particular, a semi-metric is a non-metric that satisfies reflexivity,
non-negativity, and symmetry, but not the triangle inequality.
Even using the more advanced feature extraction method, if the distance function does not
correctly model the human notion of similarity, the results will not be satisfactory. Moreover,
the effectiveness of the similarity search can only be improved by researching better algorithms,
as opposed to efficiency that may be improved by better hardware. Non-metric distances can be
useful for creating complex similarity measures that represent more accurately the human notion
of similarity without the constraints of the metric properties. However, in this case general MAMs
cannot be used.
The Dynamic Partial Functions [Li et al., 2002] are a family of semi-metric distances used
with remarkable results. It is based on the Lp distances, but only taking into consideration the
subset of the m dimensions with the smallest differences. Let ~x and ~y be two n-dimensional vectors,
and m ∈ {1, ..., d} be a parameter, then the DPF distance is defined as:
DPF(~x, ~y ) =
1
p
X
ci
ci ∈∆m
p
with p ≥ 1 ,
where ci = |xi − yi | is the difference of ~x and ~y in the i-th coordinate, and ∆m is the subset of
the m smallest values of {c1 , ..., cn }. Meng et al. [2003] present some criteria for selecting the value
of m and make a comparison with Minkowski distances for image copy detection. They showed
that for a good selection of m, DPF outperforms the effectiveness of Lp for retrieving similar images
represented by global descriptors.
In other study, Aggarwal et al. [2001] show that the Fractional distances, the Lp distances
with 0 < p < 1, have better results than L2 for clustering high dimensional spaces. The experiments were performed using the k-means algorithm and showed that for a small p (in particular
they tested L0.1 ) the rate between the minimum and maximum values diverges, thus increasing
the discriminability between points. Afterwards, Howarth and Rüger [2005] compared Fractional
distances against L1 and L2 in an image retrieval system, where each image was represented by a
global descriptor. They empirically showed that Fractional distances outperform L1 and L2 .
Rubner et al. [2001] made an empirical comparison of different metrics and non-metrics for
34
image retrieval. Their experiments showed that with larger sample sizes the non-metric functions
outperformed metrics for global color and texture descriptors. In particular, we highlight the
performance of the Chi-squared test statistic, which achieves a high performance, and it is fast to
evaluate. Let ~x and ~y be two n-dimensional vectors, the χ2 distance is defined as:
χ2 (~x, ~y )
=
n
X
(xi − m̄i )2
i=1
m̄i
,
where m̄i =
x i + yi
.
2
There are few works where non-metric measures have been specifically proposed for contentbased video copy detection. The global descriptor FRAS used by Shen et al. [2007] is compared
with the Probability-based Edit Distance which is a non-metric. This algorithm is used for short
video clips, due to the quadratic time complexity for computing the edit distance.
Cheung and Zakhor [2003] used a L1 distance for comparing histograms for a video retrieval
system and defined a modified L1 distance by removing a dominant color of each histogram. The
modified distance does not comply with the triangle inequality. They defined the modified distance
to avoid selecting videos whose background were identical even when the videos were different (the
case of videos with slides). The modified distance were used in a postprocessing filtering step for
discarding non-relevant objects selected by the metric.
Regarding the efficiency for non-metrics, a direct approach to resolve searches is to directly
use MAM and perform exact searches. Because the MAM indexes a non-metric the exact search
becomes an approximate search. This approach has the main drawback that it cannot control
or even know the amount of approximation in the result. Therefore, some techniques specifically
designed for non-metric indexing are needed.
The specific research for non-metric indexing can be broadly classified in two approaches: 1)
designing index structures for specific non-metrics; and 2) converting semi-metrics into metrics by
correcting the triangle inequality.
In the first case, Aggarwal and Yu [2000] propose an index structure for high dimensional
spaces called IGrid. It is inspired in the inverted file, and it intends to index a weighted Minkowski
distance computed on multidimensional vectors. The data vectors are organized in ranges for each
dimension, and the algorithm selects and retrieves the objects in relevant ranges for the search.
In other work, Goh et al. [2002] propose DynDex, which groups descriptors into clusters
in order to discard groups during the search. The clustering algorithm is CLARANS [Ng and
Han, 2002] because it compares pairs of in order to compute medoids instead of centroids, thus not
requiring the metric properties. The similarity search retrieves the nearest clusters to the query
object and performs a linear scan between relevant clusters.
Given a semi-metric δ, a lower bound metric function d (i.e. ∀x, y d(x, y) ≤ δ(x, y)) can
be used for efficiently filtering irrelevant objects. The objects that cannot be discarded by d are
compared by δ. Ciaccia and Patella [2002] propose the QIC-M-tree which is an extension of the
M-tree that follows this approach. The metric d should be as tight as possible to δ while satisfying
triangle inequality for obtaining some efficiency gain. Unfortunately, the definition of d depends on
the topological properties of δ, thus it needs specific studies for every semi-metric.
In the more general approach, a semi-metric δ can produce a metric by defining d(x, y) =
δ(x, y) + c with a large enough constant c for satisfying the triangle inequality [Roth et al., 2002].
Even though the triangle inequality is corrected, the intrinsic dimensionality of the new metric space
35
will be so high that it will ruin any further indexing.
Following this idea, TriGen [Skopal, 2006, 2007] replaces the constant c by concave and
convex functions. Using these kind of functions, the amount of triplets satisfying the triangle
inequality can be controlled. Skopal and Lokoč [2008] propose NM-Tree, which is an extension of
the M-Tree that performs exact and approximate searches on non-metrics. The index controls the
effectiveness and efficiency of the search by adjusting the level of correction of the non-metric and
the level of approximation of the search.
An alternative approach is Local Constant Embedding Chen and Lian [2008], which partitions the database into multiple groups and computes many different constants ci for satisfying
triangle inequality locally inside each group. The search uses representative objects from each group
in order to select the most promising groups to scan in. The main drawback of this technique is the
algorithm for generating groups and constants ci has cubic time complexity.
3.4
Vector spaces versus Metric spaces
Vector spaces and metric spaces are two approaches to resolve similarity searches. A discussion about the benefits and drawbacks for vector spaces and metric spaces are given in the surveys
Böhm et al. [2001] and Chávez et al. [2001].
For instance, let I be the image domain, let g : I → F be a feature extraction method
that calculates a global descriptor g(o) for the image o, and let dg : F × F → R be a distance
function between global descriptors. Commonly, a vector space is defined for F and dg is an Lp
distance, hence the distance between two images corresponds to the distance between their global
descriptors. A metric space makes no assumptions about a domain as long as the distance satisfies
the metric properties, e.g. a metric space for images can be defined using a distance d : I × I → R,
which compare a single global descriptor (d(x, y) = dg (g(x), g(y))), combine global descriptors in a
multi-metric, or even compare local descriptors.
On one hand, metric spaces do not have coordinate system nor dimensions, thus they must
rely on some rudimentary geometry (based on closeness between objects) to save distance computations. This issue implies that indexes for metric spaces achieve poorer pruning performance (i.e.,
are less efficient) than indexes for vector spaces [Böhm et al., 2001]. However, we should note this is
not always true since some metric indexes can show similar or higher performance than some vector
indexes (e.g., M-tree versus R*-tree Ciaccia et al. [1997]).
On the other hand, metric spaces can be applied to any kind of object, whereas the distance
function knows how to compare them. It is even possible to define specific functions that compare
two images directly without extracting feature vectors at all. In fact, a complex distance function,
like the solution of an optimization problem, can achieve higher effectiveness than a simpler distance
function [Rubner et al., 2000]. However, we should note that in some scenarios the extra computation
time is unaffordable, or the increase in effectiveness is not worth the efficiency cost.
Finally, there are some techniques that map a metric space into a vector space [Faloutsos and
Lin, 1995]. The idea of these techniques is to convert a search in a metric space into an approximate
search in a vector space. The performance of these techniques highly depends on the characteristics
of the involved data and mapped metric space, thus it is not possible to evaluate them in general
[Chávez et al., 2001].
36
In summary, the selection between vector spaces and metric spaces is loosely related to
the effectiveness-versus-efficiency tradeoff (described in Chapter 1). The main advantage of vector spaces is the higher efficiency they can achieve. The main advantage of metric spaces is the
(potentially) higher effectiveness they may achieve.
3.5
Summary
In this chapter we briefly reviewed the similarity search topic. In particular, we discussed
two approaches to resolve similarity searches: vector spaces and metric spaces. The first approach
arises naturally when a feature extraction method is used to extract descriptors from multimedia
documents. The second approach is a generalization which has as a main advantage the ability to
use complex distances to compare objects, at the cost of lower search efficiency compared to vector
spaces.
In this thesis, we chose the metric space approach mainly because there is a large unexplored
field in the application of metric spaces to the video domain. As we review in next chapter, currently
the top-performing CBVCD systems use vector spaces to model similarity, either in the form of treebased indexes, hash-based tables, or codebooks produced by k-means. To the best of our knowledge,
the metric space approach has not successfully been applied to video domains, mainly because the
high volume of data forces researchers to choose an approach with higher efficiency. Hence, this
research intends to show a valid approach to apply metric spaces in the CBVCD topic, that can
achieve competitive performance with state-of-the-art systems.
Because video datasets are usually large, a key aspect of this thesis will be the balance
between complexity of the model and the computational effort required to use it. Indeed, in this
thesis we tested different similarity functions and we analyze the performance of linear combination
of metrics. The efficiency in this thesis is based on pivot tables due to the low internal complexity,
and we developed some techniques that profit from video domain properties to reduce the search
cost.
The next chapter reviews the related work specific to the content-based video copy detection
topic.
37
Chapter 4
Related Work
In this chapter, we briefly review the related work for the content-based video copy detection
topic. First, we discuss different definitions and formalizations for the video copy detection problem. Afterwards, we summarize some applications of video copy detection techniques. Then, we
review the related work following four main aspects, namely Content Description, Similarity Search,
Temporal Consistency and Multimodal Fusion. Finally, we discuss a complementary approach to
content-based video copy detection, called watermarking.
4.1
Definition of Content-Based Video Copy Detection
In this section we analyze and formalize the term Content-Based Video Copy Detection (CBVCD). Other terms like content-based copy retrieval [Joly et al., 2007], near-duplicate video detection
[Wu et al., 2007; Shen et al., 2007], partial near-duplicate video detection [Tan et al., 2009], usercentric near-duplicate detection [Cherubini et al., 2009], detection of duplicated sequences [Naturel
and Gros, 2005], video matching [Basharat et al., 2008], and video sequence identification [Iwamoto
et al., 2006] are also used for a relatively similar or identical purpose. This section sets out to unify
these terms.
A broad definition of Content-Based Copy Detection (CBCD) is the following:
Definition 1 (CBCD) Let V be a set of multimedia documents (image, audio, video), and C be
a set of query documents, the CBCD consists in retrieving for every q ∈ C, all the multimedia
documents in V from which q “is a copy”.
It should be noted that the objective of CBCD is to retrieve multimedia documents instead
of just detecting whether or not a query document is a copy. The set V is known as the “reference
collection”, and C is the “query collection”.
The specific behavior of a CBVCD system depends on the definition of copy. Depending
on the specific objectives of a system, several definitions have been given for what a copy is. In
general, we differentiate two approaches for defining a copy: content transformations and semantic
similarity.
38
4.1.1
CBVCD based on content transformations
The approach based on content transformations is mainly related to the copyright infringement problem. Given a collection of original multimedia documents, a copy is any derivation of an
original document independent of its semantic meaning. A “content transformation” is the function
that produces a derived document by means of adding or removing visual or acoustic information
of a source document. In particular, the content transformations that are relevant for CBCD are
described by the subjective notion of tolerated transformation [Joly et al., 2007]:
Definition 2 (Tolerated transformation) Given an original multimedia document, a tolerated
transformation is a content transformation that creates a new multimedia document where the original document “remains recognizable”.
The content transformations frequently used in CBVCD can be broadly classified into three
categories:
• Quality transformations. These transformations usually affect the whole document and
sometimes they are unintended consequences of the capture and/or processing of the video.
Some examples are: blur, noise, change of brightness or contrast, artifacts due to re-encoding,
etc.
• Postproduction transformations. These transformations may affect the whole video or
some part of it. They are usually produced in the edition process with the purpose of embedding or removing information. Some examples are: insertion of text or logos, cropping,
picture in picture, chroma key composition, acoustic effects, and special effects in general.
• Mash-up transformations. These transformations affect the length of a video. They are
usually produced in the edition process by adding or removing video excerpts. Some examples
are: inclusion of opening/endings, embedding into a longer sequence, removal of excerpts, etc.
Additionally, each transformation has an associated intensity. A low-intensity transformation slightly modifies the content, hence the transformed video is essentially identical to the original.
A strong-intensity transformation modifies the document in a way that it is hard to identify the
original. Figure 4.1 shows some examples of content transformations. The transformations may
affect the visual content and/or the audio content of a video.
Finally, the definition of copy is based on the use of tolerated transformations:
Definition 3 (Copy) Let T be a set of tolerated transformations, and u and v be two multimedia
documents, v is a copy of u if ∃t ∈ T , t(u) = v.
Low-intensity transformations can normally be composed several times, i.e., if t1 and t2 ∈ T ,
it is common that t1 ◦ t2 ∈ T . Thus, given an original document, many copies could be produced by
successively applying different transformations, creating a sort of “tree of copies” [Joly et al., 2007].
However, when mash-up transformations or some strong-intensity transformations are used, the
relation of copy may become non transitive.
39
Figure 4.1 – Some examples of content transformations used in the TRECVID datasets.
The tolerated transformations that produce the copies may be a priori known or may be
unknown. In the former case, the CBVCD system is able to introduce some special treatment to
query videos like detection and reversion of transformations. In the latter, the CBVCD system
should be general enough to be robust to any kind of tolerated transformation.
As stated at the start of this section, there are many terms to denominate the copies that
a CBVCD system should detect. The following definitions try to unify them with the definition of
copy.
Definition 4 (Duplicate) v is a duplicate of u if v is a copy of u when T contains only lowintensity quality transformations.
Definition 5 (Near-Duplicate) v is a near-duplicate of u if v is a copy of u when T contains
low-intensity quality and postproduction transformations.
Definition 6 (Partial Near-Duplicate) v is a partial near-duplicate of u if v is a copy of u when
T contains low-intensity quality, postproduction and mash-up transformations.
4.1.2
CBVCD based on semantic similarity
The approach based on semantic similarity is mainly related to the problem of reducing the
redundancy of search results in video-sharing web sites.
One kind of semantic similarity occurs when two clips present the same scene, but from a
different angle and capturing parameters. This issue frequently occurs when different users upload
their recording of some public event.
Definition 7 (Scene Duplicate) v is a scene duplicate of u if v presents the same real-world
scene in u with a different capturing method (i.e., type of camera, capturing viewpoint, camera
motions, etc.).
Satoh et al. [2007] propose an approach for retrieving video footage taking the same scene or
event, but from the different viewpoints. Similarly, Basharat et al. [2008] present an approach for
40
detecting two videos showing the same event but captured from different devices and viewpoints.
These are examples of copy detection according to semantic similarity.
A more general semantic similarity can be defined as the videos that “a user would clearly
identify as essentially the same” [Wu et al., 2007]. This human perception of duplicates is later
analyzed by Cherubini et al. [2009]. Using an online questionnaire, they asked users to rate the
similarity of pairs of videos. That study enabled them to present the concept of user-centric duplicates.
Definition 8 (User-Centric Duplicate) v is a user-centric duplicate of u if v is semantically
identical to u, i.e., v provides the same information to the user as u.
In general, user-centric similarity admits most of the quality transformations and some postproduction transformations, however it excludes the transformations that add new information (like
insertions of graphs) and mash-up transformations. According to Cherubini et al. [2009], the acoustic content supports stronger transformations without affecting the semantic similarity of the video.
For instance, Figure 4.2 shows two clips introducing the same physics experiment, hence from a
user point of view those clips are duplicates, regardless of their visual dissimilarity.
Figure 4.2 – Two video clips showing the same physics experiment in similar conditions. These
videos correspond to user-centric duplicates despite their visual difference. [Cherubini et al., 2009]
Despite the wide applications that CBVCD based on semantic similarity can provide, this
thesis focuses on content transformation mainly because currently there is not a large-scale collection
to evaluate CBVCD based on semantic similarity. In fact, TRECVID evaluation provides large
datasets for CBVCD where the copies are produced by content transformations.
4.2
Applications of CBVCD
The techniques used at CBVCD can be applied to different problems. Each problem addresses a type of similarity (content transformation or semantic), and may limit the transformations
to support. Some examples of applications of CBVCD methods are:
Checking of copyright infringements The owner of the copyright on some material would like
to use a CBVCD system to find every derived work from its property. In this case the definition
based on content transformations is relevant, where the transformations are created by an adversary
that tries to avoid the detection. The main constraint is that transformations should be realistic,
i.e., the final video should have enough quality to be watched by a final user. This application may
41
be preferred by video-sharing web sites and media production companies. An alternative approach
focused on this application, called Watermarking, is discussed at the end of this Chapter.
Reduction of redundancy in search results In video search engines, a post-processing phase
may use a CBVCD system to determine multiple copies of the same document and group them.
In this case, the detection should focus on low-intensity content transformations and semantic
similarity.
Identification of known sequences in video streams A CBVCD system can be used to
monitor public broadcast with the objective of searching or tracking the emission of previously
known commercials. The detection should focus on low-intensity transformations. Naturel and
Gros [2005] present an approach to detect duplicated sequences using a lightweight representation
and search.
The mining of video databases The analysis of common sequences between videos in a collection may help to discover internal structures in the collection. For example, if two videos share the
title sequence, they can be marked as chapters of the same TV series. Poullot et al. [2008] present
an approach to mine large archives of TV broadcast creating a graph showing the relationships
between videos.
Assignation of semantic descriptions A CBVCD system may be used to add semantic tags to
new documents based on the tags assigned to similar content. In this case, a CBVCD system based
on semantic similarity is appropriate. Besides, tags can also be assigned based on the frequency of
emissions, e.g., a video broadcast from many TV stations in a short period of time may be tagged
as a news event, or a video broadcast for a longer period of time may be tagged as an historical
event.
Video footage managing system The detection of common shots between unedited material
and the final edited version of a film may help to reverse engineer the edition process of a film. In
this case, CBVCD based on content transformations (in particular on mash-up transformations) is
required.
4.3
Content Description
The Content Description consists in representing the content of the multimedia document
with one or more descriptors. The descriptors should fulfill two goals: be invariant to transformations (i.e., be identical or very similar to the descriptors of documents that are copies); and be
discriminative between unrelated documents (i.e., be very different to the descriptors of document
that are not copies).
Descriptors can be low-level or high-level depending on the type of information they represent. Low-level descriptors represent some characteristic in the content itself (like colors, edges,
shapes, pitch, etc.). They are the result of computational processes on the information stored in
the document. High-level descriptors represent semantic features in the video (like the presence of
42
persons, musical genre, abstract concepts, etc.). Usually, high-level descriptors are generated by
users, i.e., a person assigns one or more concepts to a video after having watched it. A broad field
of research focuses on the problem of automatically producing high-level features. Machine learning
techniques, like grouping low-level features into high-level categories by means of adequately-trained
classifiers, are commonly used to this aim.
Depending on the source of information to represent, descriptors can be visual or acoustic.
Additionally, high-level descriptors may represent both sources at the same time, in sorts of audiovisual concepts [Jiang et al., 2009].
In the case of multimedia documents with temporal dimension (audio and video), the description process may represent the entire document by a single description called signature, or may
divide the document into small units and create independent fine-grained descriptors for each one.
Video signatures can be used in a first step for detecting identical videos and for discarding
definitely irrelevant documents [Wu et al., 2007]. The signature can be calculated directly from
the video content, or be a global summarization of fine-grained descriptions. However, in the case
of mash-up transformations or strong postproduction transformations a fine-grained description is
required.
The process of selecting independent representative frames from a video is called keyframe
selection. The process of dividing a video into segments is called video segmentation. A spatial
description represents static visual information from isolated keyframes. A spatio-temporal description represents the temporal flow of the visual content in a video segment. The search process for
fine-grained descriptors locates similar segments or keyframes between query and reference videos,
and then a temporal consistency process determines the matched units that joins excerpts between
an original with its copy.
A description can be global or local. A global description represents the content of the whole
keyframe or segment with a single descriptor. A local description consists of many descriptors, each
one representing only a small zone of the keyframe or segment.
In general, local description can achieve higher detection performance than global description, especially for strong-intensity postproduction transformations [Law-To et al., 2007a]. However,
it should be noted that local description presents some difficulties compared to global description:
a) it requires more computational time to compute descriptors; b) it needs more disk and memory
space to store descriptors; and c) it increases the complexity for the similarity search process. In order to reduce the storage and search time while keeping the high detection performance, a common
approach is to create a global description summarizing a local description.
The following subsections detail these definitions and approaches, and review their common
usage on CBVCD systems.
4.3.1
Description for an entire video document
The description of an entire video can be calculated by using a content-based hash function
or by summarizing fine-grained descriptions.
In the case of content-based hashes, Coskun et al. [2006] propose two hash functions for
calculating video signatures that depend on the spatio-temporal position of pixels, and are robust
43
to low-intensity quality transformations. The detection is performed by locating collisions between
the hash values of videos. The first hash function achieves higher detection performance, but does
not support attacks, i.e., an adversary is able to produce an unrelated video with the same hash
or can produce similar videos with different hashes. The other hash function prevents adversary
attacks by using a secret randomly-generated key, but it decreases its detection performance.
In the case of summarizations, Wu et al. [2007] create a video signature by averaging finegrained global descriptors. The fine-grained descriptors are color histograms of 24 bins calculated
on representative keyframes of shots, which are compared with the Euclidean distance. Depending
on the distance between them, two videos may be reported as a copy (when signatures are close to
one another), no copy (when signatures are far from each other), or otherwise a detailed comparison
based on the fine-grained description is performed.
The Bounded Coordinate System (BCS) [Shen et al., 2007; Shao et al., 2008] calculates a
statistical summarization of fine-grained descriptors. They calculate PCA over fine-grained global
descriptors and transform each descriptor into the reduced space. The video signature is the average
descriptor plus the minimum bounding rectangle that encloses the reduced descriptors. Two signatures are compared by summing the Euclidean distance between the averages and the Euclidean
distances between the corners of the bounding rectangles.
Video signatures can be understood as a content-aware file hashing algorithm, like a CRC-32
or MD5, that computes its value from video pixels instead of file bytes. Therefore, they can be used
to efficiently retrieve videos that visually are almost identical. In fact, they are intended to be used
in scenarios where only quality transformations can be present. However, video signatures are not
able to detect copies between video excerpts or strong postproduction transformations that may
affect many consecutive frames.
4.3.2
Video Segmentation and Keyframe Selection
A video usually has between 20 and 30 frames per second, a movie has more than 100,000
frames, and a large database may contain more than 108 frames. However, most consecutive frames
in a video are similar to one another. In order to reduce the amount of data, either representative
keyframes or shot boundaries are calculated. A video shot is a series of interrelated consecutive
frames taken contiguously by a single camera and representing a continuous action in time and
space [Hanjalic, 2002]. A keyframe is the frame which can represent the salient content of a shot
[Zhuang et al., 1999].
Shot boundaries are commonly located by searching significant changes between consecutive
frames. For instance, shot boundaries can be detected by differences of average intensity [Hampapur et al., 1994], difference of gray histograms by zones [Boreczky and Rowe, 1996], difference of
“intensity of motion” [Eickeler and Müller, 1999], or motion compensating features with adaptive
threshold for hard and gradual transitions [Hanjalic, 2002]. TRECVID performed an evaluation for
shot boundary detection between 2003 and 2007. Some systems that achieved good results on that
evaluation are Naito et al. [2006], Liu et al. [2007], and Yuan et al. [2007].
Keyframes can be selected either from the whole video or from shots. The common approach
for selecting keyframes are motion estimation [Fauvet et al., 2004], unsupervised clustering of histograms [Zhuang et al., 1998], difference between consecutive frames [Gengembre and Berrani, 2008],
or just by a regular sampling.
44
The size of computed segments and/or the number of keyframes extracted by video affects
both effectiveness and efficiency of a system. A sparse segmentation or keyframe selection increases
efficiency but may ruin the detection effectiveness. On the other hand, a dense segmentation or
keyframe selection increases effectiveness but affects the efficiency due to the high number of objects
to process. Moreover, a too dense segmentation may also affect the effectiveness due to the high
number of redundant and noisy objects.
4.3.3
Visual Global Description
The global description consists in representing the visual content of the video keyframe/segment
with a single descriptor. The descriptor may represent only one keyframe in the segment (a spatial
descriptor) or the whole segment (a spatio-temporal descriptor).
Computing global descriptors from video frames is a straightforward approach to address
the CBVCD problem. Global descriptors are commonly used for detecting video duplicates, i.e.,
quality transformations, and in some systems, they form the first phase that resolves the easy-todetect copies, leaving the complex copies to local descriptors.
One contribution of this thesis is to show that global descriptors can achieve satisfactory
results in more complex datasets, even outperforming many state-of-the-art systems based on local
descriptors. Global descriptors are able to detect many copies without producing false alarms, but
they are not able to detect some copies with strong postproduction transformations, i.e., global
descriptors can achieve higher precision than local descriptors but lower recall.
Spatial global descriptors
Hampapur and Bolle [2001] compare the effectiveness of different global descriptors extracted
from regular-sampled keyframes. The evaluated descriptors are: image difference (the sum of the
differences between pixel colors), color histogram, histogram of orientations of gradient, Hausdorff
distance between Canny edges, Invariant Moments (spread and slenderness of Canny edges), and
Local Edge (centroid of the edge points by zone). In their tests, the Local Edge performed better
than the others at CBVCD, followed by the partial Hausdorff distance. This is one of the first works
on the CBVCD topic. It focuses on low-intensity quality transformations (reencoding), and shows
that the edge pixels contain discriminative information for the CBVCD problem.
In other work, Iwamoto et al. [2006] present a descriptor robust to captions which is very
similar to the Edge Histogram descriptor (see Section 2.3.1). Each video frame is converted to gray
scale and divided into N blocks. For each block the energy for ten different orientation filters is
measured: the five orientations of Edge Histogram plus their mirrored versions (see Figure 4.3).
The descriptor is a vector with N values containing the identifier of the dominant orientation for
each block or a “no edge” identifier if no orientation exceeded a minimum threshold. The comparison uses a Hamming distance with predefined weights depending on the probability of caption
superimposition for each block. This work shows that dividing the frame in zones and computing
independent values for each zone produces a descriptor that is robust to localized modifications. In
fact, some kinds of postproduction transformations, like insertion of captions or images, modify a
few zones and the rest of the frame remains unchanged. Therefore, a robust global description must
consider independent values for many zones of frames.
45
-1
-1
◦
1
-1
1
1
(a) 0 edge
-1
√
- 2
1
0
◦
(b) -90 edge
0
√
2
◦
(c) -45 edge
0
√
2
√
- 2
-2
2
0
2
-2
◦
(d) -135 edge
(e) Isotropic 2
Figure 4.3 – The mirrored versions of the Edge Histogram filters (see Figure 2.8 on page 18).
Lee and Yoo [2008] compute a global descriptor on regular-sampled keyframes. A video
frame is scaled, converted to intensity values and divided into N zones, the global descriptor is the
centroid of the orientations of the gradient by zone. They show this descriptor outperforms the
histogram of orientations of gradient from Hampapur and Bolle [2001]. The work reaffirms that
edge pixels are indeed relevant for CBVCD.
The FRAS descriptor [Shen et al., 2007] maps frames to symbols from a dictionary. The
symbol dictionary is defined by clustering all the frames in the database and selecting the centroids.
A frame is represented by the centroid of the clusters that contain it, or a special symbol if it
is not contained by any cluster. The descriptor for a whole video is the string resulting from
concatenating all the symbols. The similarity between two descriptors is measured by a probabilitybased edit distance, which is similar to the Levenshtein distance, but using a probability of replacing
symbols given by the number of common frames between clusters. This work shows a non-metric
distance that computes similarity between videos. Unfortunately, the cost of distance computation
is quadratic in terms of video lengths.
Spatio-temporal global descriptors
The Ordinal Measurement Descriptor (OMD) captures the relative ordering of intensities in
an image (see Section 2.3.1). In the case of CBVCD, Kim and Vasudev [2005] extract OMD for each
frame and compare sequences of descriptors. First, the spatial distance is defined as the average of
the differences between two descriptors. Then, a temporal descriptor is calculated by comparing the
changes between consecutive OMDs: the temporal OMD contains for each zone a value 1, -1, or 0,
depending on whether the rank for the zone increases, decreases, or does not change, respectively.
The temporal distance between two videos is defined as the average of the differences between their
temporal OMDs. Finally, the spatio-temporal distance between video sequences is defined as a static
weighted combination of spatial and temporal distances. This spatio-temporal distance follows the
approach of weighted combinations of distances. Unfortunately, the temporal alignment of videos
is determined by linear probing and the OMD descriptor is only robust to quality transformations.
Chen and Lian [2008] create the temporal OMD by sorting the average intensities of the
same zone in the frame series. For a subsequence of m frames divided into n spatial zones, the
temporal OMD consists in the n permutations (one for each zone) of m intensities (one for each
frame). The distance between two temporal OMD is the average of the difference of permutations
for each zone. This work improves the temporal localization of OMD, but it is also focused on
quality transformations.
46
4.3.4
Visual Local Description
The local description consists in representing the visual content of the video keyframe/segment
with many local descriptors. Each local descriptor represents a fixed zone in a keyframe (spatial
descriptors) or a fixed or moving zone in the segment (spatio-temporal descriptors). The use of local
descriptors enables the detection of complex postproduction transformations, which would otherwise
be undetectable under global descriptions Law-To et al. [2007a]. Therefore, most state-of-the-art
CBVCD systems rely on local descriptors to represent visual content. However, local descriptors
demand much higher computational resources than global descriptors to store descriptors and perform searches. Additionally, the use of local descriptors usually requires the computation of spatial
consistency processes.
The most common local descriptor used by CBVCD systems is SIFT or some variation of it,
e.g., Basharat et al. [2008], Liu et al. [2010a], and Bai et al. [2011]. Other local descriptors used in
CBVCD are SURF [Roth et al., 2009; Sun et al., 2010], DART [Younessian et al., 2010], CS-LBP
[Jégou et al., 2010], and partial derivatives of graylevels [Joly et al., 2003; Poullot et al., 2007].
The temporal dimension can be used to improve the quality of local descriptors. Additionally,
in order to improve the efficiency, the use of global summarization of descriptors is commonly used.
Spatio-temporal local descriptors
A common technique to improve the quality of local descriptors in videos is to track them between video frames. This technique also enables discarding unstable regions that usually correspond
to noise.
Law-To et al. [2006] propose an algorithm for spatio-temporal local description which classifies persistent keypoints by tracking their trajectories in consecutive keyframes. If the points are
static, they represent the background, and if they are moving, they represent objects in motion.
This distinction enables the detection of strong postproduction transformations, like background
replacement and insertion of new characters in a scene. This work proves the potentialities of local
descriptors compared to global descriptors.
Satoh et al. [2007] detect and track interest points in a video using the Kanade-Lucas-Tomasi
tracking algorithm [Tomasi and Kanade, 1991]. Following the work by Shechtman and Irani [2005],
a trajectory is represented by motion patterns and the comparison between two videos consists in
measuring their degree of inconsistency between their patterns. A binary signature is computed for
each trajectory by locating the local maximum of inconsistency. This work focuses on detecting
copies according to semantic similarity instead of copies based on content transformations. The
works reports high precision and low recall, i.e., it reports few false alarms but misses many copies.
This technique demands high computational resources to compare videos, therefore it may not be
able to scale to common CBVCD datasets.
Similarly, Basharat et al. [2008] detect keypoints in time and compute trajectories. The
spatial regions containing trajectories with uniform direction and velocity are used to segment
physical volumes in the scene. The volumes are represented by SIFT descriptors, HSV histogram,
histogram of gradients, and histogram of motion directions. The similarity is measured by the
Earth Mover’s Distance using centroids of each descriptor space. Due to the required resources
to compute similarity between videos, this technique presumably is not able to scale to common
47
CBVCD datasets.
Another approach to improve the quality of local descriptors is to directly compute spatiotemporal descriptors. SURF is extended to the temporal dimension by means of a spatio-temporal
keypoint detection [Willems et al., 2008a] and a spatio-temporal descriptor [Willems et al., 2008b].
These extensions are tested on a CBVCD system using synthetic quality transformations. Unfortunately, to the best of our knowledge, a comparison of the performance of this spatio-temporal local
description with other CBVCD systems does not exist.
4.3.5
Global description based on local descriptors
A common technique to reduce search times is to compute summarizations of local descriptors
during the offline phase. This summarization alleviates the need to compare full sets of descriptors
during the online phase.
Roth et al. [2009] summarize the distribution of local descriptors in a frame by creating a
16-bytes global descriptor. The global description is the count of local descriptor that belongs to
each of the 4×4 quadrants, where the maximum count per quadrant is 255. Two descriptors are
compared by just summing up the differences between quadrants. This system did not achieve high
detection effectiveness at TRECVID 2009, probably because this kind of aggregation produces a
high rate of false alarms.
The Bag-of-Visual-Words approach is widely used for addressing different problems in computer vision area (see Section 2.3.3). In the CBVCD problem, this approach has shown high
efficiency improvements and low impact in the effectiveness. The efficiency gains are achieved by
the use of lookup tables during the search. It should be noted that the computation of codebooks
demands high computational resources, however they are usually not reported because that cost is
part of the offline phase, i.e., the cost does not depend on queries to be processed.
Most of the top performing CBVCD systems use some form of BOVW or other codebookbased descriptors in their processes. For instance, Douze et al. [2008] computes a codebook of
200,000 visual words from SIFT descriptors, and additionally they enhance the BOVW by computing the Hamming Embedding signatures. In other work, Jiang et al. [2011] compute a codebook
of 800 visual words on dense sampled color SIFT descriptors. They increase the performance by
computing spatial pyramids. Similarly, Bai et al. [2011] compute a codebook of 50,000 visual words
from SIFT descriptors.
An open issue of codebooks and CBVCD is to determine the extent to which the quantization of descriptors increases or decreases the effectiveness compared to using the local descriptors
themselves. In the case of image classification, there is some evidence that using codebooks to
produce “mid-level” descriptors can improve effectiveness [Boureau et al., 2010]. However, in the
case of CBVCD, particularly for copies based on content transformations, this generalization might
just increase the amount of false alarms instead of improving effectiveness.
An alternative approach is to compute the “glocal” descriptor [Poullot et al., 2008; Le
et al., 2009] which is a binary sequence that summarizes the local descriptors in a static frame.
The space of the local descriptors is hierarchically partitioned into 2h hyper-rectangular cells. The
distribution of descriptor in cells is analyzed in order to confirm that a relatively homogeneous
partition is performed. The glocal descriptor is a 2h -length binary sequence, where the ith bit is
set to 1 if at least one local descriptor falls in the ith cell, or 0 otherwise. The system uses h=8,
48
thus the glocal descriptor is a 32-byte sequence. This brief description enables the processing of
very large datasets (in the order of thousands of hours) at the cost of decreasing effectiveness due
to false alarms.
4.3.6
Acoustic Description
The acoustic description is usually computed by short-time analysis of the acoustic signal
at regular intervals. The analysis is based on converting the signal into the frequency domain (by
means of the Fourier transform) and analyzing the energies of the audible frequencies.
Haitsma and Kalker [2002] present an acoustic description based on energy differences in
consecutive sub-bands. It consists of a sliding window of 11 ms, and for each window it selects
33 non-overlapping frequency bands in the logarithmic frequency domain. These bands lie in the
range from 300 Hz to 2 kHz (the most relevant audible spectral range). A 32 bit binary descriptor
is generated by comparing the difference in the energy of consecutive bands for two consecutive
windows: 1 means an increase in the difference, i.e., E(n, m) − E(n, m + 1) > E(n − 1, m) − E(n −
1, m + 1) for the nth window and mth sub-band; and 0 otherwise. A match between audio segments
is performed by calculating a Hamming distance between the binary descriptors.
Saracoğlu et al. [2009] also use the energy differences descriptor with the following differences:
window size of 25 ms, frequency range from 300 Hz to 3 kHz, 16 sub-bands, and the mth bit of
in the descriptor is 1 when E(n, m) > E(n, m + 1) for the nth window and mth sub-band. This
configuration is also used by CBVCD systems Gupta et al. [2010] and Younessian et al. [2010].
The Mel-frequency cepstral coefficients (MFCC) are widely used descriptors for speech
recognition. Some systems using MFCC descriptors for CBVCD are Liang et al. [2009], Anguera
et al. [2009a], Natsev et al. [2010], and Gupta et al. [2011].
4.3.7
High-level Description
Min et al. [2010, 2011] face the problem of semantic similarity by first assigning high-level
semantic features to each shot. The semantic vector for each shot is created by extracting low-level
features and using trained classifiers for 32 concepts (like park, people, indoor, flowers, and others).
The semantic description for each shot is a 32-dimensional binary vector denoting the existence or
inexistence of each concept in the shot. A video signature is created by calculating the entropy of
the occurrence of each concept in the semantic vectors of the shots. Additionally, some low-level
features are extracted from each shot. The distance between two videos is a weighted combination
of the distance between semantic descriptors and the distance between global descriptors.
This approach assumes that high-level concepts are able to discriminate between copies and
non-copies. In fact, this may be the case in collections containing videos from a broad range of
subjects and styles. However, in general this approach will miss some copies with content transformations and also produce a high rate of false alarms. For example, it will miss copies with some
picture-in-picture transforms, and will report false alarms between different videos proceeding from
the same scene. In fact, the evaluation in that work uses datasets based on content transformations
(MUSCLE-VCD-2007 and TRECVID 2008), and the improvement in effectiveness due to semantic
descriptors is rather marginal.
49
4.4
Similarity Search
The similarity search corresponds to the algorithm that locates descriptors within the database
that make a close match with the descriptors computed from a query video. In the case of videos
signatures, a single search is enough to return similar videos. In the case of fine-grained global or
local descriptors, one search should be performed for each descriptor in the query video.
In Chapter 3 we reviewed general aspects of the similarity search and common indexing
techniques. In this section we focus on the techniques that are preferred by CBVCD systems. In
particular, we distinguish three approaches in current CBVCD systems: linear scan, lookup tables,
and space filling curves.
4.4.1
Linear Scan
The linear scan corresponds to sequentially comparing every query descriptor with every
reference descriptor and to retrieve those that are the closest.
Kim and Vasudev [2005] perform a linear scan using their spatio-temporal distance function.
A copy is detected when the distance between query video and a reference excerpt is lower than a
threshold. The main drawback is that it compares the query video with reference videos probing
every possible alignment.
Gupta et al. [2010] sequentially compare acoustic descriptors following a linear scan. The
linear scan is efficiently implemented on GPU, therefore it is possible to be used on a large collection.
A system following this approach achieved high performance in TRECVID [Gupta et al., 2011], as
is reviewed in Section 11.6.2.
The linear scan is usually not considered to address large video collections due to the high
cost involved in computing the distance between every pair of query and reference descriptors.
However, it should be noted that if a system is able to afford the resources required by the linear
scan then it can indeed achieve high effectiveness in the similarity search. This fact motivates the
research on metric indexes, which present an alternative for improving the efficiency and achieving
high effectiveness. More details on this issue are given in Section 9.3.
4.4.2
Lookup Tables
The lookup table approach is a technique based on defining a fixed set of n possible values,
and assigning every descriptor to one or more of those values. The lookup table (or inverted index)
is an array of n entries, where each entry contains a list with all the descriptors that are assigned to
each value. The similarity search is resolved by using the lookup table to retrieve those descriptors
that are assigned to the same values than the query objects. The method that assigns a descriptor
to one or more of the n values may be a hash function (as in Locality Sensitive Hashing), or may
correspond to the relevant codewords from a codebook (as in Bag-of-Visual-Words approach).
The lookup tables can provide a great improvement in search efficiency. However, the quantization error produced by assigning multidimensional vectors to a fixed set of n values may harm
the search effectiveness. In order to enhance the discrimination of table entries, the information
stored by each entry may be augmented to consider some extra information for each descriptor (as
50
described in Section 2.3.3). Additionally, in order to reduce the false positives produced by collisions
between dissimilar descriptors, a spatio-temporal consistency step is usually required.
Le et al. [2009] organize glocal descriptors of n-bits in a lookup table, where each symbol
represents three bit positions (between 0 and n − 1). A weak geometric consistency step locates
triangles that are likely to be stable between copies, which defines a triplet of bit positions. For
each triplet, the glocal descriptor is added to the corresponding list in the lookup table. This
system participated in TRECVID 2009, achieving high performance in the balanced profile but
lower performance in the no false alarms profile.
Liu et al. [2010b] use LSH with a family of m hash functions. Each hash function maps
a local descriptor to one of n symbols. In order to resolve a search, for each local descriptor the
search uses the m hash functions to retrieve the approximate nearest neighbors. A strong spatial
consistency (RANSAC) is performed between query descriptors and retrieved candidates in order to
reduce the false alarms rate. According to its participation in TRECVID 2010, this system achieves
high performance when a certain amount of false alarms are permitted, however it achieves low
performance when false alarms are not allowed (see results in Appendix B).
Bai et al. [2011] present a system that follows the BOVW approach: it computes a codebook
with 50,000 visual words and organizes the quantized descriptors in an inverted index. It considers a
strong process of spatial consistency (Hough transform). According to its participation in TRECVID
2011, this system achieves high performance when false alarms are permitted, however it decreases
its performance when false alarms are not allowed (see results in Appendix C).
4.4.3
Space Filling Curves
This approach partitions the descriptor space into predefined zones, and produces a spatial
ordering of zones according to some predefined space-filling curve. For each zone a probability
density function is determined, and the search retrieves the zones with high probability of containing
a relevant object, and evaluates the distance to all the objects in them.
Joly et al. [2007] propose to partition the space into hyper-rectangular blocks using a Hilbert
space-filling curve. The distribution of descriptors for each block is estimated by a normal distribution, with the same standard deviation in all the dimensions, and determined by training data.
The search consists of selecting the blocks whose combined probability of containing a query vector
is greater than a given threshold. This approach was tested in TRECVID by Joly et al. [2008]
achieving high detection performance, however it was outperformed by a system using the BOVW
approach [Douze et al., 2008]
Poullot et al. [2007] present an improvement of probabilistic discard, where the space is
partitioned into hyper-rectangular blocks following a Z-space filling curve instead of the Hilbert
curve. With this change, all the blocks are the same size and orientation, simplifying the calculation
of block boundaries. This system was tested in a database with 60,000 hours of video, and in a
extension work it was tested with 280,000 hours of video [Poullot et al., 2010]. This system shows
good performance at MUSCLE-VCD-2007, unfortunately it has not been evaluated with some
TRECVID dataset.
51
4.5
Temporal Consistency
The temporal consistency consists in analyzing those similar frames or segments returned
by the similarity search in order to determine whether there exists one or more copies in a query
video and the boundaries of each one.
Joly et al. [2007] use an extension of the spatial consistency between local descriptors (see
Section 2.3.2) including the temporal dimension as an additional variable. The RANSAC algorithm
is used to determine the best alignment between query and reference video.
Tan et al. [2009] model the temporal consistency problem between a query and reference
videos as directed graph. The frames of the videos correspond to vertices, and the similarity between
frames corresponds to the weights of the edges in the graph. The copied excerpt is determined by
the optimal solution of a network flow problem.
Chapter 10 presents a novel algorithm for temporal consistency that can be used to locate
copied excerpts from the results of k-NN searches. Unlike previous works, this algorithm can profit
from several nearest neighbors by weighting their ranks and distances.
4.6
Multimodal Fusion
In order to achieve a highly effective detection, CBVCD systems should use many sources of
information to perform the detection process. In fact, the top-performing CBVCD systems usually
fuse many modalities, i.e., combine different kinds of descriptors to perform a detection process,
like global descriptors, local descriptors, and/or acoustic descriptors.
Most of the current CBVCD systems perform a multimodal detection by dividing the system
into independent subsystems, where each subsystem performs a detection using a single different
modality. For instance, let A and B be two subsystems for a CBVCD system, and assume A uses
visual descriptors and B uses audio descriptors. A and B perform an independent copy detection
process, each one producing a set of copy candidates C A and C B , respectively. The commonly used
approach for multimodal fusion creates the final detection list C by combining detections and scores
from C A and C B .
Snoek et al. [2005] describe two approaches to perform multimodal fusion for the video
semantic indexing problem: the “early fusion” and the “late fusion”. Adapting those definitions
to CBVCD, the multimodal fusion approach described in the previous example corresponds to late
fusion, because the multimodal detections are the result of combining unimodal detections. On the
other hand, the early fusion combines modalities before determining copy candidates.
4.6.1
Early Fusion
The early fusion combines information between modalities at an earlier stage, rather than at
the final decision step. Extending the definition of Snoek et al. [2005], we differentiate three stages
for early fusion: content description, similarity search, and temporal consistency.
In the case of early fusion at the content description, a multimodal descriptor is created
52
by concatenating descriptors from different modalities. Snoek et al. [2005] performs a comparison
between early and late fusion for the semantic video indexing using this approach. The results
suggest that (in general) late fusion achieves a slightly better performance, while only in specific
cases, the early fusion outperforms the late fusion.
In the case of early fusion at the similarity search, a multimodal search is performed by comparing descriptors from different modalities. Barrios and Bustos [2009] present an image retrieval
system that combines text-based descriptors with visual-based descriptors at the distance function.
Batko et al. [2009] present an image retrieval system that combines many visual-based descriptors
using distances with predefined weights.
In the case of early fusion at the temporal consistency, a multimodal detection is created by analyzing the results of similarity searches produced by independent modalities. Anguera
et al. [2009b] present a CBVCD system that performs a comparison between query and reference
videos using global descriptors and acoustic descriptors, and then produces the copy detections by
measuring the correlation between their similarities. Ayari et al. [2011] present a CBVCD system
that compares the performance of early and late fusion of local descriptors and acoustic descriptors.
The early fusion is performed by analyzing the results of the similarity search of both modalities.
The results suggest the early fusion achieves better performance than the late fusion.
In Chapter 8 we test the performance of CBVCD using early fusion at the similarity search by
combining global descriptors and acoustic descriptors into a single distance function. Section 9.3.3
compares the performance of early fusion with the late fusion of partial nearest neighbors.
4.6.2
Late Fusion
The late fusion combines the copy detections from subsystems using a single modality. In
general, the late fusion processes follow two approaches: a) choose candidates from one subsystem
following some heuristic rule depending on prior knowledge of the performance of each modality; or
b) merge candidates from all subsystems using a weighted sum of the confidence scores given from
each subsystem.
Wu et al. [2007] perform a hierarchical combination of results between video signatures and
local descriptors. Depending on the distance between video signatures, the system may decide to:
declare is-a-copy (when the distance is near zero), declare is-not-a-copy (when the distance is above
a threshold), or otherwise delegate the decision to a subsystem based on local descriptors.
Saracoğlu et al. [2009] combine acoustic descriptors and global descriptors. The fusion
consists of choosing the detection with highest confidence that exceeds a certain threshold. Liang
et al. [2009] combine acoustic descriptors and local descriptors. The fusion is either the union or
intersection of the candidates, depending on whether the final result can or cannot contain false
alarms, respectively. Mukai et al. [2010b] combine acoustic descriptors and global descriptors. The
fusion joins candidates from both subsystems, but prefers the acoustic-based candidates in case of
conflict. Le et al. [2009] combine acoustic descriptors and glocal descriptors. The fusion joins or
intersects both detections, and the final confidence score is a weighted sum favoring audio scores.
Anguera et al. [2011b] present a late fusion algorithm specific for CBVCD. It is based on the
weighted sum of scores, where the scores are dynamically normalized according to their distribution
for each modality. This algorithm can fuse an arbitrary number of modalities and it does not require
any specific distribution for the scores, hence it can combine almost any CBVCD result.
53
In Section 11.6 we compare the performance our system using early fusion to other systems
using late fusion. We also analyze the participation of the Telefonica team at TRECVID 2011
[Anguera et al., 2011a], which uses late fusion to combine acoustic descriptors, local descriptors,
and our system. We then use these results to show benefits and drawbacks of these two fusion
approaches.
4.7
Alternative Approach: Watermarking
Watermarking consists of embedding invisible information, called a watermark, into an original video. This process should be performed by the producer or distributor of the content before
being broadcast. Then, given a query video from an unknown source, the copy detection system
detects and retrieves the watermark from the video, and compares it with its database of watermarks. The result of the search may indicate the owner of the video, its copyrights, the allowed
usages, and other information to determine whether the video is a legal copy or not [Swanson
et al., 1998; Wolfgang et al., 1999].
The watermarking approach is focused on checking copyright infringements, i.e., a media
owner enforces its rights over the distribution of a multimedia content (either an image, audio or
video document) by adding invisible information in the content. Another use of watermarking is
to check the authenticity of the content, i.e., to verify that the image or video has not been edited
after the watermark embedding.
In general, the problem of embedding a secret message into a public communication channel
is known as steganography. The watermarking uses steganography to hide a message about the
communication channel. Langelaar et al. [2000] present an overview of methods for watermarking,
like adding pseudo-random noise pattern, DFT amplitude modulation, and modification of the
least significant bits. These methods present some strengths and weaknesses for common content
transformations.
The main problems with watermarking are that an early manipulation of the documents is
required, and watermarks may not be robust enough to withstand some content transformations
(like strong quality or postproduction transformations). Additionally, despite watermarks being
usually imperceptible, some domains may require the highest possible image quality (like medical
images).
CBVCD systems can be seen as watermark-based copy detection systems where the watermarks are calculated from the content itself. Therefore, CBVCD avoids embedding any extra
information into the media, enabling copy detection processes to already broadcasted content.
4.8
Summary
In this chapter we have depicted the main topics that are involved in a CBVCD system. We
have also discussed the research work that has been done on these topics.
Despite the improvements that have been shown in recent years, the CBVCD is still a
challenging problem. The main issues that researchers are currently facing are the development
of new techniques for managing huge sets of local descriptors, the development of different fusion
54
techniques, and the use of CBVCD in realistic environments.
55
Part II
CBVCD and the Metric Space
Approach
56
Chapter 5
Overview
This chapter gives a general overview of our CBVCD system, called P-VCD, and its components. The details for each component are given in following chapters.
Let V be a set of original videos (the reference collection), C be a set of query videos (the
query collection), and T be the set of content transformations that may have been applied to a
reference video to create a query video. In practice, the objective of a CBVCD system is to produce
a list of copy detections (c̄, v̄, s), where c̄ is an excerpt from query video c ∈ C, v̄ is an excerpt from
reference video v ∈ V, and s ∈ R+ is a confidence score for ∃ t ∈ T , t(v̄) = c̄.
P-VCD is divided into five main tasks: Preprocessing, Video Segmentation, Feature Extraction, Similarity Search, and Copy Localization (see Figure 5.1). These tasks work as a pipeline:
each task reads the required input from files in secondary storage, performs the desired work, and
then writes back the output to storage (possibly creating new files with some predefined format).
A general overview of the system is the following: 1) the preprocessing task processes every
video in C and V in order to normalize videos and diminish content transformations effects; 2)
the video segmentation task partitions every video into short segments, producing a set of query
segments and a set of reference segments; 3) the feature extraction task calculates many descriptions
for every segment; 4) the similarity search task performs many NN+range searches to retrieve the
most similar reference segments for each query segment; finally 5) the copy localization task uses
the similar reference segments to locate chains that belong to the same reference video, producing
the final set of detections (c̄, v̄, s).
The main motivation for this design is to clearly isolate the similarity search from the other
tasks (specifically, from feature extraction and copy localization). This isolation enables focusing
the research on developing techniques for improving effectiveness and efficiency using widely-used
descriptors. This is a key aspect for this work: the relevance of the similarity search (including
distance definition and multimodal fusion) in the system performance. This marks a difference
with some research in CBVCD that focuses on developing novel robust descriptors to different
transformations, but keeps the similarity search as linear scans.
Following sections summarize each of these tasks and their relationships. Additionally, we
present the dataset and the evaluation we used for the experiments in Chapters 7, 8, 9, and 10.
57
Query
Videos
Reference
Videos
1
Preprocessing
2
Video Segmentation
3
Feature Extraction
4
Similarity Search
5
Copy Localization
Detection
Result
Figure 5.1 – The five main tasks of the P-VCD system.
5.1
Preprocessing Task
The video preprocessing is a common step in CBVCD systems. Its main benefit is the
improvement in the detection effectiveness by giving to descriptors invariance or improved robustness
to some content transformations. On the other hand, the video preprocessing usually affects the
efficiency of the system due to the computational cost needed to analyze and enhance videos. An
analysis of the impact of preprocessing on system effectiveness is shown in Section 11.6.3.
Specifically, this task has two objectives: 1) to normalize or enhance the quality of videos;
and 2) to diminish the content transformations effect on query videos.
In order to normalize videos, every video is analyzed and the frames without information
or too noisy are removed. For each reference video v ∈ V its normalized version v ′ is added to
the output set V ′ , analogously for each query video c ∈ C its normalized version c′ is added to the
output set C ′ . In order to diminish content transformations effects, new query videos are created
by detecting and reverting some specific transformations. The reversion process may create one or
more query videos. New query videos are normalized and added to C ′ .
The output for this task is a new set of normalized reference videos V ′ and a new set of
normalized and reverted query videos C ′ . It is expected that |V| ≃ |V ′ | and |C| ≤ |C ′ |. The following
tasks use these new sets to detect copies. The Copy Localization task converts detections between
C ′ and V ′ into detections between C and V. Chapter 6 details the preprocessing techniques applied
by the system to reference and query videos.
58
5.2
Video Segmentation Task
The objective of this task is to partition every video into short segments. A video segment
is just a group of similar consecutive frames {fi , fi+1 , ...}. The segments do not need to be all the
same length.
The segmentation task allows the system to detect copies that may be shorter than the
query video. Essentially, the segmentation task divides every video into independent segments,
the similarity search task locates similar segments in the collection, and the copy localization task
restores the temporal unity obtaining the copy boundaries. Therefore, in order to detect a copy,
the original and the duplicated excerpt must have been divided into several segments.
A shot partition usually produces a too coarse segmentation to achieve high detection effectiveness. In fact, some corpus contain copies shorter than a single shot, like TRECVID’s dataset
where copy excerpts can be just three seconds long. On the other hand, a too fine segmentation
directly affects the efficiency due to the high number of segments to process, and also may affect
the effectiveness due to noisy segments that may produce false alarms with high score. The impact
of segment length on system effectiveness is evaluated in Section 7.6
Depending on the type of descriptor to compute, the system may need to select a keyframe
for each segment. In the case of spatial descriptors, each segment s must define a representative
keyframe fs from which the descriptor is computed. In the case of spatio-temporal descriptors, the
segments do not need to define keyframes because the descriptor depends on all the frames in the
segment.
The output of this task is a set of segments {s1 , ..., sr } for every query and reference video.
Chapter 7 details the segmentations used by the system and compares their impact on effectiveness.
5.3
Feature Extraction Task
The objective of this task is to create one or more descriptions (global, local, acoustic) for
each video segment.
A feature extraction method is defined by the pair (g, d) where: g : S → F is the extraction
function; d : F × F → R is the distance function for comparing two descriptions; S is the set of
video segments; F is the feature space; and g(s) is the description for segment s. In general, the
description g(s) can be a global descriptor for the representative frame, a spatio-temporal descriptor
for the whole segment, local descriptors for the representative frame, or an acoustic descriptor for
the audio track in the segment. Chapter 7 details the descriptions that are used in this thesis and
compares their effectiveness using different segmentations.
The task establishes two requirements for feature extraction methods: 1) g(s) should represent the whole segment s; and 2) the feature extraction method should not be severely affected by
the content transformations that might have been applied to s. If the descriptor is highly affected by
transformations, the inclusion of some normalization or reversion in the preprocessing task should
be considered. A third desirable property is that d should satisfy the metric properties. The feature
extraction task is restricted to use a unique segmentation for all the feature extraction methods, i.e.,
to extract different descriptors from the same segments. This restriction enables the early fusion
59
g1
()
()
()
()
()
()
()
g2
()
()
()
()
()
()
()
g3
()
()
()
()
()
()
()
... ... ...
s7
... ... ...
s6
... ... ...
s5
... ... ...
s4
... ... ...
s3
... ... ...
s2
... ... ...
s1
Figure 5.2 – Video Segmentation and Feature Extraction tasks: The video is partitioned into
segments by the Video Segmentation task, and for each segment the Feature Extraction task calculates one or more descriptions.
because descriptors represent the same excerpt. On the other hand, the late fusion does not have
this restriction because the combination is performed after the similarity search. An analysis and a
comparison between early and late fusion is given in Section 11.6.2.
The feature extraction task makes no restriction about the descriptor format, allowing methods to create descriptors as multidimensional vectors, binary signatures, variable length vectors, sets
of vectors, or any other format. This independence from the format is possible whereas the distance
function is able to compare them and compute a distance value. The main strength of this approach
is that it provides two ways for improving effectiveness: 1) to represent and store descriptors in any
format that may better represent the data instead of restricting to fixed-length multidimensional
vectors; and 2) to compare descriptors using functions that may better implement the concept of
similarity instead of restricting to Minkowski distances. On the other hand, this generalization directly affects the efficiency, but (as a general rule) it is the responsibility of the next task (similarity
search) to balance the effectiveness-versus-efficiency tradeoff.
The output of this task is a set of m descriptors {g1 (s), ..., gm (s)} for each segment s, and a
set of m distance functions {d1 , ..., dm } to compare each type of descriptor. Figure 5.2 depicts the
Video Segmentation and Feature Extraction tasks.
5.4
Similarity Search Task
The objective of this task is to perform NN+range searches to efficiently retrieve for each
query segment q a list with its k most similar reference segments closer than a distance threshold
ǫ. More formally, for every q ∈ Q this task calculates the list Nq = {(r1 , d(q, r1 )), ..., (rk , d(q, rk ))},
where rj ∈ R is the j th nearest neighbor to q, d(q, rj ) ≤ d(q, rj+1 ), and d(q, rj ) ∈ [0, ǫ].
Controlling the effectiveness-versus-efficiency tradeoff is a key issue for this task. The similarity search should achieve as high effectiveness as possible (i.e., to retrieve all the segments that
are indeed copies) and be as efficient as possible (i.e., to require few resources, including memory
and search time). On one hand, effectiveness is usually determined by the quality of descriptors
extracted by the previous task. In this thesis we show different techniques that can improve their
basal effectiveness at the cost decreasing efficiency. On the other hand, a basal efficiency is achieved
by the linear scan, which can be improved with no cost at effectiveness by using some index structure and exact searches. The efficiency can be highly improved by using approximate searches at
the cost of decreasing effectiveness.
This task addresses two main issues:
60
1. To define the distance function d between two segments. In the case of a single descriptor
per segment, the distance may be the same di defined for the descriptor. In the case of
many descriptions per segment, a distance based on {d1 , ..., dm } can be defined. In Chapter 8
we study techniques for defining distances that improve the effectiveness by performing an
early fusion of descriptors. We show that very simple descriptors can indeed achieve high
effectiveness if they are combined wisely. In particular, we study the effectiveness of linear
combination of distances, spatio-temporal distance, and non-metric distances.
2. To perform NN+range searches for every query object using d. In the more general case, the
search cannot make assumptions about underlying descriptors, impeding the use of multidimensional indexes. In that case, if d satisfies the metric properties, a metric index can be
used, otherwise some non-metric index may be considered (see Chapter 3). In Chapter 9 we
focus on techniques for efficiently resolving searches for metric distances d. In particular we
study the approximate searches using static pivot tables, and exact searches using dynamic
pivot tables.
As discussed in Section 3.4, it is expected that metric indexes achieve lower efficiency than
multidimensional indexes. Moreover, if d defines some complex similarity models, it can also be
expected it will demand high computational resources, affecting the efficiency. A comparison considering search efficiency between multidimensional indexes and metric indexes is presented in Section 9.3.
The output of this task is a set of similar reference segments (i.e., the nearest neighbors
according to d) for every segment in query video.
5.5
Copy Localization Task
The objective of this task is to decide on the existence of a copy by analyzing the result of
the similarity search. The input for this task is a query video c′ ∈ C ′ partitioned into the segments
{s1 , ..., sr } and the output of the similarity search {Ns1 , ..., Nsr } where Nsi is the list of the nearest
neighbors for query segment si .
The task traverses the lists of nearest neighbors looking for chains of similar segments belonging to the same reference video. It produces a list of copy candidates (c̄′ , v̄ ′ , s′ ) between the
preprocessed sets C ′ and V ′ . Finally, it combines the detections for the query videos created by the
Preprocessing task producing the final sets of triplets (c̄, v̄, s).
Chapter 10 details the localization task and its voting algorithm.
5.6
Evaluation of CBVCD
The evaluation compares two aspects: the effectiveness in the detection and the efficiency
in the detection. In order to evaluate the effectiveness, the system performs a detection on a
dataset with known results (i.e., a ground-truth) and the differences between detected copies and
the actual copies are measured. In order to evaluate the efficiency, we compare the computational
effort required by linear scans (i.e., searches without indexing at all) with the computational effort
required by the proposed algorithms.
61
Additionally, the system is also evaluated by comparing its effectiveness and efficiency with
other state-of-the-art systems. The system participated on the Content-Based Copy Detection task
at TRECVID 2010 and 2011. The results of these participations, datasets, and comparison with
other systems, are detailed in Chapter 11.
The MUSCLE-VCD-2007 dataset [Law-To et al., 2007b] is a publicly available and widelyused collection for evaluation of CBVCD systems. It was created as the corpus for a CBVCD
evaluation performed at CIVR 2007. Between the public databases for video copy detection, we
have chosen this dataset because it has a well defined ground-truth (it is possible to suit a frameto-frame match between query videos and reference videos), and it has an appropriate size for the
evaluations: it is big enough to be a challenging database, and exact searches can be resolved in
reasonable time, thus it enables us to analyze the real effectiveness of different techniques.
5.6.1
Dataset
The MUSCLE-VCD-2007 dataset consists in one reference collection, called DB-MPEG1, and
two query collections, namely ST1 and ST2. Table 5.1 summarizes the sizes of these collections.
Collection
Videos
Hours
Frames
GB
Query
ST1
ST2
Reference
18
15
3
101
3.9
3.1
0.7
58.7
348,321
281,052
67,269
5,285,921
2.3
1.8
0.4
34.3
Table 5.1 – Summary of MUSCLE-VCD-2007 collection.
The reference videos have varying lengths, ranging from 17 seconds for the shortest one, up
to 115 minutes for the longest one. The reference videos proceed mainly from TV rips, thus they
have reasonably good quality (352×288 and 25 fps).
The query collection was created by applying some content transformation to a reference
video. The visual transformations vary between: color adjustment, blur, resize, crop, vertical
mirroring, noise, insertion of subtitles, insertion of logo, and manual camcording. The audio transformation for most of the videos is the reencoding of the audio track. For videos with manual
camcording the audio was recorded with the camcorder (thus the audio contains ambient noise
and low volume), and there is one copy with muted audio track, possibly due to an unintended
corruption in the query video. There is no acceleration transformation, thus the copy and reference
excerpts are the same length.
The ST1 collection tests full-length copies. It consists in fifteen videos, where ten proceed
from the reference collection and the other five proceed from videos not in the reference collection.
The query videos are almost identical in length as the corresponding reference videos. The ST2
collection tests partial-length copies. The three videos proceed from videos not in the reference
collection, and they contain 21 embedded excerpts extracted from the reference collection.
The details of the ground-truth is shown in Appendix A along with example frames from
query videos.
62
5.6.2
Evaluation measures
In the following chapters, we evaluate different aspects of the CBVCD system. Each experiment first defines some configuration (i.e., a type of preprocessing, a video segmentation, the
description for each segment, and the distance between segments), and then it performs the similarity searches. The results of the searches are evaluated according to two aspects: the frame-by-frame
matching effectiveness and the copy detection effectiveness.
Frame-to-frame match
These measures evaluate the performance of matching a copied frame with its original frame.
Let Q be the set of segments from query videos and R be the set of segments from reference videos.
Let Qc ⊆ Q be the set of query segments that have a matching original segment in R according to
the ground-truth (i.e., the segments that are copies).
For this evaluation, we stated every segment must have a representative frame. The correct
answer for a query segment s with representative frame fs is the reference segment r that contains
the frame Rfs . The frame Rfs is the original frame for fs , i.e., ∃ t ∈ T , t(Rfs ) = fs . The frame Rfs
is determined by the ground-truth. Some examples of frames fs and corresponding Rfs are given
in Appendix A.
Given a configuration and a query object q ∈ Qc , the similarity search sorts all the segments
in R from the most similar to the less similar to q. Then, using the ground-truth, we determine the
rank of each correct answer. The best value for rank(Rq ) is 1 and the worst value is |R|. Finally,
we evaluate the frame-to-frame effectiveness with two measures:
• Mean average precision (MAP): The average precision for a single query is just the inverse
of the rank of Rq (each query has one correct answer), then MAP is defined as:
MAP =
1 X
1
|Qc |
rank(Rq )
q∈Qc
MAP values range between 0 and 1, where 1 means best effectiveness.
• Correct k-NN searches (N k): The percentage of queries in which the correct answer is
retrieved by an exact k-NN search:
Nk =
|Ck |
· 100 , where Ck = {q ∈ Qc , rank(Rq ) ≤ k}
|Qc |
N k values range between 0% and 100%, where 100% means best effectiveness.
Correct copy detections
These measures evaluate the performance of detecting the copies in the query collection. Let
S={(c̄, v̄, s), ...} be the output of the CBVCD system, and R be the ground-truth, i.e., R defines
the actual copies between reference and query videos: R={(ĉ, v̂), ...}. In order to evaluate S, we
calculate Precision (fraction of relevant documents which has been retrieved) and Recall (fraction of
63
the retrieved documents which is relevant) [Baeza-Yates and Ribeiro-Neto, 1999] using the following
steps:
1. Sort triplets in S according to their confidence scores, and determine the minimum and maximum confidence scores smin and smax .
2. For every decision threshold t between smin and smax :
(a) Define S(t) = {(c̄, v̄, s) ∈ S, s ≥ t}.
(b) For every triplet (c̄, v̄, s) ∈ S(t): if ∃ (ĉ, v̂) ∈ R where c̄ ∩ ĉ 6= ∅ and v̄ ∩ v̂ 6= ∅, then add
(c̄, v̄, s) to Sa , and add (ĉ, v̂) to Ra .
(c) Calculate Precision and Recall at decision threshold t:
Precision(t) =
|Sa |
|S(t)|
Recall(t) =
|Ra |
|R|
Once precision and recall have been calculated at each threshold, we evaluate S according
to:
• Presicion/Recall curve: The values for Precision(t) and Recall(t) can be plotted in an x-y
plane. A system will outperform another when it achieves higher precision at every recall
value. In particular, two relevant values to compare are:
– Maximum Recall with Precision 1 (RP1 ): It compares systems by the amount of
correct detections without false alarms.
– Maximum Recall with Precision greater of equal than 0.5 (RP.5 ): It compares
systems by the amount of correct detections, admitting an equal amount of correct
answers and false alarms.
• Detections without false alarms. This is the absolute number of correct detections (|Ra |)
at Precision 1. We prefer this value instead of RP1 because MUSCLE-VCD-2007 has a small
number of copies (31), thus the absolute number may be more meaningful than its fractions.
This value varies between 0 and 31. Table 5.2 shows the relationship between Detections
without false alarms and Recall at Precision 1.
Indicator
Conversion
Detections
Recall
0
0
1
0.03
2
0.06
3
0.10
4
0.13
5
0.16
6
0.19
7
0.23
8
0.26
9
0.29
10
0.32
Detections
Recall
11
0.35
12
0.39
13
0.42
14
0.45
15
0.48
16
0.52
17
0.55
18
0.58
19
0.61
20
0.65
21
0.68
Detections
Recall
22
0.71
23
0.74
24
0.77
25
0.81
26
0.84
27
0.87
28
0.90
29
0.94
30
0.97
31
1
Table 5.2 – Detections without false alarms at MUSCLE-VCD-2007 dataset and their associated
Recall at Precision 1.
To the best of our knowledge, the work that shows the best detection performance on the
MUSCLE-VCD-2007 dataset is Poullot et al. [2010], which achieves precision 1 and recall 1 for
64
ST1, and precision 1 and recall 0.95 for ST2, i.e., a total of 30 out of 31 correct detections without
false alarms. Other works achieving high detection performance on this dataset are: Anguera
et al. [2009b] achieving precision 1 and recall 1 for ST1, and precision 1 and recall 0.88 for ST2 (but
considering 18 instead of 21 copies); Poullot et al. [2008] achieving precision 1 and recall 0.93 for
ST1, and precision 1 and recall 0.86 for ST2.
5.7
Summary
In this chapter we have presented the big picture of this thesis and its five main components.
This overview shows the relationships between each component and the evaluation procedure for
each task.
The following chapters detail each task: Chapter 6 details the preprocessing task, Chapter 7
reviews the segmentation task and feature extraction task, Chapter 8 and Chapter 9 focus on the
similarity search task, and Chapter 10 details the localization task.
Finally, Chapter 11 details the participation of this system at TRECVID 2010 and 2011 and
its comparison with other state-of-the-art systems.
65
Chapter 6
Preprocessing
The Preprocessing task normalizes the quality of every video in the dataset and diminishes
the effect of content transformations on query videos. This task analyzes the videos and creates a
new dataset with the processed versions of the videos.
The new videos produced by this task are not physically created. The creation of a new video
means that a source video has one or more filters applied to the actual video file. The parameters
for the filters are calculated by this task and stored in a configuration file which is read by the
following tasks. This approach avoids reencoding issues and wasting disk space, at the cost of an
extra computational effort before reading each frame.
6.1
Quality normalization
The normalization process is performed for every video in the dataset, first in a frame-byframe basis and then in a global-video basis.
6.1.1
Frame-by-frame normalization
For this normalization, every frame in the video is analyzed. A frame fi is marked to be
skipped if it is plain or it is an outlier. A frame fi is skipped by duplicating the previous frame fi−1
when i > 0, or the next non-skipped frame when i=0.
In order to detect plain frames, a frame is converted into gray scale and down-scaled to
20×15 pixels. We state a frame is plain under any of these two conditions: the variance of the
intensities is below a minimum value; or the difference between minimum and maximum intensity
pixels is smaller than a threshold. These criteria aim to remove frames lacking enough information
for being discriminative.
Once the plain frames have been removed, a detection of outlier frames is performed. We
state a frame fi is an outlier when two conditions are met: the previous frame fi−1 and the next
frame fi+1 are similar; and frame fi is very different from both fi−1 and fi+1 .
Let ∆(x, y) be the sums of differences for every pixel between frames x and y. The frames
66
are first down-scaled to 20 × 15, and fi is an outlier when both ∆(fi−1 , fi ) and ∆(fi , fi+1 ) are
greater than a threshold and ∆(fi−1 , fi+1 ) is smaller than a threshold. This criterium aims to
remove frames like flash-lights or other editing effects that may affect the video segmentation or the
spatio-temporal feature extraction.
6.1.2
Global normalization
Once every frame has been normalized, the video is processed to remove letter-, pillar-, or
window-boxing by cropping the black borders. To do this, the temporal median and variance of the
intensity is calculated for each pixel. One row or column is removed iteratively from the borders to
the center while the median and variance of the majority of the pixels is smaller than a threshold.
6.2
Detection and reversion of transformations
The detection process performs some test on a query video to decide whether or not the
video has some specific transformation to revert. The reversion process creates one or more query
videos with the reverted version of the detected transformation. The created query videos are
added to the query collection together with the original query video. In this thesis, we focus on
the reversion of three transformations: picture-in-picture (PIP), camcording, and vertical flip. The
reversion processes have been performed only for TRECVID’s query datasets.
6.2.1
Picture-in-picture
The PIP creation process used in TRECVID’s query dataset combines two video files. One
video file is used as a background file and the other video file is embedded inside a static rectangle
at any of the four corners or the center of the video. The size of the rectangle is fixed for the
whole video, but it may vary between query videos. The reversion process looks for that persistent
rectangle (see Figure 6.1).
First, all the frames in the video are processed to calculate the average intensity and the
variance for every pixel. Because the two combined videos present different sequences, the edges of
the PIP rectangle should be visible in either the average frame or the variance frame. The edges in
these two frames are detected using the Laplacian kernel. We chose to apply the Laplacian kernel
because it produces thinner edges (see Chapter 2). The “mean-edge” frame is the average of the
two edges frames. The approach of averaging all the frames in a video is feasible because query
videos contain only a few shots (in fact, the average length for query videos in TRECVID is about
one minute).
Second, it detects corner candidates in the mean-edge frame through convolution between
the frame and different ad-hoc masks (top-left, top-right, bottom-left, and bottom-right corner
masks).
Third, it searches for rectangles by joining the detected corner in a valid spatial configuration
according to the PIP creation process. The score of a detected rectangle is the sum of the value of
its corners. The PIP is detected when the rectangle with the best score is greater than a threshold.
67
(a)
(b)
(c)
(d)
(e)
(f)
Figure 6.1 – PIP detection. (a) Query video from TRECVID 2010 dataset. (b) Average for each
pixel. (c) Variance for each pixel. (d) The “mean-edge” frame. (e) New query with
video at background. (f) New query with video at foreground.
68
Finally, the PIP reversion creates two new query videos: the foreground video (each frame
cropped to the detected rectangle) and the background video (each frame with the detected rectangle
filled with black pixels). The two new videos are then normalized.
6.2.2
Camcording
In TRECVID datasets, a simulated camcording transformation may be applying to query
videos. This transformation intends to simulate a person recording a projection of the video on
a big screen using a handheld camcorder. The transformation is implemented by: modifying a
video by altering the gamma correction, applying a projective transform to each frame, and giving
some random movements and distortions. Our reversion process searches for a moving wrapping
quadrilateral on the query video (see Figure 6.2).
First, each frame in the video is binarized using a near-zero threshold. The center-of-mass is
calculated for the frame which will be used as a content-relative reference point. Then, an “outeredge” frame is created by selecting for each column and row in the binary image the nearest pixels
to any border.
Second, with the center-of-mass and edge points, four Hough transforms are performed. Each
transform focuses on locating an edge of the quadrilateral (top, bottom, left, and right margins)
by restricting the position and slope in the parameter space. The four detected lines are then
intersected to find the four vertices of the quadrilateral. The positions of these vertices are stored
relative to the center-of-mass of the image.
Third, the four vertices are located for every frame in the video. If they are detected with
a small variance in their relative position, then the average position of the four vertices define the
wrapping quadrilateral.
Finally, the camcording reversion creates a new query video by calculating the center-of-mass
for each frame and mapping the four vertices to the corners of the frame. The new reverted video
is then normalized.
6.2.3
Vertical flip
There is no reliable method to detect whether a query video has a flip transformation or not.
Thus, the detection process is skipped and the reversion always creates a new query video applying
a vertical mirror to each frame. The flip reversion is applied to all the query videos, including the
videos created by PIP and camcording reversions.
6.3
Summary
In this chapter we have depicted the processes performed by the Preprocessing task. This
task intends to improve the quality of each video in order to increase the effectiveness of the copy
detection. For MUSCLE-VCD-2007 dataset, we use the video normalization, while for TRECVID
datasets, we use both video normalization and reversion of transformations.
69
(a)
(b)
(c)
(d)
(e)
(f)
Figure 6.2 – Camcording detection. (a) Query video from TRECVID 2010 dataset. (b) Binarized
frame. (c) The “outer-edge” frame. (d) The parameter space for one Hough transform (left edge). (e) Detected quadrilateral and the center-of-mass. (f) New query
video.
70
The following chapter presents the Video Segmentation and Feature Extraction tasks, including an evaluation of the impact of video normalization on the detection effectiveness. Afterwards,
in Chapter 11, we analyze the impact of reversion of PIP and camcording on detection effectiveness.
71
Chapter 7
Video Segmentation and Feature
Extraction
The Video Segmentation task creates one or more segments from every video. The Feature
Extraction task calculates one or more descriptions (global, local, acoustic) from each video segment.
This chapter reviews these two tasks and evaluates the effectiveness of different configurations of
segmentations and descriptions.
7.1
Video Segmentation
The Video Segmentation task partitions a video into groups of similar consecutive frames.
The segments are not required to be the same length. Each segment may have a representative
frame, which is used for extracting spatial descriptors.
Formally, given a video v with n frames v={f1 , ..., fn }, the segmentation task creates the set
of segments S={s1 , .., sr }, where each segment si is a set of consecutive frames si ={fj , fj+1 , fj+2 , ...}
for some starting frame fj . We restrict segmentations to contain only non-overlapping segments,
i.e., ∀f ∈ v ∃! s ∈ S, f ∈ s. Additionally, every segment s ∈ S contains a unique representative
frame fs .
In this thesis we test fixed-length segmentation and variable-length segmentation.
7.1.1
Fixed-length segmentation
The fixed-length segmentation receives a target length t (in seconds), and it produces segments of either ⌊t · fps⌋ or ⌈t · fps⌉ consecutive frames, depending on which rounding produces the
less accumulative error. For example, the segmentation of 13 seconds on a video with 25 fps produces
segments with lengths {8, 9, 8, 8, 9, 8, . . .} with the sequence of accumulative errors {-0.3̄, 0.3̄, 0,
-0.3̄, 0.3̄, 0, . . .}. The representative frame for each segment is its middle frame.
In this chapter we evaluate segmentations S1/5 , S1/4 , S1/3 , S1/2 , S1, S2, S3, S4, and S5,
which are fixed-length segmentations with target length t equal to 51 , 14 , 31 , 12 , 1, 2, 3, 4, and 5
seconds, respectively.
72
7.1.2
Variable-length segmentation
The variable-length segmentation is based on a fixed-length segmentation. It first performs a
fixed-length segmentation and then it joins almost identical consecutive segments. The comparison
between two segments is performed by global descriptors of their representative frames, thus it uses
a feature extraction method (g, d) to determine whether two segments should be joined or not. A
segment si is joined with previous segment si−1 when the distance dsi = d(g(fsi ), g(fsi−1 )) is smaller
than a threshold. The representative frame for the new segment is the representative frame of the
segment si that achieved the minimum dsi .
Intuitively, this segmentation avoids producing many identical consecutive segments for static
scenes, while keeping a dense segmentation for dynamic scenes. Almost identical segments will likely
harm the localization of a copy due to their indistinguishable descriptors, hence this segmentation
intends to benefit from dense fixed-length segmentations on dynamic content while preventing unnecessary segmentation of static content.
We used this type of partition for our participation at TRECVID 2010.
7.2
Spatial global description
In this thesis, we use six different feature extraction methods with different parameters,
producing fourteen different spatial global descriptions. Chapter 2 gives details about the implementation of the extraction methods.
Given a video with segmentation {s1 , .., sr }, the spatial-global description is a feature extraction method (g, d) which represents the visual content of a segment si by a single vector g(fsi )
calculated from each representative frame fsi .
In order to reduce disk storage needs, if g produces vectors of floating point numbers, we
perform a linear quantization to 8 bits to each value, therefore each dimension is an integer value
between 0 and 255. During this chapter, d is fixed to L1 . Chapter 8 includes an evaluation of other
distance functions.
Edge Histogram per zone (EH) This descriptor captures the orientation of edges. The extraction method converts a frame into gray scale and partitions it into W ×H zones. Each zone is
subdivided into Mw ×Mh blocks and the edge orientation of each block is tested. The orientation is
determined by calculating the energy of n different filters, and selecting the filter whose maximum
exceeds a threshold t.
In this implementation we used Mw =Mh =8, t=5, and linear quantization to 8 bits/bin. We
test the usage of five and ten orientation filters (see Figure 4.3 on page 46). In particular, we
evaluate descriptors:
• EH4x4-10: W =H=4, ten orientation filters. It creates a vector of 160 dimensions. We used this
descriptor at TRECVID 2010 and 2011.
• EH4x4-5: W =H=4, five orientation filters, the absolute value of the energy is compared against
t. It creates a vector of 80 dimensions.
73
Gray Histograms per zone (GH) This descriptor captures the spatial distribution of intensities.
The extraction method converts a frame into gray scale and divides it into W ×H zones. For each
zone a histogram of m bins is calculated. We test four descriptors:
• GH1x1-180: W =H=1, m=180. It creates a vector of 180 dimensions, invariant to vertical
mirroring.
• GH1x4-32: W =1, H=4, m=32. It creates a vector of 128 dimensions, invariant to vertical
mirroring.
• GH3x3-20: W =H=3, m=20. It creates a vector of 180 dimensions. We used this descriptor at
TRECVID 2010.
• GH4x4-12: W =H=4, m=12,. It creates a vector of 192 dimensions. We used this descriptor at
TRECVID 2011.
Independent-Color Histograms per zone (IH) This descriptor captures the spatial distribution of intensities for each channel. The extraction method divides a frame into W ×H zones,
and for each zone an independent histogram is calculated for each channel. The descriptor is the
concatenation of three 1-d histograms, producing a vector with m1 +m2 +m3 dimensions. We test
three descriptors:
• IH1x4-rgb-3x16: RGB color space, W =1, H=4, mR =mG =mB =16. It creates a vector of 192
dimensions, invariant to vertical mirroring.
• IH2x2-rgb-3x16: RGB color space, W =H=2, mR =mG =mB =16. It creates a vector of 192 dimensions. We used this descriptor at TRECVID 2010.
• IH4x4-rgb-3x4: RGB color space, W =H=4,mR =mG =mB =4. It creates a vector of 192 dimensions. We used this descriptor at TRECVID 2011.
Color Histograms per zone (CH) This descriptor captures the spatial distribution of colors.
The extraction method divides a frame into W ×H zones, and for each zone a color histogram
is calculated using the RGB or HSV color space. The 3-d color space is divided uniformly into
m1 ×m2 ×m3 blocks, and the 3-d histogram is calculated. We test four descriptors:
• CH1x1-hsv-16x4x4: W =H=1, HSV color space, mH =16, mS =mV =4. It creates a vector of 256
dimensions, invariant to vertical mirroring.
• CH2x2-hsv-16x2x2: W =H=2, HSV color space, mH =16, mS =mV =2. It creates a vector of 256
dimensions.
• CH2x2-rgb-4x4x4: W =H=2, RGB color space, mR =mG =mB =4. It creates a vector of 256 dimensions.
74
Reduced Keyframe (KF) This descriptor captures the spatial distribution of intensities. The
extraction method converts a frame into gray scale and scales it down to W ×H pixels. The descriptor is the vector of W · H dimensions whose values contain the pixel intensities. We test the
descriptor:
• KF11x9: W =11, H=9. It creates a vector of 99 dimensions.
Ordinal Measurement (OM) This descriptor captures the relative ordering of intensities (see
Section 2.3.1). The extraction method converts a frame into gray scale and divides it into W ×H
zones. The descriptor is a permutation of (1, ..., W · H) with the rank of each zone after sorting
them in ascending order by intensity (see Figure 2.9 on page 18). We test the descriptor:
• OM9x9: W =H=9. It creates a vector of 81 dimensions.
7.3
Spatio-temporal global description
The spatio-temporal (s-t) description represents the whole sequence of frames in a segment
instead of only its representative frame. The s-t extraction function uses a spatial extraction function
to calculate the average of the spatial descriptors extracted for every frame in the segment. Formally,
let (h, d) be a spatial feature extraction method which returns a fixed-length vector (the basal
description), the s-t feature extraction method is the pair (g, d) where g is defined as:
g(s) =
1 X
h(f )
|s|
f ∈s
where s is a segment and h(f ) is either the description of frame f or the description of a segment
which contains a unique frame f .
As we will show in the experimental section, this s-t descriptor can improve the effectiveness
of the similarity search compared with its basal description. The cost for this improvement is higher
computational effort at the feature extraction task, but it does not affect the dimensionality, the
distance function, nor the disk space, hence it does not affect the similarity search task.
In the evaluation, we test the spatio-temporal descriptor for the fourteen spatial-global descriptors defined in previous section, denoting it with a t mark. For instance, OM9x9 is calculated
from the representative frame in the segment, while OMt 9x9 is the average of all the OM9x9 descriptors
calculated from every frame in the segment.
In preliminary experiments we tested the effectiveness of extracting a descriptor from the
average frame of a segment. This approach achieved very poor performance, and it was outperformed
even by the spatial descriptor of the representative frame. This low performance is mainly due to
average frames usually contain grayed colors and blurred edges. Therefore, we preferred the average
descriptor extracted from each frame rather than the spatial descriptor extracted from the average
frame.
75
7.4
Acoustic description
The acoustic description is based on Telefonica Research’s implementation of the descriptor
presented by Haitsma and Kalker [2002]. Originally, the descriptor is calculated with a FFT of the
acoustic data every 10 ms over a sliding window of 32 ms. The frequency bins are then converted
into a Mel scale of 16 bands, and a 15-bit fingerprint is calculated by comparing the energies of
consecutive bands (1=increase, 0=decrease) (see Section 4.3.6).
The extraction method first resamples the audio track with a sampling rate of r-Hz, monochannel and the descriptor is calculated for each audio segment. The extraction method uses a FFT
for a sliding window of size w ms. The FFT coefficients are converted into a Mel scale of m bands,
and the sum of energies of all the bands is normalized to 1. The window slides every s ms, and
the energies for each band are averaged for every window inside the audio segment. We test the
descriptor:
• AU160: audio track sampled to 8-kHz, FFT for a window size w=500 ms, shift s=20 ms, m=160
bands. It creates a vector of 160 dimensions 32 bits/dim (float). We used this descriptor at
TRECVID 2011.
Unlike the original descriptor, where the search consists of retrieving collisions, the adapted
audio descriptor can measure the degree of similarity between any two short audio signals. Note
the similitude of this extraction method with the s-t global description. This property enables the
distances between audio descriptors to behave in a roughly similar way as global descriptors, i.e.,
the histograms of distances present an alike shape which eases their combination.
In the experimental section, we test the different parameters of the audio descriptors. In
particular, we compare the effectiveness of audio descriptors with window sizes 32 ms, 200 ms and
500 ms; window shifts every 10 ms and 20 ms; and number of bands 16, 80 and 160. This descriptor
is compared using L1 distance. Chapter 8 evaluates other distance functions for this descriptor.
7.5
Spatial Local description
The local description used in this thesis is SIFT. Instead of the SIFT reference implementation [Lowe, 2005], we preferred the implementation provided by the VLFeat library [Vedaldi and
Fulkerson, 2008] due to its seamless integration with our system through its C API. We used VLFeat
with default parameters which, according to its documentation, produces keypoints and descriptors
very similar to the ones produced by the reference implementation.
The local description for a segment is the set of SIFT vectors extracted from the representative frame. To control the amount of vectors, we follow the reference implementation’s approach,
which scales down an input image to decrease the keypoint detections. This approach makes the
tuning of library-dependent parameters unnecessary. We test descriptions:
• SFn: SIFT vectors extracted from the representative frame of the segment scaled down by a
factor of n, where n varies between 1 and 6.
76
The local description for each segment is stored as a variable-size array containing vectors
of 128 dimensions (1 byte/dim). The distance between any two segments compares their local
descriptions and returns a dissimilarity value. Given segments q and r, we use two functions to
compare them:
• Matches: It is the number of SIFT vectors in q that match a vector in r. Following the
definition of Lowe [2004], a vector in q matches a vector in r if the ratio of the distance
between its closest neighbor in r and the distance of its second closest neighbor is less that s.
Formally:
Matchess (q, r) = 1 − |Pairs s (q, r)|/|SF(q)|
L2 (x, y)
Pairs s (q, r) = (x, y) ∈ SF(q) × SF(r), ∀z ∈ SF(r) − {y}
≤s
L2 (x, z)
• Spatial: It is the size of the maximum subset of vectors in q that match a vector in r that are
spatially coherent:
Spatials (q, r) = min {1 − |Ssσ,τ (q, r)|/|SF(q)|}
σ,τ
Ssσ,τ (q, r)
= {(x, y) ∈ Pairs s (q, r), L2 (x · σ + τ, y) ≤ ε}
We use RANSAC to determine the maximum subset of matching vectors that satisfy the same
scale and translation (see Chapter 2).
Lowe [2004] recommends the ratio s=0.8 which selects most of the correct matches and
discards most of the incorrect ones.
7.6
Evaluation on MUSCLE-VCD-2007
The evaluation uses the MUSCLE-VCD-2007 dataset, with reference videos from DB-MPEG1
collection and query videos from ST1 and ST2 collections. As stated in Section 5.6, R is the set of
reference segments, Q is the set of query segments, and Qc is the set of query segments that have
a correct answer in R. In the following evaluations the similarity search is a linear scan.
7.6.1
Evaluation of global description
The evaluation consists of measuring the frame-to-frame effectiveness and the copy detection
effectiveness. The frame-to-frame effectiveness is measured by the MAP and the copy detection
effectiveness is measured by the amount of correct detections without false alarms.
The first experiment measures the basal effectiveness of each global descriptor calculated
from the representative frame of the segment. These results correspond to the baseline for the
following experiments. The second experiment measures the impact of preprocessing videos before
calculating descriptors. Finally, the third experiment evaluates the impact of the s-t description on
preprocessed videos.
Each query video and reference video is partitioned into segments of fixed length { 15 , 41 , 13 , 21 , 1, 2, 3, 4, 5}
seconds, and the effectiveness is measured for each segmentation length. Table 7.1 shows the number
77
S1/5
S1/4
S1/3
S1/2
S1
S2
S3
S4
S5
5
6.25
8.33
12.5
25
50
75
100
125
Collection size (thousands of segments)
Q
69.7
55.7
41.8
Qc
49.9
39.9
29.9
R
1,057.2 845.8 634.4
27.9
20.0
422.9
13.9
10.0
211.5
7.0
5.0
105.8
4.7
3.3
70.5
3.5
2.5
52.9
2.8
2.0
42.3
Frames/segment †
Table 7.1 – Number of query and reference segments for each fixed-length segmentation.
† Fractional frames are resolved as described in Section 7.1.
of segments produced by each segmentation length. Because all the videos in the MUSCLE-VCD2007 dataset have 25 fps, each segment contains between 5 frames and 125 frames.
To simplify the comparison we use the same L1 distance (Manhattan distance) to compare
any two descriptors:
L1 ((x1 , ..., xn ), (y1 , ..., yn )) =
n
X
i=1
|xi − yi |
Although we could use a different metric for comparing each descriptor, we decided to use
only L1 in order to simplify the analysis. Moreover, L1 is fast to evaluate, satisfies the metric
properties, and achieves satisfactory effectiveness for all tested descriptors. In Chapter 8, we test
the behavior of other distance functions.
Between the fourteen evaluated descriptors, only GH1x1-180, GH1x4-32, IH1x4-rgb-3x16, and CH1x1-hsv-16x4x4
are invariant to vertical mirroring.
Effectiveness of spatial description
This experiment measures the effectiveness of each global descriptor, when it is calculated
from the middle frame of the segment. Figure 7.1 summarizes the effectiveness achieved by the
fourteen spatial global descriptors for each segment size. In the case of frame-to-frame effectiveness,
the MAP for every descriptor and segmentation varies between 0.2 and 0.5. Top performers are
descriptors GH1x4-32 at S1/5 ; GH3x3-20 at S1/4 , S1/3 , S1/2 , and S1; and IH4x4-rgb-3x4 at S2, S3 (overall
maximum with MAP=0.49), S4, and S5. On average, descriptors GH3x3-20 and IH1x4-rgb-3x16 are the
top performers with an average MAP of 0.45 and 0.44 respectively, for the nine tested segmentations. On the other hand, OM9x9 is the descriptor that achieves the worst performance for every
segmentation.
Analyzing the behavior of segmentation, the frame-to-frame effectiveness achieves a slight
increase in coarser segmentations. On one hand, a coarse segmentation reduces the number of
objects in Q and R which implies faster similarity searches. For instance, doubling the length of
segments divides both the number of queries and the reference collection by a half, resulting in
searches that can be up to four times faster. On the other hand, a coarse segmentation may affect
the performance of the Copy Localization task due to the decrease in the query segments that match
their original segment. For example, a copy excerpt of 4 seconds with S1/5 seconds may match up to
20 consecutive query segments with their respective reference segment, but with S2 it may match
at most two query and reference segments.
78
The copy detection effectiveness shows the performance of the Copy Localization task for the
different segmentations and descriptors. The graph shows the number of correct detections before
the first false alarm is reported. The copies are determined by locating chains of reference segments
between the first nearest neighbor of each query segment. The localization task is described in
Chapter 10.
According to the ground-truth, MUSCLE-VCD-2007 has 31 copied excerpts. The top overall
performance is achieved by EH4x4-10 with up to 27 correct detections without false alarms at S1/4 ,
while the worst performers are GH1x1-180 (S1/5 , S1/4 , S1/3 , S1/2 , and S1) and OM9x9 (S2, S3, S4, and
S5). The overall results show a clear inverse correlation between segmentation length and detection
performance: the average number of detections for the evaluated descriptors at S1/5 is 20.4, and it
gradually decreases to 13.8 for S5. Therefore, even though a coarse segmentation may achieve a
high frame-to-frame matching, the detection performance decreases due to the reduction of voter
segments. Dense segmentations can achieve higher detection performance, however it should be
noted that more voters may also produce false alarms with a higher score.
Effectiveness of video normalization
In this experiment we evaluate the influence of the preprocessing task on the effectiveness.
Each video on the query and reference datasets is first preprocessed by the quality normalization
(described in Section 6.1), i.e., outliers and plain frames are skipped and black borders are cropped.
This experiment does not include any reversion, thus the sizes of Q, Qc , and R do not vary. Figure 7.2 shows the effectiveness achieved by each descriptor when the videos have been preprocessed.
In the case of frame-to-frame effectiveness, for every descriptor and segmentation, the video
normalization increases the effectiveness by an average of 16%. The effect of the quality normalization is bigger at dense segmentations (an average increase of 28% at S1/5 ) than at coarse
segmentations (an average increase of 8% at S4). Descriptor KF11x9 achieves the largest increase
(its MAP increased 84% on average for the nine segmentations), while descriptor CH2x2-hsv-16x2x2
was almost unaffected (2% average increase). The top performances are now achieved by EH4x4-10
at S1/5 , S1/4 , S1/3 , S1, S3 (overall maximum with MAP=0.54), and S5; IH1x4-rgb-3x16 at S1/2 , and
S4; and IH4x4-rgb-3x4 at S2.
In the case of detection effectiveness, the largest increase is achieved by KF11x9 with an
average of 20%. The top overall performance is achieved by EH4x4-10 with 28 correct detections at
S1/5 , S1/4 and S1/3 , while the worst performers are again GH1x1-180 and OM9x9. In fact, OM9x9 is the
only descriptor that mostly decreased its detection performance.
In summary, the video preprocessing increases the frame-to-frame effectiveness by an average
of 16% for all descriptors and segmentations (ranging from 123% for KF11x9 at S1/5 to -11% for OM9x9
at S4) and increases the detection effectiveness by an average of 9% (ranging from 45% for KF11x9
at S5 to -38% for OM9x9 at S1).
Effectiveness of spatio-temporal description
This experiment tests the performance of the s-t descriptor. Figure 7.3 shows the effectiveness
achieved by each descriptor when the videos have been normalized and the descriptors are extracted
and averaged for every frame in the segment.
79
0.8
0.7
0.7
0.6
0.6
0.5
0.5
MAP
MAP
Frame-To-Frame Effectiveness (no preprocessing)
0.8
0.4
0.4
0.3
0.3
0.2
0.2
S1/5
S1/4
S1/3
S1/2
S1
S2
S3
S4
S1/5
S5
S1/4
S1/3
Segmentation
S1/2
S1
S2
S3
S4
S5
S3
S4
S5
Segmentation
Copy Detection Effectiveness (no preprocessing)
30
Detections without false alarms
Detections without false alarms
30
25
20
15
10
S1/5
S1/4
S1/3
S1/2
S1
S2
S3
S4
25
20
15
10
S1/5
S5
S1/4
S1/3
Segmentation
S1/2
S1
S2
Segmentation
Global Descriptors
EH4x4-10
EH4x4-5
KF11x9
OM9x9
CH1x1-hsv-16x4x4
CH2x2-hsv-16x2x2
CH2x2-rgb-4x4x4
GH1x1-180
GH1x4-32
GH3x3-20
GH4x4-12
IH1x4-rgb-3x16
IH2x2-rgb-3x16
IH4x4-rgb-3x4
Figure 7.1 – Effectiveness of the fourteen spatial descriptors for different segmentation length
without video preprocessing.
80
0.8
0.7
0.7
0.6
0.6
0.5
0.5
MAP
MAP
Frame-To-Frame Effectiveness (with preprocessing)
0.8
0.4
0.4
0.3
0.3
0.2
0.2
S1/5
S1/4
S1/3
S1/2
S1
S2
S3
S4
S1/5
S5
S1/4
S1/3
Segmentation
S1/2
S1
S2
S3
S4
S5
S3
S4
S5
Segmentation
Copy Detection Effectiveness (with preprocessing)
30
Detections without false alarms
Detections without false alarms
30
25
20
15
10
S1/5
S1/4
S1/3
S1/2
S1
S2
S3
S4
25
20
15
10
S1/5
S5
S1/4
S1/3
Segmentation
S1/2
S1
S2
Segmentation
Global Descriptors
EH4x4-10
EH4x4-5
KF11x9
OM9x9
CH1x1-hsv-16x4x4
CH2x2-hsv-16x2x2
CH2x2-rgb-4x4x4
GH1x1-180
GH1x4-32
GH3x3-20
GH4x4-12
IH1x4-rgb-3x16
IH2x2-rgb-3x16
IH4x4-rgb-3x4
Figure 7.2 – Effectiveness of the fourteen spatial descriptors for preprocessed collections.
81
0.8
0.7
0.7
0.6
0.6
0.5
0.5
MAP
MAP
Frame-To-Frame Effectiveness (preprocessing, spatio-temporal)
0.8
0.4
0.4
0.3
0.3
0.2
0.2
S1/5
S1/4
S1/3
S1/2
S1
S2
S3
S4
S1/5
S5
S1/4
S1/3
Segmentation
S1/2
S1
S2
S3
S4
S5
S4
S5
Segmentation
Copy Detection Effectiveness (preprocessing, spatio-temporal)
30
Detections without false alarms
Detections without false alarms
30
25
20
15
10
S1/5
S1/4
S1/3
S1/2
S1
S2
S3
S4
25
20
15
10
S1/5
S5
Segmentation
S1/4
S1/3
S1/2
S1
S2
S3
Segmentation
Spatio-Temporal Global Descriptors
EHt 4x4-10
EHt 4x4-5
KFt 11x9
OMt 9x9
CHt 1x1-hsv-16x4x4
CHt 2x2-hsv-16x2x2
CHt 2x2-rgb-4x4x4
GHt 1x1-180
GHt 1x4-32
GHt 3x3-20
GHt 4x4-12
IHt 1x4-rgb-3x16
IHt 2x2-rgb-3x16
IHt 4x4-rgb-3x4
Figure 7.3 – Effectiveness of the fourteen spatio-temporal descriptors for preprocessed collections.
82
In the case of frame-to-frame effectiveness, the global increase of 22% in MAP compared
with the previous experiment shows that the s-t description can improve the effectiveness of spatial
descriptors. In particular, the OMt 9x9 achieves the highest increase in MAP (an average increase
of 66% for the nine segmentations). On the other hand, CHt 2x2-hsv-16x2x2 was almost unaffected (an
average increase of 2%). The top performer for every segmentation is the EHt 4x4-10 with an overall
maximum of MAP=0.78 at S3. The worst performances are achieved by OMt 9x9, CHt 2x2-hsv-16x2x2
and GHt 1x1-180.
In the case of detection effectiveness, the performance increases by 10% on average for every
descriptor and segmentation compared with the previous experiment. The largest average increase
is for OMt 9x9 with a 38%, however it is still the worst performer. The top performers are EHt 4x4-10
with 29 detections between S1/5 and S1/2 , and also EHt 4x4-5 with 29 detections, but between S1/5
and S1/3 .
In summary, the spatio-temporal description increases the frame-to-frame effectiveness by
an average of 22% for all descriptors and segmentations (ranging from 106% for OMt 9x9 at S5 to
-2% for IHt 2x2-rgb-3x16 at S1/5 ) and increases the detection effectiveness by an average of 10% (ranging
from 89% for OMt 9x9 at S2 to -13% for GHt 1x1-180 at S3), when they are compared with the previous
experiment.
Discussion
Analyzing the results for copy detection at the s-t description, the overall top performers are
the two EH descriptors, with the 10-orientations histogram outperforming the 5-orientations. This
reveals that the orientation of edges is the feature that best represents the visual information for
CBVCD, and also that they are more robust to content transformations.
An interesting result is that EHt 4x4-10 at S1/2 detects correctly 29 of 31 copies without producing any false alarm, i.e., it achieves recall 0.94 at precision 1. In particular, for ST1 it achieves
recall 1, and recall 0.90 for ST2. Despite EHt 4x4-10 not being invariant to mirroring, the system
can detect the flipped copy from ST1 by just matching the almost symmetrical frames. This result
outperforms most of the state-of-the-art systems evaluated with MUSCLE-VCD-2007 dataset, like
a complex system based on local descriptor [Poullot et al., 2008] and a system combining visual
with audio descriptors [Anguera et al., 2009b]. Moreover, EHt 4x4-5 is the descriptor with lowest
dimensionality (80-d) and it also achieves this detection performance at S1/3 .
The following best performer is KFt 11x9. In the baseline, this descriptor does not achieve
very good results, but the two processes (quality normalization and s-t description) highly improved
its detection performance. In fact, the KF is a really naive descriptor (it is just a reduction of the
frame), thus its good detection performance is somewhat surprising given its low dimensionality
(99-d) compared with color histograms.
The remaining best performer is IHt 4x4-rgb-3x4. This descriptor (with 192-d) outperforms
(with 256-d) mainly because IH uses more zones, which in turn is due to IH needing
less dimensions to represent a zone (just 12-d per zone). This issue is more evident if we compare
the results for two descriptors using the same number of zones: CHt 2x2-rgb-4x4x4 slightly outperforms
IHt 2x2-rgb-3x16. The IH descriptor calculates independent histograms for each channel, therefore it
does not capture the actual colors in a zone. For instance, an IH descriptor with high values for the
R channel may be created by zones with colors red, yellow, magenta, or white. This problem can be
reduced by using a finer zoning, which turns out to be more relevant for locating duplicates. Note
CHt 2x2-rgb-4x4x4
83
also that this simplified color model is able to outperform GHt 4x4-12, which uses the same number
of dimensions by zone.
Comparing color spaces, the performances of CHt 2x2-rgb-4x4x4 and CHt 2x2-hsv-16x2x2 are mostly
similar (both use 64-d per zone) with a slight gain for the RGB color space. Therefore, for locating
copies, a color descriptor based on RGB color space can show similar and even better performance
than HSV color space for the same number of dimensions.
Between the descriptors invariant to vertical mirroring, IHt 1x4-rgb-3x16 achieves the best performance. Moreover, it also outperforms IHt 2x2-rgb-3x16, showing that the division in horizontal slides can
be a better choice for datasets with vertical mirroring. However, it was outperformed by IHt 4x4-rgb-3x4
which have the same dimensionality, thus a good strategy for designing a descriptor is to use more
zones, but given a fixed number of zones, the horizontal division is better.
The OMt 9x9 is the descriptor that achieves the worst performance for every segmentation.
This result might seem contradictory with the evaluation presented by Kim and Vasudev [2005],
however their work focused on detecting low-intensity quality transformations (i.e., video duplicates). Ordinal Measurement is highly affected by partial changes that can convert a dark zone into
a bright one (like the insertion of logos), changing the ordering of zones and affecting (and ruining)
the whole descriptor. This unsatisfactory performance of OM in CBVCD is also verified by Law-To
et al. [2007a]. The other evaluated descriptors also divide the frames, but unlike OM, a change in
one zone does not affect the values for the other zones.
With respect to the two tested improvements, the video normalization increases the frame-toframe and detection performance by an average of 16% and 9%, respectively, for every descriptor and
segmentation. The s-t description also increases them by an average of 22% and 10% compared with
the video normalization alone. Compared to the baseline, these two combined processes increase the
frame-to-frame effectiveness by 39% on average, and the detection effectiveness by 20%. A synergy
exists between these two processes given that the quality normalization removes noisy and outlier
frames which may harm the s-t descriptor.
7.6.2
Effectiveness of acoustic description
This experiment measures the effectiveness of the acoustic description using different values
for parameters m (number of bands), w (window size), and s (window shift). In particular, we
compare the effectiveness of audio descriptors using three numbers of bands (16, 80 and 160), three
window sizes (32 ms, 200 ms and 500 ms), and two window shifts (10 ms and 20 ms). As in
the previous experiments, we perform the comparison for the frame-to-frame matching and copy
detection performance using the nine segmentations (S1/5 to S5). Figure 7.4 shows the results.
In the case of frame-to-frame effectiveness, the performance is highly dependent of the number of bands m. On average, the MAP increases by 24% between 16 and 80 bands, and by 28%
between 16 and 160 bands. However, the increase is reduced for the densest segmentation (S1/5 ): 4%
between 16 and 80 bands, and 6% between 16 and 160 bands. A larger window size also increases
the effectiveness. On average, the MAP increases by 7% from 32 ms to 200 ms, and by 11% from
32 ms to 500 ms. Finally, the shift s does not have a clear impact as the MAP for 10 ms and 20
ms are almost similar. The overall maximum MAP 0.84 was achieved by m=160 and w=500 ms at
S1/5 .
In the case of detection effectiveness, a larger m implies more detections. On average, the
84
Frame-To-Frame Effectiveness
0.8
0.8
0.8
0.6
MAP
0.9
MAP
0.9
0.7
0.7
0.6
0.7
0.6
0.5
0.5
0.5
0.4
0.4
0.4
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
Segmentation
Segmentation
Segmentation
w=32, s=20
w=200, s=20
w=500, s=20
0.9
0.9
0.8
0.8
0.8
0.7
0.6
MAP
0.9
0.7
0.6
0.7
0.6
0.5
0.5
0.5
0.4
0.4
0.4
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
Segmentation
Segmentation
Segmentation
w=32, s=10
25
20
m=80
30
25
20
m=160
w=500, s=10
Detections without
false alarms
30
m=16
w=200, s=10
Detections without
false alarms
30
25
20
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
Segmentation
Segmentation
Segmentation
w=32, s=20
w=200, s=20
Detections without
false alarms
30
25
20
30
w=500, s=20
Detections without
false alarms
Detections without
false alarms
m=160
w=500, s=10
0.9
Copy Detection Effectiveness
Detections without
false alarms
m=80
w=200, s=10
MAP
MAP
MAP
w=32, s=10
m=16
25
20
30
25
20
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
S1
/5 S1
/4 S1
/3 S1
/2 S1 S2 S3 S4 S5
Segmentation
Segmentation
Segmentation
Figure 7.4 – Effectiveness of different parameters for the acoustic descriptor.
85
SF1
SF2
SF3
SF4
SF5
SF6
Average vectors/segment
Query
498.8
Reference
579.3
151.1
155.4
75.5
78.5
46.4
48.9
31.4
33.3
23.1
24.6
176×144
23,475
117×96
5,927
88×72
2,269
70×58
1,047
59×48
569
Input frame size
Average number of pairs
352×288
288,945
Table 7.2 – Average number of vectors per segment, size of frame from which vectors are extracted, and number of pairs between query and reference vectors.
number of detection increases by 15.6% between 16 and 80 bands, and by 16.2% between 16 and
160 bands. However, there is almost no improvement between using 80 and 160 bands. A larger
window size increases the effectiveness by 18% for 16 bands between 32 ms and 200 ms, but it
has almost no influence for 80 and 160 bands. Finally, the shift s does not have a clear impact
on descriptor effectiveness. The overall maximum is 29 correct detections without false alarms
(out of 31) achieved between S1/5 and S1 for m≥80 and w≥200 ms. As discussed in a previous
experiment, this detection performance (recall 0.94 at precision 1) outperforms most of the CBVCD
systems evaluated with this dataset, in particular a system based on audio descriptors [Anguera
et al., 2009b].
The most relevant parameter for the acoustic descriptor is the number of bands: 16 bands
are too few, but 80 and 160 are enough to represent the audio. The two copy excerpts that could
not be detected are a camcording transformation in which the audio track is almost inaudible, and
a copy that completely lacks the audio track. The almost inaudible track may become detectable
by increasing the base audio sampling rate (instead of 8-kHz), or with some audio preprocessing
(like a volume normalization).
In summary, this experiment proves the metric space approach can also be applied to acoustic descriptors. In fact, the similarity search using the proposed acoustic descriptor shows an alike
behavior and performance compared to the similarity search using visual descriptors. Therefore, it
seems feasible to design a distance function that successfully combines acoustic and visual descriptors. We explore this idea in the next chapter.
7.6.3
Evaluation of local descriptors
The next experiment measures the base effectiveness of local description. We perform the
evaluation for frame-to-frame matching and copy detection performance using different frame resolutions and segmentations in preprocessed videos.
Table 7.2 summarizes the average number of SIFT vectors extracted from the representative
frame of each segment. On average, query segments have less vectors than reference segments due
to the effect of visual transformations. The number of pairs between query and reference vectors
indicates the expected evaluations of L2 to measure the dissimilarity between any two segments.
This number is directly proportional to the overtime spent by a search based on local description
compared with a search based on global description (which require just one distance evaluation to
compare two segments). For instance, the time spent by a search using SF1 will be five orders of
magnitude slower than the time spent by a search using global description (even not considering
any spatial coherence process). This is an unaffordable amount of time, hence in the following
experiment we test the effectiveness of SF4, SF5, SF6 for segmentations S2, S3, S4 and S5 (the
86
Frame-To-Frame
30
MAP
0.5
0.4
0.3
SF4, Matches,s=0.7
SF4, Spatial,s=0.8
25
SF5, Matches,s=0.7
SF5, Spatial,s=0.8
20
SF6, Matches,s=0.7
SF6, Spatial,s=0.8
15
0.2
S2
Copy Detection
Detections without
false alarms
0.6
S3
S4
S5
S2
Segmentation
S3
S4
S5
Segmentation
Figure 7.5 – Effectiveness of Matches and Spatial for local descriptions SF4, SF5, and SF6.
number of segments produced by each segmentation is already shown in Table 7.1). The search will
be a linear scan comparing each query segment to every reference segment using distances Matches
and Spatial, described in Section 7.5.
Figure 7.5 shows the effectiveness achieved by each local description and distance when the
videos have been normalized and local description calculated from the middle frame in the segment.
We compare distances Spatial with s=0.8, and Matches with a stricter ratio s=0.7, due to its lack
of spatial coherence.
The MAP is directly affected by the size of the frames: SF4 outperforms SF5, which in turn
outperforms SF6. In the case of distances, Matches consistently outperforms Spatial. This behavior
is due to the reduction of the frames: Spatial misses frames when the number of matching vectors is
not enough to perform a spatial coherence process. The gap between Matches and Spatial increases
when there are less vectors per frames, i.e., the gap using SF6 is higher than using SF4.
In detection effectiveness, there is also a gap between Matches and Spatial, but it is reduced
with denser segmentations and larger frames. In fact, Spatial outperforms Matches from S3 using
SF4, and from S2 using SF5.
Comparing these results with the performance of a spatial global description (Figure 7.2),
SF4 achieves a relatively similar MAP as global descriptions at the same segmentation, but achieves
higher detection effectiveness. In fact, SF4 detects 28 of 31 copies at S2, but EH4x4-10 requires S1/3
or denser to achieve the same result. On the other hand, the search using SF4 on S2 is expected to
be about fifty times slower than the search using EH4x4-10 on S1/3 .
SIFT vectors capture the orientations of the gradient, which is relatively similar to the
information that captures the EH descriptor. Therefore, the high detection performance achieved
by SIFT vectors is consistent with our previous conclusion that the orientation of edges is the feature
that best represents the visual information for CBVCD.
To reduce the search time, the approach of the visual codebook is widely preferred. The
codebook approach moves the cost associated with large amount of local vectors from the online
phase to the offline phase. The description of each frame only uses a set of representative vectors,
which are calculated by clustering local vectors. The search becomes a simple detection of collisions
(which is very fast compared to Matches because it does not require any distance evaluation), but
it requires a spatial coherence process (like the one in Spatial) to reduce the false positives. In
Chapter 9 we present an alternative approach for reducing the search time for local description
87
based on approximate searches between local vectors.
7.7
Summary
In this chapter we have presented and evaluated the effectiveness of different descriptions
under different segmentations.
In the case of global description, we evaluate fourteen different descriptors under nine segmentations. We also test the impact of the video normalization process and the spatio-temporal
description. Both processes highly improved the performance of every descriptor. These increases
convert rather simple global descriptors into surprisingly well-performing descriptors that can compete with state-of-the-art systems based on local descriptors.
In the case of acoustic description, we present an acoustic descriptor designed for using the
metric space approach. We tested its parameters and we showed that it can achieve high detection
performance.
In the case of local description, we evaluate SIFT vectors extracted for a reduced frame. The
experiments show that local description can achieve a high performance, but its major drawback
is the increase in search time. To reduce the search time, the approach of the visual codebook is
usually preferred, however in Chapter 9 we present an alternative approach for reducing the search
time.
In particular, the configurations that achieved high detection effectiveness in the experimental section are:
• Global descriptions EHt 4x4-10, KFt 11x9, and IHt 4x4-rgb-3x4, with segmentation S1/3 or S1/2 , and
distance L1 .
• Acoustic description AU with parameters m=80 bands, w=500 ms, and s=20 ms, with segmentation S1, and distance L1 .
• Local description SF4, with segmentation S3, and distance Matchess=0.7 .
In the following chapter we focus on improving the effectiveness of these descriptors using
their combination and using different distance functions. Afterwards, in Chapter 9 we focus on
improving the efficiency of the search using an index structure and performing approximate searches
instead of linear scans.
88
Chapter 8
Improving Effectiveness in the
Similarity Search
The objective of the Similarity Search is to perform NN+range searches to retrieve the k
most similar reference segments at a distance threshold ǫ for each segment q ∈ Q according to a
distance function d.
This chapter focuses on improving the effectiveness of the search by using different definitions
of d. In particular, we test: the linear combination of distances, the spatio-temporal combination
of distances, and the use of non-metric distances. For these approaches we assume the descriptors
are vectors with known dimensionality that represent a whole segment.
For this chapter, the similarity search is implemented as a linear scan. The next chapter
focuses on increasing the effectiveness of the search.
8.1
Spatial-Combined Distance
Let m be the number of descriptors for each segment, let gi (s) be the ith descriptor for
segment s, and let di be the distance for comparing the ith descriptor, i ∈ {1, ..., m}.
Definition 1 (Spatial-Combined Distance) The spatial-combined distance γ between two segments q and r is defined as a weighted combination of the distances between its descriptors:
γ(q, r) =
m
X
i=1
where wi ∈ [0, 1] and
wi · di (gi (q), gi (r))
(8.1)
Pm
i=1 wi =1.
If the underlying distances in γ satisfy the metric properties and the weights are static, then
the spatial-combined distance will also satisfy the metric properties (see Section 3.3.2).
89
frequency
frequency
1
M1 d1 (a, b)
d1 (a, b)
1
M2 d2 (a, b)
d2 (a, b)
distance
distance
0
M1
0
M2
(a) Maximum distances M1 and M2 .
0.5
1
(b) Normalization according to M1 and M2 .
Figure 8.1 – Normalization by maximum distances on two functions d1 and d2 .
8.1.1
α-Normalization of distances
The first issue that each weight wi should solve is that of scaling the values returned by each
underlying distance di to a range in which they are comparable. These weights are usually fixed to
normalize each distance by its maximum value.
The normalization by maximum distance sets weights wi = M1 i , where Mi is the maximum
distance that di returns. This normalization scales all distances to a bounded value in the range [0,1]
in order to enable their combination. However, this approach does not reflect the distribution of
values for each function, i.e., for some functions the distance threshold 0.5 could be a very permissive
value (a range search selects many objects) while for others it could be very restrictive (a range
search selects just a few objects). An example is shown in Figure 8.1.
To address this issue, we set each weight wi using the histogram of distances of di . Because
the area of the histogram of distances is normalized to 1, the histogram can be seen as a probability
distribution of di . Then, we define the cumulative distribution in a similar way as probabilities.
Definition 2 (Cumulative Distribution) Let d be a distance function, and let Hd be its histogram of distances. The Cumulative Distribution of Distances Fd : R+ → [0, 1] is defined as:
Z x
Hd (t)dt
Fd (x) =
0
Additionally, Fd−1 : [0, 1] → R+ is the inverse function of Fd .
We state that two distance values are comparable when they have the same selectiveness
according to their respective distributions, i.e., for two distance functions d1 and d2 the values d1 (x)
and d2 (y) are comparable when Fd1 (x) ≈ Fd2 (y).
With the objective of scaling distance functions for making their values comparable for the
nearest neighbors, we define the α-normalization.
Definition 3 (α-Normalization) Let d be a distance function, and let α be a number in (0,1],
the α-normalization of d defines the normalized distance dα as:
90
frequency
frequency
1
τα,d1
d1 (a, b)
d2 (a, b)
0
α
α
τα,d1 τα,d2
d1 (a, b)
1
τα,d2
distance
(a) Distance threshold τα for both distances.
distance
α
0
d2 (a, b)
1
(b) Normalization according to each τα .
Figure 8.2 – The α-normalization of two functions d1 and d2 .
dα (x, y) =
1
τα,d
d(x, y) , where τα,d = Fd−1 (α) .
Because γ is used to perform NN+range searches, all of its underlying distances should be αnormalized with a small α. Then, the smaller values for the underlying distances will be comparable
1
between them. As a general rule, a value between 0.1 and |R|
is usually good enough. Note that
α=1 implies the normalization by maximum distance. Note also that α-normalization with the
same α for different distance functions should imply a different weight for each one. Figure 8.2
shows an example of the α-normalization of two distances.
The α-normalization of the underlying distances of γ tries to select weights that scale the
distributions in order to make the smaller distances comparable between them. Because histograms
of distances may have different shapes and slopes, the distances become comparable only in a
neighborhood of τα . To make underlying distances comparable in their whole range, a variable
weight is needed. However, in that case, the triangle inequality will not be satisfied.
Even when underlying distances are comparable, depending on the actual descriptors and
the dataset, it may be better to increase or decrease some weight on the combination. Hence, the
second issue that weights should support for improving effectiveness is to give more importance to
better descriptors. The following sections present two weighting strategies.
8.1.2
Weighting by Max-ρ
The motivation for this algorithm comes from our experimentation regarding efficiency and
pivots. In general, we have noticed that very simple descriptors which produce search spaces with
low intrinsic dimensionality (i.e., the search space can easily be indexed with a few pivots) usually
yield low effectiveness. However, as we use more complex descriptors and/or distances, the increase
in the effectiveness is usually bound to an increase in the intrinsic dimensionality (pivots become
less efficient in the search). This tradeoff between effectiveness and efficiency motivated us to define
a weighting algorithm that tries to improve the effectiveness by decreasing the efficiency.
Definition 4 (Weighting by Max-ρ) Let γ be a convex combination of the α-normalized distances {dα1 , ..., dαm }, the Weighting by Max-ρ selects the weights {w1 , ..., wm } that maximizes the
intrinsic dimensionality of γ.
91
In other words, given the histogram of distances Hγ , the weighting by max-ρ selects the
weights that solve:
µ(Hγ )2
max
{w1 ,...,wm } σ 2 (Hγ )
A search space with high intrinsic dimensionality clearly does not imply a high effectiveness.
This can be proved with a distance that returns a constant value: its intrinsic dimensionality will
be infinity but its effectiveness will be zero. On the other hand, a perfectly discriminant distance
which gives a small distance to correct matches and a large distance to every irrelevant object would
achieve the maximum effectiveness and a near-infinite intrinsic dimensionality.
Despite these counterexamples, the weighting by max-ρ assumes that if we start from a point
(a set of weights) with medium effectiveness, then a close point that decreases the efficiency should
imply that it is a point where effectiveness increases. There may be sets of weights that result in
better effectiveness and in lower intrinsic dimensionality, but to recognize them an evaluation of
effectiveness should be required.
8.1.3
Weighting by Max-τ
The similarity search retrieves only the nearest objects to the query, discarding the highest
distances. The weighting by max-ρ uses an indicator which is influenced by the whole distribution
of distances, but the effectiveness is only influenced by the smaller distances. This issue motivates
the definition of the weighting by max-τ , which tries to improve effectiveness by using an indicator
that depends only on the distribution of the smaller distances.
The α-normalization scales each underlying distance di with a weight that for randomlyselected objects x,y, P[dαi (x, y) ≤ 1] = α. Because γ is a convex combination of α-normalized
distances, it might be expected that for randomly-selected objects x,y P[γ(x, y) ≤ 1] = α. If all
the underlying distances were independent, this statement would be true. However, in general, this
statement is not true, because the same pair (x, y) chosen to evaluate γ will be used to evaluate
each of its underlying distances. If descriptors i and j are highly correlated, then the conditional
probability increases P[dαj (x, y) ≤ 1 | dαi (x, y) ≤ 1] > α. For example, assume dα1 compares segments
by their RGB histogram and dα2 compares them by their HSV histogram. When two segments x
and y are similar according to their RGB histogram (dα1 (x, y) ≤ 1), the probability of being similar
according to their HSV histogram will increase (P[dα2 (x, y) ≤ 1] > α). In that case, γ(x, y) will be
less than 1 with a probability higher than α, hence the value of τα,γ (i.e., the value that α-normalizes
γ) will be smaller than 1. Due to this issue, we define the following weighting method.
Definition 5 (Weighting by Max-τ ) Let γ be a convex combination of the α-normalized distances {dα1 , ..., dαm }, the Weighting by Max-τ selects the weights {w1 , ..., wm } that maximizes τα,γ ,
where τα,γ =Fγ−1 (α).
In other words, the weighting by max-τ maximizes the value that α-normalizes γ with the
objective of favoring uncorrelated descriptors.
92
8.1.4
Discussion and implementation details
The weighting by max-ρ favors distances that produce values globally closer to µ (in particular the small values). The maximization of τ favors distances whose small values are larger.
Graphically, this can be seen as choosing a function whose histogram of distances is sharper in the
near zero zone. The intuition behind these two algorithms is to choose functions that increase the
isolation of the nearest neighbors by the spreading small distances in a wider range.
For the process of maximization of ρ or τα,γ , the Newton-Raphson method can be used.
However, we use a simpler approach: given an indicator (ρ or τα,γ ), each weight wi is initialized to
1
m (for m underlying distances), and then iteratively wi is replaced by wi ± ε if that change increases
the desired indicator, and ends when every weight has been tested and none updated.
We should stress that these algorithms depend on the histogram of distances which is created
by sampling pairs of objects (not evaluating every possible pair). Depending on the distance function, the evaluation of a statistically-relevant number of distances should take from a few seconds
up to a few minutes. Thus, the computational time required for the automatic normalization and
weighting processes is practically negligible when it is compared to the computational time required
for the similarity searches.
The experimental section shows that the values of ρ and τα,γ for varying weights produce
smooth surfaces, thus the iterative search can have two phases: a first phase, with a large ε and
histograms with fewer samples, and a second phase with finer ε and more precise histograms.
Because the histogram of distances can be created from any distance function, these algorithms can
(a priori) work with any kind of descriptor and distance function. However, there is an assumption
of continuity for ρ and τα,γ , thus the underlying distances should produce smooth histograms like
the ones produced by the Minkowski distances.
In practice, each underlying distance has two associated weights: an internal weight from
its α-normalization, and an external weight from the weighting process of γ. Once the internal and
external weight has been selected, both can be multiplied to fix the final weight to each underlying
distance.
The two proposed weighting algorithms are just heuristics to select a good set of weights
based on statistical indicators. However, these algorithms do not guarantee the quality of the set
of weights because they do not perform any effectiveness evaluation. Sometimes the evaluation of
the effectiveness is difficult because it requires the definition of proper methodology and indicators,
and in some cases, it requires that people be hired to use the system and fill out evaluation forms.
In those cases, these algorithms can offer a reasonably good set of weights –at least better than the
normalization by maximum distance– to perform a combined search.
8.1.5
Evaluation
The evaluation uses the MUSCLE-VCD-2007 dataset, with reference videos from DB-MPEG1
collection and query videos from ST1 and ST2 collections. As stated in Section 5.6, R is the set
of reference segments, Q is the set of query segments, and Qc is the set of query segments that
have a correct answer in R. Each reference and query video was preprocessed and partitioned
into segments of one second length (S1), thus |R|=211,479, |Q|=13,942 segments and |Qc |=9,983
segments.
93
Name
EH
IH
KF
Segm.
d
Descriptor
ρ
MAP
S1
S1
S1
L1
L1
L1
EHt 4x4-10
IHt 1x4-rgb-3x16
KFt 11x9
9.59
8.58
4.03
0.664
0.501
0.510
Table 8.1 – Base effectiveness of the three configurations to be combined.
α
norm.
1
max
wEH
wIH
ρ
τα,γ
MAP
0.5
0.5
12.4
0.97
0.640
0.5
0.5
0.5
12.8
1.00
0.675
0.1
0.1
0.1
max-ρ
max-τ
0.5
0.552
0.524
0.5
0.448
0.476
12.8
12.8
12.8
1.06
1.06
1.07
0.671
0.687
0.679
0.01
0.01
0.01
max-ρ
max-τ
0.5
0.580
0.546
0.5
0.420
0.454
12.7
12.8
12.8
1.15
1.15
1.15
0.665
0.690
0.679
0.001
0.001
0.001
max-ρ
max-τ
0.5
0.612
0.550
0.5
0.388
0.450
12.6
12.8
12.8
1.26
1.26
1.26
0.656
0.692
0.672
0.0001
0.0001
0.0001
max-ρ
max-τ
0.5
0.668
0.550
0.5
0.332
0.450
12.2
12.8
12.5
1.49
1.45
1.49
0.631
0.688
0.648
Table 8.2 – Effectiveness of γ when combining distances from EH and IH.
Combination of two and three descriptors
The following experiments are based on the effectiveness achieved by combination of the
following three configurations: EH, IH and KF. Table 8.1 summarizes the indicators and base
effectiveness (MAP) achieved by each configuration (the MAP values are the same to the frame-toframe effectiveness shown in Figure 7.3 on page 82). The intrinsic dimensionality was calculated by
evaluating the distance between randomly sampled pairs of segments (x, y) with x ∈ Q and y ∈ R.
The searches are implemented as linear scans. Note the relationship between ρ and MAP: EH
achieves the best MAP and it also has the highest ρ. On the other hand, KF has the lowest ρ but
outperforms IH. These three configurations will be combined using the normalization by maximum
distance, α-normalization, weighting by max-ρ, and weighting by max-τ .
For EH+IH (Table 8.2) the best effectiveness is achieved by weighting by max-ρ with
α=0.001. For EH+KF (Table 8.3) the best effectiveness is achieved by weighting by max-ρ with
α=0.0001. For IH+KF (Table 8.4) the best effectiveness is achieved by weighting by max-τ with
α=0.01. For EH+IH+KF (Table 8.5) the best effectiveness is achieved by weighting by max-τ
with α=0.1.
Because EH by itself achieves the best MAP and has the highest ρ, the optimal weights
for EH+IH and EH+KF are selected by the weighting by max-ρ algorithm. However, in the
combination of IH+KF, the weights chosen by weighting by max-ρ are biased to IH. In this case,
the weighting by max-τ achieves the best performance. In the combination of the three descriptors,
94
α
norm.
1
max
ρ
τα,γ
MAP
0.5
10.1
0.85
0.698
0.5
0.5
9.2
1.00
0.691
max-ρ
max-τ
0.5
0.794
0.566
0.5
0.206
0.434
8.7
11.3
9.5
1.11
1.08
1.11
0.685
0.706
0.694
max-ρ
max-τ
0.5
0.820
0.532
0.5
0.180
0.468
8.2
11.3
8.6
1.23
1.16
1.24
0.678
0.706
0.684
0.001
0.001
0.001
max-ρ
max-τ
0.5
0.856
0.516
0.5
0.144
0.484
7.4
11.3
7.6
1.42
1.22
1.42
0.665
0.706
0.668
0.0001
0.0001
0.0001
max-ρ
max-τ
0.5
0.896
0.510
0.5
0.104
0.490
6.5
11.3
6.6
1.69
1.28
1.69
0.645
0.706
0.647
0.5
0.1
0.1
0.1
0.01
0.01
0.01
wEH
wKF
0.5
Table 8.3 – Effectiveness of γ when combining distances from EH and KF.
α
norm.
1
max
wIH
wKF
ρ
τα,γ
MAP
0.5
0.5
7.9
1.00
0.564
0.5
0.5
0.5
6.9
0.99
0.584
0.1
0.1
0.1
max-ρ
max-τ
0.5
0.936
0.538
0.5
0.064
0.462
6.7
8.6
7.0
1.07
1.02
1.07
0.586
0.520
0.583
0.01
0.01
0.01
max-ρ
max-τ
0.5
0.938
0.470
0.5
0.062
0.530
6.6
8.6
6.4
1.16
1.04
1.16
0.587
0.520
0.588
0.001
0.001
0.001
max-ρ
max-τ
0.5
0.946
0.482
0.5
0.054
0.518
6.4
8.6
6.3
1.23
1.05
1.23
0.588
0.520
0.587
0.0001
0.0001
0.0001
max-ρ
max-τ
0.5
0.952
0.516
0.5
0.048
0.484
6.3
8.6
6.4
1.21
1.04
1.21
0.587
0.520
0.588
Table 8.4 – Effectiveness of γ when combining distances from IH and KF.
95
α
norm.
wEH
wIH
wKF
ρ
τα,γ
MAP
1
max
0.333
0.333
0.333
11.2
0.89
0.670
0.333
0.333
0.333
10.5
1.00
0.694
0.5
0.1
0.1
0.1
max-ρ
max-τ
0.333
0.561
0.453
0.333
0.394
0.179
0.333
0.045
0.368
10.1
12.9
10.2
1.11
1.08
1.12
0.690
0.704
0.711
0.01
0.01
0.01
max-ρ
max-τ
0.333
0.575
0.427
0.333
0.362
0.19
0.333
0.063
0.383
9.7
12.9
9.4
1.24
1.18
1.25
0.684
0.709
0.702
0.001
0.001
0.001
max-ρ
max-τ
0.333
0.602
0.447
0.333
0.364
0.173
0.333
0.034
0.38
9.0
12.9
8.8
1.44
1.29
1.46
0.676
0.703
0.693
0.0001
0.0001
0.0001
max-ρ
max-τ
0.333
0.675
0.436
0.333
0.296
0.189
0.333
0.029
0.375
8.2
12.9
7.9
1.70
1.50
1.78
0.658
0.705
0.675
Table 8.5 – Effectiveness of γ when combining distances from EH, IH and KF.
the weighting by max-τ achieves the best result. These results also show there is not a clear rule
for locating the best α since the best effectiveness is achieved with some α≤0.1.
Previous tables show that both weighting algorithms can improve the trivial selection of
equal weights, outperforming the α-normalization and the normalization by maximum distance.
However, these tables do not show the best set of weights that may have been selected, i.e., the
set of weights that maximizes MAP. To analyze this issue, we test the performance for every set of
weights in the combination of EH, IH and KF. Because the sum of weights is 1 and no weight can
be negative,
√ all the 3-tuples (wEH , wIH , wKF ) reside in a plane, specifically on an equilateral triangle
of side 2. We have uniformly sampled weights in the triangle following a regular grid with step
0.1 (66 samples). For each sampled set of weights we measure the value of ρ, τα,γ and the MAP
achieved by the similarity search.
Figure 8.3 summarizes values of ρ, τα,γ and MAP for different combinations of weights when
the underlying distances of γ have been α-normalized with α=0.1 and α=0.0001. The location of
the weights that result in maximum ρ and τα,γ are marked in the graphs (the actual values are
already shown in Table 8.5). The set of sampled weights that achieves the maximum MAP is also
marked on each graph. The 3D graph shows the three surfaces when the vertical axis has been
displaced and scaled to match the minimum value at zero and the maximum value at one, thus
the relationship between increase and decrease of ρ, τα,γ , and MAP is clearer. The figure shows
that the weights that produce maximum ρ and τα,γ do not coincide with the weights that achieve
maximum MAP, however they are closer to it than weights ( 13 , 31 , 31 ). Therefore, the maximization
of either ρ or τα,γ selects a set of weights that outperforms the trivial selection of equal weights,
but that achieves a sub-optimal MAP.
The fact that the weighting by max-ρ selects a sub-optimal configuration implies that there
is a gap in which to improve both efficiency and effectiveness, and that the highest effectiveness does
not necessarily imply the lowest efficiency. The weighting by max-ρ exploits the (sometimes loose)
inverse correlation between efficiency and effectiveness. We should remark this approach must be
used with prudence: for example, assume there are initial weights that achieve high MAP, that were
96
α=0.1
α=0.0001
(0,1,0)
(0,1,0)
max-ρ
max-ρ
ρ
(0,0,1)
(1,0,0)
(0,0,1)
(0,1,0)
(1,0,0)
(0,1,0)
max-τ
τα,γ
(0,0,1)
max-τ
(1,0,0)
(0,0,1)
(0,1,0)
max
(1,0,0)
(0,1,0)
min
(0.7,0.2,0.1)
MAP=0.735
MAP
(0,0,1)
(0.8,0.1,0.1)
MAP=0.731
(1,0,0)
(0,0,1)
(1,0,0)
max-ρ
max-τ
max-ρ
max-τ
max-MAP
max-MAP
3D
(0,0,1)
(1,0,0)
(0,0,1)
(1,0,0)
Figure 8.3 – Relationship between intrinsic dimensionality (ρ), the value that α-normalizes γ
(τα,γ ), and MAP, when combining distances from EH, IH and KF.
97
Name Segm. d
GH
EH5
OM
CH
G14
S1
S1
S1
S1
S1
L1
L1
L1
L1
L1
Descriptor
ρ
MAP
GHt 4x4-12
EHt 4x4-5
OMt 9x9
CHt 2x2-rgb-4x4x4
GHt 1x4-32
13.21
7.26
6.18
6.92
8.71
0.493
0.561
0.339
0.439
0.488
Table 8.6 – Effectiveness of five more configurations to be combined.
located either by experimental evaluation or using a partial ground-truth. If we apply the weighting
by max-ρ in order to improve the effectiveness, the algorithm will move the initial weights towards
the weights that produce the maximum ρ, and it may happen that it is actually decreasing both
effectiveness and efficiency.
The fact that the weighting by max-τ optimizes to a set with relatively similar weights
shows there is not a strong correlation between EH, IH and KF. In order to test the performance
of the weighting algorithm under correlated descriptors, we will perform a combination using more
configurations.
Incremental combination
Table 8.6 shows the effectiveness for five more configurations: GH, EH5, OM, CH and
G14. The following experiment combines all these configurations together with the configurations
already shown in Table 8.1. The experiment compares the performance of the weighting algorithms
for the incremental combination in the following (arbitrary) order: EH, IH, KF, GH, EH5, OM,
CH, and finally G14.
Figure 8.4 shows the achieved MAP when γ combines incrementally, from one, up to eight
configurations, using the normalization by maximum distance with equal weights, the α-normalization
with equal weights (α=0.1), the weighting by max-ρ, and the weighting by max-τ . The results show
that combining descriptors improves the effectiveness up to a saturation point (in this case, three
descriptors), and adding more descriptors may even reduce the effectiveness. The α-normalization
outperforms the normalization by maximum distance at every step. In turn, the weighting by maxτ outperforms the α-normalization at every step. Then, α-normalization performs a combination
with a better scaling, and the weighting by max-τ can correct the trivial selection of equal weights.
In the experiment, the configuration GH acts as a noisy descriptor, decreasing the effectiveness of all weighting algorithms, in particular the weighting by max-ρ is highly impacted due
to the high ρ of GH. However, as more descriptors are combined, the weighting by max-ρ corrects
its behavior, outperforming the effectiveness achieved by the normalization by maximum distance
when eight descriptors are combined.
8.2
Spatio-Temporal Combined Distance
Let γ be a spatial distance function between two segments, which may be a simple distance
between descriptors or may be a weighted combination of descriptors. The spatio-temporal (s-t)
98
Effectiveness of Incremental Combination
0.7
MAP
Norm. maxdist
α-Norm.
Max-ρ
Max-τ
0.6
EH
+IH
+KF
+GH
+EH5
+OM
+CH
+G14
Figure 8.4 – Effectiveness of combining incrementally from one up to eight descriptors. Weights
in γ use normalization by maximum distance (equal weights), α-Normalization
(equal weights, α=0.1), weighting by max-ρ, and weighting by max-τ .
distance between two segments is the average distance of γ inside a temporal window.
Definition 6 (Spatio-Temporal Combined Distance) The spatio-temporal combined distance
δ between two segments is defined as:
1
δ(qj , rk ) =
W
⌊W/2⌋
X
γ(qj+w , rk+w )
w=−⌊W/2⌋
where W is an odd number, qj is the j th segment for a video v partitioned into segment {q1 , ..., qs },
∀i < 1 qi =q1 , and ∀j > s qj =qs .
The scaling factor 1/W does not affect the nearest neighbors, thus it could be discarded.
However, we included it because γ could be α-normalized, in which case a distance threshold for δ
can be defined using 1 as a reference value.
If γ satisfies the metric properties, then δ also satisfies them. Note the behavior for comparing
segments lying in the beginnings and endings of videos (∀i < 1 qi =q1 , ∀j > s qj =qs ). This
alternative is preferred over an early-termination of the sum, because the latter may break the
triangle inequality.
Some works that include a s-t distance have been reviewed in Chapter 4. In particular,
the temporal distance for the ordinal measurement descriptor [Kim and Vasudev, 2005] is closely
related. However, δ is more general because it can combine many descriptors and it is not associated
with any extraction method.
8.2.1
Evaluation
The following experiment tests the performance of the s-t distance δ varying the size of
the temporal window W . Figure 8.5 summarizes the effectiveness of varying the temporal window
W ={1,3,5,7,9} for the fourteen global descriptors evaluated in Chapter 7 plus the acoustic descriptor
AU160. The experiment uses videos with preprocessing, s-t description and segmentation fixed to
one second length (S1). The results for W =1 and global descriptors are shown in Figure 7.3 on
99
1
0.9
0.9
0.8
0.8
0.7
0.7
MAP
MAP
Frame-To-Frame Effectiveness (segmentation S1)
1
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
1
3
5
7
9
1
Temporal window W
3
5
7
9
Temporal window W
Copy Detection Effectiveness (segmentation S1)
30
Detections without false alarms
Detections without false alarms
30
25
20
15
10
1
3
5
7
9
25
20
15
10
1
Temporal window W
3
5
7
9
Temporal window W
Spatio-Temporal Global Descriptors
EHt 4x4-10
EHt 4x4-5
KFt 11x9
OMt 9x9
CHt 1x1-hsv-16x4x4
CHt 2x2-hsv-16x2x2
CHt 2x2-rgb-4x4x4
AU160
GHt 1x1-180
GHt 1x4-32
GHt 3x3-20
GHt 4x4-12
IHt 1x4-rgb-3x16
IHt 2x2-rgb-3x16
IHt 4x4-rgb-3x4
Figure 8.5 – Effectiveness of increasing temporal window W (videos with preprocessing, s-t description, segments of one second length).
100
page 82 for S1, and acoustic descriptor in Figure 7.4 on page 85 for m=160 bands, w=500 ms, s=20
ms at S1.
In the case of frame-to-frame effectiveness, each descriptor increases its effectiveness in almost
a similar proportion. In particular, the EHt 4x4-10 increases its MAP up to more than 0.9 for W =9.
On average for all the descriptors, the increases compared with the MAP achieved by W =1 are
{29%, 41%, 48%, 52%} respectively for W in {3, 5, 7, 9}. The improvements in the MAP for every
descriptor shows that a larger W can improve the frame-to-frame effectiveness. The experiment
uses segments of one second length, but it should be noted that with different segment lengths the
behavior is similar.
In the case of detection effectiveness, the improvements do not increase as well as the MAP.
Moreover, the detections remain mainly stable with some small increases (even decreases for some
descriptors). The only descriptor that greatly benefits from large W is OMt 9x9 no longer achieving
the worst performance, but a performance comparable with the other descriptors. This result shows
that large W mainly does not improve the detection performance, thus the improvement in MAP
is due to better matching between frames from already detectable copies.
In summary, a large temporal window is useful for improving the copy localization (and also
the detection score) for copies that are already detectable by a small window, but it is not useful
for discovering more copies. Moreover, a large temporal window also implies more computational
effort, e.g., an increase of W from 1 to 11 implies 10 times more evaluations of γ function.
In an exact search, where every query segment is compared to every reference segment, most
of the evaluation of γ are repeated, thus an optimization, like memoization, can be used. For
example, with W =3 the evaluation of δ(qi , rj ) and δ(qi+1 , rj+1 ) shares the evaluation of γ(qi , rj )
and γ(qi+1 , rj+1 ). In practice, for exact searches, an increase in W will not affect the amount of
evaluations of γ, but will require more memory to store all its evaluations.
However, this optimization will not be effective for approximate searches. In approximate
searches most of the evaluations of δ are avoided and only a few evaluations are actually performed.
Then the evaluations of δ may not share the evaluations of γ. In this case, the cost for each evaluation
of δ will increase proportionally with W , but the detection performance does not increase at the
same ratio.
As we show in Chapter 11, for TRECVID 2010 we performed approximate searches using
W =3 due to its increase in the effectiveness. However, for TRECVID 2011 we prefer to fix W
to 1 and to improve the approximation parameters (which show better performance revenues, see
Chapter 9).
Combination of acoustic and visual descriptors
Chapter 7 shows the performance of an acoustic descriptor that behaves in a similar way as
global descriptors. In this experiment we perform a combination of visual and acoustic information
in the similarity search using different weights and temporal window W . Table 8.7 shows the
configurations EH and AU and their effectiveness.
Figure 8.6 shows the effectiveness of the s-t distance varying the size of the temporal window
W . The distance γ is a weighted combination of EH and AU, using α-normalization (α=0.1) and
five different sets of weights. The best frame-to-frame effectiveness is achieved by equal weights
101
Name
EH
AU
Segm.
d
Descriptor
ρ
MAP
S1
S1
L1
L1
EHt 4x4-10
AU160
9.59
7.53
0.664
0.734
Table 8.7 – Base effectiveness of the two configurations to be combined.
Frame-To-Frame
1
Copy Detection
Weights (EH, AU)
Detections without
false alarms
30
MAP
0.9
0.8
0.7
(1.0, 0.0)
(0.8, 0.2)
(0.5, 0.5)
(0.2, 0.8)
(0.0, 1.0)
25
20
0.6
1
3
5
7
Temporal window W
9
1
3
5
7
Temporal window W
9
Figure 8.6 – Effectiveness of s-t distance with temporal window W , when γ is a combination of
EH and AU, α-normalization (α=0.1) and five different sets of weights.
wEH =wAU =0.5 and W =9 with MAP 0.986. However, that configuration detects 30 copies out of
31: it misses the copy that has no audio track. In turn, weights wEH =0.8, wAU =0.2 achieves a
slightly lower MAP but it detects all the 31 copies with W ≥3.
Both weighting algorithms chose relatively similar weights: weighting by max-ρ selects
weights wEH =0.572, wAU =0.428, and weighting by max-τ selects weights wEH =0.514, wAU =0.486,
both sets detecting 30 of 31 copies. This behavior shows a weakness of the automatic weighted
combination: the difficulty to tune the combination when some underlying distance may use a
misleading descriptor. This issue may be overcome using some descriptor-specific behavior (e.g.,
decrease wAU if a video lacks audio track), or using previous knowledge of the dataset (manually
adjusting weights).
Besides, to the best of our knowledge, the result of 100% correct detections without false
alarms (achieved by wEH =0.8, wAU =0.2, W =3, α=0.1) outperforms a CBVCD system using local
descriptors that achieves the best result on the MUSCLE-VCD-2007 dataset [Poullot et al., 2010].
8.3
Other Metric and Non-Metric Distances
In previous chapters, we have exclusively used the L1 distance for comparing descriptors.
The L1 distance satisfies the metric properties and also it is fast to evaluate, but a comparison with
other distances is required.
The metric properties present a tradeoff between effectiveness and efficiency. Thus, we may
improve the effectiveness of the distance function by raising the restrictions of a metric. When
a distance function does not satisfy some of the metric properties it is known as a non-metric.
Section 3.3.2 reviews some related work for the non-metric approach and some non-metric distances.
102
The following experiment compares the performance achieved by the fourteen descriptors
and the acoustic descriptor using the following distances:
• Minkowski distance: L1 and L2 . These distances satisfy the metric properties.
• Fractional distance: L0.5 and L0.1 . These distances do not satisfy the metric properties.
• Chi-squared test statistic: χ2 . This distance does not satisfy the metric properties.
• DPF distance: D13% , D16% and D120% . These distances correspond to evaluate L1 after discarding
the 3%, 6% and 20% of the dimensions with higher differences, respectively. These distances
do not satisfy the metric properties.
The similarity search using Fractional and Chi-squared distances is relatively slower than
Minkowski distances (about two or three times slower), while the search using DPF is more than
one order of magnitude slower (mainly affected by the storage and sorting of differences by dimension). We did not evaluate other more complex distances (like the Earth Mover’s Distance [Rubner
et al., 2000]) because their evaluation time (more than two orders of magnitude slower) makes them
unfeasible to test them in the MUSCLE-VCD-2007 dataset.
Figure 8.7 compares the effectiveness achieved by the eight distances. The experiment uses
videos with preprocessing, s-t description and segmentation fixed to one second length (S1). The
results for L1 are also shown in Figure 7.3 on page 82 for S1.
In the case of frame-to-frame effectiveness, the L0.5 produces a slight increase in the MAP
for
GHt and IHt descriptors, the χ2 produces a slight increase for CHt and IHt descriptors, and
the DPF distances produces a slight increase for EHt , GHt and IHt descriptors. Between the three
DPFs, the D16% produces best results, while D120% only improves MAP for two GHt descriptors and
one IHt . The global maximum for global descriptors is achieved by D16% (0.692) and L0.5 (0.688)
both at EHt 4x4-10. On the other hand, the L2 and L0.1 are outperformed by L1 for every descriptor
(except at GHt 3x3-20).
EHt ,
In the case of detection effectiveness, the L0.5 outperforms L1 for almost every descriptor
(except at CHt 1x1-hsv-16x4x4), and χ2 and D13% on average produce the same effectiveness as L1 . On
the other hand, L2 , L0.1 , D16% and D120% on average worsen the detection effectiveness. The global
maximum for global descriptors is achieved by L0.5 at EHt 4x4-10 with 29 detections without false
alarms.
In summary, the replacement of L1 with a non-metric distance may improve the effectiveness.
In particular, the effectiveness achieved L1 can be outperformed by a fractional distance with p
in the range [0.5, 1). The intuition behind this good performance of fractional distances is that
they penalize descriptors with small differences in many dimensions, and favor descriptors that
concentrate high differences on few dimensions. For example, given a query vector A=(9, 9, 9, 9),
and two objects B=(5, 5, 5, 5) and C=(1, 1, 9, 9), L2 selects B as the nearest neighbor of A, L1
evaluates B and C with the same distance to A, and L0.5 selects C as the nearest neighbor of
A. In fact, when the descriptors represent zones of a frame, the latter alternative is preferable
because a postproduction transformation usually modifies a few zones while leaving the other zones
unchanged.
The DPF distances can also outperform L1 . They discard dimensions with high differences
before evaluating the distance, thus they can avoid the noise that a transformation may have pro103
0.8
0.7
0.7
0.6
0.6
MAP
MAP
Frame-To-Frame Effectiveness (segmentation S1)
0.8
0.5
0.5
0.4
0.4
0.3
0.3
0.2
L2
L1
L0.5
L0.1
χ2
D13%
D16%
0.2
D120%
L2
L1
Distance Function
L0.5
L0.1
χ2
D13%
D16%
D120%
D16%
D120%
Distance Function
Copy Detection Effectiveness (segmentation S1)
30
Detections without false alarms
Detections without false alarms
30
25
20
15
L2
L1
L0.5
L0.1
χ2
D13%
D16%
D120%
Distance Function
25
20
15
L2
L1
L0.5
L0.1
χ2
D13%
Distance Function
Spatio-Temporal Global Descriptors
EHt 4x4-10
EHt 4x4-5
KFt 11x9
OMt 9x9
CHt 1x1-hsv-16x4x4
CHt 2x2-hsv-16x2x2
CHt 2x2-rgb-4x4x4
AU160
GHt 1x1-180
GHt 1x4-32
GHt 3x3-20
GHt 4x4-12
IHt 1x4-rgb-3x16
IHt 2x2-rgb-3x16
IHt 4x4-rgb-3x4
Figure 8.7 – Effectiveness achieved by different distance functions (videos with preprocessing,
s-t description, segments of one second length).
104
duced in a descriptor. However, they tend to improve the effectiveness of copies already detectable
by L1 rather than detecting new copies.
Finally, we choose L1 because it achieves a good balance between effectiveness and efficiency.
Its effectiveness may be outperformed by a fractional distance, but at the cost of higher evaluation
time and breaking the triangle inequality.
8.4
Summary
In this chapter we have presented three approaches to improve the effectiveness of the distance function.
The first approach combines the distance from many descriptors into a unique static weighted
distance. This combined distance can improve the effectiveness of the search by using different
sources of information. The α-normalization enables the comparison of distances between nearest
neighbors from different metric spaces. The weighting by max-ρ and weighting by max-τ are two
algorithms that intend to automatically select a good set of weights for the combination. The
main strength of these algorithms is that they do not need training data nor effectiveness evaluations. These algorithm use straightforward statistical indicators, that can be calculated rapidly by
sampling distances from the metric spaces.
The second approach uses a spatio-temporal combined distance. This distance exploits the
temporal dimension of videos to improve the effectiveness. The evaluation shows that the frame-toframe match between copies and original segments can be highly improved using a large temporal
window, however the detection effectiveness does not improve as well. In particular, we have shown
a combination of a global and acoustic descriptors that fully resolves the MUSCLE-VCD-2007
dataset.
The third approach raises the metric properties of the distance to use some of the well-known
non-metric distances. In particular, we have shown the fractional distances can indeed improve the
effectiveness, however the tradeoff between the effectiveness and efficiency still favors the L1 distance.
The next chapter reviews the efficiency of the search assuming a good distance function has
been defined.
105
Chapter 9
Improving Efficiency in the Similarity
Search
The objective of the Similarity Search is to perform NN+range searches to retrieve the k
most similar reference segments at a distance threshold ǫ for each segment q ∈ Q, according to a
distance function d.
In this chapter we will focus on performing an efficient search using an already defined
distance d. We present two approaches to perform an efficient search: approximate searches using
a static pivot table, and exact searches using a dynamic pivot table.
9.1
Approximate Search with Pivots
LAESA (see Chapter 3) selects a set of pivots from the reference objects to perform an
efficient search. During the search, before evaluating the distance between the query and every
object, it calculates its lower bound. If the lower bound is larger than the current k th nearest
neighbor candidate, the object can be safely discarded without affecting the final result.
This section presents an approximate search algorithm which uses the lower bounds as fast
distance estimators. This search discards most of the objects and evaluates the actual distance only
for the most promising objects (objects with the lowest lower bounds). This section also presents
an application of the approximate search for local descriptors, and a Two-step search which first
performs approximate searches and then performs an exact search.
9.1.1
Index Structure
Let D be the set of segments, d : D × D → R be a distance between any two segments,
R ⊆ D be the set of reference segments in the CBVCD system, and P ⊆ R be a set of pivots. The
LBP function is defined as:
LBP (q, r) = max {|d(q, p) − d(r, p)|}
p∈P
106
(9.1)
Algorithm 9.1: The Sparse Spatial Selection (SSS) algorithm.
Input: R set of reference segments, t sparse threshold.
Output: P set of pivots
{r0 , ..., r|R| } ← randomize objects in R
P ← {r0 }
foreach ri ∈ {r1 , ..., r|R| } do
if ∀ p ∈ P, d(ri , p) ≥ t then
P ← P ∪ {ri }
return P
If d satisfies the metric properties, the object-pivot distance constraint implies that:
∀ q, r ∈ D, ∀ P ⊆ R, LBP (q, r) ≤ d(q, r)
(9.2)
If d(x, p) is precalculated ∀x ∈ D p ∈ P, the evaluation of LBP costs just |P| operations.
The index structure is LAESA, i.e., a |R| × |P| pivot table containing distances from each pivot to
every reference segment. Additionally, if d is a spatio-temporal distance, then each segment must
have a pointer to the previous segment and the next segment in the video segmentation.
The Sparse Spatial Selection (SSS) [Bustos et al., 2008] incrementally selects pivots according to the minimum distance that they should have between each other. This algorithm returns
a variable number of pivots depending on the distance threshold and the traverse order of R. Algorithm 9.1 shows the implemented SSS algorithm. It first randomizes R and then performs the
incremental selection.
The approximate search uses the lower bound as an estimator of the distance, hence the
search requires a set of pivots that produces tight values between the distance and its lower
bound. Therefore, given n candidate sets {P1 , ..., Pn } we select the set Pi that minimizes |d(x, y) −
LBPi (x, y)|, for randomly sampled x and y. This criterion is equivalent to select the set that maximizes the average lower bound µP , proposed by Bustos et al. [2003]. In fact, the latter is preferable
in order to avoid the evaluation of d(x, y) and save some computational cost. The calculation of
µP is performed by evaluating the lower bound between randomly sampled pair of objects in the
collection.
Given two sets P1 and P2 with |P1 |<|P2 |, it may be expected that µP1 <µP2 . However,
the µP1 >µP2 scenario may appear due to either redundant pivots in P2 or over-fitting of P1 to
the evaluating pairs. When selecting sets with a few pivots from a large collection of objects, it
is unlikely that SSS selects redundant pivots, and in the long-term it is more likely that P2 will
produce tighter values between d and LBP , despite P1 achieves a higher µP on the random sample.
Thus, unlike Bustos et al. [2003], we first fix a target number of pivots and then we select the set
with highest µP . This assumption that smaller sets of pivots are outperformed by bigger sets is
proved in the results shown in the experimental section.
Algorithm 9.2 shows the selection algorithm for a target number of pivots m. The parameters
n and s (the n sets of pivots to evaluate and the s pairs to use in the evaluation) can be adjusted
to take reasonable computational time (in our implementation the selection takes a few minutes).
The algorithm calculates a histogram of distances of d, and invokes the SSS algorithm to select
107
Algorithm 9.2: Pivot Selection Algorithm.
Input: R set of reference segments, m number of pivots, n number of sets to evaluate, s
number of evaluation pairs.
Output: P set of pivots
{x1 , y1 , ..., xs , ys } ← random sample of 2 · s objects in R
Fd ← cumulative distribution of distances of d
α←1
foreach i ∈ {1, ..., n} do
while |Pi | 6= m do
Pi ← SSS (R, Fd−1 (α) )
if |Pi | < m then
α ← α · (1 − ε)
// Algorithm 9.1
else if |Pi | > m then
discard last (m − |Pi |) pivots from Pi
µPi ← average value for LBP (xj , yj ), j = {1, ..., s}
// Equation 9.1
return set Pi with maximum µPi
pivots. The sparse threshold is calculated using the accumulated distribution of distances (see
Section 8.1.1). It starts with the maximum distance (α=1) and it decreases when SSS could not
select the required number of pivots. Actually, it is preferable to try a few times with the SSS before
decreasing α. Finally, the set P with maximum µP is selected to create the index. This algorithm
can be accelerated by selecting each candidate set in an independent thread.
9.1.2
Exact Search with pivots
For every query segment q ∈ Q, the first step is to calculate d(q, p) ∀p ∈ P. Then, the
NN+range search retrieves the k closest objects inside a distance threshold ǫ to q. Algorithm 9.3
depicts the algorithm for the classic exact NN+range search using pivots. It uses Equation 9.2 to
evaluate d only when the lower bound is less than both the range ǫ and the k th candidate distance.
The exact search with pivots can achieve a big improvement in the search time compared
to a linear scan, and it retrieves the same objects. However, when the search is performed in large
collections, this algorithm may not be as fast as needed. For those cases, we propose an approximate
search algorithm.
9.1.3
Approximate Search with pivots
The exact search with pivots may improve the search time compared to a linear scan, however in some large collections (as the TRECVID datasets) it is not fast enough. In fact, for our
participation at TRECVID 2010, the exact search with pivots would have taken several months to
complete.
Algorithm 9.4 shows an algorithm which returns an approximation of the result of the exact
search and can perform much faster. It is based on the fact that the lower bounds of the nearest
neighbors are usually between the smallest lower bounds. It is divided in two rounds. In the first
108
Algorithm 9.3: Classic NN+range search using pivots.
Input: q query segment, R set of reference segments, k number for NN search, ǫ threshold
for range search, P set of pivots.
Output: List of k nearest neighbors to q
NNs ← new priority queue
foreach r ∈ R do
lb ← LBP (q, r)
if lb < ǫ and ( size of NNs < k or
lb < max distance in NNs ) then
dist ← d(q, r)
if dist < ǫ then
add r to NNs with distance dist
if size of NNs > k then
remove max distance object from NNs
// Equation 9.1
// Equation 9.2
return NNs
round it uses LBP as an estimator for d: it evaluates LBP for every object in R , it discards objects
with LBP greater than threshold ǫ, and it selects the T ·|R| objects with the smallest values of
LBP , given an approximation parameter T ∈ [0, 1]. In the second round, it evaluates d just for the
selected objects, and it locates between them the k nearest neighbors that are closer than ǫ to q.
This is an approximate search because there is no guarantee that the LBP for the actual nearest
neighbors will be between the T ·|R| smallest values of LBP .
The key difference between the exact search with pivots and the approximate search is that
while the exact search uses LBP values as lower bounds for discarding objects that are far away
from the query, Algorithm 9.4 compares LBP values between them, assuming that a low/high value
for LBP implies a low/high value for d. Note that Equation 9.2 is not used on the approximate
search because LBP is used just as a fast estimator and not as a lower bound of d.
As previously mentioned, selecting more pivots for P implies tighter estimations, but it also
implies a higher computational cost for evaluating LBP . However, with tighter estimations a smaller
T is required to select the actual nearest neighbors. The tradeoff between |P| and T is analyzed in
the experimental section. Note that as T moves closer to 1, Algorithm 9.4 will produce the same
results of Algorithm 9.3 independent of P. In particular, when T =1 the approximate search will
evaluate d for the whole reference collection, thus it will select the same nearest neighbors as the
exact search (but at a higher computational cost).
In preliminary experiments we tested other estimators based on pivots. In particular, we
tested the minimum upper bound, the average bound (the average between lower and upper bounds),
and the median of the average bound for all the pivots. However, these estimators were not worth
their higher computational cost compared with LBP .
9.1.4
Approximate Search with Pivots for Local Descriptors
The approximate search with pivots requires a single distance to compare segments. In
order to use local descriptions in the search, it is possible to directly use Matches or Spatial (see
109
Algorithm 9.4: Approximate NN+range search using pivots.
Input: q query segment, R set of reference segments, k number for NN search, ǫ threshold
for range search, P set of pivots, T approximation parameter.
Output: List of approximate k nearest neighbors to q
MinLbs ← new priority queue
foreach r ∈ R do
lb ← LBP (q, r)
if lb < ǫ then
add r to MinLbs with distance lb
if size of MinLbs > T · |R| then
remove max distance object from MinLbs
// Equation 9.1
NNs ← new priority queue
foreach r ∈ MinLbs do
dist ← d(q, r)
if dist < ǫ then
add r to NNs with distance dist
if size of NNs > k then
remove max distance object from NNs
return NNs
Chapter 7) as d in Algorithm 9.4. However, those distances are non-metrics, hence LBP will not
correctly estimate them, and the approximate search with pivots will achieve poor performance.
Nonetheless, those distances rely on a metric distance dL to compare local vectors (e.g., L2 for SIFT
vectors), thereby we can adapt the approximate search to estimate dL instead of the non-metric
distance.
As in previous section, Q and R are to be the sets of query and reference segments, where
each segment is now represented by one or more local vectors. Let QL be the set of all the local
vectors for every query segment, and RL be the set of all the local vectors for reference segments.
Essentially, for every vector in QL the search retrieves its m closest vectors in RL according to dL .
First, a set of vectors P L ⊆ RL are selected as pivots using Algorithm 9.2. Then, given two
segments q and r we define the ApMatches distance as:
ApMatchesP,m,T (q, r) = 1 − |ApPairs P,m,T (q, r)|/|SF(q)|
ApPairs P,m,T (q, r) = (x, y) ∈ SF(q) × SF(r), y ∈ ApNNP,m,T (x)
,
where ApNNP,m,T (x) is the set of the m nearest neighbors of x, which are retrieved from RL using
approximation parameter T and LBP L as the estimator of dL .
Finally, the approximate search is just a linear scan, where for each query segment in Q
retrieves its k closest reference segments in R according to the distance ApMatches. The scan may
evaluate ApMatches(q, r) ∀q ∈ Q, ∀r ∈ R at any order, but in practice the scan first calculates the
sets ApNN for every local vector in q, and afterwards resolves all the ApMatches(q, ·) consecutively.
Amato et al. [2011] present a distance between images which is based on k-NN searches in the
local-vectors space. They use that approach for image classification, reporting comparable results
110
to both image-to-image matches and visual codebooks. The main difference with our approach
is that ApMatches is based on approximate searches instead of exact searches, thus it can scale
to larger collections. Unlike Amato et al. [2011], for efficiency considerations, we allow ApNN to
contain vectors from the same segment and avoid performing spatial consistency tests.
The performance of ApMatches depends on parameters m, T , |P| which are used to retrieve
the ApNN sets. In the experimental section we evaluate the effectiveness of ApMatches with
different parameters compared to Matches and Spatial.
9.1.5
Two-step Search
As tested in Chapter 8, a distance which combines many descriptors can be used in order
to increases the effectiveness of the search. However, a combined distance usually implies higher
intrinsic dimensionality (as shown in the experimental section). The increase of the intrinsic dimensionality, in turn, affects the quality of the estimations, which decreases the effectiveness of the
approximate search. This sort of paradox motivated us to define the following Two-step search.
As noted in the previous section, the effectiveness of the approximate search with pivots can
be improved either by increasing |P| or by increasing T , both at the cost of increasing the search
time. In order to avoid those alternatives for a combined distance, we propose to locate candidates
by performing approximate searches using the underlying distances, and then perform an exact
search between the candidates using the combined distance.
More formally, let {d1 , ..., dm } be the underlying distances of γ, let V be the set of reference
videos, and let C be the set of query videos. Given a query video c ∈ C, for every query segment
q ∈ c the first step performs m approximate k-NN searches –one for each di – retrieving the sets of
reference segments k-NN(q, di ). All the retrieved reference segments are then gathered:
[
k-NN(q, di )
N (c) =
q∈c
1≤i≤m
Given the set N (c) ⊆ R, the set V(c) ⊆ V is defined as the D videos with more different
segments in N (c):
V(c) = {X ⊆ V, |X| ≤ D, ∀v ∈ X, u ∈ V − X, |v ∩ N (c)| ≥ |u ∩ N (c)|}
The second step performs an exact k-NN search for every q ∈ c using γ. The nearest
neighbors are retrieved from the search space R(c) ⊆ R, which is defined as the union of all the
segments from videos in V(c), i.e., r ∈ R(c) iff ∃v ∈ V(c) which r ∈ v. With this reduction of the
search space (considering segments from the most promising videos), it is expected that |R(c)| ≪ |R|
and thus the exact search with γ can be resolved in a reduced time.
Moreover, this definition can be extended to define a general Two-step search for a query
video c, where the first step performs one or more similarity searches in the reduced search space R
using distances di , and the second step performs an exact search in the reduced search space R(c)
using a distance γ. The γ function may combine different descriptors or perform some complex
computations, and if it satisfies the metric properties the search efficiency can be improved by using
some MAM.
111
The Two-step search was used for our participation at TRECVID 2011 to efficiently search
with a distance which combined audio and visual descriptors. The first step retrieved k1 nearest
neighbors according to distance da (which compares acoustic descriptors) and k2 nearest neighbors
according to distance dv (which compares visual descriptors), and the second step performs a k-NN
search according to distance dav (which combines acoustic and visual descriptors) in a reduced set
of candidate videos.
9.1.6
Evaluation
The evaluation uses the MUSCLE-VCD-2007 dataset, with reference videos from DB-MPEG1
collection and query videos from ST1 and ST2 collections. As stated in Section 5.6, R is the set
of reference segments, Q is the set of query segments. Each reference and query video was preprocessed and partitioned into segments of one second length (S1), thus |R|=211,479 and |Q|=13,942
segments.
Configuration
Name Segm. d
Histogram of distances Hd
Descriptor
max
µ
σ
ρ
Time MAP
25199
7692
2708
4.0
139
0.510
KF
S1
L1
KFt 11x9
EH
S1
L1
EHt 4x4-10
8101
3279
749
9.6
270
0.664
3.371
1.50
0.33
10.1
1319
0.690
5.945
2.16
0.55
7.9
2254
0.795
EIK
S1
γ
L1 -EHt 4x4-10
L1 -IHt 1x4-rgb-3x16
L1 -KFt 11x9
EK3
S1
δ
L1 -EHt 4x4-10
L1 -KFt 11x9
Table 9.1 – Configurations used in following experiments. The combination in γ uses αnormalization (α=0.1) with equal weights, and the temporal window in δ is W =3.
The following experiments use four different configurations: OM, KF, EH, IH, EIK, and
EK3. Table 9.1 depicts the histogram of distances (maximum distance, mean, standard deviation
and intrinsic dimensionality) for d(x, y) with randomly sampled x ∈ Q and y ∈ R, time spent (in
seconds) by a non-parallel process which performs |Q| linear scans to retrieve the 1-NN, and the
achieved MAP according to the ground-truth. The distance γ for EIK is the combination of three
L1 distances with α-normalization, α=0.1, and equal weights. The spatio-temporal distance δ for
EK3 uses a window W =3 combining two L1 distances with α-normalization, α=0.1, and equal
weights.
Performance of the approximate search with pivots
This experiment analyzes the performance of the approximate search with pivots compared
to the linear scan. Given a query object q whose nearest neighbor retrieved by the linear scan is rq ,
the approximate search with pivots will return the same nearest neighbor depending on the value
of LBP (q, rq ). In fact, if LBP (q, rq ) is between the T ·|R| smallest values of LBP (q, r) for all r ∈ R,
then rq will be promoted to the second round, where the actual distance d(q, rq ) will be evaluated
and rq will be returned as the first nearest neighbor.
112
KF
80%
60%
40%
20%
0%
0
0.2
0.4
0.6
EH
100%
amount of queries
amount of queries
100%
0.8
80%
60%
40%
20%
0%
1
0
rank(NN) / |R|
amount of queries
amount of queries
60%
40%
20%
0
0.2
0.4
0.6
0.8
0.8
1
80%
60%
40%
20%
0%
1
rank(NN) / |R|
LBP ,|P|=1
LBP ,|P|=3
0.6
EK3
100%
80%
0%
0.4
rank(NN) / |R|
EIK
100%
0.2
0
0.2
0.4
0.6
0.8
1
rank(NN) / |R|
LBP ,|P|=5
LBP ,|P|=10
LBP ,|P|=20
LBP ,|P|=40
LBP ,|P|=80
Figure 9.1 – Amount of queries whose actual nearest neighbor is between the smallest values
of LBP for |P| ∈ {1, 3, 5, 10, 20, 40, 80}.
Figure 9.1 compares the approximation performance of LBP under the four configurations
using different sets of pivots with sizes |P| ∈ {1, 3, 5, 10, 20, 40, 80}. The figure depicts the amount
of queries in which LBP ranks the correct NN between the T ·|R| smallest values. Because the pivot
selection algorithm has a random component, the plotted value is obtained by averaging the result
of many sets of pivots with the same size. Note that the ideal result is located in the upper left
corner, where LBP would rank the correct NN in the first place for 100% of queries.
The figure proves the assumptions that the lower bounds for the nearest neighbors are
usually between the smallest lower bounds and that the quality of the approximation depends on
the number of pivots, i.e., larger sets of pivots obtain (on average) better approximations than
smaller sets. However, the sets of pivots show diminishing returns, e.g., the improvement from 5 to
10 pivots is higher than the improvement from 10 to 20, which are higher than the improvement
from 20 to 40, and so on. The quality of the approximation also depends on the configuration: LBP
produces better approximations for KF, which is consistent with its lower ρ. On the other hand,
LBP produces worse approximations for EH and EIK, also consistently with their higher ρ. In
general, the approximation parameter T ≥ 1% achieves reasonable high accuracy.
Figure 9.2 depicts the time spent by the approximate search in proportion to the time spent
by the linear scan. The time spent by the approximate search depends on three factors: the cost of
computing LBP (which depends linearly on |P|), the cost of computing d (Cd , which depends on
size of descriptors), and the amount of distance evaluations T . The search is divided in two rounds,
113
KF
80%
60%
40%
20%
0%
0
0.2
0.4
0.6
EH
100%
search time / lin.scan
search time / lin.scan
100%
0.8
80%
60%
40%
20%
0%
1
0
0.2
0.4
T
EIK
80%
60%
40%
20%
0%
0
0.2
0.4
0.6
0.8
1
0.8
1
80%
60%
40%
20%
0%
1
T
LBP ,|P|=1
LBP ,|P|=3
0.8
EK3
100%
search time / lin.scan
search time / lin.scan
100%
0.6
T
0
0.2
0.4
0.6
T
LBP ,|P|=5
LBP ,|P|=10
LBP ,|P|=20
LBP ,|P|=40
LBP ,|P|=80
Figure 9.2 – Computational time of approximate searches with pivots compared to linear scan.
the first round computes LBP for every object in R and then selects the smallest T objects, while
the second round computes d for T objects. Therefore, the expected cost of the approximate search
is O(|R| · |P| + |R| · log T + T · Cd ), compared to the expected cost of linear scan which is O(|R| · Cd ).
The figure shows the search times grow almost linearly with T . The rate of growth is given
by the cost of selecting T objects plus computing T evaluations of d. In the case of time-inexpensive
distances the cost of selecting the T minimum distances dominates the times. In fact, fixing T = 0.1
implies creating and updating a heap with size 21,147 (T · |R| objects) which for KF turns out to
be more expensive than directly computing L1 distances between 99-d vectors. In the case of timeexpensive distances, the cost of maintaining a heap worth the saved distance computations, hence
producing much better improvement in efficiency compared to linear scan. In fact, in EK3 an
approximate search using 10 pivots and T = 5% is able to retrieve the correct nearest neighbor in
90% of queries while reducing the search time to 10% of the time spent by a linear scan.
In summary, the approximate search with pivots enables to perform k-NN searches using
combined distances in large datasets achieving high effectiveness. This search was a key algorithm for
our participation at TRECVID: in TRECVID 2010 we used a configuration of T =0.001 and |P|=9,
and for 2011 we adjusted the parameters to T =0.01 and |P|=5. This tuning highly improved the
effectiveness of the search with only a minor change in the search time.
114
Configuration
Histogram of distances Hd
Segm. d Descriptor
max
µ
σ
ρ
S1
L1
SF6
3169
2016
281
25.8
S5
L1
SF6
3204
2015
281
25.7
S2
L1
SF4
3070
2015
283
25.4
S1
L1
SF2
3127
2003
293
23.4
S5
L1
SF1
3122
1999
298
22.5
Table 9.2 – Metric spaces defined by different local descriptors and segmentations.
0.7
Frame-To-Frame
Detections without
false alarms
0.6
MAP
0.5
0.4
0.3
0.2
ApMatches,|P|=10,m=1
25
ApMatches,|P|=10,m=10
20
ApMatches,|P|=10,m=100
ApMatches,|P|=3,m=1
15
ApMatches,|P|=3,m=10
10
0.1
10-4
Copy Detection
30
10-3
10-2
10-1
10-4
T
ApMatches,|P|=3,m=100
10-3
10-2
10-1
T
Figure 9.3 – Effectiveness of approximate search with pivots for local description SF4 and segmentation S3 varying |P|, m and T .
Approximate search with pivots for local descriptors
The following experiments analyze the performance of ApMatches function, which is our
approach for approximate search with pivots for local descriptors. The performance of ApMatches
depends on the search parameter m (amount of nearest neighbors to retrieve for each local vector),
and approximation parameter T .
Table 9.2 shows five samples of metric spaces defined by different segmentations and local
descriptors. The table shows these metric spaces are fairly similar, with a small increase in the
intrinsic dimensionality when the frames are reduced. The values for ρ are higher than those at
Table 9.1. We chose to compare SIFT vectors with L1 because it consistently achieves higher
effectiveness than L2 (see Section 8.3).
The following experiment measures the impact of parameters |P|, m, and T on detection
performance. Figure 9.3 shows the performance of ApMatches using parameters |P|={3, 10}, m={1,
10, 100}, and T ={0.0001, 0.001, 0.01, 0.1}. The segmentation is fixed to S3 and local description
to SF4. The figure shows that m=1 achieves the best overall effectiveness. Although m=10 achieves
a slightly better MAP for |P|=10, m=1 detects more copies without false alarms. A P with more
pivots improves the effectiveness, however the difference in effectiveness between |P|=3 and |P|=10
is reduced as T increases.
The next experiment measures the performance of approximate search for local descriptions
SF2, SF4 and SF6 fixing parameters m=1 and |P|=3. The performance of ApMatches is compared
115
with Matches. Figure 9.4 shows the frame-to-frame effectiveness and detection effectiveness achieved
by each configuration together with the total time spent in the |Q| searches. The effectiveness of
Matches is also shown in Figure 7.5 on page 87. The influence of T in the effectiveness is shown,
where T =0.01 achieves a convenient balance between effectiveness and efficiency. In fact, ApMatches
with T =0.01 outperforms Matches detecting 30 out of 31 copies instead of 28, and furthermore it
is one order of magnitude faster.
The experiment shows that ApMatches can outperform Matches. The fact that an approximate search outperforms the exact search maybe counterintuitive. The reason for this behavior is
that the matched pairs in ApMatches are stricter than the matched pairs in Matches. For instance,
given two segments q and r, Matches may match local vector x from q with local vector y from
r when y is the nearest neighbor between the vectors in r and d(x, y) satisfies the distance ratio
s. It considers every segment independently of the others, thus the value of Matches(q, r) does
not affect the value of Matches(q, s), for another segment s. On the other hand, ApMatches may
match vectors x and y when y is the nearest neighbor considering the whole collection of vectors
RL (m=1). Therefore, the value of ApMatches(q, r) can indeed affect the value of ApMatches(q, s)
because the vectors in q paired with a vector in r cannot be also paired with another vector in s.
The experiment also shows a limit for the approximate search on SIFT vectors. Given a
configuration that achieves good effectiveness but requires too much time, the search time can be
reduced by decreasing the approximation parameter. However, if T decreases below 0.001 the search
time is slightly reduced but the effectiveness may be ruined. An alternative is to represent each
frame by less vectors (e.g., to use SF4 instead of SF2), which also decreases the effectiveness but it
gives more gain in efficiency. However, a reduction beyond SF6 also highly impacts the effectiveness.
In fact, for very small frames a global descriptor is more meaningful than local descriptors.
In summary, the approximate search with pivots is able to both reduce the search times and
achieve high detection effectiveness. However, its performance has a limit mainly determined by:
1) the amount of local vectors to search in (which in turn depends on the dataset size), and 2) the
intrinsic dimensionality of the metric space created by the descriptor (which for SIFT vectors and L1
is consistently high). In a medium-sized dataset, like MUSCLE-VCD-2007, it can successfully reduce
search times by about one order of magnitude while still achieving high effectiveness, but it is an open
issue to successfully use it on larger collections (like the TRECVID datasets). Moreover, it should
be noted that in the case of approximate searches using SIFT descriptors, the multidimensional
indexes can achieve much better effectiveness-versus-efficiency tradeoff than metric indexes. In fact,
the main benefit of using metric indexes is to improve the efficiency of complex distances, hence
a time-inexpensive distance between 128-d vectors does not provide the best scenario to evaluate
them. More details on this issue are given in Section 9.3 and Section 12.2.
9.2
Efficiency for Streams of Exact Searches
In our CBVCD system, the similarity searches are resolved for every segment in a query video,
thus it can be expected that two consecutive segments from the same video are similar (especially
for fine-grained segmentations). In that case the distance between ith and (i+1)th segments should
be small, and also the nearest neighbor of ith segment may be close to the nearest neighbor of
(i+1)th segment.
Based on this idea, the efficiency of the search can be improved by using the ith query
segment as a dynamic pivot to resolve the (i+1)th search. This requires a dynamic pivot table
116
Effectiveness vs Efficiency SF2
0.7
0.4
0.3
0.2
0.1
0
S1/2 S1
S2
S3
S4
Search time [hrs.]
Detections without
false alarms
0.5
MAP
21
30
0.6
25
20
15
10
S1/2 S1
S5
Segmentation
S2
S3
S4
18
15
12
9
6
3
0
S1/2 S1
S5
Segmentation
S2
S3
S4
S5
Segmentation
Effectiveness vs Efficiency SF4
0.7
0.4
0.3
0.2
0.1
0
S1/2 S1
S2
S3
S4
Search time [hrs.]
Detections without
false alarms
0.5
MAP
21
30
0.6
25
20
15
10
S1/2 S1
S5
Segmentation
S2
S3
S4
18
15
12
9
6
3
0
S1/2 S1
S5
Segmentation
S2
S3
S4
S5
Segmentation
Effectiveness vs Efficiency SF6
0.7
0.5
0.4
0.3
0.2
0.1
0
S1/2 S1
S2
S3
S4
Search time [hrs.]
Detections without
false alarms
0.6
MAP
21
30
25
20
15
10
S1/2 S1
S5
Segmentation
S2
S3
S4
S5
Segmentation
ApMatches,m=1,T =10-3 ,|P|=3
18
15
12
9
6
3
0
S1/2 S1
S2
S3
S4
Segmentation
Matches,s=0.7
-4
ApMatches,m=1,T =10-2 ,|P|=3
ApMatches,m=1,T =10 ,|P|=3
Figure 9.4 – Performance of approximate search for local descriptors and exact search.
117
S5
that stores the distances between the previous query segment and every reference segment, hence if
query segments are nearby, the stored distances may produce tight lower bounds that discard most
of the distances in the (i+1)th search. However, in that case the (i+1)th query segment will be a
poor pivot for the (i+2)th segment, because the dynamic pivot table would have stored only a few
distances. Then, to minimize that “min-max” effect we will use a sliding window with the last p
query segments, instead of just the last one.
The D-file is a dynamic structure that can be used for this approach (see Section 3.3.1).
However, as we show in the experimental section, the D-file suffers from high internal complexity.
The main problem arises when the distance function is not time-expensive. In that case, the internal
complexity associated with the hash function and collision resolution dominates the search times
rendering it unviable to use in many scenarios. In order to solve this problem, we introduce the
Snake Table that preserves the idea and advantages of D-file and D-cache, but exhibits lower internal
complexity.
9.2.1
Snake Table
The life cycle of the Snake Table is as follows: First, when a new session starts, an empty
Snake Table is created and associated with it. When a query object q1 is received, a k-NN search
is performed. The distances between q1 and the objects in the collection are added to the Snake
Table, and the result is returned. When a new query object qi is received, a k-NN is performed
using the previous query objects qi−p , ..., qi−1 as pivots to accelerate the search. Finally, when the
session ends, the Snake Table is discarded. Therefore, like D-file and unlike most of MAMs, the
Snake Table is a session-oriented and short-lived MAM, see Figure 9.5.
. . . qi . . .
. . . qi . . .
|R|
|R|
Snake
Table 1
Q1 ={q1 , q2 , . . .}
Session 1
Snake
Table 2
Q2 ={q1 , q2 , . . .}
Collection R
Session 2
Figure 9.5 – Snake Tables created for stream of queries Q1 and Q2 .
The Snake Table is implemented with a fixed-size |R|×p matrix used as a dynamic pivot
table. As in LAESA, the j th row in the dynamic pivot table represents the object oj in R and
contains the distances between oj and up to p previously processed query objects. Each cell in
the j th row of the table contains a pair (q, d(q, oj )) for some query object q (not necessarily in
order). When processing a new query object qi , the lower bound LBP (qi , oj ) is computed, with P
dynamically determined by the query objects and distances in the j th row. As in the exact search
with pivots, the object oj is discarded when LBP (qi , oj ) is greater than the distance between qi and
the current k th nearest neighbor candidate (obtained between o1 and oj−1 ). If oj is not discarded,
the actual distance d(qi , oj ) is computed, added to some cell in the j th row, and the NN candidates
are updated if necessary.
We present three different replacement strategies to assign a distance d(qi , oj ) to one of the
p cells in the j th row:
118
1. Sparse/FIFO: Each query qi picks a column in round-robin mode, i.e., the distance d(qi , oj )
is stored in the (i mod p) column of the j th row, eventually replacing the stored distance
d(qi−p , oj ). If the distance was not evaluated because it was discarded by LBP (qi , oj ) then
the corresponding cell is not used. This behavior can be implemented following two options:
1) the cell is updated with an ∞ distance; or 2) the cell is left unmodified, but before any
read operation the query stored in the cell is matched with the last query for that column
(the experimental section uses the latter). This strategy produces sparse rows containing at
most p distances between d(qi−p , oj ) and d(qi , oj ). As a consequence, if p=1 and most of the
distances for qi−1 were discarded, then qi will achieve poor efficiency. In order to diminish
this “min-max effect” a larger p or a different strategy should be chosen.
2. Compact/Unsorted: The distance d(qi , oj ) is compared to every distance in the j th row and
the cell with the minimum distance is replaced, independently of its position in the row. With
this strategy, every row stores the highest p distances between d(q1 , oj ) and d(qi , oj ) that have
not been discarded.
3. Compact/FIFO: The distance d(qi , oj ) is stored in a cell chosen in an independent round-robin
for every row. With this strategy, the j th row stores the last p computed distances for oj ,
discarding the old ones. LBP starts its evaluation from the last stored distance and goes
backwards, therefore favoring the most recently stored distances.
For strategy Sparse/FIFO, distances d(qi , qj ) with j ∈ {i−p, ..., i−1} are calculated and stored
in memory at the beginning of every search. For strategies Compact/Unsorted and Compact/FIFO,
distances d(qi , qj ) with j ∈ {1, ..., i − 1} are calculated on-demand by LBP . Note that Sparse/FIFO
uses a global index to mark the current column where computed distances are stored, Compact/FIFO
uses an independent index for each row, and Compact/Unsorted does not need any index because it
determines the cell to replace when it calculates each lower bound.
D-file uses a combination of Sparse/FIFO and Compact/Unsorted. It always replaces an old
distance (older than qi−p ), but if there is no old distance in the collision interval, it replaces the
worst distance (defined as the distance closer to the median or some predefined percentile).
The performance achieved by these three replacement strategies are compared in the experimental section. However, it should be stressed that despite the replacement strategy, the overall
performance of the Snake Table mainly depends on the distribution of the query objects.
9.2.2
Snake Distribution
The Snake Table is intended to be used when the query objects in a stream fit a “snake
distribution”. Intuitively, we define that a set of objects fits a snake distribution when the distance
between two consecutive objects in Q is small compared to the median distance between any two
objects (see Figure 9.6). In order to measure and compare the degree of fitness, we define an
indicator using the histogram of distances of d for Q and R.
Definition 7 (Difference ∆) Let F1 and F2 be two cumulative distributions, the difference ∆
between F1 and F2 is defined as:
Z ∞
F1 (t) − F2 (t) dt
∆(F1 , F2 ) =
0
119
R
q6 q7
q12
q5
q4
Q
q1
q11
q3
q10
q2
q8 q9
Figure 9.6 – Stream of queries Q={q1 , ..., q12 } with a snake distribution. Most distances
d(qi , qi+1 ) are smaller than d(x, y) for randomly selected pairs x,y in R.
The Difference ∆ is meaningful only when both F1 and F2 originate from the same metric
space. Note that ∆(F1 , F2 ) is greater than zero when the distances accumulated by F1 are smaller
than distances accumulated by F2 .
Definition 8 (Snake Distribution) Let M = (D, d) be a metric space, let R ⊂ D be the collection of objects, and let Q ⊂ D be a set of m query objects Q = {q1 , ..., qm }. Let F be the cumulative
distribution of d(x, y) with random pairs x, y ∈ Q ∪ R, p be a number between 1 and m-1, and FQp
be the cumulative distribution of d(qi , qi−p ) ∀ i ∈ {p + 1, ..., m}. Q fits a snake distribution of order
p if ∆(FQp , F ) > s, for some threshold value s ∈ R+ .
Note that when both Q and R are random samples of D without any special ordering (i.e.,
the
sample does not depend on previous samples), then ∆(FQp , F ) ≈ 0. When a distribution
fits a Snake Distribution of order 1 to p then a Snake Table can be created with a sliding window
containing up to p query objects.
ith
9.2.3
Evaluation
The following experiments compare the improvements in the efficiency achieved for six index
structures using either static pivots or dynamic pivots. The number of pivots for each index varies
between 1 and 20. The evaluated indexes are:
• D-file: It uses a D-cache with a fixed size hash table of n = 10 · |R| · p cells, collision interval
1, and hash function h(qi , oj )=(rndi ∗ rndj ) mod n, where rndi and rndj are unique random
IDs assigned to each object.
• LAESA-based: Following its definition, LAESA does not require any information of the query
objects, but for a fair comparison, we allow LAESA to use Q in the selection process. LaesaQ
chooses p static pivots from Q, and LaesaR chooses p static pivots from R. Both selections
are performed using the selection algorithm depicted in Algorithm 9.2.
• Snake Table: We test the three strategies depicted in Section 9.2.1. SnakeSF uses Sparse/FIFO
strategy (sparse row with the last p queries), SnakeCU uses Compact/Unsorted strategy (compact row containing p prior query in no particular order), and SnakeCF uses Compact/FIFO
strategy (compact row with the last p evaluated distances for each object).
We evaluate these indexes under four configurations: KF, EH, EIK, and EK3, see Table 9.1. These configurations are split into two groups: Group 1 (KF and EH) contains the
120
p
Difference ∆(FQ
,F)
0.3
KF
EH
EIK
EK3
0.2
0.1
0
1
4
7
10
order p
13
16
19
Figure 9.7 – Snake distribution of order p for the four configurations.
configurations where the linear scan takes less amount of time, and Group 2 (EIK and EK3)
contains the configurations in which the linear scan is slower by one order of magnitude.
Snake Distribution
Figure 9.7 depicts the snake distribution of order p for the four configurations. The value
of difference ∆(FQp , F ) for p ∈ {1, ..., 20} is shown. The four configurations present a difference ∆
higher than zero, hence the streams of queries fit a snake distribution, i.e., distances between qi
and qi−p are smaller than distances between random pairs. The four configurations present higher
fitness for p close to 1, and as p increases the snake distributions tend to disappear. As shown in
the following experiments, the different configurations present satisfactory results with p roughly
between 1 and 5 pivots.
Group 1
Figure 9.8 shows the efficiency achieved by the six indexes under the two fastest configurations. It depicts the amount of distance computations and search time achieved by each index, as a
proportion of the linear scan, when the number of pivots varies between 1 and 20. These values only
consider the online phase, i.e., they do not include the distance computations required by LAESA
to select pivots and build the pivot table.
The disparity between saved distances and saved time reveals the internal complexity of each
index. Because these configurations use fast distances, an index must have low internal complexity
in order to outperform the linear scan, otherwise it will be faster to directly compute each distance
instead of trying to discard them.
In the case of static pivots, LaesaQ shows an almost identical performance as LaesaR. This
result implies that knowing a priori the set of queries does not produce a noticeable improvement
in LAESA performance. The instability in LAESA for consecutive p is due to the random nature
of pivot selection. In order to reduce this effect for each number of pivots we averaged the result
achieved by four different sets of pivots.
Configuration KF has low intrinsic dimensionality, therefore it is expected that pivots will
discard most of the distance computations. In this case, LAESA achieves the best search times even
121
Distance evaluations KF
80%
60%
40%
20%
0%
1
4
7
10
13
16
number of pivots
Distance evaluations EH
80%
60%
40%
20%
0%
1
4
7
10
60%
40%
20%
1
4
7
13
16
number of pivots
13
16
19
SnakeCF
D-file
Search time EH
100%
80%
60%
40%
20%
0%
19
10
number of pivots
SnakeSF
SnakeCU
search time / lin.scan
distance evals. / lin.scan
100%
80%
0%
19
LaesaQ
LaesaR
Search time KF
100%
search time / lin.scan
distance evals. / lin.scan
100%
1
4
7
10
13
16
number of pivots
19
Figure 9.8 – Search time and distance evaluations for KF and EH (Group 1).
though the Snake Table discards slightly more distances. In fact, LaesaQ and LaesaR discarding
80% of distance computations can reduce by about 50% the search time, whereas SnakeCU and
SnakeCF also discarding 80% of distance computations reduces by less than 40% the search time.
This difference reveals that a static pivot table has lower internal complexity than a dynamic pivot
table, which is an expected result due to the overhead produced by updating the pivot table.
Configuration EH has higher intrinsic dimensionality, therefore it is expected that pivots
will discard less distance computations than in KF. In fact, both LaesaR and LaesaQ discard just
about 25% of distance computations and search time is reduced by less than 10%. On the other hand,
Snake Table can profit from the snake distribution in order to discard more distance computations:
both SnakeCU and SnakeCF discard more than 60% distance computations, reducing the search
time by up to 20%. This result shows the snake distribution can reduce search times even in
scenarios with high intrinsic dimensionality and fast distances.
Comparing the Snake Table replacement strategies, compact rows from SnakeCU and
SnakeCF show superior performance than sparse rows from SnakeSF. The “min-max” effect described in Section 9.2.1 becomes apparent: SnakeSF cannot achieve high performance when using
one and two pivots because every discarded distance implies an empty cell in the pivot table, which
affects the performance for subsequent pivots. This undesired effect of Sparse/FIFO strategy decreases as the number of pivots increases. On the other hand, Compact/Unsorted and Compact/FIFO
strategies do not suffer the “min-max” effect because the compact rows prevent harming the performance with empty cells.
122
Distance evaluations EIK
search time / lin.scan
distance evals. / lin.scan
80%
60%
40%
20%
0%
1
4
7
10
13
16
number of pivots
Distance evaluations EK3
60%
40%
20%
4
7
10
40%
20%
1
4
7
13
16
number of pivots
13
16
19
SnakeCF
D-file
80%
60%
40%
20%
0%
19
10
number of pivots
Search time EK3
100%
search time / lin.scan
distance evals. / lin.scan
80%
1
60%
SnakeSF
SnakeCU
100%
0%
80%
0%
19
LaesaQ
LaesaR
Search time EIK
100%
100%
1
4
7
10
13
16
number of pivots
19
Figure 9.9 – Search time and distance evaluations for EIK and EK3 (Group 2).
The D-file presents very high internal complexity: it discards as many distance computations
as SnakeSF, but the search times are always higher than linear scan. In fact, the search using
D-file takes more than two times the linear scan for EH, and more than three times the linear
scan for KF. This high internal complexity is due to the cost of computing a hash function instead
of directly accessing some table cell (as SnakeSF does). Moreover, D-file discards slightly less
distance computations than SnakeSF due to the existence of collisions in the hash table which
may overwrite already computed distances.
In summary, LAESA can achieve high performance in scenarios with fast distance and low
intrinsic dimensionality. However, in scenarios with higher intrinsic dimensionality, the Snake Table
achieves the best performance due to its better pivot selection (profiting from the snake distribution).
Group 2
Figure 9.9 shows the performance achieved by the six indexes under configurations with
time-expensive distances. It depicts the amount of distance computations and time spent by the
search, as a proportion of the linear scan. As in previous experiment, the values for LAESA do not
consider the cost of building the index structures.
These scenarios produce a higher correlation between discarded distances and search time
savings. In fact, LAESA search time is less affected by the number of pivots, while Snake Table starts
123
to increase the search time from about five pivots. Besides, D-file using a few pivots outperforms
the linear scan, however as the number of pivots increases the search times quickly exceeds the
linear scan due to the high cost of computing the hash function for each pivot.
The existence and the exploitation of a snake distribution in the query objects becomes
a remarkable approach to improve the efficiency of exact searches. In the case of EK3, LAESA
reduces the search time by 50%, while the Snake Table can reduce the search times by more than
60%. In the case of EIK the difference is even greater, LAESA reduces the search time by 40%,
while the Snake Table can reduce the search times by up to 70%.
In summary, in scenarios with a time-expensive distance and a query set that fits a snake
distribution, the Snake Table achieves much higher performance than LAESA. The best performances for Stake Table are achieved between three and five pivots and using Compact/Unsorted or
Compact/FIFO replacement strategies.
Discussion
The high performance of the Snake Table compared with LAESA and D-cache is due to
its properties of dynamic selection of pivots and low internal complexity. The Snake Table is able
to exploit snake distributions in order to reduce the search time for both fast and time-expensive
distances even in spaces with high intrinsic dimensionality. In particular, the Snake Table is a better
alternative than D-file in the tested scenarios.
The Snake Table presents an approach to index spaces when consecutive queries are similar
to each other. This behavior usually arises in content-based video retrieval (when the queries are
consecutive keyframes), interactive multimedia retrieval systems (when the user selects a new query
object from the answers of a previous query), and similarity searches using local descriptors. In
a more general domain, given an unsorted set of queries, the test of snake distribution presented
in this work may be useful to determine an optimal ordering of queries which will achieve a high
performance in the Snake Table.
One usage of the Snake Table is to create an index for each stream of queries. When a
user starts a session, an empty Snake Table is associated with it. As the user performs queries
with snake distribution, the index improves its performance because it will select pivots close to
following queries. However, the Snake table is not memory efficient as it requires space proportional
to the size of the dataset and to the number of sessions connected. This approach is more suitable
for medium-sized databases with long k-NN streams. Moreover, because it does not need to use a
central shared index structure, it is also suitable for highly dynamic datasets.
On one hand, pivots in a sliding window with snake distribution satisfy one desirable property: they should be close to either the query or the collection objects. On the other hand, those
pivots do not satisfy another desirable property: they should be far away from each other. Hence,
using a Snake Table with many pivots will only increase the internal complexity without increasing
the efficiency because pivots will be mostly redundant. In order to overcome this issue, the Snake
Table and LAESA can seamlessly be combined by with a unique pivot table containing both static
and dynamic pivots. Moreover, the SSS algorithm can also be combined with the Snake Table by
fixing one pivot when it is far away from all the previous ones. This combined approach enables
the profiting from both dynamic pivots close to queries and non-redundant static pivots.
LAESA can benefit from multi-core architectures by sharing the pivot table and resolving
124
Name
Segm.
d
Descriptor
dim.
ρ
MAP
AU
EH
IH
KF
OM
SF
S1
S1
S1
S1
S1
S5
L1
L1
L1
L1
L1
L1
AU160
EHt 4x4-10
IHt 1x4-rgb-3x16
KFt 11x9
OMt 9x9
SF6
160
160
192
99
81
128
7.5
9.6
8.6
4.0
6.2
24.6
0.734
0.664
0.501
0.510
0.339
—
Table 9.3 – Configurations used in the comparison between multidimensional and metric indexing.
each query in different threads. In the case of Snake Table, in order to efficiently resolve parallel
queries we recommend partitioning the queries into independent subsets, and resolving each subset
by a Snake Table in an independent thread.
9.3
Comparison with Multidimensional Indexes
In this section we evaluate the performance of the presented techniques compared to stateof-the-art search techniques. The main strength of Approximate Search with Pivots and Snake
Table is they can work with general distances and any kind of descriptors, as long as the distance
satisfies the metric properties. On the other hand, most state-of-the-art techniques are designed
to work with multidimensional vectors and Euclidean or Manhattan distance. Hence, because that
scenario is a particular case of metric spaces, we will use configurations based on multidimensional
vectors in order to make the comparison. However, accordingly to Section 3.4, the metric approach
achieves lower performance in those scenarios than multidimensional indexes.
Muja and Lowe [2009] presents the FLANN library (Fast Library for Approximate Nearest
Neighbors), which efficiently implements different indexing techniques from multidimensional vectors. In particular, we evaluate two families of algorithms: kd-tree and k-means tree (described in
Chapter 3).
In the following experiments we evaluate multidimensional and metric indexes under different
descriptors. Table 9.3 summarizes the descriptors: four global descriptors EHt 4x4-10, OMt 9x9, KFt 11x9,
and IHt 1x4-rgb-3x16, one acoustic descriptor AU160, all of them were computed from preprocessed
videos, segmentation S1, considering query videos from ST1 and ST2 collection; and local descriptor
SF6 computed from preprocessed videos, segmentation S5, and query videos from ST2 collection.
The table shows the number of vectors for the query set Q, the number of vectors for the reference
dataset R, the vector dimensionality, the intrinsic dimensionality, and the MAP achieved by the
linear scan according to the ground-truth (already shown in Figure 7.3 on page 82).
In this experiment the linear scans are resolved by FLANN, which provides an efficient
implementation of distance computations. This produces a much faster scan than the used by
previous experiments. That issue is the reason the results from the following experiments are not
consistent with results from previous experiments in this chapter.
125
Search Time spent by Exact Search
250%
search time/linear scan
225%
200%
kd-tree
k-means tree (k=10)
Linear Scan
LaesaR (|P|=3)
LaesaR (|P|=10)
SnakeCU (p=2)
175%
150%
125%
100%
75%
50%
25%
0%
AU
EH
IH
KF
OM
SF
Figure 9.10 – Search time spent by the exact search implemented by different multidimensional
and metric indexes.
9.3.1
Exact search
This experiment evaluates the performance achieved by different indexes for exact searches,
i.e. when the search is restricted to retrieve identical nearest neighbors as the linear scan. Therefore,
in order to compute effectiveness the ground-truth is set to the nearest neighbor retrieved by the
linear scan.
Figure 9.10 summarizes the performance for the multidimensional indexes: kd-tree, and kmeans tree (with branching k=10 per node); and for the metric indexes: LAESA (using 1, 3, and
10 pivots selected by SSS), and the Snake Table (using 2 dynamic pivots replacement strategy 2).
The figure compares the time spent by the exact search relative to the time spent by the linear scan
under the six tested configurations.
The experiment shows the exact search is not favorable for multidimensional indexing. In
fact, in every tested scenario multidimensional indexes are slower than the linear scan, with the
exception of k-means tree at IH configuration (4% faster than the linear scan). This result evinces
the impact of curse dimensionality in multidimensional indexing: in order to resolve an exact search,
every tree node that potentially may contain an object closer than the NN must be visited. Under
high dimensional spaces most of space regions have some intersection with a query region, forcing
the search to visit almost the whole tree. We must note that an early termination of the search
indeed retrieves the correct NN most the times (see next experiment), but forcing the search to
guarantee the highest effectiveness inevitably ruins the efficiency.
On the other hand, metric indexes are able to reduce the search time by about 20% to
60% without affecting the effectiveness. As said in previous experiments, the number of pivots
that must be selected by LAESA in order to achieve the best efficiency depends on two factors:
intrinsic dimensionality, and cost of the distance (which in turn depends on vector dimensionality).
The former influences the quality of the lower bounds: a high intrinsic dimensionality forces to
select many pivots in order to save some distance computations. The latter sets an upper bound
for the internal cost: a cheap distance forces the use of only a few pivots to achieve some gain
compared to the linear scan. The tested scenarios use a cheap distance, therefore the best balance
is achieved by few pivots. In fact, the figure shows that one and three pivots always outperforms
ten pivots. In scenarios with higher intrinsic dimensionality (EH and IH) the Snake Table is
able to outperform LAESA because it profits from the snake distribution, reducing the impact of
126
intrinsic dimensionality in the quality of lower bounds. The Snake Table achieves low performance
at SF (6% slower than the linear scan) because SIFT descriptors do not produce a profitable snake
distribution.
In summary, the exact search shows that metric indexing can outperform the linear scan
without affecting effectiveness. On the other hand, multidimensional indexing necessarily has to
reduce the effectiveness in order to achieve high efficiency. The next experiment analyzes this issue.
An interesting issue that needs more investigation is that multidimensional indexing and metric
indexing do not show the same behavior for different configurations. For example, metric indexes
achieve a relatively similar performance for IH and KF, but multidimensional indexes achieve
much better performance at IH than IH. A similar issue occurs between AU and EH: almost same
performance for metric indexes, but AU is more difficult for multidimensional indexes than EH .
9.3.2
Approximate search
In this experiment we compare the performance of multidimensional and metric indexes for
approximate searches. Approximate search algorithms rely on one or more parameters to control
the effectiveness-vs-efficiency tradeoff. The evaluation method fixes approximation parameters,
performs the search, and measures the search time and the amount of identical nearest neighbors
compared to the linear scan. Following this method for different approximation parameters we
produce a curve that reveals the tradeoff between search time and effectiveness cost.
Figure 9.11 shows the effectiveness-versus-efficiency tradeoff for different indexes. The figure
compares the time spent by the approximate search relative to the time spent by the linear scan, and
the amount of correct nearest neighbors retrieved, under the six tested configurations. Three types
of indexes are evaluated: kd-tree, using a single tree and ten trees; k-means tree, using a branching
10 and 20 per node; and LAESA, using 3, 5, 10 and 20 pivots selected by SSS. Additionally, the
search time achieved by the best exact search is included for each configuration (see Figure 9.10).
The exact search sets an upper bound for the search time that an approximate search can spend.
The experiment shows multidimensional indexes achieve a much better effectiveness-versusefficiency tradeoff than metric indexes. In fact, multidimensional indexes achieve a 90/10 tradeoff
(i.e., to retrieve more than 90% correct NNs requiring less than 10% the time spent by the linear
scan) at every tested configuration. Moreover, in the case of SF, the approximate search using
ten kd-trees is able to achieve a surprising 97/3 tradeoff (i.e., to retrieve 97% of correct NNs while
reducing the search time to less than 3% compared to the linear scan).
On the other hand, approximate search with pivots using LAESA achieves a less satisfactory
effectiveness-versus-efficiency tradeoff. In fact, the metric approximate search is able to achieve a
70/30 tradeoff (i.e., to achieve a near 70% precision requiring about 30% of the time spent by the
linear scan) for OM and EH configurations, and improving up to a 80/20 tradeoff for AU, IH,
KF, and SF configurations. Those tradeoffs are commonly achieved by setting an approximate
parameter T between 1% and 5%, i.e., by estimating and discarding more than 95% of objects and
evaluating less than 5% of distances. Increasing the effectiveness of the search may produce a large
increase in search times, even achieving search times close to the exact search. Trying to improve
the efficiency by reducing T beyond that interval may produce a large decrease of effectiveness, even
ruining the search results.
The experiment proves the high performance that multidimensional indexes can achieve in
the particular case of configurations based on single multidimensional vectors. Therefore, the metric
127
AU
80%
60%
40%
20%
0%
0%
20%
40%
60%
80%
60%
40%
20%
0%
0%
80% 100%
precision 1-NN
search time/linear scan
search time/linear scan
60%
40%
20%
20%
40%
60%
40%
20%
search time/linear scan
search time/linear scan
20%
20%
40%
60%
60%
80% 100%
80%
60%
40%
20%
0%
0%
80% 100%
precision 1-NN
kd-tree
multi kd-tree (n=10)
k-means tree (k=10)
40%
SF
100%
40%
0%
0%
20%
precision 1-NN
OM
60%
80% 100%
60%
0%
0%
80% 100%
80%
60%
80%
precision 1-NN
100%
40%
KF
100%
80%
0%
0%
20%
precision 1-NN
IH
100%
EH
100%
search time/linear scan
search time/linear scan
100%
20%
40%
60%
80% 100%
precision 1-NN
k-means tree (k=20)
LaesaR (|P|=10)
LaesaR (|P|=20)
LaesaR (|P|=3)
LaesaR (|P|=5)
Exact search
Figure 9.11 – Effectiveness-versus-efficiency tradeoff for multidimensional and metric indexes.
Exact search is the best time achieved by a metric index in Figure 9.10.
128
Algorithm 9.5: k-NN search by distance aggregation of partial nearest neighbors.
Input: {NNd1 , ..., NNdn } partial nearest neighbors according to distances d1 to dn ,
{w1 , ..., wn } weights for underlying distances, k amount of NN to retrieve.
Output: the k nearest neighbors according to DistAgg.
NN ← new priority queue
foreach r ∈ R do
foreach di ∈ {d1 , ..., dn } do
if ∃(r, ·) ∈ NNdi then
retrieve (r, valdi ) from NNdi
else
valdi ← defaultVal
P
dist ← i wi valdi
add (r, dist) to NN
if size of NN > k then
remove max distance object from NN
// missing object
// compute DistAgg
return NN
approach is intended to be used in scenarios where the highest effectiveness is desired and/or it is
impossible to use a multidimensional index. Two samples of those scenarios are: 1) the design of a
complex distance that may combine many descriptors in order to achieve higher effectiveness, which
makes the use of multidimensional indexes unviable; and 2) the size of the search space permits
to resolve exact searches in reasonable search time, as is the case of global descriptors and the
MUSCLE-VCD-2007 dataset.
9.3.3
Fusion of descriptors
Pn
In Chapter 8 we defined the distance γ(q, r) =
i=1 wi di (q, r), and we experimentally
proved that γ outperforms the effectiveness of any single di . An efficient search based on γ requires
a metric index, however underlying distances di are based on vectors. In the previous section we
showed that multidimensional indexes present a much better tradeoff for approximate searches, thus
it is desirable to use some technique to use them when vectors are present.
In this section we compare some techniques to efficiently resolve similarity searches based on
a linear combination of distances. They take the result of independent k-NN searches for each of the
n underlying distances (referred to as “partial” nearest neighbors), combine them in some way, and
produce the final list of fused nearest neighbors. This design enables to replace the similarity search
based on γ by fusing n searches, which may efficiently be resolved by multidimensional indexes. This
approach requires that di be some distance supported by multidimensional indexes, like Minkowski
distances.
The first fusion approach is the distance aggregation, which normalizes and sums the distances for the partial nearest neighbors. Algorithm 9.5 shows an implementation of the distance
aggregation. It computes the nearest neighbors using the function:
DistAgg(q, r) =
n
X
i=1
129
wi · di (q, r)
di (q, r) =
(
di (q, r)
defaultVal
iff r ∈ k-NN list according to di
otherwise
The weights wi must normalize the distances from di . An algorithm to automatically compute those weights is the α-normalization, described in Chapter 8.
If the n partial lists contain all the objects in R (i.e., they are the result of k-NN searches
with k=|R|) then DistAgg retrieves the same nearest neighbors as γ. However, computing and
storing searches with k=|R| is usually unviable, thus partial NN lists use a small k and the fusion
algorithm will miss some di (q, r). In those cases, the algorithm replaces the missing values by a
constant defaultVal, which should be large enough to represent an object that is farther than the
k th nearest neighbor.
The second fusion approach consists in directly merging the n partial NN lists and sorting
them by distance. The distances must first be normalized by wi in order to make them comparable.
In case of duplicated nearest neighbors, only the occurrence with minimum distance remains and the
rest is discarded. In fact, this merging algorithm computes nearest neighbors according to distance:
DistMin(q, r) = min wi · di (q, r)
i=1..n
The third fusion approach corresponds to the well-known rank aggregation. In this case, a
voting algorithm is performed between the partial NN, where each one sums one vote weighted by
its position in the list. In this experiment, we fix weights as the inverse of the rank, i.e., the first
NN sums 1, the second NN sums 0.5, and so on:
RankAgg(q, r) =
n X
k
X
votei,j (r)
i=1 j=1
votei,j (r) =
(
1
j
iff r is the j-th NN according to di
0
otherwise
The rank aggregation does not consider the distance values from the partial NN lists, hence
there is no need for any normalization. The most similar objects are the ones that achieve more
votes, therefore RankAgg is a similarity function.
Figure 9.12 compares the effectiveness achieved by DistAgg, DistMin, and RankAgg when
combining consecutively from one up to eight descriptors. As a baseline, it shows the effectiveness of
the exact search according to γ distance (these results were also shown in Figure 8.4 on page 99). The
weights wi from γ were computed by the α-normalization with α=0.1. Two scenarios are evaluated:
when the partial NN lists are computed by exact searches, and when the partial NN lists are
computed by approximate searches. In the case of exact searches, the distance aggregation performs
best when combining two or three descriptors, and then it is highly affected by the fourth descriptor,
which behaves as a noisy (spammer) distance. Remarkably, the merge by minimum distance is not
harmed by GH, however its performance decreases steadily from the fifth descriptor. The rank
aggregation achieves a low effectiveness when combining few descriptors, but its performance highly
improves from the fifth descriptor, achieving the highest MAP overall (0.768) at the sixth descriptor.
The comparison shows that DistAgg outperforms γ distance. This result shows a strength
of the late fusion: cutting the partial NN lists to the top k achieves better performance than using
130
Fusion of exact partial-NNs
Fusion of approx. partial-NNs
0.8
MAP
MAP
0.8
0.7
0.6
EH
0.7
0.6
+IH +KF +GH +EH5 +OM +CH +G14
EH
+IH +KF +GH +EH5+OM +CH +G14
DistMin
DistAgg
DistMin
RankAgg
Distance γ
RankAgg
DistAgg
Figure 9.12 – Effectiveness of incremental combination from one up to eight descriptors.
the whole |R| set. A high value for k may admit too many noisy objects in the partial lists, while
a too small k may drop many correct objets. In fact, the results shown in the figure were achieved
with k=10, which presented the best balance.
The key difference between DistAgg and DistMin is that the former privileges objects with
many repetitions in the partial NN lists, while the latter privileges the closest NN according to any
di . Selecting objects by the minimum distance is a criterium that may give at first some robustness
to noisy objects, however in the long term it fails at selecting the best objects.
The rank aggregation achieves a low performance with few descriptors. Therefore, the distance values contain relevant information for the fusion that should not be discarded when combining
only a few lists. Indeed, DistAgg and DistMin outperform RankAgg when combining between two
to four descriptors, where DistAgg achieves a MAP even below that using a single descriptor. However, in the long term, rank aggregation is able to achieve higher performance than distance fusions,
because when distance values proceed from many search spaces, they tend to be noisy and lose their
relevance.
Finally, the fusion of approximate searches was evaluated by resolving the k-NN searches
using a k-means tree with branching factor 20 (which achieved high performance in Figure 9.11).
The same algorithms DistAgg, DistMin, and RankAgg are evaluated but replacing the exact search
by the result of approximate searches. The figure shows that the MAP decreases by less than a
ten percent compared to the fusion between exact searches. On the other hand, the search times
are reduced to about one fourth the time spent by an exact search indexed by a metric index. The
behavior of the algorithms is fairly similar to the exact searches, achieving RankAgg the highest
MAP (0.727) at six descriptors.
In summary, this experiment shows that the combination of partial results can improve
both the effectiveness of the search and its efficiency. Computing partial results by each modality
enables to build high performance indexes. The use of partial lists discards many irrelevant objects,
increasing the effectiveness of the search. The fusion algorithm may profit from distance values in
order to increase the effectiveness, in which case the α-normalization achieves appropriate results.
In the particular case of combination of multidimensional vectors, the fusion may benefit from the
131
effectiveness-versus-efficiency tradeoff of approximate searches. In a more general case, the metric
indexes can be used to increase the performance of the partial exact searches.
9.4
Summary
In this chapter we have presented two main approaches to improve the efficiency of the
search.
The first approach is an approximate search using a static pivot table. The approximation
relies on a fast distance estimator to discard most of the irrelevant objects without evaluating their
actual distance. The approximate search shows a convenient tradeoff between effectiveness and efficiency, which enables to drastically reduce the search time while maintaining high effectiveness. We
also presented a novel approach for copy detection based on local descriptor using this approximate
search. Additionally, we presented the Two-step search which is an efficient approach to search in
large collections using approximate searches and a complex distance.
The second approach is an exact search using a dynamic pivot table. We defined the snake
distribution and we presented the Snake Table, which dynamically selects the previous query objects
as pivots. Its efficiency was analyzed under different configurations and compared with LAESA and
D-cache. The Snake Table reduces the search times by exploiting streams of queries with snake
distributions, improving the search time under both fast and complex distances.
The two presented approaches can be combined using the Two-step search: the first step
performs approximate searches with static pivots, and the second step can use the Snake Table to
efficiently resolve an exact search in a reduced search space.
An interesting and open issue is the designing of an algorithm to perform approximate
searches with the Snake Table. Exploiting the snake distributions to perform distance estimations
could improve the effectiveness of the approximations, even in high intrinsic dimensionality spaces.
However, when most of the distances are estimated and discarded, the Snake Table will not store
enough distances to create good estimations.
Additionally, we included a comparison of the proposed techniques with state-of-the-art
implementations of widely used multidimensional indexes. The comparison shows the metric indexes
can outperform multidimensional indexes in the exact search, but the latter shows a much better
effectiveness-versus-efficiency tradeoff in approximate searches. In the particular case of linear
combination of distances, the many approximate searches can be combined by a late fusion algorithm
in order to achieve both high effectiveness and high efficiency.
The next chapter reviews the copy localization task, which uses the output of the similarity
search to locate a copy excerpt.
132
Chapter 10
Copy Localization
The copy localization process analyzes the lists of nearest neighbors retrieved by the similarity search, in order to look for chains of segments belonging to the same reference video. By using
these chains, this process reports detections (c̄, v̄, s), where c̄ is an excerpt from the query video
c ∈ C, v̄ is an excerpt from the reference video v ∈ V, and s ∈ R+ is a confidence score (a high value
implies more certainty that the reported detection is actually a copy).
in this chapter, following the procedure used in TRECVID, we model a copy detection using
six values: the name of the query video, the start time and end time in the query video, the name of
the reference video, the offset, and the confidence score. The offset is the constant value that needs
to be added to the boundaries in the query video to get the boundaries in the reference video. The
usage of the offset instead of start/end times assumes that the copy and original excerpts are the
same length. This assumption is based on the fact that neither TRECVID nor MUSCLE-VCD-2007
include a transformation that accelerates or slows down a copy.
The main component for the copy localization is the voting algorithm, which produces many
copy candidates between preprocessed videos. Afterwards, a gathering process combines the candidates between preprocessed videos to create the detections between query and reference videos.
10.1
Voting algorithm
Given a set of preprocessed query videos C ′ and preprocessed reference videos V ′ , the input for
the voting algorithm is a query video c′ ∈ C ′ (with segments {s1 , ..., sr }), and the list of nearest neighbors for each segment. In the algorithm, a nearest neighbor is a reference segment and a distance
value. Therefore, the list of nearest neighbors for the segment si is Nsi ={(r1 , dist1 ), ..., (rk , distk )},
where ∃v ′ ∈ V ′ ∃r ∈ v ′ , distj =d(si , r) for some distance function d, distj ∈ [0, ǫ], and distj ≤ distj+1 .
The segmentation used for the query video and for the reference video is not necessarily
identical. Therefore, between a query segment q and a reference segment r, we define the offsets
133
Query Video
1st NN
Query1-Segm1
Query1-Segm2
Query1-Segm3
Query1-Segm4
Query1-Segm5
Query1-Segm6
Query1-Segm7
Query1-Segm8
Vid07-Segm54
Vid09-Segm13
Vid07-Segm34
Vid09-Segm15
Vid01-Segm88
Vid09-Segm54
Vid01-Segm45
Vid09-Segm19
...
...
3rd NN
2nd NN
dist
dist
dist
dist
dist
dist
dist
dist
Vid08-Segm73
Vid02-Segm34
Vid03-Segm54
Vid02-Segm13
Vid01-Segm12
Vid09-Segm17
Vid03-Segm43
Vid01-Segm12
...
dist
dist
dist
dist
dist
dist
dist
dist
Vid01-Segm68
Vid02-Segm33
Vid09-Segm14
Vid03-Segm65
Vid07-Segm58
Vid07-Segm59
Vid03-Segm20
Vid07-Segm61
...
dist
dist
dist
dist
dist
dist
dist
dist
Figure 10.1 – Example showing the result of the voting algorithm. Search using three nearest
neighbors, video candidate “Vid09”, and offset candidate “-11”. The voter segments are highlighted. The copy is reported between excerpts [Segm2, Segm8]
from Query1, and [Segm13, Segm19] from Vid09.
offsetlow and offsethigh as:
offsetlow (q, r) = min {offset 1 (q, r), offset 2 (q, r)}
offsethigh (q, r) = max {offset 1 (q, r), offset 2 (q, r)}
offset 1 (q, r) = start time(q) − start time(r)
offset 2 (q, r) = end time(q) − end time(r)
Given a reference video v ′ , we define the candidate offsets between c′ and v ′ as the range
between offsetmin and offsetmax defined as:
offsetmin (c′ , v ′ ) = min {offsetlow (q, r)}
offsetmax (c′ , v ′ ) = min {offsethigh (q, r)}
∀q ∈ c′ , ∀r ∈ v ′ , (r, d(q, r)) ∈ Nq
For each pair of videos c′ and v ′ , the range of valid offsets [offsetmin , offsetmax ] is divided into
small intervals of fixed length (for instance, l=0.25 seconds) and the voting algorithm is invoked
(see Algorithm 10.1). Given a reference video and the offset, the algorithm returns the query video
bounds (start/end) and the copy detection score (see Figure 10.1).
The detection score s′ is the sum of votes that received that copy candidate from the segments
in the lists of nearest neighbors. The value of each vote and the voter segment is calculated by the
function CalculateVote, described in Algorithm 10.2. The constant MatchVote is the base value for
a supporting vote (for instance, MatchVote=1), which is weighted according to the relevance of the
rank and distance of the voter segment. The constant MissCost is the default value when there is
not a reference segment supporting the detection. This value should be zero or negative to favor
detections without discontinuities (for instance, MissCost=−0.1).
The relevance of the rank is a weight that decreases as the rank increases. In particular, we
use a geometric progression with constant ratio a, with 0 < a < 1 (for instance, a=0.75):
rank relevance(j) = aj
(10.1)
Analogously, the relevance of the distance is a weight that decreases as the distance increases.
In particular, our implementation is based on the cumulative distribution Fd :
dist relevance(z) = 1 − Fd (z)
134
(10.2)
Algorithm 10.1: Voting algorithm for copy localization.
Input: {(s1 , N1 ), ..., (sr , Nr )} set of query segment with its nearest neighbors, v ′ reference
video candidate, ioffset offset interval candidate.
Output: start, end and score for copy detection on video v ′ with offset ioffset
(start, end, score) ← (null, null, 0)
(cstart, cend, cscore) ← (null, null, 0)
foreach (si , Ni ) ∈ {(s1 , N1 ), ..., (sr , Nr )} do
(voter, vote) ← CalculateVote(si , Ni , v ′ , ioffset)
cscore ← cscore + vote
if cscore < 0 then
(cstart, cend, cscore) ← (null, null, 0)
else if vote > 0 then
if cstart is null then
cstart ← voter
cend ← voter
if cscore > score then
(start, end, score) ← (cstart, cend, cscore)
return (start, end, score)
10.2
Evaluation of voting algorithm
All the previous evaluations in this thesis (specifically in Chapters 7, 8, and 9) follow this procedure: a configuration is fixed (i.e., input videos, segmentation, description, and distance function),
the similarity search retrieves the first nearest neighbor (i.e., k-NN searches with k=1 and range
search ǫ=∞), and the voting algorithm generates the detections and their confidence scores (using
parameters l=0.25, MatchVote=1, MissCost=−0.1). The evaluation method uses the ground-truth
to determine the amount of actual copies detected by correct detections with score higher than any
incorrect detection. Because the searches used k=1, there is no influence of weighting functions
rank relevance (Equation 10.1) and dist relevance (Equation 10.2) in the final detection result.
The following experiments evaluate the impact on detection effectiveness of parameter k,
rank relevance, and dist relevance. The experiments use preprocessed videos, S1 segmentation, s-t
global description and acoustic description, and L1 distance, thus the detection effectiveness for k=1
is already shown in Figure 7.3 on page 82 in the S1 column.
Table 10.1 shows the variations in the number of detected copies without false alarms with k
between 1 and 10 and both rank relevance and dist relevance are disabled (i.e., they both return the
constant 1). The detection effectiveness may increase or decrease depending on k and the descriptor.
The major improvement occurs with k=3, when the effectiveness improves for eight descriptors and
it decreases for one of the fifteen tested descriptors. On the other hand, with k=10 the effectiveness
improves for four descriptors and it decreases for six configurations.
Table 10.2 shows the variations in the detection effectiveness with k between 1 and 10, and
rank relevance is used with parameter a=0.75. The table highlights the impact of rank relevance:
differences in boldface mark an improvement with respect to the previous experiment, and differences in italics mark a decrease. Globally, eight descriptors are benefited by the use of rank relevance
135
Algorithm 10.2: CalculateVote function for copy localization.
Input: s query segment, {(r1 , dist1 ), ..., (rk , distk )} list of the k nearest neighbors to s, v ′
reference video candidate, ioffset offset interval candidate.
Output: voter best matching segment, vote score of match.
(voter, vote) ← (null, MissCost)
foreach (rj , distj ) ∈ {(r1 , dist1 ), ..., (rk , distk )} do
if rj ∈ v ′ and offset(s,rj ) ∈ ioffset then
v ← MatchVote
× rank relevance(j)
× dist relevance(distj )
if v > vote then
(voter, vote) ← (rj , v)
return (voter, vote)
while two show some decrease in effectiveness. The major improvement occurs again with k=3, when
the effectiveness improves for eight descriptors without showing decreases.
Table 10.3 shows the variations in the detection effectiveness with k between 1 and 10 including both rank relevance and dist relevance. The table highlights the impact of dist relevance:
differences in boldface mark an improvement with respect to the previous experiment, and differences in italics mark a decrease. Globally, the table shows a little impact of dist relevance. The
major improvement affects OMt 9x9 which can detect one more copy from k ≥ 6.
The experiments show that performing a search with k > 1 can increase the effectiveness
but it also increases the number of false alarms. In this case, the inclusion of rank relevance reduces
the increase of false alarms, producing a satisfactory balance with k=3. On the other hand, the
effect of dist relevance is minimal, i.e., it shows neither an improvement or decrease for most of the
descriptors. We also tested the inclusion of dist relevance and disabling rank relevance, which also
shows almost no impact on detection effectiveness for dist relevance.
10.3
Combination of candidates
The voting algorithm returns a list of candidate detections between preprocessed query videos
C ′ and preprocessed reference videos V ′ . The final step consists in processing the list of candidates
(c̄′ , v̄ ′ , s′ ) in order to produce the final list of copy detections (c̄, v̄, s).
Let P (c) ⊂ C ′ be the videos created by the preprocessing task for the query video c ∈ C. If
the preprocessing considers some reversions (like the reversion of PIP described in Chapter 6), then
|P (c)| ≥ 1, but if the preprocessing step only considers a quality normalization then |P (c)| = 1. All
the candidates are gathered,
and the candidates
S
S
P ′ referring to the same query and reference videos
are joined, i.e., c̄ = c̄′ , v̄ = v¯′ , and s =
s . Optionally, the final scores may be scaled to the
interval [0,1]. The list of detections (c̄, v̄, s) corresponds to the final answer of the system.
136
Dets. no f.a.
Descriptors
EHt 4x4-10
EHt 4x4-5
KFt 11x9
OMt 9x9
CHt 1x1-hsv-16x4x4
CHt 2x2-hsv-16x2x2
CHt 2x2-rgb-4x4x4
GHt 1x1-180
GHt 1x4-32
GHt 3x3-20
GHt 4x4-12
IHt 1x4-rgb-3x16
IHt 2x2-rgb-3x16
IHt 4x4-rgb-3x4
AU160
Difference to k=1
k=1
28
27
23
15
20
21
21
16
19
21
21
20
20
21
29
2
3
4
5
6
7
8
9
10
0
+1
+1
+1
−1
0
+1
+1
0
−1
0
−1
0
0
0
+1
+1
0
+1
−1
0
+1
+1
0
0
+1
0
+1
+1
0
0
+1
0
0
−1
+1
+1
+1
0
0
+1
+1
+1
+1
−1
0
+1
0
+2
−1
0
+1
0
0
0
+1
0
+1
+1
−1
0
+1
0
+2
−1
+1
0
0
−1
−1
0
0
+1
+1
−1
0
+1
0
+1
−1
+1
+1
−1
−1
0
0
0
−1
+1
0
0
+1
0
+1
−1
+1
0
0
−1
−1
0
0
−1
+1
0
0
+1
0
+1
−1
0
0
0
−1
−1
+1
0
−1
+1
0
0
+1
0
0
−1
+1
−1
0
−1
−1
+1
0
−1
+1
−1
Table 10.1 – Detections without false alarms when increasing k and both rank relevance and
dist relevance are disabled.
Dets. no f.a.
Descriptors
t
EH 4x4-10
EHt 4x4-5
KFt 11x9
OMt 9x9
CHt 1x1-hsv-16x4x4
CHt 2x2-hsv-16x2x2
CHt 2x2-rgb-4x4x4
GHt 1x1-180
GHt 1x4-32
GHt 3x3-20
GHt 4x4-12
IHt 1x4-rgb-3x16
IHt 2x2-rgb-3x16
IHt 4x4-rgb-3x4
AU160
k=1
28
27
23
15
20
21
21
16
19
21
21
20
20
21
29
Difference to k=1
2
3
4
5
6
7
8
9
10
0
+1
+1
+1
0
0
+1
+1
0
−1
0
0
0
0
0
0
+1
0
+1
0
0
+1
+1
0
0
+1
+2
+1
+1
0
0
+1
0
+1
−1
0
+1
+1
0
0
+1
+2
+1
+1
0
0
+1
0
+2
−1
0
+1
+1
0
0
+1
+2
+1
+1
0
0
+1
0
+2
−1
0
+1
+1
0
−1
+1
+2
+1
+1
0
0
+1
0
+2
−1
0
+1
+1
0
0
+1
+2
+1
+1
0
0
+1
0
+2
−1
0
+1
+1
0
−1
+1
+2
+1
+1
0
0
+1
0
+2
−1
0
+1
+1
0
−1
+1
+2
+1
+1
0
0
+1
0
+2
−1
0
+1
+1
0
−1
+1
+2
+1
+1
0
Table 10.2 – Detections without false alarms when increasing k, rank relevance is active and
dist relevance is disabled. Differences in boldface mark an improvement with respect to Table 10.1, and differences in italics mark a decrease.
137
Dets. no f.a.
Descriptors
EHt 4x4-10
EHt 4x4-5
KFt 11x9
OMt 9x9
CHt 1x1-hsv-16x4x4
CHt 2x2-hsv-16x2x2
CHt 2x2-rgb-4x4x4
GHt 1x1-180
GHt 1x4-32
GHt 3x3-20
GHt 4x4-12
IHt 1x4-rgb-3x16
IHt 2x2-rgb-3x16
IHt 4x4-rgb-3x4
AU160
k=1
28
27
23
15
20
21
21
16
19
21
21
20
20
21
29
Difference to k=1
2
3
4
5
6
7
8
9
10
0
+1
+1
+1
0
0
+1
+1
0
−1
0
0
0
0
0
0
+1
0
+1
0
0
+1
+1
0
0
+1
+2
+1
+1
0
0
+1
0
+1
−1
0
+1
+1
0
0
+1
+2
+1
+1
0
+1
+1
0
+2
−1
0
+1
+1
0
0
+1
+2
+1
+1
0
0
+1
0
+3
−1
0
0
+1
0
−1
+1
+2
+1
+1
0
0
+1
0
+3
−1
0
+1
+1
0
0
+1
+2
+1
+1
0
0
+1
0
+3
−1
0
+1
+1
0
−1
+1
+2
+1
+1
0
0
+1
0
+3
−1
0
+1
+1
0
−1
+1
+2
+1
+1
0
0
+1
0
+3
−1
0
+1
+1
0
−1
+1
+2
+1
+1
0
Table 10.3 – Detections without false alarms when increasing k, both rank relevance and
dist relevance are active. Differences in boldface mark an improvement with respect to Table 10.2, and differences in italics mark a decrease.
10.4
Summary
In this chapter we have presented our approach to locate copies from the lists of nearest
neighbors retrieved by the similarity search. The voting algorithm locates chains of nearest neighbors
belonging to the same reference video. The combination process gathers all the candidate between
preprocessed videos to produce the final detection list.
The evaluation showed that the voting algorithm can improve its effectiveness when it considers more nearest neighbors (k > 1), however a weighting of the voter by its rank position is
needed in order to prevent an increase in the false alarms. We also evaluated the weighting of
the voter according to its distance to the query object, but it showed almost no impact on system
effectiveness.
The next chapter reviews our participation at TRECVID 2010 and 2011, where we compared
the performance of our system using most of the developed techniques, with other state-of-the-art
systems.
138
Chapter 11
Evaluation at TRECVID
During the development of this thesis, we participated at TRECVID’s CBVCD evaluation.
In that evaluation, the performance of our system was compared with other state-of-the-art CBVCD
systems.
In this chapter we review our participation at TRECVID 2010 and TRECVID 2011, and we
compare our results with the other participants. Additionally, Appendix B and Appendix C present
detailed data with the evaluation of all the participating teams.
11.1
Introduction
The Text Retrieval Conference (TREC), organized yearly by the National Institute of Standards and Technology (NIST), focuses on the research in information retrieval. As a part of TREC
since 2003, the TREC Video Retrieval Evaluation (TRECVID) promotes research in video information retrieval by providing large test collections, uniform scoring procedures, and a forum
for organizations interested in comparing their results [Smeaton et al., 2006]. Between 2008 and
2011, TRECVID included an evaluation for CBVCD systems, called Content-Based Copy Detection
(CCD). Many teams from different universities and private companies around the world participated
in the CCD evaluation. This evaluation served as a benchmark to test and compare different CBVCD techniques. During 2008 and 2009 the evaluation used datasets from the BBC collection,
while for 2010 and 2011 it used a more general dataset with videos from internet, called IACC.1.
11.2
TRECVID datasets
TRECVID 2010 and 2011 used the datasets IACC.1.A and IACC.1.tv10.training as the
reference collection. It contains more than eleven thousand internet short video clips with duration
between 11 seconds and 4.1 minutes. The visual and audio quality of reference videos is variable:
the videos proceed from TV rips, handheld and cellphone cameras, movie excerpts, slideshows, etc.
The videos have different resolutions (between 320×240 and 1104×240) and different frame rates
(from less than 1 fps up to more than 75 fps).
The query collections for TRECVID 2010 and 2011 were created following the same process.
139
(a) Original frame.
(b) V1: Camcording.
(c) V2: PIP.
(d) V3: Insert Pattern.
(e) V4: Reencoding.
(f) V5: Change of gamma.
(g) V6:
gamma+
frame drop+reencoding.
(h) V8:
contrast.
mirroring+PIP+
(i) V10:
caption.
PIP+noise+
Figure 11.1 – Example of visual transformations in TRECVID datasets. (a) Frame from a reference video. (b-i) Same frame for the eight visual transformations.
140
Code
Description
V1
V2
V3
V4
V5
V6
Simulated camcording.
Picture-in-picture (PIP) original video in foreground.
Insertion of pattern.
Strong reencoding.
Change of gamma.
Combination of three quality transformations: blur, change of gamma, frame
dropping, contrast, reencoding, ratio, and white noise.
Combination of three postproduction transformations: crop, shift, contrast,
caption, mirroring, insertion of pattern, and PIP original video in background.
Random combination of three previous transformations.
V8
V10
(a) The eight visual transformations evaluated at TRECVID.
Code
Description
A1
A2
A3
A4
A5
A6
A7
No transformation.
Mp3 compression.
Mp3 compression and multiband companding.
Bandwidth limit and single-band companding.
Mix with speech.
Mix with speech and multiband compress.
Bandpass filter mix with speech and compress.
(b) The seven acoustic transformations evaluated at TRECVID.
Table 11.1 – The evaluated transformations at TRECVID 2010 and 2011.
First, 201 base videos are created, each one with a length between 3 seconds and 3 minutes: 67
videos were an excerpt from a reference video, 67 were an excerpt from a video not in the reference
collection, and 67 contain an excerpt from a reference video embedded into a longer excerpt from
a video not in the reference collection. The audio query collection was created by applying seven
acoustic transformations (with codes A1 to A7) to each of the 201 base videos. The visual query
collection was created by applying eight visual transformations (with codes V1 to V10) to each
of the 201 base videos. Transformations V7 (combination of five quality transformations) and
V9 (combination of five postproduction transformations) were only evaluated in 2008 and then
they were discarded for the subsequent years. The final a+v query collection was created by
combining the audio queries and the visual queries for the same base video, producing 56 query
videos from each base video. Table 11.1 details the eight visual transformations and the seven
acoustic transformations. Figure 11.1 shows an example of the eight visual transformations applied
to a base video. Table 11.2 summarizes the sizes of the collections.
11.3
Evaluation process
Two configuration profiles were tested: Balanced and No False Alarms (NoFA). Teams can
submit up to four Runs containing the detection results produced by their CBVCD system according
to one of these profiles. A Run is a text file with a list detections, each detection reports the
boundaries of the copy excerpt for the query video, the offset of the boundaries for some reference
video, and a confidence score. Additionally, the Run contains a decision threshold submitted by the
team, and the total time in seconds required to process each query video (including the decoding
and feature extraction, read/write of intermediate results, and generation of the final output), i.e.,
the time required for the online phase.
141
Collection
Videos
Hours
Frames
GB
Reference
11,524
425.4
39,463,431
100.3
Query
Audio
Visual
Audio+Visual
1,407
1,608
11,256
27.7
32.0
222.8
—
3,402,620
23,818,340
0.8
25.3
—
(a) TRECVID 2010
Collection
Videos
Hours
Frames
GB
Reference
11,485
423.8
39,280,446
99.9
Query
Audio
Visual
Audio+Visual
1,407
1,608
11,256
28.3
32.3
226.3
—
3,359,710
23,517,970
0.8
24.3
—
(b) TRECVID 2011
Table 11.2 – Summary of TRECVID 2010 and 2011 collections.
The evaluation of each Run relies on three measures (two for effectiveness and one for
efficiency). NIST calculates these measures separately for each transformation, it separates the detections in the Run into independent files by transformation, and for each of the 56 transformations
the following three measures are calculated:
• Normalized Detection Cost (NDCR): Measures the effectiveness of the detection. It considers the probability of missing a detection and the probability to falsely indicate that there
is a copy for a query video:
NDCR = PMISS + β · RFA
where PMISS is the conditional probability of a missed copy, RFA is the conditional probability
of a false alarm rate, and β is a weight parameter. NDCR sets β=200 for Balanced profile,
and β=200,000 for NoFA profile. NDCR ranges between 0 and +∞, where being closer to zero
means better detection effectiveness. A trivial NDCR of 1.0 can be obtained by submitting
an empty Run, thus a good result should not be greater than this value.
• F1: Measures the accuracy in localization after a copy has been correctly detected. It is the
harmonic mean of the precision and recall calculated over the lengths of the detected and
copied excerpts:
2
F1 =
1
1
presicion + recall
F1 ranges between 0 and 1, closer to 1.0 means better accuracy at detecting the copy boundaries.
• Mean query processing time (MPT): Measures the efficiency at processing the query
videos. It is calculated by averaging the reported time for queries belonging to each transformation.
NDCR and F1 are calculated using two decision thresholds for the confidence score: the
submitted threshold by the team (it gives the actual performance of the system); and the optimal
142
threshold, which is the threshold that produces the minimal NDCR for each transformation (it gives
the optimal performance of the system). Thus, the evaluation of a Run produces five measures (Actual NDCR, Actual F1, Optimal NDCR, Optimal F1, and MPT) for each of the 56 transformations.
The optimal threshold is calculated for each transformation while the submitted threshold is fixed
for every transformation. Unfortunately, the evaluation process did not consider an optimal threshold fixed for all transformations. In order to globally compare system performances, we averaged
the 56 measures for each transformation into one average value, producing five measures for each
Run: Average Actual NDCR, Average Actual F1, Average Optimal NDCR, Average Optimal F1,
and Average MPT.
TRECVID 2010 and 2011 allowed CBVCD systems to rely on both acoustic with visual
information, although some systems may use only one source and discard the other. For comparison
purposes we classified every Run into audio-only (A), visual-only (V), and audio+video (AV). We
state that a Run is type V when its results for NDCR and F1 are identical for the seven acoustic
transformations in a same visual transformation (thus, its results are not influenced by changes in
audio). Analogously, we state that a Run is type A when NDCR and F1 are identical for the eight
visual transformations in a same acoustic transformation. Finally, a Run is type AV if it is neither
type A nor type V.
11.4
Participation at TRECVID 2010
This section reviews our first participation in the CCD evaluation. In this participation
we tested a system based exclusively on visual global descriptors, thus we discarded the audio
information generating exclusively Runs type V.
11.4.1
Submissions
The preprocessing task created new query videos for PIP, camcording and vertical flip transformation. As a result of this task, the number of query videos used by next tasks increased from
1,607 to 5,378 (containing 11,526,076 frames).
The video segmentation task used a variable-length partitioning: first it used a base segmentation S1/3 , then the descriptor KF with parameters 20×15 was extracted for representative frames,
and finally two consecutive segments were joined if the L1 distance for their descriptor was smaller
that threshold 24. The 5,378 preprocessed query videos produced 990,246 query segments, and the
11,524 reference videos produced 3,967,815 reference segments.
The feature extraction task described each segment by three global descriptors: EHt 4x4-10,
and IHt 2x2-rgb-3x16. Table 11.3 shows the sizes of the extracted descriptors.
GHt 3x3-20,
Descriptor
EHt 4x4-10
GHt 3x3-20
IHt 2x2-rgb-3x16
Length
Query size
Reference size
151 MB
170 MB
181 MB
605 MB
681 MB
727 MB
160 bytes
180 bytes
192 bytes
Table 11.3 – Descriptors used at TRECVID 2010.
143
In order to create our submissions we defined two configurations: ehdNgry and ehdNclr,
summarized in Table 11.4. These configurations used a spatio-temporal distance δ with temporal
window W =3, α-normalization (α=0.0001), and weighting by max-ρ.
Name
Segmentation
Distance
Descriptor
ehdNgry
S1/3 +comb.
δ
L1 -EHt 4x4-10
L1 -GHt 3x3-20
ehdNclr
S1/3 +comb.
δ
L1 -EHt 4x4-10
L1 -IHt 2x2-rgb-3x16
Table 11.4 – Configurations used at TRECVID 2010.
For each configuration a k-NN+range searches were performed with k=6 and ǫ=6. The exact
search with pivots would have taken nearly eleven months to complete. In order to reduce the search
time we use the approximate search with pivots using parameters |P|=9 and T =0.001·|R|. We fixed
that parameters by first deciding the amount of time that similarity search should take (we decided
the search should not take more than 24 hours total for all queries), and then we tested different
values for T and |P| to fulfill that performance. In both configurations the evaluation of δ needed
more than 1,000 operations, but the approximate search estimated it as using LBP (which requires
only 9 operations) and evaluated δ only 0.1% times (3,967 evaluations for each query segment).
We submitted two Runs for the Balanced profile: balanced.ehdNgryhst (configuration ehdNgry) and balanced.ehdNclrhst (configuration ehdNclr); and two Runs for the No False Alarms
profile: nofa.ehdNgryhst (configuration ehdNgry) and nofa.ehdNghT10 (same as previous but with
a stricter decision threshold).
We performed all the processes on a single desktop computer with Intel Q9400 CPU (2.66
GHz × 4 cores) and 4 GB RAM on a GNU/Linux 2.6.18. The whole system was implemented in C
using OpenCV and FFmpeg libraries.
11.4.2
Results
Twenty-two teams participated in the CCD evaluation at TRECVID 2010. Each team
submitted four Runs, which resulted in 37 submissions for the NoFA profile (14 Runs type V), and
41 submissions for the Balanced profile (15 Runs type V). The list of participants, the results for
each participant, and the results by transformation are detailed in Appendix B.
Average Results
Table 11.5 shows the average results for all the transformations for the submitted Runs
to NoFA profile and Balanced profile. Figure 11.2 depicts a comparison with the other Runs for
Average Optimal NDCR, Average Optimal F1 and Average MPT.
In the NoFA profile, both submissions nofa.ehdNghT10 and nofa.ehdNgryhst achieved the
same Optimal NDCR and Optimal F1 because they only differed in the decision threshold. Both
optimal NDCR and F1 are better than the median, and considering just Runs type V they achieved
the best results. In the case of Actual NDCR, the decision thresholds were too permissive (both
NDCR are higher than 1.0 because the threshold accepts too many false alarms, but still both
144
Indicator
Value
Overall Rank
Rank in Runs V
nofa.ehdNghT10
Average Optimal NDCR
Average Optimal F1
Average Actual NDCR
Average Actual F1
Average MPT
0.611
0.828
40.75
0.846
128 s.
10
14
9
15
23
of
of
of
of
of
37
37
37
37
37
1
1
2
1
11
of
of
of
of
of
14
14
14
14
14
nofa.ehdNgryhst
Average Optimal NDCR
Average Optimal F1
Average Actual NDCR
Average Actual F1
Average MPT
0.611
0.828
147.7
0.811
128 s.
10
14
13
16
23
of
of
of
of
of
37
37
37
37
37
1
1
4
2
11
of
of
of
of
of
14
14
14
14
14
balanced.ehdNgryhst
Average Optimal NDCR
Average Optimal F1
Average Actual NDCR
Average Actual F1
Average MPT
0.597
0.820
9.057
0.723
128 s.
14
16
28
19
26
of
of
of
of
of
41
41
41
41
41
1
3
7
11
11
of
of
of
of
of
15
15
15
15
15
balanced.ehdNclrhst
Average Optimal NDCR
Average Optimal F1
Average Actual NDCR
Average Actual F1
Average MPT
0.658
0.820
8.902
0.724
132 s.
16
15
27
20
27
of
of
of
of
of
41
41
41
41
41
3
2
6
12
12
of
of
of
of
of
15
15
15
15
15
Table 11.5 – Evaluation for submitted Runs to TRECVID 2010 (average values for the 56 transformations).
NDCR are better than the median). Fixing an appropriate threshold was a difficult task for all the
teams, in fact only five submissions achieved an Average Actual NDCR lower than 1.0.
In the Balanced profile, balanced.ehdNgryhst and balanced.ehdNclrhst achieved an Average
Optimal NDCR better than the median, and balanced.ehdNgryhst achieved the best optimal NDCR
between Runs type V. As in NoFA profile, the high Actual NDCR is due to a too permissive decision
threshold.
Comparing Optimal NDCR, nofa.ehdNgryhst was the 10th between 37 submissions, while
balanced.ehdNgryhst was the 14th between 41 submissions. These results show that the system
achieves a better performance for the NoFA profile than for the Balanced profile.
The four submissions achieved a Mean Processing Time higher than the median, thus the
system was relatively slow. This is mainly due to the preprocessing task, which was inefficiently
implemented and took almost half of the total processing time. In particular, queries with PIP
and camcording required about three and four times, respectively, more time to be processed.
Furthermore, the restriction of taking no more than 24 hours in the search could have been reduced
in order to increase the efficiency of the system.
145
Opt. NDCR vs Opt. F1
1
Opt. NDCR vs Time
10000
Average MPT
Average Optimal F1
0.8
1000
0.6
0.4
100
10
0.2
0
0.01
0.1
1
10
100
1
0.01
1000
Average Optimal NDCR
Type A
0.1
1
10
100
1000
Average Optimal NDCR
Type V
Type AV
nofa.ehdNgryhst
nofa.ehdNghT10
No Type
(a) Runs for NoFA profile. The data for this figure is shown in Table B.2 on page 190.
Opt. NDCR vs Opt. F1
1
Opt. NDCR vs Time
10000
Average MPT
Average Optimal F1
0.8
1000
0.6
0.4
100
10
0.2
1
0
0
0.2
0.4
0.6
0.8
1
0
1.2
Average Optimal NDCR
Type A
0.2
0.4
0.6
0.8
1
1.2
Average Optimal NDCR
Type V
Type AV
balanced.ehdNgryhst
balanced.ehdNclrhst
(b) Runs for Balanced profile. The data for this figure is shown in Table B.4 on page 192.
Figure 11.2 – Average Optimal NDCR, Average Optimal F1 and Average MPT for Runs at
TRECVID 2010. The ideal indicators are NDCR=0, F1=1, and MPT=0.
146
Optimal NDCR
1.2
1
0.8
0.6
0.4
0.2
0
A1
A7
V1
A1
V2
A7
V3
A1
V4
A7
V5
A1
V6
A1
V8
A7
V10
A+V Transformations
Optimal F1
1
Overall Best
Overall Median
Best Type V
Median Type V
nofa.ehdNgryhst
nofa.ehdNghT10
0.8
0.6
0.4
A1
A7
V1
A1
V2
A7
V3
A1
V4
A7
V5
A1
V6
A1
V8
A7
V10
A+V Transformations
(a) NoFA profile. Data shown in Table B.3 on page 191.
Optimal NDCR
1.2
1
0.8
0.6
0.4
0.2
0
A1
A7
V1
A1
V2
A7
V3
A1
V4
A7
V5
A1
V6
A1
V8
A7
V10
A+V Transformations
Optimal F1
1
Overall Best
Overall Median
Best Type V
Median Type V
balanced.ehdNgryhst
balanced.ehdNclrhst
0.8
0.6
0.4
A1
A7
V1
A1
V2
A7
V3
A1
V4
A7
V5
A1
V6
A1
V8
A7
V10
A+V Transformations
(b) Balanced profile. Data shown in Table B.5 on page 193.
Figure 11.3 – Optimal NDCR and Optimal F1 by transformation at TRECVID 2010.
147
Results by Transformation
In order to analyze the performance by transformation, we compare the results of Optimal
NDCR and Optimal F1 for all submissions at NoFA profile and Balanced profile. Figure 11.3 shows
the results for each transformation for our submissions compared with: the best result, the median,
the best result for Runs type V, and the median for Runs type V.
In the NoFA profile, nofa.ehdNghT10 and nofa.ehdNgryhst achieved their best effectiveness at
V4 and V3 (NDCR 0.262 and 0.269, respectively). Their effectiveness were always better than the
overall median, and between Runs type V they achieved the best NDCR for V1, V2, V3 and V8.
Globally, the most difficult transformation was V2, where only three submissions type V achieved
a NDCR less than 1.0, and the second most difficult was V1, where only five submissions type V
achieved NDCR less than 1.0. This shows the relevance of the preprocessing, where the inclusion
of detection and reversion enabled a remarkable performance. In copy localization (F1), the best
localization was achieved for V2 and the worst localization for V1.
In the Balanced profile, balanced.ehdNgryhst and balanced.ehdNclrhst achieved better effectiveness than the overall median and the median for Runs Type V for every transformation.
balanced.ehdNclrhst achieved the best effectiveness for Runs type V at V1 and V2 and balanced.ehdNgryhst
achieved the best effectiveness for Runs type V at V3. The localization was accurate for V8 but it
was not accurate for V1.
In summary, the results for our submitted Runs were positioned above the overall median
for Optimal NDCR at every transformation. Considering just Runs type V, they achieved the best
detection performance for V1, V2 and V3. The localization was not very accurate, since some
transformations were difficult to locate (in particular V1 and V5).
11.5
Participation at TRECVID 2011
This section reviews our second participation in the CCD evaluation. We tested a combination of visual and acoustic information at the similarity search level, i.e., the system compared
segments using both acoustic descriptors and global descriptors in the distance function. This is a
novel approach because most of the CBVCD systems fuse candidates from independent subsystems,
each subsystem using either visual or audio information.
11.5.1
Submissions
The preprocessing task creates new query videos for PIP, camcording and vertical flip transformation. As a result of this task, the number of visual query videos increased from 1,608 to 5,147,
thus the number of a+v queries also increased from 11,256 to 36,029.
The video segmentation used a segmentation S1/3 . Unlike our previous participation, we
chose a fixed-length segmentation in order to simplify the fusion of acoustic with visual descriptors.
The 11,485 reference videos produced |Rv |=4,522,262 visual segments, and |Ra |=4,441,717 audio
segments (some videos have different lengths for audio and visual tracks). The 5,147 visual queries
produced |Qv |=1,120,455 visual segments, and the 1,407 audio queries produced |Qa |=306,304 audio
segments.
148
The feature extraction task described each visual segment in Qv ∪ Rv by three global descriptors (EHt 4x4-10, GHt 4x4-12, and IHt 4x4-rgb-3x4), and each audio segment in Qa ∪ Ra by one acoustic
descriptor (AU160). Table 11.6 shows the sizes of the extracted descriptors. Unlike previous participation, we prefer to use zoning of 4×4 grid for every descriptor because changing zoning highly
impacts the effectiveness of the description (see Chapter 7).
Descriptor
EHt 4x4-10
GHt 4x4-12
IHt 4x4-rgb-3x4
AU160
Length
160
192
192
640
Query size
bytes
bytes
bytes
bytes
212
206
206
187
MB
MB
MB
MB
Reference size
691 MB
829 MB
829 MB
2,7 GB
Table 11.6 – Descriptors used at TRECVID 2011.
Let Qav be the set of a+v-segments for query videos, and Rav be the set of a+v-segments
for reference videos. We created Qav by combining sets Qa and Qv and their descriptors following
the script tv11.make.av.queries.sh, which produced |Qav |=7,840,587 a+v-segments. We created
Rav by combining sets Ra and Rv and their descriptors, producing |Rav |=4,387,633 a+v-segments.
The combination process requires that visual-segments and audio-segments have the same length
to create an a+v-segment, and it guarantees that every created a+v-segment has all the visual and
acoustic descriptors, i.e., an a+v-segment is discarded if it lacks of any acoustic or visual descriptor.
The submissions were based on four configurations: EhdGry, EhdRgb, Aud and EhdRgbAud, summarized in Table 11.7. The γ distance for configurations EhdGry, EhdRgb, and
EhdRgbAud used α-normalization (α=0.001) and weighting by max-τ . Unlike previous participation, we chose a temporal window W =1 because the improvement of effectiveness due to a large
W was not worth the increase in the search time (see Chapter 8).
In EhdRgbAud we had to manually decrease wAud because we already knew from TRECVID
2010 that audio tracks from the query videos were not as reliable as the visual tracks to detect copies.
The guidelines for the CCD evaluation ensures a copy exists in both visual and audio tracks at the
same time, however there are some valid copies whose audio track is ruined by the acoustic transformation. Moreover, after analyzing the ground-truth, we realized that an acoustic transformation
replaces (maybe unintentionally) the audio track from a valid copy with an audio track from an
unrelated video. This creates some a+v queries where the visual track matches the original while
the audio track does not match (i.e., the copy exists only in the visual track). In particular, this
behavior is observable in query videos 8960.mpg, 6008.mpg, 7468.mpg, 7996.mpg, 9261.mpg, 9472.mpg,
9517.mpg, and maybe others. This issue was reported to NIST but they ruled that those queries
were valid.
The submitted Runs nofa.EhdGry and balanced.EhdGry were similar to the previous participation. The similarity search used configuration EhdGry to perform a k-NN approximate search
in Rv for every query segment in Qv . The approximate search fixed k=10 and approximation
parameters T =1% and |P|=5 pivots. Hence, each estimation cost five operations, and for the
0.01 · |Rv |=45,222 objects with lowest estimations the actual γ was calculated. Compared with
previous participation, we increased parameter T and decreased parameter |P|, because T is more
relevant than |P| for the effectiveness of the approximation (see Chapter 9). The copy localization
algorithm located chains of nearest neighbors belonging to the same reference video and offset. The
chain with the highest score for each query video was reported in nofa.EhdGry, and the two chains
with highest scores were reported in balanced.EhdGry.
149
Name
Segmentation
Distance
Descriptor
EhdGry
S1/3
γ
L1 -EHt 4x4-10
L1 -GHt 4x4-12
EhdRgb
S1/3
γ
L1 -EHt 4x4-10
L1 -IHt 4x4-rgb-3x4
Aud
S1/3
L1
AU160
EhdRgbAud
S1/3
γ
L1 -EHt 4x4-10
L1 -IHt 4x4-rgb-3x4
L1 -Aud
Table 11.7 – Configurations used at TRECVID 2011.
The submitted Runs nofa.EhdRgbAud and balanced.EhdRgbAud performed a similarity search
using configuration EhdRgbAud. However, we did not directly apply the approximate search with
pivots (as in our previous participation) due to two major drawbacks:
• The similarity search must perform |Qav | approximate searches in Rav , which compared
against EhdGry (|Qv | searches in Rv ) is an increase of almost seven times. Therefore,
the approximate parameters T and P should be adjusted to reduce the search time to one
seventh, at the cost of decreasing the effectiveness.
• The distance function in EhdRgbAud has a higher intrinsic dimensionality than EhdGry,
thus it is more difficult to approximate because it combines more independent distances (see
a similar behavior in Figure 9.1 on page 113). Therefore, T and P should be adjusted to
increase the effectiveness at the cost of increasing the search time.
In order to overcome these issues we chose to use the Two-step search (see Section 9.1.5).
Given a query video c, the first step performs approximate searches using both configurations
EhdRgb and Aud in order to collect candidate reference videos V(c). For EhdRgb, the approximate search parameters were k=10, T =1%, and |P|=5. For Aud, the approximate search
parameters were k=30, T =2%, and |P|=5. The set V(c) is created with the D most voted reference
videos that received at least 2 votes. We defined parameter D=40 based exclusively on the resulting
search time. The second step performs exact searches using EhdRgbAud between every query
segment in q in c and every reference segment r in Rav (c) ⊆ Rav , where Rav (c) is the set of a+v
segments for candidate videos in V(c).
Finally, the copy localization located copies using the (exact) k-NN lists according to EhdRgbAud. The chain with the highest score for each query video was reported in nofa.EhdRgbAud, and
the two chains with highest scores were reported in balanced.EhdRgbAud.
We performed all the processes on a single desktop computer with Intel Core i7-2600K CPU
(3.4 GHz × 4 cores) with 8 GB RAM on a GNU/Linux 2.6.38. Like our previous participation, the
whole system was implemented in C using OpenCV and FFmpeg libraries.
Additionally, we performed a joint submission with Telefonica Research team for the Balanced profile, under the name Telefonica-research.balanced.joint. This submission tested the
combination at the decision level of Telefonica’s local descriptor, Telefonica’s acoustic descriptor,
and the EhdRgb configuration.
150
11.5.2
Results
Twenty-two teams participated in the CCD evaluation at TRECVID 2011. Each team
submitted four Runs, which resulted in 32 submissions for the NoFA profile (14 Runs type V), and
41 submissions for the Balanced profile (15 Runs type V). The list of participants, the results for
each participant, and the results by transformation are detailed in Appendix C.
Average Results
Indicator
Value
Overall Rank
nofa.EhdGry
Average Optimal NDCR
Average Optimal F1
Average Actual NDCR
Average Actual F1
Average MPT
0.374
0.938
0.419
0.956
49.9 s.
10
7
4
3
8
of
of
of
of
of
32
32
32
32
32
nofa.EhdRgbAud
Average Optimal NDCR
Average Optimal F1
Average Actual NDCR
Average Actual F1
Average MPT
0.286
0.946
0.336
0.962
64.4 s.
6
5
2
1
12
of
of
of
of
of
32
32
32
32
32
balanced.EhdGry
Average Optimal NDCR
Average Optimal F1
Average Actual NDCR
Average Actual F1
Average MPT
0.412
0.938
3.716
0.913
49.9 s.
18
11
30
20
7
of
of
of
of
of
41
41
41
41
41
balanced.EhdRgbAud
Average Optimal NDCR
Average Optimal F1
Average Actual NDCR
Average Actual F1
Average MPT
0.300
0.955
8.462
0.935
64.4 s.
12
3
37
11
10
of
of
of
of
of
41
41
41
41
41
Rank in Runs V
1
3
1
2
5
of
of
of
of
of
14
14
14
14
14
—
—
—
—
—
3
4
9
6
3
of
of
of
of
of
15
15
15
15
15
—
—
—
—
—
Table 11.8 – Evaluation for submitted Runs to TRECVID 2011 (average values for the 56 transformations).
Table 11.8 shows the average results for all the transformations for the submitted Runs
to NoFA profile and Balanced profile. Figure 11.4 depicts a comparison with the other Runs for
Average Optimal NDCR, Average Optimal F1 and Average MPT.
In the NoFA profile, both submissions nofa.EhdGry and nofa.EhdRgbAud achieved an Average Optimal NDCR and F1 better than the median. Particularly, considering just Runs type V
nofa.EhdGry achieved the best detection effectiveness. In the case of Actual values, the decision
threshold was appropriate, because NDCR kept below 1, and nofa.EhdRgbAud achieved the second
best Actual NDCR and the best Actual F1 overall.
In the Balanced profile, balanced.EhdGry and balanced.EhdRgbAud achieved an Average Optimal NDCR and F1 better than the median. In this case of Actual values, the decision thresholds
were too permissive, because Actual NDCR was higher than 1 for both submissions.
151
Opt. NDCR vs Opt. F1
1
Opt. NDCR vs Time
10000
Average MPT
Average Optimal F1
0.8
1000
0.6
0.4
100
10
0.2
0
0.01
0.1
1
10
100
1
0.01
1000
Average Optimal NDCR
Type A
0.1
1
10
100
1000
Average Optimal NDCR
Type V
nofa.EhdGry
nofa.EhdRgbAud
Type AV
(a) Runs for NoFA profile. The data in this figure is shown in Table C.2 on page 199.
Opt. NDCR vs Opt. F1
1
Opt. NDCR vs Time
10000
Average MPT
Average Optimal F1
0.8
1000
0.6
0.4
100
10
0.2
1
0
0
0.2
0.4
0.6
0.8
1
0
1.2
Average Optimal NDCR
Type A
Type V
0.2
0.4
0.6
0.8
1
1.2
Average Optimal NDCR
Type AV
balanced.EhdGry
balanced.EhdRgbAud
Telefonica-research.balanced.joint
(b) Runs for Balanced profile. The data in this figure is shown in Table C.4 on page 201.
Figure 11.4 – Average Optimal NDCR, Average Optimal F1 and Average MPT for Runs at
TRECVID 2011. The ideal indicators are NDCR=0, F1=1, and MPT=0.
152
The improvement in detection effectiveness between nofa.EhdGry and nofa.EhdRgbAud, and
between balanced.EhdGry and balanced.EhdRgbAud, proves that acoustic descriptors can be successfully combined with visual descriptors at the distance function.
All the submissions achieved a very high Optimal F1, i.e., the system correctly delimits the
boundaries of the copies. This is mainly due to the Two-step search: the inaccuracies produced by
the approximate searches in the first step are reduced by the exact searches in second step.
The submissions achieved better results for the NoFA profile than for the Balanced profile,
which is consistent with our participation in 2011. Non-copies are usually easier to discard for
global descriptors than for local descriptors, thus global descriptors may detect more correct copies
before the first false alarm. However, the NDCR decreases for Balanced profile because copies with
complex transformations may be undetectable for global descriptors affecting the detection rate.
In the case of Mean Processing Time, the four submissions are between the fastest Runs
with good detection performance. This good balance between effectiveness and efficiency is due
to a better implementation of the Preprocessing task, approximate search parameters properly
adjusted, and a desktop computer with a more powerful CPU. Note also that the Two-step search
can efficiently resolve audio+video searches: the MPT for audio+video submissions increased nearly
30% compared with video-only submissions, instead of the seven times as with the naive approach.
Results by Transformation
Figure 11.5 shows the results for each transformation for our submissions compared with:
the best result, the median, the best result for Runs type V, and the median for Runs type V.
In the case of NoFA profile, nofa.EhdGry achieved the best detection effectiveness between
Runs type V for V1, V2, V8, V10. The most difficult transformations to detect for this Run
were V1 and V6. In copy localization (F1), the best localization was achieved for V3 and the
worst localization for V1. nofa.EhdRgbAud improves the detection effectiveness for almost every
transformation. The acoustic descriptor achieves high performance for transformations A1, A2,
and A5. This Run achieves its best NDCR for V3 and V8. In copy localization, the Run achieves
the best overall localization for transformations V3, V4 and V8.
In the case of Balanced profile, balanced.EhdGry achieves a better NDCR than the median
for submissions Type V for every transformation, but for V1, V5 and V6 its result is worse than the
overall median. It achieves a good localization (higher than median) for every transformation except
for V1. balanced.EhdRgbAud achieves a better NDCR than the median for every transformation.
The inclusion of the acoustic descriptor improves the effectiveness, except for V3, where it is
outperformed by balanced.EhdGry. As in the NoFA profile, it achieves the best overall localization
for transformations V3, V4 and V8.
In summary, V1 and V2 are the most difficult visual transformations, while A3 is the most
difficult acoustic transformation for the NoFA profile, as well as A6 for the Balanced profile (see
the Overall Median line for Optimal NDCR at Figure 11.5).
153
Optimal NDCR
1
0.8
0.6
0.4
0.2
0
A1
A7
V1
A1
V2
A7
V3
A1
V4
A7
V5
A1
V6
A1
V8
A7
V10
A+V Transformations
Optimal F1
1
Overall Best
Overall Median
Best Type V
Median Type V
nofa.EhdGry
nofa.EhdRgbAud
0.9
0.8
0.7
A1
A7
V1
A1
V2
A7
V3
A1
V4
A7
V5
A1
V6
A1
V8
A7
V10
A+V Transformations
Optimal NDCR
(a) NoFA profile. Data shown in Table C.3 on page 200.
1
0.8
0.6
0.4
0.2
0
A1
A7
V1
A1
V2
A7
V3
A1
V4
A7
V5
A1
V6
A1
V8
A7
V10
A+V Transformations
Optimal F1
1
Overall Best
Overall Median
Best Type V
Median Type V
balanced.EhdGry
balanced.EhdRgbAud
0.9
0.8
0.7
A1
A7
V1
A1
V2
A7
V3
A1
V4
A7
V5
A1
V6
A1
V8
A7
V10
A+V Transformations
(b) Balanced profile. Data shown in Table C.5 on page 202.
Figure 11.5 – Optimal NDCR and Optimal F1 by transformation at TRECVID 2011.
154
11.6
Precision/Recall analysis
As previously described, TRECVID evaluation requests that each team adjust its system
either for NoFA or Balanced profile. The evaluation divides each submission according to video
transformation and it evaluates each transformation using either the optimal threshold (Optimal
NDCR) or the submitted threshold (Actual NDCR). The Optimal NDCR hinders the comparison of
global performances because the threshold is calculated by transformation. Actual NDCR uses the
same threshold for every transformation but most of the teams failed at submitting a threshold with
satisfactory performance. In previous sections, we relied on Average Optimal NDCR to evaluate
the global performance, however an indicator using global thresholds would be more appropriate.
In this section we evaluate the global performance using Precision and Recall measures (see
definition in Section 5.6.2). In particular, we use two values: RP1 (maximum Recall with Precision
1) and RP.5 (maximum Recall with Precision greater or equal than 0.5).
11.6.1
TRECVID 2010
Figure 11.6 depicts the precision/recall curves for selected Runs type V. Additionally, Appendix B lists RP1 and RP.5 values for all submissions at TRECVID 2010. In particular, our analysis
considers four Runs type V: IBM’s balanced.gistG [Natsev et al., 2010], which uses a global descriptor based on Gabor filter responses in different scales and orientations; TID’s balanced.localvideo
[Younessian et al., 2010], which is based on DART local descriptors (an alternative to SIFT
descriptors) and computing spatial coherence with RANSAC; VIREO’s balanced.srpeflip [Ngo
et al., 2010], which follows the codebook approach with Hamming Embedding; and our balanced.ehdNgryhst
[Barrios and Bustos, 2010], already described in Section 11.4.1.
The data suggests that Runs based on global descriptors achieve the best performance at
detecting copies without reporting false alarms. In fact, balanced.gistG detects 34% of the copies
and nofa.ehdNgryhst detects 25% of copies before reporting the first false alarm, compared to less
than 2% for balanced.srpeflip and balanced.localvideo. On the other hand, Runs based on local
descriptors achieve better performance in a balanced scenario: balanced.localvideo detects 67% of
copies and balanced.srpeflip detects 54.3% of copies when producing a similar amount of correct
and incorrect detections, compared to 53.9% for balanced.ehdNgryhst and 49% for balanced.gistG.
An interpretation for this behavior is that non-copy query videos are usually easier to discard
for global descriptors than for local descriptors. In fact, TID team reported that local descriptors
may trigger a false detection between two unrelated videos if they share captions or logos. This
behavior is unlikely for global descriptors because captions and logos only change small zones. Hence,
the amount of correct detections up to the first false alarm can be higher for global descriptors than
for local descriptors. However, as precision decreases local descriptors detect transformations that
may be undetectable for global descriptors.
Figure 11.7 compares the performance for the four Runs. It summarizes the detection performance for each one of the 1072 visual queries: 67 base queries with a full copy (red pixels) plus
67 base queries with a copy excerpt (green pixel), and each base video with eight visual transformations (from V1 to V10). A dark red/green represents a correct detection at RP1 , a light red/green
represents a correct detection at RP.5 , yellow represents an undetected copy neither at RP1 nor
RP.5 .
155
1
PRISMA.nofa.ehdNghT10 (V)
PRISMA.nofa.ehdNgryhst (V)
IBM.balanced.gistG (V)
IBM.nofa.gistG (V)
PRISMA.balanced.ehdNgryhst (V)
PRISMA.balanced.ehdNclrhst (V)
TID.balanced.localvideo (V)
IBM.nofa.gistGCsift (V)
IBM.nofa.gistGC (V)
VIREO.balanced.srpeflip (V)
Precision
0.99
0.98
0.97
0.96
0
0.1
0.2
0.3
0.4
0.5
Recall
(a) Precision/Recall curves for submissions type V with high RP1 . Data shown in Table B.6 on
page 194. Additional markers have been added to each curve (at an arbitrary position) to make
them more visible.
1
Precision
0.9
TID.balanced.localvideo (V)
VIREO.balanced.srpeflip (V)
PRISMA.balanced.ehdNgryhst (V)
PRISMA.nofa.ehdNghT10 (V)
PRISMA.nofa.ehdNgryhst (V)
VIREO.nofa.srpeflip (V)
IBM.balanced.gistG (V)
IBM.nofa.gistG (V)
IBM.nofa.gistGCsift (V)
PRISMA.balanced.ehdNclrhst (V)
VIREO.balanced.srpe (V)
IBM.nofa.gistGC (V)
0.8
0.7
0.6
0.5
0.2
0.3
0.4
0.5
0.6
0.7
Recall
(b) Precision/Recall curves for submissions type V with high RP.5 . The marker shows the RP.5
on each precision/recall curve. Data shown in Table B.8 on page 196.
Figure 11.6 – Precision/Recall curves for selected submissions type V to TRECVID 2010.
156
V1 V2
V3 V4
V5
V6
V8 V10
(a) IBM balanced.gistG
V1 V2
V3 V4
V5
V6
V1 V2
V3 V4
V5
V6
V8 V10
(b) PRISMA balanced.ehdNgryhst
V8 V10
(c) TID balanced.localvideo
V1 V2
V3 V4
V5
V6
V8 V10
(d) VIREO balanced.srpeflip
Figure 11.7 – Copy detections at RP1 and RP.5 at TRECVID 2010. Each pixel represents a
visual query video. Queries containing a full copy are shown in red, queries containing a copy excerpt are shown in green. A dark red/green represents a correct
detection at RP1 , a light red/green represents a correct detection at RP.5 , yellow
represents an undetected copy neither at RP1 nor RP.5 .
The figure shows that submission balanced.gistG achieves high detection performance at
transformations V4, V5 and V6, thus GIST descriptor is highly robust to quality transformations.
In fact, most of the queries are detected at RP1 . However, the descriptor shows weak performance
at postproduction transformations.
Detections at balanced.ehdNgryhst are sparsely distributed between V2 and V10. In fact,
between the 134 base videos it successfully detects at least one copy from 126 base videos (94%).
In comparison, balanced.gistG was able to detect at least one copy from 122 base videos (91%),
balanced.srpeflip was able to detect at least one copy from 121 base videos (90%), and balanced.localvideo
was able to detect at least one copy from 126 base videos (94%). This behavior shows the combination of global descriptors can achieve high performance, even comparable to local descriptors.
balanced.ehdNgryhst may have achieved higher performance with more suitable approximate search
parameters.
Regarding the reported processing times of the four Runs, balanced.gistG is the fastest (it
uses an approximate search for GIST descriptors using FLANN), balanced.ehdNgryhst is the second
fastest, while both Runs based on local descriptors are slower.
In summary, the combination of distances could have achieved high detection performance,
however the approximate search was not properly configured in order to profit from the potentially
high detection performance.
11.6.2
TRECVID 2011
Figure 11.8 depicts the precision/recall curves for Runs with high detection performance.
Additionally, Appendix C lists RP1 and RP.5 values for all submissions at TRECVID 2011. We
compare the results in two scenarios: Runs type V and Runs type AV.
157
1
PKU-IDM.balanced.cascade (AV)
PRISMA.nofa.EhdRgbAud (AV)
PKU-IDM.nofa.cascade (AV)
PRISMA.balanced.EhdRgbAud (AV)
Telefonica-research.balanced.joint (AV)
PRISMA.nofa.EhdGry (V)
FTRDBJ.nofa.AudioOnly (A)
INRIA-LEAR.nofa.dodo (AV)
INRIA-TEXMEX.balanced.audioonly (A)
KDDILabs.balanced.4sys (V)
INRIA-LEAR.balanced.dodo (AV)
PRISMA.balanced.EhdGry (V)
INRIA-TEXMEX.balanced.zozo (AV)
Precision
0.99
0.98
0.97
0.96
0.5
0.6
0.7
0.8
0.9
1
Recall
(a) Precision/Recall curves for submissions with high RP1 . Data shown in Table C.6 on
page 203. Additional markers have been added to each curve (at an arbitrary position) to
make them more visible.
1
Telefonica-research.balanced.multimodal (AV)
BUPT-MCPRL.balanced.zhVideo (V)
BUPT-MCPRL.nofa.zhVideo (V)
BUPT-MCPRL.nofa.wsyVA (AV)
Telefonica-research.balanced.joint (AV)
NTT-CSL.balanced.3 (AV)
FTRDBJ.balanced.orange3 (AV)
FTRDBJ.balanced.VideoOnly (V)
PRISMA.balanced.EhdRgbAud (AV)
PRISMA.nofa.EhdRgbAud (AV)
PKU-IDM.balanced.cascade (AV)
PRISMA.balanced.EhdGry (V)
INRIA-LEAR.balanced.dodo (AV)
Precision
0.9
0.8
0.7
0.6
0.5
0.5
0.6
0.7
0.8
0.9
1
Recall
(b) Precision/Recall curves for submissions with high RP.5 . The marker shows the RP.5 on
each precision/recall curve. Data shown in Table C.8 on page 205.
Figure 11.8 – Precision/Recall curves for selected submissions to TRECVID 2011.
158
V1 V2
V3 V4
V5
V6
V8 V10
(a) KDDI’s balanced.4sys
V1 V2
V3 V4
V5
V6
V1 V2
V3 V4
V5
V6
V8 V10
(b) PRISMA’s balanced.EhdGry
V8 V10
(c) INRIA’s balanced.deaf
V1 V2
V3 V4
V5
V6
V8 V10
(d) BUPT-MCPRL’s balanced.zhVideo
Figure 11.9 – Copy detections at RP1 and RP.5 at TRECVID 2011. See Figure 11.7 for an
explanation of the figure.
Runs Type V
We analyze the detection performance of four submissions Runs type V: our nofa.EhdGry,
KDDI’s balanced.4sys, INRIA’s balanced.deaf, and BUPT-MCPRL’s balanced.zhVideo. Figure 11.9 details the performance at every query video, comparing the performance at RP1 (darker
red/green) and RP.5 (lighter red/green), divided by visual transformation.
PRISMA’s nofa.EhdGry achieves the highest detection performance at RP1 detecting 59%
of copies before reporting the first false alarm. The improvement compared with our previous
participation, where nofa.ehdNgryhst achieved RP1 24%, is mainly due to better choice of approximate search parameters. When a similar amount of correct and incorrect detections are possible,
nofa.EhdGry achieved RP.5 69% correct detections.
KDDI’s balanced.4sys [Uchida et al., 2011] computes global descriptors (DCT descriptors
at different frame sizes) and a search based on lookup tables. The Run achieved RP1 45% and
RP.5 55%, hence it was outperformed by nofa.EhdGry. In fact, balanced.4sys achieves a very high
performance at quality transformations (V3 to V6), but a very low performance at postproduction
transformation.
INRIA’s balanced.deaf [Ayari et al., 2011] computes CS-LBP descriptors (an alternative to
SIFT descriptors) and a codebook with hamming embedding. That submission achieved RP1 54%,
hence it is outperformed by nofa.EhdGry, but it achieves a strong performance at RP.5 detecting
94% copies.
BUPT-MCPRL’s balanced.zhVideo [Zhao et al., 2011b] combines SIFT descriptors and different global descriptors (binary patterns, color correlogram, HOG). It failed at detecting copies
without false alarms, but it achieved a strong RP.5 93% at the balance scenario.
In summary, the results for nofa.EhdGry confirm the good performance at detecting copies
without false alarms, and that a combined distance is able to detect copies either for quality and
postproduction transformation, but it the balanced scenario its detection performance is lower than
CBVCD system based on local descriptors.
159
V1 V2
V3
V4
V5
V6
V8 V10
A1
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
A1
A7
A1
A7
A1
A7
A2
A1
A7
A1
A7
A3
A1
A7
A4
A1
A7
A1
A7
A5
A1
A7
A1
A7
A6
A1
A7
A7
A1
A7
(a) PRISMA’s balanced.EhdRgbAud
V1 V2
A1
A2
V3
V4
V5
V6
V8 V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
A1
A7
A1
A7
A1
A7
A1
A7
A3
A4
A5
A1
A7
A1
A7
A1
A7
A1
A7
A1
A7
A6
A1
A7
A1
A7
A7
A1
A7
(b) CRIM’s balanced.V48A66T58B
Figure 11.10 – Copy detections at RP1 and RP.5 at TRECVID 2011. On the left side the queries
are grouped by a+v transformation. On the right side the queries are grouped by
base video.
160
V1 V2
V3
V4
V5
V6
V8 V10
A1
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
A1
A7
A1
A7
A1
A7
A2
A1
A7
A1
A7
A3
A1
A7
A4
A1
A7
A1
A7
A5
A1
A7
A1
A7
A6
A1
A7
A7
A1
A7
(a) Telefonica’s balanced.multimodal
V1 V2
A1
A2
V3
V4
V5
V6
V8 V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
V1...V10
A1
A7
A1
A7
A1
A7
A1
A7
A3
A4
A5
A1
A7
A1
A7
A1
A7
A1
A7
A1
A7
A6
A1
A7
A1
A7
A7
A1
A7
(b) Telefonica’s balanced.joint
Figure 11.11 – Results achieved by the fusion between our system into Telefonica’s system. (a)
Basal performance of Telefonica’s system. (b) Performance of fusion between
Telefonica’s and our system.
161
Runs Type AV
Regarding the NoFA scenario, PKU-IDM’s balanced.cascade [Jiang et al., 2011] achieved
the highest RP1 with 94.7% correct detections without false alarms. Paradoxically, it was submitted
to the balanced profile. Thereafter, our nofa.EhdRgbAud achieves the second highest RP1 with 69.9%
correct detections without false alarms. A slightly lower result is achieved by CRIM’s submissions
[Gupta et al., 2011] with RP1 69.6%, which performs a late fusion of acoustic and global descriptors
and linear scan implemented in GPU.
Regarding the balanced scenario, INRIA’s balanced.dodo [Ayari et al., 2011] achieves the
highest RP.5 with 98.4% correct detections. Thereafter, both PKU-IDM’s balanced.cascade and
BUPT-MCPRL’s balanced.wsyVA achieve RP.5 94.7%, while CRIM achieved RP.5 94.4%. Our submission balanced.EhdRgbAud achieves a much lower result with RP.5 76.1%.
Figure 11.10 shows a detailed comparison between balanced.EhdRgbAud and CRIM’s balanced.V48A66T58B.
The figure summarizes the detection performance for each one of the 7504 a+v queries: 67 base
queries with a full copy (red pixels) plus 67 base queries with a copy excerpt (green pixel), and
each base video with the combination between eight visual transformations (from V1 to V10) and
seven acoustic transformations (from A1 to A8). A dark red/green represents a correct detection at
RP1 , a light red/green represents a correct detection at RP.5 , yellow represents an undetected copy
neither at RP1 nor RP.5 . The figure organizes queries according to two different criteria: left-side
figure groups queries by a+v transformation, thus revealing their impact on detection effectiveness;
right-side figure groups queries by base video, thus revealing the influence of actual audiovisual
information on detection effectiveness.
The figure shows that balanced.EhdRgbAud do not perform well for transformations V1, V5
and V6. In those cases the audio information slightly improves the performance for transformations
A1 and A2. These results evince the selected fusion highly favors the visual information, which is
due to the reduction in the weight applied to acoustic distance (described in Section 11.5.1).
On the other hand, CRIM’s balanced.V48A66T58B, which uses a late fusion that considers
global and acoustic descriptors, shows a similar performance at RP1 but a much higher performance
at RP.5 . This is result of more appropriate global descriptors and fusion technique. The quality
of the descriptors becomes evident when balanced.EhdRgbAud fully detects 10 base videos at RP1
(i.e., to detect the base video in every a+v transformation), while balanced.V48A66T58B fully detects
36 base videos. The late fusion enables to detect copies using just one modality (a copy can be
detected in either visual or audio track), while the fusion of distances mainly detects copies on both
modalities at the same time. Additionally, the use of linear scans also contributes to CRIM’s high
performance: in fact, NTT-SCL’s balanced.3 [Mukai et al., 2011] uses the same global descriptor
as CRIM but resolves the searches with lookup tables and codebooks and it achieves a lower RP.5
89.8%.
Figure 11.11 shows a comparison between Telefonica’s balanced.multimodal and balanced.joint.
balanced.multimodal is a Run that makes a late fusion between local descriptors and audio. balanced.joint
includes a late fusion with the result of PRISMA’s copy detection based on global descriptors. The
figure reveals the improvement at RP1 is evident (from 0% to 53.3%). However, the improvement at
RP.5 is rather marginal (from 87.6% to 91.8%). Therefore, the inclusion of global descriptors has a
large impact on the number of detections without false alarms, but it has small impact at detecting
new copies that are not detectable by local descriptors.
In summary, the fusion of distances is an alternative to combine global and acoustic de162
scriptors which improves the detection performance without producing false alarms. However, in a
balanced scenario the fusion of distances only has small impact on detection performance.
11.6.3
Evaluation of preprocessing task
As shown in Chapter 7, the preprocessing task has a big impact on global descriptors detection performance. To estimate its impact, we assume query videos with camcording (V1) or
PIP (V2) become undetectable when the preprocessing task is not present. We then calculate the
precision/recall curves for our submissions type V when all the detections for query videos with
transformations V1 or V2 have been discarded (assuming the non-preprocessed videos would not
generate false alarms).
Figure 11.12 shows the precision/recall curves for four submissions type V. In TRECVID
2010 submissions, the reversion of camcording had a little impact on effectiveness (about 0.01
points in recall) and the reversion of PIP had a bigger impact (about 0.08 points in recall). In
fact, TRECVID query collection comprises 12.5% videos of each visual transformation, thus the
preprocess helped detect about two thirds of all the queries with PIP. In TRECVID 2011, we
improved our implementation for the preprocessing task, thus the reversion of camcording and PIP
had a bigger impact on effectiveness (about 0.04 and 0.09 points in recall, respectively).
These results show discarding the preprocess would worsen the detection performance because global descriptors are highly affected by camcording and PIP transformations. In fact, a
decrease of 0.1 points in RP1 and RP.5 would highly impact the performance of our Runs (see Table B.6 on page 194 and Table C.6 on page 203). On the other hand, without detection and reversion
of transformations, the number of query videos will be reduced, hence there will be more room for
improving the accuracy of the approximate search without increasing the total search time.
11.7
Summary
In this chapter we reviewed our participation in the CCD evaluation at TRECVID 2010 and
2011.
In our participation at TRECVID 2010 we submitted four Runs. To create these Runs
we used video preprocessing, spatio-temporal global descriptors, weighting by max-ρ, approximate
search with pivots, and temporal coherence between nearest neighbors. The Runs showed competitive performance, especially for detecting copies without false alarms, outperforming most of the
systems that only used visual information.
In our participation at TRECVID 2011 we again submitted four Runs. Two of them used
global descriptors, and the other two included a combination of global and acoustic descriptors.
The results showed a big improvement with respect to the previous participation, mainly due to
a better implementation for the preprocessing task, better parameters for descriptors and distance
function, and a Two-step search which mixes the approximate search and exact search. The Runs
achieved a good performance, outperforming many systems, especially for detections without false
alarms.
These results show that global descriptors can achieve competitive performance compared
with other state-of-the-art systems (which are usually based on local descriptors), in particular, for
163
TRECVID 2010 – nofa.ehdNgryhst
1
Precision
0.9
Original
0.8
No camcording
0.7
No PIP
0.6
0.5
0
0.1
0.2
0.3
0.4
Recall
0.5
0.6
0.7
0.8
TRECVID 2010 – balanced.ehdNclrhst
1
Precision
0.9
Original
0.8
No camcording
0.7
No PIP
0.6
0.5
0
0.1
0.2
0.3
0.4
Recall
0.5
0.6
0.7
0.8
TRECVID 2011 – nofa.EhdGry
1
Precision
0.9
Original
0.8
No camcording
0.7
No PIP
0.6
0.5
0
0.1
0.2
0.3
0.4
Recall
0.5
0.6
0.7
0.8
TRECVID 2011 – balanced.EhdGry
1
Precision
0.9
Original
0.8
No camcording
0.7
No PIP
0.6
0.5
0
0.1
0.2
0.3
0.4
Recall
0.5
0.6
0.7
0.8
Figure 11.12 – Precision/Recall curves for the original Run, and for a Run discarding every detection from queries with either camcording or PIP transformations.
164
detection without false alarms. The distance fusion between global and acoustic descriptors can
achieve a high performance.
165
Chapter 12
Conclusions
In this thesis, we have detailed a novel CBVCD system, called P-VCD. It is founded on
the metric space approach and uses many novel algorithms and techniques in order to achieve high
effectiveness and efficiency. P-VCD has been released as an Open Source Project under the GNU
General Public License version 3.0 (GPLv3). Most of the source code used in this thesis can be
freely downloaded from its website1 .
The system has been evaluated using the MUSCLE-VCD-2007 dataset (which is a mediumsize dataset), and by participating in the CBVCD evaluation at TRECVID 2010 and 2011 (which
uses large datasets). P-VCD shows high performance in both evaluations. In the case of MUSCLEVCD-2007, it outperforms the best published result for the dataset, achieving the maximum detection effectiveness. In the case of TRECVID, it shows competitive performance with other state-ofthe-art CBVCD systems. Therefore, this thesis exhibits an approach with high potential that we
believe makes a valuable contribution to the CBMIR area.
12.1
Summary of the main contributions
Our proposed approach for CBVCD consists of five main tasks: Preprocessing, Video Segmentation, Feature Extraction, Similarity Search, and Copy Localization. The main contributions
of this thesis, divided by task, can be summarized as follows:
• Regarding the Preprocessing task:
– We have shown that the preprocessing of videos can improve detection effectiveness. In
particular, we presented two processes: one for quality normalization and one for reversion of transformations, both producing a big improvement in the detection effectiveness.
• Regarding the Video Segmentation and Feature Extraction tasks:
– We have reviewed different global descriptors, performed experimental evaluations of
them using different segmentations, analyzed the results, and have defined some guidelines to design a good descriptor.
1
P-VCD: https://sourceforge.net/projects/p-vcd/
166
– We have developed a spatio-temporal extension for frame-based global descriptors showing that the spatio-temporal description improves the quality of the descriptors without
affecting the similarity search, and that rather simple global descriptors can achieve
high detection performance. In particular, the descriptors based on orientation of edges
achieve high detection effectiveness.
– We have proposed an acoustic descriptor that can be used with the metric space approach.
The descriptor achieves high detection performance and can be seamlessly combined with
global descriptors.
• Regarding the effectiveness of the Similarity Search task:
– We have shown that the combination of distances at the similarity search can improve
the effectiveness of the search. We developed three novel techniques to automatize the
selection of weights in the combination: the α-normalization, the weighting by maxρ, and the weighting by max-τ . We analyzed these algorithms showing they enable
automatically setting good weights without requiring the use of training data.
– We have shown that the spatio-temporal combined distance can improve the effectiveness
of the search. In particular, we have shown that the spatio-temporal combination of the
distances between the Edge Histogram descriptor and the novel acoustic descriptor can
achieve the maximum effectiveness for MUSCLE-VCD-2007, i.e., to detect all the copies
without giving any false alarms. This result outperforms the state-of-the-art systems
evaluated on this dataset.
– We have analyzed the use of some common non-metric distances in order to improve
the effectiveness. We have concluded the evaluated non-metric distances present only
small gains at the cost of decreasing the efficiency, and therefore the metric distance L1
achieves the best tradeoff between effectiveness and efficiency.
• Regarding the efficiency of the Similarity Search task:
– We have developed the Approximate Search with Pivots, which uses static pivots to estimate and discard most of the distance evaluations. We have shown that this approximate
search shows a convenient effectiveness-versus-efficiency trade-off. We have tested this
search in MUSCLE-VCD-2007 and TRECVID datasets with good results.
– We have developed an approach to apply the Approximate Search with Pivots to local
descriptors. We have shown that this approximate search can outperform the traditional
search approach for local descriptors in both effectiveness and efficiency.
– We have developed the Two-step search, which enables the use of complex combined
distances in large datasets. We have tested this technique to perform a combined audiovisual search in the TRECVID dataset with satisfactory results.
– We have developed the Snake Table, which uses previous query objects as dynamic pivots
to resolve exact searches. We have defined the snake distribution, and we have shown
experimentally that query objects in a CBVCD system fit that distribution. We have
shown the snake distribution enables a novel approach to index metric spaces with high
intrinsic dimensionality.
• Regarding the Copy Localization task:
– We have developed a voting algorithm that determines the boundaries of a copy by
analyzing the objects retrieved by the similarity search. Experimentally we have shown it
can improve its effectiveness when it considers more than one similar object by weighting
the votes according to rank positions.
167
Regarding our participation in the CCD evaluation at TRECVID, we have validated that
our approach for CBVCD and the developed techniques are indeed relevant to the current state-ofthe-art.
12.2
Benefits and drawbacks of the proposed solution
The results presented in this thesis prove the proposed solution can successfully address
high effectiveness and efficiency. However, we must note there exist alternatives that can be more
suitable than our approach in some situations. For instance, in Section 3.4 we discuss advantages
and disadvantages between vector spaces and metric spaces.
In general, the benefits and drawbacks of the proposed solution are related to the effectivenessversus-efficiency tradeoff presented in the form “complex-vs-simple similarity measure”. By simple
similarity measures we refer to distances that compare vectors directly by their dimensions, like
traditional Minkowski distances. By complex similarity measures we refer to distances that may
combine distances between descriptors, compute temporal correlation, or even compute the solution
for an optimization problem. In general, the former distances are the basis for vector spaces, while
the latter are the basis for the metric spaces. The following list intends to summarize the strengths
and weaknesses of the proposed approach:
• Pro: A complex similarity measure can achieve higher effectiveness than simple distance functions (see Chapter 8). In fact, complex distances ease the combination of descriptors, the
usage of variable-length descriptors, multimodal descriptors, and even non-vectorial descriptors, which may improve the effectiveness of a system.
• Pro: The metric approach enables to improve the efficiency for complex similarity measures, as
long as they satisfy the metric properties. Therefore, even complex distances can use metric
indexes to outperform the linear scan. Moreover, videos usually show similarity between
consecutive frames which can easily be used by a metric index to improve efficiency (see
Snake Table in Chapter 9).
• Con: A complex similarity measure usually is computationally expensive and produces a
space with high intrinsic dimensionality. Therefore, in scenarios involving large amounts of
data (which is a common case in video domain), the lightweight measures (like Euclidean
distance) are sometimes the only affordable functions that can be used. In those cases, highly
time-efficient techniques (like data projections or space divisions) show much higher scalability
than our approach. In fact, based on the results shown in this thesis, a medium-size dataset
(like MUSCLE-VCD-2007) can successfully be addressed by our approach, a large dataset (like
TRECVID’s CCD) can be addressed with satisfactory results, but a very large dataset (like
the dataset described by Poullot et al. [2010]) is far beyond the possibilities of our approach
(yet).
• Con: The similarity measure must satisfy the metric properties in order to apply the metric
approach. Unfortunately, some convenient techniques (like adaptive weights, partial similarity,
and even the minimum of two distances) may yield to break some metric property. Particularly
to video domain, the triangle inequality hinders the definition of a function for detecting copies
either in the visual track or in the audio track. In that situation, a similarity measure that does
not (always) satisfy the metric properties or a late fusion of partial results may be considered.
168
• Pro: The global efficiency of a system can generally be improved by investing in better hardware. Additionally, better hardware also enables to compute more descriptors and reduce the
search approximation, which may improve the effectiveness, but to a limited extent. On the
other hand, enhancing similarity models can yield to improvements in effectiveness that could
have never been reached by hardware enhancement. In other words, the lower efficiency that
is shown by metric spaces can be overcome by just improving the hardware, while the low
effectiveness shown by simple similarity models cannot be overcome just by better hardware.
12.3
Trends for future research
In this section we outline the open issues and research trends based on the results presented
in this thesis:
• Regarding the effectiveness of the search:
– We plan to study and develop an “universal normalization”, i.e., given a distance d, to
ˆ y) = Fd (d(x, y)), where Fd is the cumulative
define the normalized function dˆ as d(x,
distribution (see Section 8.1.1). This universal normalization is like the α-normalization
with dynamic weights. The dˆ is bounded to [0, 1] and thus it can be used to normalize
and combine any distance. However, in general, dˆ does not satisfy the triangle inequality.
ˆ its properties, benefits, and drawbacks.
More work is needed in order to study d,
– During the experimental analysis of the weighting by max-ρ, we concluded that the
weights that maximize MAP do not coincide with the weights that maximize ρ. In fact,
the analysis showed configurations that can achieve both high effectiveness and high
efficiency (see Figure 8.3 on page 97). We plan to study the characteristics of those
configurations and develop criteria to locate them.
– During the discussion of effectiveness versus efficiency at Section 8.1.2, we stated that a
perfectly discriminant function d′ , which returns a small value to the correct objects and
a large constant value to the irrelevant objects, would achieve maximum effectiveness and
also a near-infinite intrinsic dimensionality. However, if d′ returns uniformly distributed
values to the irrelevant objects (instead of a constant value), the intrinsic dimensionality
would decrease while maintaining the maximum effectiveness. Therefore, it is possible
to improve the efficiency of a distance without compromising its effectiveness. We plan
to develop this idea in order to define distances that achieve both high effectiveness and
efficiency.
– Using the approach of weighted combination of distances, we successfully combined distances from global descriptors and acoustic descriptors. This combination assumes that
the copy exists in both modalities, i.e., in the visual and audio tracks at the same time.
However, when the copy exists in only one modality, the combined function may not work
as expected. We plan to address the detection of copies that exist only in one modality.
– Another open problem is to successfully combine global, acoustic, and local descriptors
in a single distance function. In preliminary experiments, we tested the weighted combination between L1 and Matches (see Section 7.5), however the result was unsatisfactory
due to their different properties and behavior, hence a different distance for local descriptors is needed. In particular, we plan to test a distance using a dense sampling of
local descriptors, and a distance based on global summarizations of local descriptors, like
BOW or glocal descriptors.
169
– The algorithms we have developed to automatically combine distances (i.e., α-normalization
weighting by max-ρ, and weighting by max-τ ) are general enough to be applied to other
CBMIR problems. We plan to use them in different scenarios with different descriptors
and objectives in order to test their behavior. In particular, we plan to use them in a
content-based image retrieval system and a 3D object retrieval.
– In this thesis we have developed an early fusion at the similarity search (see Section 4.6.1).
In the case of classification and semantic indexing, the common approach for the early
fusion is at the content description, i.e., the descriptors from multimodal sources are
combined in a unique descriptor which is the input to the classifier. We plan to develop
the “distance fusion” approach, which is to calculate the distance for each descriptor,
and to classify the vector of distances. Hence, this approach proposes to perform the
classification in a “distances space” instead of the descriptors space.
– In Section 9.3.3 we compare the linear combination of distances and the fusion of nearest
neighbors, and we showed the fusion of nearest neighbors outperforms the linear combination in both effectiveness and efficiency. The benefits come from discarding irrelevant
objects before combining the distances. We plan to continue the research on developing
similarity measures based on distance aggregation of partial nearest neighbors.
• Regarding the efficiency of the search:
– At Section 9.1.5 we described a sort of paradox that occurs when improving the distance in the approximate search: a distance with higher effectiveness usually implies a
search space with higher intrinsic dimensionality, in turn, this produces a decrease in
the quality of the estimator function, which may lead to a decrease in the effectiveness
of the approximate search. This effect can be seen in Figure 9.1 on page 113, where
the approximations of KF outperform the approximations of EH, despite than an exact
search of EH outperforms KF. More work is needed in order to analyze this effect.
– In the experimental evaluation of the approximate search with pivots for local descriptors
at Section 9.1.6, we declared there is a limit for performing an approximate search with
high effectiveness. That limit is given by the number of SIFT descriptors, which in
turn depends on the size of the dataset. We plan to address the issue of scaling the
approximate search beyond that limit. In this case, it is necessary to evaluate other
local descriptors in order to determine the one that produces a search space with lower
intrinsic dimensionality than SIFT.
– The approximate search with pivots uses the maximum lower bound as an estimator.
We have tested other simple estimators (like minimum upper bound, and the average of
lower and upper bounds) with unsatisfactory results. Depending on the properties of the
actual distance to evaluate, there may be other properties to exploit in the definition of
the estimation. We plan to study this issue.
– The Snake Table can achieve high efficiency on metric spaces with high intrinsic dimensionality if the queries fit a snake distribution. However, if the query set does not fit a
snake distribution, the Snake Table may not achieve satisfactory results. In some cases,
there may be a reordering of the query objects that improves the snake distribution. We
have produced some advances in this topic [Barrios et al., 2013].
• Regarding other aspects of the thesis:
– The preprocessing of videos improves the detection effectiveness. The quality normalization improves the effectiveness with a minimal impact in search times. However, the
detection and reversion of transformations highly affects the efficiency of the system.
170
Moreover, the proposed reversion methods are specific to TRECVID datasets, hence
more work is needed in order to determine realistic transformation to revert and to
develop implementations that do not harm the efficiency of the system.
– The benchmark we performed for global descriptors included many descriptors and techniques, however there are many more techniques and configurations that were not tested.
We plan to extend this evaluation in order to include other descriptors from the MPEG-7
standard, other color spaces, and other parameters for frame zoning and quantization.
– We presented an approach to extend frame-based global descriptors into spatio-temporal
global descriptors. An open issue is the generalization of this approach to local descriptors. This generalization would increase the quality of local descriptors for videos,
without needing to define a complex spatio-temporal local descriptor.
– P-VCD has been released as an Open Source project. This software uses command line
interface, hence it requires experienced users. We plan to work on this project in order
to improve the documentation and provide an end-user interface.
171
Bibliography
C. Aggarwal and P. Yu. The IGrid index: reversing the dimensionality curse for similarity indexing
in high dimensional space. In Proc. of the 6h ACM SIGKDD int. conf. on Knowledge Discovery
and Data mining (KDD), pages 119–129. ACM, 2000.
C. Aggarwal, A. Hinneburg, and D. Keim. On the surprising behavior of distance metrics in high
dimensional space. In Proc. of the 8th int. conf. on Database Theory (ICDT), pages 420–434.
Springer, 2001.
G. Amato, F. Falchi, and C. Gennaro. Geometric consistency checks for knn based image classification relying on local features. In Proc. of the intl. workshop on Similarity Search and Applications
(SISAP), pages 81–88. ACM, 2011.
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high
dimensions. Communications of the ACM, 51(1):117–122, 2008.
X. Anguera, P. Obrador, T. Adamek, D. Marimon, and N. Oliver. Telefonica research content-based
copy detection trecvid submission. In Proc. of TRECVID. NIST, 2009a.
X. Anguera, P. Obrador, and N. Oliver. Multimodal video copy detection applied to social media.
In Proc. of the 1st SIGMM workshop on Social media (WSM), pages 57–64. ACM, 2009b.
X. Anguera, T. Adamek, D. Xu, and J. M. Barrios. Telefonica research at trecvid 2011 content-based
copy detection. In Proc. of TRECVID. NIST, 2011a.
X. Anguera, J. M. Barrios, T. Adamek, and N. Oliver. Multimodal fusion for video copy detection.
In Proc. of the int. conf. on Multimedia (ACMMM), pages 1221–1224. ACM, 2011b.
S. Avila, N. Thome, M. Cord, E. Valle, and A. Araujo. Bossa: Extended bow formalism for image
classification. In Proc. of the int. conf. on Image Processing (ICIP), pages 2909–2912. IEEE,
2011.
M. Ayari, J. Delhumeau, M. Douze, H. Jégou, D. Potapov, J. Revaud, C. Schmid, and J. Yuan.
Inria@trecvid’2011: Copy detection & multimedia event detection. In Proc. of TRECVID. NIST,
2011.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman
Publishing Co., Inc., 1999.
H. Bai, Y. Dong, W. Liu, L. Wang, C. Huang, and K. Tao. France telecom orange labs (beijing) at
trecvid 2011: Content-based copy detection. In Proc. of TRECVID. NIST, 2011.
J. M. Barrios. Content-based video copy detection. In Proc. of the int. conf. on Multimedia
(ACMMM), pages 1141–1142. ACM, 2009.
172
J. M. Barrios and B. Bustos. Text-based and content-based image retrieval on flickr: Demo. In
Proc. of the int. workshop on Similarity Search and Applications (SISAP), pages 156–157. IEEE,
2009.
J. M. Barrios and B. Bustos. Content-based video copy detection: Prisma at trecvid 2010. In Proc.
of TRECVID. NIST, 2010.
J. M. Barrios and B. Bustos. P-VCD: A pivot-based approach for content-based video copy detection. In Proc. of the IEEE int. conf. on Multimedia and Expo (ICME)., pages 1–6. IEEE,
2011a.
J. M. Barrios and B. Bustos. Automatic weight selection for multi-metric distances. In Proc. of the
int. workshop on Similarity Search and Applications (SISAP), pages 61–68. ACM, 2011b.
J. M. Barrios and B. Bustos. Competitive content-based video copy detection using global descriptors. Multimedia Tools and Applications, 62(1):75–110, 2013.
J. M. Barrios, B. Bustos, and X. Anguera. Combining features at search time: Prisma at video
copy detection task. In Proc. of TRECVID. NIST, 2011.
J. M. Barrios, B. Bustos, and T. Skopal. Snake table: A dynamic pivot table for streams of k-nn
searches. In Proc. of the int. workshop on Similarity Search and Applications (SISAP), pages
25–39. Springer, 2012.
J. M. Barrios, B. Bustos, and T. Skopal. Analyzing and dynamically indexing the query set. To
appear in Information Systems, 2013.
A. Basharat, Y. Zhai, and M. Shah. Content based video matching using spatiotemporal volumes.
Journal of Computer Vision and Image Understanding, 110(3):360––377, 2008.
M. Batko, P. Kohoutkova, and D. Novak. Cophir image collection under the microscope. In Proc.
of the intl. workshop on Similarity Search and Applications (SISAP), pages 47–54. IEEE, 2009.
H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Speeded-up robust features (SURF). Computer
Vision and Image Understanding, 110(3):346–359, 2008.
J. S. Beis and D. G. Lowe. Shape indexing using approximate nearest-neighbour search in highdimensional spaces. In IEEE conf. on Computer Vision and Pattern Recognition (CVPR), pages
1000–1006. IEEE, 1997.
D. N. Bhat and S. K. Nayar. Ordinal measures for image correspondence. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(4):415–423, 1998.
C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for
improving the performance of multimedia databases. ACM Computing Surveys, 33(3):322–373,
2001.
J. S. Boreczky and L. A. Rowe. Comparison of video shot boundary detection techniques. Journal
of Electronic Imaging, 5(2):122–128, 1996.
Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In
Proc. of the intl. conf. on Computer Vision and Pattern Recognition (CVPR), pages 2559–2566.
IEEE, 2010.
G. Bradski and A. Kaehler. Learning OpenCV. O’Reilly Media, Inc., 2008.
173
S. Brin. Near neighbor search in large metric spaces. In Proc. of the int. conf. on Very Large
Databases (VLDB)., pages 574–584. Morgan Kauffman, 1995.
B. Bustos. Index structures for similarity search in multimedia databases. Ph.D. thesis, Department
of Computer and Information Science, University of Konstanz, 2006.
B. Bustos and T. Skopal. Dynamic similarity search in multi-metric spaces. In Proc. of the int.
workshop on Multimedia Information Retrieval (MIR), pages 137–146. ACM, 2006.
B. Bustos, G. Navarro, and E. Chávez. Pivot selection techniques for proximity searching in metric
spaces. Pattern Recognition Letters, 24(14):2357–2366, 2003.
B. Bustos, O. Pedreira, and N. Brisaboa. A dynamic pivot selection technique for similarity search.
In Proc. of the int. workshop on Similarity Search and Applications (SISAP), pages 105–112.
IEEE, 2008.
J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 8(6):679–698, 1986.
E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquı́n. Searching in metric spaces. ACM
Computing Surveys, 33(3):273–321, 2001.
L. Chen and X. Lian. Efficient similarity search in nonmetric spaces with local constant embedding.
IEEE Transactions on Knowledge and Data Engineering, 20(3):321–336, 2008.
M. Cherubini, R. de Oliveira, and N. Oliver. Understanding near-duplicate videos: A user-centric
approach. In Proc. of the int. conf. on Multimedia (ACMMM), pages 35–44. ACM, 2009.
S. Cheung and A. Zakhor. Efficient video similarity measurement with video signature. IEEE
Transactions on Circuits and Systems for Video Technology, 13(1):59–74, 2003.
O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near identical image and shot detection.
In Proc. of the int. conf. on Image and Video Retrieval (CIVR), pages 549–556. ACM, 2007.
P. Ciaccia and M. Patella. Searching in metric spaces with user-defined and approximate distances.
ACM Transactions on Database Systems, 27(4):398–437, 2002.
P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search
in metric spaces. In Proc. of the int. conf. on Very Large Databases (VLDB)., pages 426–435.
Morgan Kauffman, 1997.
Cisco Systems Inc. Global IP Traffic Forecast and Methodology, 2006–2011, jan 14, 2008. White
paper.
Cisco Systems Inc. Cisco Visual Networking Index: Forecast and Methodology, 2011–2016, may 30,
2012. White paper.
B. Coskun, B. Sankur, and N. Memon. Spatio-temporal transform based video hashing. IEEE
Transactions on Multimedia, 8(6):1190–1208, 2006.
T. Deselaers, T. Weyand, and H. Ney. Image retrieval and annotation using maximum entropy. In
CLEF Workshop 2006, pages 725–734. Springer, 2007.
M. Douze, A. Gaidon, H. Jegou, M. Marszalek, and C. Schmid. Inria lear’s video copy detection
system. In Proc. of TRECVID. NIST, 2008.
174
S. Eickeler and S. Müller. Content-based video indexing of tv broadcast news using hidden markov
models. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International
Conference on, pages 2997–3000 vol.6. IEEE, 1999.
C. Faloutsos and K.-I. Lin. Fastmap: a fast algorithm for indexing, data-mining and visualization of
traditional and multimedia datasets. In Proc. of the int. conf. on Management of data (SIGMOD),
pages 163–174. ACM, 1995.
B. Fauvet, P. Bouthemy, P. Gros, and F. Spindler. A geometrical key-frame selection method
exploiting dominant motion estimation in video. In Proc. of the 3th int. conf. on Image and
Video Retrieval (CIVR), pages 419––427. Springer, 2004.
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic
expected time. ACM Transactions on Mathematical Software, 3(3):209–226, 1977.
N. Gengembre and S.-A. Berrani. A probabilistic framework for fusing frame-based searches within
a video copy detection system. In Proc. of the int. conf. on Image and Video Retrieval (CIVR),
pages 211–220. ACM, 2008.
K.-S. Goh, B. Li, and E. Chang. DynDex: A dynamic and non-metric space indexer. In Proc. of
the 10th ACM int. conf. on Multimedia (ACMMM), pages 466–475. ACM, 2002.
R. C. Gonzalez and R. E. Woods. Digital Image Processing (Third Edition). Prentice-Hall, Inc.,
2007.
X. Guo, Y. Chen, W. Liu, Y. Mao, H. Zhang, K. Zhou, L. Wang, Y. Hua, Z. Zhao, Y. Zhao, and
A. Cai. Bupt-mcprl at trecvid 2010. In Proc. of TRECVID. NIST, 2010.
V. Gupta, G. Boulianne, and P. Cardinal. Crim’s content-based audio copy detection system for
trecvid 2009. In Proc. of the int. workshop on Content-Based Multimedia Indexing (CBMI), pages
1–6. IEEE, 2010.
V. Gupta, P. D. Z. Varcheie, L. Gagnon, and G. Boulianne. Crim at trecvid-2011: Content-based
copy detection using nearest-neighbor mapping. In Proc. of TRECVID. NIST, 2011.
A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. of the int. conf.
on Management of data (SIGMOD), pages 47–57. ACM, 1984.
J. Haitsma and T. Kalker. A highly robust audio fingerprinting system. In Proc. of the int. symp.
on Music Information Retrieval (ISMIR), pages 107–115. ISMIR, 2002.
A. Hampapur and R. Bolle. Comparison of distance measures for video copy detection. In Proc. of
the IEEE int. conf. on Multimedia and Expo (ICME), pages 737–740. IEEE, 2001.
A. Hampapur, T. Weymouth, and R. Jain. Digital video segmentation. In Proc. of the int. conf.
on Multimedia (ACMMM), pages 357–364. ACM, 1994.
A. Hanjalic. Shot-boundary detection: unraveled and resolved? IEEE Transactions on Circuits
and Systems for Video Technology, 12(2):90–105, 2002.
C. Harris and M. Stephens. A combined corner and edge detector. In Proc. of the Alvey Vision
Conference, pages 147—-151. The Plessey Company, 1988.
G. Hjaltason and H. Samet. Ranking in spatial databases. In Proc. of the 4th int. symp. on Spatial
Databases (SSD), pages 83–95. Springer, 1995.
175
P. Howarth and S. Rüger. Fractional distance measures for content-based image retrieval. In Proc.
of the 27th european conf. on IR Research (ECIR), pages 447–456. Springer, 2005.
M. Hradiš, I. Řeznı́ček, D. Bařina, A. Vlček, and P. Zemčı́k. Brno university of technology at trecvid
2010 sin, ccd. In Proc. of TRECVID. NIST, 2010.
M. Hradiš, I. Řeznı́ček, K. Behúň, and L. Otrusina. Brno university of technology at trecvid 2011
sin, ccd. In Proc. of TRECVID. NIST, 2011.
K. Iwamoto, E. Kasutani, and A. Yamada. Image signature robust to caption superimposition
for video sequence identification. In Proc. of the int. conf. on Image Processing (ICIP), pages
3185–3188. IEEE, 2006.
H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large
scale image search. In Proc. of the european conf. on Computer Vision (ECCV), pages 304–317.
Springer, 2008.
H. Jégou, M. Douze, and C. Schmid. Packing bag-of-features. In Proc. of the IEEE int. conf. on
Computer Vision (ICCV), pages 2357–2364. IEEE, 2009.
H. Jégou, M. Douze, G. Gravier, C. Schmid, and P. Gros. Inria lear-texmex: Video copy detection
task. In Proc. of TRECVID. NIST, 2010.
M. Jiang, S. Fang, Y. Tian, T. Huang, and W. Gao. Pku-idm @ trecvid 2011 cbcd: Content-based
copy detection with cascade of multimodal features and temporal pyramid matching. In Proc. of
TRECVID. NIST, 2011.
W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. C. Loui. Short-term audio-visual atoms for
generic video concept classification. In Proc. of the int. conf. on Multimedia (ACMMM), pages
5–14. ACM, 2009.
A. Joly, C. Frélicot, and O. Buisson. Robust content-based video copy identification in a large
reference database. In Proc. of the int. conf. on Image and Video Retrieval (CIVR), pages 414–
424. Springer, 2003.
A. Joly, O. Buisson, and C. Frélicot. Content-based copy retrieval using distortion-based probabilistic similarity search. IEEE Transactions on Multimedia, 9(2):293–306, 2007.
A. Joly, J. Law-To, and N. Boujemaa. Inria-imedia trecvid 2008: Video copy detection. In Proc. of
TRECVID. NIST, 2008.
Y. Ke and R. Sukthankar. Pca-sift: A more distinctive representation for local image descriptors.
In Proc. of the intl. conf. on Computer Vision and Pattern Recognition (CVPR), pages II–506 –
II–513 Vol.2. IEEE, 2004.
C. Kim. Content-based image copy detection. Signal Processing: Image Communication, 18(3):
169–184, 2003.
C. Kim and B. Vasudev. Spatiotemporal sequence matching for efficient video copy detection. IEEE
Transactions on Circuits and Systems for Video Technology, 15(1):127–132, 2005.
S. Kutluk and B. Gunsel. Itu mspr trecvid 2010 video copy detection system. In Proc. of TRECVID.
NIST, 2010.
G. Langelaar, I. Setyawan, and R. Lagendijk. Watermarking digital image and video data. a stateof-the-art overview. Signal Processing Magazine, 17(5):20–46, 2000.
176
J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa. Robust voting algorithm based on
labels of behavior for video copy detection. In Proc. of the int. conf. on Multimedia (ACMMM),
pages 835–844. ACM, 2006.
J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In Proc. of the int. conf. on Image and Video
Retrieval (CIVR), pages 371–378. ACM, 2007a.
J. Law-To, A. Joly, and N. Boujemaa. MUSCLE-VCD-2007: A live benchmark for video copy
detection, 2007b. https://www-rocq.inria.fr/imedia/civr-bench/.
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for
recognizing natural scene categories. In Proc. of the intl. conf. on Computer Vision and Pattern
Recognition (CVPR), pages 2169–2178 Vol.2. IEEE, 2006.
D.-D. Le, S. Poullot, M. Crucianu, X. Wu, M. Nett, M. E. Houle, and S. Satoh. National institute
of informatics, japan at trecvid 2009. In Proc. of TRECVID. NIST, 2009.
D.-D. Le, S. Poullot, X. Wu, B. Nouvel, and S. Satoh. National institute of informatics, japan at
trecvid 2010. In Proc. of TRECVID. NIST, 2010.
D.-D. Le, C.-Z. Zhu, S. Poullot, V. Q. Lam, D. A. Duong, and S. Satoh. National institute of
informatics, japan at trecvid 2011. In Proc. of TRECVID. NIST, USA, 2011.
S. Lee and C. D. Yoo. Robust video fingerprinting for content-based video identification. IEEE
Transactions on Circuits and Systems for Video Technology, 18(7):983–988, 2008.
M. Lew, N. Sebe, C. Djeraba, and R. Jain. Content-based multimedia information retrieval: State
of the art and challenges. ACM Transactions on Multimedia Computing, Communications and
Applications, 2(1):1–19, 2006.
B. Li, E. Chang, and C.-T. Wu. DPF - a perceptual distance function for image retrieval. In Proc.
of the 2002 int. conf. on Image Processing (ICIP), pages II–597–II–600. IEEE, 2002.
Y. Li, L. Mou, M. Jiang, C. Su, X. Fang, M. Qian, Y. Tian, Y. Wang, T. Huang, and W. Gao.
Pku-idm @ trecvid 2010: Copy detection with visual-audio feature fusion and sequential pyramid
matching. In Proc. of TRECVID. NIST, 2010.
Y. Liang, B. Cao, J. Li, C. Zhu, Y. Zhang, C. Tan, G. Chen, C. Sun, J. Yuan, M. Xu, and B. Zhang.
Thu-img at trecvid 2009. In Proc. of TRECVID. NIST, 2009.
Z. Liu, E. Zavesky, D. Gibbon, B. Shahraray, and P. Haffner. AT&T research at trecvid 2007. In
Proc. of TRECVID. NIST, 2007.
Z. Liu, T. Liu, D. Gibbon, and B. Shahraray. Effective and scalable video copy detection. In Proc.
of the int. conf. on Multimedia Information Retrieval (MIR), pages 119–128. ACM, 2010a.
Z. Liu, E. Zavesky, N. Sawant, and B. Shahraray. At&t research at trecvid 2010. In Proc. of
TRECVID. NIST, 2010b.
Z. Liu, E. Zavesky, N. Zhou, and B. Shahraray. At&t research at trecvid 2011. In Proc. of TRECVID.
NIST, 2011.
D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
177
D. Lowe. SIFT demo program (version 4), 2005. https://www.cs.ubc.ca/~lowe/keypoints/
siftDemoV4.zip.
B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and A. Yamada. Color and texture descriptors.
IEEE Transactions on Circuits and Systems for Video Technology, 11(6):703–715, 2001.
Y. Meng, E. Chang, and B. Li. Enhancing DPF for near-replica image recognition. In IEEE conf.
on Computer Vision and Pattern Recognition (CVPR), pages 416–423. IEEE, 2003.
L. Micó and J. Oncina. A constant average time algorithm to allow insertions in the LAESA fast
nearest neighbour search index. In Proc. of the int. conf. on Pattern Recognition (ICPR), pages
3911–3914. IEEE, 2010.
M. Micó, J. Oncina, and E. Vidal. A new version of the nearest-neighbour approximating and
eliminating search algorithm (AESA) with linear preprocessing time and memory requirements.
Pattern Recognition Letters, 15(1):9–17, 1994.
H.-S. Min, W. D. Neve, and Y. M. Ro. Towards using semantic features for near-duplicate video
detection. In Proc. of the IEEE int. conf. on Multimedia and Expo (ICME), pages 1364–1369.
IEEE, 2010.
H.-S. Min, J. Y. Choi, W. D. Neve, and Y. M. Ro. Bimodal fusion of low-level visual features
and high-level semantic features for near-duplicate video clip detection. Signal Processing: Image
Communication, 26(10):612–627, 2011.
M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In Proc. of the int. conf. on Computer Vision Theory and Application (VISSAPP), pages
331–340. INSTICC Press, 2009.
R. Mukai, T. Kurozumi, K. Hiramatsu, T. Kawanishi, H. Nagano, and K. Kashino. Ntt communication science laboratories at trecvid 2010 content-based copy detection. In Proc. of TRECVID.
NIST, 2010a.
R. Mukai, T. Kurozumi, K. Hiramatsu, T. Kawanishi, H. Nagano, and K. Kashino. Ntt communication science laboratories at trecvid 2010 content-based copy detection. In Proc. of TRECVID.
NIST, 2010b.
R. Mukai, T. Kurozumi, T. Kawanishi, H. Nagano, and K. Kashino. Ntt communication science
laboratories at trecvid 2011 content-based copy detection. In Proc. of TRECVID. NIST, 2011.
M. Naito, K. Matsumoto, M. Shshibori, K. Kita, M. Cuturi, T. Matsui, S. Sato, K. Hoashi, F. Sugaya, , and Y. Nakajima. Shot boundary detection and high-level feature extraction experiments
for trecvid 2006. In Proc. of TRECVID. NIST, 2006.
A. Natsev, J. R. Smith, M. Hill, G. Hua, B. Huangy, M. Merlery, L. Xie, H. Ouyangz, and M. Zhoux.
Ibm research trecvid-2010 video copy detection and multimedia event detection system. In Proc.
of TRECVID. NIST, 2010.
X. Naturel and P. Gros. A fast shot matching strategy for detecting duplicate sequences in a
television stream. In Proc. of the int. workshop on Computer Vision meets Databases (CVDB),
pages 21–27. ACM, 2005.
R. Ng and J. Han. CLARANS: A method for clustering objects for spatial data mining. IEEE
Transactions on Knowledge and Data Engineering, 14(5):1003–1016, 2002.
178
C.-W. Ngo, S.-A. Zhu, H.-K. Tan, W.-L. Zhao, and X.-Y. Wei. Vireo at trecvid 2010: Semantic
indexing, known-item search, and content-based copy detection. In Proc. of TRECVID. NIST,
2010.
M. Patella and P. Ciaccia. Approximate similarity search: A multi-faceted problem. Journal of
Discrete Algorithms, 7(1):36–48, 2009.
S. Poullot, O. Buisson, and M. Crucianu. Z-grid-based probabilistic retrieval for scaling up contentbased copy detection. In Proc. of the int. conf. on Image and Video Retrieval (CIVR), pages
348–355. ACM, 2007.
S. Poullot, M. Crucianu, and O. Buisson. Scalable mining of large video databases using copy
detection. In Proc. of the int. conf. on Multimedia (ACMMM), pages 61–70. ACM, 2008.
S. Poullot, O. Buisson, and M. Crucianu. Scaling content-based video copy detection to very large
databases. Multimedia Tools and Applications, 47(2):279–306, 2010.
G. Roth, R. Laganière, M. Bouchard, I. Lakhmiri, and T. Janati. Viva lab - university of ottawa
at trecvid 2009 content based copy detection. In Proc. of TRECVID. NIST, 2009.
V. Roth, J. Laub, J. Buhmann, and K.-R. Müller. Going metric: Denoising pairwise data. In Proc.
of the int. conf. on Neural Information Processing Systems (NIPS), pages 817–824. MIT Press,
2002.
A. H. Rouhi and J. A. Thom. Rmit university at trecvid 2011 content-based copy detection. In
Proc. of TRECVID. NIST, 2011.
Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval.
International Journal of Computer Vision, 40(2):99–121, 2000.
Y. Rubner, J. Puzicha, C. Tomasi, and J. Buhmann. Empirical evaluation of dissimilarity measures
for color and texture. Computer Vision and Image Understanding, 84(1):25–43, 2001.
T. Sakata, N. Matozaki, K. Kise, and M. Iwamura. Osaka prefecture university at trecvid 2011. In
Proc. of TRECVID. NIST, 2011.
A. Saracoğlu, E. Esen, T. K. Ateş, B. O. Acar, Ü. Zubari, E. C. Ozan, E. Özalp, A. A. Alatan, and
T. Çiloğlu. Content based copy detection with coarse audio-visual fingerprints. In Proc. of the
int. workshop on Content-Based Multimedia Indexing (CBMI), pages 213–218. IEEE, 2009.
A. Saracoğlu, E. Esen, M. Soysal, T. K. Ateş, B. Loğoğlu, M. Tekin, T. Karadeniz, M. Sevinç,
H. Sevimli, B. O. Acar, E. C. Ozan, D. O. Onur, S. Selçuk, A. A. Alatan, and T. Çiloğlu.
Tübitak uzay at trecvid 2010: Content-based copy detection and semantic indexing. In Proc. of
TRECVID. NIST, 2010.
S. Satoh, M. Takimoto, and J. Adachi. Scene duplicate detection from videos based on trajectories
of feature points. In Proc. of the int. conf. on Multimedia Information Retrieval (MIR), pages
237–244. ACM, 2007.
J. Shao, H. T. Shen, and X. Zhou. Challenges and techniques for effective and efficient similarity
search in large video databases. Proc. of the VLDB Endowment, 1(2):1598–1603, 2008.
E. Shechtman and M. Irani. Space-time behavior based correlation. In Proc. of the intl. conf. on
Computer Vision and Pattern Recognition (CVPR), pages 405–412 Vol.1. IEEE, 2005.
179
H. T. Shen, X. Zhou, Z. Huang, J. Shao, and X. Zhou. Uqlips: A real-time near-duplicate video clip
detection system. In Proc. of the int. conf. on Very Large Data Bases (VLDB), pages 1374–1377.
VLDB Endowment, 2007.
H. T. Shen, J. Shao, Z. Huang, Y. Yang, J. Song, J. Liu, and X. Zhu. Uqmsg experiments for
trecvid 2011. In Proc. of TRECVID. NIST, 2011.
J. Shi and C. Tomasi. Good features to track. In Proc. of the intl. conf. on Computer Vision and
Pattern Recognition (CVPR), pages 593–600. IEEE, 1994.
M. Shishibori, M. Ohnishi, Y. Tanioka, and K. Kita. Instance search and content-based copy
detection experiments for trecvid 2011. In Proc. of TRECVID. NIST, 2011.
J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos.
In Proc. of the IEEE int. conf. on Computer Vision (ICCV), pages 1470–1477. IEEE, 2003.
T. Skopal. On fast non-metric similarity search by metric access methods. In Proc. of the 10th int.
conf. on Extending Database Technology (EDBT), pages 718–736. Springer, 2006.
T. Skopal. Unified framework for fast exact and approximate search in dissimilarity spaces. ACM
Transactions on Database Systems, 32(4):29–47, 2007.
T. Skopal and B. Bustos. On nonmetric similarity search problems in complex domains. ACM
Computing Surveys, 43(4):1–34, 2011.
T. Skopal and J. Lokoč. NM-tree: Flexible approximate similarity search in metric and non-metric
spaces. In Proc. of the 19th int. workshop on Database and Expert Systems Applications (DEXA),
pages 312–325. Springer, 2008.
T. Skopal, J. Lokoč, and B. Bustos. D-cache: Universal distance cache for metric access methods.
IEEE Transactions on Knowledge and Data Engineering, 24(5):868–881, 2012.
A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and TRECVid. In Proc. of the int.
workshop on Multimedia Information Retrieval (MIR), pages 321–330. ACM, 2006.
A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval
at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(12):1349–1380, 2000.
C. G. Snoek, M. Worring, and A. W. Smeulders. Early versus late fusion in semantic video analysis.
In Proc. of the int. conf. on Multimedia (ACMMM), pages 399–402. ACM, 2005.
C. Sun, J. Li, B. Zhang, and Q. Zhang. Thu-img at trecvid 2010. In Proc. of TRECVID. NIST,
2010.
M. Swanson, M. Kobayashi, and A. Tewfik. Multimedia data-embedding and watermarking technologies. Proc. of the IEEE, 86(6):1064–1087, 1998.
R. Szeliski. Computer Vision: Algorithms and Applications. Springer, 2010.
H.-K. Tan, C.-W. Ngo, R. Hong, and T.-S. Chua. Scalable detection of partial near-duplicate
videos by visual-temporal consistency. In Proc. of the int. conf. on Multimedia (ACMMM), pages
145–154. ACM, 2009.
C. Tomasi and T. Kanade. Detection and tracking of point features. Carnegie Mellon University
Technical Report CMU-CS-91-132, 1991.
180
Y. Uchida, S. Sakazawa, M. Agrawal, and M. Akbacak. Kddi labs and sri international at trecvid
2010: Content-based copy detection. In Proc. of TRECVID. NIST, 2010.
Y. Uchida, K. Takagi, and S. Sakazawa. Kddi labs at trecvid 2011: Content-based copy detection.
In Proc. of TRECVID. NIST, 2011.
K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluation of color descriptors for object
and scene recognition. In Proc. of the intl. conf. on Computer Vision and Pattern Recognition
(CVPR), pages 1–8. IEEE, 2008.
J. van Gemert, J.-M. Geusebroek, C. Veenman, and A. Smeulders. Kernel codebooks for scene
categorization. In Proc. of the european conf. on Computer Vision (ECCV), pages 696–709.
Springer, 2008.
A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms
(version 0.9.14), 2008. https://www.vlfeat.org/.
E. Vidal. New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (AESA). Pattern Recognition Letters, 15(1):1–7, 1994.
G. Willems, T. Tuytelaars, and L. V. Gool. An efficient dense and scale-invariant spatio-temporal
interest point detector. In Proc. of the european conf. on Computer Vision (ECCV), pages 650–
–663. Springer, 2008a.
G. Willems, T. Tuytelaars, and L. V. Gool. Spatio-temporal features for robust content-based
video copy detection. In Proc. of the int. conf. on Multimedia Information Retrieval (MIR),
pages 283–290. ACM, 2008b.
R. B. Wolfgang, C. I. Podilchuk, and E. J. Delp. Perceptual watermarks for digital images and
video. Proc. of the IEEE, 87(7):1108–1126, 1999.
X. Wu, A. G. Hauptmann, and C.-W. Ngo. Practical elimination of near-duplicates from web video
search. In Proc. of the int. conf. on Multimedia (ACMMM), pages 218––227. ACM, 2007.
M.-C. Yeh, C.-Y. Hsu, and C.-S. Lu. Ntnu-academia sinica at trecvid 2010 content based copy
detection. In Proc. of TRECVID. NIST, 2010.
E. Younessian, X. Anguera, T. Adamek, N. Oliver, and D. Marimon. Telefonica research at trecvid
2010 content-based copy detection. In Proc. of TRECVID. NIST, 2010.
J. Yuan, Z. Guo, L. Lv, W. Wan, T. Zhang, D. Wang, X. Liu, C. Liu, S. Zhu, D. Wang, Y. Pang,
N. Ding, Y. Liu, J. Wang, X. Zhang, X. Tie, Z. Wang, H. Wang, T. Xiao, Y. Liang, J. Li, F. Lin,
B. Zhang, J. Li, W. Wu, X. Tong, D. Ding, Y. Chen, T. Wang, and Y. Zhang. THU and ICRC
at trecvid 2007. In Proc. of TRECVID. NIST, 2007.
P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach
(Advances in Database Systems). Springer, 2005.
W.-L. Zhao, D. Borth, and T. M. Breuel. Participation at trecvid 2011 semantic indexing & contentbased copy detection tasks. In Proc. of TRECVID. NIST, 2011a.
Z. Zhao, Y. Zhao, X. Guo, Y. Chen, Y. Hua, W. Wang, C. Liu, S. Wu, H. Zhang, L. Wang, Y. Mao,
A. Cai, and M. Zhai. Bupt-mcprl at trecvid 2011*. In Proc. of TRECVID. NIST, 2011b.
Y. Zhuang, Y. Rui, T. Huang, and S. Mehrotra. Adaptive key frame extraction using unsupervised
clustering. In Proc. of the int. conf. on Image Processing (ICIP), pages 866–870. IEEE, 1998.
181
Y. Zhuang, Y. Rui, and T. Huang. Video key frame extraction by unsupervised clustering and
feedback adjustment. Journal of Computer Science and Technology, 14(3):283–287, 1999.
182
Appendix A
MUSCLE-VCD-2007 ground-truth
Query
Video
Reference
Copy Length
Transformations
Copy Start
Video
Copy Start
ST1Query1
ST1Query2
ST1Query3
00:00.0
—
00:00.0
movie27
—
movie8
00:00.0
—
00:00.0
06:58.3
—
06:19.1
ST1Query4
ST1Query5
—
00:00.0
—
movie44
—
00:01.4
—
07:43.8
ST1Query6
00:00.0
movie76
00:09.3
06:02.1
Frontal
camcording,
subtitles, live audio
recording.
ST1Query7
ST1Query8
ST1Query9
ST1Query10
—
—
00:00.0
00:00.0
—
—
movie9
movie21
—
—
00:00.0
00:00.7
—
—
09:13.4
11:34.0
—
—
ST1Query11
00:00.0
movie37
00:21.2
11:46.9
Frontal camcording, live
audio recording.
ST1Query12
ST1Query13
ST1Query14
ST1Query15
—
00:00.0
00:00.0
00:00.0
—
movie11
movie17
movie68
—
00:00.0
00:02.4
00:05.4
—
17:27.3
26:19.4
42:59.4
—
Color adjustment, blur.
—
Reencoding, color adjustment, crop.
—
Strong reencoding, resize.
Colors adjustment.
Non frontal camcording,
live audio recording.
Flip.
Zoom, subtitles.
Resize.
Table A.1 – Ground-truth of ST1 collection in MUSCLE-VCD-2007 dataset. Start times and copy
length in format [minutes]:[seconds].[d]
183
Query
Video
Reference
Copy Length
Transformations
Copy Start
Video
Copy Start
ST2Query1
01:02.6
movie30
05:51.0
00:38.4
Blur, lack
track.
ST2Query1
ST2Query1
ST2Query1
ST2Query1
ST2Query1
02:52.6
07:51.3
09:11.6
11:22.5
13:30.7
movie55
movie33
movie38
movie43
movie50
19:12.0
01:31.0
15:06.9
38:39.6
01:23.0
01:14.6
00:52.8
01:49.3
01:25.2
00:40.8
Blur.
ST2Query2
00:40.2
movie98
08:04.0
01:25.1
Insertion of moving caption.
ST2Query2
03:30.6
movie20
03:35.0
00:39.6
Change of color, sharpness.
ST2Query2
04:30.7
movie27
01:43.0
01:04.0
Vertical
deformation,
letter box, contrast.
ST2Query2
06:21.4
movie26
15:07.0
02:02.7
Insertion of logo, vertical shift, contrast.
ST2Query2
ST2Query2
08:40.6
10:36.4
movie89
movie82
08:06.0
06:30.0
00:36.5
00:56.1
Blur.
ST2Query2
12:41.0
movie59
13:39.0
01:43.2
Zoom, vertical deformation.
ST2Query2
16:28.2
movie13
04:51.0
00:40.9
Crop, change of color.
ST2Query3
ST2Query3
01:10.6
03:27.6
movie46
movie15
31:36.0
05:45.0
00:55.2
00:38.9
Blur, flip.
ST2Query3
04:56.4
movie16
40:38.0
00:22.4
Crop, insertion of logo,
blur.
ST2Query3
ST2Query3
ST2Query3
06:18.5
07:56.1
10:02.2
movie18
movie99
movie65
00:55.0
48:27.0
07:29.0
00:30.4
00:27.8
00:44.2
Change of gamma.
ST2Query3
11:22.9
movie23
04:25.0
00:29.3
Crop, Change of color,
vertical deformation.
of
audio
Blur.
Blur.
Blur.
Blur.
Insertion of logo, contrast.
Blur, vertical deformation.
Noise, brightness.
Brightness, change of
gamma, vertical shift,
subtitles.
Table A.2 – Ground-truth of ST2 collection in MUSCLE-VCD-2007 dataset. Start times and copy
length in format [minutes]:[seconds].[d]
184
ST1Query1
movie27
ST1Query3
movie8
ST1Query5
movie44
ST1Query6
movie76
ST1Query9
movie9
ST1Query10
movie21
ST1Query11
movie37
ST1Query13
movie11
ST1Query14
movie17
ST1Query15
movie68
Figure A.1 – Copies in ST1 collection. On the left a frame from query video and on the right the
corresponding frame from reference video.
185
ST2Query1
movie30
ST2Query1
movie55
ST2Query1
movie33
ST2Query1
movie38
ST2Query1
movie43
ST2Query1
movie50
Figure A.2 – Copy excerpts in ST2Query1. On the left a sample from query video and on the
right the corresponding frame from reference video.
186
ST2Query2
movie98
ST2Query2
movie20
ST2Query2
movie27
ST2Query2
movie26
ST2Query2
movie89
ST2Query2
movie82
ST2Query2
movie59
ST2Query2
movie13
Figure A.3 – Copy excerpts in ST2Query2. On the left a sample from query video and on the
right the corresponding frame from reference video.
187
ST2Query3
movie46
ST2Query3
movie15
ST2Query3
movie16
ST2Query3
movie18
ST2Query3
movie99
ST2Query3
movie65
ST2Query3
movie23
Figure A.4 – Copy excerpts in ST2Query3. On the left a sample from query video and on the
right the corresponding frame from reference video.
188
Appendix B
Results at TRECVID 2010
Team 2010
asahikasei
ATTLabs
brno
BUPT-MCPRL
CCU
IBM
IDARE
INRIA-LEARTEXMEX
ITU MSPR
KDDILabs-SRI
NII
NJU
NTNUAcademia-Sinica
NTT-CSL
Organization
Location
Asahikasei Co.
AT&T Labs - Research
Brno University of Technology
Beijing University of Posts and
Telecommunications-MCPRL
National Chung Cheng University
IBM T. J. Watson Research Center
Shandong University
INRIA-TEXMEX
Asia
NorthAm
Europe
Asia
—
[Liu et al., 2010b]
[Hradiš et al., 2010]
[Guo et al., 2010]
Asia
NorthAm
Asia
Europe
—
[Natsev et al., 2010]
—
[Jégou et al., 2010]
Istanbul Technical University
KDDI R&D Labs and SRI International
National Institute of Informatics
Nanjing University
NTNU and Academia Sinica
NTT
Communication
Science
Laboratories-CSL
PKU-IDM
Peking University-IDM
PRISMA
University of Chile
SYSU-GITL
Sun Yat-sen University - GITL
THU-IMG
Tsinghua University-IMG
TID
Telefonica Research
TUBITAK UZAY TUBITAK - Space Technologies
Research Institute
UNIBS
University of Brescia
VIREO
City University of Hong Kong
Europe
Asia
Paper
[Kutluk and Gunsel, 2010]
[Uchida et al., 2010]
Asia
Asia
Asia
[Le et al., 2010]
[Mukai et al., 2010a]
[Yeh et al., 2010]
Asia
[Mukai et al., 2010b]
Asia
SouthAm
Asia
Asia
Europe
Europe
Europe
Asia
[Li et al., 2010]
[Barrios and Bustos, 2010]
—
[Sun et al., 2010]
[Younessian et al., 2010]
[Saracoğlu et al., 2010]
—
[Ngo et al., 2010]
Table B.1 – The 22 participant teams in CCD evaluation at TRECVID 2010.
189
Avg. Optimal
#
NDCR
F1
NDCR
F1
Avg.
Team
Run
1
PKU-IDM
perseus
AV
0.061
0.889
0.061
0.889
11925
2
PKU-IDM
kraken
AV
0.091
0.892
0.091
0.892
9631
3
NTT-CSL
3
AV
0.140
0.933
0.140
0.933
25
4
NTT-CSL
0
AV
0.158
0.934
0.158
0.934
25
5
INRIA-LEAR-TEXMEX
mouflon
AV
0.308
0.959
1017.305
0.957
537
6
KDDILabs-SRI
1
AV
0.343
0.961
13.982
0.847
84
7
KDDILabs-SRI
2
AV
0.373
0.964
13.986
0.848
84
8
INRIA-LEAR-TEXMEX
bouquetin
AV
0.465
0.946
1485.700
0.935
537
9
Type
Avg. Actual
MTP
ATTLabs
1
AV
0.554
0.808
48.199
0.809
41
10
PRISMA
ehdNghT10
V
0.611
0.828
40.753
0.846
128
11
PRISMA
ehdNgryhst
V
0.611
0.828
147.772
0.811
128
12
IDARE
test
—
1.000
0.000
1.000
0.000
1592
13
ATTLabs
3
AV
10.050
0.822
105.473
0.799
41
14
IBM
gistG
V
13.986
0.775
80.890
0.773
14
15
IBM
gistGCsift
V
14.190
0.819
228.262
0.701
85
16
IBM
gistGC
V
14.211
0.818
469.092
0.801
50
17
nii
av
AV
17.915
0.927
306.224
0.898
10
18
asahikasei
VmainAsub
AV
18.166
0.758
274.309
0.761
225
19
brno
l3sl2
AV
27.747
0.452
880.447
0.565
1003
20
brno
l3sl
AV
29.663
0.468
553.523
0.536
640
21
nii
a
A
31.337
0.928
306.224
0.898
9
22
VIREO
srpeflip
V
40.974
0.444
2650.472
0.612
144
23
NJU
norank1
V
41.135
0.432
9503.509
0.469
94
24
THU-IMG
tortoise
V
54.495
0.596
22619
0.633
110
25
THU-IMG
tiger
V
54.503
0.599
15780
0.634
110
26
asahikasei
AmainVsub
AV
54.511
0.512
1429.169
0.656
407
27
TUBITAK UZAY
aindexnf
A
61.969
0.384
658.254
0.917
4
28
NTNU-Academia-Sinica
2
V
67.885
0.505
1111.631
0.709
111
29
NJU
comp2
V
81.300
0.084
0.998
0.064
94
30
SYSU-GITL
sysuc2
V
108.07
0.000
13987
0.000
504
31
CCU
submission
AV
111.89
0.011
2412.019
0.027
81
32
TID
rawfusion
AV
143.78
0.921
210.709
0.920
601
33
ITU MSPR
ITUMSPR2
AV
228.05
0.563
228.046
0.563
880
34
BUPT-MCPRL
SD
AV
535.41
0.868
535.411
0.868
27
35
BUPT-MCPRL
TF
AV
669.25
0.868
669.248
0.868
27
36
UNIBS
MF
V
25176
0.524
36057
0.520
12
37
UNIBS
SF
V
25390
0.774
36953
0.758
11
Table B.2 – Results for NoFA profile at TRECVID 2010. Values averaged for the 56 transformations. Submissions in descending order by Average Optimal NDCR.
190
V1
V2
V3
V4
V5
V6
V8
V10
Avg
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
A1
A2
A3
A4
A5
A6
A7
0.977
0.977
0.977
0.977
0.977
0.977
0.977
15
13
12
11
16
16
17
0.631
0.631
0.631
0.631
0.631
0.631
0.631
10
8
7
9
14
11
13
0.269
0.269
0.269
0.269
0.269
0.269
0.269
10
9
8
9
11
9
9
0.262
0.262
0.262
0.262
0.262
0.262
0.262
10
9
8
8
10
9
9
0.769
0.769
0.769
0.769
0.769
0.769
0.769
13
15
13
15
17
16
17
0.708
0.708
0.708
0.708
0.708
0.708
0.708
15
13
12
14
18
18
18
0.562
0.562
0.562
0.562
0.562
0.562
0.562
8
8
7
9
11
9
11
0.708
0.708
0.708
0.708
0.708
0.708
0.708
12
12
11
11
15
15
15
0.611
0.611
0.611
0.611
0.611
0.611
0.611
10
11
8
10
13
12
12
Avg
0.977 12
0.631
9
0.269
8
0.262
8
0.769
15
0.708
15
0.562
8
0.708
12
0.611 10
(a) Optimal NDCR for nofa.ehdNghT10
V1
F1
V2
V3
V4
V5
V6
V8
V10
Avg
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
16
15
15
14
17
15
15
A1
A2
A3
A4
A5
A6
A7
0.484
0.484
0.484
0.484
0.484
0.484
0.484
23
23
22
21
24
23
22
0.946
0.946
0.946
0.946
0.946
0.946
0.946
10
9
9
9
6
5
5
0.878
0.878
0.878
0.878
0.878
0.878
0.878
21
23
21
20
20
21
21
0.877
0.877
0.877
0.877
0.877
0.877
0.877
16
15
15
15
17
15
15
0.753
0.753
0.753
0.753
0.753
0.753
0.753
25
25
24
25
26
26
25
0.838
0.838
0.838
0.838
0.838
0.838
0.838
22
22
21
20
22
21
20
0.931
0.931
0.931
0.931
0.931
0.931
0.931
14
13
13
14
12
11
9
0.916
0.916
0.916
0.916
0.916
0.916
0.916
15
15
15
15
14
14
11
0.828
0.828
0.828
0.828
0.828
0.828
0.828
Avg
0.484 21
0.946
5
0.878
19
0.877
14
0.753
24
0.838
20
0.931
12
0.916
15
0.828 14
(b) Optimal F1 for nofa.ehdNghT10
V1
V2
V3
V4
V5
V6
V8
V10
Avg
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
A1
A2
A3
A4
A5
A6
A7
0.977
0.977
0.977
0.977
0.977
0.977
0.977
15
13
12
11
16
16
17
0.631
0.631
0.631
0.631
0.631
0.631
0.631
10
8
7
9
14
11
13
0.269
0.269
0.269
0.269
0.269
0.269
0.269
10
9
8
9
11
9
9
0.262
0.262
0.262
0.262
0.262
0.262
0.262
10
9
8
8
10
9
9
0.769
0.769
0.769
0.769
0.769
0.769
0.769
13
15
13
15
17
16
17
0.708
0.708
0.708
0.708
0.708
0.708
0.708
15
13
12
14
18
18
18
0.562
0.562
0.562
0.562
0.562
0.562
0.562
8
8
7
9
11
9
11
0.708
0.708
0.708
0.708
0.708
0.708
0.708
12
12
11
11
15
15
15
0.611
0.611
0.611
0.611
0.611
0.611
0.611
10
11
8
10
13
12
12
Avg
0.977 12
0.631
9
0.269
8
0.262
8
0.769
15
0.708
15
0.562
8
0.708
12
0.611 10
(c) Optimal NDCR for nofa.ehdNgryhst
V1
F1
V2
V3
V4
V5
V6
V8
V10
Avg
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
16
15
15
14
17
15
15
A1
A2
A3
A4
A5
A6
A7
0.484
0.484
0.484
0.484
0.484
0.484
0.484
23
23
22
21
24
23
22
0.946
0.946
0.946
0.946
0.946
0.946
0.946
10
9
9
9
6
5
5
0.878
0.878
0.878
0.878
0.878
0.878
0.878
21
23
21
20
20
21
21
0.877
0.877
0.877
0.877
0.877
0.877
0.877
16
15
15
15
17
15
15
0.753
0.753
0.753
0.753
0.753
0.753
0.753
25
25
24
25
26
26
25
0.838
0.838
0.838
0.838
0.838
0.838
0.838
22
22
21
20
22
21
20
0.931
0.931
0.931
0.931
0.931
0.931
0.931
14
13
13
14
12
11
9
0.916
0.916
0.916
0.916
0.916
0.916
0.916
15
15
15
15
14
14
11
0.828
0.828
0.828
0.828
0.828
0.828
0.828
Avg
0.484 21
0.946
5
0.878
19
0.877
14
0.753
24
0.838
20
0.931
12
0.916
15
0.828 14
(d) Optimal F1 for nofa.ehdNgryhst
Table B.3 – Optimal NDCR and Optimal F1 for submissions
nofa.ehdNgryhst at TRECVID 2010.
# means the rank between 37 submissions.
191
nofa.ehdNghT10
and
Avg. Optimal
#
Type
NDCR
F1
Avg. Actual
NDCR
F1
Avg.
Team
Run
MTP
1
PKU-IDM
perseus
AV
0.032
0.889
0.151
0.889
11925
2
PKU-IDM
kraken
AV
0.118
0.892
0.356
0.890
9631
3
NTT-CSL
1
AV
0.139
0.933
0.139
0.933
25
4
NTT-CSL
2
AV
0.140
0.933
0.140
0.933
25
5
INRIA-LEAR-TEXMEX
lapin
AV
0.198
0.649
15.652
0.646
537
6
KDDILabs-SRI
1
AV
0.297
0.960
0.413
0.960
84
7
KDDILabs-SRI
2
AV
0.317
0.961
2.366
0.954
84
8
ATTLabs
2
AV
0.356
0.803
0.818
0.802
41
9
ATTLabs
4
AV
0.398
0.800
1.346
0.788
41
10
nii
av
AV
0.485
0.897
0.611
0.898
10
11
INRIA-LEAR-TEXMEX
truite
AV
0.488
0.877
4.531
0.853
719
12
nii
a
A
0.519
0.902
0.611
0.898
9
13
TID
rawfusion
AV
0.529
0.921
1.196
0.913
601
14
PRISMA
ehdNgryhst
V
0.597
0.820
9.057
0.723
128
15
IBM
gistG
V
0.612
0.775
0.664
0.773
14
16
PRISMA
ehdNclrhst
V
0.658
0.820
8.902
0.724
132
17
TID
localvideo
V
0.741
0.941
4.829
0.932
441
18
VIREO
srpeflip
V
0.799
0.462
3.840
0.607
144
19
VIREO
srpe
V
0.839
0.475
3.472
0.608
72
20
TUBITAK UZAY
aindexb
A
0.847
0.384
9.827
0.886
4
21
TUBITAK UZAY
avb
AV
0.958
0.392
2.463
0.905
31
22
asahikasei
VmainAsub
AV
0.971
0.493
1.475
0.491
224
23
THU-IMG
dragon
V
0.987
0.723
28.967
0.692
110
24
THU-IMG
linnet
V
0.994
0.722
35.364
0.690
110
25
brno
l3sl2
AV
1.006
0.452
1.823
0.565
1003
26
brno
l3sl2X
AV
1.006
0.452
1.823
0.565
1003
27
asahikasei
AmainVsub
AV
1.028
0.469
5.123
0.622
407
28
NTNU-Academia-Sinica
1
V
1.032
0.505
3.051
0.704
111
29
NJU
comp2
31
NTNU-Academia-Sinica
3
30
SYSU-GITL
sysuc1
32
CCU
submission
33
TUBITAK UZAY
sift1024b
34
ITU MSPR
ITUMSPR1
35
TID
rawlocaud
36
BUPT-MCPRL
SD
37
NJU
rank1
V
38
BUPT-MCPRL
TF
AV
39
IDARE
test
AV
10.428
0.265
253.035
0.300
3
40
UNIBS
MF
V
26.130
0.524
142.569
0.587
12
41
UNIBS
SF
V
26.377
0.774
144.287
0.744
11
V
1.077
0.084
59.086
0.489
94
AV
1.107
0.000
16.522
0.781
11
V
1.107
0.000
21.919
0.000
227
AV
1.111
0.000
3.715
0.000
81
V
1.120
0.000
19.020
0.650
26
AV
1.155
0.561
1.155
0.561
956
A
1.202
0.822
43.946
0.927
2167
AV
1.224
0.867
1.224
0.867
27
1.240
0.120
39.924
0.508
94
1.918
0.867
1.918
0.867
27
Table B.4 – Results for Balanced profile at TRECVID 2010. Values averaged for the 56 transformations. Submissions in descending order by Average Optimal NDCR.
192
V1
V2
V3
V4
V5
V6
V8
V10
Avg
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
A1
A2
A3
A4
A5
A6
A7
0.962
0.962
0.962
0.962
0.962
0.962
0.962
16
16
15
13
17
17
17
0.454
0.454
0.454
0.454
0.454
0.454
0.454
9
7
7
7
10
9
9
0.346
0.346
0.346
0.346
0.346
0.346
0.346
12
11
11
11
13
13
13
0.608
0.608
0.608
0.608
0.608
0.608
0.608
16
14
13
15
17
17
17
0.638
0.638
0.638
0.638
0.638
0.638
0.638
19
16
17
17
19
19
19
0.808
0.808
0.808
0.808
0.808
0.808
0.808
19
19
18
18
21
20
20
0.762
0.762
0.762
0.762
0.762
0.762
0.762
16
17
16
15
18
17
18
0.685
0.685
0.685
0.685
0.685
0.685
0.685
16
14
14
15
18
17
17
0.658
0.658
0.658
0.658
0.658
0.658
0.658
16
14
15
15
17
17
17
Avg
0.962 17
0.454
7
0.346
11
0.608
16
0.638
18
0.808
19
0.762
17
0.685
16
0.658 16
(a) Optimal NDCR for balanced.ehdNclrhst
V1
F1
V2
V3
V4
V5
V6
V8
V10
Avg
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
16
16
14
15
18
17
17
A1
A2
A3
A4
A5
A6
A7
0.583
0.583
0.583
0.583
0.583
0.583
0.583
23
23
19
20
25
22
22
0.893
0.893
0.893
0.893
0.893
0.893
0.893
11
13
10
10
11
10
9
0.894
0.894
0.894
0.894
0.894
0.894
0.894
16
16
15
16
16
14
12
0.809
0.809
0.809
0.809
0.809
0.809
0.809
23
22
22
25
27
23
23
0.849
0.849
0.849
0.849
0.849
0.849
0.849
20
20
18
19
18
19
18
0.767
0.767
0.767
0.767
0.767
0.767
0.767
26
24
22
23
26
25
24
0.952
0.952
0.952
0.952
0.952
0.952
0.952
7
7
5
7
7
6
6
0.813
0.813
0.813
0.813
0.813
0.813
0.813
21
21
19
21
23
22
24
0.820
0.820
0.820
0.820
0.820
0.820
0.820
Avg
0.583 20
0.893
10
0.894
13
0.809
21
0.849
18
0.767
23
0.952
5
0.813
20
0.820 15
(b) Optimal F1 for balanced.ehdNclrhst
V1
V2
V3
V4
V5
V6
V8
V10
Avg
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
A1
A2
A3
A4
A5
A6
A7
0.977
0.977
0.977
0.977
0.977
0.977
0.977
18
18
17
15
19
20
20
0.515
0.515
0.515
0.515
0.515
0.515
0.515
11
9
9
11
14
12
11
0.269
0.269
0.269
0.269
0.269
0.269
0.269
10
10
8
9
11
9
10
0.262
0.262
0.262
0.262
0.262
0.262
0.262
11
9
8
9
9
9
9
0.777
0.777
0.777
0.777
0.777
0.777
0.777
20
20
19
18
21
21
20
0.708
0.708
0.708
0.708
0.708
0.708
0.708
18
16
17
17
19
19
19
0.562
0.562
0.562
0.562
0.562
0.562
0.562
14
12
10
13
15
15
14
0.708
0.708
0.708
0.708
0.708
0.708
0.708
17
15
15
16
19
18
18
0.597
0.597
0.597
0.597
0.597
0.597
0.597
14
12
11
13
15
15
15
Avg
0.977 19
0.515
9
0.269
9
0.262
9
0.777
20
0.708
18
0.562
14
0.708
17
0.597 14
(c) Optimal NDCR for balanced.ehdNgryhst
V1
F1
V2
#
V3
F1
#
V4
F1
#
V5
F1
#
V6
F1
#
V8
F1
#
V10
F1
#
Avg
F1
#
F1
#
0.820
0.820
0.820
0.820
0.820
0.820
0.820
17
17
15
16
19
18
18
A1
A2
A3
A4
A5
A6
A7
0.484
0.484
0.484
0.484
0.484
0.484
0.484
24
24
21
22
26
24
23
0.888
0.888
0.888
0.888
0.888
0.888
0.888
14
14
13
14
13
12
12
0.878
0.878
0.878
0.878
0.878
0.878
0.878
18
20
19
18
19
17
17
0.877
0.877
0.877
0.877
0.877
0.877
0.877
15
16
13
14
17
14
14
0.747
0.747
0.747
0.747
0.747
0.747
0.747
24
24
22
23
26
25
25
0.838
0.838
0.838
0.838
0.838
0.838
0.838
21
20
18
19
21
20
19
0.931
0.931
0.931
0.931
0.931
0.931
0.931
13
12
11
13
12
9
9
0.916
0.916
0.916
0.916
0.916
0.916
0.916
12
12
12
13
12
11
11
Avg
0.484 22
0.888
13
0.878
16
0.877
13
0.747
23
0.838
18
0.931
10
0.916
9
0.820 16
(d) Optimal F1 for balanced.ehdNgryhst
Table B.5 – Optimal NDCR and Optimal F1 for submissions balanced.ehdNclrhst and
balanced.ehdNgryhst at TRECVID 2010.
# means the rank between 41 submissions.
193
#
Team
Profile
Run
Type
RP1
1
PKU-IDM
nofa
2
PKU-IDM
balanced
3
PKU-IDM
4
NTT-CSL
5
NTT-CSL
6
7
Precision
perseus
AV
0.911
1
perseus
AV
0.885
1
nofa
kraken
AV
0.882
1
balanced
1
AV
0.836
1
balanced
2
AV
0.835
1
NTT-CSL
nofa
3
AV
0.835
1
PKU-IDM
balanced
kraken
AV
0.817
1
8
NTT-CSL
nofa
0
AV
0.816
1
9
INRIA-LEAR-TEXMEX
balanced
lapin
AV
0.711
1
10
INRIA-LEAR-TEXMEX
nofa
mouflon
AV
0.578
1
11
KDDILabs-SRI
nofa
1
AV
0.346
1
12
IBM
balanced
gistG
V
0.344
1
13
IBM
nofa
gistG
V
0.341
1
14
KDDILabs-SRI
balanced
2
AV
0.314
1
15
KDDILabs-SRI
nofa
2
AV
0.314
1
16
KDDILabs-SRI
balanced
1
AV
0.285
1
17
PRISMA
nofa
ehdNghT10
V
0.248
1
18
PRISMA
nofa
ehdNgryhst
V
0.248
1
19
PRISMA
balanced
ehdNgryhst
V
0.204
1
20
INRIA-LEAR-TEXMEX
nofa
bouquetin
AV
0.203
1
21
PRISMA
balanced
ehdNclrhst
V
0.176
1
22
INRIA-LEAR-TEXMEX
balanced
truite
AV
0.158
1
23
IBM
nofa
gistGCsift
V
0.132
1
24
IBM
nofa
gistGC
V
0.111
1
25
VIREO
balanced
srpeflip
V
0.018
1
26
VIREO
nofa
srpeflip
V
0.018
1
27
asahikasei
balanced
AmainVsub
AV
0.016
1
28
asahikasei
nofa
AmainVsub
AV
0.016
1
29
VIREO
balanced
srpe
V
0.015
1
30
NJU
nofa
norank1
V
0.010
1
31
TID
balanced
localvideo
V
0.009
1
32
nii
balanced
av
AV
0.008
1
33
nii
nofa
av
AV
0.008
1
34
ATTLabs
balanced
2
AV
0.004
1
35
ATTLabs
nofa
1
AV
0.004
1
36
asahikasei
balanced
VmainAsub
AV
0.002
1
37
asahikasei
nofa
VmainAsub
AV
0.002
1
38
NJU
balanced
comp2
V
0.002
1
39
NJU
nofa
comp2
V
0.002
1
continue in Table B.7...
Table B.6 – RP1 (maximum Recall with Precision 1) for all TRECVID 2010 submissions. Part 1
of 2.
194
#
Team
Profile
Run
Type
RP1
Precision
1
continuation from Table B.6...
40
NTNU-Academia-Sinica
balanced
1
V
0.001
41
NTNU-Academia-Sinica
nofa
2
V
0.001
1
42
ATTLabs
balanced
4
AV
0.000
1
43
ATTLabs
nofa
3
AV
0.000
1
–
brno
balanced
l3sl2
AV
—
—
–
brno
balanced
l3sl2X
AV
—
—
–
brno
nofa
l3sl
AV
—
—
–
brno
nofa
l3sl2
AV
—
—
–
BUPT-MCPRL
balanced
SD
AV
—
—
–
BUPT-MCPRL
balanced
TF
AV
—
—
–
BUPT-MCPRL
nofa
SD
AV
—
—
–
BUPT-MCPRL
nofa
TF
AV
—
—
–
CCU
balanced
submission
AV
—
—
–
CCU
nofa
submission
AV
—
—
–
IDARE
balanced
test
AV
—
—
–
IDARE
nofa
test
—
—
—
–
ITU MSPR
balanced
ITUMSPR1
AV
—
—
–
ITU MSPR
nofa
ITUMSPR2
AV
—
—
–
nii
balanced
a
A
—
—
–
nii
nofa
a
A
—
—
–
NJU
balanced
rank1
V
—
—
–
NTNU-Academia-Sinica
balanced
3
AV
—
—
–
SYSU-GITL
balanced
sysuc1
V
—
—
–
SYSU-GITL
nofa
sysuc2
V
—
—
–
THU-IMG
balanced
dragon
V
—
—
–
THU-IMG
balanced
linnet
V
—
—
–
THU-IMG
nofa
tiger
V
—
—
–
THU-IMG
nofa
tortoise
V
—
—
–
TID
balanced
rawfusion
AV
—
—
–
TID
balanced
rawlocaud
A
—
—
–
TID
nofa
rawfusion
AV
—
—
–
TUBITAK UZAY
balanced
aindexb
–
TUBITAK UZAY
balanced
avb
–
TUBITAK UZAY
balanced
–
TUBITAK UZAY
nofa
–
UNIBS
balanced
–
UNIBS
–
UNIBS
–
UNIBS
A
—
—
AV
—
—
sift1024b
V
—
—
aindexnf
A
—
—
MF
V
—
—
balanced
SF
V
—
—
nofa
MF
V
—
—
nofa
SF
V
—
—
Table B.7 – RP1 (maximum Recall with Precision 1) for all TRECVID 2010 submissions. Part 2
of 2.
195
#
Team
Profile
Run
Type
RP.5
Precision
1
INRIA-LEAR-TEXMEX
balanced
lapin
AV
0.964
0.512
2
INRIA-LEAR-TEXMEX
balanced
truite
AV
0.961
0.520
3
PKU-IDM
balanced
perseus
AV
0.954
0.990
4
PKU-IDM
balanced
kraken
AV
0.947
0.976
5
INRIA-LEAR-TEXMEX
nofa
bouquetin
AV
0.943
0.608
6
BUPT-MCPRL
balanced
TF
AV
0.927
0.877
7
BUPT-MCPRL
balanced
SD
AV
0.925
0.919
8
BUPT-MCPRL
nofa
TF
AV
0.920
0.952
9
BUPT-MCPRL
nofa
SD
AV
0.918
0.961
10
PKU-IDM
nofa
perseus
AV
0.911
1.000
11
INRIA-LEAR-TEXMEX
nofa
mouflon
AV
0.907
0.590
12
PKU-IDM
nofa
kraken
AV
0.882
1.000
13
TID
balanced
rawfusion
AV
0.849
0.510
14
TID
nofa
rawfusion
AV
0.849
0.510
15
NTT-CSL
balanced
1
AV
0.836
1.000
16
NTT-CSL
balanced
2
AV
0.835
1.000
17
NTT-CSL
nofa
3
AV
0.835
1.000
18
ATTLabs
balanced
4
AV
0.818
0.574
19
NTT-CSL
nofa
0
AV
0.816
1.000
20
KDDILabs-SRI
balanced
2
AV
0.784
0.652
21
KDDILabs-SRI
nofa
2
AV
0.784
0.652
22
KDDILabs-SRI
balanced
1
AV
0.780
0.784
23
KDDILabs-SRI
nofa
1
AV
0.780
0.784
24
nii
balanced
av
AV
0.754
0.771
25
nii
nofa
av
AV
0.734
0.966
26
ATTLabs
balanced
2
AV
0.719
0.518
27
ATTLabs
nofa
3
AV
0.701
0.988
28
TUBITAK UZAY
balanced
aindexb
A
0.681
0.519
29
TUBITAK UZAY
nofa
aindexnf
A
0.681
0.519
30
nii
balanced
a
A
0.674
0.971
31
nii
nofa
a
A
0.674
0.971
32
TID
balanced
localvideo
V
0.672
0.515
33
ATTLabs
nofa
1
AV
0.637
0.995
34
TUBITAK UZAY
balanced
avb
AV
0.560
0.797
35
VIREO
balanced
srpeflip
V
0.543
0.697
36
PRISMA
balanced
ehdNgryhst
V
0.539
0.508
37
VIREO
nofa
srpeflip
V
0.530
0.742
38
PRISMA
nofa
ehdNghT10
V
0.526
0.501
39
PRISMA
nofa
ehdNgryhst
V
0.526
0.501
continue in Table B.9...
Table B.8 – RP.5 (maximum Recall with Precision greater of equal than 0.5) for all TRECVID
2010 submissions. Part 1 of 2.
196
#
Team
Profile
Run
Type
RP.5
Precision
continuation from Table B.8...
40
IBM
balanced
gistG
41
ITU MSPR
balanced
ITUMSPR1
V
0.494
0.510
AV
0.491
0.914
42
IBM
nofa
43
VIREO
balanced
gistG
V
0.486
0.558
srpe
V
0.484
0.705
44
TID
balanced
rawlocaud
45
ITU MSPR
nofa
ITUMSPR2
0.500
46
IBM
nofa
gistGCsift
A
0.470
AV
0.466
0.967
V
0.464
0.516
47
IBM
nofa
gistGC
V
0.449
0.564
48
PRISMA
balanced
ehdNclrhst
V
0.442
0.501
49
NTNU-Academia-Sinica
balanced
1
V
0.312
0.502
50
NTNU-Academia-Sinica
nofa
2
V
0.312
0.502
51
THU-IMG
nofa
tortoise
V
0.303
0.504
52
THU-IMG
nofa
tiger
V
0.298
0.500
53
THU-IMG
balanced
linnet
V
0.283
0.505
54
THU-IMG
balanced
dragon
V
0.271
0.500
55
asahikasei
balanced
VmainAsub
AV
0.112
0.512
56
asahikasei
nofa
VmainAsub
AV
0.112
0.577
57
brno
balanced
l3sl2
AV
0.084
0.518
58
brno
balanced
l3sl2X
AV
0.084
0.518
59
brno
nofa
l3sl2
AV
0.084
0.518
60
asahikasei
balanced
AmainVsub
AV
0.082
0.527
61
asahikasei
nofa
AmainVsub
AV
0.082
0.527
62
brno
nofa
l3sl
AV
0.052
0.517
63
NJU
nofa
norank1
V
0.049
0.515
64
NJU
balanced
comp2
V
0.003
0.750
65
NJU
nofa
comp2
V
0.003
0.750
–
CCU
balanced
submission
AV
—
—
–
CCU
nofa
submission
AV
—
—
–
IDARE
balanced
test
AV
—
—
–
IDARE
nofa
test
—
—
—
–
NJU
balanced
rank1
V
—
—
–
NTNU-Academia-Sinica
balanced
3
AV
—
—
–
SYSU-GITL
balanced
sysuc1
V
—
—
–
SYSU-GITL
nofa
sysuc2
V
—
—
–
TUBITAK UZAY
balanced
sift1024b
V
—
—
–
UNIBS
balanced
MF
V
—
—
–
UNIBS
balanced
SF
V
—
—
–
UNIBS
nofa
MF
V
—
—
–
UNIBS
nofa
SF
V
—
—
Table B.9 – RP.5 (maximum Recall with Precision greater of equal than 0.5) for all TRECVID
2010 submissions. Part 2 of 2.
197
Appendix C
Results at TRECVID 2011
Team 2011
ATTLabs
brno
BUPT-MCPRL
CRIM-VISI
FTRDBJ
IMP
INRIA-LEAR
INRIATEXMEX
ITU MSPR
iupr-dfki
KDDILabs
NTT-CSL
Organization
Location
AT&T Labs Research
Brno University of Technology
Beijing University of Posts and
Telecommunications-MCPRL
Computer Research Institute of
Montreal - Vision & Imaging team
France Telecom Orange Labs (Beijing)
Osaka Prefecture University
INRIA-LEAR
INRIA/IRISA
NorthAm
Europe
Asia
[Liu et al., 2011]
[Hradiš et al., 2011]
[Zhao et al., 2011b]
NorthAm
[Gupta et al., 2011]
Istanbul Technical University
University of Kaiserslautern
KDDILabs
NTT
Communication
Science
Laboratories-CSL
PKU-IDM
Peking University-IDM
PRISMA
PRISMA-University of Chile
RMIT
RMIT University School of CS&IT
SYSU-GITL
Sun Yat-sen University - GITL
Telefonica.research Telefonica Research
Tokushima University
tokushima U
UQMSG
University of Queensland
USC-UTSA
USC Viterbi School of Engineering
XJTU
Xi’an Jiaotong University
Zhejiang University
ZJU CS IV
Asia
Paper
[Bai et al., 2011]
Asia
Europe
Europe
[Sakata et al., 2011]
[Ayari et al., 2011]
[Ayari et al., 2011]
Europe
Europe
Asia
Asia
—
[Zhao et al., 2011a]
[Uchida et al., 2011]
[Mukai et al., 2011]
Asia
SouthAm
Australia
Asia
Europe
Asia
Australia
NorthAm
Asia
Asia
[Jiang et al., 2011]
[Barrios et al., 2011]
[Rouhi and Thom, 2011]
—
[Anguera et al., 2011a]
[Shishibori et al., 2011]
[Shen et al., 2011]
—
—
—
Table C.1 – The 22 participant teams in CCD evaluation at TRECVID 2011.
198
Avg. Optimal
#
Team
Run
Type
NDCR
F1
Avg. Actual
NDCR
F1
Avg.
MTP
1
PKU-IDM
cascade
AV
0.078
0.950
0.080
0.950
179
2
INRIA-LEAR
dodo
AV
0.103
0.942
5.911
0.947
2079
3
CRIM-VISI
V48A66T160
AV
0.122
0.711
42.073
0.711
2792
4
CRIM-VISI
V48A66T60
AV
0.122
0.711
85.889
0.715
2792
5
INRIA-TEXMEX
tyche
AV
0.252
0.934
0.352
0.935
32848
6
PRISMA
EhdRgbAud
AV
0.286
0.946
0.336
0.962
64
7
NTT-CSL
0
AV
0.311
0.931
106.897
0.924
96
8
ATTLabs
1
AV
0.330
0.879
0.508
0.890
30
9
ATTLabs
3
AV
0.342
0.875
0.506
0.886
42
10
PRISMA
EhdGry
V
0.374
0.938
0.419
0.956
50
11
FTRDBJ
AudioOnly
12
FTRDBJ
orange1
A
0.410
0.887
76.678
0.880
4589
AV
0.511
0.927
219.410
0.917
4589
801
13
ZJU CS IV
bhgccd
V
0.544
0.952
0.545
0.952
14
BUPT-MCPRL
wsyVA
AV
0.587
0.895
404.343
0.892
62
15
ZJU CS IV
bgccd
V
0.626
0.956
0.627
0.956
445
16
IMP
Uvote
V
0.748
0.654
374.333
0.650
191
17
IMP
Wvote
V
0.773
0.658
107.503
0.680
184
18
KDDILabs
4sys
V
13.807
0.682
27.161
0.645
4
19
SYSU-GITL
videoonly1
AV
14.243
0.785
2723.846
0.693
727
20
brno
brnoccd
AV
23.812
0.709
197.237
0.705
1575
21
KDDILabs
base
V
27.236
0.683
0.720
0.732
1
22
KDDILabs
2sys
V
27.237
0.621
13.927
0.636
2
23
SYSU-GITL
videoonly2
V
27.590
0.773
3057.516
0.739
719
24
Telefonica-research
multimodal
AV
57.768
0.948
153.092
0.949
601
25
USC-UTSA
test
V
107.79
0.000
961696
0.280
4
26
UQMSG
mfh
AV
117.32
0.000
296.575
0.000
1
27
ITU MSPR
ITUMSPR1
A
137.98
0.370
137.975
0.370
953
28
tokushima U
ch4of12
A
275.33
0.930
854.986
0.915
110
29
BUPT-MCPRL
zhVideo
V
400.57
0.914
400.577
0.915
62
30
RMIT
VideoNOFA7
V
401.45
0.054
5687.435
0.391
142
31
RMIT
VideoNOFA8
V
401.45
0.054
3284.710
0.343
142
32
XJTU
1
V
9999
0.000
13830
0.000
50
Table C.2 – Results for Nofa profile at TRECVID 2011. Values averaged for the 56 transformations. Submissions in descending order by Average Optimal NDCR.
199
V1
V2
NDCR
#
A1
A2
A3
A4
A5
A6
A7
0.761
0.761
0.761
0.761
0.761
0.761
0.761
14
14
13
13
13
13
13
0.321
0.321
0.321
0.321
0.321
0.321
0.321
Avg
0.761 13
0.321
V3
NDCR #
V4
V5
V6
V8
NDCR
#
NDCR
#
NDCR
#
NDCR
#
6
5
6
7
8
7
7
0.119
0.119
0.119
0.119
0.119
0.119
0.119
10
9
9
8
9
9
9
0.239
0.239
0.239
0.239
0.239
0.239
0.239
10
10
9
10
10
10
10
0.403
0.403
0.403
0.403
0.403
0.403
0.403
18
17
14
14
15
15
15
0.515
0.515
0.515
0.515
0.515
0.515
0.515
17
17
15
17
16
16
14
0.209
0.209
0.209
0.209
0.209
0.209
0.209
7
0.119
8
0.239
9
0.403
16
0.515
16
0.209
V10
NDCR #
Avg
NDCR
#
NDCR
#
4
6
6
6
7
7
7
0.425
0.425
0.425
0.425
0.425
0.425
0.425
11
11
10
10
10
10
8
0.374
0.374
0.374
0.374
0.374
0.374
0.374
11
11
9
10
10
10
8
6
0.425
11
0.374 10
(a) Optimal NDCR for nofa.EhdGry
V1
F1
V2
V3
V4
V5
V6
V8
V10
Avg
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
A1
A2
A3
A4
A5
A6
A7
0.887
0.887
0.887
0.887
0.887
0.887
0.887
15
14
16
13
14
13
9
0.945
0.945
0.945
0.945
0.945
0.945
0.945
9
10
9
6
5
6
5
0.966
0.966
0.966
0.966
0.966
0.966
0.966
4
2
2
2
2
1
2
0.949
0.949
0.949
0.949
0.949
0.949
0.949
8
8
8
7
6
7
7
0.952
0.952
0.952
0.952
0.952
0.952
0.952
7
5
5
6
3
5
3
0.924
0.924
0.924
0.924
0.924
0.924
0.924
11
11
9
10
10
8
8
0.934
0.934
0.934
0.934
0.934
0.934
0.934
12
11
10
11
9
10
10
0.944
0.944
0.944
0.944
0.944
0.944
0.944
7
6
6
4
6
6
5
0.938
0.938
0.938
0.938
0.938
0.938
0.938
10
9
8
8
7
7
7
Avg
0.887 14
0.945
8
0.966
2
0.949
6
0.952
4
0.924
10
0.934
10
0.944
5
0.938
7
(b) Optimal F1 for nofa.EhdGry
V1
V2
NDCR
#
A1
A2
A3
A4
A5
A6
A7
0.358
0.358
0.575
0.694
0.493
0.664
0.716
8
8
10
10
11
10
11
0.216
0.209
0.284
0.306
0.239
0.321
0.321
Avg
0.551 10
0.271
V3
NDCR #
V4
NDCR
#
2
4
5
6
6
7
7
0.075
0.075
0.112
0.119
0.104
0.142
0.142
6
6
8
8
7
11
10
0.142
0.142
0.149
0.149
0.142
0.157
0.172
6
0.110
7
0.150
V5
NDCR #
V6
V8
V10
Avg
NDCR #
NDCR #
NDCR
#
NDCR
#
NDCR #
7
8
8
8
9
9
9
0.284
0.269
0.418
0.463
0.358
0.433
0.478
12
12
16
18
14
16
16
0.299
0.284
0.366
0.388
0.336
0.381
0.403
9
8
11
12
11
11
10
0.104
0.104
0.172
0.157
0.127
0.187
0.179
2
4
4
5
4
6
6
0.254
0.261
0.358
0.358
0.291
0.358
0.366
6
5
6
7
6
8
7
0.217
0.213
0.304
0.329
0.261
0.330
0.347
6
5
6
6
7
8
7
8
0.386
15
0.351
11
0.147
4
0.321
5
0.286
6
(c) Optimal NDCR for nofa.EhdRgbAud
V1
F1
V2
V3
V4
V5
V6
V8
V10
Avg
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
A1
A2
A3
A4
A5
A6
A7
0.927
0.931
0.890
0.882
0.904
0.866
0.876
7
6
15
16
10
16
13
0.948
0.949
0.951
0.948
0.942
0.947
0.944
7
7
5
4
7
5
6
0.977
0.975
0.969
0.971
0.972
0.966
0.969
1
1
1
1
1
1
1
0.964
0.969
0.958
0.958
0.960
0.956
0.962
2
1
2
1
1
4
1
0.928
0.930
0.940
0.940
0.931
0.919
0.936
14
12
12
13
13
13
11
0.931
0.925
0.923
0.948
0.933
0.924
0.940
9
10
10
7
7
8
6
0.974
0.971
0.969
0.968
0.973
0.970
0.964
1
1
1
1
1
1
1
0.949
0.954
0.958
0.957
0.962
0.963
0.957
4
3
2
2
1
1
2
0.950
0.951
0.945
0.947
0.947
0.939
0.944
5
5
6
6
5
6
5
Avg
0.897 12
0.947
5
0.971
1
0.961
1
0.932
13
0.932
8
0.970
1
0.957
2
0.946
5
(d) Optimal F1 for nofa.EhdRgbAud
Table C.3 – Optimal NDCR and Optimal F1 for submissions nofa.EhdGry and nofa.EhdRgbAud
at TRECVID 2011.
# means the rank between 32 submissions.
200
Avg. Optimal
#
NDCR
F1
NDCR
F1
Avg.
Team
Run
1
PKU-IDM
cascade
AV
0.053
0.949
0.055
0.949
172
2
CRIM-VISI
V48A66T58B
AV
0.117
0.712
0.163
0.715
2792
3
CRIM-VISI
V48A66T65B
AV
0.117
0.712
0.159
0.715
2792
4
INRIA-LEAR
dodo
AV
0.144
0.942
0.217
0.944
2079
5
INRIA-TEXMEX
zozo
AV
0.194
0.929
0.348
0.936
32848
6
INRIA-TEXMEX
themis
AV
0.211
0.929
0.351
0.936
32848
7
NTT-CSL
1
AV
0.244
0.940
0.306
0.936
96
8
INRIA-LEAR
deaf
V
0.258
0.950
0.362
0.951
2041
9
Telefonica-research
joint
AV
0.268
0.957
1.209
0.944
601
NTT-CSL
2
AV
0.270
0.930
0.384
0.924
96
10
Type
Avg. Actual
MTP
11
FTRDBJ
orange3
AV
0.287
0.920
0.340
0.917
4589
12
PRISMA
EhdRgbAud
AV
0.300
0.955
8.462
0.935
64
13
NTT-CSL
3
AV
0.309
0.943
0.474
0.935
96
14
ATTLabs
2
AV
0.317
0.879
0.492
0.889
30
15
ATTLabs
4
AV
0.330
0.876
0.509
0.887
42
16
FTRDBJ
VideoOnly
V
0.335
0.918
0.388
0.917
4589
17
INRIA-TEXMEX
audioonly
A
0.406
0.910
0.545
0.917
192
18
PRISMA
EhdGry
V
0.412
0.938
3.716
0.913
50
19
KDDILabs
4sys
V
0.471
0.682
0.490
0.645
4
20
BUPT-MCPRL
zhVideo
V
0.529
0.915
0.636
0.915
62
801
21
ZJU CS IV
bhgccd
V
0.544
0.952
0.545
0.952
22
BUPT-MCPRL
wsyVA
AV
0.578
0.892
1.213
0.891
62
23
Telefonica-research
multimodal
AV
0.610
0.947
3.595
0.907
601
24
ZJU CS IV
bgccd
V
0.626
0.956
0.627
0.956
445
25
Telefonica-research
mask
AV
0.662
0.729
1.085
0.708
2393
26
IMP
Wvote
V
0.681
0.652
0.888
0.637
184
27
IMP
Uvote
V
0.726
0.661
1.401
0.632
188
28
iupr-dfki
fsift
29
SYSU-GITL
videoonly1
30
SYSU-GITL
videoonly2
31
brno
brnoccd
32
iupr-dfki
33
tokushima U
V
0.836
0.639
4.479
0.563
84
AV
0.893
0.786
3.467
0.693
727
V
0.909
0.773
3.757
0.739
719
AV
0.911
0.800
2.317
0.744
1575
fsift2
V
0.962
0.581
6.100
0.569
84
ch4of12
A
1.005
0.930
1.533
0.915
110
953
34
ITU MSPR
ITUMSPR2
AV
1.125
0.417
1.125
0.417
35
UQMSG
mfh
AV
1.193
0.000
6.408
0.001
1
36
tokushima U
chth
A
1.256
0.928
4.122
0.922
248
37
RMIT
VideoBal5
V
1.395
0.054
25.230
0.470
142
38
RMIT
VideoBal6
V
1.395
0.054
13.219
0.421
142
39
tokushima U
chcode
A
1.989
0.929
7.120
0.917
120
40
XJTU
1
41
USC-UTSA
test
V
19.087
0.000
41.766
0.000
49
AV
19.089
0.242
1834
0.304
4
Table C.4 – Results for Balanced profile at TRECVID 2011. Values averaged for the 56 transformations. Submissions in descending order by Average Optimal NDCR.
201
V1
V2
V3
V4
V5
V6
V8
V10
Avg
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
A1
A2
A3
A4
A5
A6
A7
0.761
0.761
0.761
0.761
0.761
0.761
0.761
23
23
22
22
21
20
21
0.313
0.313
0.313
0.313
0.313
0.313
0.313
11
11
10
10
10
10
10
0.104
0.104
0.104
0.104
0.104
0.104
0.104
8
9
7
8
7
11
7
0.323
0.323
0.323
0.323
0.323
0.323
0.323
15
16
12
15
14
14
13
0.495
0.495
0.495
0.495
0.495
0.495
0.495
23
22
21
22
21
19
20
0.607
0.607
0.607
0.607
0.607
0.607
0.607
23
24
23
24
21
21
21
0.293
0.293
0.293
0.293
0.293
0.293
0.293
15
17
14
14
15
17
14
0.403
0.403
0.403
0.403
0.403
0.403
0.403
17
17
13
17
16
16
14
0.412
0.412
0.412
0.412
0.412
0.412
0.412
18
18
14
18
17
17
15
Avg
0.761 22
0.313
10
0.104
7
0.323
14
0.495
21
0.607
22
0.293
16
0.403
16
0.412 18
(a) Optimal NDCR for balanced.EhdGry
V1
F1
V2
V3
V4
V5
V6
V8
V10
Avg
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
5
3
3
3
3
3
3
0.950
0.950
0.950
0.950
0.950
0.950
0.950
9
10
10
8
6
8
7
0.952
0.952
0.952
0.952
0.952
0.952
0.952
8
9
10
9
5
7
5
0.926
0.926
0.926
0.926
0.926
0.926
0.926
17
17
16
17
15
14
10
0.936
0.936
0.936
0.936
0.936
0.936
0.936
19
19
17
19
13
16
14
0.941
0.941
0.941
0.941
0.941
0.941
0.941
13
12
9
9
8
9
7
0.938
0.938
0.938
0.938
0.938
0.938
0.938
15
13
13
13
9
10
9
3
0.950
7
0.952
6
0.926
16
0.936
15
0.941
7
A1
A2
A3
A4
A5
A6
A7
0.887
0.887
0.887
0.887
0.887
0.887
0.887
24
23
23
22
24
22
16
0.945
0.945
0.945
0.945
0.945
0.945
0.945
11
10
12
9
8
10
6
0.966
0.966
0.966
0.966
0.966
0.966
0.966
Avg
0.887 24
0.945
9
0.966
0.938 11
(b) Optimal F1 for balanced.EhdGry
V1
V2
NDCR
#
A1
A2
A3
A4
A5
A6
A7
0.358
0.343
0.575
0.679
0.470
0.664
0.701
12
10
13
21
14
18
19
0.194
0.179
0.269
0.291
0.216
0.306
0.306
Avg
0.541 16
0.252
V3
NDCR #
V4
V5
V6
V8
V10
Avg
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
NDCR
#
2
4
8
9
6
9
9
0.075
0.075
0.112
0.127
0.104
0.142
0.127
5
6
10
12
7
15
12
0.134
0.149
0.209
0.216
0.187
0.246
0.224
4
7
8
9
8
11
9
0.343
0.328
0.455
0.493
0.425
0.500
0.515
19
19
20
21
19
20
22
0.313
0.321
0.388
0.373
0.373
0.425
0.448
16
17
15
17
17
17
17
0.119
0.127
0.187
0.216
0.164
0.231
0.224
2
4
5
7
5
10
9
0.246
0.254
0.328
0.336
0.284
0.343
0.358
8
11
10
14
10
15
13
0.223
0.222
0.315
0.341
0.278
0.357
0.363
6
6
11
15
11
16
14
9
0.109
9
0.195
8
0.437
20
0.377
17
0.181
5
0.307
11
0.300 12
(c) Optimal NDCR for balanced.EhdRgbAud
V1
F1
V2
V3
V4
V5
V6
V8
V10
Avg
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
F1
#
A1
A2
A3
A4
A5
A6
A7
0.927
0.932
0.890
0.887
0.907
0.866
0.881
13
11
22
22
15
23
18
0.949
0.950
0.952
0.949
0.943
0.947
0.944
7
6
5
6
10
7
8
0.977
0.976
0.971
0.974
0.972
0.970
0.969
1
1
1
1
1
1
1
0.971
0.976
0.978
0.971
0.969
0.978
0.974
1
1
1
1
1
1
1
0.952
0.960
0.960
0.956
0.954
0.958
0.956
8
4
4
5
4
5
4
0.951
0.951
0.961
0.948
0.945
0.958
0.964
9
9
1
9
7
2
2
0.976
0.973
0.976
0.973
0.976
0.978
0.977
1
1
1
1
1
1
1
0.949
0.954
0.959
0.957
0.962
0.963
0.957
5
2
1
2
1
1
2
0.957
0.959
0.956
0.952
0.954
0.952
0.953
5
1
4
6
3
4
3
Avg
0.899 19
0.948
7
0.973
1
0.974
1
0.957
3
0.954
4
0.976
1
0.957
2
0.955
3
(d) Optimal F1 for balanced.EhdRgbAud
Table C.5 – Optimal NDCR and Optimal F1 for submissions balanced.EhdGry and
balanced.EhdRgbAud at TRECVID 2011.
# means the rank between 41 submissions.
202
#
Team
Profile
Run
1
PKU-IDM
balanced
cascade
2
PKU-IDM
nofa
cascade
3
PRISMA
nofa
EhdRgbAud
4
CRIM-VISI
balanced
5
CRIM-VISI
6
CRIM-VISI
7
8
9
Type
RP1
Precision
AV
0.947
1
AV
0.922
1
AV
0.699
1
V48A66T58B
AV
0.696
1
balanced
V48A66T65B
AV
0.696
1
nofa
V48A66T160
AV
0.696
1
CRIM-VISI
nofa
V48A66T60
AV
0.696
1
INRIA-TEXMEX
balanced
zozo
AV
0.670
1
INRIA-TEXMEX
balanced
themis
AV
0.652
1
10
INRIA-TEXMEX
nofa
tyche
AV
0.652
1
11
PRISMA
balanced
EhdRgbAud
AV
0.650
1
12
ATTLabs
balanced
2
AV
0.598
1
13
PRISMA
nofa
EhdGry
V
0.590
1
14
ATTLabs
nofa
1
AV
0.580
1
15
ATTLabs
balanced
4
AV
0.567
1
16
ATTLabs
nofa
3
AV
0.552
1
17
INRIA-LEAR
balanced
deaf
V
0.542
1
18
INRIA-TEXMEX
balanced
audioonly
19
Telefonica-research
balanced
joint
A
0.539
1
AV
0.533
1
20
PRISMA
balanced
21
FTRDBJ
nofa
EhdGry
V
0.490
1
AudioOnly
A
0.483
22
ZJU CS IV
1
balanced
bhgccd
V
0.456
23
1
ZJU CS IV
nofa
bhgccd
V
0.456
1
24
KDDILabs
balanced
4sys
V
0.447
1
25
KDDILabs
nofa
4sys
V
0.447
1
26
KDDILabs
nofa
2sys
V
0.410
1
27
INRIA-LEAR
balanced
dodo
AV
0.409
1
28
INRIA-LEAR
nofa
dodo
AV
0.409
1
29
KDDILabs
nofa
base
V
0.382
1
30
ZJU CS IV
balanced
bgccd
V
0.374
1
31
ZJU CS IV
nofa
bgccd
V
0.374
1
32
BUPT-MCPRL
balanced
wsyVA
AV
0.361
1
33
BUPT-MCPRL
nofa
wsyVA
AV
0.361
1
34
FTRDBJ
nofa
orange1
AV
0.325
1
35
FTRDBJ
balanced
orange3
AV
0.314
1
36
FTRDBJ
balanced
VideoOnly
V
0.265
1
37
NTT-CSL
nofa
0
AV
0.261
1
continue in Table C.7...
Table C.6 – RP1 (maximum Recall with Precision 1) for all TRECVID 2011 submissions. Part 1
of 2.
203
#
Team
Profile
Run
Type
RP1
Precision
1
continuation from Table C.6...
38
NTT-CSL
balanced
1
AV
0.253
39
Telefonica-research
balanced
mask
AV
0.134
1
40
IMP
balanced
Uvote
V
0.132
1
41
IMP
balanced
Wvote
V
0.125
1
42
IMP
nofa
Wvote
V
0.125
1
43
IMP
nofa
Uvote
V
0.099
1
44
NTT-CSL
balanced
2
AV
0.033
1
45
NTT-CSL
balanced
3
AV
0.031
1
46
iupr-dfki
balanced
fsift
V
0.012
1
47
iupr-dfki
balanced
fsift2
V
0.012
1
48
brno
balanced
brnoccd
AV
0.002
1
49
0.002
1
—
—
brno
nofa
brnoccd
AV
–
BUPT-MCPRL
balanced
zhVideo
V
–
BUPT-MCPRL
nofa
zhVideo
V
—
—
–
ITU MSPR
balanced
ITUMSPR2
AV
—
—
–
ITU MSPR
nofa
ITUMSPR1
A
—
—
–
RMIT
balanced
VideoBal5
V
—
—
–
RMIT
balanced
VideoBal6
V
—
—
–
RMIT
nofa
VideoNOFA7
V
—
—
–
RMIT
nofa
VideoNOFA8
V
—
—
–
SYSU-GITL
balanced
videoonly1
AV
—
—
–
SYSU-GITL
balanced
videoonly2
V
—
—
–
SYSU-GITL
nofa
videoonly1
AV
—
—
–
SYSU-GITL
nofa
videoonly2
V
—
—
–
Telefonica-research
balanced
multimodal
AV
—
—
–
Telefonica-research
nofa
multimodal
AV
—
—
–
tokushima U
balanced
ch4of12
A
—
—
–
tokushima U
balanced
chcode
A
—
—
–
tokushima U
balanced
chth
A
—
—
–
tokushima U
nofa
ch4of12
A
—
—
–
UQMSG
balanced
mfh
AV
—
—
–
UQMSG
nofa
mfh
AV
—
—
–
USC-UTSA
balanced
test
AV
—
—
–
USC-UTSA
nofa
test
V
—
—
–
XJTU
balanced
1
V
—
—
–
XJTU
nofa
1
V
—
—
Table C.7 – RP1 (maximum Recall with Precision 1) for all TRECVID 2011 submissions. Part 2
of 2.
204
#
Team
Profile
Run
Type
RP.5
Precision
1
INRIA-LEAR
balanced
2
INRIA-LEAR
nofa
dodo
AV
0.984
0.532
dodo
AV
0.952
3
PKU-IDM
0.982
balanced
cascade
AV
0.947
1.000
4
BUPT-MCPRL
balanced
wsyVA
AV
0.947
0.921
5
INRIA-LEAR
balanced
deaf
V
0.945
0.520
6
CRIM-VISI
balanced
V48A66T58B
AV
0.944
0.644
7
CRIM-VISI
balanced
V48A66T65B
AV
0.944
0.644
8
CRIM-VISI
nofa
V48A66T160
AV
0.944
0.644
9
0.644
CRIM-VISI
nofa
V48A66T60
AV
0.944
10
INRIA-TEXMEX
balanced
zozo
AV
0.944
0.961
11
BUPT-MCPRL
balanced
zhVideo
V
0.927
0.564
12
BUPT-MCPRL
nofa
wsyVA
AV
0.927
0.970
13
INRIA-TEXMEX
balanced
themis
AV
0.927
0.960
14
INRIA-TEXMEX
nofa
tyche
AV
0.927
0.960
15
BUPT-MCPRL
nofa
zhVideo
V
0.922
0.620
16
PKU-IDM
nofa
cascade
AV
0.922
1.000
17
Telefonica-research
balanced
joint
AV
0.918
0.511
18
NTT-CSL
balanced
3
AV
0.898
0.972
19
NTT-CSL
balanced
2
AV
0.898
0.981
20
FTRDBJ
balanced
orange3
AV
0.895
0.982
21
NTT-CSL
balanced
1
AV
0.892
0.985
22
NTT-CSL
nofa
0
AV
0.892
0.992
23
FTRDBJ
nofa
orange1
AV
0.890
0.983
24
Telefonica-research
balanced
multimodal
AV
0.876
0.530
25
Telefonica-research
nofa
multimodal
AV
0.876
0.530
26
FTRDBJ
balanced
VideoOnly
V
0.813
0.983
27
IMP
balanced
Uvote
28
PRISMA
balanced
EhdRgbAud
V
0.768
0.852
AV
0.761
0.505
29
IMP
balanced
Wvote
V
0.756
0.865
30
IMP
nofa
Wvote
V
0.756
0.865
31
IMP
nofa
Uvote
V
0.744
0.904
32
PRISMA
nofa
EhdRgbAud
AV
0.742
0.640
33
ATTLabs
balanced
2
AV
0.718
0.989
34
ATTLabs
nofa
1
AV
0.714
0.981
35
ATTLabs
nofa
3
AV
0.699
0.979
36
ATTLabs
balanced
4
AV
0.695
0.990
37
PRISMA
balanced
EhdGry
V
0.690
0.607
continue in Table C.9...
Table C.8 – RP.5 (maximum Recall with Precision greater of equal than 0.5) for all TRECVID
2011 submissions. Part 1 of 2.
205
#
Team
Profile
Run
Type
RP.5
Precision
continuation from Table C.8...
38
INRIA-TEXMEX
balanced
audioonly
A
0.675
0.595
39
PRISMA
nofa
EhdGry
V
0.673
0.784
40
FTRDBJ
nofa
AudioOnly
41
Telefonica-research
balanced
mask
42
iupr-dfki
balanced
43
KDDILabs
balanced
44
KDDILabs
45
46
A
0.601
0.991
AV
0.574
0.599
fsift
V
0.565
0.667
4sys
V
0.551
0.525
nofa
4sys
V
0.551
0.525
iupr-dfki
balanced
fsift2
V
0.547
0.582
KDDILabs
nofa
2sys
V
0.495
0.583
47
KDDILabs
nofa
base
V
0.481
0.528
48
ZJU CS IV
balanced
bhgccd
V
0.456
1.000
49
ZJU CS IV
nofa
bhgccd
V
0.456
1.000
0.518
50
tokushima U
balanced
ch4of12
A
0.376
51
tokushima U
nofa
ch4of12
A
0.376
0.518
52
ZJU CS IV
balanced
bgccd
V
0.374
1.000
53
ZJU CS IV
nofa
bgccd
V
0.374
1.000
54
brno
balanced
brnoccd
AV
0.363
0.756
55
ITU MSPR
balanced
ITUMSPR2
AV
0.329
0.912
56
ITU MSPR
nofa
ITUMSPR1
A
0.324
0.971
57
SYSU-GITL
balanced
videoonly2
V
0.324
0.504
58
SYSU-GITL
nofa
videoonly2
V
0.299
0.586
59
SYSU-GITL
balanced
videoonly1
AV
0.270
0.502
60
SYSU-GITL
nofa
videoonly1
AV
0.262
0.537
61
tokushima U
balanced
chth
A
0.249
0.522
62
tokushima U
balanced
chcode
A
0.184
0.503
AV
63
brno
nofa
brnoccd
0.177
0.928
–
RMIT
balanced
VideoBal5
V
—
—
–
RMIT
balanced
VideoBal6
V
—
—
–
RMIT
nofa
VideoNOFA7
V
—
—
–
RMIT
nofa
VideoNOFA8
V
—
—
–
UQMSG
balanced
mfh
AV
—
—
–
UQMSG
nofa
mfh
AV
—
—
–
USC-UTSA
balanced
test
AV
—
—
–
USC-UTSA
nofa
test
V
—
—
–
XJTU
balanced
1
V
—
—
–
XJTU
nofa
1
V
—
—
Table C.9 – RP.5 (maximum Recall with Precision greater of equal than 0.5) for all TRECVID
2011 submissions. Part 2 of 2.
206