=Paper= {{Paper |id=Vol-1263/paper53 |storemode=property |title=JRS at Event Synchronization Task |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_53.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/NowakTSB14 }} ==JRS at Event Synchronization Task== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_53.pdf

JRS at Event Synchronization Task

Paweł Nowak, Marcus Thaler, Harald Stiegler, Werner Bailer
JOANNEUM RESEARCH – DIGITAL
Steyrergasse 17, 8010 Graz, Austria
werner.bailer@joanneum.at

ABSTRACT For the extraction of the image similarities based on the
The event synchronisation task addresses the problem of compact feature representation VLAD the same extracted
aligning photo streams from different users temporally and SIFT key descriptors were used. In order to compute the
identifying coherent events in the streams. In our approach, VLAD signature of each gallery image we used the VLFeat 1
we first determine the visual similarity of image pairs. We open source library. We reduced a global visual vocabulary
determine visual similarity based on full matching of SIFT with about 300,000 descriptor cluster using k-means clus-
descriptors and based on VLAD, and compare the use of tering to 256 visual words. The descriptors for building the
the two sets of similarity scores. We then build a non- vocabulary have been extracted from a news data set of the
homogeneous linear equation system constraining the time TOSCA-MP project2 . Based on sum of squared errors the
offsets between the galleries based on these matching pairs similarities between the VLAD signatures and thus the im-
and determine an approximate solution. Event clusters are age similarities within the test sets were calculated.
initialised from subsequent and visually similar images, and For a pair of images (Ii , Ij ), VLAD yields distances dVij ,
clusters are merged if their temporal proximity and the max- which are transformed into similarities
imum similarity of their members is high enough.
θV − dVij , if dVij < θV

sVij = (1)
0, otherwise,
1. INTRODUCTION
The event synchronisation task addresses the problem of
where θV is a threshold for the maximum distance. The
aligning photo streams from different users temporally and
SIFT similarity sS
ij is determined as
identifying coherent events in the streams. This paper de-
scribes the work done by the JRS team for the two subtasks (
|P |
of determining the time offsets of galleries and clustering the max(0, min(|Pij − θS ), if |Pij | ≥ p
images into events. Details on the task and the data set can sS
ij = i |,|Pj |) (2)
0, otherwise,
be found in [1].

where Pi are the key points in each of the images, Pij is the
2. APPROACH set of matching key points, p is a threshold for the number
of matching key points and θS is a similarity threshold. We
2.1 Determining Gallery Offsets use all similarities above zero to formulate constraints on the
In our approach, we first determine the visual similarity time offsets of the galleries. Optionally, the GPS information
of image pairs. We determine visual similarity based on full of the images (if available) can be used, setting the similarity
matching of SIFT [3] descriptors and based on VLAD [2], to zero, if the deviation in longitude or latitude is above a
and compare the use of the two sets of similarity scores. threshold θG (in degrees).
The computation of the image similarities between the For N galleries G1 , . . . , GN , we can assume without loss
images of each gallery is based on SIFT descriptors. All of generality that G1 is the reference gallery. We aim at
images of each gallery were first downscaled from HD to obtaining a list of time differences D = (δ2 , . . . , δN ), where
SD. Subsequently, up to 500 SIFT key points and descriptors δi is the time offset between galleries Gi and G1 . As the
were extracted from each image. underlying assumption in this task is that the offset between
For similarity calculation based on nearest neighbor match- two galleries is constant over time, each pair of matching
ing of SIFT descriptors, each raw SIFT descriptor of the images adds one constraint of the form δp − δq = τij , where
source image is assigned to its nearest neighbour (based on p, q are the galleries containing images Ii , Ij respectively,
Euclidean distance) descriptor in the target image. These and τij is the time offset determined from time stamps of
assignments are validated by a homography extracted with the matching images. Note that δ1 is by definition 0. We
the maximum number of descriptors supporting a consistent can then reorganise our constraints into an overdetermined
homography. equation system

1
Copyright is held by the author/owner(s). http://www.vlfeat.org
2
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain http://www.tosca-mp.eu
run set vis.sim. θS θV θG αt 1
0,9
1 1,2 VLAD n/a 1.70 2.5 1.0
0,8
2 1 SIFT+VLAD 0.07 1.82 2.5 0.0 0,7
2 2 SIFT+VLAD 0.08 1.80 2.5 1.0 0,6
3 1 SIFT+VLAD 0.07 1.82 2.5 0.0 0,5
3 2 SIFT+VLAD 0.08 1.85 2.5 0.0 0,4
0,3
4 1 SIFT+VLAD 0.06 1.85 2.5 0.0
0,2
4 2 SIFT+VLAD 0.08 1.80 2.5 0.0 0,1
0
Table 1: Parameters of runs, tmin = 120s and p = 10. run 1 run 2 run 3 run 4

precision Vancouver precision London
accuracy Vancouver accuracy London
g2 (i) − g2 (j) · · · gN (i) − gN (j)
    
δ2 τij
.. ..  .   .  Figure 1: Results for sychronisation.
  ..  =  .. 

 . .
g2 (k) − g2 (l) · · · gN (k) − gN (l) δN τkl 1
(3) 0,9
0,8
where gn (i) is a binary function, yielding 1 if I ∈ Gn , 0 0,7
otherwise. In order to deal with outliers, we iteratively solve 0,6
the equation system, and remove up to 10% of the largest 0,5
0,4
outliers. In each iteration, we use the Jacobi method to solve 0,3
the equation system. 0,2
0,1
2.2 Clustering Events 0
run 1 run 2 run 3 run 4
We initialise the event time line by grouping subsequent Rand Vancouver Rand London
images, which have visual similarity (sV or sS ) above zero. Jaccard Vancouver Jaccard London
This will oversegment the event time line. In a next step, F-Measure Vancouver F-Measure London
we start regrouping these events based on visual similarity
and (optionally) temporal proximity. The distance between Figure 2: Results for clustering.
two events i, j is determined as

not so much related with the similarity to the development
|t̄i − t̄j |
dE
ij = αt max(1, )θ − maxk∈Ei ,l∈Ej sk l, (4) set, but rather with the high visual similarity in Winter
tmin Olympics (e.g., all ice based competitions have high simi-
where t̄i is the mean time of images in event Ei , αt is a larity). For clustering, the differences are not so clear, for
weight for using time information, θ is the similarity thresh- runs 2 and 3 the Vancouver results are even better than the
old used (S or V) and sk l is the visual similarity between a London ones according to Jaccard index and F-measure. In
pair of images of which one belongs to Ei and the other to general, the Rand index shows a quite different picture than
Ej . Two events are merged if dE ij < θmerge , where θmerge the other two measures.
has been set to θV + 0.15.
Acknowledgments
3. EXPERIMENTS AND RESULTS The research leading to these results has received funding
We submitted four runs, with the parameters listed in Ta- from the European Union’s Seventh Framework Programme
ble 1. One observation of the experiments of the test set is (FP7/2007-2013) under grant agreement n◦ 610370, “ICoSOLE
that full matching of SIFT descriptors is better for deter- – Immersive Coverage of Spatially Outspread Live Events”
mining gallery offsets, which needs to find the single most (http://www.icosole.eu/).
similar image from the other gallery. In contrast, the event
clustering needs a more global notion of similarity, which
is well covered by VLAD. Thus we used VLAD similarities
5. REFERENCES
for event clustering in all the runs. The results for synchro- [1] Nicola Conci, Francesco De Natale, and Vasileios
nisation are shown in Figure 1, and those for clustering in Mezaris. Synchronization of Multi-User Event Media
Figure 2. (SEM) at MediaEval 2014: Task Description, Datasets,
and Evaluation. In MediaEval 2014 Workshop,
4. DISCUSSION Barcelona, Spain, October 16-17 2014.
[2] H. Jegou, F. Perronnin, M. Douze, J. Sanchez,
As already expected from the experiments on the devel-
P. Perez, and C. Schmid. Aggregating local image
opment set, VLAD is not discriminative enough for deter-
descriptors into compact codes. IEEE Transactions on
mining the image pairs for synchronisation, thus the results
Pattern Analysis and Machine Intelligence,
of run 1 are much worse than the others. Our method only
34(9):1704–1716, 2012.
manages to sychronise a fraction of the galleries correctly,
[3] D. Lowe. Distinctive image features from scale-invariant
however, if a gallery is sychronised, the accuracy is rather
keypoints. International Journal of Computer Vision,
high. The results for the London data set are clearly bet-
60(2):91–110, 2004.
ter than those for the Vancouver set. We think that this is