=Paper= {{Paper |id=Vol-1263/paper53 |storemode=property |title=JRS at Event Synchronization Task |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_53.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/NowakTSB14 }} ==JRS at Event Synchronization Task== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_53.pdf
                          JRS at Event Synchronization Task

                        Paweł Nowak, Marcus Thaler, Harald Stiegler, Werner Bailer
                                            JOANNEUM RESEARCH – DIGITAL
                                            Steyrergasse 17, 8010 Graz, Austria
                                              werner.bailer@joanneum.at



ABSTRACT                                                            For the extraction of the image similarities based on the
The event synchronisation task addresses the problem of           compact feature representation VLAD the same extracted
aligning photo streams from different users temporally and        SIFT key descriptors were used. In order to compute the
identifying coherent events in the streams. In our approach,      VLAD signature of each gallery image we used the VLFeat 1
we first determine the visual similarity of image pairs. We       open source library. We reduced a global visual vocabulary
determine visual similarity based on full matching of SIFT        with about 300,000 descriptor cluster using k-means clus-
descriptors and based on VLAD, and compare the use of             tering to 256 visual words. The descriptors for building the
the two sets of similarity scores. We then build a non-           vocabulary have been extracted from a news data set of the
homogeneous linear equation system constraining the time          TOSCA-MP project2 . Based on sum of squared errors the
offsets between the galleries based on these matching pairs       similarities between the VLAD signatures and thus the im-
and determine an approximate solution. Event clusters are         age similarities within the test sets were calculated.
initialised from subsequent and visually similar images, and        For a pair of images (Ii , Ij ), VLAD yields distances dVij ,
clusters are merged if their temporal proximity and the max-      which are transformed into similarities
imum similarity of their members is high enough.
                                                                                                   θV − dVij , if dVij < θV
                                                                                               
                                                                                      sVij =                                           (1)
                                                                                                   0, otherwise,
1.    INTRODUCTION
   The event synchronisation task addresses the problem of
                                                                  where θV is a threshold for the maximum distance. The
aligning photo streams from different users temporally and
                                                                  SIFT similarity sS
                                                                                   ij is determined as
identifying coherent events in the streams. This paper de-
scribes the work done by the JRS team for the two subtasks                        (
                                                                                                      |P   |
of determining the time offsets of galleries and clustering the                       max(0, min(|Pij          − θS ), if |Pij | ≥ p
images into events. Details on the task and the data set can              sS
                                                                           ij =                     i |,|Pj |)                         (2)
                                                                                      0, otherwise,
be found in [1].

                                                                  where Pi are the key points in each of the images, Pij is the
2.    APPROACH                                                    set of matching key points, p is a threshold for the number
                                                                  of matching key points and θS is a similarity threshold. We
2.1    Determining Gallery Offsets                                use all similarities above zero to formulate constraints on the
   In our approach, we first determine the visual similarity      time offsets of the galleries. Optionally, the GPS information
of image pairs. We determine visual similarity based on full      of the images (if available) can be used, setting the similarity
matching of SIFT [3] descriptors and based on VLAD [2],           to zero, if the deviation in longitude or latitude is above a
and compare the use of the two sets of similarity scores.         threshold θG (in degrees).
   The computation of the image similarities between the             For N galleries G1 , . . . , GN , we can assume without loss
images of each gallery is based on SIFT descriptors. All          of generality that G1 is the reference gallery. We aim at
images of each gallery were first downscaled from HD to           obtaining a list of time differences D = (δ2 , . . . , δN ), where
SD. Subsequently, up to 500 SIFT key points and descriptors       δi is the time offset between galleries Gi and G1 . As the
were extracted from each image.                                   underlying assumption in this task is that the offset between
   For similarity calculation based on nearest neighbor match-    two galleries is constant over time, each pair of matching
ing of SIFT descriptors, each raw SIFT descriptor of the          images adds one constraint of the form δp − δq = τij , where
source image is assigned to its nearest neighbour (based on       p, q are the galleries containing images Ii , Ij respectively,
Euclidean distance) descriptor in the target image. These         and τij is the time offset determined from time stamps of
assignments are validated by a homography extracted with          the matching images. Note that δ1 is by definition 0. We
the maximum number of descriptors supporting a consistent         can then reorganise our constraints into an overdetermined
homography.                                                       equation system


                                                                  1
Copyright is held by the author/owner(s).                             http://www.vlfeat.org
                                                                  2
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain        http://www.tosca-mp.eu
      run   set   vis.sim.              θS     θV      θG      αt                    1
                                                                                   0,9
      1     1,2   VLAD                n/a     1.70     2.5    1.0
                                                                                   0,8
      2     1     SIFT+VLAD           0.07    1.82     2.5    0.0                  0,7
      2     2     SIFT+VLAD           0.08    1.80     2.5    1.0                  0,6
      3     1     SIFT+VLAD           0.07    1.82     2.5    0.0                  0,5
      3     2     SIFT+VLAD           0.08    1.85     2.5    0.0                  0,4
                                                                                   0,3
      4     1     SIFT+VLAD           0.06    1.85     2.5    0.0
                                                                                   0,2
      4     2     SIFT+VLAD           0.08    1.80     2.5    0.0                  0,1
                                                                                     0
Table 1: Parameters of runs, tmin = 120s and p = 10.                                        run 1        run 2    run 3        run 4

                                                                                         precision Vancouver     precision London
                                                                                         accuracy Vancouver      accuracy London
  g2 (i) − g2 (j) · · ·  gN (i) − gN (j)
                                                         
                                              δ2         τij
         ..                     ..        .   .                              Figure 1: Results for sychronisation.
                                           ..  =  .. 

         .                      .
  g2 (k) − g2 (l) · · ·  gN (k) − gN (l)     δN          τkl                         1
                                                             (3)                   0,9
                                                                                   0,8
where gn (i) is a binary function, yielding 1 if I ∈ Gn , 0                        0,7
otherwise. In order to deal with outliers, we iteratively solve                    0,6
the equation system, and remove up to 10% of the largest                           0,5
                                                                                   0,4
outliers. In each iteration, we use the Jacobi method to solve                     0,3
the equation system.                                                               0,2
                                                                                   0,1
2.2    Clustering Events                                                             0
                                                                                             run 1       run 2     run 3       run 4
  We initialise the event time line by grouping subsequent                               Rand Vancouver          Rand London
images, which have visual similarity (sV or sS ) above zero.                             Jaccard Vancouver       Jaccard London
This will oversegment the event time line. In a next step,                               F-Measure Vancouver     F-Measure London
we start regrouping these events based on visual similarity
and (optionally) temporal proximity. The distance between                            Figure 2: Results for clustering.
two events i, j is determined as

                                                                          not so much related with the similarity to the development
                          |t̄i − t̄j |
        dE
         ij = αt max(1,                )θ − maxk∈Ei ,l∈Ej sk l,     (4)   set, but rather with the high visual similarity in Winter
                             tmin                                         Olympics (e.g., all ice based competitions have high simi-
where t̄i is the mean time of images in event Ei , αt is a                larity). For clustering, the differences are not so clear, for
weight for using time information, θ is the similarity thresh-            runs 2 and 3 the Vancouver results are even better than the
old used (S or V) and sk l is the visual similarity between a             London ones according to Jaccard index and F-measure. In
pair of images of which one belongs to Ei and the other to                general, the Rand index shows a quite different picture than
Ej . Two events are merged if dE  ij < θmerge , where θmerge              the other two measures.
has been set to θV + 0.15.
                                                                          Acknowledgments
3.    EXPERIMENTS AND RESULTS                                             The research leading to these results has received funding
   We submitted four runs, with the parameters listed in Ta-              from the European Union’s Seventh Framework Programme
ble 1. One observation of the experiments of the test set is              (FP7/2007-2013) under grant agreement n◦ 610370, “ICoSOLE
that full matching of SIFT descriptors is better for deter-               – Immersive Coverage of Spatially Outspread Live Events”
mining gallery offsets, which needs to find the single most               (http://www.icosole.eu/).
similar image from the other gallery. In contrast, the event
clustering needs a more global notion of similarity, which
is well covered by VLAD. Thus we used VLAD similarities
                                                                          5.   REFERENCES
for event clustering in all the runs. The results for synchro-            [1] Nicola Conci, Francesco De Natale, and Vasileios
nisation are shown in Figure 1, and those for clustering in                   Mezaris. Synchronization of Multi-User Event Media
Figure 2.                                                                     (SEM) at MediaEval 2014: Task Description, Datasets,
                                                                              and Evaluation. In MediaEval 2014 Workshop,
4.    DISCUSSION                                                              Barcelona, Spain, October 16-17 2014.
                                                                          [2] H. Jegou, F. Perronnin, M. Douze, J. Sanchez,
  As already expected from the experiments on the devel-
                                                                              P. Perez, and C. Schmid. Aggregating local image
opment set, VLAD is not discriminative enough for deter-
                                                                              descriptors into compact codes. IEEE Transactions on
mining the image pairs for synchronisation, thus the results
                                                                              Pattern Analysis and Machine Intelligence,
of run 1 are much worse than the others. Our method only
                                                                              34(9):1704–1716, 2012.
manages to sychronise a fraction of the galleries correctly,
                                                                          [3] D. Lowe. Distinctive image features from scale-invariant
however, if a gallery is sychronised, the accuracy is rather
                                                                              keypoints. International Journal of Computer Vision,
high. The results for the London data set are clearly bet-
                                                                              60(2):91–110, 2004.
ter than those for the Vancouver set. We think that this is