=Paper=
{{Paper
|id=Vol-1263/paper40
|storemode=property
|title=CERTH at MediaEval 2014 Synchronization of Multi-User Event Media Task
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_40.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ApostolidisPM14
}}
==CERTH at MediaEval 2014 Synchronization of Multi-User Event Media Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_40.pdf</pdf>
<pre>
     CERTH at MediaEval 2014 Synchronization of Multi-User
                     Event Media Task

               Konstantinos Apostolidis, Christina Papagiannopoulou, Vasileios Mezaris
                  Information Technologies Institute, CERTH, Thessaloniki, Greece
                                {kapost, cppapagi, bmezaris}@iti.gr


ABSTRACT                                                              We further modified this NDD process to also use color
This paper describes the results of the CERTH participation        information (HSV histograms), so that near duplicate can-
in the Synchronization of Multi-User Event Media Task of           didates that are very similar in color are not discarded even
MediaEval 2014. We used a near duplicate image detector to         if the GC score is relatively low.
identify very similar photos, which allowed us to temporally          We apply the modified NDD on the union of all galleries.
align photo galleries; and then we used time, geolocation and      Consequently, we filter out identified pairs of near duplicates
visual information, including the results of visual concept        according to the following rules:
detection, to cluster all photos into different events.                • Reject pairs when geolocation information is available
                                                                           and the location distance of the two photos is greater
                                                                           than a distance threshold.
1.   INTRODUCTION                                                      • Reject pairs when the time difference between the pho-
  People attending large-scale social events collect dozens                tos is above an extreme time threshold (which indicates
of photos and video clips with their smartphones, tablets,                 that this time difference is unlikely to be due to a time
cameras. These are later exchanged and shared in a num-                    synchronization error alone).
ber of different ways. The alignment and presentation of              The remaining near duplicate photos belonging to differ-
the photo galleries of different users in a consistent way, so     ent galleries are considered as links between those galleries.
as to preserve the temporal evolution of the event, is not            It is now straightforward to construct a graph whose nodes
straightforward, considering that the time information at-         represent the galleries, and the edges represent these links
tached to some of the captured media may be wrong (due             between galleries. Each edge has a weight which is equal
to different photo capturing devices not being synchronized)       to the number of links between the two galleries. Having
and geolocation information may be missing. The 2014 Me-           constructed the graph, we compute the time offset of each
diaEval Synchronization of Multi-user Event Media (SEM)            gallery by traversing it, as follows. Starting from the node
task tackles this exact problem [1].                               corresponding to the reference gallery, we select the edge
                                                                   with the highest weight. We compute the time offset of the
2.   SYSTEM OVERVIEW                                               node on the other end of this edge as the median of the
   The main goal of our system is the time alignment of photo      time differences of the pairs of near duplicate photos that
galleries that are created by different digital photo capture      this edge represents, and add this node to the set of visited
devices, and the clustering of these into event-related clus-      nodes. The selection of the edge with the highest weight is
ters. In the first stage, similar photos of the different gal-     repeated, considering as possible starting point any member
leries are identified and are used for constructing a graph,       of the set of visited nodes, and the corresponding time offset
whose nodes represent galleries and edges represent discov-        is computed, until all nodes are visited. Alternatively, we
ered links between them. Time alignment of the galleries           can traverse the graph and compute the nodes’ time offsets
is achieved by traversing the graph. After that, we apply          by simultaneously considering the weights of multiple edges.
clustering techniques in order to split our collection into dif-
ferent events. Figure 1 shows the pipeline of our system.
                                                                   4.   MEDIA CLUSTERING OF EVENTS
3.   TIME SYNCHRONIZATION                                             Following time synchronization, we cluster all photos to
  Time synchronization makes use of a Near Duplicate De-           events. Two different approaches are adopted: the first one
tector (NDD) that extracts SIFT descriptors from the pho-          considers all photo galleries as a single photo collection, ex-
tos, forms a visual vocabulary and encodes the descriptor-         ploiting the synchronization results, while the second one
based representation of each photo using VLAD encoding.            first makes a pre-clustering within each gallery separately.
The nearest neighbours that are returned for a query image            In the first approach, we use the method of [2], resulting in
are refined by checking the geometrical consistency of SIFT        clusters that are time distinct, comprising different events.
keypoints using geometric coding (GC) [4].                         Subsequently, each of these clusters is split based on the
                                                                   geolocation information. The photos that do not have ge-
                                                                   olocation information are assigned to the geo-cluster which
Copyright is held by the author/owner(s)                           is more similar according to the color information (e.g. HSV
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain     histogram).
                                                Figure 1: System overview


  In the second approach, we detect time gaps between
events of each gallery. Specifically, we find the minimum       Table 2: Time Synchronization and Clustering met-
time difference of dissimilar photos which is greater than      rics for each run for the London testset.
                                                                                  run1   run2   run3 run4      run5
the maximum time difference of the near-duplicate photos         Sync. Precision 0.6111 0.6111 0.6111 0.2222 0.6389
(based on the similarity matrix of GC). The clusters that        Sync. Accuracy 0.7127 0.7127 0.7127 0.6996 0.7299
are formed are merged according to time and geolocation            Rand Index    0.9885 0.9910 0.9838 0.9829 0.9863
similarity. For the clusters that do not have geolocation in-     Jaccard Index 0.5051 0.5614 0.3232 0.2739 0.4849
formation, the merging is continued by considering the time        F-Measure     0.3356 0.3596 0.2443 0.2150 0.3266
and low-level feature similarity or the time and the concept
detector (CD) confidence similarity scores [3].
                                                                6.   CONCLUSIONS
5.   EXPERIMENTS AND RESULTS                                       This paper presented our framework and results at the
  We submitted 5 runs in total, combining 3 methods for         MediaEval 2014 Synchronization of Multi-User Event Media
time synchronization and 3 methods for event clustering:        Task. Our modified NDD approach gives the best results in
    • Run1:aNDD-perGallery-mergeCD: Compute gallery ti-         time alignment for the Vancouver testset, while the stan-
      me offsets using our modified NDD. CD scores are used     dard NDD yields a slightly better time synchronization for
      to merge clusters using the second approach of section    the London testset. In sub-event clustering, the exploitation
      4.                                                        of consistent timestamps in a gallery and the use of CD con-
    • Run2:aNDD-perGallery-mergeHSV : Compute gallery           fidence scores gives a good performance for the Vancouver
      time offsets using our modified NDD. HSV histogram        testset, whereas HSV histogram similarity seems to give the
      similarity is used to merge clusters using the second     best clustering results for the London testset.
      approach of section 4.
    • Run3:aNDD-concat: Compute gallery time offsets us-
      ing our modified NDD. Clustering is performed using
                                                                7.   ACKNOWLEDGMENTS
      the first approach of section 4.                            This work was supported by the EC under contracts FP7-
    • Run4:aNDD-multiT-perGallery-mergeCD: Compute ga-          287911 LinkedTV and FP7-600826 ForgetIT.
      llery time offsets using our modified NDD and traversal
      of the graph by simultaneously considering the weights    8.   REFERENCES
      of multiple edges. CD scores are used to merge clusters   [1] N. Conci, F. De Natale, and V. Mezaris.
      using the second approach of section 4.                       Synchronization of Multi-User Event Media (SEM) at
    • Run5:NDD-perGallery-mergeCD: Compute gallery ti-              MediaEval 2014: Task Description, Datasets, and
      me offsets using NDD without HSV color information.           Evaluation. In Proc. MediaEval Workshop, 2014.
      CD scores are used to merge certain events using the      [2] M. Cooper, J. Foote, A. Girgensohn, and L. Wilcox.
      second approach of the section 4.                             Temporal event clustering for digital photo collections.
The results of our approach for all 5 runs, for the Vancouver       ACM Transactions on Multimedia Computing,
testset and the London testset are listed in Tables 1 and 2         Communications, and Applications (TOMCCAP),
respectively.                                                       1(3):269–288, 2005.
                                                                [3] C. Papagiannopoulou and V. Mezaris. Concept-based
Table 1: Time Synchronization and Clustering met-                   Image Clustering and Summarization of Event-related
rics for each run for the Vancouver testset.                        Image Collections. In Proc. Int. Workshop on Human
                  run1    run2  run3    run4   run5                 Centered Event Understanding from Multimedia
 Sync. Precision 0.9118 0.9118 0.9118 0.5294 0.9118                 (HuEvent14) of ACM Multimedia (MM14), 2014.
 Sync. Accuracy 0.7375 0.7375 0.7375 0.7014 0.7279              [4] W. Zhou, H. Li, Y. Lu, and Q. Tian. SIFT match
   Rand Index    0.9770 0.9734 0.9526 0.9601 0.9656                 verification by geometric coding for large-scale
  Jaccard Index 0.2581 0.2315 0.2856 0.1782 0.2861                  partial-duplicate web image search. ACM Trans.
   F-Measure     0.2052 0.1880 0.2222 0.1512 0.2225                 Multimedia Comput. Commun. Appl., 9(1):4:1–4:18,
                                                                    Feb. 2013.

</pre>