=Paper=
{{Paper
|id=Vol-1436/Paper53
|storemode=property
|title=CERTH at MediaEval 2015 Synchronization of Multi-User Event Media Task
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper53.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ApostolidisM15
}}
==CERTH at MediaEval 2015 Synchronization of Multi-User Event Media Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper53.pdf</pdf>
<pre>
     CERTH at MediaEval 2015 Synchronization of Multi-User
                     Event Media Task

                   Konstantinos Apostolidis                                      Vasileios Mezaris
                            CERTH-ITI                                                CERTH-ITI
                       Thermi 57001, Greece                                     Thermi 57001, Greece
                           kapost@iti.gr                                          bmezaris@iti.gr


ABSTRACT                                                            1. Geometric Consistency of Local Features Similarity
This paper describes the results of our participation to the           (GC): We check the geometric consistency of SIFT
Synchronization of Multi-User Event Media Task at the Me-              keypoints for each pair of photos, using geometric cod-
diaEval 2015 challenge. Using multiple similarity measures,            ing [6]. The GC similarity can discover near-duplicate
we identify pairs of similar media from different galleries. We        photos.
use a graph-based approach to temporally synchronize user           2. Scene Similarity (S): We calculate the pairwise cosine
galleries; subsequently we use time information, geolocation           distances between the extracted GIST descriptor [4] of
information and visual concept detection results to cluster            each photo. High S similarity indicates photos cap-
all photos into different sub-events. Our method achieves              tured at similar scenery (indoor, urban, nature).
good accuracy on considerably diverse datasets.                     3. Color Allocation Similarity (CA): We divide each im-
                                                                       age to three equal, non-overlapping horizontal strips,
                                                                       and extract the HSV histogram of each. We calculate
1.   INTRODUCTION                                                      the pairwise cosine distances between the concatena-
   People attending large events collect dozens of photos and          tion of the HSV histograms. High CA similarity indi-
video clips with their smartphones, tablets, cameras. These            cates photos with similar colors.
are later exchanged and shared in a number of different ways.       4. DCNN Concept Scores Similarity (DCS): We use the
The alignment and presentation of the media galleries of dif-          Cafe DCNN [3] and the googleNet pre-trained model
ferent users in a consistent way, so as to preserve the tempo-         [5] to extract concept scores for photos. We use the
ral evolution of the event, is not straightforward, considering        Euclidean distance to calculate pairwise distances be-
that the time information attached to some of the captured             tween concept scores vectors of photos. High DCS
media may be wrong and geolocation information may be                  similarity indicates semantically similar photos.
missing. The 2015 MediaEval Synchronization of Multi-user
Event Media (SEM) task tackles this exact problem [2].               We calculate the aforementioned similarity measures on
                                                                  the photos of all galleries to be synchronized. We combine
                                                                  the information of all similarity measures, using the follow-
2.   METHOD OVERVIEW                                              ing procedure: initially, the similarity O(i, j) of photos i
   The proposed method temporally aligns user galleries that      and j is set equal to GC(i, j). Then, if S(i, j) > ts and
are created by different digital capture devices, and clusters    S(i, j) > GC(i, j), O(i, j) is updated as O(i, j) = S(i, j).
the time-aligned photos into event-related clusters. In the       The same update process is subsequently repeated using
first stage, we assess media similarity by combining multi-       CA similarity and DCS similarity (and the respective tc ,
ple similarity measures and by taking into account the ge-        td thresholds).
olocation metadata of photos. Similar media of the differ-           Subsequently, we weigh each similarity value so that the
ent galleries are identified and are used for constructing a      similarity of photos with distance of capture locations lower
graph, whose nodes represent galleries and edges represent        than a m threshold is emphasized, while the similarity of
the discovered similarities between media items of differ-        photos with distance of capture locations significantly above
ent galleries. Synchronization of the galleries is achieved by    this threshold is zeroed. Similar photos that belong to dif-
traversing the minimum spanning tree (MST) of the graph.          ferent user galleries are treated as potential links between
Finally, we apply clustering techniques to split the media        these galleries.
to different sub-events. Figure 1 illustrates the proposed           To identify similar audio files of different galleries, we per-
method.                                                           form cross correlation of audio data, degraded to 11KHz
                                                                  sampling rate. For video files, we select a frame for each
                                                                  second of video and resize it to 1 pixel width. To identify
3.   MEDIA SIMILARITY ASSESSMENT                                  similar video files of different galleries, we perform cross cor-
  To identify similar photos of different galleries, we com-      relation of the horizontally concatenated resized frames.
bine the information of four similarity measures [1]:                The ts , tc and td thresholds are empirically calculated
                                                                  over the training dataset. The m threshold is calculated by
                                                                  estimating a Gaussian mixture model of two Gaussian dis-
Copyright is held by the author/owner(s)
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger-      tributions on the histogram of all photo’s pairwise capture
many                                                              location distances. The Gaussian distribution with the low-
                                                 Figure 1: System overview


est mean (m) presumably signifies photos captured in the          similarity. In the second approach (AP C), we augment the
same sub-event.                                                   DCNN feature vectors with the normalized time information
                                                                  and cluster the media using Affinity Propagation.
4.   TEMPORAL SYNCHRONIZATION
   Having identified potential links for at least some gallery
                                                                  6.    RESULTS
pairs, we construct a weighted graph, whose nodes represent          We submitted 4 runs in total, combining the 2 methods for
the galleries, and its edges represent the links between gal-     temporal synchronization and the 2 methods for sub-event
leries. The weight assigned to each edge is calculated as the     clustering. The results of our approach for all datasets and
sum of similarities of the photos linking the two galleries.      all four runs are listed in Table 1. From the reported results,
Using this graph, the temporal offsets of each gallery will be    it is clear that our method achieved good accuracy but only
computed against the reference gallery.                           managed to synchronize a small number of galleries, par-
   We compute the temporal offset of each gallery by travers-     ticularly in the TDF14 dataset. In sub-event clustering,
ing the minimum spanning tree (MST) of the galleries graph.       the MPC method scored a slightly better F-score (column
This procedure (M ST t) can be summarized as follows: Start-      F1) for two of the datasets. The M ST t and M ST x meth-
ing from the node corresponding to the reference gallery, we      ods performed the same because maxDif f was set too low
select the edge with the highest weight. We compute the           (maxDif f = 10), which allowed only very small adjust-
temporal offset of the node on the other end of this edge as      ments, thus degenerating the M ST x method to M ST t.
the median of the capture time differences of the pairs of
similar photos that this edge represents. We add this node                    Table 1: Proposed method results.
to the set of visited nodes. The selection of the edge with            Dataset       Run          Precision   Accuracy    F1
the highest weight is repeated, considering any member of                        M ST t+AP C       0.833       0.908     0.226
the set of visited nodes as possible starting point, and the                     M ST t+M P C      0.833       0.908     0.348
                                                                    NAMM15
corresponding temporal offset is again computed, until all                       M ST x+AP C       0.833       0.908     0.226
nodes are visited. This process is explained in more detail                      M ST x+M P C      0.833       0.908     0.348
in [1].                                                                          M ST t+AP C       0.125       0.845     0.113
                                                                                 M ST t+M P C      0.125       0.845     0.001
   The MSTt method calculates the offsets using only the               TDF14
                                                                                 M ST x+AP C       0.125       0.845     0.113
shortest path from a visited node to any given node. We also                     M ST x+M P C      0.125       0.845     0.001
explore a variation of the MStt process as an alternative way                    M ST t+AP C       0.424       1.000     0.123
of computing temporal offsets (M ST x): before traversing                        M ST t+M P C      0.424       1.000     0.164
the MST of the graph, we detect fully-connected triplets of             STS
                                                                                 M ST x+AP C       0.424       1.000     0.123
nodes and we average the offset of the shortest path with                        M ST x+M P C      0.424       1.000     0.164
the alternative path in each triplet, only if the difference of
the two paths is lower than maxDif f threshold. Utilizing
in this M ST x process some additional information that the       7.    CONCLUSIONS
M ST t method ignores, we expect to achieve better accuracy          In this paper our framework and results at the MediaEval
in time synchronization.                                          2015 Synchronization of Multi-User Event Media Task are
                                                                  presented. Better fine-tuning of the algorithm parameters is
5.   SUB-EVENT CLUSTERING                                         required to achieve consistently good performance on diverse
                                                                  datasets. As a future work, we are considering extending
   After time synchronization, we cluster all photos to sub-      the algorithm to automatic parameter selection (which could
events. Two different approaches were adopted. In the first       lead to select more links between galleries, thus improving
approach (M P C), we apply the following procedure: At the        precision), experiment with different values of maxDif f ,
first stage, we split the photo’s timeline where consecutive      and apply a more sophisticated method to combine different
photos have temporal distance above the mean of all tempo-        similarity measures.
ral distances. At the second stage, geolocation information
is used to further split clusters of photos.During the third
stage, clusters are merged using time and geolocation in-         8.    ACKNOWLEDGMENTS
formation. For the clusters that do not have geolocation            This work was supported by the European Commission
information, the merging is continued by considering visual       under contract FP7-600826 ForgetIT.
9.   REFERENCES
[1] K. Apostolidis and V. Mezaris. Using photo similarity
    and weighted graphs for the temporal synchronization
    of event-centered multi-user photo collections. In Proc.
    2nd Workshop on Human Centered Event
    Understanding from Multimedia (HuEvent’15) at ACM
    Multimedia (MM’15), Brisbane, Australia, Oct. 2015.
[2] N. Conci, F. De Natale, V. Mezaris, and M. Matton.
    Synchronization of Multi-User Event Media (SEM) at
    MediaEval 2015: Task Description, Datasets, and
    Evaluation. In Proc. MediaEval Workshop, Wurzen,
    Germany, Sept. 2015.
[3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
    R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
    Convolutional architecture for fast feature embedding.
    ACM Int. Conf. on Multimedia, Nov. 2014.
[4] A. Oliva and A. Torralba. Modeling the shape of the
    scene: A holistic representation of the spatial envelope.
    International journal of computer vision,
    42(3):145–175, 2001.
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
    D. Anguelov, D. Erhan, V. Vanhoucke, and
    A. Rabinovich. Going deeper with convolutions. CoRR,
    abs/1409.4842, 2014.
[6] W. Zhou, H. Li, Y. Lu, and Q. Tian. SIFT match
    verification by geometric coding for large-scale
    partial-duplicate web image search. ACM Trans.
    Multimedia Comput. Commun. Appl., 9(1):4:1–4:18,
    Feb. 2013.

</pre>