<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CERTH at MediaEval 2015 Synchronization of Multi-User Event Media Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantinos Apostolidis</string-name>
          <email>kapost@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasileios Mezaris</string-name>
          <email>bmezaris@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CERTH-ITI</institution>
          ,
          <addr-line>Thermi 57001</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the results of our participation to the Synchronization of Multi-User Event Media Task at the MediaEval 2015 challenge. Using multiple similarity measures, we identify pairs of similar media from di erent galleries. We use a graph-based approach to temporally synchronize user galleries; subsequently we use time information, geolocation information and visual concept detection results to cluster all photos into di erent sub-events. Our method achieves good accuracy on considerably diverse datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        People attending large events collect dozens of photos and
video clips with their smartphones, tablets, cameras. These
are later exchanged and shared in a number of di erent ways.
The alignment and presentation of the media galleries of
different users in a consistent way, so as to preserve the
temporal evolution of the event, is not straightforward, considering
that the time information attached to some of the captured
media may be wrong and geolocation information may be
missing. The 2015 MediaEval Synchronization of Multi-user
Event Media (SEM) task tackles this exact problem [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD OVERVIEW</title>
      <p>The proposed method temporally aligns user galleries that
are created by di erent digital capture devices, and clusters
the time-aligned photos into event-related clusters. In the
rst stage, we assess media similarity by combining
multiple similarity measures and by taking into account the
geolocation metadata of photos. Similar media of the di
erent galleries are identi ed and are used for constructing a
graph, whose nodes represent galleries and edges represent
the discovered similarities between media items of di
erent galleries. Synchronization of the galleries is achieved by
traversing the minimum spanning tree (MST) of the graph.
Finally, we apply clustering techniques to split the media
to di erent sub-events. Figure 1 illustrates the proposed
method.</p>
    </sec>
    <sec id="sec-3">
      <title>MEDIA SIMILARITY ASSESSMENT</title>
      <p>
        To identify similar photos of di erent galleries, we
combine the information of four similarity measures [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
      </p>
      <p>We calculate the aforementioned similarity measures on
the photos of all galleries to be synchronized. We combine
the information of all similarity measures, using the
following procedure: initially, the similarity O(i; j) of photos i
and j is set equal to GC(i; j). Then, if S(i; j) &gt; ts and
S(i; j) &gt; GC(i; j), O(i; j) is updated as O(i; j) = S(i; j).
The same update process is subsequently repeated using
CA similarity and DCS similarity (and the respective tc,
td thresholds).</p>
      <p>Subsequently, we weigh each similarity value so that the
similarity of photos with distance of capture locations lower
than a m threshold is emphasized, while the similarity of
photos with distance of capture locations signi cantly above
this threshold is zeroed. Similar photos that belong to
different user galleries are treated as potential links between
these galleries.</p>
      <p>To identify similar audio les of di erent galleries, we
perform cross correlation of audio data, degraded to 11KHz
sampling rate. For video les, we select a frame for each
second of video and resize it to 1 pixel width. To identify
similar video les of di erent galleries, we perform cross
correlation of the horizontally concatenated resized frames.</p>
      <p>The ts, tc and td thresholds are empirically calculated
over the training dataset. The m threshold is calculated by
estimating a Gaussian mixture model of two Gaussian
distributions on the histogram of all photo's pairwise capture
location distances. The Gaussian distribution with the
lowest mean (m) presumably signi es photos captured in the
same sub-event.</p>
    </sec>
    <sec id="sec-4">
      <title>TEMPORAL SYNCHRONIZATION</title>
      <p>Having identi ed potential links for at least some gallery
pairs, we construct a weighted graph, whose nodes represent
the galleries, and its edges represent the links between
galleries. The weight assigned to each edge is calculated as the
sum of similarities of the photos linking the two galleries.
Using this graph, the temporal o sets of each gallery will be
computed against the reference gallery.</p>
      <p>
        We compute the temporal o set of each gallery by
traversing the minimum spanning tree (MST) of the galleries graph.
This procedure (M ST t) can be summarized as follows:
Starting from the node corresponding to the reference gallery, we
select the edge with the highest weight. We compute the
temporal o set of the node on the other end of this edge as
the median of the capture time di erences of the pairs of
similar photos that this edge represents. We add this node
to the set of visited nodes. The selection of the edge with
the highest weight is repeated, considering any member of
the set of visited nodes as possible starting point, and the
corresponding temporal o set is again computed, until all
nodes are visited. This process is explained in more detail
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The MSTt method calculates the o sets using only the
shortest path from a visited node to any given node. We also
explore a variation of the MStt process as an alternative way
of computing temporal o sets (M ST x): before traversing
the MST of the graph, we detect fully-connected triplets of
nodes and we average the o set of the shortest path with
the alternative path in each triplet, only if the di erence of
the two paths is lower than maxDif f threshold. Utilizing
in this M ST x process some additional information that the
M ST t method ignores, we expect to achieve better accuracy
in time synchronization.</p>
    </sec>
    <sec id="sec-5">
      <title>5. SUB-EVENT CLUSTERING</title>
      <p>After time synchronization, we cluster all photos to
subevents. Two di erent approaches were adopted. In the rst
approach (M P C), we apply the following procedure: At the
rst stage, we split the photo's timeline where consecutive
photos have temporal distance above the mean of all
temporal distances. At the second stage, geolocation information
is used to further split clusters of photos.During the third
stage, clusters are merged using time and geolocation
information. For the clusters that do not have geolocation
information, the merging is continued by considering visual
similarity. In the second approach (AP C), we augment the
DCNN feature vectors with the normalized time information
and cluster the media using A nity Propagation.
6.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>We submitted 4 runs in total, combining the 2 methods for
temporal synchronization and the 2 methods for sub-event
clustering. The results of our approach for all datasets and
all four runs are listed in Table 1. From the reported results,
it is clear that our method achieved good accuracy but only
managed to synchronize a small number of galleries,
particularly in the TDF14 dataset. In sub-event clustering,
the MPC method scored a slightly better F-score (column
F1) for two of the datasets. The M ST t and M ST x
methods performed the same because maxDif f was set too low
(maxDif f = 10), which allowed only very small
adjustments, thus degenerating the M ST x method to M ST t.</p>
      <p>Dataset</p>
      <p>In this paper our framework and results at the MediaEval
2015 Synchronization of Multi-User Event Media Task are
presented. Better ne-tuning of the algorithm parameters is
required to achieve consistently good performance on diverse
datasets. As a future work, we are considering extending
the algorithm to automatic parameter selection (which could
lead to select more links between galleries, thus improving
precision), experiment with di erent values of maxDif f ,
and apply a more sophisticated method to combine di erent
similarity measures.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by the European Commission
under contract FP7-600826 ForgetIT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Apostolidis</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          .
          <article-title>Using photo similarity and weighted graphs for the temporal synchronization of event-centered multi-user photo collections</article-title>
          .
          <source>In Proc. 2nd Workshop on Human Centered Event Understanding from Multimedia (HuEvent'15) at ACM Multimedia (MM'15)</source>
          , Brisbane, Australia, Oct.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Conci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>De Natale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Matton</surname>
          </string-name>
          .
          <article-title>Synchronization of Multi-User Event Media (SEM) at MediaEval 2015: Task Description, Datasets, and Evaluation</article-title>
          .
          <source>In Proc. MediaEval Workshop</source>
          , Wurzen, Germany, Sept.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shelhamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          , and T. Darrell. Ca e:
          <article-title>Convolutional architecture for fast feature embedding</article-title>
          .
          <source>ACM Int. Conf. on Multimedia, Nov</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          .
          <article-title>Modeling the shape of the scene: A holistic representation of the spatial envelope</article-title>
          .
          <source>International journal of computer vision</source>
          ,
          <volume>42</volume>
          (
          <issue>3</issue>
          ):
          <volume>145</volume>
          {
          <fpage>175</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          .
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>CoRR, abs/1409.4842</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          .
          <article-title>SIFT match veri cation by geometric coding for large-scale partial-duplicate web image search</article-title>
          .
          <source>ACM Trans. Multimedia Comput. Commun. Appl.</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):4:
          <issue>1</issue>
          {4:
          <fpage>18</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>