<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CERTH at MediaEval 2014 Synchronization of Multi-User Event Media Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Konstantinos Apostolidis, Christina Papagiannopoulou, Vasileios Mezaris Information Technologies Institute, CERTH</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper describes the results of the CERTH participation in the Synchronization of Multi-User Event Media Task of MediaEval 2014. We used a near duplicate image detector to identify very similar photos, which allowed us to temporally align photo galleries; and then we used time, geolocation and visual information, including the results of visual concept detection, to cluster all photos into di erent events.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        People attending large-scale social events collect dozens
of photos and video clips with their smartphones, tablets,
cameras. These are later exchanged and shared in a
number of di erent ways. The alignment and presentation of
the photo galleries of di erent users in a consistent way, so
as to preserve the temporal evolution of the event, is not
straightforward, considering that the time information
attached to some of the captured media may be wrong (due
to di erent photo capturing devices not being synchronized)
and geolocation information may be missing. The 2014
MediaEval Synchronization of Multi-user Event Media (SEM)
task tackles this exact problem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM OVERVIEW</title>
      <p>The main goal of our system is the time alignment of photo
galleries that are created by di erent digital photo capture
devices, and the clustering of these into event-related
clusters. In the rst stage, similar photos of the di erent
galleries are identi ed and are used for constructing a graph,
whose nodes represent galleries and edges represent
discovered links between them. Time alignment of the galleries
is achieved by traversing the graph. After that, we apply
clustering techniques in order to split our collection into
different events. Figure 1 shows the pipeline of our system.</p>
    </sec>
    <sec id="sec-3">
      <title>TIME SYNCHRONIZATION</title>
      <p>
        Time synchronization makes use of a Near Duplicate
Detector (NDD) that extracts SIFT descriptors from the
photos, forms a visual vocabulary and encodes the
descriptorbased representation of each photo using VLAD encoding.
The nearest neighbours that are returned for a query image
are re ned by checking the geometrical consistency of SIFT
keypoints using geometric coding (GC) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
We further modi ed this NDD process to also use color
information (HSV histograms), so that near duplicate
candidates that are very similar in color are not discarded even
if the GC score is relatively low.
      </p>
      <p>We apply the modi ed NDD on the union of all galleries.
Consequently, we lter out identi ed pairs of near duplicates
according to the following rules:</p>
      <p>Reject pairs when geolocation information is available
and the location distance of the two photos is greater
than a distance threshold.</p>
      <p>Reject pairs when the time di erence between the
photos is above an extreme time threshold (which indicates
that this time di erence is unlikely to be due to a time
synchronization error alone).</p>
      <p>The remaining near duplicate photos belonging to di
erent galleries are considered as links between those galleries.</p>
      <p>It is now straightforward to construct a graph whose nodes
represent the galleries, and the edges represent these links
between galleries. Each edge has a weight which is equal
to the number of links between the two galleries. Having
constructed the graph, we compute the time o set of each
gallery by traversing it, as follows. Starting from the node
corresponding to the reference gallery, we select the edge
with the highest weight. We compute the time o set of the
node on the other end of this edge as the median of the
time di erences of the pairs of near duplicate photos that
this edge represents, and add this node to the set of visited
nodes. The selection of the edge with the highest weight is
repeated, considering as possible starting point any member
of the set of visited nodes, and the corresponding time o set
is computed, until all nodes are visited. Alternatively, we
can traverse the graph and compute the nodes' time o sets
by simultaneously considering the weights of multiple edges.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>MEDIA CLUSTERING OF EVENTS</title>
      <p>Following time synchronization, we cluster all photos to
events. Two di erent approaches are adopted: the rst one
considers all photo galleries as a single photo collection,
exploiting the synchronization results, while the second one
rst makes a pre-clustering within each gallery separately.</p>
      <p>
        In the rst approach, we use the method of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], resulting in
clusters that are time distinct, comprising di erent events.
Subsequently, each of these clusters is split based on the
geolocation information. The photos that do not have
geolocation information are assigned to the geo-cluster which
is more similar according to the color information (e.g. HSV
histogram).
      </p>
      <p>
        In the second approach, we detect time gaps between
events of each gallery. Speci cally, we nd the minimum
time di erence of dissimilar photos which is greater than
the maximum time di erence of the near-duplicate photos
(based on the similarity matrix of GC). The clusters that
are formed are merged according to time and geolocation
similarity. For the clusters that do not have geolocation
information, the merging is continued by considering the time
and low-level feature similarity or the time and the concept
detector (CD) con dence similarity scores [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>We submitted 5 runs in total, combining 3 methods for
time synchronization and 3 methods for event clustering:
Run1:aNDD-perGallery-mergeCD : Compute gallery
time o sets using our modi ed NDD. CD scores are used
to merge clusters using the second approach of section
4.</p>
      <p>Run2:aNDD-perGallery-mergeHSV : Compute gallery
time o sets using our modi ed NDD. HSV histogram
similarity is used to merge clusters using the second
approach of section 4.</p>
      <p>Run3:aNDD-concat : Compute gallery time o sets
using our modi ed NDD. Clustering is performed using
the rst approach of section 4.</p>
      <p>Run4:aNDD-multiT-perGallery-mergeCD : Compute
gallery time o sets using our modi ed NDD and traversal
of the graph by simultaneously considering the weights
of multiple edges. CD scores are used to merge clusters
using the second approach of section 4.</p>
      <p>Run5:NDD-perGallery-mergeCD : Compute gallery
time o sets using NDD without HSV color information.
CD scores are used to merge certain events using the
second approach of the section 4.</p>
      <p>The results of our approach for all 5 runs, for the Vancouver
testset and the London testset are listed in Tables 1 and 2
respectively.
6.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>This paper presented our framework and results at the
MediaEval 2014 Synchronization of Multi-User Event Media
Task. Our modi ed NDD approach gives the best results in
time alignment for the Vancouver testset, while the
standard NDD yields a slightly better time synchronization for
the London testset. In sub-event clustering, the exploitation
of consistent timestamps in a gallery and the use of CD
condence scores gives a good performance for the Vancouver
testset, whereas HSV histogram similarity seems to give the
best clustering results for the London testset.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by the EC under contracts
FP7287911 LinkedTV and FP7-600826 ForgetIT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Conci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>De Natale</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          .
          <article-title>Synchronization of Multi-User Event Media (SEM) at MediaEval 2014: Task Description, Datasets, and Evaluation</article-title>
          .
          <source>In Proc. MediaEval Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cooper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Foote</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Girgensohn</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          .
          <article-title>Temporal event clustering for digital photo collections</article-title>
          .
          <source>ACM Transactions on Multimedia Computing</source>
          , Communications, and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          (TOMCCAP),
          <volume>1</volume>
          (
          <issue>3</issue>
          ):
          <volume>269</volume>
          {
          <fpage>288</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Papagiannopoulou</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          .
          <article-title>Concept-based Image Clustering and Summarization of Event-related Image Collections</article-title>
          .
          <source>In Proc. Int. Workshop on Human Centered Event Understanding from Multimedia (HuEvent14) of ACM Multimedia (MM14)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          .
          <article-title>SIFT match veri cation by geometric coding for large-scale partial-duplicate web image search</article-title>
          .
          <source>ACM Trans. Multimedia Comput. Commun. Appl.</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):4:
          <issue>1</issue>
          {4:
          <fpage>18</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>