<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>JRS at Synchronization of Multi-user Event Media Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hannes Fassold</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Stiegler</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Lee</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Austria</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>firstname.lastname}@joanneum.at</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The event synchronisation task addresses the problem of aligning media (i.e., photo and video) streams (\galleries") from di erent users temporally and identifying coherent events in the streams. Our approach uses the visual similarity of image/key frame pairs based on full matching of SIFT descriptors with geometric veri cation. Based on the visual similarity and the given time information, a probabilistic algorithm is employed, where in each run a hypothesis is calculated for the set of time o sets with respect to the reference gallery. From the gathered hypotheses, the nal set of time o sets is calculated as the medoid of all hypotheses.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>2.1</p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH ABSTRACT</title>
    </sec>
    <sec id="sec-3">
      <title>1. INTRODUCTION</title>
      <p>
        The event synchronisation task addresses the problem of
aligning media streams (referred to as galleries) from di
erent users temporally and identifying coherent events in the
streams. This paper describes the work done by the JRS
team for the two subtasks of determining the time o sets of
galleries and clustering the images and videos into events.
Details on the task and the data set can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>Determining Gallery Offsets</title>
      <p>Our approach utilizes the visual information (the captured
images and extracted key frames from the video) and the
given time stamps in a probabilistic way. The absolute time
stamps are not considered reliable in this task, however,
their relative distances within the gallery of one user can
be exploited.</p>
      <p>
        We denote galleries as G0::M (assuming G0 as the reference
gallery), each Gk containing a set of images or key frames
I1::Nk . For every image, several thousands of SIFT
descriptors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are extracted. A GPU accelerated implementation
is used to speed up descriptor extraction and matching [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>For a pair of galleries (k; l), for each image Ii 2 Gk its best
matching image Ij 2 Gl is identi ed, via exhaustive
matching of their respective SIFT descriptors. For each match
(Ii; Ij ), a geometric veri cation step is applied, yielding a
variable number of homographies along with the number of
points ht supporting the respective homography. The
visual similarity si;j for the image pair is calculated as
follows. First, all homographies with ht &lt; are discarded.
From the remaining ones, the k highest values ht are
selected. The selected homographies are clipped to a range
[htmin; htmax] and the arithmetic average havg and sum hsum
of the clipped values is calculated. The visual similarity si;j
is retrieved as the geometric average of havg and hsum.</p>
      <p>Our general approach is a probabilistic method, where a
signi cant number of potential solutions (hypotheses) are
calculated, and from these hypotheses the 'most-inner' (in a
sense which will be explained later) is taken as the nal
solution. Such a probabilistic approach is more robust against
outliers in the data. As a preprocessing step, we calculate
a connection magnitude ck;l for each gallery pair k and l in
order to steer the random picking of gallery pairs (k; l)
towards the more 'stable' gallery pairs (e.g., the gallery pairs
with a high number of matches and a low deviation of the
time di erence values between the matches). The
connection magnitude is calculated as the geometric average of the
number of identi ed matches (based on visual similarity)
between the galleries, the average visual similarity scores of
the matches and of the reciprocal of the average deviation
of the time di erences between the matches.</p>
      <p>One potential solution is a vector of time di erences D0 =
( 1; :::; M ) between the M galleries and the reference gallery
G0. For generating one potential solution D0, we proceed
as follows. First, a random gallery pair (k; l) is identi ed.
The probability of picking a pair (k; l) is proportional to
its connection magnitude ck;l, therefore we steer the
random picking towards more stable gallery pairs. In order to
probabilistically calculate the time di erence k;l between
the two galleries, we rst apply k-means clustering on the
time di erence values of all matches, where k is typically
in the range 3 to 5. Then, we randomly pick one of the
cluster centers and set it as k;l. Having calculated k;l, we
can propagate this value recursively and calculate unknown
values k0;l by taking usage of the relation
k;l = k;k0 + k0;l;
(1)
which is very easy to show. By iterating this process of
randomly selecting a gallery pair, followed by calculating
k;l, M 1 times we retrieve one potential solution D0.</p>
      <p>In order to calculate the nal solution D, we generate a set
of several thousands of potential solutions D0 (each being a
vector of time di erences) in the way described above. From
the potential solutions, we determine the nal solution D by
calculating the medoid of all potential solutions. In a certain
sense, this is the 'most-inner' solution, when interpreting the
potential solutions as vectors.
run 2
40%
0%</p>
      <p>F-Score
Precision</p>
      <p>Recall</p>
      <p>F-Score
40%
0%
80%
40%
0%</p>
      <p>Precision
Accuracy</p>
      <p>TDF14</p>
      <p>NAMM15
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Clustering Events</title>
      <p>For the event clustering, we rely solely on the time
information. We correct the time stamp of a speci c gallery with
the calculated o set, with respect to the reference gallery,
for the speci c gallery. Based on the time information, a one
dimensional k-means clustering algorithm is applied, where
k is ranging between 30 and 100. The value is determined
based on the size of a data set - the total number of images
in all galleries - and a user parameter which speci es the
desired granularity of the subevents.</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>We submitted two runs, which use the same parameters
for determining the time o sets. The clustering is di erent,
with k for run 2 having the double value of run 1. So run
2 corresponds to a ner granularity of the subevents
compared to run 1. Unfortunately, the o cial submissions only
contained the results for the still image data sets (Tour de
France, NAMM), but not for the videos.</p>
      <p>Figure 2 shows the results for synchronisation. For both
data sets, accuracy is clearly higher than precision. This
means that our approach tends to optimise for a globally
lower synchronisation error at the cost of higher
individual errors for some galleries. While precision is signi cantly
lower for NAMM than for TDF, accuracy has actually
increased. One reason for this may be the fact, that local
features of bikes and bikers match quite well across many
images (which can be seen from the high visual similarity
values si;j for these images), thus visual matching provides
a weaker constraint than on visually more diverse data.</p>
      <p>The results for subevent clustering are shown in Figure 1.
One interesting observation is that while the F1 score is on a
comparable level for both data sets, precision and recall are
quite balanced for NAMM, but biased towards higher
precision for TDF. Interestingly, the variation of the parameter
between the two runs does not change this behaviour. For
both parameterisations the method tends to oversegment
the TDF data. It seems that the impact of synchronisation
errors on the clustering result is limited, as no direct relation
is apparent from the results.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>The proposed method performs quite well in minimising
the overall synchronisation error, but at the expense of more
galleries that exceed the error threshold. For the subevent
clustering, a better automatic adaptation of the number of
clusters to the data set is needed, in order to avoid
oversegmentation such as on the TDF data.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The research leading to these results has received funding
from the European Union's Seventh Framework Programme
(FP7/2007-2013) under grant agreement n 610370, \ICoSOLE
{ Immersive Coverage of Spatially Outspread Live Events"
(http://www.icosole.eu/).
5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Conci</surname>
          </string-name>
          , Francesco De Natale, Vasileios Mezaris, and
          <string-name>
            <given-names>Mike</given-names>
            <surname>Matton</surname>
          </string-name>
          .
          <article-title>Synchronization of Multi-User Event Media at MediaEval 2015: Task Description, Datasets, and Evaluation</article-title>
          . In MediaEval 2015 Workshop, Wurzen, Germany, September 14-15
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Hannes</given-names>
            <surname>Fassold</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jakub</given-names>
            <surname>Rosner</surname>
          </string-name>
          .
          <article-title>A real-time GPU implementation of the SIFT algorithm for large-scale video analysis tasks</article-title>
          .
          <source>In Real-Time Image and Video Processing</source>
          , San Francisco, CA, USA,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lowe</surname>
          </string-name>
          .
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>60</volume>
          (
          <issue>2</issue>
          ):
          <volume>91</volume>
          {
          <fpage>110</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>