<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UPC at MediaEval 2013 Social Event Detection Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Manchon-Vizuete</string-name>
          <email>dmanchon@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xavier Giro-i-Nieto</string-name>
          <email>xavier.giro@upc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pixable</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politecnica de Catalunya</institution>
          ,
          <addr-line>Barcelona, Catalonia</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>These working notes present the contribution of the UPC team to the Social Event Detection (SED) task in MediaEval 2013. The proposal extends the previous PhotoTOC work in the domain of shared collections of photographs stored in cloud services. An initial over-segmentation of the photo collection is later re ned by merging pairs of similar clusters.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        These working notes describe the algorithms tested by the
UPC team in the MediaEval 2013 Semantic Event Detection
(SED) task. The reader is referred to the task description [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
for further details about the study case, dataset and metrics.
Our team participated only in Task 1, where all image were
to be clustered in events.
      </p>
      <p>The proposed approach is aimed at a light computational
solution capable of dealing with large amounts of data. This
requirement is specially sensible when dealing not only with
large amounts of data, but also with large amount of users.
The SED task describes a dataset with photos from di erent
users, so that the events to be detected a ect several users.
This set up suggests a computational solution to be run on
a centralised and shared service on the cloud, in contrast to
other scenarios where each user data can be processed on the
client side. Any computation on the cloud typically implies
an economical cost on the server which, in many cases, is not
directly charged on the user, but assumed by the
intermediate photo management service. For this reason, it is of high
priority that any solution involves only light computations,
discarding this way any pixel-related operation which would
require the decoding and processing of the images.</p>
      <p>
        In addition, the SED task presents an inherent challenge
due the incompleteness of the photo metadata. The
provided dataset contains real photos with real missing or
corrupted information; such as non-geolocalised images, or
identical time stamps for the moment when the photo was taken
but also uploaded. These situations are common specially
when dealing with online services managing photos, which
present hetereogenous upload sources and, in many cases,
remove the EXIF metadata of the photos. These drawbacks
have been partially managed in the proposed solution, which
combines the diversity of metadata sources (time stamps,
geolocation and textual labels) in this challenging context.
In our approach, no external data is used, so all submitted
runs belong to the required type (as speci ed in the SED
overview paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
      <p>These working notes is structured as follows. Section 2
describes the existing PhotoTOC system, which has been
adopted as an initial oversegmentation of the dataset. Later,
Section 3 presents how the oversegmented clusters are merged
considering di erent metadata sources. The performance of
the solution is assessed in Section 4 with the results
obtained on the MediaEval SED 2013 task. Finally, Section 5
provides the insights learned and points at future research
directions.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        The adopted solution is inspired by an original work from
Microsoft Research[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] named PhotoTOC (Photo Table of
Contents. In this previous design, photos are initially sorted
according to their creation time stamp and they are
sequentially clustered by estimating the location of event
boundaries. A new event boundary is created whenever the time
gap (gi) between two consecutive photos is much larger than
the average time di erences of a temporal window around it.
In particular, a new event is created whenever the criterion
show in Equation 1 was satis ed,
log(gN )
      </p>
      <p>K +
d
X log(gN+1)
i= d
(1)
where PhotoTOC empirically set the con guration
parameters to d = 10 and K = log(17).</p>
      <p>When the time creation is missing in the EXIF metadata,
the PhotoTOC uses the le creation time. Whenever a
cluster is larger than 23, this event is considered too large and
it is split based on color features. This content-based
clustering algorithm generates 1/12 the amount of photographs
in the large cluster</p>
      <p>The main drawback of PhotoTOC approach was the need
of an image processing analysis to estimate the
contentbased similarity. The visual modality was discarded and
substituted by the geolocation and textual labels as
additional information to the time creation. In addition, in the
SED task images from di erent users were considered taken
from di erent cameras and point of view, all of this driving
to a less reliable visual analysis. There is no guarantee
either that the empirically set values proposed in PhotoTOC
would be useful in another dataset, nor it is clear from the
paper how they were estimated.</p>
      <p>Two solutions have been tested in our submission, both of
them having a common starting point in the time-based
clustering solution proposed by PhotoTOC.In both solutions,
the initial time-based clusters are compared based on
associated geolocation, textual labels and user IDs. The rst
solution relies on weights for each criterion which have been
manually tuned, while the second introduces an estimation
of the relevance of each feature type.
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>User and time-based over-segmentation</title>
      <p>
        The rst step in the proposed solution considers the
photos of each user separately. The time-based clustering
algorithm proposed by PhotoTOC independently optimising
con guration parameters K and d with the training dataset
provided by MediaEval. The obtained values were K =
log(150) and d = 40, which clearly di er from the ones
proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. During this rst stage, those images whose
Date taken matches their Date uploaded are not processed,
as their time stamp is considered corrupted.
      </p>
      <p>As a result, an over-segmentation of mini-clusters is
obtained. Each of them is characterised by its averaged time,
averaged geolocation, aggregated set of textual labels and
associated user ID. These are the features used in the
posterior stages to assess the similarity between the mini-clusters.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Cluster merges</title>
      <p>The set of time-sorted clusters is sequentially analysed
in increasing time value. Each cluster is compared with the
forthcoming 15 clusters, a time window set to avoid excessive
computational time. Two clusters are merged whenever a
similarity measure is above an estimated threshold. The
submitted runs have considered two options for assessing
this similarity: a rst one that adopts binary decision based
on each criterion which are manually weighted, and a second
one where each individual similarity measure is normalised
and later fused with a learned weight.</p>
      <p>Method 1: Binary decisions and manual weights
This method compares each pair of clusters separately and
takes a binary decision for each criterion. The geolocation
coordinates are compared with the Haversine distance, the
textual label set with the Jaccard Index and the user IDs
with a simple binary decision. The three binary decisions are
linearly fused with a weighting scheme of 0.2 per geolocation,
0.2 for text and 0.4 for user ID. Two clusters are merged if
the fused combination exceeds 0.3.</p>
      <p>The binary decision for each criterion is based on a
speci c similarity threshold learned after optimisation on the
training dataset. This process has assumed independence
between the di erent features, so each of them has been
treated separately.</p>
      <p>Method 2: Weighted fusion of normalised distances
This second solution emerged as a need for a more re ned
algorithm to combine the di erent metadata features. In
this case the individual and binary decisions are for a single
and fused similarity value.</p>
      <p>This fusion requires a normalization of the distance values
based on the provided training data. This normalization
was based after the computation of the distances between
3,000 random pairs of photos selected from the training set
and belonging to the same event. The estimated mean and
deviation were used to compute the value of the phi function,
which is basically a mapping of the z-score between 0.0 and
1.0.</p>
      <p>After normalization, it is still necessary to estimate the
weight of each modality to be later applied to the linear
fusion. These weights were estimated according to the
individual gain of each type of features studied Method 1.
Results shown in Table 1 indicate that the most important
reason for the fusion of two clusters is that both of them
belong to the same user ID, while geolocation and textual
labelling have similar relevance. These experimental values
validate the empirical proposal adopted in Method 1.</p>
      <sec id="sec-4-1">
        <title>Geolocated No geolocated</title>
      </sec>
      <sec id="sec-4-2">
        <title>Time</title>
        <p>0.06
0.08
Geo
0.28</p>
      </sec>
      <sec id="sec-4-3">
        <title>Label 0.22 0.30</title>
      </sec>
      <sec id="sec-4-4">
        <title>User 0.44 0.60</title>
        <p>Finally, the training dataset was used again to estimate
the merging threshold for this fused score. The experiments
indicated a maximum F1-score for values between 0.3 and
0.6, for which a nal threshold of 0.5 was adopted.
4.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>The UPC participated in Challenge 1 with the results
shown in Table 2. The more optimised Method 2
corresponds to Run 1, while Runs 2 and 3 correspond to Method 1
with an optimisation with respect to F1 or NMI, respecively.
As expected, the values obtained for Method 2 outperform
the two runs associated to Method 1.</p>
      <sec id="sec-5-1">
        <title>Method 1 (F1)</title>
        <p>Method 1 (NMI)
Method 2</p>
        <p>F1
0.8798
0.8753
0.8833
NMI
0.9720
0.9710
0.9731</p>
      </sec>
      <sec id="sec-5-2">
        <title>Divergence F1 0.8268 0.8220 0.8316</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>The presented technique has allowed a fast resolution of
the photo clustering of images based only on numerical and
textual metadata. The obtained results seems reasonable to
assist real users in the organisation of shared collections of
photographs. However, the authors consider that presented
work may still bene t with an optimised set of similarity
thresholds adapted to the type of event.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Platt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Czerwinski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Field</surname>
          </string-name>
          .
          <article-title>Phototoc: automatic clustering for browsing personal photographs</article-title>
          .
          <source>In Proc. 4th Paci c Rim Conference on Multimedia.</source>
          , vol.
          <volume>1</volume>
          , pp.
          <fpage>6</fpage>
          -
          <lpage>10</lpage>
          Vol.
          <volume>1</volume>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , C. de Vries, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Geva</surname>
          </string-name>
          . Social Event Detection at MediaEval 2013:
          <article-title>Challenges, datasets, and evaluation</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>