<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marina Riga</string-name>
          <email>mriga@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgios Petkos</string-name>
          <email>gpetkos@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos</string-name>
          <email>papadop@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emmanouil Schinas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yiannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies Institute / CERTH 6</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper describes the participation of CERTH in the Social Event Detection Task of MediaEval 2014. For Challenge 1, we use a \same event model" to construct a graph on which we perform community detection to obtain the nal clustering. Importantly, we tune the model to have a higher true positive rate than true negative rate, leading to signi cantly improved performance. The F1 score and NMI for our best run are 0.9161 and 0.9818, respectively. For Challenge 2, we developed probabilistic language models to classify events according to the criteria of the di erent queries. Our best run on Challenge 2 achieved an average F-score of 0.4604.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The paper presents the approaches developed by CERTH
for the two Challenges of the MediaEval 2014 Social Event
Detection (SED) task. Challenge 1 asks for a full clustering
of a collection of Flickr images, so that each cluster
corresponds to a social event. Challenge 2 examines a retrieval
scenario in which, given a set of social events, the goal is to
determine those events that match particular criteria. More
details about the task can be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Overview of method in Challenge 1</title>
      <p>
        Our approach for Challenge 1 utilizes what is termed the
Same Event Model (SEM)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The SEM takes as input the
set of per modality similarities between two items and
predicts how likely it is that these two items belong to the same
event or not. Subsequently, a graph is constructed, in which
the nodes represent the images to be clustered and the
existence of an edge between a pair of nodes denotes the positive
prediction of the SEM for the two respective images. Finally,
a community detection algorithm is performed on the graph
to obtain a full clustering. Moreover, in order to limit the
number of evaluations of the SEM and make the approach
scalable, we deploy a candidate neighbour selection step:
for each image we utilize appropriate indices in order to
obtain the most similar images according to each modality and
evaluate the SEM only for them. This is a technique that
is commonly referred to as blocking. This overall approach
is similar to that of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and that which we deployed in last
year's task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Importantly though, we introduce a tweak
which improves performance signi cantly. The key idea is
that false positive and false negative predictions of the SEM
are not equally important. More speci cally, the average
size of an event in the training set is roughly 20 images.
In practice though, the set of candidate neighbours needs
to be quite larger than the average. For instance, in our
experiments we experimented with at most 500 candidate
neighbours. The primary reasons for this is that a) the
distribution of the sizes of the events is much wider and b) in
large datasets one needs to consider a larger number of
candidate neighbours in order to have higher con dence that
the actual neighbours of some image appear in the set of
candidate neighbours. Therefore, since the number of
candidate neighbours will be much larger than the number of
actual neighbours, and assuming that the classi er has been
trained to achieve similar true positive and true negative
rates, we can expect that the SEM will give a signi cantly
larger number of false positive predictions than false
negative predictions. Too many false positive predictions are
likely to result in a lot of merged clusters as they will
create too many incorrect edges in the graph. If on the other
hand we opt for a higher true positive rate at the cost of
a lower true negative rate (by increasing the classi cation
threshold), we will have far fewer incorrectly merged
clusters, but will also have some fragmented clusters. The way
to deal with this problem is to increase the set of candidate
neighbours. In our experiments, we observed that when
increasing the threshold so that the true positive rate is 0:9999,
the true negative rate does not drop below 0:95, which in
practice appears su cient for our purpose.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Overview of method in Challenge 2</title>
      <p>
        In Challenge 2, we utilize regularized unigram language
models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to classify clusters (or images in Run 5, as will
be explained later) according to the given retrieval criteria
(location, type of event, entities involved). For learning the
language models for the event types and entities of
interest we collected sets of images from Flickr using the
relevant keywords that appear in the queries. Moreover, we
retrieved an additional random collection of images, in
order to learn a general language model that does not focus
on any particular event type or entity, against which the
type- or entity-speci c language models are compared. For
some cluster (or image) i the comparison is performed by
computing the ratio of the probability given by the speci c
language model pspecific(i) over the probability given by the
general language model pgeneral(i); if the ratio is above some
threshold , then we assign the event (or image) as
matching the examined criterion. In a second variation we utilize
a language model that has trained both with the type and
entity speci c datasets and the general dataset and
com7
0.4578
0.2774
0.4211
0.6383
0.1538
pute the ratio pspecific;general(i)=pgeneral(i). For inferring
location we adopted the per grid-cell language model based
approach of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It should be noted though that for clusters
that contain geotagged images, we do not use the language
models, but rather use the explicit coordinates to estimate
the location.
3.
3.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-6">
      <title>Runs description in Challenge 1</title>
      <p>
        In all runs of Challenge 1 we utilized a SVM classi er to
learn the SEM. The following features were used to compute
the input to the SEM for a pair of images: user (1 if both
images have been uploaded by the same user, 0 otherwise),
textual (title, tags and description, similarity computed
using BM25 and cosine), taken and upload time, spatial (if
available) and visual information (SURF descriptors
aggregated using a VLAD scheme [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as well as features extracted
using Overfeat [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a popular convolutional net, similarity for
both is computed using Euclidean distance). In Run 1 we
apply our basic approach, without using any visual features
and we take the predictions of the SEM as they are, i.e. we
do not change the classi cation threshold. In Run 2 we only
add the visual features. In Run 3 we use the probabilities
that are provided by the SVM classi er and set the
threshold to 0.995, achieving the true positive and true negative
rates that were mentioned earlier. In Run 4 we attempt to
improve the results by increasing the set of candidate
neighbours: after the graph has been constructed by predicting
the SEM output for each image's candidate neighbours, we
add to the candidate neighbours of each image the
neighbours of its actual neighbours and predict the output of the
SEM for them as well. In Run 5 we do not use blocking and
compute the output of the SEM for all pairs of images.
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Runs description in Challenge 2</title>
      <p>In Run 1 of Challenge 2 we perform the classi cation by
computing the ratio pspecific(i)=pgeneral(i) and setting the
threshold to 1. In Run 2, we perform the classi cation by
computing the ratio pspecific;general(i)=pgeneral(i) and again
setting the threshold to 1. In Run 3 and Run 4 we use the
models of Run 2 and Run 1 respectively, but with di erent
threshold values per query. Each threshold is selected
according to the evaluation results of the methodology in the
corresponding development queries. For queries Test-9 and
Test-10 where there are no analogous development queries,
we used the maximum threshold from the other queries. In
Runs 1 to 4 we perform classi cation per event, that is, we
aggregate all images of an event and then perform the
classi cation. In Run 5 on the other hand we perform classi
cation per item and then perform the aggregation by majority
vote. Also, in Run 5, the same approach in language models
and threshold values as in Run 3 has been followed.
4.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND DISCUSSION</title>
    </sec>
    <sec id="sec-9">
      <title>Challenge 1</title>
      <p>Table 1 shows the scores we achieved in Challenge 1. The
main thing to note is that Runs 3, 4 and 5 that use the
mod</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>The work was supported by the European Commission
under contract FP7-287975 SocialSensor.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Martin. Speech</surname>
          </string-name>
          and
          <string-name>
            <given-names>Language</given-names>
            <surname>Processing. Prentice Hall</surname>
          </string-name>
          <string-name>
            <surname>PTR</surname>
          </string-name>
          , Upper Saddle River, NJ, USA, 1st edition,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Petkos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Social event detection using multimodal clustering and integrating supervisory signals</article-title>
          .
          <source>In Proc. of ICMR</source>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Petkos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Social event detection at MediaEval 2014: Challenges, datasets, and evaluation</article-title>
          .
          <source>In Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          .
          <article-title>CEA list's participation at MediaEval 2013 Placing Task</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reuter</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>Event-based classi cation of social media streams</article-title>
          .
          <source>In Proceedings of ICMR</source>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mantziou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , G. Petkos, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          . CERTH @
          <article-title>MediaEval 2013 Social Event Detection Task</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eigen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun. Overfeat</surname>
          </string-name>
          :
          <article-title>Integrated recognition, localization and detection using convolutional networks</article-title>
          .
          <source>CoRR, abs/1312.6229</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas,
          <string-name>
            <surname>and I. Vlahavas.</surname>
          </string-name>
          <article-title>An empirical study on the combination of SURF features with VLAD vectors for image search</article-title>
          .
          <source>WIAMIS</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>