<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CERTH @ MediaEval 2013 Social Event Detection Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emmanouil Schinas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eleni Mantziou</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgios Petkos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yiannis Kompatsiaris CERTH-ITI Thessaloniki</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Greece</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>manosetro</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>lmantziou</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>papadop</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>gpetkos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ikomg@iti.gr</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper describes the participation of CERTH in the Social Event Detection Task of MediaEval 2013. For Challenge 1, we used the concept of same event model to construct a graph of items, in which dense sub-graphs correspond to event clusters. The F1 score and NMI for our best run are 0.7041 and 0.9131, respectively. For Challenge 2, we used an e cient manifold learning method to assign images to speci c event types. Our best run for Challenge 2 achieves a F1 score of 0.3344 (0.7163 for the event/non-event case).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The paper presents the approaches devised by CERTH
for the Challenges of the MediaEval 2013 Social Event
Detection (SED) task. Challenge 1 calls for the detection of
social events in a set of Flickr images. Challenge 2 calls
for the classi cation of images to a set of event types (incl.
non-event). Reuter et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] describe the task in detail.
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH 2. 2.1</title>
    </sec>
    <sec id="sec-3">
      <title>Overview of Method in Challenge 1</title>
      <p>
        The approach to tackle Challenge 1 is based on a learned
similarity metric [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which in the following we call the Same
Event Model (SEM). A SEM takes as input the set of per
modality dissimilarities between two images, and produces
a prediction value in the range [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] that indicates the
possibility that the two images are of the same event. The SEM is
used to construct a graph of same event relationships: pairs
of images for which the output of SEM is above some
threshold are connected. Eventually, the nodes of the graph are
clustered using an e cient community detection algorithm,
the Structural Clustering Algorithm for Networks (SCAN)
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to obtain a set of candidate events. To avoid
evaluating the output of the SEM for each pair of images, we
introduce a candidate neighbours selection step, as in Reuter
and Cimiano [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: for each item in the collection, we nd
its nearest neighbours in each modality and only compare it
to them. A schematic of our approach is shown in Fig. 1.
It is very similar to that of Philip and Cimiano [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], with
the di erence that they use a learned similarity metric that
works on image-event pairs (where the event is represented
as the centroid of the images assigned to it), whereas our
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Overview of Method in Challenge 2</title>
      <p>
        For Challenge 2, we made use of SMaL, a Scalable Manifold
Learning framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that is based on Semi-Supervised
Learning (SSL) by constructing Laplacian Eigenmaps (LEs)
approximately. The main problem in SSL is to build a n n
similarity matrix between labelled and unlabelled images,
which is time consuming for large datasets. SMaL
tackles this problem by estimating a smaller covariance matrix,
where it is hypothesized that the data xi 2 &lt;d are samples
from a distribution p(x). Rotating the data to be as
independent as possible, s = Rx, can result in a B B histogram
of bins, using only marginal distributions that approximate
the density p(s) of the rotated data. Then, we de ne
eigenfunctions g corresponding to B B, which can be seen as
approximations of the LEs as n ! 1 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. An interpolation
step to the target dimension k follows and in the end, the
n k approximate LE vectors are derived. In the nal step,
a linear classi er is trained using the approximate vectors
of the labelled items as input. In our implementation, we
opted for the use of linear Support Vector Machine (SVM).
      </p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS 3. 3.1</title>
    </sec>
    <sec id="sec-6">
      <title>Runs Description in Challenge 1</title>
      <p>
        For the rst challenge we experimented with a variety
of classi ers to build SEMs. We obtained the best results
using SVMs, but Decision Trees produced comparable
results. The inputs for SEM are the following 11 features:
user (1 if both images have been uploaded by the same user,
0 otherwise), textual (title, tags and description, similarity
computed using BM25 and cosine), taken and upload time,
spatial (if exists) and visual information (SURF descriptors
aggregated using a VLAD scheme [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], similarity computed
using Euclidean distance). We used and evaluated several
combinations of the input features. We achieved the best
results using textual, temporal, spatial and user features.
More speci cally the corresponding trained SEM which was
used for all submitted runs achieved 99% accuracy on the
classi cation of positive pairs and 99.7% on negative. By
using all available features, we get slightly worse results
(98.75% for positive and 99.7% for negative examples). Note
that even with only the user as a feature the trained model
achieves 95% and 99% for positive and negative examples
respectively. The SEM for the nal submission was trained
on the full data of the training set. Regarding the retrieval
of the candidate neighbours of each item, the 150 nearest
neighbours with respect to the textual features were
considered, 200 with respect to time, 50 with respect to location
(when it exists), and 100 for visual. For the construction
of the items graph we experimented with various values of
threshold W a ecting the insertion of edges between items.
By increasing the threshold, we get a more sparse graph but
with a negative impact to the produced results. We have
found that a good trade-o between the computational cost
and the e ciency of the method is a value of 0.5. For this
reason, we used the default value in Runs 1, 2, 3 and 4. In
Run 5, we pruned the graph by using W = 0:9. At a
postprocessing step we attempt to merge hubs and outliers to
the set of event clusters. Regarding the hubs we attempt to
merge each of them with the community with which they
have more edges under the condition that this number is
greater than a prede ned threshold R. In Runs 1, 2 and 5
we used a threshold of 5 links. In Run 3 we make the
assignment easier by requiring only 2 links, in contrast with Run 4
where more than 12 links are required. The remaining hubs
form single-item events. Regarding outliers we can either
follow the same approach as hubs (Run 2) or to consider
them all as single-item events (Runs 1, 3, 4 and 5).
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Runs Description in Challenge 2</title>
      <p>Di erent runs were produced with respect to the employed
features and classi cation strategy:</p>
      <p>Features: In Run 1, we considered the 1000 most
frequent tags, and then applying pLSA using 200
latent topics. In Run 2, we used dense sampling for
selecting the keypoints, SIFT features were extracted
using codebooks of k=64 visual words and learned
using the k-means algorithm. VLAD was performed for
the aggregation. The nal vectors were power (a=0.5)
and L2 normalized and then PCA reduced to
512dimensional vectors. The same features were also used
in Run 5. In Runs 3 and 4, we used both textual and
visual features and fused the corresponding LE vectors.
Classi cation strategy: We used two variants: (a)
max score (Runs 1, 4 and 5), in which we select the
concept of which the detector produced the highest
prediction score, (b) if-max score (Runs 2 and 3), in
which we assign the image to the highest scoring
concept, only if that score is higher than a threshold TE,
otherwise assigned to the no-event class. TE was
em4 5
0.3046 0.2411
0.2062 0.1494
0.6536 0.5989
0.1893 0.1521
pirically determined by averaging the maximum
prediction scores for the images of a validation set ( 30%
of the training set).
4.</p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND DISCUSSION</title>
      <p>From Table 1, we note that Run 5 leads to the worst
results. This indicates that pruning on the edges of the graph
has negative impact on the event detection accuracy. The
other results seem to be very similar. This leads us to
conclude that our method is insensitive to the di erent
parameters involved. For example, Runs 1, 3 and 4 di er regarding
the threshold R (5, 12 and 3 respectively) that controls the
merging of hubs to adjacent communities. Although this
step improves the nal results, the threshold value does not
seem to a ect them signi cantly.</p>
      <p>For Challenge 2, Run 1 has the lowest performance (F1 =
0:1105), which indicates that visual features are more useful
than textual. However, their fusion performs best (F1 =
0:3344), revealing a complementary role. We also note that
in concepts that are related to each other (e.g. concert and
theater), visual features do not perform well. Moreover,
threshold TE used in Runs 2 and 3 is prone to produce many
false negatives.</p>
      <p>In the future, for Challenge 1 we plan to explore di erent
graph construction strategies and community detection
algorithms that consider the edge weights. Also, we plan to
use more sophisticated approaches for merging or splitting
candidate events based on temporal and spatial information.
For Challenge 2, we intend to explore di erent features for
further improving the event type detection accuracy.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work is supported by the SocialSensor FP7 project,
partially funded by the EC under contract number 287975.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          .
          <article-title>Semi-supervised learning in gigantic image collections</article-title>
          .
          <source>In NIPS</source>
          , pages
          <volume>522</volume>
          {
          <fpage>530</fpage>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mantziou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Large-scale Semi-Supervised Learning by Approximate Laplacian Eigenmaps, VLAD and Pyramids</article-title>
          . In WIAMIS,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Petkos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Social event detection using multimodal clustering and integrating supervisory signals</article-title>
          .
          <source>In ICMR'12, page 23. ACM</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reuter</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>Event-based classi cation of social media streams</article-title>
          .
          <source>In ICMR'12, page 22. ACM</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , C. de Vries, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Geva</surname>
          </string-name>
          . Social Event Detection at MediaEval 2013:
          <article-title>Challenges, datasets, and evaluation</article-title>
          . In MediaEval Workshop, Barcelona, Spain, Oct 18-19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas,
          <string-name>
            <surname>and I. Vlahavas.</surname>
          </string-name>
          <article-title>An empirical study on the combination of surf features with vlad vectors for image search</article-title>
          .
          <source>WIAMIS</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yuruk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. A. J.</given-names>
            <surname>Schweiger</surname>
          </string-name>
          .
          <article-title>Scan: a structural clustering algorithm for networks</article-title>
          .
          <source>KDD '07</source>
          , pages
          <fpage>824</fpage>
          {
          <fpage>833</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>