<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ADMRG @ MediaEval 2013 Social Event Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Taufik Sutanto</string-name>
          <email>taufikedy.sutanto@connect.qut.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electrical Engineering and Computer Science, Queensland University of Technology Brisbane</institution>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Electrical Engineering and Computer Science, Queensland University of Technology Brisbane</institution>
          ,
          <addr-line>Australia 2n</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper elaborates the approach used by the Applied Data Mining Research Group (ADMRG) for the Social Event Detection (SED) Tasks of the 2013 MediaEval Benchmark. We participated in the semi-supervised clustering task as well as the classification of social events task. The constrained clustering algorithm is utilized in the semi-supervised clustering task. Several machine learning classifiers with Latent Dirichlet Allocation as feature selector are utilized in the event classification task. Results of the first task show the effectiveness of the proposed method. Results from task 2 indicate that attention on the imbalance categories distributions is needed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The Social Event Detection (SED) task at the 2013 MediaEval
Benchmark for Multimedia Evaluation consists of two challenges:
(1) semi-supervised clustering; and (2) classification of social
events [4]. The dataset consists of images metadata from Flickr
and Instagram. It includes text, time, and spatial information. The
SED task is to group social event images according to the given
initial labels and classify them into one of the given event
categories (music, conference, exhibition, fashion, protest, sport,
theatrical, other event, or a non-event). We participated in both of
these tasks, but our efforts were more concentrated on the
semisupervised clustering task.</p>
      <p>The number of initial clusters for the first task in the training
data is about 14,000 clusters. This task poses many challenges: (1)
the number of initial clusters is large; (2) the events in the test
data may be grouped in these cluster labels or form new clusters
as stated in [4]; and (3) clusters vary in size. About 2,000 clusters
contain just a single member while some clusters contain more
than 900 members. We adopted the constrained clustering
algorithm [2] for handling large clusters more efficiently with the
concept of document ranking and the use of a customized
similarity measure dealing with text, time, and space. Memory
allocation was suppressed by using a semi-incremental algorithm
and by combining in-database and in-memory processing. The
experiment results show the efficacy of our proposed method.</p>
      <p>In the second task, we apply feature reduction using Latent
Dirichlet Allocation (LDA) and train several traditional and more
recent machine learning classifiers including ensemble of the
classifiers through a consensus function. Results from this task
were severely influenced by the imbalanced category distribution
within the training and test datasets.
2. THE PROPOSED APPROACH
2.1 Preprocessing
All of the features in SED data were used in the analysis, except
the uniform resource locator of the images. The structure of data
in task1 and task 2 are similar, except that task 2 data does not
contain date_upload and description attributes. Consequently, the</p>
      <p>Terms of documents within a cluster were combined as if it is a
document. A term weight in this cluster is the average weight of
the term within the cluster. Document information from this
cluster were then indexed and stored efficiently in real-time using
the in-memory delta index of Sphinx search engine. When
calculating similarity measure in all iterations, documents were
retrieved incrementally from the database and final distances were
stored back in database. Transition of documents between clusters
were recorded, centroids were re-calculated only with regards to
these changes. This approach is efficient in memory usage and
computations, even when full text features were used. An
illustration of our approach is given in Figure 1.</p>
      <sec id="sec-1-1">
        <title>SED 2013 Training Data</title>
      </sec>
      <sec id="sec-1-2">
        <title>Set initial cluster centers based on the labelled data</title>
      </sec>
      <sec id="sec-1-3">
        <title>Incrementally retrieve</title>
        <p>records d from Test Data</p>
        <p>Choose k Nearest
clusters from d using
cluster-document</p>
        <p>ranking</p>
      </sec>
      <sec id="sec-1-4">
        <title>Calculate the multi</title>
        <p>domain distance
between d and the k
nearest clusters</p>
      </sec>
      <sec id="sec-1-5">
        <title>Distance&gt;threshold Yes Cluster the records based on the distance</title>
        <p>
          2.3 Task 2
We utilize LDA’s Gibbs sampling to automatically form 3,000
topics using the Matlab modelling toolbox [
          <xref ref-type="bibr" rid="ref1">3</xref>
          ] from the total of
100,000 text features. Traditional classifiers such as k-Nearest
Neighbor (kNN) and decision tree were then used. A more recent
classifier (Random Forest) was also used for comparison. An
ensemble of the classifiers results were then formed using a
consensus function. We used tenfold cross validation on our
classifiers by randomly choosing 15% of the training data as
validation.
3. EXPERIMENTS AND RESULTS
There are four runs submitted for each task. In task 1, we set
threshold to form new cluster γ =0.3 and set the number of nearest
clusters k=5. Task 1 run variations were based on different
ranking methods and similarity measures. Runs one, two and three
in task 1 were using the multi domain similarity measure and
using BM25, BM25 with proximity and SPH04 ranking
respectively. The last run in this task is used to test the
effectiveness of our similarity measure by measuring only text
information and using the SPH04 ranking formula.
        </p>
        <p>Results in Table 1 show that the ranking formula positively
affects the clustering results and the multi-domain similarity
measure effectively improves the clustering quality. We also
noted from the result that one of the latest Sphinx ranking formula
(SPH04) outperforms the other ranking formula. Furthermore
these results confirm the efficacy of our approach in using query
ranking to improve scalability of constrained clustering in data
with large clusters.</p>
        <p>Experiments on task 2 were done by building several classifiers.
Random forest, k-Nearest neighbor classifier, and decision tree
were used for runs one to three respectively. The last result in task
2 was obtained from the consensus function of the previous
classifiers. Since the focus of our experiment was on task 1, the
minor attempt on handling the imbalanced category on task 2 has
proven to be insufficient.
F1
Overall
accuracy</p>
        <p>1
0.811
0.953
0.753
0.473/
0.104
-0.01/
0.001
4. CONCLUSIONS AND FUTURE WORK
In this task, we used the constrained clustering algorithm with the
customized similarity measure, variable number of clusters, and
the use of document ranking. Results show that this method is
able to group social events to their corresponding initial labels
with higher accuracy. It was also noted that more work is needed
to handle the severely imbalanced data of task 2 of classification.
Future work will explore the optimal parameter of the similarity
measure in the proposed clustering algorithm and investigate
further usage of ranking to improve scalability.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. REFERENCES</title>
      <p>[1] A. Aksyonoff, "Sphinx Search," 2.1.1-beta ed: Sphinx</p>
      <p>
        Technologies Inc, 2013.
[2] B. Sugato, B. Arindam, and J. M. Raymond,
"Semisupervised clustering by seeding," presented at the
Proceedings of the nineteenth international conference on
machine learning, San Francisco, CA, USA, 2002.
[
        <xref ref-type="bibr" rid="ref1">3</xref>
        ] Griffiths, T., &amp; Steyvers, M., “Finding Scientific Topics,”
Proceedings of the National Academy of Sciences, 101
(suppl. 1), 5228-5235, 2004.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>3 0.812 0.954 0</source>
          .
          <issue>758</issue>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , C. de Vries, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Geva</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Social Event Detection at MediaEval</source>
          <year>2013</year>
          : Challenges, Datasets, and
          <string-name>
            <surname>Evaluation</surname>
          </string-name>
          ," Barcelona, Spain,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>