<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Supervised Clustering of Social Media Streams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Wistuba</string-name>
          <email>wistuba@ismll.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lars Schmidt-Thieme</string-name>
          <email>schmidt-thieme@ismll.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Hildesheim, Information Systems &amp; Machine Learning Lab</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In this paper we present our approach for the Social Event Detection Task 1 of the MediaEval 2013. We address the problem of event detection and clustering by learning a distance measure between two images in a supervised way. Then, we apply a variant of the Quality Threshold clustering to detect events and assign the images accordingly. We can show that the performance measures do not decrease for an increasing number of documents and report the results achieved for the challenge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2.1 Features</title>
      <p>
        We represent a pair (di; dj ) of two documents di, dj by a
feature vector x 2 Rm of m features. We have chosen the
same nine features as Reuter et. al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Additionally, a
further feature was used, indicating whether the document
was created by the same user (+1) or not ( 1). If a feature
cannot be computed because the information is missing, it
is assumed to be 0.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2.2 Preprocessing</title>
      <p>
        Textual information like title, tags and description is stemmed
using a Porter Stemmer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Additionally, the documents are
sorted by the time of creation in ascending order. If the time
of creation is unknown, the time of its upload is used instead.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.3 Similarity Measure</title>
      <p>
        Related work in this eld [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ] prefer using SVMs to learn
the similarity between two documents but for our clustering
approach it has proven to be better to use Factorization
Machines [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] instead. We randomly sampled 4,000 positive and
4,000 negative document pair examples. A document pair
example (di; dj ) is positive if di and dj belong to the same
event, negative otherwise. The positive pairs were labeled
with 1, the negative with 0. Then we trained the model of
Factorization Machines (FM), i.e.
      </p>
      <p>m m
y^ (x) = w0 + X wixi + X
i=1 i=1 j=i+1
m
X viT vj xixj
by using stochastic gradient descent. Here, w0 is the global
bias, wi models the strength of the i-th variable and viT vj
models the interaction between the i-th and j-th variable
where V 2 Rm k. As a hyperparameter search combined
with those of the clustering would have been too time-intensive,
we tuned the learning rate and the regularization rate
such that the root mean square error was acceptable.
Concluding, we have chosen = 0:05, = 0 and k = 1. In
the following section we will see that it is more important to
choose the right hyperparameters for the clustering method.</p>
    </sec>
    <sec id="sec-4">
      <title>2.4 Clustering Method</title>
      <p>
        As the number of clusters is unknown and for application
in practice, an incremental, threshold-based clustering
technique is preferable as argued by Becker et. al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we decided
to use Quality Threshold clustering (QT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Because it is
computationally intensive as much as O n3 , an
approximation was needed to speed it up. Previous work [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ] has used
single-pass methods, but we were expecting better results by
sticking to the QT idea. Instead of applying QT onto the
full data, we split it into disjoint batches b1; : : : ; bdn=le of
size l. Choosing l small enough makes it feasible to apply
QT onto the batches. To also allow documents in the
following batches to be placed into a cluster from documents
in the previous batches, a representative of each cluster was
kept. The representative of a cluster C is the document
dR = arg mindi2C Pdj2C (di; dj )2, which is motivated by
the smallest enclosing circle. Assuming that the
represen●
●●
●●●
●●●
●●●
●●●●●●
●●
●
●
tative is actually the center of the smallest enclosing circle,
only documents with a distance of at most 2 can be
clustered to the same cluster for the following batches, where
is the threshold.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3. EXPERIMENTS</title>
      <p>
        For the clustering approach two hyperparameters are needed:
the quality threshold and the batch size l. We estimated
them using a grid search on 130,000 documents which is
approximately the size of the testing set. The results
identi ed that there is probably only one global optimum, but
also that it is possible to trade precision with recall with
only a small loss of the F1-Score. For this challenge this is
not of importance but as already stated by Reuter et. al.
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a higher precision is more important for applications. A
part of our grid search is shown in Figure 1. Finally, for the
testing set we have chosen = 0:81 and l = 2; 000.
Another interesting fact of this approach is that it seems
to be stable for a larger number of documents as shown in
Figure 2. Reuter et. al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] has reported worse results for
the algorithms presented by Becker et. al. and Reuter et. al.
[
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ] if the number of documents grow. Even though they
have used a di erent dataset, a decrease in performance of
the F1-Score from around 87% for 10,000 documents to 74%
for 100,000 documents cannot be neglected.
      </p>
      <p>The nal challenge results on the test set are presented in
0.9
0.8
●
●
●●
0.78
0.95
0.90
0.85
0.80
0</p>
    </sec>
    <sec id="sec-6">
      <title>4. CONCLUSIONS</title>
      <p>The presented algorithm promises to be a good method for
this problem especially for bigger datasets. Therefore, a
comparison to state of the art algorithms using the same
dataset and features would be interesting. Possibly,
blocking can also be applied to our approach to further improve
the performance and especially the speed. As QT can be
parallelized, this could be another possibility to improve the
speed.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naaman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Gravano</surname>
          </string-name>
          .
          <article-title>Learning similarity metrics for event identi cation in social media</article-title>
          .
          <source>In Proceedings of the third ACM international conference on Web search and data mining</source>
          ,
          <source>WSDM '10</source>
          , pages
          <fpage>291</fpage>
          {
          <fpage>300</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Heyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kruglyak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yooseph</surname>
          </string-name>
          .
          <source>Exploring Expression Data: Identi cation and Analysis of Coexpressed Genes. Genome Research</source>
          ,
          <volume>9</volume>
          (
          <issue>11</issue>
          ):
          <volume>1106</volume>
          {
          <fpage>1115</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          .
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>An algorithm for su x stripping</article-title>
          .
          <source>Program: electronic library and information systems</source>
          ,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <volume>130</volume>
          {
          <fpage>137</fpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          .
          <article-title>Factorization machines</article-title>
          . In G. I.
          <string-name>
            <surname>Webb</surname>
            ,
            <given-names>B. L.</given-names>
          </string-name>
          0001,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gunopulos</surname>
          </string-name>
          , and X. Wu, editors, ICDM, pages
          <volume>995</volume>
          {
          <fpage>1000</fpage>
          . IEEE Computer Society,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reuter</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>Event-based Classi cation of Social Media Streams</article-title>
          .
          <source>In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ICMR '12</source>
          , pages
          <issue>22:1</issue>
          {
          <issue>22</issue>
          :
          <fpage>8</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Drumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Buza</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          .
          <article-title>Scalable event-based clustering of social media via record linkage techniques</article-title>
          . In L. A.
          <string-name>
            <surname>Adamic</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Baeza-Yates</surname>
          </string-name>
          , and S. Counts, editors,
          <source>ICWSM. The AAAI Press</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , C. de Vries, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Geva</surname>
          </string-name>
          . Social Event Detection at MediaEval 2013:
          <article-title>Challenges, datasets, and evaluation</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>