=Paper=
{{Paper
|id=None
|storemode=property
|title=Supervised Clustering of Social Media Streams
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_53.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WistubaS13
}}
==Supervised Clustering of Social Media Streams==
<pdf width="1500px">https://ceur-ws.org/Vol-1043/mediaeval2013_submission_53.pdf</pdf>
<pre>
             Supervised Clustering of Social Media Streams

                         Martin Wistuba                                      Lars Schmidt-Thieme
                    University of Hildesheim                                University of Hildesheim
         Information Systems & Machine Learning Lab              Information Systems & Machine Learning Lab
                       wistuba@ismll.de                                   schmidt-thieme@ismll.de


ABSTRACT                                                         2.2    Preprocessing
In this paper we present our approach for the Social Event       Textual information like title, tags and description is stemmed
Detection Task 1 of the MediaEval 2013. We address the           using a Porter Stemmer [3]. Additionally, the documents are
problem of event detection and clustering by learning a dis-     sorted by the time of creation in ascending order. If the time
tance measure between two images in a supervised way.            of creation is unknown, the time of its upload is used instead.
Then, we apply a variant of the Quality Threshold cluster-
ing to detect events and assign the images accordingly. We       2.3    Similarity Measure
can show that the performance measures do not decrease for       Related work in this field [1, 6] prefer using SVMs to learn
an increasing number of documents and report the results         the similarity between two documents but for our clustering
achieved for the challenge.                                      approach it has proven to be better to use Factorization Ma-
                                                                 chines [4] instead. We randomly sampled 4,000 positive and
1.    INTRODUCTION                                               4,000 negative document pair examples. A document pair
This paper presents our approach to tackle Task 1 of the         example (di , dj ) is positive if di and dj belong to the same
MediaEval Social Event Detection 2013 Challenge [7]. The         event, negative otherwise. The positive pairs were labeled
task is to cluster images into an unknown number of events       with 1, the negative with 0. Then we trained the model of
in such a way that they belong to each other. For the re-        Factorization Machines (FM), i.e.
quired run only meta information like title and description                               m
                                                                                          X               m X
                                                                                                          X m
may be used whereas for the general runs more information                 ŷ (x) = w0 +         wi xi +               viT vj xi xj
can be considered. Here, we only discuss an approach for                                  i=1             i=1 j=i+1
the required run.
                                                                 by using stochastic gradient descent. Here, w0 is the global
                                                                 bias, wi models the strength of the i-th variable and viT vj
2.    FRAMEWORK DESCRIPTION                                      models the interaction between the i-th and j-th variable
In this section we describe how our clustering approach works.   where V ∈ Rm×k . As a hyperparameter search combined
Therefore, we first introduce the used features, explain our     with those of the clustering would have been too time-intensive,
preprocessing process before we then define how we learn         we tuned the learning rate α and the regularization rate λ
the similarity metric between two documents. Finally, our        such that the root mean square error was acceptable. Con-
incremental clustering approach based on Quality Threshold       cluding, we have chosen α = 0.05, λ = 0 and k = 1. In
clustering is explained.                                         the following section we will see that it is more important to
                                                                 choose the right hyperparameters for the clustering method.

2.1    Features                                                  2.4    Clustering Method
We represent a pair (di , dj ) of two documents di , dj by a
                                                                 As the number of clusters is unknown and for application
feature vector x ∈ Rm of m features. We have chosen the
                                                                 in practice, an incremental, threshold-based clustering tech-
same nine features as Reuter et. al. [5]. Additionally, a
                                                                 nique is preferable as argued by Becker et. al. [1] we decided
further feature was used, indicating whether the document
                                                                 to use Quality Threshold clustering (QT) [2]. Because it is
was created by the same user (+1) or not (−1). If a feature
                                                                 computationally intensive as much as O n3 , an approxima-
cannot be computed because the information is missing, it
                                                                 tion was needed to speed it up. Previous work [1, 6] has used
is assumed to be 0.
                                                                 single-pass methods, but we were expecting better results by
                                                                 sticking to the QT idea. Instead of applying QT onto the
                                                                 full data, we split it into disjoint batches b1 , . . . , bdn/le of
                                                                 size l. Choosing l small enough makes it feasible to apply
                                                                 QT onto the batches. To also allow documents in the fol-
                                                                 lowing batches to be placed into a cluster from documents
                                                                 in the previous batches, a representative of each cluster was
                                                                 kept. The representative    of a cluster C is the document
Copyright is held by the author/owner(s).                        dR = arg mindi ∈C dj ∈C δ (di , dj )2 , which is motivated by
                                                                                     P
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
                                                                 the smallest enclosing circle. Assuming that the represen-
             ●
             ●
                   ●
                        ●
                        ●
                                                                                       Table 1: Final results on the test set for different
                        ●
                                  ●
                                  ●
                                                                                       hyperparameters.
     0.9                          ●
                                                                 ●
                                                                     F1−Score 1k
                                           ●
                                                                 ●
                                                                     F1−Score 2k             Hyperparameter setting F1 -Score NMI
                                           ●
                                                           ●
                                                                 ●
                                                                     F1−Score 3k               µ = 0.81, l = 2, 000  0.8720   0.9606
                        ●
                        ●
                                  ●
                                  ●
                                  ●        ●     ●
                                                                 ●
                                                                     Precision 1k
             ●
                   ●
                   ●
                   ●
                        ●                  ●
                                           ●
                                           ●
                                           ●
                                                 ●
                                                 ●
                                                 ●
                                                 ●
                                                           ●
                                                                 ●
                                                                     Precision 2k              µ = 0.81, l = 1, 000  0.8712   0.9643
             ●                                   ●               ●
                                                                     Precision 3k
     0.8                          ●
                                           ●
                                                 ●


                                                           ●
                                                                 ●
                                                                     Recall 1k
                                                                                               µ = 0.81, l = 1, 500  0.8755   0.9641
                        ●
                        ●
                                  ●
                                                           ●
                                                                 ●
                                                                     Recall 2k                 µ = 0.82, l = 1, 500  0.8784   0.9655
                        ●                                        ●
                                                                     Recall 3k
                   ●
                   ●
                   ●
                                                           ●
                                                           ●
             ●
             ●
             ●


            0.78       0.80               0.82            0.84                         Table 1. The results are even better than those on the val-
                          Quality Threshold                                            idation set. The reason for this is that the validation set
                                                                                       was more complex as it contained more and smaller clus-
Figure 1: Excerpt of the grid search for batch                                         ter. We recognized the larger clusters on the test set while
sizes l ∈ {1000, 2000, 3000} and quality threshold µ ∈                                 computing the clusters. This led to really high clustering
{0.78, . . . , 0.84}. Decreasing l and µ improves the pre-                             times for few batches such that we decided to stop cluster-
cision, increasing them the recall.                                                    ing a batch if it took more than two hours computing time.
                                                                                       The difference between the validation and test set also led
                                                                                       to non-optimal hyperparameters as a threshold of µ = 0.82
                                                                                       looks more promising.
     0.95


     0.90                                                                  F1−Score    4.   CONCLUSIONS
                                                                           Precision
                                                                           Recall
                                                                                       The presented algorithm promises to be a good method for
     0.85                                                                              this problem especially for bigger datasets. Therefore, a
                                                                                       comparison to state of the art algorithms using the same
     0.80
                                                                                       dataset and features would be interesting. Possibly, block-
             0              50,000              100,000                                ing can also be applied to our approach to further improve
                            Number of Documents
                                                                                       the performance and especially the speed. As QT can be
                                                                                       parallelized, this could be another possibility to improve the
Figure 2: For the used validation set the evaluation                                   speed.
scores seem to be stable for a growing number of
documents. A threshold of µ = 0.81 and a batch size
of l = 2, 000 was used.                                                                5.   REFERENCES
                                                                                       [1] H. Becker, M. Naaman, and L. Gravano. Learning
                                                                                           similarity metrics for event identification in social
tative is actually the center of the smallest enclosing circle,                            media. In Proceedings of the third ACM international
only documents with a distance of at most µ2 can be clus-                                  conference on Web search and data mining, WSDM ’10,
tered to the same cluster for the following batches, where µ                               pages 291–300, New York, NY, USA, 2010. ACM.
is the threshold.                                                                      [2] L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring
                                                                                           Expression Data: Identification and Analysis of
                                                                                           Coexpressed Genes. Genome Research,
3.          EXPERIMENTS                                                                    9(11):1106–1115, Nov. 1999.
For the clustering approach two hyperparameters are needed:                            [3] M. Porter. An algorithm for suffix stripping. Program:
the quality threshold µ and the batch size l. We estimated                                 electronic library and information systems,
them using a grid search on 130,000 documents which is                                     14(3):130–137, 1980.
approximately the size of the testing set. The results iden-                           [4] S. Rendle. Factorization machines. In G. I. Webb, B. L.
tified that there is probably only one global optimum, but                                 0001, C. Zhang, D. Gunopulos, and X. Wu, editors,
also that it is possible to trade precision with recall with                               ICDM, pages 995–1000. IEEE Computer Society, 2010.
only a small loss of the F1 -Score. For this challenge this is                         [5] T. Reuter and P. Cimiano. Event-based Classification
not of importance but as already stated by Reuter et. al.                                  of Social Media Streams. In Proceedings of the 2nd
[5], a higher precision is more important for applications. A                              ACM International Conference on Multimedia
part of our grid search is shown in Figure 1. Finally, for the                             Retrieval, ICMR ’12, pages 22:1–22:8, New York, NY,
testing set we have chosen µ = 0.81 and l = 2, 000.                                        USA, 2012. ACM.
                                                                                       [6] T. Reuter, P. Cimiano, L. Drumond, K. Buza, and
Another interesting fact of this approach is that it seems
                                                                                           L. Schmidt-Thieme. Scalable event-based clustering of
to be stable for a larger number of documents as shown in
                                                                                           social media via record linkage techniques. In L. A.
Figure 2. Reuter et. al. [5] has reported worse results for
                                                                                           Adamic, R. A. Baeza-Yates, and S. Counts, editors,
the algorithms presented by Becker et. al. and Reuter et. al.
                                                                                           ICWSM. The AAAI Press, 2011.
[1, 5] if the number of documents grow. Even though they
                                                                                       [7] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano,
have used a different dataset, a decrease in performance of
                                                                                           C. de Vries, and S. Geva. Social Event Detection at
the F1 -Score from around 87% for 10,000 documents to 74%
                                                                                           MediaEval 2013: Challenges, datasets, and evaluation.
for 100,000 documents cannot be neglected.
                                                                                           In MediaEval 2013 Workshop, Barcelona, Spain,
                                                                                           October 18-19 2013.
The final challenge results on the test set are presented in

</pre>