2.1 Features

Supervised Clustering of Social Media Streams

Martin Wistuba

wistuba@ismll.de 0

Lars Schmidt-Thieme

schmidt-thieme@ismll.de 0 0 University of Hildesheim, Information Systems & Machine Learning Lab

2013

18 19

In this paper we present our approach for the Social Event Detection Task 1 of the MediaEval 2013. We address the problem of event detection and clustering by learning a distance measure between two images in a supervised way. Then, we apply a variant of the Quality Threshold clustering to detect events and assign the images accordingly. We can show that the performance measures do not decrease for an increasing number of documents and report the results achieved for the challenge.

2.1 Features

We represent a pair (di; dj ) of two documents di, dj by a feature vector x 2 Rm of m features. We have chosen the same nine features as Reuter et. al. [ 5 ]. Additionally, a further feature was used, indicating whether the document was created by the same user (+1) or not ( 1). If a feature cannot be computed because the information is missing, it is assumed to be 0.

2.2 Preprocessing

Textual information like title, tags and description is stemmed using a Porter Stemmer [ 3 ]. Additionally, the documents are sorted by the time of creation in ascending order. If the time of creation is unknown, the time of its upload is used instead.

2.3 Similarity Measure

Related work in this eld [ 1, 6 ] prefer using SVMs to learn the similarity between two documents but for our clustering approach it has proven to be better to use Factorization Machines [ 4 ] instead. We randomly sampled 4,000 positive and 4,000 negative document pair examples. A document pair example (di; dj ) is positive if di and dj belong to the same event, negative otherwise. The positive pairs were labeled with 1, the negative with 0. Then we trained the model of Factorization Machines (FM), i.e.

m m y^ (x) = w0 + X wixi + X i=1 i=1 j=i+1 m X viT vj xixj by using stochastic gradient descent. Here, w0 is the global bias, wi models the strength of the i-th variable and viT vj models the interaction between the i-th and j-th variable where V 2 Rm k. As a hyperparameter search combined with those of the clustering would have been too time-intensive, we tuned the learning rate and the regularization rate such that the root mean square error was acceptable. Concluding, we have chosen = 0:05, = 0 and k = 1. In the following section we will see that it is more important to choose the right hyperparameters for the clustering method.

2.4 Clustering Method

As the number of clusters is unknown and for application in practice, an incremental, threshold-based clustering technique is preferable as argued by Becker et. al. [ 1 ] we decided to use Quality Threshold clustering (QT) [ 2 ]. Because it is computationally intensive as much as O n3 , an approximation was needed to speed it up. Previous work [ 1, 6 ] has used single-pass methods, but we were expecting better results by sticking to the QT idea. Instead of applying QT onto the full data, we split it into disjoint batches b1; : : : ; bdn=le of size l. Choosing l small enough makes it feasible to apply QT onto the batches. To also allow documents in the following batches to be placed into a cluster from documents in the previous batches, a representative of each cluster was kept. The representative of a cluster C is the document dR = arg mindi2C Pdj2C (di; dj )2, which is motivated by the smallest enclosing circle. Assuming that the represen● ●● ●●● ●●● ●●● ●●●●●● ●● ● ● tative is actually the center of the smallest enclosing circle, only documents with a distance of at most 2 can be clustered to the same cluster for the following batches, where is the threshold.

3. EXPERIMENTS

For the clustering approach two hyperparameters are needed: the quality threshold and the batch size l. We estimated them using a grid search on 130,000 documents which is approximately the size of the testing set. The results identi ed that there is probably only one global optimum, but also that it is possible to trade precision with recall with only a small loss of the F1-Score. For this challenge this is not of importance but as already stated by Reuter et. al. [ 5 ], a higher precision is more important for applications. A part of our grid search is shown in Figure 1. Finally, for the testing set we have chosen = 0:81 and l = 2; 000. Another interesting fact of this approach is that it seems to be stable for a larger number of documents as shown in Figure 2. Reuter et. al. [ 5 ] has reported worse results for the algorithms presented by Becker et. al. and Reuter et. al. [ 1, 5 ] if the number of documents grow. Even though they have used a di erent dataset, a decrease in performance of the F1-Score from around 87% for 10,000 documents to 74% for 100,000 documents cannot be neglected.

The nal challenge results on the test set are presented in 0.9 0.8 ● ● ●● 0.78 0.95 0.90 0.85 0.80 0

4. CONCLUSIONS

The presented algorithm promises to be a good method for this problem especially for bigger datasets. Therefore, a comparison to state of the art algorithms using the same dataset and features would be interesting. Possibly, blocking can also be applied to our approach to further improve the performance and especially the speed. As QT can be parallelized, this could be another possibility to improve the speed.

[1]

Becker ,

Naaman , and

Gravano . Learning similarity metrics for event identi cation in social media . In Proceedings of the third ACM international conference on Web search and data mining , WSDM '10 , pages 291 { 300 , New York, NY, USA, 2010 . ACM.

[2]

L. J.

Heyer ,

Kruglyak , and

Yooseph . Exploring Expression Data: Identi cation and Analysis of Coexpressed Genes. Genome Research , 9 ( 11 ): 1106 { 1115 , Nov . 1999 .

[3]

Porter . An algorithm for su x stripping . Program: electronic library and information systems , 14 ( 3 ): 130 { 137 , 1980 .

[4]

Rendle . Factorization machines . In G. I. Webb , B. L. 0001,

Zhang ,

Gunopulos , and X. Wu, editors, ICDM, pages 995 { 1000 . IEEE Computer Society, 2010 .

[5]

Reuter and

Cimiano . Event-based Classi cation of Social Media Streams . In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ICMR '12 , pages 22:1 { 22 : 8 , New York, NY, USA, 2012 . ACM.

[6]

Reuter ,

Cimiano ,

Drumond ,

Buza , and L. Schmidt-Thieme . Scalable event-based clustering of social media via record linkage techniques . In L. A. Adamic , R. A. Baeza-Yates , and S. Counts, editors, ICWSM. The AAAI Press , 2011 .

[7]

Reuter ,

Papadopoulos ,

Mezaris ,

Cimiano , C. de Vries, and

Geva . Social Event Detection at MediaEval 2013: Challenges, datasets, and evaluation . In MediaEval 2013 Workshop, Barcelona, Spain, October 18 -19 2013 .