=Paper=
{{Paper
|id=None
|storemode=property
|title=Supervised Clustering of Social Media Streams
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_53.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WistubaS13
}}
==Supervised Clustering of Social Media Streams==
Supervised Clustering of Social Media Streams Martin Wistuba Lars Schmidt-Thieme University of Hildesheim University of Hildesheim Information Systems & Machine Learning Lab Information Systems & Machine Learning Lab wistuba@ismll.de schmidt-thieme@ismll.de ABSTRACT 2.2 Preprocessing In this paper we present our approach for the Social Event Textual information like title, tags and description is stemmed Detection Task 1 of the MediaEval 2013. We address the using a Porter Stemmer [3]. Additionally, the documents are problem of event detection and clustering by learning a dis- sorted by the time of creation in ascending order. If the time tance measure between two images in a supervised way. of creation is unknown, the time of its upload is used instead. Then, we apply a variant of the Quality Threshold cluster- ing to detect events and assign the images accordingly. We 2.3 Similarity Measure can show that the performance measures do not decrease for Related work in this field [1, 6] prefer using SVMs to learn an increasing number of documents and report the results the similarity between two documents but for our clustering achieved for the challenge. approach it has proven to be better to use Factorization Ma- chines [4] instead. We randomly sampled 4,000 positive and 1. INTRODUCTION 4,000 negative document pair examples. A document pair This paper presents our approach to tackle Task 1 of the example (di , dj ) is positive if di and dj belong to the same MediaEval Social Event Detection 2013 Challenge [7]. The event, negative otherwise. The positive pairs were labeled task is to cluster images into an unknown number of events with 1, the negative with 0. Then we trained the model of in such a way that they belong to each other. For the re- Factorization Machines (FM), i.e. quired run only meta information like title and description m X m X X m may be used whereas for the general runs more information ŷ (x) = w0 + wi xi + viT vj xi xj can be considered. Here, we only discuss an approach for i=1 i=1 j=i+1 the required run. by using stochastic gradient descent. Here, w0 is the global bias, wi models the strength of the i-th variable and viT vj 2. FRAMEWORK DESCRIPTION models the interaction between the i-th and j-th variable In this section we describe how our clustering approach works. where V ∈ Rm×k . As a hyperparameter search combined Therefore, we first introduce the used features, explain our with those of the clustering would have been too time-intensive, preprocessing process before we then define how we learn we tuned the learning rate α and the regularization rate λ the similarity metric between two documents. Finally, our such that the root mean square error was acceptable. Con- incremental clustering approach based on Quality Threshold cluding, we have chosen α = 0.05, λ = 0 and k = 1. In clustering is explained. the following section we will see that it is more important to choose the right hyperparameters for the clustering method. 2.1 Features 2.4 Clustering Method We represent a pair (di , dj ) of two documents di , dj by a As the number of clusters is unknown and for application feature vector x ∈ Rm of m features. We have chosen the in practice, an incremental, threshold-based clustering tech- same nine features as Reuter et. al. [5]. Additionally, a nique is preferable as argued by Becker et. al. [1] we decided further feature was used, indicating whether the document to use Quality Threshold clustering (QT) [2]. Because it is was created by the same user (+1) or not (−1). If a feature computationally intensive as much as O n3 , an approxima- cannot be computed because the information is missing, it tion was needed to speed it up. Previous work [1, 6] has used is assumed to be 0. single-pass methods, but we were expecting better results by sticking to the QT idea. Instead of applying QT onto the full data, we split it into disjoint batches b1 , . . . , bdn/le of size l. Choosing l small enough makes it feasible to apply QT onto the batches. To also allow documents in the fol- lowing batches to be placed into a cluster from documents in the previous batches, a representative of each cluster was kept. The representative of a cluster C is the document Copyright is held by the author/owner(s). dR = arg mindi ∈C dj ∈C δ (di , dj )2 , which is motivated by P MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain the smallest enclosing circle. Assuming that the represen- ● ● ● ● ● Table 1: Final results on the test set for different ● ● ● hyperparameters. 0.9 ● ● F1−Score 1k ● ● F1−Score 2k Hyperparameter setting F1 -Score NMI ● ● ● F1−Score 3k µ = 0.81, l = 2, 000 0.8720 0.9606 ● ● ● ● ● ● ● ● Precision 1k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Precision 2k µ = 0.81, l = 1, 000 0.8712 0.9643 ● ● ● Precision 3k 0.8 ● ● ● ● ● Recall 1k µ = 0.81, l = 1, 500 0.8755 0.9641 ● ● ● ● ● Recall 2k µ = 0.82, l = 1, 500 0.8784 0.9655 ● ● Recall 3k ● ● ● ● ● ● ● ● 0.78 0.80 0.82 0.84 Table 1. The results are even better than those on the val- Quality Threshold idation set. The reason for this is that the validation set was more complex as it contained more and smaller clus- Figure 1: Excerpt of the grid search for batch ter. We recognized the larger clusters on the test set while sizes l ∈ {1000, 2000, 3000} and quality threshold µ ∈ computing the clusters. This led to really high clustering {0.78, . . . , 0.84}. Decreasing l and µ improves the pre- times for few batches such that we decided to stop cluster- cision, increasing them the recall. ing a batch if it took more than two hours computing time. The difference between the validation and test set also led to non-optimal hyperparameters as a threshold of µ = 0.82 looks more promising. 0.95 0.90 F1−Score 4. CONCLUSIONS Precision Recall The presented algorithm promises to be a good method for 0.85 this problem especially for bigger datasets. Therefore, a comparison to state of the art algorithms using the same 0.80 dataset and features would be interesting. Possibly, block- 0 50,000 100,000 ing can also be applied to our approach to further improve Number of Documents the performance and especially the speed. As QT can be parallelized, this could be another possibility to improve the Figure 2: For the used validation set the evaluation speed. scores seem to be stable for a growing number of documents. A threshold of µ = 0.81 and a batch size of l = 2, 000 was used. 5. REFERENCES [1] H. Becker, M. Naaman, and L. Gravano. Learning similarity metrics for event identification in social tative is actually the center of the smallest enclosing circle, media. In Proceedings of the third ACM international only documents with a distance of at most µ2 can be clus- conference on Web search and data mining, WSDM ’10, tered to the same cluster for the following batches, where µ pages 291–300, New York, NY, USA, 2010. ACM. is the threshold. [2] L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Research, 3. EXPERIMENTS 9(11):1106–1115, Nov. 1999. For the clustering approach two hyperparameters are needed: [3] M. Porter. An algorithm for suffix stripping. Program: the quality threshold µ and the batch size l. We estimated electronic library and information systems, them using a grid search on 130,000 documents which is 14(3):130–137, 1980. approximately the size of the testing set. The results iden- [4] S. Rendle. Factorization machines. In G. I. Webb, B. L. tified that there is probably only one global optimum, but 0001, C. Zhang, D. Gunopulos, and X. Wu, editors, also that it is possible to trade precision with recall with ICDM, pages 995–1000. IEEE Computer Society, 2010. only a small loss of the F1 -Score. For this challenge this is [5] T. Reuter and P. Cimiano. Event-based Classification not of importance but as already stated by Reuter et. al. of Social Media Streams. In Proceedings of the 2nd [5], a higher precision is more important for applications. A ACM International Conference on Multimedia part of our grid search is shown in Figure 1. Finally, for the Retrieval, ICMR ’12, pages 22:1–22:8, New York, NY, testing set we have chosen µ = 0.81 and l = 2, 000. USA, 2012. ACM. [6] T. Reuter, P. Cimiano, L. Drumond, K. Buza, and Another interesting fact of this approach is that it seems L. Schmidt-Thieme. Scalable event-based clustering of to be stable for a larger number of documents as shown in social media via record linkage techniques. In L. A. Figure 2. Reuter et. al. [5] has reported worse results for Adamic, R. A. Baeza-Yates, and S. Counts, editors, the algorithms presented by Becker et. al. and Reuter et. al. ICWSM. The AAAI Press, 2011. [1, 5] if the number of documents grow. Even though they [7] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano, have used a different dataset, a decrease in performance of C. de Vries, and S. Geva. Social Event Detection at the F1 -Score from around 87% for 10,000 documents to 74% MediaEval 2013: Challenges, datasets, and evaluation. for 100,000 documents cannot be neglected. In MediaEval 2013 Workshop, Barcelona, Spain, October 18-19 2013. The final challenge results on the test set are presented in