Ranking Based Clustering for Social Event Detection Taufik Sutanto Richi Nayak Queensland University of Technology Queensland University of Technology taufik.sutanto@qut.edu.au r.nayak@qut.edu.au ABSTRACT tionally expensive, as well as, a large size cluster makes the The problem of clustering a large document collection is not similarity measure between documents ambiguous. only challenged by the number of documents and the num- Semi-supervised clustering methods have shown to pro- ber of dimensions, but it is also affected by the number and duce a better result compared to their traditional unsuper- sizes of the clusters. Traditional clustering methods fail to vised counterpart [2]. In the 2013 SED task, we proposed scale when they need to generate a large number of clusters. and used a scalable ranking based semi-supervised clustering Furthermore, when the clusters size in the solution is het- approach that produces accurate clusters [5]. However, this erogeneous, i.e. some of the clusters are large in size, the method suffers with the communication cost for long doc- similarity measures tend to degrade. A ranking based clus- uments. To deal with the issue, we utilized the document tering method is proposed to deal with these issues in the frequency distribution and exclude the most occurring terms context of the Social Event Detection task. Ranking scores in the query document (i.e. the document to be clustered) are used to select a small number of most relevant clusters if needed. in order to compare and place a document. Additionally, The use of hubs has been explored and has shown its ef- instead of conventional cluster centroids, cluster patches are ficacy in dealing with high dimensional data and clusters proposed to represent clusters, that are hubs-like set of doc- with large sizes [3]. However, the k-NN calculation of hubs uments. Text, temporal, spatial and visual content infor- demands a considerable amount of extra computation that mation collected from the social event images is utilized in is not suitable for large data clustering. In this paper, doc- calculating similarity. Results show that these strategies al- uments are assigned to clusters based on its distances to low us to have a balance between performance and accuracy cluster patches. These patches are calculated based on the of the clustering solution gained by the clustering method. ranking scores from the queries. Document frequencies are used to select a subset of terms from documents to create the queries. These patches become data representatives to mea- 1. INTRODUCTION sure distances for a document, instead of using each cluster The Social Event Detection (SED) task at the 2014 Medi- centroid. The use of patches is expected to enable the clus- aEval Benchmark for Multimedia Evaluation consists of two tering method to capture more specific sub-topics within a subtasks: (1) Image clustering based on a given set of events; cluster. and (2) retrieval of social events based on predefined queries In this paper, we present a method based on cluster patches [4]. The SED task poses challenges to clustering analysis to calculate the distance between a document and the groups due to the real-world nature of the data such as the large of documents inside a cluster (Figure 1). Instead of a single number of dimensions, large data size, multi-domain types centroid, patches are proposed to represent a large high- of features, and the need to group data into a large and un- dimensional cluster in order to control the significance of fixed number of clusters. This paper focuses on proposing similarity measurement. a solution to the first subtask, i.e., semi-supervised cluster- ing of social event images based on the metadata and visual content. Search engine technologies e.g., Sphinx, Lucene or Solr have been successfully implemented to process large sized document collections for information retrieval. Utilizing the concept of ranking scores used in search engines, coupled with using prior knowledge from the learning data, in semi- supervised clustering has shown to be an effective and effi- cient approach of clustering text data [5, 6]. This type of approaches works fine when the collection size or the num- ber of clusters required is small. Calculating ranking scores for a large number of documents is known to be computa- Copyright is held by the author/owner(s). MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain Figure 1: Ranking based document clustering with patches. 2. PREPROCESSING 4. RESULTS AND DISCUSSION All the features of the images were used in the clustering We submitted five runs for the supervised clustering task process except of their URL. English stopwords and some (Table 1). Runs one, three, four, and five used the proposed symbols (e.g. #,&,@) were filtered. Title, tag, username, method on text only, text-time-space, all attributes, and and description attributes were combined into a short doc- text-images set of attributes respectively. While run two is ument. No external resources were used in the analysis. using the method as described in [5] using the text attribute The document length normalized tf-idf was used as the term only. weighting scheme. The time information were transformed into day interval between date taken and date upload. Spa- Table 1: Semi-Supervised clustering results. tial information (i.e. latitude and longitude) were used by utilizing a modified Harversine-formula. The modification is Run1 Run2 Run3 Run4 Run5 done by changing the range of the measure to a unit value as F1-Score 0.7463 0.7533 0.7445 0.7440 0.7456 in cosine distance. Feature-based super-pixel segmentation NMI 0.9024 0.9020 0.9017 0.9015 0.9020 is used to extract compact color and texture representation Div. F1 0.7447 0.7516 0.7428 0.7424 0.7439 for small image patches [1]. This representation has smaller dimension compared to the bag-of-visual words (BOVW) The first two runs indicate that the proposed method has approach. comparable accuracy to general ranking method, but an im- proved cluster quality as shown by NMI. While the remain- 3. THE MODEL ing runs shows that the usage of image, spatial, and time A set of patches P are calculated in each iteration based information is ineffective in this data for the purpose of clus- on the ranking score from the document query. Instead of tering. The main reason behind this is the dependence of comparing a document with a cluster centroid, the document the proposed method on text ranking. feature vector is compared with all the patches. The patches An adaptive weighting where weights of each attribute are are calculated based on a certain size (δ) neighborhood of dynamic among documents is a priority for future investi- documents based on ranking scores within clusters. Optimal gation to solve this issue. Future work will also explore on distance from the document and these patches is then used finding the optimal parameter γ and improve the scalability to decide the document assignment to a cluster. More detail of the method in distributed data and distributed computing of the approach is given in Algorithm 1. environment. input : Set of documents D, initial clusters 5. ACKNOWLEDGMENTS C = {c1 , c2 , . . . , cK }, neighborhood size m, We like to thank Dr Simon Denman from QUT Compu- patches size δ, and cluster threshold γ. tational Intelligence and Signal Processing lab for providing output: K 0 disjoint partitions of D. us the image visual content encoding. Index all documents D; for each di ∈ Dtest do 6. REFERENCES [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and calculate a set of cluster patches S. Susstrunk. SLIC superpixels compared to P = {0 < |drank | < δi , i ∈ I, d ∈ cj }; state-of-the-art superpixel methods. IEEE Transactions for each p ∈ P do on Pattern Analysis and Machine Intelligence, 34, 2012. calculate p∗ = maxp {sim(di , p), p ∈ P }; if sim(di , p∗) > γ then [2] S. Basu, I. Davidson, and K. Wagstaff. Constrained Assign document di to a cluster where p∗ Clustering: Advances in Algorithms, Theory, and belongs; Applications. Chapman & Hall/CRC, 1 edition, 2008. else [3] J. Hou and R. Nayak. The heterogeneous cluster Form a new cluster c=di ; ensemble method using hubness for clustering text end documents. In WISE 2013, pages 102–110. Springer Update cluster labels via the search engine Berlin Heidelberg. end [4] G. Petkos, S. Papadopoulos, V. Mezaris, and end Y. Kompatsiaris. Social event detection at MediaEval 2014: Challenges, datasets, and evaluation. In Algorithm 1: Incremental ranking based social Proceedings of the MediaEval 2014 Multimedia event images clustering algorithm. Benchmark Workshop Barcelona, Spain, October 16-17, 2014, volume 1044. CEUR-WS.org, 2014. The similarity measure between a document d and a patch [5] T. Sutanto and R. Nayak. ADMRG @ MediaEval 2013 p in a cluster c is given by utilizing textual, temporal, spatial social event detection. In Proceedings of the MediaEval and visual information within images: 2013 Multimedia Benchmark Workshop Barcelona, sim(d, p) = β1 simcosine (d, p) + β2 simtime (d, p)+ Spain, October 18-19, 2013, volume 1043, 2013. (1) [6] T. Sutanto and R. Nayak. The ranking based β3 simspace (d, p) + β4 simimage (d, p). constrained document clustering method and its βi is a weight parameter to combine the effect of various application to social event detection. In Database types of attributes. These parameters can be fine tuned Systems for Advanced Applications, volume 8422 of manually or calculated from the learning data by using vari- Lecture Notes in Computer Science, pages 47–60. able importance measures from a decision tree model. Springer International Publishing, 2014.