Unsupervised Clustering of Social Events Matthias Zeppelzauer Maia Zaharieva Manfred Del Fabro Vienna University of University of Vienna, Austria Klagenfurt University, Austria Technology, Austria Research Group Multimedia Institute of Information Interactive Media Sys. Group Information Systems Technology mzz@ims.tuwien.ac.at zaharieva@cs.univie.ac.at manfred@itec.aau.at ABSTRACT based on topic detection [5]. The authors perform topic This paper describes our contribution to the social event de- detection by Latent Dirichlet Allocation (LDA) for each city tection (SED) task of the MediaEval Benchmark 2013. We in the image collection. Additionally, the authors manually present a robust unsupervised approach for the clustering of identify topics that are typical for a specific event cluster. tagged photos and videos into social events. Results on the From related approaches we observe that many assump- SED datasets show that the proposed approach yields an ex- tions are made on the training set and (partially manual) cellent generalization ability and state-of-the-art clustering optimizations are required which limits general applicabil- performance. ity. Our unsupervised approach minimizes the assumptions on the data and avoids manual intervention. The approach exhibits a strong generalization ability and results show that 1. INTRODUCTION the sensitivity to the involved parameters is reasonably low. We participated in challenge 1 of the Social Event De- tection (SED) task [4]. The goal of the task is to build 3. APPROACH photo clusters belonging to unique social events in a large collection of tagged flicker images. Thereby the total num- ber of events is not provided. In an additional subtask we 3.1 Full Clustering assign unlabeled videos to the previously discovered photo The input to the approach are the available metadata of clusters. The development set comprises 300k images from the SED dataset (capture data, location, title, tags, descrip- 14882 unique events. For the test set of 131k images no tion) and a stopword list. No other data sources are re- ground truth is available. quired. In a first step, the metadata are preprocessed: Since We consider challenge 1 as an unsupervised data mining a user cannot be at two locations at the same time, we as- task. The basic idea is to rely on robust heuristics and sign locations of photos taken by the same user at the same to reduce the number of parameters of the approach to a time to the user’s non-geotagged photos. Additionally, the minimum to obtain a good generalization ability between textual metadata are filtered by the stopword list. different datasets. Additionally, the proposed approach does In a next step, we perform three independent cluster- not require any external (online) data sources. ings in parallel: temporal clustering, location clustering, and In the course of the SED2013 task, we focus on the fol- topic clustering. For temporal clustering we employ mean- lowing research questions: (i) Which level of clustering per- shift and set the bandwidth parameter βT in a way that formance can be obtained by relying on simple but robust the resulting clusters span between 2 and 6 hours, which is heuristics for unsupervised clustering and how do the results a reasonable temporal resolution for social events. For lo- compare to more complex clustering methods? (ii) How well cation clustering we observe that the performance gain of does the proposed approach generalize to unknown data? meanshift clustering does not justify the computational ef- forts. Hence, we skip meanshift clustering and assign each individual and unique location in the data a separate cluster 2. RELATED WORK ID. Topic clustering is based on topic extraction by LDA. Many existing approaches for event detection in image We perform topic modeling on the textual descriptions of collections require a separate training [1, 3]. Becker et al. each photo (title, tags, description) using LDA and extract create separate clusters for each feature such as title, descrip- T topics for the employed dataset. For each photo i, we tion, time, etc. The authors employ single-pass incremental estimate the likelihoods li,1 and li,2 of the first- and second- clustering whereas the threshold for each cluster is tuned best matching topics. If the difference of the likelihoods is based on a set of training data [1]. Reuter and Cimiano em- larger than a threshold τ (li,1 − li,2 > τ ) the most likely ploy machine learning techniques to detect events in social topic is assigned to the photo otherwise no topic is assigned. streams. The authors employ SVMs to classify Flickr images Parameter τ is set to 0.3 for all experiments. annotated by machine tags from last.fm into events [3]. The three independent clusterings are the basis for the Vavliakis et al. propose a social event detection approach generation of initial event clusters. Photos which share the same temporal cluster, location cluster, and topic cluster are assigned the same unique event ID. The remaining pho- Copyright is held by the author/owner(s). tos are assigned to existing and new events in a number MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain of matching steps. First, remaining photos which share metadata results) and 0.69 (average performance) on a portion of the description location time SED2011 dataset (no F1 reported) [2]. Becker et al. [1] yield stop words preprocessing NMI values between 0.92 and 0.94 and F1 values from 0.77 topic clustering location clustering temporal clustering to 0.82 on a test set consisting of 270k photos (10 splits). Reuter and Cimiano report an F1 of 0.74 for a dataset of merge clusterings 700k photos (7 splits, no NMI reported) [3]. initial clusters unassigned fotos update, merge no location no location, no topic update merge all match user + time Table 1: Results for Full Clustering match Development Set Test Set time + topic βT Topics F1 NMI Topics F1 NMI new refined event clusters non-geotagged event clusters 0.2 2000 0.74 0.94 1000 0.78 0.94 merge events 0.2 3000 0.74 0.94 1500 0.78 0.94 final event clusters 0.2 1600 0.74 0.94 800 0.78 0.94 0.1 2000 0.73 0.93 1000 0.76 0.94 0.5 2000 0.72 0.93 1000 0.77 0.94 Figure 1: Overview of the approach the same user and capture time as photos in already ex- The three approaches submitted to the video subtask show isting events are assigned to the respective events. If sev- different results. The supervised approach trained on the de- eral events share the same users and capture times, the velopment data performs suboptimally (F1=0.42, NMI=0.68). events are merged. Second, remaining photos without loca- The reason for this may be that the events of the test data tion information are matched to existing events by time and are inferred from the events in the development data. If an topic. If no match to an existing event can be established, a event is not included in the development data, it cannot be new (non-geotagged event cluster) is generated. For photos inferred. The second approach shows that comparing the where no location and no topic is available we generate new metadata of single videos with the accumulated LDA key- events by their capture time. words from clusters is not well-suited to link single videos The resulting sets of events (refined event clusters and to clusters (F1=0.34, NMI=0.77). The unsupervised LDA- non-geotagged event clusters) may oversegment the true event based approach performs best (F1=0.69, NMI=0.85) and distribution. Hence, we merge events that share similar builds a promising baseline for future improvements. time, location, and topic to obtain the final event clusters. 5. CONCLUSIONS AND OUTLOOK 3.2 Full Clustering of Media using Videos In this paper we presented our contribution to the SED For the video subtask, we apply the above described topic challenge of the MediaEval 2013 Benchmark. We proposed a modeling to the stopword-filtered textual descriptions of the robust unsupervised method for the clustering of photos and videos (title, description, keywords). Temporal clustering videos into social events. The method exhibits strong gen- and location clustering are neglected, because most videos eralization ability, low sensitivity to parameters, and yields do not contain location information and a capturing date. state-of-the-art performance. Future work focuses on more As a consequence, parameter τ is set to 0.0 for all experi- sophisticated event refinements and visual content analysis. ments to achieve a complete clustering of all videos. We investigate three different approaches for generating the video clusters: (i) LDA is applied to train a topic model 6. ACKNOWLEDGMENTS with 200 topics on the development data from which the This work has been partly funded by the Vienna Science topics of the test data are derived; (ii) each video constitutes and Technology Fund (WWTF) through project ICT12-010 a topic on its own; and (iii) an unsupervised LDA-based and the Carinthian Economic Promotion Fund (KWF) un- approach is used to detect 70 topics in the test data. After der grant KWF-20214 22573 33955. the video clusters are created, we link them to the previously generated photo clusters. The keywords of video clusters 7. REFERENCES V are compared to the keywords of the photo clusters P [1] H. Becker, M. Naaman, and L. Gravano. Learning using the Jaccard similarity coefficient. Each video cluster similarity metrics for event identification in social is linked to the photo cluster with the highest similarity. media. In ACM WSDM, pp. 291–300, 2010. [2] G. Petkos, S. Papadopoulos, and Y. Kompatsiaris. 4. EXPERIMENTS AND RESULTS Social event detection using multimodal clustering and integrating supervisory signals. In ACM ICMR, pp. We use the same parameters for experiments on the de- 23:1–8, 2012. velopment and test set. To estimate the numbers of topics, we assume that each topic is constituted in average by 100- [3] T. Reuter and P. Cimiano. Event-based classification of 200 photos. Additionally, we evaluate different values of βT social media streams. In ACM ICMR, pp. 22:1–8, 2012. corresponding to an event duration of 2-6 hours. The results [4] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano, of the proposed approach for both sets demonstrate its ex- C. de Vries, and S. Geva. Social Event Detection at cellent generalization ability (see Table 1). Results for the MediaEval 2013: Challenges, datasets, and evaluation. test set are even better than for the development set. The In MediaEval 2013 Workshop, 2013. clustering performance is comparable to (more complex) su- [5] K. N. Vavliakis, F. A. Tzima, and P. A. Mitkas. Event pervised state-of-the-art methods. The approach by Petkos detection via LDA for the MediaEval2012 SED Task. et al., for example, yields NMI values of 0.92 (average of best In MediaEval 2012 Workshop, 2012.