Social Event Detection with Clustering and Filtering Yanxiang Wang Lexing Xie Hari Sundaram Australian National University Australian National University Arizona State University u4950984@anu.edu.au lexing.xie@anu.edu.au hari.sundaram@asu.edu ABSTRACT We use a single-passed incremental clustering algorithm [1] We present a clustering and filtering approach for the Social to cluster the data. The similarity metrics used for each of Event Detection task in MediaEval 2011. Our algorithm the time-stamp, location, tags, textual features are as follow: makes use of time, location, as well as textual and visual • Time-stamp: We represent time value as the min- features. We cluster the multimedia documents followed by utes elapsed since the beginning of Unix epoch. If two retrieval-based filtering with partial event properties. times are more then a week apart, their similarity is 0. Otherwise, the similarity between two time-stamps 1. INTRODUCTION t1 and t2 is computed as st = 1 − t1t−t w 2 , where tw as The Social Event Detection (SED) [5] task at MediaEval number of minutes in a week. 2011 present a challenging problem for retrieving and orga- nizing social media around real-world events, such as sports • Location: We compute the great circle distance (GCD)1 games or events at a given concert venue. A key difference between a pair of locations using the GeoPy library2 . between the SED problem and earlier work on media event We set the location similarity sl to 0 if the GCD value detection is that information about the target events are is greater than 50 miles, otherwise sl = 1 − GCD50 . partially specified (via venue or type of sport), rather than completely unspecified [1, 2, 7] or specified for each event • Tags: We use the Jaccard index3 as the similarity sg with examples [3]. between two tag set. Such problem specification motivate us to adopt a hy- • Text: We obtain a term-frequency vector from the brid clustering and filtering approach. We first cluster the photo title and description after stemming and elimi- dataset with approaches similar to Becker [1] and Papadopou- nating the stop words. The cosine similarity is used as los [6], tuned using a separate training set. We then filter the text similarity sw . the resulting clusters, using retrieval approaches on time, location, text and visual information. In clustering phase C2, we use a weighted combination of similarity functions s0 = wg sg + ww sw + wl sl . We use wg = 2. APPROACHES 0.65, and wl = 0.15, ww = 0.2 if location data is available Since the SED task only provided an evaluation dataset [5], for both photos, otherwise ww = 0.35. The centroid of each we compile a separate training collection using a subset of cluster is maintained in the end of the clustering step for the upcoming dataset [1] with additional random photos filtering. from Flickr. To mimic the challenge proposed by SED2011, the training subset only contains upcoming events that are 2.2 Retrieve relevant events cluster sports and music. The random photos added are within the In the first phase of filtering step, we remove the clusters same timeframe of the existing events. The performance of outside the specified time and location constraints. the algorithm is evaluated against ground-truth events in We subsequently filter the clusters with text and tags as- upcoming using F1. sociated with the query term. We generate a text vector The overall flow of our algorithm is shown in Figure 1. and a tag vector for each query term. We construct the two We perform two clustering phases before the filtering step. vectors via two Flickr API4 methods. To construct the text vector, we call method flickr.photos.search with the query 2.1 Clustering on data set term. We build the text vector by normalizing text content from 100 most relevant results. Similarly, we call method flickr.tags.getClusters with the query term, to retrieve a set of tags statistically associated with the query term. We use weighted combination similarity function described in 2.1 to compute the similarity between each centroid and 1 http://en.wikipedia.org/wiki/Great-circle distance 2 http://code.google.com/p/geopy/ 3 Copyright is held by the author/owner(s). http://en.wikipedia.org/wiki/Jaccard index 4 MediaEval 2011 Workshop September 1-2, 2011, Pisa, Italy http://www.flickr.com/services/api/ Figure 1: Overview of the clustering and filtering steps. Run No. 1 2 3 4. REFERENCES Metrics µ:0.2 µ:0.1 µ:0.05 [1] H. Becker, M. Naaman, and L. Gravano. Learning Precision 12.53 62.88 84.86 similarity metrics for event identification in social Recall 58.79 52.93 52.54 media. In Proceedings of the third ACM international F1 20.65 57.48 64.9 conference on Web search and data mining, WSDM ’10, NMI 0.1166 0.2207 0.2367 pages 291–300, 2010. [2] L. Chen and A. Roy. Event detection from flickr data through wavelet-based spatial analysis. In Proceeding of Table 1: challenge 1 result the 18th ACM conference on Information and knowledge management, CIKM ’09, pages 523–532, Run No. 1 2 3 4 2009. Metrics µ:0.2 µ:0.1 µ:0.05 µ:0.1 last.fm [3] C. S. Firan, M. Georgescu, W. Nejdl, and R. Paiu. Precision 38.5 59.26 66.89 56.16 Bringing order to your photos: event-driven Recall 66.34 43.9 6.04 18.9 classification of flickr images based on social knowledge. F1 48.72 50.44 11.07 28.28 In Proceedings of the 19th ACM international NMI 0.2941 0.448 0.2705 0.4491 conference on Information and knowledge management, CIKM ’10, pages 189–198, 2010. [4] X. Liu, R. Troncy, and B. Huet. Finding media Table 2: challenge 2 result illustrating events. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR ’11, pages 58:1–58:8, 2011. the query document. We specify a threshold µ to filter the [5] S. Papadopoulos, R. Troncy, V. Mezaris, B. Huet, and clusters below the minimum similarity. I. Kompatsiaris. Social Event Detection at MediaEval In F3, clusters are filter based on their visual information. 2011: Challenges, Dataset and Evaluation. In we use a visual classifier [8] to label all photos in each clus- MediaEval 2011 Workshop, Pisa, Italy, September 1-2 ter. We manually construct key, value pairs to represent the 2011. invalid class labels and corresponding threshold. A cluster [6] S. Papadopoulos, C. Zigkolis, Y. Kompatsiaris, and is discarded if the fraction of photos with invalid label in A. Vakali. Cluster-based landmark and event detection cluster is greater than the threshold value. for tagged photo collections. IEEE MultiMedia, 18:52–63, 1 2011. 3. RESULTS [7] T. Rattenbury, N. Good, and M. Naaman. Towards For challenge 1, we feed search term ‘Barcelona’, ‘soccer’ automatic extraction of event and place semantics from and ‘Rome’, ‘soccer’ to the Flickr API method and perform flickr tags. In Proceedings of the 30th annual three runs with different setting of µ shows in Table 1. international ACM SIGIR conference on Research and For challenge 2, in addition to the runs from search term development in information retrieval, SIGIR ’07, pages ‘Paradiso’ and ‘Parc del Forum’, we take the idea of Liu [4]. 103–110, 2007. We construct the tag set and text vector from artists’ names, [8] J. R. Smith and et al. IBM multimedia analysis and title and descriptions for each event found on last.fm5 event retrieval system. directory to anchor a supplementary run 2. http://www.alphaworks.ibm.com/tech/imars. While our results show promise, they can be substantially improved. However, the best performing result with µ = 0.1 for F1 evaluation is still in the acceptable level. The results show that our recall value on average is lower than precision. Thus, in future work, we will further investigate to refine the filtering method to improve the recall value. Possible directions include: other tag and text construction strategy, augment visual filtering etc. To tackle the low performance on NMI value, we will study the clustering results to gain more insight. 5 http://www.last.fm/api