Social Event Detection via sparse multi-modal feature selection and incremental density based clustering Sina Samangooei Jonathon Hare David Dupplaw ss@ecs.soton.ac.uk jsh2@ecs.soton.ac.uk dpd@ecs.soton.ac.uk Mahesan Niranjan Nicholas Gibbins Paul Lewis mn@ecs.soton.ac.uk nmg@ecs.soton.ac.uk phl@ecs.soton.ac.uk Electronics and Computer Science, University of Southampton, United Kingdom ABSTRACT a square sparse affinity matrix whose elements represent the simi- Combining items from social media streams, such as Flickr photos larity of two items of social media. Once this object is created, two and Twitter tweets, into meaningful groups can help users contextu- clustering algorithms are applied. However, the creation of such a alise and effectively consume the torrents of information now made matrix for any number of items beyond a trivial number is a time available on the social web. This task is made challenging due to consuming O(n2 ) operation which scales poorly. Therefore, the the scale of the streams and the inherently multimodal nature of the first stage of our process was the efficient construction of such an information to be contextualised. We present a methodology which affinity matrix. approaches social event detection as a multi-modal clustering task. Given the SED2013 training set we know for ∼300, 000 objects We address the various challenges of this task: the selection of the there exist ∼14, 000 clusters. The average number of items per features used to compare items to one another; the construction cluster in the training set is ∼20. From this information we know of a single sparse affinity matrix; combining the features; relative that the similarity between most objects must be 0 if similarity is importance of features; and clustering techniques which produce a reasonable indication of cluster membership. However, inducing meaningful item groups whilst scaling to cluster large numbers of this sparsity after feature extraction and comparison of the social items. In our best tested configuration we achieve an F1 score of media objects is without merit. At this point the expensive opera- 0.94, showing that a good compromise between precision and re- tion has already been performed and is only forced to 0 after this call of clusters can be achieved using our technique. fact. To address this issue, we construct a Lucene 1 index of the items to be clustered. The items are indexed using their metadata, each meta data component is given a field in the Lucene index. 1. INTRODUCTION Then, for each item in the dataset we construct a custom Lucene In their June 2013 WWDC keynote, Apple announced a new query based on the item’s metadata, receiving an artificially limited photo collection feature for their iOS mobile operating system. With number of documents. We then extract features from, and compare the evocative tag-line “Life is full of special moments. So is your distances using, only the documents returned by this query. Once photo library”, Apple noted the importance of clustering social the work is done to construct this Lucene index, this operation has streams by their belonging to some real life event. This, along with a complexity of O(n) which allows a much faster construction of the plethora of mobile and desktop applications which offer some the affinity matrix. degree of event detection in user photo streams, demonstrates that Once documents are filtered using Lucene, the affinity matrix is detecting events in multimedia streams has real practical utility for constructed. The items being clustered are inherently multi-modal. users. In this work we present our approach to achieving clustering These modalities include time information (both posted and taken), of social media artefacts towards addressing task 1 in the Social geographic information, textual information (tags, descriptions and Event Detection (SED) challenge of the Mediaeval 2013 multime- titles) as well as the visual information of the Flickr photos them- dia evaluation [3]. Task 1 asks that a collection of Flickr photos be selves. Any of these modalities might serve as a strong signal of organised into events such that events are defined as “events that cluster membership. Photos taken in the same place, or at the same are planned by people, attended by people and the media illustrat- time, or containing similar text might all serve as strong indication ing the events are captured by people”. The Flickr items in this task of these photos being of the same event. However on their own the contain metadata beyond the content of the image itself. Namely, features might also serve to confuse unrelated events, for example, the Flickr photos are guaranteed to include accurate: Flickr IDs, two events happening on a Friday, but one in Nottingham and one user IDs and time posted. The photos may also contain in vary- in London. Therefore, the first stage in the construction of a uni- ing degrees of accuracy location information, time taken according fied affinity matrix is a separate affinity matrix for each of these to the camera, and textual information including title, tags and a features, while the second step is the correct combination of the description. affinity matrices. Inspired by Reuter and Cimiano [2] we use a log- arithmic distance function for our two time features. We also use a 2. METHODOLOGY logarithmic distance function for our geographic Haversine based distance function. Fundamentally, this forces distances and times The overarching strategy of our technique is the construction of beyond a certain distance to count as being infinitely far, or as hav- ing 0 similarity. For the textual features we use the TF-IDF score Copyright is held by the author/owner(s). 1 http://Lucene.apache.org/core/ MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain Table 1: Results from our four runs. DBSCAN (best-weight) Spectral (best-weight) DBSCAN (average-weight) Spectral (average-weight) F1 0.9454 0.9114 0.9461 0.9024 NMI 0.9851 0.9765 0.9852 0.9737 F1 (Div) 0.9350 0.8819 0.9358 0.8663 Random Base F1 0.0589 0.0580 0.0597 0.0569 Div F1 0.8865 0.8534 0.8864 0.8455 with the IDF statistics calculated against the entire corpus of Flickr crease. In this way, the effective number of items to be clustered in objects. We also experimented with SIFT based visual features for any given round will increase as items arrive but will also decrease image feature affinity matrix construction. However, we found this as clusters become stable. In this way we were able to successfully feature only made F1 scores worse in the training set and the visual apply the spectral clustering algorithm to the large training set of features were completely ignored in all submitted runs against the 300, 000 items unlike Petkos et al. [1] who also applied a spec- test set. If any given feature is missing or empty for either object tral clustering technique to event detection, but on a comparatively represented by a particular cell in the affinity matrix, for the pur- small sample size. pose of the sparse affinity matrix it is treated as being “not present” rather than 0 similarity. The distinction here is important for how 3. EXPERIMENTS AND RESULTS these affinity matrices are combined. For the runs submitted we first had to explore and set the vari- While [2] constructed vectors of similarity, we choose to fuse the ous parameters of our approach against the training set. Our first similarity features into a single similarity score to construct a fused parameter was the weightings with which the feature affinity ma- affinity matrix. We experimented with various feature-fusion tech- trices were combined. We performed a search across the weight- niques to combine them. Combination schemes including: product, ing simplex and found that for the training set the best F1 score max, min, and average were tested, average was shown to work was achieved when the features were weighted as: ptimetaken = 3, well. Different feature weightings were also investigated. To cal- ptimeposted = 0, pgeolocation = 1, pdescription = 1, ptags = 3, ptitle = culate the combined affinity wi j of the ith image with the jth image, 0. However, we noticed that the F1 achieved by the top feature the feature affinities w f i j for all features f ∈ Fi j were used where weightings on the simplex were very similar. Therefore, to avoid Fi j is all the features which had values for both images i and j, i.e. over-fitting we selected the top 1000 F1 ranked selections on the those features “present” in the affinity matrix: weightings simplex and got the weightings average in the train- F F ing set. This resulted in the features weighted as: ptimetaken = 2.1, wi j = ∑ p f w f i j , ∑ p f = 1 (1) ptimeposted = 1.8, pgeolocation = 1.4, pdescription = 0.7, ptags = 1.7, f f ptitle = 0.3. We performed a similar line search to find the optimal where p f is the weight of a given feature f . The final affinity ma- values for DBSCAN’s parameters (eps = 0.45, minpts = 3) and trix produced by this process was used by the clustering techniques Spectral Clustering’s parameters (eigengap = 0.8, eps = −0.6). below. The 4 runs we submitted for the MediaEval 2013 SED Task 1 were therefore: DBSCAN with the best weighting, Spectral clustering 2.1 Event Clustering with the best, DBSCAN with the average weighting and spectral The exact number of events in the corpus was unknown and hard clustering with the average. All these runs were performed using to estimate. We explored clustering techniques which work without our incremental clustering technique. An overall summary of the an explicit k. The first technique we experimented with was a mod- results from the runs can be seen in Table 1. ified version of the classic Density-Based Spatial Clustering of Ap- plications with Noise (DBSCAN) algorithm, implementing a fast 4. ACKNOWLEDGMENTS version of the neighbourhood selection stage which worked against The described work was funded by the European Union Seventh sparse affinity matrices. Secondly, we explored more sophisticated Framework Programme (FP7/2007-2013) under grant agreements spectral clustering techniques which interprete the affinity matrix as 270239 (ARCOMEM), and 287863 (TrendMiner). weights of edges on a graph and uses graph theoretic techniques to automatically estimate the number of clusters present in the graph. 5. ADDITIONAL AUTHORS The data items are projected into a metric space which ensures bet- Additional author: Jamie Davies (email: jagd1g11@ecs.soton. ter separation between the clusters. We used a classic cosine simi- ac.uk) Neha Jain (email: nj1g12@ecs.soton.ac.uk), and John larity nearest neighbour DBSCAN to cluster in the spectral space. Preston (email: jlp1g11@ecs.soton.ac.uk) In response to challenges of scale that both DBSCAN and Spec- tral clustering methods face given enough data, we developed a general incremental clustering technique which takes advantage of 6. REFERENCES the streaming nature of social media data. Namely, if we assume [1] G. Petkos, S. Papadopoulos, and Y. Kompatsiaris. Social event the data appears in an incremental fashion, we can cluster the data detection using multimodal clustering and integrating in small windows. If we then allow the windows to grow and supervisory signals. In Proc. ICMR, 2012. perform the clustering again, we might notice that certain clusters [2] T. Reuter and P. Cimiano. Event-based classification of social and their members are stable across window growth. This stability media streams. In In Proc. ICMR, 2012. could be defined as the cluster members not changing whatsoever, [3] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano, or a relaxed form could define stability as paired clusters with high C. de Vries, and S. Geva. Social Event Detection at MediaEval overlap. Regardless, once a cluster is defined as stable those items 2013: Challenges, datasets, and evaluation. In MediaEval could be removed and not involved in the next window size in- 2013 Workshop, Barcelona, Spain, October 18-19 2013.