Social Event Detection via sparse multi-modal feature
         selection and incremental density based clustering

                 Sina Samangooei                            Jonathon Hare                        David Dupplaw
                ss@ecs.soton.ac.uk                      jsh2@ecs.soton.ac.uk                  dpd@ecs.soton.ac.uk
                 Mahesan Niranjan                        Nicholas Gibbins                          Paul Lewis
                mn@ecs.soton.ac.uk                     nmg@ecs.soton.ac.uk                     phl@ecs.soton.ac.uk
                       Electronics and Computer Science, University of Southampton, United Kingdom

ABSTRACT                                                                a square sparse affinity matrix whose elements represent the simi-
Combining items from social media streams, such as Flickr photos        larity of two items of social media. Once this object is created, two
and Twitter tweets, into meaningful groups can help users contextu-     clustering algorithms are applied. However, the creation of such a
alise and effectively consume the torrents of information now made      matrix for any number of items beyond a trivial number is a time
available on the social web. This task is made challenging due to       consuming O(n2 ) operation which scales poorly. Therefore, the
the scale of the streams and the inherently multimodal nature of the    first stage of our process was the efficient construction of such an
information to be contextualised. We present a methodology which        affinity matrix.
approaches social event detection as a multi-modal clustering task.         Given the SED2013 training set we know for ∼300, 000 objects
We address the various challenges of this task: the selection of the    there exist ∼14, 000 clusters. The average number of items per
features used to compare items to one another; the construction         cluster in the training set is ∼20. From this information we know
of a single sparse affinity matrix; combining the features; relative    that the similarity between most objects must be 0 if similarity is
importance of features; and clustering techniques which produce         a reasonable indication of cluster membership. However, inducing
meaningful item groups whilst scaling to cluster large numbers of       this sparsity after feature extraction and comparison of the social
items. In our best tested configuration we achieve an F1 score of       media objects is without merit. At this point the expensive opera-
0.94, showing that a good compromise between precision and re-          tion has already been performed and is only forced to 0 after this
call of clusters can be achieved using our technique.                   fact. To address this issue, we construct a Lucene 1 index of the
                                                                        items to be clustered. The items are indexed using their metadata,
                                                                        each meta data component is given a field in the Lucene index.
1.      INTRODUCTION                                                    Then, for each item in the dataset we construct a custom Lucene
   In their June 2013 WWDC keynote, Apple announced a new               query based on the item’s metadata, receiving an artificially limited
photo collection feature for their iOS mobile operating system. With    number of documents. We then extract features from, and compare
the evocative tag-line “Life is full of special moments. So is your     distances using, only the documents returned by this query. Once
photo library”, Apple noted the importance of clustering social         the work is done to construct this Lucene index, this operation has
streams by their belonging to some real life event. This, along with    a complexity of O(n) which allows a much faster construction of
the plethora of mobile and desktop applications which offer some        the affinity matrix.
degree of event detection in user photo streams, demonstrates that          Once documents are filtered using Lucene, the affinity matrix is
detecting events in multimedia streams has real practical utility for   constructed. The items being clustered are inherently multi-modal.
users. In this work we present our approach to achieving clustering     These modalities include time information (both posted and taken),
of social media artefacts towards addressing task 1 in the Social       geographic information, textual information (tags, descriptions and
Event Detection (SED) challenge of the Mediaeval 2013 multime-          titles) as well as the visual information of the Flickr photos them-
dia evaluation [3]. Task 1 asks that a collection of Flickr photos be   selves. Any of these modalities might serve as a strong signal of
organised into events such that events are defined as “events that      cluster membership. Photos taken in the same place, or at the same
are planned by people, attended by people and the media illustrat-      time, or containing similar text might all serve as strong indication
ing the events are captured by people”. The Flickr items in this task   of these photos being of the same event. However on their own the
contain metadata beyond the content of the image itself. Namely,        features might also serve to confuse unrelated events, for example,
the Flickr photos are guaranteed to include accurate: Flickr IDs,       two events happening on a Friday, but one in Nottingham and one
user IDs and time posted. The photos may also contain in vary-          in London. Therefore, the first stage in the construction of a uni-
ing degrees of accuracy location information, time taken according      fied affinity matrix is a separate affinity matrix for each of these
to the camera, and textual information including title, tags and a      features, while the second step is the correct combination of the
description.                                                            affinity matrices. Inspired by Reuter and Cimiano [2] we use a log-
                                                                        arithmic distance function for our two time features. We also use a
2.      METHODOLOGY                                                     logarithmic distance function for our geographic Haversine based
                                                                        distance function. Fundamentally, this forces distances and times
     The overarching strategy of our technique is the construction of
                                                                        beyond a certain distance to count as being infinitely far, or as hav-
                                                                        ing 0 similarity. For the textual features we use the TF-IDF score

Copyright is held by the author/owner(s).                               1 http://Lucene.apache.org/core/
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
                                                      Table 1: Results from our four runs.
                          DBSCAN (best-weight)         Spectral (best-weight) DBSCAN (average-weight)             Spectral (average-weight)
  F1                      0.9454                       0.9114                   0.9461                            0.9024
  NMI                     0.9851                       0.9765                   0.9852                            0.9737
  F1 (Div)                0.9350                       0.8819                   0.9358                            0.8663
  Random Base F1          0.0589                       0.0580                   0.0597                            0.0569
  Div F1                  0.8865                       0.8534                   0.8864                            0.8455


with the IDF statistics calculated against the entire corpus of Flickr     crease. In this way, the effective number of items to be clustered in
objects. We also experimented with SIFT based visual features for          any given round will increase as items arrive but will also decrease
image feature affinity matrix construction. However, we found this         as clusters become stable. In this way we were able to successfully
feature only made F1 scores worse in the training set and the visual       apply the spectral clustering algorithm to the large training set of
features were completely ignored in all submitted runs against the         300, 000 items unlike Petkos et al. [1] who also applied a spec-
test set. If any given feature is missing or empty for either object       tral clustering technique to event detection, but on a comparatively
represented by a particular cell in the affinity matrix, for the pur-      small sample size.
pose of the sparse affinity matrix it is treated as being “not present”
rather than 0 similarity. The distinction here is important for how        3.    EXPERIMENTS AND RESULTS
these affinity matrices are combined.                                         For the runs submitted we first had to explore and set the vari-
    While [2] constructed vectors of similarity, we choose to fuse the     ous parameters of our approach against the training set. Our first
similarity features into a single similarity score to construct a fused    parameter was the weightings with which the feature affinity ma-
affinity matrix. We experimented with various feature-fusion tech-         trices were combined. We performed a search across the weight-
niques to combine them. Combination schemes including: product,            ing simplex and found that for the training set the best F1 score
max, min, and average were tested, average was shown to work               was achieved when the features were weighted as: ptimetaken = 3,
well. Different feature weightings were also investigated. To cal-         ptimeposted = 0, pgeolocation = 1, pdescription = 1, ptags = 3, ptitle =
culate the combined affinity wi j of the ith image with the jth image,     0. However, we noticed that the F1 achieved by the top feature
the feature affinities w f i j for all features f ∈ Fi j were used where   weightings on the simplex were very similar. Therefore, to avoid
Fi j is all the features which had values for both images i and j, i.e.    over-fitting we selected the top 1000 F1 ranked selections on the
those features “present” in the affinity matrix:                           weightings simplex and got the weightings average in the train-
                          F              F                                 ing set. This resulted in the features weighted as: ptimetaken = 2.1,
                   wi j = ∑ p f w f i j , ∑ p f = 1                 (1)    ptimeposted = 1.8, pgeolocation = 1.4, pdescription = 0.7, ptags = 1.7,
                           f             f
                                                                           ptitle = 0.3. We performed a similar line search to find the optimal
where p f is the weight of a given feature f . The final affinity ma-      values for DBSCAN’s parameters (eps = 0.45, minpts = 3) and
trix produced by this process was used by the clustering techniques        Spectral Clustering’s parameters (eigengap = 0.8, eps = −0.6).
below.                                                                     The 4 runs we submitted for the MediaEval 2013 SED Task 1 were
                                                                           therefore: DBSCAN with the best weighting, Spectral clustering
2.1     Event Clustering                                                   with the best, DBSCAN with the average weighting and spectral
   The exact number of events in the corpus was unknown and hard           clustering with the average. All these runs were performed using
to estimate. We explored clustering techniques which work without          our incremental clustering technique. An overall summary of the
an explicit k. The first technique we experimented with was a mod-         results from the runs can be seen in Table 1.
ified version of the classic Density-Based Spatial Clustering of Ap-
plications with Noise (DBSCAN) algorithm, implementing a fast              4.    ACKNOWLEDGMENTS
version of the neighbourhood selection stage which worked against            The described work was funded by the European Union Seventh
sparse affinity matrices. Secondly, we explored more sophisticated         Framework Programme (FP7/2007-2013) under grant agreements
spectral clustering techniques which interprete the affinity matrix as     270239 (ARCOMEM), and 287863 (TrendMiner).
weights of edges on a graph and uses graph theoretic techniques to
automatically estimate the number of clusters present in the graph.        5.    ADDITIONAL AUTHORS
The data items are projected into a metric space which ensures bet-
                                                                             Additional author: Jamie Davies (email: jagd1g11@ecs.soton.
ter separation between the clusters. We used a classic cosine simi-
                                                                           ac.uk) Neha Jain (email: nj1g12@ecs.soton.ac.uk), and John
larity nearest neighbour DBSCAN to cluster in the spectral space.
                                                                           Preston (email: jlp1g11@ecs.soton.ac.uk)
   In response to challenges of scale that both DBSCAN and Spec-
tral clustering methods face given enough data, we developed a
general incremental clustering technique which takes advantage of          6.    REFERENCES
the streaming nature of social media data. Namely, if we assume            [1] G. Petkos, S. Papadopoulos, and Y. Kompatsiaris. Social event
the data appears in an incremental fashion, we can cluster the data            detection using multimodal clustering and integrating
in small windows. If we then allow the windows to grow and                     supervisory signals. In Proc. ICMR, 2012.
perform the clustering again, we might notice that certain clusters        [2] T. Reuter and P. Cimiano. Event-based classification of social
and their members are stable across window growth. This stability              media streams. In In Proc. ICMR, 2012.
could be defined as the cluster members not changing whatsoever,           [3] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano,
or a relaxed form could define stability as paired clusters with high          C. de Vries, and S. Geva. Social Event Detection at MediaEval
overlap. Regardless, once a cluster is defined as stable those items           2013: Challenges, datasets, and evaluation. In MediaEval
could be removed and not involved in the next window size in-                  2013 Workshop, Barcelona, Spain, October 18-19 2013.