Social Event Detection with Clustering and Filtering

                  Yanxiang Wang                               Lexing Xie                       Hari Sundaram
           Australian National University             Australian National University        Arizona State University
            u4950984@anu.edu.au                        lexing.xie@anu.edu.au             hari.sundaram@asu.edu


ABSTRACT                                                                 We use a single-passed incremental clustering algorithm [1]
We present a clustering and filtering approach for the Social          to cluster the data. The similarity metrics used for each of
Event Detection task in MediaEval 2011. Our algorithm                  the time-stamp, location, tags, textual features are as follow:
makes use of time, location, as well as textual and visual                 • Time-stamp: We represent time value as the min-
features. We cluster the multimedia documents followed by                    utes elapsed since the beginning of Unix epoch. If two
retrieval-based filtering with partial event properties.                     times are more then a week apart, their similarity is
                                                                             0. Otherwise, the similarity between two time-stamps
1.    INTRODUCTION                                                           t1 and t2 is computed as st = 1 − t1t−t
                                                                                                                  w
                                                                                                                    2
                                                                                                                      , where tw as
  The Social Event Detection (SED) [5] task at MediaEval                     number of minutes in a week.
2011 present a challenging problem for retrieving and orga-
nizing social media around real-world events, such as sports               • Location: We compute the great circle distance (GCD)1
games or events at a given concert venue. A key difference                   between a pair of locations using the GeoPy library2 .
between the SED problem and earlier work on media event                      We set the location similarity sl to 0 if the GCD value
detection is that information about the target events are                    is greater than 50 miles, otherwise sl = 1 − GCD50
                                                                                                                                .
partially specified (via venue or type of sport), rather than
completely unspecified [1, 2, 7] or specified for each event               • Tags: We use the Jaccard index3 as the similarity sg
with examples [3].                                                           between two tag set.
  Such problem specification motivate us to adopt a hy-
                                                                           • Text: We obtain a term-frequency vector from the
brid clustering and filtering approach. We first cluster the
                                                                             photo title and description after stemming and elimi-
dataset with approaches similar to Becker [1] and Papadopou-
                                                                             nating the stop words. The cosine similarity is used as
los [6], tuned using a separate training set. We then filter
                                                                             the text similarity sw .
the resulting clusters, using retrieval approaches on time,
location, text and visual information.                                    In clustering phase C2, we use a weighted combination of
                                                                       similarity functions s0 = wg sg + ww sw + wl sl . We use wg =
2.    APPROACHES                                                       0.65, and wl = 0.15, ww = 0.2 if location data is available
   Since the SED task only provided an evaluation dataset [5],         for both photos, otherwise ww = 0.35. The centroid of each
we compile a separate training collection using a subset of            cluster is maintained in the end of the clustering step for
the upcoming dataset [1] with additional random photos                 filtering.
from Flickr. To mimic the challenge proposed by SED2011,
the training subset only contains upcoming events that are             2.2    Retrieve relevant events cluster
sports and music. The random photos added are within the                  In the first phase of filtering step, we remove the clusters
same timeframe of the existing events. The performance of              outside the specified time and location constraints.
the algorithm is evaluated against ground-truth events in                 We subsequently filter the clusters with text and tags as-
upcoming using F1.                                                     sociated with the query term. We generate a text vector
   The overall flow of our algorithm is shown in Figure 1.             and a tag vector for each query term. We construct the two
We perform two clustering phases before the filtering step.            vectors via two Flickr API4 methods. To construct the text
                                                                       vector, we call method flickr.photos.search with the query
2.1    Clustering on data set                                          term. We build the text vector by normalizing text content
                                                                       from 100 most relevant results. Similarly, we call method
                                                                       flickr.tags.getClusters with the query term, to retrieve a set
                                                                       of tags statistically associated with the query term.
                                                                          We use weighted combination similarity function described
                                                                       in 2.1 to compute the similarity between each centroid and
                                                                       1
                                                                         http://en.wikipedia.org/wiki/Great-circle distance
                                                                       2
                                                                         http://code.google.com/p/geopy/
                                                                       3
Copyright is held by the author/owner(s).                                http://en.wikipedia.org/wiki/Jaccard index
                                                                       4
MediaEval 2011 Workshop September 1-2, 2011, Pisa, Italy                 http://www.flickr.com/services/api/
                                 Figure 1: Overview of the clustering and filtering steps.


            Run No.        1         2        3                    4.   REFERENCES
            Metrics        µ:0.2     µ:0.1    µ:0.05               [1] H. Becker, M. Naaman, and L. Gravano. Learning
            Precision      12.53     62.88    84.86                    similarity metrics for event identification in social
            Recall         58.79     52.93    52.54                    media. In Proceedings of the third ACM international
            F1             20.65     57.48    64.9                     conference on Web search and data mining, WSDM ’10,
            NMI            0.1166    0.2207   0.2367                   pages 291–300, 2010.
                                                                   [2] L. Chen and A. Roy. Event detection from flickr data
                                                                       through wavelet-based spatial analysis. In Proceeding of
                 Table 1: challenge 1 result
                                                                       the 18th ACM conference on Information and
                                                                       knowledge management, CIKM ’09, pages 523–532,
     Run No.      1          2       3        4                        2009.
     Metrics      µ:0.2      µ:0.1   µ:0.05   µ:0.1 last.fm        [3] C. S. Firan, M. Georgescu, W. Nejdl, and R. Paiu.
     Precision    38.5       59.26   66.89    56.16                    Bringing order to your photos: event-driven
     Recall       66.34      43.9    6.04     18.9                     classification of flickr images based on social knowledge.
     F1           48.72      50.44   11.07    28.28                    In Proceedings of the 19th ACM international
     NMI          0.2941     0.448   0.2705   0.4491                   conference on Information and knowledge management,
                                                                       CIKM ’10, pages 189–198, 2010.
                                                                   [4] X. Liu, R. Troncy, and B. Huet. Finding media
                 Table 2: challenge 2 result                           illustrating events. In Proceedings of the 1st ACM
                                                                       International Conference on Multimedia Retrieval,
                                                                       ICMR ’11, pages 58:1–58:8, 2011.
the query document. We specify a threshold µ to filter the         [5] S. Papadopoulos, R. Troncy, V. Mezaris, B. Huet, and
clusters below the minimum similarity.                                 I. Kompatsiaris. Social Event Detection at MediaEval
   In F3, clusters are filter based on their visual information.       2011: Challenges, Dataset and Evaluation. In
we use a visual classifier [8] to label all photos in each clus-       MediaEval 2011 Workshop, Pisa, Italy, September 1-2
ter. We manually construct key, value pairs to represent the           2011.
invalid class labels and corresponding threshold. A cluster
                                                                   [6] S. Papadopoulos, C. Zigkolis, Y. Kompatsiaris, and
is discarded if the fraction of photos with invalid label in
                                                                       A. Vakali. Cluster-based landmark and event detection
cluster is greater than the threshold value.
                                                                       for tagged photo collections. IEEE MultiMedia,
                                                                       18:52–63, 1 2011.
3.     RESULTS                                                     [7] T. Rattenbury, N. Good, and M. Naaman. Towards
   For challenge 1, we feed search term ‘Barcelona’, ‘soccer’          automatic extraction of event and place semantics from
and ‘Rome’, ‘soccer’ to the Flickr API method and perform              flickr tags. In Proceedings of the 30th annual
three runs with different setting of µ shows in Table 1.               international ACM SIGIR conference on Research and
   For challenge 2, in addition to the runs from search term           development in information retrieval, SIGIR ’07, pages
‘Paradiso’ and ‘Parc del Forum’, we take the idea of Liu [4].          103–110, 2007.
We construct the tag set and text vector from artists’ names,      [8] J. R. Smith and et al. IBM multimedia analysis and
title and descriptions for each event found on last.fm5 event          retrieval system.
directory to anchor a supplementary run 2.                             http://www.alphaworks.ibm.com/tech/imars.
   While our results show promise, they can be substantially
improved. However, the best performing result with µ = 0.1
for F1 evaluation is still in the acceptable level. The results
show that our recall value on average is lower than precision.
Thus, in future work, we will further investigate to refine
the filtering method to improve the recall value. Possible
directions include: other tag and text construction strategy,
augment visual filtering etc. To tackle the low performance
on NMI value, we will study the clustering results to gain
more insight.
5
    http://www.last.fm/api