Unsupervised Clustering of Social Events

            Matthias Zeppelzauer                         Maia Zaharieva                   Manfred Del Fabro
                Vienna University of               University of Vienna, Austria      Klagenfurt University, Austria
                Technology, Austria                Research Group Multimedia             Institute of Information
           Interactive Media Sys. Group               Information Systems                       Technology
            mzz@ims.tuwien.ac.at                 zaharieva@cs.univie.ac.at               manfred@itec.aau.at


ABSTRACT                                                            based on topic detection [5]. The authors perform topic
This paper describes our contribution to the social event de-       detection by Latent Dirichlet Allocation (LDA) for each city
tection (SED) task of the MediaEval Benchmark 2013. We              in the image collection. Additionally, the authors manually
present a robust unsupervised approach for the clustering of        identify topics that are typical for a specific event cluster.
tagged photos and videos into social events. Results on the            From related approaches we observe that many assump-
SED datasets show that the proposed approach yields an ex-          tions are made on the training set and (partially manual)
cellent generalization ability and state-of-the-art clustering      optimizations are required which limits general applicabil-
performance.                                                        ity. Our unsupervised approach minimizes the assumptions
                                                                    on the data and avoids manual intervention. The approach
                                                                    exhibits a strong generalization ability and results show that
1. INTRODUCTION                                                     the sensitivity to the involved parameters is reasonably low.
  We participated in challenge 1 of the Social Event De-
tection (SED) task [4]. The goal of the task is to build            3.    APPROACH
photo clusters belonging to unique social events in a large
collection of tagged flicker images. Thereby the total num-
ber of events is not provided. In an additional subtask we
                                                                    3.1    Full Clustering
assign unlabeled videos to the previously discovered photo             The input to the approach are the available metadata of
clusters. The development set comprises 300k images from            the SED dataset (capture data, location, title, tags, descrip-
14882 unique events. For the test set of 131k images no             tion) and a stopword list. No other data sources are re-
ground truth is available.                                          quired. In a first step, the metadata are preprocessed: Since
  We consider challenge 1 as an unsupervised data mining            a user cannot be at two locations at the same time, we as-
task. The basic idea is to rely on robust heuristics and            sign locations of photos taken by the same user at the same
to reduce the number of parameters of the approach to a             time to the user’s non-geotagged photos. Additionally, the
minimum to obtain a good generalization ability between             textual metadata are filtered by the stopword list.
different datasets. Additionally, the proposed approach does           In a next step, we perform three independent cluster-
not require any external (online) data sources.                     ings in parallel: temporal clustering, location clustering, and
  In the course of the SED2013 task, we focus on the fol-           topic clustering. For temporal clustering we employ mean-
lowing research questions: (i) Which level of clustering per-       shift and set the bandwidth parameter βT in a way that
formance can be obtained by relying on simple but robust            the resulting clusters span between 2 and 6 hours, which is
heuristics for unsupervised clustering and how do the results       a reasonable temporal resolution for social events. For lo-
compare to more complex clustering methods? (ii) How well           cation clustering we observe that the performance gain of
does the proposed approach generalize to unknown data?              meanshift clustering does not justify the computational ef-
                                                                    forts. Hence, we skip meanshift clustering and assign each
                                                                    individual and unique location in the data a separate cluster
2. RELATED WORK                                                     ID. Topic clustering is based on topic extraction by LDA.
   Many existing approaches for event detection in image            We perform topic modeling on the textual descriptions of
collections require a separate training [1, 3]. Becker et al.       each photo (title, tags, description) using LDA and extract
create separate clusters for each feature such as title, descrip-   T topics for the employed dataset. For each photo i, we
tion, time, etc. The authors employ single-pass incremental         estimate the likelihoods li,1 and li,2 of the first- and second-
clustering whereas the threshold for each cluster is tuned          best matching topics. If the difference of the likelihoods is
based on a set of training data [1]. Reuter and Cimiano em-         larger than a threshold τ (li,1 − li,2 > τ ) the most likely
ploy machine learning techniques to detect events in social         topic is assigned to the photo otherwise no topic is assigned.
streams. The authors employ SVMs to classify Flickr images          Parameter τ is set to 0.3 for all experiments.
annotated by machine tags from last.fm into events [3].                The three independent clusterings are the basis for the
   Vavliakis et al. propose a social event detection approach       generation of initial event clusters. Photos which share the
                                                                    same temporal cluster, location cluster, and topic cluster
                                                                    are assigned the same unique event ID. The remaining pho-
Copyright is held by the author/owner(s).                           tos are assigned to existing and new events in a number
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain      of matching steps. First, remaining photos which share
                                                           metadata
                                                                                                                            results) and 0.69 (average performance) on a portion of the
                   description                              location                                                 time   SED2011 dataset (no F1 reported) [2]. Becker et al. [1] yield
            stop
           words
                                                      preprocessing                                                         NMI values between 0.92 and 0.94 and F1 values from 0.77
             topic clustering                     location clustering temporal clustering
                                                                                                                            to 0.82 on a test set consisting of 270k photos (10 splits).
                                                                                                                            Reuter and Cimiano report an F1 of 0.74 for a dataset of
                                                  merge clusterings
                                                                                                                            700k photos (7 splits, no NMI reported) [3].
                    initial clusters                                       unassigned fotos

                                 update, merge


                                                                               no location


                                                                                             no location, no topic
                                                 update
                                                 merge


                                                                         all
                                                            match
                                                          user + time                                                                  Table 1: Results for Full Clustering
                                                             match                                                                        Development Set          Test Set
                                                          time + topic                                                           βT     Topics   F1   NMI Topics       F1   NMI


                                                                               new
             refined event clusters                                    non-geotagged event clusters                              0.2     2000   0.74 0.94     1000    0.78 0.94
                                                      merge events                                                               0.2     3000   0.74 0.94     1500    0.78 0.94
                                                 final event clusters
                                                                                                                                 0.2     1600   0.74 0.94     800     0.78 0.94
                                                                                                                                 0.1     2000   0.73 0.93     1000    0.76 0.94
                                                                                                                                 0.5     2000   0.72 0.93     1000    0.77 0.94
         Figure 1: Overview of the approach

the same user and capture time as photos in already ex-                                                                       The three approaches submitted to the video subtask show
isting events are assigned to the respective events. If sev-                                                                different results. The supervised approach trained on the de-
eral events share the same users and capture times, the                                                                     velopment data performs suboptimally (F1=0.42, NMI=0.68).
events are merged. Second, remaining photos without loca-                                                                   The reason for this may be that the events of the test data
tion information are matched to existing events by time and                                                                 are inferred from the events in the development data. If an
topic. If no match to an existing event can be established, a                                                               event is not included in the development data, it cannot be
new (non-geotagged event cluster) is generated. For photos                                                                  inferred. The second approach shows that comparing the
where no location and no topic is available we generate new                                                                 metadata of single videos with the accumulated LDA key-
events by their capture time.                                                                                               words from clusters is not well-suited to link single videos
   The resulting sets of events (refined event clusters and                                                                 to clusters (F1=0.34, NMI=0.77). The unsupervised LDA-
non-geotagged event clusters) may oversegment the true event                                                                based approach performs best (F1=0.69, NMI=0.85) and
distribution. Hence, we merge events that share similar                                                                     builds a promising baseline for future improvements.
time, location, and topic to obtain the final event clusters.
                                                                                                                            5.   CONCLUSIONS AND OUTLOOK
3.2 Full Clustering of Media using Videos                                                                                     In this paper we presented our contribution to the SED
   For the video subtask, we apply the above described topic                                                                challenge of the MediaEval 2013 Benchmark. We proposed a
modeling to the stopword-filtered textual descriptions of the                                                               robust unsupervised method for the clustering of photos and
videos (title, description, keywords). Temporal clustering                                                                  videos into social events. The method exhibits strong gen-
and location clustering are neglected, because most videos                                                                  eralization ability, low sensitivity to parameters, and yields
do not contain location information and a capturing date.                                                                   state-of-the-art performance. Future work focuses on more
As a consequence, parameter τ is set to 0.0 for all experi-                                                                 sophisticated event refinements and visual content analysis.
ments to achieve a complete clustering of all videos.
   We investigate three different approaches for generating
the video clusters: (i) LDA is applied to train a topic model
                                                                                                                            6.   ACKNOWLEDGMENTS
with 200 topics on the development data from which the                                                                        This work has been partly funded by the Vienna Science
topics of the test data are derived; (ii) each video constitutes                                                            and Technology Fund (WWTF) through project ICT12-010
a topic on its own; and (iii) an unsupervised LDA-based                                                                     and the Carinthian Economic Promotion Fund (KWF) un-
approach is used to detect 70 topics in the test data. After                                                                der grant KWF-20214 22573 33955.
the video clusters are created, we link them to the previously
generated photo clusters. The keywords of video clusters                                                                    7.   REFERENCES
V are compared to the keywords of the photo clusters P                                                                      [1] H. Becker, M. Naaman, and L. Gravano. Learning
using the Jaccard similarity coefficient. Each video cluster                                                                    similarity metrics for event identification in social
is linked to the photo cluster with the highest similarity.                                                                     media. In ACM WSDM, pp. 291–300, 2010.
                                                                                                                            [2] G. Petkos, S. Papadopoulos, and Y. Kompatsiaris.
4. EXPERIMENTS AND RESULTS                                                                                                      Social event detection using multimodal clustering and
                                                                                                                                integrating supervisory signals. In ACM ICMR, pp.
   We use the same parameters for experiments on the de-
                                                                                                                                23:1–8, 2012.
velopment and test set. To estimate the numbers of topics,
we assume that each topic is constituted in average by 100-                                                                 [3] T. Reuter and P. Cimiano. Event-based classification of
200 photos. Additionally, we evaluate different values of βT                                                                    social media streams. In ACM ICMR, pp. 22:1–8, 2012.
corresponding to an event duration of 2-6 hours. The results                                                                [4] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano,
of the proposed approach for both sets demonstrate its ex-                                                                      C. de Vries, and S. Geva. Social Event Detection at
cellent generalization ability (see Table 1). Results for the                                                                   MediaEval 2013: Challenges, datasets, and evaluation.
test set are even better than for the development set. The                                                                      In MediaEval 2013 Workshop, 2013.
clustering performance is comparable to (more complex) su-                                                                  [5] K. N. Vavliakis, F. A. Tzima, and P. A. Mitkas. Event
pervised state-of-the-art methods. The approach by Petkos                                                                       detection via LDA for the MediaEval2012 SED Task.
et al., for example, yields NMI values of 0.92 (average of best                                                                 In MediaEval 2012 Workshop, 2012.