Ranking Based Clustering for Social Event Detection

                          Taufik Sutanto                                           Richi Nayak
              Queensland University of Technology                     Queensland University of Technology
                 taufik.sutanto@qut.edu.au                                    r.nayak@qut.edu.au


ABSTRACT                                                          tionally expensive, as well as, a large size cluster makes the
The problem of clustering a large document collection is not      similarity measure between documents ambiguous.
only challenged by the number of documents and the num-              Semi-supervised clustering methods have shown to pro-
ber of dimensions, but it is also affected by the number and      duce a better result compared to their traditional unsuper-
sizes of the clusters. Traditional clustering methods fail to     vised counterpart [2]. In the 2013 SED task, we proposed
scale when they need to generate a large number of clusters.      and used a scalable ranking based semi-supervised clustering
Furthermore, when the clusters size in the solution is het-       approach that produces accurate clusters [5]. However, this
erogeneous, i.e. some of the clusters are large in size, the      method suffers with the communication cost for long doc-
similarity measures tend to degrade. A ranking based clus-        uments. To deal with the issue, we utilized the document
tering method is proposed to deal with these issues in the        frequency distribution and exclude the most occurring terms
context of the Social Event Detection task. Ranking scores        in the query document (i.e. the document to be clustered)
are used to select a small number of most relevant clusters       if needed.
in order to compare and place a document. Additionally,              The use of hubs has been explored and has shown its ef-
instead of conventional cluster centroids, cluster patches are    ficacy in dealing with high dimensional data and clusters
proposed to represent clusters, that are hubs-like set of doc-    with large sizes [3]. However, the k-NN calculation of hubs
uments. Text, temporal, spatial and visual content infor-         demands a considerable amount of extra computation that
mation collected from the social event images is utilized in      is not suitable for large data clustering. In this paper, doc-
calculating similarity. Results show that these strategies al-    uments are assigned to clusters based on its distances to
low us to have a balance between performance and accuracy         cluster patches. These patches are calculated based on the
of the clustering solution gained by the clustering method.       ranking scores from the queries. Document frequencies are
                                                                  used to select a subset of terms from documents to create the
                                                                  queries. These patches become data representatives to mea-
1.   INTRODUCTION                                                 sure distances for a document, instead of using each cluster
   The Social Event Detection (SED) task at the 2014 Medi-        centroid. The use of patches is expected to enable the clus-
aEval Benchmark for Multimedia Evaluation consists of two         tering method to capture more specific sub-topics within a
subtasks: (1) Image clustering based on a given set of events;    cluster.
and (2) retrieval of social events based on predefined queries       In this paper, we present a method based on cluster patches
[4]. The SED task poses challenges to clustering analysis         to calculate the distance between a document and the groups
due to the real-world nature of the data such as the large        of documents inside a cluster (Figure 1). Instead of a single
number of dimensions, large data size, multi-domain types         centroid, patches are proposed to represent a large high-
of features, and the need to group data into a large and un-      dimensional cluster in order to control the significance of
fixed number of clusters. This paper focuses on proposing         similarity measurement.
a solution to the first subtask, i.e., semi-supervised cluster-
ing of social event images based on the metadata and visual
content.
   Search engine technologies e.g., Sphinx, Lucene or Solr
have been successfully implemented to process large sized
document collections for information retrieval. Utilizing the
concept of ranking scores used in search engines, coupled
with using prior knowledge from the learning data, in semi-
supervised clustering has shown to be an effective and effi-
cient approach of clustering text data [5, 6]. This type of
approaches works fine when the collection size or the num-
ber of clusters required is small. Calculating ranking scores
for a large number of documents is known to be computa-


Copyright is held by the author/owner(s).
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain      Figure 1: Ranking based document clustering with patches.
2.     PREPROCESSING                                                    4.   RESULTS AND DISCUSSION
   All the features of the images were used in the clustering             We submitted five runs for the supervised clustering task
process except of their URL. English stopwords and some                 (Table 1). Runs one, three, four, and five used the proposed
symbols (e.g. #,&,@) were filtered. Title, tag, username,               method on text only, text-time-space, all attributes, and
and description attributes were combined into a short doc-              text-images set of attributes respectively. While run two is
ument. No external resources were used in the analysis.                 using the method as described in [5] using the text attribute
The document length normalized tf-idf was used as the term              only.
weighting scheme. The time information were transformed
into day interval between date taken and date upload. Spa-                      Table 1: Semi-Supervised clustering results.
tial information (i.e. latitude and longitude) were used by
utilizing a modified Harversine-formula. The modification is                          Run1       Run2      Run3     Run4      Run5
done by changing the range of the measure to a unit value as             F1-Score    0.7463     0.7533     0.7445   0.7440    0.7456
in cosine distance. Feature-based super-pixel segmentation               NMI         0.9024     0.9020     0.9017   0.9015    0.9020
is used to extract compact color and texture representation              Div. F1     0.7447     0.7516     0.7428   0.7424    0.7439
for small image patches [1]. This representation has smaller
dimension compared to the bag-of-visual words (BOVW)                       The first two runs indicate that the proposed method has
approach.                                                               comparable accuracy to general ranking method, but an im-
                                                                        proved cluster quality as shown by NMI. While the remain-
3.     THE MODEL                                                        ing runs shows that the usage of image, spatial, and time
   A set of patches P are calculated in each iteration based            information is ineffective in this data for the purpose of clus-
on the ranking score from the document query. Instead of                tering. The main reason behind this is the dependence of
comparing a document with a cluster centroid, the document              the proposed method on text ranking.
feature vector is compared with all the patches. The patches               An adaptive weighting where weights of each attribute are
are calculated based on a certain size (δ) neighborhood of              dynamic among documents is a priority for future investi-
documents based on ranking scores within clusters. Optimal              gation to solve this issue. Future work will also explore on
distance from the document and these patches is then used               finding the optimal parameter γ and improve the scalability
to decide the document assignment to a cluster. More detail             of the method in distributed data and distributed computing
of the approach is given in Algorithm 1.                                environment.

      input : Set of documents D, initial clusters                      5.   ACKNOWLEDGMENTS
              C = {c1 , c2 , . . . , cK }, neighborhood size m,           We like to thank Dr Simon Denman from QUT Compu-
              patches size δ, and cluster threshold γ.                  tational Intelligence and Signal Processing lab for providing
      output: K 0 disjoint partitions of D.                             us the image visual content encoding.
      Index all documents D;
      for each di ∈ Dtest do                                            6.   REFERENCES
                                                                        [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
         calculate a set of cluster patches
                                                                            S. Susstrunk. SLIC superpixels compared to
         P = {0 < |drank | < δi , i ∈ I, d ∈ cj };
                                                                            state-of-the-art superpixel methods. IEEE Transactions
         for each p ∈ P do
                                                                            on Pattern Analysis and Machine Intelligence, 34, 2012.
             calculate p∗ = maxp {sim(di , p), p ∈ P };
             if sim(di , p∗) > γ then                                   [2] S. Basu, I. Davidson, and K. Wagstaff. Constrained
                 Assign document di to a cluster where p∗                   Clustering: Advances in Algorithms, Theory, and
                 belongs;                                                   Applications. Chapman & Hall/CRC, 1 edition, 2008.
             else                                                       [3] J. Hou and R. Nayak. The heterogeneous cluster
                 Form a new cluster c=di ;                                  ensemble method using hubness for clustering text
             end                                                            documents. In WISE 2013, pages 102–110. Springer
             Update cluster labels via the search engine                    Berlin Heidelberg.
         end                                                            [4] G. Petkos, S. Papadopoulos, V. Mezaris, and
      end                                                                   Y. Kompatsiaris. Social event detection at MediaEval
                                                                            2014: Challenges, datasets, and evaluation. In
     Algorithm 1: Incremental ranking based social
                                                                            Proceedings of the MediaEval 2014 Multimedia
     event images clustering algorithm.
                                                                            Benchmark Workshop Barcelona, Spain, October 16-17,
                                                                            2014, volume 1044. CEUR-WS.org, 2014.
   The similarity measure between a document d and a patch              [5] T. Sutanto and R. Nayak. ADMRG @ MediaEval 2013
p in a cluster c is given by utilizing textual, temporal, spatial           social event detection. In Proceedings of the MediaEval
and visual information within images:                                       2013 Multimedia Benchmark Workshop Barcelona,
      sim(d, p) = β1 simcosine (d, p) + β2 simtime (d, p)+                  Spain, October 18-19, 2013, volume 1043, 2013.
                                                                  (1)   [6] T. Sutanto and R. Nayak. The ranking based
                   β3 simspace (d, p) + β4 simimage (d, p).                 constrained document clustering method and its
βi is a weight parameter to combine the effect of various                   application to social event detection. In Database
types of attributes. These parameters can be fine tuned                     Systems for Advanced Applications, volume 8422 of
manually or calculated from the learning data by using vari-                Lecture Notes in Computer Science, pages 47–60.
able importance measures from a decision tree model.                        Springer International Publishing, 2014.