A similarity-based Chinese Restaurant Process for Social
                     Event Detection

                  Athanasios                       Konstantinos Tserpes                  Magdalini Kardara
                Papaoikonomou                     National Technical University      National Technical University
           National Technical University                    of Athens                          of Athens
                     of Athens                     tserpes@mail.ntua.gr               nkardara@mail.ntua.gr
              tpap@mail.ntua.gr
                                                    Theodora Varvarigou
                                                  National Technical University
                                                            of Athens
                                                      dora@mail.ntua.gr

ABSTRACT                                                           where nk is the number of customers already seated at table
                                                                                                                             α
In this paper, we present our approach for the Social Event        k, k = 1, 2, ..K or sit at a new table with probability n−1+α ,
Detection task of Medieval 2013 [2]. The goal of the task          where α is a pre-defined parameter. The connection between
was to group similar multimedia items into event clusters,         the CRP and our task is pretty straightforward. The pho-
based on their metadata (e.g. title, description, tags). Since     tos in the dataset stand for the customers being seated and
the number of the event clusters in the test set was not           the tables are the event clusters that group similar multime-
known in advance, we formulated a non-parametric algo-             dia items. The traditional CRP takes into account only the
rithm which resembles the Dirchlet process clustering. More        popularity of a table in order to decide whether to assign an
specifically, we developed a similarity-based version of the       item to a specific table or not. In Section 3 we will present
Chinese Restaurant Process (CRP) which exploits the sim-           our modified version of the CRP, which we call similarity-
ilarities among the media items. Our approach achieved a           based CRP as it exploits the similarity among the items in
F1 score of 0.2364.                                                the dataset.

Keywords                                                           2.    RELATED WORK
event detection, dirichlet proccess clustering, latent topic       One of the most prominent algorithms in latent topic discov-
discovery, chinese restaurant process                              ery is the Latent Dirichlet Allocation (LDA) presented by
                                                                   Blei et al. in [1]. HDP-LDA is a non-parametric version of
1.   INTRODUCTION                                                  LDA, based on the Hierarchical Dirichlet Process clustering
The goal of the Social Event Detection task of Medieval            algorithm given in [3]. Borrowing ideas from the LDA algo-
2013 was to discover event-related multimedia items and or-        rithm (especially the HDP-LDA), we tried to build an event
ganize them in event-specific clusters. For this purpose a         detection algorithm focused on metadata mining. LDA (and
large training set of about 312,000 photos was given along         its variants) exploit word co-occurences to identify latent
with their textual metadata like title , description, location     topics, but in the case of metadata there are attributes that
and tags.                                                          cannot be modeled directly as words. For example, in the
One of the biggest challenges that we faced had to do with         case of the date taken attribute, we do not expect two multi-
the calculation of the number of event clusters in the test set,   media items to share the exact same value for the timestamp
since this was not known in advance. To tackle this prob-          even if they refer to the same event. A different approach
lem, we used the Chinese Restaurant Process (CRP), which           should be followed, as we do in Section 3.
is a formulation of the Dirichlet process. It has attracted
its name from its analogy to a Chinese Restaurant where n          3.    APPROACH
customers are seated to an infinite number of tables based         In this section we present our algorithm for the Social Event
on the following algorithm: The first customer sits at the         Detection task. Our approach leans on the assumption that
first table. All the subsequent customers either sit at one of     similar customers (photos) will tend to gather together and
                                                         nk
the previously occupied K tables with probability n−1+α        ,   “sit at the same table” in the context of our modified Chinese
                                                                   Restaurant Process. Our analysis focused on the compari-
                                                                   son of the available metadata of multimedia items which in
                                                                   our case were typical properties like title, description, tags
                                                                   and username, spatial properties like the longitude and lat-
                                                                   itude, and finally temporal attributes like the date that the
                                                                   media item was taken. The following subsections present in
                                                                   a stepwise manner the construction of our algorithm.
Copyright is held by the author/owner(s).
MediaEval 2013 Workshop ,October 18-19, 2013, Barcelona, Spain
                                                                   3.1    Similarity Computation
                   Attribute            proba                    then we make a stohastic decision: The newcomer will ei-
                   Location            56.35%                    ther sit in one of K tables with probability analogous to
                   Username            10.80%                    their similarity value, or she will pick a new table.In short,
                   Title               09.34%                    the algorithm works as follows :
                   Date Taken          25.74%
                   Tag                 48.20%
                                                                      1. The first customer (photo) sits at the first table (event
Table 1: Attribute statistics.The second column re-                      cluster) and initializes the first table profile.
ports the percentage of the datapoints that share a
common value and belong to the same event cluster.                    2. For each of the subsequent customers we compute their
                                                                         similarity value with each of the K table profiles. We
                                                                         denote these values as vk , k = 1,2..K
We evaluated the importance of each attribute to the al-
                                                                      3. The customer sits at table k with probability P vvik+α
location of the media items in event clusters. More con-                                                                     i
cretely, we operated on the training set and we measured                 or sits at a new table with probability P vαi +α . In
the probability that two datapoints sharing the same value                                                            i

for a specific attribute, will also belong also to the same             the former case, the attributes of the media item are
event cluster.The results are depicted in Table 1. In order             “merged” into the k-th table profile, while in the latter
to measure the similarity of two photos in the test set, we             case a new table profile is initialized by the media item.
used the computed probabilities as scores. More specifically,
the computation of the similarity between two datapoints i
and j was performed using the following formula                  The parameter α controls the distribution of the customers
                                                                 on the tables. Higher values of α signify higher dispersion
                          P                                      and thus, a larger number of occupied tables, while lower
                vij =             proba · Ma (i, j)              values of α give more compact allocations. Finally, to better
                        a∈attrs                                                                                              1
                                                                 measure the quality of our results, we computedPthe purity
                                                                                                             1
                                                                 of the generated clusters as purity(Ω, C) = N · maxj (ωk ∩
                                                                                                                     k
, where attrs is the set of the metadata, proba the associated   cj ) , where where Ω = {ω1 , ω2 , . . . , ωK } is the set of the
scores from Table 1 and Ma (i,j) is the matching function.       clusters generated by the algorithm and C = {c1 , c2 , . . . , cJ }
For the majority of the attributes (Location, Username, Ti-      is the set of the actual clusters.
tle, Tags) the M function is simply the indicator function
              1 i,j share the same value for attribute a
Ia (i, j) =
              0 otherwise                                        4.     RESULTS
                                                                 The results were very sensitive to the selection of α. We per-
In the case of the “Date Taken” attribute we used an expo-       formed a line search in the region [1,100]. We observed that
nential decay weight function to model the temporal prox-        for α > 10 the algorithm diverged giving a huge number of
imity of two photo items. More specifically the similarity       clusters. For example, for α = 20 we got about 50k clusters
                                       |d −d |                   in the training set, while the actual number was about 15k.
score was computed as Mij = exp(− i h j ) , where di , dj
                                                                 Of course, the high purity value that we measured in this
are the timestamps of the two datapoints. The denominator
                                                                 setting is meaningless, since the results are not well inter-
in the exponent h is called the bandwidth and controls the
                                                                 pretable. For α < 10 , the algorithm generated a few and
rate of decay. We set this value to 1 hour, so that times-
                                                                 low-purity clusters. For example, for α = 5 , the algorithm
tamps with time difference less than one hour will receive
                                                                 gave about 3k clusters in the training set. We decided to set
high values (close to 1) , while larger deviations are penal-
                                                                 the value of α equal to 10, since it was a good compromise
ized more heavily.
                                                                 between the number and the purity of the generated event
                                                                 clusters in the training set. On the test set we achieved a
3.2   Table Profiles                                             F1(Main Score) of 0.2364 and NMI of 0.6644.
The second issue that we faced was the scaling of the algo-
rithm in large datasets. Even for medium size datasets, like     5.     REFERENCES
the one given for the task, it becomes impractical to measure    [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
the similarities among all pairs of datapoints. To tackle this       dirichlet allocation. J. Mach. Learn. Res., 3:993–1022,
problem, we introduced the concept of table profiles. A table        Mar. 2003.
profile is simply the union of all the photo items that “sit”
                                                                 [2] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano,
in that table, and it is equivalent to a super-photo item that
                                                                     C. de Vries, and S. Geva. Social Event Detection at
encompasses all the characteristics of the photos belonging
                                                                     MediaEval 2013: Challenges, datasets, and evaluation.
to the table. Using this trick, we reduced significantly the
                                                                     In MediaEval 2013 Workshop, Barcelona, Spain,
number of comparisons needed to allocate a new “customer”
                                                                     October 18-19 2013.
(media item) from n (the total number of customers at that
point) to K (the number of occupied tables)                      [3] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
                                                                     Hierarchical dirichlet processes. Journal of the
                                                                     American Statistical Association, 101, 2004.
3.3   Similarity-based CRP algorithm
This subsection finally presents our modified CRP algorithm.
When a new customer (photo) comes in we measure its sim-         1
                                                                   http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-
ilarity with each one of the K already occupied tables and       of-clustering-1.html