A similarity-based Chinese Restaurant Process for Social Event Detection Athanasios Konstantinos Tserpes Magdalini Kardara Papaoikonomou National Technical University National Technical University National Technical University of Athens of Athens of Athens tserpes@mail.ntua.gr nkardara@mail.ntua.gr tpap@mail.ntua.gr Theodora Varvarigou National Technical University of Athens dora@mail.ntua.gr ABSTRACT where nk is the number of customers already seated at table α In this paper, we present our approach for the Social Event k, k = 1, 2, ..K or sit at a new table with probability n−1+α , Detection task of Medieval 2013 [2]. The goal of the task where α is a pre-defined parameter. The connection between was to group similar multimedia items into event clusters, the CRP and our task is pretty straightforward. The pho- based on their metadata (e.g. title, description, tags). Since tos in the dataset stand for the customers being seated and the number of the event clusters in the test set was not the tables are the event clusters that group similar multime- known in advance, we formulated a non-parametric algo- dia items. The traditional CRP takes into account only the rithm which resembles the Dirchlet process clustering. More popularity of a table in order to decide whether to assign an specifically, we developed a similarity-based version of the item to a specific table or not. In Section 3 we will present Chinese Restaurant Process (CRP) which exploits the sim- our modified version of the CRP, which we call similarity- ilarities among the media items. Our approach achieved a based CRP as it exploits the similarity among the items in F1 score of 0.2364. the dataset. Keywords 2. RELATED WORK event detection, dirichlet proccess clustering, latent topic One of the most prominent algorithms in latent topic discov- discovery, chinese restaurant process ery is the Latent Dirichlet Allocation (LDA) presented by Blei et al. in [1]. HDP-LDA is a non-parametric version of 1. INTRODUCTION LDA, based on the Hierarchical Dirichlet Process clustering The goal of the Social Event Detection task of Medieval algorithm given in [3]. Borrowing ideas from the LDA algo- 2013 was to discover event-related multimedia items and or- rithm (especially the HDP-LDA), we tried to build an event ganize them in event-specific clusters. For this purpose a detection algorithm focused on metadata mining. LDA (and large training set of about 312,000 photos was given along its variants) exploit word co-occurences to identify latent with their textual metadata like title , description, location topics, but in the case of metadata there are attributes that and tags. cannot be modeled directly as words. For example, in the One of the biggest challenges that we faced had to do with case of the date taken attribute, we do not expect two multi- the calculation of the number of event clusters in the test set, media items to share the exact same value for the timestamp since this was not known in advance. To tackle this prob- even if they refer to the same event. A different approach lem, we used the Chinese Restaurant Process (CRP), which should be followed, as we do in Section 3. is a formulation of the Dirichlet process. It has attracted its name from its analogy to a Chinese Restaurant where n 3. APPROACH customers are seated to an infinite number of tables based In this section we present our algorithm for the Social Event on the following algorithm: The first customer sits at the Detection task. Our approach leans on the assumption that first table. All the subsequent customers either sit at one of similar customers (photos) will tend to gather together and nk the previously occupied K tables with probability n−1+α , “sit at the same table” in the context of our modified Chinese Restaurant Process. Our analysis focused on the compari- son of the available metadata of multimedia items which in our case were typical properties like title, description, tags and username, spatial properties like the longitude and lat- itude, and finally temporal attributes like the date that the media item was taken. The following subsections present in a stepwise manner the construction of our algorithm. Copyright is held by the author/owner(s). MediaEval 2013 Workshop ,October 18-19, 2013, Barcelona, Spain 3.1 Similarity Computation Attribute proba then we make a stohastic decision: The newcomer will ei- Location 56.35% ther sit in one of K tables with probability analogous to Username 10.80% their similarity value, or she will pick a new table.In short, Title 09.34% the algorithm works as follows : Date Taken 25.74% Tag 48.20% 1. The first customer (photo) sits at the first table (event Table 1: Attribute statistics.The second column re- cluster) and initializes the first table profile. ports the percentage of the datapoints that share a common value and belong to the same event cluster. 2. For each of the subsequent customers we compute their similarity value with each of the K table profiles. We denote these values as vk , k = 1,2..K We evaluated the importance of each attribute to the al- 3. The customer sits at table k with probability P vvik+α location of the media items in event clusters. More con- i cretely, we operated on the training set and we measured or sits at a new table with probability P vαi +α . In the probability that two datapoints sharing the same value i for a specific attribute, will also belong also to the same the former case, the attributes of the media item are event cluster.The results are depicted in Table 1. In order “merged” into the k-th table profile, while in the latter to measure the similarity of two photos in the test set, we case a new table profile is initialized by the media item. used the computed probabilities as scores. More specifically, the computation of the similarity between two datapoints i and j was performed using the following formula The parameter α controls the distribution of the customers on the tables. Higher values of α signify higher dispersion P and thus, a larger number of occupied tables, while lower vij = proba · Ma (i, j) values of α give more compact allocations. Finally, to better a∈attrs 1 measure the quality of our results, we computedPthe purity 1 of the generated clusters as purity(Ω, C) = N · maxj (ωk ∩ k , where attrs is the set of the metadata, proba the associated cj ) , where where Ω = {ω1 , ω2 , . . . , ωK } is the set of the scores from Table 1 and Ma (i,j) is the matching function. clusters generated by the algorithm and C = {c1 , c2 , . . . , cJ } For the majority of the attributes (Location, Username, Ti- is the set of the actual clusters. tle, Tags) the M function is simply the indicator function 1 i,j share the same value for attribute a Ia (i, j) = 0 otherwise 4. RESULTS The results were very sensitive to the selection of α. We per- In the case of the “Date Taken” attribute we used an expo- formed a line search in the region [1,100]. We observed that nential decay weight function to model the temporal prox- for α > 10 the algorithm diverged giving a huge number of imity of two photo items. More specifically the similarity clusters. For example, for α = 20 we got about 50k clusters |d −d | in the training set, while the actual number was about 15k. score was computed as Mij = exp(− i h j ) , where di , dj Of course, the high purity value that we measured in this are the timestamps of the two datapoints. The denominator setting is meaningless, since the results are not well inter- in the exponent h is called the bandwidth and controls the pretable. For α < 10 , the algorithm generated a few and rate of decay. We set this value to 1 hour, so that times- low-purity clusters. For example, for α = 5 , the algorithm tamps with time difference less than one hour will receive gave about 3k clusters in the training set. We decided to set high values (close to 1) , while larger deviations are penal- the value of α equal to 10, since it was a good compromise ized more heavily. between the number and the purity of the generated event clusters in the training set. On the test set we achieved a 3.2 Table Profiles F1(Main Score) of 0.2364 and NMI of 0.6644. The second issue that we faced was the scaling of the algo- rithm in large datasets. Even for medium size datasets, like 5. REFERENCES the one given for the task, it becomes impractical to measure [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent the similarities among all pairs of datapoints. To tackle this dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, problem, we introduced the concept of table profiles. A table Mar. 2003. profile is simply the union of all the photo items that “sit” [2] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano, in that table, and it is equivalent to a super-photo item that C. de Vries, and S. Geva. Social Event Detection at encompasses all the characteristics of the photos belonging MediaEval 2013: Challenges, datasets, and evaluation. to the table. Using this trick, we reduced significantly the In MediaEval 2013 Workshop, Barcelona, Spain, number of comparisons needed to allocate a new “customer” October 18-19 2013. (media item) from n (the total number of customers at that point) to K (the number of occupied tables) [3] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101, 2004. 3.3 Similarity-based CRP algorithm This subsection finally presents our modified CRP algorithm. When a new customer (photo) comes in we measure its sim- 1 http://nlp.stanford.edu/IR-book/html/htmledition/evaluation- ilarity with each one of the K already occupied tables and of-clustering-1.html