1. INTRODUCTION

A Data-Driven Approach for Social Event Detection

Dimitrios Rafailidis

0 1

Theodoros Semertzidis

0 1

Michalis Lazaridis

lazar@iti.gr 0 1

Michael G. Strintzis

strintzi@eng.auth.gr 0 1

Petros Daras

daras@iti.gr 0 1 0 CERTH-Information Technologies Institute , 6th Km. Charilaou-Thermi, Thessaloniki , Greece 1 T. Semertzidis and M.G. Strintzis are also with the Information Processing Laboratory, Electrical and Computer Engineering Dept., Aristotle University of Thessaloniki , Greece

2013

18 19

In this paper, we present a data-driven approach for challenge 1 of the MediaEval 2013 Social Event Detection Task. Our proposed approach consists of the following steps: (a) initialization based on the images' spatio-temporal information; (b) computation of clusters' intercorrelations; and (c) the final clusters' generation. In the initialization step, the images that have both geolocation and time information are clustered analogously, where few “anchored” clusters are generated, while the rest of images with no geolocation or time information are considered as singleton (one image) clusters. In the second step, all pairwise intercorrelations between the “anchored” and the singleton clusters are calculated with the help of an aggregated similarity measure based on the user, title, description tag, and visual information of images. In the final step, the “anchored” and singleton clusters derived by the initialization step are merged based on the calculated intercorrelations of the second step to generate the final clusters. Our best run achieves a score of 0.5701, 0.8739 and 0.5592 for F1-Measure, NMI and Divergence (F1), respectively.

1. INTRODUCTION

We hereby present the data-driven approach followed by the Visual Computing Lab (http://vcl.iti.gr) of CERTH at the MediaEval 2013 Social Event Detection Task for challenge 1. Details of the task are provided in the paper from Reuter et al. [ 3 ]. Previous works on the field use techniques such as LDA of Vavliakis et al. [ 5 ] or spectral clustering of Petkos et al. [ 1 ] that perform well on small sets but have high preprocessing requirements. Our initial motivation was to design an approach that exploits the most of the available information while avoiding complex training algorithms and classification schemes that do not scale well. Towards this end, our initial goal was to process the dataset in the shortest

Initialization Step Given N images in the database, we retrieve A N images that have both geolocation and time information, while the rest Csing = N A images are considered as singleton clusters. The geolocation information is provided as longitude and latitude coordinates, whereas in our approach we consider the date that the photo has been taken as the time information. Then, images A are clustered in two steps. In the first step, two images I1, I2 2 A that do not differ by more than a predefined threshold l are clustered together on the condition that both jI1long I2long lj and jI1lat I2lat lj hold true, where Iilong and Iilat are the longitude and latitude coordinates of the i-th image. In doing so, clusters C1; : : : ; Cr are generated. In the second step the generated clusters r are split based on the time condition that images of a cluster should be within a predefined time window w. All generated clusters from the two steps that contain images A are called “anchored clusters” Canc, in the sense that these clusters must not be merged together, since the time condition is never satisfied. The final outcome of the initialization step is that (a) the Csing singleton clusters and (b) the Canc “anchored” clusters of the images A where each cluster Ci is associated with a minimum TmCiin and a maximum TmCaix date based on the w time window, i.e. TmCaix TmCiin w and TmCaix/ TmCiin is the maximum/minimum date of an image within cluster Ci. Computation of Cluster Intercorrelations: In the second step of our approach, we only calculate all possible intercorrelations if two examined clusters Ci and Cj are (a) not both “anchored” clusters and (b) satisfy the time condition that at least one difference between the associated dates TmCiin, TmCaix, TmCijn, Tmax is lower or equal than the

Cj time window w. In case that the time information of one singleton cluster Csing is missing, the aforementioned time constraint is ignored. Then, if the time condition is satisfied, the intercorrelations between two clusters Ci and Cj are computed as follows. First, each cluster Ci is associated with the three textual vocabularies of tags, titles, descriptions as well as with a list of users, which are the owners of the images of cluster Ci. For each cluster Ci, the distinct tags, titles, descriptions, and users form the respective textual vocabularies and the list of users. For the textual information of tags, titles, descriptions we used a Jaccard similarity measure to calculate the textual similarity measures Stags(Ci; Cj), Sdesc(Ci; Cj) and Stitle(Ci; Cj ) for each pair Ci–Cj. In parallel, the cluster similarity Susers(Ci; Cj) based on the users is computed.

Using the visual information, we have a set of the Ivis(Ci) distinct visual neighbors of all images that belong to the same cluster Ci, by aggregating all the visual neighbors k. Thus, the visual similarity between two clusters Ci and Cj is calculated as: Finally, the intercorrelations between two clusters Ci and Cj are computed using the following aggregated similarity measure:

Sagg(Ci; Cj) = a1Susers(Ci; Cj) + a2Stags(Ci; Cj) + : : : +a3Sdesc(Ci; Cj) + a4Stitle(Ci; Cj) + a5Svisual(Ci; Cj) where each coefficient a1; a2; a3; a4; a5 expresses the respective weight of each similarity measure.

Final Cluster Generation: The final clusters are generated based on the calculated intercorrelations. For each singleton cluster Csing the maximum intercorrelation with an “anchored” cluster Canc is computed. If the condition Sagg(Csing; Canc) Mthres, where Mthres is a merging threshold, is satisfied then the two clusters are merged. After all pair-wise comparisons between the singleton cluster and the “anchored” ones, the non-merged singleton clusters are compared with each other in the same way and merged analogously which thus generates the final clusters.

3. EXPERIMENTS

A grid selection strategy was used to compute the optimal values of l = 0:05, w = 24 hours, a1 = 0:5, and a2:::5 = 0:125. In order to retrieve the visual neighbors (k), we used Opponent SIFT [ 4 ] with a codebook of 1004 dimensions. The number of visual neighbors has been set to 20. By varying the merging threshold Mthres our best run in the training set with Mthres=0.4 achieves a F1-Measure of 0.8889, a NMI of 0.9771, and a DIV-F1 of 0.8076.

The experimental results for the test set are presented in Table 1. For the required run the merging threshold Mthres is set to 0.003. For the general runs (1-4) it is set to 0.01, 0.005, 0.01 and 0.005, respectively. The remark “visual” denotes that visual information was used in addition. The reason for using low values of Mthres in our runs in the test set is due to the majority of the calculated intercorrelations which were lower than 0.01, whereas the respective intercorrelations in the training set were lower than 0.4. Higher values of Mthres in the test set generated many singleton clusters. The unknown number of clusters and the extremely low cluster intercorrelations can explain the high difference between the results on the training and test set. All experiments were conducted on our distributed environment in the context of the EC funded project CUBRIK (see Section 5).

4. DISCUSSION

Based on the experimental results of Table 1, we observe that the visual information slightly improves the performance of the algorithm. It is not necessarily solving the sparsity problem, which is detected by the many zero and low values of the cluster intercorrelations. This happens because the k visual neighbors of each image are not definitely conceptually similar and thus add noise to the cluster intercorrelations. This is a very important challenge for many content-based tag propagation methods that try to solve the sparsity problem between less annotated images. This issue has been termed “learning tag relevance”, based on the semantic connections between the assigned tag (or any other textual information) and the content it represents. It must be revealed to perform as accurate tag propagation. In the future, we plan to evaluate the proposed data-driven approach using our personalized content-based tag propagation method [ 2 ], in order to solve the extreme sparsity that may occur between the cluster intercorrelations.

5. ACKNOWLEDGEMENTS

This work was partially supported by the EC FP7 funded project CUBRIK, ICT- 287704 (www.cubrikproject.eu).

[1]

Petkos ,

Papadopoulos , and

Kompatsiaris . Social event detection using multimodal clustering and integrating supervisory signals . In ICMR. ACM , 2012 .

[2]

Rafailidis ,

Axenopoulos ,

Etzold ,

Manolopoulou , and

Daras . Content-based tag propagation and tensor factorization for personalized item recommendation based on social tagging . ACM Trans. Interact . Intell. Syst., to appear.

[3]

Reuter ,

Papadopoulos ,

Mezaris ,

Cimiano , C. de Vries, and

Geva . Social Event Detection at MediaEval 2013: Challenges, datasets, and evaluation . In MediaEval 2013 Workshop, Barcelona, Spain, October 18 -19 2013 .

[4] K. E. A. van de Sande , T. Gevers, and C. G. M. Snoek . Evaluating color descriptors for object and scene recognition . IEEE Trans. on PAMI , 32 ( 9 ): 1582 - 1596 , 2010 .

[5]

K. N.

Vavliakis ,

F. A.

Tzima , and

P. A.

Mitkas . Event detection via lda for the mediaeval 2012 sed task . In MediaEval Workshop , 2012 .