1. INTRODUCTION

Camille Guinaudeau

guinaudeau@limsi.fr 0 1

Antoine Laurent

laurent@limsi.fr 0

Hervé Bredin

bredin@limsi.fr 0 0 LIMSI-CNRS, Rue John Von Neumann , 91400 Orsay 1 University Paris-Sud , Rue du Château, 91400 Orsay

2014

16 17

This paper provides an overview of the Social Event Detection (SED) system developed at LIMSI for the 2014 campaign. Our approach is based on a hierarchical agglomerative clustering that uses textual metadata, user-based knowledge and geographical information. These di erent sources of knowledge, either used separately or in cascade, reach good results for the full clustering subtask with a normalized mutual information equal to 0.95 and F1 scores greater than 0.82 for our best run.

1. INTRODUCTION

The Social Event Detection (SED) task aims at mining social events (such as concerts, protest and so on) in large collections of online multimedia [ 1 ]. This challenge is divided into two subtasks: detection by full clustering and retrieval of events. In this work, we focus only on the rst subtask which consists in clustering all images in the given dataset, so that each cluster represents a social event. As the number of the target clusters is not provided by the SED organizers, the main di culty of this substak is to infer this number. To overcome this di culty, our full clustering system relies on a hierarchical agglomerative clustering approach that works by continuously merging the most similar clusters (as long as the similarity between them is higher than some threshold).

In this work, our system is only based on the metadata associated with images. The hierarchical clustering approach, presented in Section 2.2, is based on distance matrices accounting for textual metadata (Section 2.3.1) or geographical information (Section 2.3.2). In order to make the distance computation as robust as possible and avoid data sparsity, a rst preliminary clustering is performed on the dataset, so that each preliminary cluster is associated with a set of metadata coming from all the images in the cluster. This preliminary clustering is described in Section 2.1.

FULL CLUSTERING

The development dataset released by the SED task organizers for the full clustering subtask is composed by 362,578 images collected from Flickr, associated with their metadata. To lower the computation time and facilitate our experimentation process, this development dataset was divided into three smaller datasets Dev A, Dev B and Dev C so that each dataset has approximately the same number of clusters and the same distribution in terms of number of images per cluster. Moreover, the number of images in each cluster is also quite similar to the number of images contained in the test set (110,541) allowing us to experiment on our approach with comparable datasets. As explained in the introduction, a preliminary clustering was rst applied on these datasets { to create user-based clusters of images { before the hierarchical clustering step that makes use of textual metadata and/or geographical information. 2.1

User-based clustering

The preliminary clustering is obtained by creating one cluster per user, using the user name associated with each image in the dataset. As these user-based clusters usually contain several social events (one user is rarely associated with only one social event) they are then divided into smaller ones depending on date and time information also mentioned in the pictures' metadata.

For each cluster, we initialize a time reference with the date of a randomly chosen picture, hereafter called picture F , in the cluster. We then compare this time reference with the date of all the other pictures in the cluster and we pick the closest one, hereafter called picture C. If picture C is taken less than 1 hours before or after the time reference, then it belongs in the same cluster as picture F and the time reference is recomputed to equal the mean between the previous time reference and the date of picture C. However, if picture C was taken too far in time from picture F , then another cluster is created with a new time reference corresponding to the taken time of picture C.

The objective of this step is to lower the number of clusters to be processed in the hierarchical clustering step while keeping the clusters as pure as possible. To evaluate the purity of our clusters we use the homogeneity metric [ 3 ] which equals one when each cluster contains only members of a 1All parameters were tuned on Dev A.

F1 (Main Score)

NMI Divergence F1

Dev A 0.7895 0.9479 0.6880

Dev B 0.7869 0.9472 0.7258 single class. Table 1 summarizes the homogeneity scores for the development datasets Dev A, Dev B and Dev C with the parameter ranging from 1 hour to 100 hours. It can be seen from this table that, rst, the values are similar for all the datasets and, second, that the homogenity values are high even if the parameter equals 100 meaning that users tend not to participate to social events very frequently. 2.2

Hierarchical clustering approach

The hierarchical agglomerative clustering approach begins with the set of clusters that have been de ned in the preliminary clustering presented in the previous section and is based on a single-linkage clustering method. When two clusters u and v from this set are combined into a single cluster w, u and v are removed from the set, and w is added to the set. When only one cluster remains, the algorithm stops. To decide whether two clusters have to be combined, a distance matrix is maintained at each iteration where the d[u; v] entry corresponds to the distance between cluster u and v. At each iteration, the algorithm must update the distance matrix to re ect the distance of the newly formed cluster w with the remaining clusters in the set. The distance between the newly formed cluster w and each v0 is computed with the following equation d(w; v0) = min(dist(w[i]; v0[j])) (1) for all images i in cluster w and j in cluster v0.

The nal clustering is then obtained by forming at clusters from the hierarchical clustering previously de ned. A threshold is used so that observations in each at cluster have no intergroup distance greater than . 2.3

Distance matrices

In this last part, we describe how the distance matrices used in the hierarchical agglomerative clustering approach are computed. Both use the metadata associated with the pictures, namely textual metadata and geographical information. 2.3.1

Textual metadata distance matrix

To compute the textual distance, each cluster is represented by a vector composed by lemmas weighted with a BM25 score. A cosine distance is then computed between two vectors to estimate the distance between the two corresponding clusters. To create the vectors, words are extracted from the textual metadata associated with each picture in the cluster. These words are then lemmatized and only nouns, adjectives and non modal verbs are kept to characterize the cluster. Each lemma in the vector is nally associated with a score computed using the BM25 weighting function [ 2 ] that gives a weight close to 1 to lemmas that are the most representative of the content of the cluster. In our system, lemmas are extracted from title, description and, when available, tags metadata. 2.3.2

Geographic distance matrix

We also compute the geographic distance between every cluster that contains at least one picture with GPS information. The distance between two clusters u and v corresponds to the minimum geographic distance between any picture from cluster u and any picture from cluster v. Moreover, in order to avoid the merging of events that take place in the same location but at di erent time (such as festivals that take place at the same location every year), we also prevent the merging of two clusters if their associated date is greater than a certain threshold (48h) by arti cially increasing their geographical distance. 3.

EXPERIMENTS AND RESULTS

For the full clustering subtask, each participant was allowed to submit up to 5 runs. In this section we describe our runs and discuss the results obtained. All the submitted runs are based on the preliminary clustering, that uses an parameter equal to 20 hours, 24 hours or 30 hours. The hierarchical clustering is then obtained thanks to the textual metadata only (Text), the geographical information only (Geo) or both sources of knowledge (Geo-Text). In this latter case, the combination is done in cascade, meaning that a hierarchical clustering is rst performed thanks to the geographical information and then a hierarchical clustering based on text is applied on the result of the geographical clustering. From Table 2 it can be seen that the combination gives the best results with F1 scores greater than 0.8 and normalized mutual information greater than 0.95. We can also see that combining both sources of information (24h-Geo-Text) improves the metric values compared with information used alone (24h-Text and 24h-Geo). Finally the left part of the table presents the results obtained on our three development datasets. We can notice from these numbers that the proposed approach is robust and gives similar results on both the development and test sets.

[1]

Georgios

Petkos , Symeon Papadopoulos, Vasileios Mezaris, and

Yiannis

Kompatsiaris . Social Event Detection at MediaEval 2014 : Challenges, Datasets, and Evaluation . In Working Notes Proceedings of the MediaEval 2014 Workshop , Barcelona, Spain, October 2014 .

[2] Stephen

E Robertson

, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu , Mike Gatford , et al. Okapi at TREC-3 . NIST Special Publication, pages 109 { 109 , 1995 .

[3]

Andrew

Rosenberg and

Julia

Hirschberg . V-measure: A conditional entropy-based external cluster evaluation measure . In EMNLP-CoNLL , volume 7 , pages 410 { 420 . Citeseer , 2007 .