=Paper=
{{Paper
|id=Vol-1263/paper45
|storemode=property
|title=LIMSI @ MediaEval SED 2014
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_45.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/GuinaudeauLB14
}}
==LIMSI @ MediaEval SED 2014==
LIMSI @ MediaEval SED 2014 Camille Guinaudeau1,2 , Antoine Laurent2 , Hervé Bredin2 1 University Paris-Sud, Rue du Château, 91400 Orsay 2 LIMSI-CNRS, Rue John Von Neumann, 91400 Orsay {guinaudeau,laurent,bredin}@limsi.fr ABSTRACT Table 1: Homogeneity for user-based clustering This paper provides an overview of the Social Event Detec- Dev A Dev B Dev C tion (SED) system developed at LIMSI for the 2014 cam- 1h 0.9874 0.9872 0.9874 paign. Our approach is based on a hierarchical agglomera- 10h 0.9813 0.9796 0.9798 tive clustering that uses textual metadata, user-based knowl- 20h 0.9785 0.9766 0.9770 edge and geographical information. These different sources 24h 0.9777 0.9755 0.9757 of knowledge, either used separately or in cascade, reach 30h 0.9763 0.9743 0.9749 good results for the full clustering subtask with a normal- 100h 0.9678 0.9673 0.9665 ized mutual information equal to 0.95 and F1 scores greater than 0.82 for our best run. each dataset has approximately the same number of clusters 1. INTRODUCTION and the same distribution in terms of number of images per The Social Event Detection (SED) task aims at mining cluster. Moreover, the number of images in each cluster is social events (such as concerts, protest and so on) in large also quite similar to the number of images contained in the collections of online multimedia [1]. This challenge is divided test set (110,541) allowing us to experiment on our approach into two subtasks: detection by full clustering and retrieval with comparable datasets. As explained in the introduction, of events. In this work, we focus only on the first subtask a preliminary clustering was first applied on these datasets which consists in clustering all images in the given dataset, – to create user-based clusters of images – before the hier- so that each cluster represents a social event. As the number archical clustering step that makes use of textual metadata of the target clusters is not provided by the SED organizers, and/or geographical information. the main difficulty of this substak is to infer this number. To overcome this difficulty, our full clustering system relies on 2.1 User-based clustering a hierarchical agglomerative clustering approach that works The preliminary clustering is obtained by creating one by continuously merging the most similar clusters (as long as cluster per user, using the user name associated with each the similarity between them is higher than some threshold). image in the dataset. As these user-based clusters usually In this work, our system is only based on the metadata as- contain several social events (one user is rarely associated sociated with images. The hierarchical clustering approach, with only one social event) they are then divided into smaller presented in Section 2.2, is based on distance matrices ac- ones depending on date and time information also mentioned counting for textual metadata (Section 2.3.1) or geographi- in the pictures’ metadata. cal information (Section 2.3.2). In order to make the dis- For each cluster, we initialize a time reference with the tance computation as robust as possible and avoid data date of a randomly chosen picture, hereafter called picture sparsity, a first preliminary clustering is performed on the F , in the cluster. We then compare this time reference with dataset, so that each preliminary cluster is associated with the date of all the other pictures in the cluster and we pick a set of metadata coming from all the images in the cluster. the closest one, hereafter called picture C. If picture C is This preliminary clustering is described in Section 2.1. taken less than α1 hours before or after the time reference, then it belongs in the same cluster as picture F and the time reference is recomputed to equal the mean between the 2. FULL CLUSTERING previous time reference and the date of picture C. How- The development dataset released by the SED task orga- ever, if picture C was taken too far in time from picture F , nizers for the full clustering subtask is composed by 362,578 then another cluster is created with a new time reference images collected from Flickr, associated with their meta- corresponding to the taken time of picture C. data. To lower the computation time and facilitate our ex- The objective of this step is to lower the number of clus- perimentation process, this development dataset was divided ters to be processed in the hierarchical clustering step while into three smaller datasets Dev A, Dev B and Dev C so that keeping the clusters as pure as possible. To evaluate the pu- rity of our clusters we use the homogeneity metric [3] which equals one when each cluster contains only members of a Copyright is held by the author/owner(s). 1 MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain All parameters were tuned on Dev A. Table 2: Results on test set Dev A Dev B Dev C 20h-Geo-Text 24h-Geo-Text 30h-Geo-Text 24h-Text 24h-Geo F1 (Main Score) 0.7895 0.7869 0.7912 0.8214 0.8140 0.8115 0.7563 0.7387 NMI 0.9479 0.9472 0.9483 0.9554 0.9532 0.9526 0.9423 0.9359 Divergence F1 0.6880 0.7258 0.7224 0.8207 0.8132 0.8107 0.7557 0.7380 single class. Table 1 summarizes the homogeneity scores for 2.3.2 Geographic distance matrix the development datasets Dev A, Dev B and Dev C with We also compute the geographic distance between every the α parameter ranging from 1 hour to 100 hours. It can cluster that contains at least one picture with GPS informa- be seen from this table that, first, the values are similar for tion. The distance between two clusters u and v corresponds all the datasets and, second, that the homogenity values are to the minimum geographic distance between any picture high even if the α parameter equals 100 meaning that users from cluster u and any picture from cluster v. Moreover, in tend not to participate to social events very frequently. order to avoid the merging of events that take place in the same location but at different time (such as festivals that 2.2 Hierarchical clustering approach take place at the same location every year), we also prevent The hierarchical agglomerative clustering approach begins the merging of two clusters if their associated date is greater with the set of clusters that have been defined in the pre- than a certain threshold (48h) by artificially increasing their liminary clustering presented in the previous section and is geographical distance. based on a single-linkage clustering method. When two clus- ters u and v from this set are combined into a single cluster w, u and v are removed from the set, and w is added to the 3. EXPERIMENTS AND RESULTS set. When only one cluster remains, the algorithm stops. To For the full clustering subtask, each participant was al- decide whether two clusters have to be combined, a distance lowed to submit up to 5 runs. In this section we describe matrix is maintained at each iteration where the d[u, v] en- our runs and discuss the results obtained. All the submit- try corresponds to the distance between cluster u and v. ted runs are based on the preliminary clustering, that uses At each iteration, the algorithm must update the distance an α parameter equal to 20 hours, 24 hours or 30 hours. matrix to reflect the distance of the newly formed cluster w The hierarchical clustering is then obtained thanks to the with the remaining clusters in the set. The distance between textual metadata only (Text), the geographical information the newly formed cluster w and each v 0 is computed with only (Geo) or both sources of knowledge (Geo-Text). In the following equation this latter case, the combination is done in cascade, meaning that a hierarchical clustering is first performed thanks to the d(w, v 0 ) = min(dist(w[i], v 0 [j])) (1) geographical information and then a hierarchical clustering 0 based on text is applied on the result of the geographical for all images i in cluster w and j in cluster v . clustering. From Table 2 it can be seen that the combi- The final clustering is then obtained by forming flat clus- nation gives the best results with F1 scores greater than ters from the hierarchical clustering previously defined. A 0.8 and normalized mutual information greater than 0.95. threshold θ is used so that observations in each flat cluster We can also see that combining both sources of information have no intergroup distance greater than θ. (24h-Geo-Text) improves the metric values compared with 2.3 Distance matrices information used alone (24h-Text and 24h-Geo). Finally the left part of the table presents the results obtained on our In this last part, we describe how the distance matrices three development datasets. We can notice from these num- used in the hierarchical agglomerative clustering approach bers that the proposed approach is robust and gives similar are computed. Both use the metadata associated with the results on both the development and test sets. pictures, namely textual metadata and geographical infor- mation. 4. REFERENCES 2.3.1 Textual metadata distance matrix [1] Georgios Petkos, Symeon Papadopoulos, Vasileios To compute the textual distance, each cluster is repre- Mezaris, and Yiannis Kompatsiaris. Social Event sented by a vector composed by lemmas weighted with a Detection at MediaEval 2014: Challenges, Datasets, BM25 score. A cosine distance is then computed between and Evaluation. In Working Notes Proceedings of the two vectors to estimate the distance between the two cor- MediaEval 2014 Workshop, Barcelona, Spain, October responding clusters. To create the vectors, words are ex- 2014. tracted from the textual metadata associated with each pic- [2] Stephen E Robertson, Steve Walker, Susan Jones, ture in the cluster. These words are then lemmatized and Micheline M Hancock-Beaulieu, Mike Gatford, et al. only nouns, adjectives and non modal verbs are kept to char- Okapi at TREC-3. NIST Special Publication, pages acterize the cluster. Each lemma in the vector is finally as- 109–109, 1995. sociated with a score computed using the BM25 weighting [3] Andrew Rosenberg and Julia Hirschberg. V-measure: A function [2] that gives a weight close to 1 to lemmas that are conditional entropy-based external cluster evaluation the most representative of the content of the cluster. In our measure. In EMNLP-CoNLL, volume 7, pages 410–420. system, lemmas are extracted from title, description and, Citeseer, 2007. when available, tags metadata.