Mediaeval benchmark: Social Event Detection using LDA and external resources Mohamed Morchid Georges Linarès Laboratoire d’Informatique d’Avignon Laboratoire d’Informatique d’Avignon LIA, Avignon, France LIA, Avignon, France mohamed.morchid@etd.univ-avignon.fr georges.linares@univ-avignon.fr ABSTRACT 2.1 Subset extraction This article presents two methods for the automatic detection of The proposed extraction method relies first on creating a corpus social events that were evaluated on the annotated set of pictures from the web. This corpus is obtained by querying google with the as part of the 2011 Mediaeval benchmark [1]. The first method keywords of the challenge query. When we collect this corpus, we uses a set of web pages and a semantic space obtained by Latent create a query-dependent feature vector to evaluate picture/event Dirichelet Allocation (LDA, [2, 3]) to classify pictures from Flickr. category similarites. Then, we select the nearest pictures according The second approach uses the query to extract a subset of pictures to a fixed threshold. and classify this subset. Theses approches are compared in the ex- → Query representation perimental framework of Mediaeval 2011 Social Event Detection The event category is represented by a feature vector obtained by task . analysing the related web pages. Web pages are collected by send- ing query 1 to google and to select the 100 best documents. The Categories and Subject Descriptors query depedent feature vector v is composed by the relative fre- quency (p(w|v)) of each words w of the corpus, divided by the av- H.3 [Information Search and Retrieval]: Indexing erage position of the first occurence of the word positionw in the returned documents. A stop-list based filtering process takes off the General Terms meaningless words. The last feature of the vector is the number of LDA, picture categorization seconds since the first january 1970 until the picture taken date. → Distance between pictures and query We want to select the pictures that are related to the query. This Keywords is achieved by calculating the distance between the picture and the LDA, Benchmark, Mediaeval, Classification, Event Detection feature vector by: 1. INTRODUCTION Dist(pictk , v) = X v(w) , v(w) = T Fw (1) The search and the browsing of picture collections from sharing w∈pictk positionw platforms requires automatic processing of both content and meta- data that are provided by users or owners. Social event detection We create a subset with pictures those distance to the features consists in finding, in a large collection of photos, the ones that vector exceeds a fixed threshold. are related to a specific event. Our system performs two steps that consist first in extracting all the pictures related to the event cate- 2.2 Subset clustering gory (i.e. soccer, Barcelona and Roma), and then to select pictures We have to cluster this subset into parts that are supposed to be related to specific events. In the following, the category extrac- related to social events belonging to the same category. This clus- tion step is named subset extraction, the by-event clustering being tering step is achieved into the LDA space. named subset clustering. We tested a method based on a semantic representation of pic- → Semantic Space with LDA tures by LDA, that is compared to a simple clustering method. These 2 methods are described respectivelly in the Section 2 and For each challenge, we just have the words in the query to find 3 of this paper. the events on the pictures set. Here, we locate the query in a topic space estimated by LDA. The 50-topic LDA model is estimated on 2. FIRST RUN: LDA-BASED CONTENT REP- the dataset obtained for the Subset Extraction step. RESENTATION This method relies on an intermediate representation of pictures → Vector of distance in a semantic space obtained by LDA. In order to estimate the LDA We calculate, for each picture, a vector of distance with all top- model, we collect text materials from the Web by using queries ics. We add to this vector another feature: the number of seconds related to the event category, represented by a set of keywords. between the date of the picture (dateTaken field) and the 01 jan. 1970. 1 Copyright is held by the author/owner(s). C1:soccer barcelona rome MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy C2:may 2009 parc del forum barcelona paradiso amsterdam X of the challenge to estimate picture/cluster similarities. Evaluation Dist(pictk , tj ) = p(w|tj ).p(tj |C) (2) probably presents technical problem that remains to be clearly un- w∈ictk derstood. Nevertheless, the results show that high level approach such rep- We use the prior probability p(tj |C) of a topic tj in the corpus resentation in a semantic space doesn’t perform well, probably due C to weight the distance of a picture with a (un)relevant topic. to it’s complexity and the various possiblity of adding noise at dif- → Clustering ferent level of the processing chain (in data collecting, topic mod- Selected picture set is clustered by using the Expectation-Maximisation, eling, document representation in the topic space, etc.). gaussian-based clustering algorithm [4]. 6. REFERENCES 3. SECOND RUN: PROPOSED APPROACH [1] S. Papadopoulos, R. Troncy, V. Mezaris, B. Huet, and In this run, we use only the information of pictures and the query I. Kompatsiaris, “Social Event Detection at MediaEval 2011: for the two challenges. These information are constituted by all Challenges, Dataset and Evaluation,” in MediaEval 2011 textual metadata available. The global processing scheme is similar Workshop, Pisa, Italy, September 1-2 2011. to the one we used for LDA based approach: a first step select a [2] A. McCallum, “Mallet: A machine learning for language subset of relevant pictures that are clustered in a second step. toolkit,” 2002. Here, similarities are only based on text-level comparison, the query being represented by it’s keywords and the pictures being [3] L. Rigouste, O. Cappé, and F. Yvon, “Quelques observations represented by title, description and taggs (if available). sur le modele lda,” Actes des IXe JADT, pp. 819–830, 2006. Section 3.1 presents the method to extract a subset of pictures re- [4] T. Moon, “The expectation-maximization algorithm,” Signal lated to the query and Section 3.2 describes the clustering method. Processing Magazine, IEEE, vol. 13, no. 6, pp. 47 –60, nov 1996. 3.1 Subset Extraction [5] C. Goutte and E. Gaussier, “A probabilistic interpretation of This first selection step relies on an estimation of proximity of a precision, recall and F-score, with implication for evaluation,” picture to the targeted query. We count the number of occurences Advances in Information Retrieval, pp. 345–359, 2005. of each word of the query in the picture text materials. A specific weighting is applied according to the field in which a word occurs. If the word appears in the title, the weight is 1.0, in the description, 0.75 and 0.25 for a tag. For each element of the picture, we calculate the f-score[5] be- tween words in query and words in pictures features. If the f-score of a section exceeds a threshold (0.75 seems to give the best result), we apply another boosting rule by multiplying the score of the el- ement by 100. We add to the subset the pictures with a score over 40% of the highest score. 3.2 Subset clustering We determine the similarity between each pair of pictures: Nj,k Sim(pictj , pictk ) = Nj + Nk Where j 6= k and Nj is the total number of words in pictj . Nj,k is the number of words that belongs to a element of pictj AND a element of pictk . The system puts each picture with the cluster that contains the picture of highest similarity. 4. EXPERIMENTS For this task, we used 73269 pictures from Flickr [1]. Each pic- ture is associated to a title, a description, owner nickname and tags. Table 1: Results each challenges and each runs → Results Challenge I Challenge II We present in the Table 1 the results in each challenge where E is E PA PR % E PA PR % the number of events detected for this challenge, PA is the number Run 1 10 1223 72046 1,6 9 1373 71896 1,8 of pictures accepted for the challenge, PR the number of rejected Run 2 3 13 73256 0 20 65 73204 0 pictures for the challenge and % show the percentage of accepted pictures. In Table 2 we present our evaluation (as evaluated by Me- diaEval Benchmark organisers). Measures are Normalized Mutual Information (NMI) and F-Score. Table 2: Evaluation for each challenges and each runs Challenge I Challenge II 5. CONCLUSION F-Score NMI F-Score NMI We proposed two methods to cluster a set of pictures from Flickr. Run 1 10,13 0.0263 12,44 -0.01 In the first run, we use LDA and the Web pages to cluster pictures Run 2 Un. Un. 3.53 0.0253 with topics. We also propose a second method that use the query