=Paper=
{{Paper
|id=Vol-1177/CLEF2011wn-ImageCLEF-CsurkaEt2011a
|storemode=property
|title=XRCE's Participation at Wikipedia Retrieval of ImageCLEF 2011
|pdfUrl=https://ceur-ws.org/Vol-1177/CLEF2011wn-ImageCLEF-CsurkaEt2011a.pdf
|volume=Vol-1177
|dblpUrl=https://dblp.org/rec/conf/clef/CsurkaCP11
}}
==XRCE's Participation at Wikipedia Retrieval of ImageCLEF 2011==
XRCE and CEA LIST’s Participation at Wikipedia Retrieval of ImageCLEF 2011 Gabriela Csurka1 and Stéphane Clinchant1,2 and Adrian Popescu3 1 Xerox Research Centre Europe, 6 chemin de Maupertuis 38240, Meylan France firstname.lastname@xrce.xerox.com 2 LIG, Univ. Grenoble I, BP 53 - 38041 Grenoble cedex 9, Grenoble France 3 CEA, LIST, Vision & Content Engineering Laboratory, 92263 Fontenay aux Roses, France adrian.popescu@cea.fr Abstract. In this document, we first recall briefly our baseline methods both for text and image retrieval and describe our information fusion strategy, before giving specific details concerning our submitted runs. As text retrieval, XRCE used either and Information-Based IR model [4] or a Lexical Entailment IR model based on statistical translation IR model [5]. Alternatively, we also used an approach from CEA List that models the queries using on one hand socially related Flickr tags and on the other hand Wikipedia concepts introduced in [13]. The combination of these runs have shown that the approaches were rather complementary. As image representation, we used spatial pyramid of Fisher Vectors built on local orientation histograms and local RGB statistics. The dot product was used to define the similarity between two images and to combine the color and texture based rank- ing we used simple score averaging. Finally, to combine visual and textual information, we used a so called the Late Semantic Combination (LSC) method [3], where first the text expert is used to re- trieved semantically relevant documents, and than the visual and textual scores are averaged to rank these documents. This strategy allowed us to significantly improve over mono-modal retrieval performances. Using the late fusion of the best text expert from XRCE and from CEA and combining with our Fisher Vector based image run with LSC leaded to a MAP of 37% (best score obtained in the Challenge). Keywords Multi-modal Information Retrieval, Wikipedia Retrieval, Fisher Vector, Lexical Entailment 1 Introduction The Wikipedia Retrieval task consisted of multilingual and multimedia retrieval [14]. The collec- tion contains images with their captions extracted from Wikipedia in different languages namely French, English and German. In addition, participants were provided with the original Wikipedia pages in wikitext format. The aim was to retrieve as many relevant images as possible from the aforementioned collection, given a textual query translated in the three different languages and one or several query images. Each team submitted different types of runs (see Table 1) : mono-media runs (text) and multimedia (mixed) runs with different fusion approaches. As the table shows, we also submitted common runs for which we applied the semantic late filtering or late combination of XRCE textual runs with the textual from CEA List. As the results show, these textual runs were complemantary to our runs that used no external resources and hence their combination allowed for further boosting the performance of mono and multi-modal retrieval. Our image representation based on Fisher Vectors is briefly recalled in section 4. To combine visual and textual information, we used a semantic filtered late combination method described in section 5. Finally, we give specific details about the submitted run in section 6 and conclude in 7 2 XRCE Text Based IR Models We start from a traditional bag-of-words representation of pre-processed texts where pre-processing includes tokenization, lemmatization, and standard stopword removal. However, in some cases lemmatization might lead to a loss of information. Therefore before building the bag-of-words representation we concatenated a lemmatized version of the document with the original document. We build an index for the image captions and one for the surrounding paragraph containing the images (as last year). For all runs, we average the score obtained on the captions and paragraph index. We used basically two textual models, the Smoothed Power Law (SPL) Information-Based Model [4] and the Lexical Entailment (AX) IR Model [5]. 2.1 Information Based IR Model (SPL) Information models draw their inspiration from a long-standing hypothesis in IR, namely the fact that the difference in the behaviors of a word at the document and collection levels brings information on the significance of the word for the document. This hypothesis has been exploited in the 2-Poisson mixture model, in the notion of eliteness in BM25, and more recently in DFR models. In particular, several researchers, Harter [6] being one of the first ones, have observed that the distribution of significant, ”specialty” words in a document deviates from the distribution of ”functional” words. The more the distribution of a word in a document deviates from its average distribution in the collection, the more likely is this word significant for the document considered. This can be easily captured in terms of information: Info(x) = − log P (X = x|λ) = Informative Content (1) If a word behaves in the document as expected in the collection, then it has a high probability P (X = x|λ) of occurrence in the document, according to the collection distribution, and the information it brings to the document, − log P (X = x|λ), is small. On the contrary, if it has a low probability of occurrence in the document, according to the collection distribution, then the amount of information it conveys is greater. In a nutshell, information can be understood as a deviation from an average behavior. Overall, the general idea of the information-based family is the following: 1. Due to different document length, discrete term frequencies (xdw ) are renormalized into con- tinuous values tdw = t(xdw , ld ) 2. For each term w , we assume that the renormalized values tdw follow a probability distribution P on the corpus. Formally, Tw ∼ P (.|λw ). 3. Queries and documents are compared through a measure of surprise, or a mean of information of the form X RSV (q, d) = −qw log P (Tw > tdw |λw ) w∈q So, information models are specified by two main components: a function which normalizes term frequencies across documents, and a probability distribution modeling the normalized term fre- quencies. Information is the key ingredient of such models since information measures the signifi- cance of a word in a document. We choosed for our runs the Smoothed Power model proposed in [4]. This model is specified in two steps: the Divergence from Randomness (DRF) normalization of terms frequencies and the Smooth Power Law (SPL) distributions: – DFR Normalization with parameter c: tdw = xdw log(1 + c avg ld ) l Nw – T fw ∼ SPL(λw = N ) where avgl is the mean document length, ld the document length, c a parameter N is the number of documents in the collection and Nw the number of document containing word w. The retrieval function is then: d tw td +1 X λww − λw RSV (q, d) = −xqw log( ) (2) 1 − λw w∈q∩d 2.2 Lexical Entailment based IR Models (AX) Berger and Lafferty [2] addressed the problem of information retrieval as a statistical translation problem with the well-known noisy channel model. This model can be viewed as a probabilistic version of the generalized vector space model. The analogy with the noisy channel is the following one: To generate a query word, a word is first generated from a document and this word then gets “corrupted” into a query word. The key mechanism of this model is the probability P (v|u) that term u is “translated” by term v. These probabilities enable us to address a vocabulary mismatch, and some kinds of semantic enrichments. The problem now lies in the estimation of such probability models. We refer here to a previous work [5] on lexical entailment models to estimate the probability that one term entails another. It can be understood as a probabilistic term similarity or as a unigram language model associated to a word (rather than to a document or a query). Let u be a term in the corpus, then lexical entailment models compute a probability distribution over terms v of the corpus P (v|u). These probabilities can be used in information retrieval models to enrich queries and/or documents and to give a similar effect to use of a semantic thesaurus. However, lexical entailment is purely automatic, as statistical relationships are only extracted from the considered corpus. In practice, a sparse representation of P (v|u) is adopted, where we restrict v to be one of the Nmax terms that are the closest to u using an Information Gain metric. We computed the probabilistic term similarity on the paragraph collections because we believed paragraphs could capture a better semantical context as opposed to captions. and retain the top 10 words for each word in the collection. 3 CEA LIST Text and Visual Prototype Based Retrieval The main idea behind in CEA LIST’s approach is to model the queries using on one hand socially related Flickr tags and on the other hand Wikipedia concepts and to combine them. Here we provide only a quick description of Flickr and Wikipedia models which are similar to those intro- duced in [13]. The main novelty introduced this year is the creation of a visual topic prototype with visual concepts. Each result image is then characterized in the visual prototype space and the results which match the prototype are ranked higher using a late fusion approach. 3.1 Flickr Query Modeling Term relations extracted from Flickr are defined within a photographic tagging language. We use this data source to define an adaptation of the TF-IDF model to the social space. Given a query Q, we define its social relatedness to term T using: SocRel(T |Q) = users(Q, T ) ∗ 1/log(pre(T )) (1) where users(Q, T ) is the number of distinct users which associate tag T to query Q among the top 20,000 results returned by the Flickr API for the query Q; and where pre(T ) is the number of distinct users from a prefetched subset of 30,000 Flickr users that have tagged photos with tag T . In this new social weighting scheme, term frequency and document counts from the classical IR formulas are replaced with user counts, which prevents the final relatedness score from being biased by heavy contributions from a reduced number of users. The computation of users(Q, T ) from 20,000 top results is destined to keep computation time low, while accounting for different query relevant contexts. pre(T ) is precomputed from all the tags submitted by a random subset of 120,000 Flickr users. Related terms are computed from preprocessed queries. This processing involves removing photographic terms (which can have a negative impact on queries with few results) as well as prepositions and articles from the queries. Both prepositions and photographic terms are kept in precompiled lists, extracted from Wikipedia: the ”List of English prepositions” page, lists of pho- tographic terms exist on the ”Category:Film techniques” and ”Category:Photographic processes” pages. In the following sections, we will use the preprocessed forms of the queries. For instance, skeleton of dinosaur becomes skeleton dinosaur. For this query, the most related Flickr terms are: bones, museum, trex, fossil, natural history museum and tyrannosaurus rex. Before using Wikipedia, we create an enriched version of the query (QE ) by selectively stem- ming orginal query terms. A Flickr related term is retained as a variant if its edit distance to one stemmed term from the initial query is smaller than three or if the stemmed form is a prefix of the related term, so skeleton dinosaur becomes (skeleton:skeletons) (dinosaur:dinosaurs). Words in the initial query and the top related Flickr term have weights of 1 while terms starting from x = 2 have weights which are normalized with SocRel(T |Q1 ). This weighting expresses the fact that the importance of a term in the Flickr model decreases with its rank among the socially related terms. . We create a Flickr query model for each language in an identical manner. 3.2 Wikipedia Query Modeling When the terms in a query (or a part of a query) are categorical in nature, results that corre- spond to their subtypes or to other semantically related concepts are ignored in absence of query expansion. For instance, tyrannosaurus or allosaurus are valid subtypes of dinosaur, which is a part of the topic skeleton of dinosaur. Images tagged with these related concepts are potentially relevant for skeleton of dinosaur but they would not be returned when querying with the initial terms. Query expansion is particularly useful in cases when initial queries return a small number of results or for languages that are seldom used in the annotations of the images. Since image queries cover a broad range of concepts and are expressed in different languages, a generic, detailed and multilingual data source is needed to enable an efficient expansion and we consider Wikipedia to be an appropriate data source for the extraction of semantically related concepts. We express the semantic relatedness between a query and a Wikipedia article as a combination of two scores. We first measure the overlap between the words in the query and the words in the category section and in the first sentence of encyclopedic articles and then compute the dot product between the query and a vectorial representation of the entire article content. Priority is given to the first score because categorical and definitional information have a priviledged role in defining semantic relatedness. For English, queries are run through WordNet and synonyms are added to the query terms that are unambiguous in WordNet, to avoid introducing noise from polysemous word senses. Overlap is normalized by the number of words in a query and its values vary from 0 (no terms in common between the query and the article’s categories) and 1 (all terms in common). Due to the fact that queries usually contain a small number of words, the overlap scores offer a coarse expression of semantic relatedness and a large number of articles will share the same scores. In our system, given the query golf player on court, an article categorized under with golf and player is always better ranked than an article categorized only under player. 3.3 Text Retrieval The collection is preprocessed in order to represent textual annotations using a TF-IDF formalism. Wikipedia and Flickr models have different roles in the retrieval system. Related concepts from the encyclopedia are used for semantic query expansion, whereas the set of related Flickr tags is exploited for result ranking. We first retrieve all documents in the collection that are annotated with at least one word from the initial query or with one of the related Wikipedia concepts and give these documents a coarse ranking score which is based on the number of terms from the initial query the document is related to. To break the many ties that result from the use of the coarse score, we compute a fine-grained score as the cosine similarity between the document representation and the Flickr query model. 3.4 Visual Prototype Based Retrieval Each 2011 Wikipedia Retrieval topic is provided with four or five example images. We extract visual concepts from these positive examples in order to extract a visual prototype of the query. Extracted visual concepts include one or several of the following: presence of a face, indoor/outdoor scene, black & white vs. color image; photograph/clipart/map content. For instance, the prototype of portrait of Che Guevara includes the face and photograph while a the prototype of golf player on green will include outdoor, photograph, color. Each image in the collection is preprocessed in order to extract the visual concepts described above. We then compare each image in the textual ranking to the query prototype and increase its visual score each time for each matching visual concept. We take a simple approach a give the same score to each different visual concept we extracted. 4 XRCE Fisher Vector based Image Representation As for the image representation, we used an improved version [12, 11] of the Fisher Vector [10]. The Fisher Vector can be understood as an extension of the bag-of-visual-words (BOV) represen- tation. Instead of characterizing an image with the number of occurrences of each visual word, it characterizes the image with the gradient vector derived from a generative probabilistic model. The gradient of the log-likelihood describes the contribution of the parameters to the generation process. Assuming that the local descriptors I = {xt , xt ∈ RD , t = 1 . . . T } of an image I are gener- PM ated independently by Gaussian mixture model (GMM) uλ (x) = i=1 wi N (x|mui , Σi ), I can be described by the following gradient vector (see also [7, 10]): T 1X GIλ = ∇λ log uλ (xt ) (3) T t=1 where λ = {wi , µi , Σi , i = 1 . . . M } are the parameters of the GMM. A natural kernel on these gradients is the Fisher Kernel [7]: 0 K(I, J) = GIλ Fλ−1 GJλ , Fλ = Ex∼uλ [∇λ log uλ (x)∇λ log uλ (x)0 ] . (4) where Fλ is the Fisher information matrix. As it is symmetric and positive definite, Fλ−1 has a Cholesky decomposition Fλ−1 = L0λ Lλ and K(I, J) can be rewritten as a dot-product between normalized vectors Gλ with: GλI = Lλ GIλ . We will refer to GλI as the Fisher Vector (FV) of the image I. In the case of diagonal covariance matrices Σi (we denote by σi2 the corresponding variance I I I vectors), closed form formulas can be derived for Gw d , Gµd , Gσ d , for i = 1 . . . M , d = 1 . . . D (see i i i I I details in [11]). As we do not consider Gw d (the derivatives according to the weights), Gλ is the i concatenation of the derivatives GµI d and GσI d and is therefore N = 2M D-dimensional. i i The Fisher Vector is further normalized with Power (α = 0.5) and L2 normalization as sug- gested in [11] and the dot product is used as similarity between the Fisher Vectors. We also used in some cases the spatial pyramid [8] to take into account the rough geometry of a scene. The main idea is to repeatedly subdivide the image and represent each layout as a concatenation of the representations (in our case Fisher Vectors) of individual sub-images. As we used three spatial layouts (1 × 1, 2 × 2, and 1 × 3), we obtained 3 image representations of respectively N , 4N and 3N dimensions. As low level features we used our usual (see for example [1]) SIFT-like Orientation Histograms (ORH) and local color statistics (COL), i.e. local color means and standard deviations in the R,G and B channels, both extracted on regular multi-scale grids and reduced to 50 or 64 dimensional with Principal Component Analysis (PCA). 5 The Late Semantic Combination Fusion Method There has been many research works addressing text/image information fusion. The method we mainly used in our runs was the one we described in [3]. The intuition behind this technique is that since different media (here text and image) are semantically expressed at different levels, we should not combine them independently as most of information fusion techniques so far do, but on the contrary, we should consider the underlying complementarities that exist between these media. In the case of text/image data fusion, as the results in most ImageClef Task shows [9], text based search is more efficient than visual based since it is more difficult to extract the semantics of an image compared to a text. However, basic late fusion approaches showed that combination of visual and textual information can outperform the only text based search. This shows that the two media are often complementary to each other despite the differences between monomedia performances. In [3], we have shown that the late fusion can be improved by simply adding a semantic filtering step before score combination. This filtering step has the role of enforcing the visual system to search among the set of retrieved objects by the text expert. In this way we impose that images visually similar to the query images share a common semantic (given by the textual query). While this filtering step is the basis of the image reranking methods, we have shown in [3] that remaining at this level is unsufficient. Indeed, when the visual system has low performance, the image re- ranking allows to significantly out-perform it however its performance is generally poorer than using only the text alone. Therefore, what we proposed was to combine image reranking with late fusion in order to overcome their respective weaknesses. Note that the strength of image reranking is to realign the visual system to search in a relevant subset with respect to the semantic viewpoint, while the strength of late fusion relies on a well performing text expert. Hence, the Late Semantic Combination (LSC) combination works as follows. First, we define a semantic filter for the image scores according to the textual expert: ( 1 if d ∈ KNNt (q) SF (q, d) = (5) 0 otherwise where KNNt (q) denotes the set of the K most similar objects to q according to the textual similarities. Hence, this will give us a reduced list (K) for which we need to compute the image similarities. After normalization (all scores are transformed to have values between 0 and 1), the semanti- cally filtered image scores are combined with the text ones: sLSC (q, d) = αt N(st (q, d)) + αv (N(SF (q, d)sv (q, d))) (6) where N is an operator normalizing score between 0 and 1, αt = α and αv = 1 − α are positive weights that sum to 1. Note that the similarity for all documents d that are not in KNNt (q) are set to 0. 6 Descriptions of our runs Similarly to 2010, we evaluated different mono-media and multimedia runs on the Wikipedia corpus using the new set of queries. The MAP and P10 performances of these runs are shown in Table 1. Some of the runs were submitted separately by each of the two participants while others were the result of a combination of text experts from XRCE and CEA LIST. The latter runs were obtained either using the late semantic filtering where CEA LIST Runs were used as text expert or first we combined their run with our text expert before using the LSC method. In addition, in Table 1 we show also the performance of our visual run (not submitted). The performance of these visual runs were again poor, as averaging the Fisher Vectors of the set of image queries was not sufficient to capture the underlying semantics of the query. However, as in 2010, we can observe that in spite of the low performance of the visual expert, the Semantic Filtered Late was able to take advantage of the complementarity of the media types to improve over the pure text based approach. Table 1. Wikipedia retrieval: overview of the performances of our different runs. Top table deals with monomedia runs and bottom table with multimedia runs. ID RUN MAP P@10 T6 XRCE CEA TXT RUN SPLAX ENFRDE 0.3141 0.516 T5 XRCE CEA TXT RUN AX ENFRDE 0.3130 0.530 T1 XRCE TXT RUN AX 0.2780 0.470 T4 XRCE TXT RUN SPLAX 0.2769 0.464 T3 CEA enfrde all 0.2591 0.466 T2 XRCE TXT RUN SPL 0.2432 0.422 I1 XRCE VIS FV 0.0271 0.0860 Txt Run ID RUN MAP P@10 Rel. Improv. M1 XRCE CEA MULTI RUN SFL AX viscon 0.3880 0.632 T5 M2 XRCE CEA MULTI RUN AX ENFRDE FV SFL 0.3869 0.624 +23.6 % T6 M3 XRCE CEA MULTI RUN SPLAX ENFRDE FV SFL 0.3848 0.620 +22.5 % T1 M4 XRCE MULTI RUN AX FV SFL 0.3557 0.594 +27.9 % T4 M5 XRCE MULTI RUN SPLAX FV SFL 0.3556 0.578 +28.4 % M6 XRCE CEA MULTI RUN SPLAX VISCON 0.3471 0.574 T3 M7 CEA XRCE RUN ENFRDE FV SFL 0.3075 0.54 M8 CEA viscon 1.07 0.2703 0.480 In which follows, we give details on each run individually. 6.1 Text based runs: – T1: XRCE Text based retrieval with the Lexical Entailment (AX) based IR Models (section 2.2). – T2: XRCE Text based retrieval with the the Smoothed Power Law (SPL) Information-Based Model (section 2.1). – T3: CEA LIST multilingual textual run (section 3.3). – T4: Late fusion between T1 and T2. – T5: Late fusion between T1 and T3. – T6: Late fusion between T4 and T3. We can see from the Table 1 that on this dataset AX works better than SPL. While their com- bination does not seem to help, when we further combine them with T3 (T6) we obtain better performance than just combining T1 with T3 (T5). In all cases, combining XRCE runs with the CEA run (T3) leaded to an absolute improvement of 4% in MAP. 6.2 Image based run. Our image run (I1) was based on similarity between Fisher Vector based signatures as described in section 4. We built 4 image signatures for each image corresponding to the two different low level features (ORH and COL) and for each using either a global FV or a spatial pyramid (1x1 2x2 1x3). The 4 FVs were used independently to rank the Wikipedia images using the dot product as similarity measure and the 4 scores were simply averaged. 6.3 Text and image based runs. The combination of visual and textual runs were done using the Late Semantic Combination com- bination described in section 5. The image scores were all the same, correponding to I1 described above, so only the text expert changed from one run to other. Hence: – M2: used T5, a late fusion between AX and the XRCE text run from CEA. – M3: used T6, a late fusion between AX, SPL and the text run from CEA. – M4: used T1 corresponding to the AX model. – M5: used T4 corresponding to the late fusion between AX and SPL. – M7: used T3, CEA multilingual textual run. The three remaining runs were based on M8, a linear combination of textual results and visual prototype based results from CEA (see details in section 3.4). Hence, we had: – M6: a late fusion between T4 and M8. – M1: a late semantic combination where we used M6 as “the semantic filter” and also we averages the visual scores with the normalized scores of the run M7. Hence, while M7 was not purely text based, it had the role of the “text expert” in the LCS approach. As a conclusion, we can again see that the CEA and XRCE runs were complementary with a gain of absolute improvement of 3% in MAP. While adding the visual characterization of the topics to pure text retrieval helps (T2 compared to M6), when we added the visual expert the gain was almost negligible (M1 compared with M2). This is interesting as it reinforces some of our observations we made in [1] experimenting with the IAPR TC-12 photographic collection. 7 Conclusion This year XRCE participated again with success in the Wikipedia Retrieval Task showing again that despite the fact that pure visual based retrieval led to poor results, when we appropriately combined them with our text ranking we are able to outperform the mono-modal ones. This was shown with various text expert, including those from the CEA LIST. Acknowledgments We would like also to acknowledge Florent Perronnin and Jorge Sánchez for the efficient imple- mentation of the Fisher Vectors we used in our experiments. Adrian Popescu was supported by the French ANR (Agence Nationale de la Recherche) via the Periplus project. References 1. J. Ah-Pine, S. Clinchant, G. Csurka, F. Perronnin, and J-M. Renders. Leveraging image, text and cross- media similarities for diversity-focused multimedia retrieval, chapter 3.4. Volume The Information Retrieval Series of etal [9], 2010. ISBN 978-3-642-15180-4. 2. Adam Berger and John Lafferty. Information retrieval as statistical translation. In In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222–229, 1999. 3. Stéphane Clinchant, Julien Ah-Pine, and Gabriela Csurka. Semantic combination of textual and visual information in multimedia retrieval. In ACM International Conference on Multimedia Retrieval (ICMR), 2011. 4. Stéphane Clinchant and Eric Gaussier. Information-based models for ad hoc ir. In SIGIR ’10: Pro- ceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 234–241, New York, NY, USA, 2010. ACM. 5. Stéphane Clinchant, Cyril Goutte, and Éric Gaussier. Lexical entailment for information retrieval. In Advances in Information Retrieval, 28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, pages 217–228, 2006. 6. S. P. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 26, 1975. 7. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11, 1999. 8. S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recog- nizing natural scene categories. In CVPR, 2006. 9. H. Müller, P. Clough, Th. Deselaers, and B. Caputo, editors. ImageCLEF- Experimental Evaluation in Visual Information Retrieval, volume The Information Retrieval Series. Springer, 2010. ISBN 978-3-642-15180-4. 10. F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007. 11. F. Perronnin, J. Sánchez, and Y. Liu. Large-scale image categorization with explicit data embedding. In CVPR, 2010. 12. Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. Large-scale image retrieval with compressed fisher vectors. In CVPR, 2010. 13. Adrian Popescu and Gregory Grefenstette. Social media driven image retrieval. In ACM International Conference on Multimedia Retrieval (ICMR), 2011. 14. Theodora Tsikrika, Adrian Popescu, and Jana Kludas. Overview of the wikipedia retrieval task at imageclef 2011. In Working Notes of CLEF 2011, Amsterdam, The Netherlands, 2011.