=Paper=
{{Paper
|id=Vol-1174/CLEF2008wn-ImageCLEF-PopescuEt2008
|storemode=property
|title=Conceptual Image Retrieval over the Wikipedia Corpus
|pdfUrl=https://ceur-ws.org/Vol-1174/CLEF2008wn-ImageCLEF-PopescuEt2008.pdf
|volume=Vol-1174
|dblpUrl=https://dblp.org/rec/conf/clef/PopescuBM08a
}}
==Conceptual Image Retrieval over the Wikipedia Corpus==
Conceptual image retrieval over the Wikipedia corpus Adrian Popescu, Hervé Le Borgne, Pierre-Alain Moëllic CEA-LIST LIC2M (Multilingual Multimedia Knowledge Engineering Laboratory) B.P.6 – F92265 Fontenay-aux-roses Cedex, France {adrian.popescu, herve.le-borgne, pierre-alain.moellic}@cea.fr Abstract. Image retrieval in large-scale databases is currently based on a textual chains matching procedure, a technique that produces good results as long as the annotations associated to pictures are accurate and detailed enough. These conditions are not met for a large majority of image corpuses, such as the Wikipedia collection, and it is interesting to explore methods that go beyond chain matching. In this paper, we present our approach to image retrieval, tested in the ImageCLEF 2008 WikipediaMM. The approach is based on a query reformulation using concepts that are semantically related to those in the initial query. For each interesting entity in the query, we used Wikipedia and WordNet to extract and list of related concepts, which were further ranked in order to propose the most salient in priority. We also made a list of visual concepts which were used in order to re-rank the answers to queries that included, implicitly or explicitly, these visual concepts. The CEA submitted two automatic runs, one based on query reformulation only and one combining query reformulation and visual concepts, which were ranked 4th and 2nd using the MAP measure. 1. Introduction The search for multimedia documents represents a growing trend in Web search. For this, new retrieval paradigms, going beyond text chains matching, are needed in order to better respond to user needs. Current image retrieval systems highly rely on text searching techniques and, although people search for images, the results are returned based on the associated text only. The results are obtained by simple match of the terms of the query to an index of terms associated to the images in a corpus. This technique is simple but its efficiency strongly depends on the terms associated to pictures, as well as their accuracy. For instance, in this paradigm, it would be impossible to know if a query with bridge regards the structures with this name or the cards game. If the request is relative to the first meaning of the term, the search engine would return only images explicitly annotated with bridge whereas it would be pertinent to propose answers for Pont Neuf or Ponte Vecchio. The exploitation of semantic structures represents a possible solution to cope with such problems if these structures are developed enough to cover the query space. In spite of flourishing research on content-based image retrieval [17], the introduction of image processing techniques in image search engines architectures is limited, for the moment, to face detection (proposed by Google, Live Search or Exalead). The adoption of more image processing techniques is conditioned by the achievement of good quality results on large amounts of data and this imperative is not met in most cases [26]. The introduction of conceptual structures and image processing techniques in image retrieval raises a number of hard questions and we try to tackle several of these questions in our work. Namely: When dealing with large conceptual domains are there enough resources available or is it necessary to enrich them? Generic semantic structures, like WordNet [7], exist and were used in image retrieval [40], [42] but they do not ensure a sufficient coverage of the query space. Wikipedia is a rich source of semi-structured and has been used to structure large quantities of knowledge [1], [25]. We exploit both WordNet and Wikipedia for extracting the information we need when processing the WikipediaMM queries. Should image processing techniques be introduced alone or should they be fused with other information? A large body of work [40], [42], [39] advocates for the use of both low level and high level image description in order to improve image search. We follow a somewhat similar approach and investigate a late fusion of textual information and low-level image description of items in the database. What terms in a query should be reformulated? When dealing with mono-conceptual queries, the answer to this query is straightforward: if knowledge about that particular concept is available, we should use it to expand the query. The problem gets complicated for more complex queries because the number of reformulations becomes rapidly unmanageable. We consider that nouns are the most important part of image queries and focus the query expansion on them. Happily, the WikipediaMM queries follow the general distribution of Web queries and contain a hefty part of mono-conceptual queries. The remainder of this paper is structured as follows: in the next section, we describe related work; in Section 3, we present our method for automatically building conceptual structures; in Section 4, we introduce the automatic query reformulation algorithm used in the WikipediaMM task and, before concluding, we discuss the results of our approach. 2. Related Work We describe related work from several relevant research areas, including: information extraction, image retrieval, query reformulation, visual concepts detection. Wikipedia is a rich resource that is used in a variety of information extraction or structuring tasks. In [35], the encyclopaedia is used to automatically construct lists of place and people names. [25] proposes a method for cleaning the categorical tree of Wikipedia in order to obtain a sound taxonomy. The result is compared to Cyc and the precision of the results reaches 86.6%, with a recall of 89.1%. Kazama et al. [13] introduce a syntaxic analysis of the first sentence in Wikipedia articles in order to extract IsA relations. These sentences are often definitional and the approach is successful in nearly 90% of the cases. [43] and [44] explore the automatic enrichment of WordNet using Wikipedia content. The authors try to extract hyponymy, hyperonymy, holonymy and meronymy relations based on lexical patterns learned from a text corpus. The overall precision of the extraction process exceeds 50% and there are a lot of incorrect relations that are extracted. DBPedia [1] is a translation of parts of Wikipedia articles to a database format, enabling structured queries over the content of the encyclopedia. The authors parse structured parts of the articles (such as info boxes, tables, or categories), which contain a fairly detailed description of the concepts described in the article and which can be later used in information retrieval tasks. The introduction of semantic structures in image retrieval architectures is a well-known practice. WordNet is exploited in a number of applications: to build a visual catalogue from the Web [42]; to propose lists of related concepts in [40]; to create multimodal similarity vectors (based on WordNet and on a visual description of the images) in [8]; to limit the conceptual neighbourhood where visually similar images are searched [26]. Wang et al. [39] reuse a taxonomy of animals created by the BBC (« BBC Science and Nature Animal Category »), which contains 620 terms. This taxonomy is enriched the description of the included concepts with visual information concerning the animal's color but also image properties (like outdoor/indoor image, photo/graph) and the result is called a multimedia ontology. They authors select 20 animal species, collect corresponding pictures from Google Image provide results for the precision at several ranks (P@20, P@40, P@60 and P@80), these values corresponding to one, two, three and four pages of results in a Web search engine. A comparison between Google Image, the use of a textual ontology and the use of a multimedia ontology and the average precision is best in the last setting, followed by the textual ontology. These results are interesting but they are limited to a specific domain, where the colour of the target concepts (animals) is stable. Image queries reformulation based on semantic ressources is tested, among others, in [11] et [12]. In [11], the authors exploit ConceptNet [17] to expand image queries and report a slight improvement of results (3% for a precision around 40%) when the query expansion is used. In [12], the same group compares a WordNet based query expansion to a ConceptNet based one and conclude that the two semantic structures are complementary. The use of WordNet provides a better discrimination of the expanded queries whereas the use of ConceptNet results in more diversified queries. This finding is not surprising when considering the structure of the two resources, with ConceptNet including a larger number of inter-conceptual relations. [45] takes a different approach to query reformulation and discusses the use of query logs for expanding queries and structuring results. Visual concepts detection has been a well studied problem in computer vision, including the work on object detection and scene recognition. It is often posed as a binary classification task consisting in deciding between the considered visual concept and other possible type of images. Hence, the general scheme of the works in this field consists of three main steps, namely region of interest detection, feature extraction and classification. Different proposals for one or several of these steps led to numerous approaches The regions of interest of an image are the spatial localizations the feature will be extracted in the next step. The simplest approach consists of considering no region of interest and extracting the features globally, as it was the case for seminal works in image classification [9, 33]. Such a holistic representation was shown to be particularly relevant in the case of scene recognition [23, 14, 16]. Among the alternatives considered in the literature to detect ROIs, one can distinguish between the following approaches: regular [6] or random [22] sampling of patches, image segmentation [2, 34], or more often using interest point detectors [29, 19]. Once the ROI are determined, a recognition system usually extracts some visual features. Numerous approaches were proposed since the seminal work of Swain and Ballard who used global colour histograms [32], including various colour and texture descriptors [20], wavelets [30, 37] and more recently local descriptors computed around interest points. This last trend consists of computing simple features on patches around interest points [18], then aggregate them into a given number of clusters in order to define a visual vocabulary, that is former used to describe the images in term of “bag of features” [38, 10, 41, 5, 22]. An alternative to this scheme is to learn analysis filters from the learning images and define a signature from their responses [15, 16]. The last important step of visual concept detection consists of learning each concept. At this level, various classifier have been used including SVM [24, 20, 16], boosting [37], Naïve Bayes [30, 5], neural networks [28], generative model [6], graphical model [21] and, regularly, K-nearest neighbours from [9] to[36]. 3. Automatic building of conceptual structures The acquisition of knowledge related to concepts appearing in the queries is the key element of our image retrieval method. We aim at processing diversified queries and this implies and automatic building of conceptual structures. In this section, we briefly present the employed data sources, the extraction of related concepts and the ranking of the extracted terms. 3.1. Data sources The main resource we exploited was Wikipedia. Dumps of the encyclopaedia are regularly provided for a free use1. We downloaded the Mars 12, 2008 English dump, which contains over two millions articles and is provided as a single file, in XML format. Next, we split the dump into individual articles in order to process the information faster. The information in Wikipedia spans over a large number of conceptual domains, with a high number of articles describing known people, places, entertainment, organisations animals and plants. Each article is placed in at least one category, a property that facilitates the extraction of IsA relations from Wikipedia. WordNet is a lexical database, including parts for different parts of speech, such as nouns, verbs or adjectives. As our approach focuses on nouns, we only used the nouns hierarchy which contains 117798 nominal chains, corresponding to 146312 senses. These nouns are grouped in 82115 sets of synonyms (or synsets), the sense separation being an interesting property for image retrieval tasks as it makes possible a separation of the different visual representation of the same terms. Out of the total number of synsets in WordNet, around 75% are common nouns and 25% are instances. That being, WordNet ensures an acceptable coverage for concepts but only a poor coverage of instances which have around 20000 associated synsets. For comparison, there are at least 80000 articles for person names in the English Wikipedia and around 300000 for place names2. 3.2. Conceptual neighbourhood building The text associated to the pictures in the WikipediaMM corpus is generally short, containing few terms. In this context, query reformulation is a way to improve recall and, if realized in a judicious way, to also improve the precision of the answers. In [27], we showed that the use of subtypes of generic concepts like dog or skyscraper is beneficial in image retrieval. This hypothesis is justified by the fact that, from a visual perspective, a concept like dog is well represented by dog races such as German shepherd, Doberman or basset. For specific concepts, having no subtypes, it is possible to build a list of synonyms in order to reformulate the queries containing them. For image queries, nouns are the most informative parts of the user's request and we focus our work on them. From the list of queries provided for WikipediaMM, we build an initial list of all nouns, which is subsequently filtered in order to eliminate visual concepts. We decided not to try to reformulate visual concepts but to employ them in the visual concepts detection framework described in [20]. The list of visual concepts contains terms like: night, day, face, portrait, graph, drawing, cartoon, photo, picture or painting. In our approach, we favour the processing of concepts in queries rather than processing words separated by blank spaces. For instance, hunting dog is regarded as a single concept and not as a composition of hunting and dog as separate terms. The same observation stands for Da Vinci paintings, threes terms that form a unique concept. When searching for subtypes of hunting dog and Da Vinci paintings, the two expressions will be considered as single concepts. The first type of composed concept is a multiword and it is detected with WordNet. “Da Vinci paintings” is retained as a single concept by comparing it to the list of categories in Wikipedia. The same rule is applied for Ice hockey players or Roads in California, which are processed as single concepts. The list of concepts was first enriched using WordNet hyponyms (for those terms that, when a term existed in the hierarchy. If we get back to the example of hunting dog, the list of hyponyms contains intermediary concepts like sporting dog, hound or retriever, as well as race names such as labrador retriever, Ibizan hound or Irish Terrier. Other terms, such as Ferrari, are not described in WordNet and they constitute an argument for using a broader resource in order to build lists of subtypes. WordNet was also used to determine the right sense for an ambiguous text queries such as the one formed of plant as text query (which is disambiguated using building, a visual concept included in the query). We were able to map this query to the second sense of plant in WordNet and extract the right set of hyponyms. A third role of WordNet was to provide a list of synonyms for terms having no subtypes in the hierachy. For example, polar bear was enriched with ice bear, Ursus Maritimus, Thalarctos Maritimus. When using Wikipedia, we first probed each member of the list of nominal concepts (including the terms extracted from WordNet) against the categories associated to articles in the encyclopedia. Wikipedia is generally more detailed than WordNet and, when existing, the lists of subtypes extracted from WordNet were enriched with terms from the encyclopedia. For instance, Mudhol Hound or Azawakh (races of hounds) were added to the list of hyponyms of hunting dog. For categories which are not represented in WordNet, i.e. Ferrari or Da Vinci paintings, we mined Wikipedia content and extracted list of instances. Ferrari hyponyms include: Ferrari F40, Ferrari GT4 or Ferrari Testarossa. 1 http://en.wikipedia.org/wiki/Wikipedia:Database_download 2 http://dbpedia.org Among the extracted Da Vinci paintings, we cite: Mona Lisa, The Last Supper or The Battle of Anghiari. 3.3. Subtypes ranking The lists of subtypes we constituted often contain a lot of terms (hundreds for terms like bridge or castle) and, given that they are to be used in an information retrieval application, it is necessary to rank these terms. We propose a two steps ranking process: First, the query is examined to see if it includes a useful qualifier of the concept (usually an adjective) and if so, terms associated to that qualifier are ranked first. Qualifiers appear in queries such as female players beach volleyball, red Ferrari or blue flower. These qualifiers are matched against WordNet glosses for terms in WordNet and against the categories as well as the text of the article for terms extracted from Wikipedia. Second, we use the length of the dedicated Wikipedia article to associate a pertinence value to each member of the subtypes list. The intuition behind this choice is that interesting concepts are usually described in more detail than the others. This simple way to rank entities generally gives satisfying results. For example, after ranking the list of subtypes of bridge, the first terms were: I-35w Mississippi River Bridge; San francisco- Oakland Bay Bridge; Golden Gate Bridge; Millau Viaduct; Luding Bridge; Brooklyn Bridge and Sydney Harbour Bridge. All these terms correspond to well-known bridges. Note that, for those queries where qualifiers appear, they are prioritary. All ranks are normalized to values smaller than one. The term ranking will be used to order the pictures that are retrieved for a query. If a picture of the Golden Gate Bridge and one of the Luding Bridge are found for a query with bridges, they will be presented in this order on the results page. 4. Query reformulation and matching procedure We performed two types of query reformulation, one implying only text and a second implying both text and visual concepts (called multimedia reformulation). 4.1. Textual query reformulation and matching We preprocess the textual queries in the WikipediaMM set. First, we do not consider as concepts the stop words that appear in the queries; they are stripped off before the reformulation. For instance bridges at night is processed as bridges night. Second, nouns in the queries are stemmed and both forms of the word are searched in the text associated to the images in the dataset. Third, when possible, verbs are transformed into corresponding nouns (dance becomes dancing). Also, all capitalized letters are transformed to low-case. The concepts are three types: Visual concepts – belonging to a pre-established closed list of terms (including graph, map or night). They are not subject to textual reformulation and are only used in the multimedia reformulation. Expandable concepts – terms for which we built a list of subtypes from WordNet and Wikipedia. Non-expandable concepts – terms for which there is no available reformulation. They include qualifiers (i.e. blue, military) or instances (i.e. Golden Gate Bridge). All concepts in a query, as well as the subtypes of expandable concepts, are tested against the text associated to images and, whenever a match is found, we increase the matching score between the query and the image. Visual concepts and expandable terms are given a higher score compared to that given to subtypes, which are, better scored than qualifiers in turn. The ranking of the image results is done by summing all the individual scores associated to query components. This ensures that the images with the most concepts in the (expanded) query appear in the descriptive text are best ranked. For instance if an image description contains both hunting dog and labrador retriever, it will be better ranked than an image described only by hunting dog, which, in its turn, will be better ranked than a third image described associated with labrador retriever. When two texts match only subtypes of the concept in the query, we use the subtypes ranking in order to differentiate between both (an images described by labrador retriever is ranked better than one described by Tenterfield terrier). In the case of the queries containing more than one concept, no image results are returned when a qualifier or a visual concept in the query matches the analyzed text. 4.2. Multimedia query reformulation and matching This section describes the third layer of the system that aims at (eventually) rearranging the order of the answers returned to a query by the first two layers, depending on its content in terms of visual concepts. We used two systems to detect visual concepts within the images. The first one is the Viola-Jones face detector that is based on the boosting of Haar wavelets [37]. The second system (extensively described in [20]) is a set of SVM-based classifiers learnt (RBF kernel) to determine: • The type of an image: clipart, map, painting or photo. • In this last case (if the image is a photo), other sets of SVM determine whether the image is: o Indoor or outdoor o Day or night o Urban or natural scene The features extracted include texture LEP [3], colour histogram [32] and connected pixels mutual extension [Ste 02]. For each classifier, the images of the learning databases were chosen separately of the wikipedia corpus considered at imageCLEF. When more than two concepts are considered, the multi-class paradigm was solved using a one-versus-one approach. The implementation was based on LibSVM [4]. The queries were analysed to detect those which have to be filtered by one of the systems described above. Each visual concept was linked to a pre-defined list of textual concept that triggers its use. For instance, the presence of a named entity of a person (such as “Georges W Bush”) will trigger the use of the face detector. The presence of the word “map” within the query will claim for the use of the “image type detector” and favour the images tagged as map while the word “cartoon” will do the same for the images classified as clipart. When a list of answers coming from the two first layers is reordered, the images detected as relevant according to the visual concept associated to the query are put at the head of the list without changing their relative order. 5. Results and discussion Table 1 gives the main results of the two runs we submitted. The run ceaTxt is the output of the textual query reformulation and matching only, while the run ceaConTxt is the output of the full system including the multimedia query reformulation and matching. The results are given in terms of Mean Average Precision as well as the precision at ranks five and ten. Run MAP P@5 P@10 ceaConTxt 0.2735 0.5467 0.4653 ceaTxt 0.2632 0.52 0.4427 Table 1: main results of the CEA LIST at ImageCLEF wikipedia task Our system returns good results, which were ranked at the fourth and second place of the WikipediaMM task at imageCLEF 2008. The difference between our two runs shows the slight interest of the multimedia reformulation and rearrangement that led to an improvement of one point in terms of MAP (from 0.263 to 0.273). It is worth noting that about half of the images were judged as relevant among the ten first answers returned by our system, demonstrating a practical interest for a real user. 6. Conclusions and perspectives We proposed a new scheme to exploit both textual and visual information in the context of image retrieval. The approach is based on a query reformulation using concepts that are semantically related to those in the initial query. For each interesting entity in the query, we used Wikipedia and WordNet to extract and list of related concepts, which were further ranked in order to propose the most salient in priority. These answers were ultimately rearranged in function of the query reformulation in terms of visual concepts. The results submitted at ImageCLEF 2008 were ranked 4th and 2nd with a mean average precision of 0.2632 and 0.2735. The small difference between the two submitted runs shows that the greater contribution to the final results was probably due to the use of conceptual structures, although a rigorous comparison would have required submitting a run with the third layer (visual concept detection) only. Nevertheless, the improvement of the results' precision accounts for the interest of introducing visual concept detection in the retrieval schema. We described an ongoing work and a number of features of our system are currently under investigation. The detection of associated concepts is currently limited to the use of Wikipedia and WordNet. We plan on extending our approach so as to exploit search engine snippets, in order to improve the coverage of the resources. Also, there exist domain-related semantic structures, like Geonames3 for geography, which could be used to improve the coverage. A second line of work concerns the concept ranking process described in this paper. While simple and generally effective, the current approach can certainly be ameliorated if, for instance, we favour unambiguous hyponyms over ambiguous ones. Concerning the third layer, we are currently exploring a finer grained filtering of visual concepts. Indeed, the current implementation uses a classifier that does not require any setting put aside the choice of the kernel. Hence, stepping up our system (in terms of number of visual concepts) is a problematic subject on its own that consists of being able to build the learning databases (images and triggering words) automatically for a large number of concepts. 7. Acknowledgement We thank the Direction Générale des Entreprises for funding us through the regional business cluster Systematic (project POPS4) and Cap Digital (project Mediatic5). 3 http://geonames.org 4 http://www.pops-systematic.org/ References [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives. “Dbpedia : A nucleus for a web of open data”. In Proceedings of the 6th International Semantic Web Conference (ISWC), Volume 4825 of Lecture Notes in Computer Science, pages 722–735. Springer, (2008). [2] K. Barnard, P. Duygulu, D. A. Forsyth, Nando de Freitas, D.M. Blei, M.I. Jordan: Matching Words and Pictures. Journal of Machine Learning Research 3: 1107-1135 (2003) [3] Y.-C. Cheng and S.-Y. Chen. Image classification using color, texture and regions. image and Vision Computing, 21 :759–776, 2003. [4] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm [5] Deselaers T., Keysers D., Ney H., "Discriminative Training for Object Recognition using Image Patches", CVPR, vol. 2, San Diego, CA, USA, IEEE, pp. 157-162, 2005 [6] L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. IEEE Comp. Vis. Patt. Recog. 2005 [7] C. Fellbaum, editor. WordNet : an electronic lexical database. MIT Press, (1998). [8] M. Ferecatu, N. Boujemaa, M. Crucianu. “Semantic interactive image retrieval combining visual and conceptual content description”, Multimedia Syst., 13(5-6), pp. 309–322, (2008). [9] M. M. Gorkani and R. W. Picard, Texture Orientation for Sorting Photos at a Glance, Proc. ICPR, Oct 1994, TR #292. [10] K. Grauman and T. Darrell. Pyramid match kernels: Discriminative classification with sets of image features. In Proc. ICCV, 2005 [11] M.-H. Hsu, H.-H. Chen. “Information retrieval with commonsense knowledge”. In SIGIR ’06 : Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 651– 652, New York, NY, USA, (2006). ACM. [12] M.-H. Hsu, M.-F. Tsai, H.-H. Chen. “Query expansion with ConceptNet and Wordnet : An intrinsic comparison”. In Proceedings of the Third Asia Information Retrieval Symposium Information Retrieval Technology, pages 1–13, (2006). [13] J. Kazama, K. Torisawa. “Exploiting wikipedia as external knowledge for named entity recognition”. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 698–707, (2007). [14] H. Le Borgne and A. Guérin-Dugué. Sparse-Dispersed Coding and Images Discrimination with Independent Component Analysis, Third International Conference on Independent Component Analysis and Signal Separation, San Diego, California, 9-13, 2001 [15] H. Le Borgne, A. Guérin-Dugué, A. Antoniadis. Representation of images for classification with independent features. Pattern Recognition Letters, 25(2):141-154, 2004 [16] H. Le Borgne, A. Guérin-Dugué, N.E. O'Connor, Learning Mid-level Image Features for Natural Scene and Texture Classification, IEEE transaction on Circuits and Systems for Video Technology, 17(3):286-297, march 2007. [17] Y. Liu, D. Zhang, G. Lu, W.-Y. Ma. “A survey of content-based image retrieval with high-level semantics”, Pattern Recogn., 40(1), pp. 262–282, (2007). [18] D. Lowe. Object Recognition from Local Scale-Invariant Features. In Proceedings of the International Conference on Computer Vision, pages 1150-1157, Corfu, Greece, September 1999. [19] K. Mikolajczyk and C. Schmid. Indexing Based on Scale Invariant Interest Points. In Proceedings of the International Conference on Computer Vision, pages 525-531, 2001. [20] C. Millet, Automatic image annotation: consistent annotation, and creating automatically a learning database. PhD thesis, 2007. [21] P. Murphy, A. Torralba and W. T. Freeman. Using the forest to see the trees: a graphical model relating features, objects and scenes. Adv. in Neural Information Processing Systems 16 (NIPS), Vancouver, BC, MIT Press, 2003. [22] E. Novak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image classification. In IEEE European Conf. on Computer Vision, 2006 [23] A . Oliva and A. Torralba. Modelling the shape of a scene: a holistic representation of the spatial envelope, International Journal of Computer Vision,42(3):145-175, 2001. [24] C. Papageorgiou and T. Poggio. A trainable system for object detection. Intl. J. Computer Vision, 38(1):15–33, 2000. [25] S. P. Ponzetto, M. Strube. “Deriving a large scale taxonomy from wikipedia”.In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, (2007). [26] A. Popescu, C. Millet, P-A. Moëllic Ontology Driven Content Based Image Retrieval, CIVR 2007 - posters session, July 9 - 11, 2007, Amsterdam, The Netherlands. [27] A. Popescu, P-A. Moëllic, I. Kanellos A Conceptual Approach to Web Image Retrieval , LREC 2008, May 28 - 30, 5 http://www.media-tic.org/ 2008, Marrakech, Morroco. [28] H.A. Rowley, S. Baluja, T. Kanade, "Rotation Invariant Neural Network-Based Face Detection," cvpr, p. 963, 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'98), 1998 [29] C. Schmid and R. Mohr. Local Grayvalue Invariants for Image Retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):530-535, May 1997. [30] H. Schneiderman and T. Kanade. A statistical model for 3D object detection applied to faces and cars. Conference on Computer Vision and Pattern Recognition, 2000. [31] R. O. Stehling , M. A. Nascimento , A. X. Falcão. A compact and efficient image retrieval approach based on border/interior pixel classification. Proceedings of the eleventh international conference on Information and knowledge management, pp 102-109, McLean, Virginia, USA 2002. [32] M. Swain and D. Ballard, Color Indexing, International Journal of Computer Vision, 7(1):11-32, 1991. [33] M. Szummer and R. W. Picard, "Indoor-outdoor image classification," Int. Workshop on Content-based Access of Image and Video Databases, pp. 42-51, Jan. 1998. [34] S. Tollari, H. Glotin, J. Le Maitre: Enhancement of Textual Images Classification Using Segmented Visual Contents for Image Search Engine. Multimedia Tools and Applications 25(3): 405-417 (2005) [35] A. Toral, R. Muñoz. “A proposal to automatically build and maintain gazetteers for named entity recognition by using wikipedia”. In NEW TEXT - Wikis and blogs and other dynamic text sources, Trento, (2006). [36] A. Torralba, R. Fergus, W. Freeman, "80 million tiny images: a large dataset for non-parametric object and scene recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 May 2008 [37] P. Viola and M.J. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. [38]C. Wallraven, B. Caputo, and A. Graf. Recognition with local features: the kernel recipe. In Proc. ICCV, volume 1, pages 257–264, 2003. [39] H. Wang, S. Liu, L.-T. Chia. “Does ontology help in image retrieval ? - a comparison between keyword, text ontology and multi-modality ontology approaches”. In MULTIMEDIA ’06 : Proceedings of the 14th annual ACM international conference on Multimedia, pages 109–112, New York, NY, USA, (2006). ACM. [40] J. Yang, L. Wenyin, H. Zhang, Y. Zhuang. “Thesaurus-aided approach for image browsing and retrieval”, Proceedings of ICME 2001, (2001). [41] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: An in-depth study. Technical Report RR-5737, INRIA Rhône-Alpes, 2005. [42] X. J.Wang, W. Y. Ma, X. Li. “Data-driven approach for bridging the cognitive gap in image retrieval”. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo, Volume 3, pages 2231–2234, Taipei, Taiwan, (June 2004).IEEE. [43] M. Ruiz-Casado, E. Alfonseca, P. Castells. \Automatic assignment of wikipedia encyclopedic entries to wordnet synsets", Advances in Web Intelligence, pages 380-386, (2005). [44] M. Ruiz-Casado, E. Alfonseca, P. Castells. Automatising the learning of lexical patterns : An application to the enrichment of wordnet by extracting semantic relationships from wikipedia", Data Knowl. Eng., 61(3), pp. 484-499, (2007). [45] S. P. Liao, P. J. Cheng, R. C. Chen, L. F. Chien. Liveimage : Organizing web images by relevant concepts". In Proc. of the Workshop on the Science of the Artificial 2004, pages 210{220, (2005) .