Multimodal information approaches for the Wikipedia collection at ImageCLEF 2011 R. Granados1, J. Benavent2, X. Benavent2, E. de Ves2, Ana García-Serrano1 1 Universidad Nacional de Educación a Distancia, UNED 2 Universitat de Valencia rgranados@lsi.uned.es, xaro.benavent@uv.es, agarcia@lsi.uned.es, esther.deves@uv.es, jobego@alumni.uv.es Abstract. The main goal of this paper it is to present our experiments in ImageCLEF 2011 Campaign (Wikipedia retrieval task). This edition we focused on applying different strategies of merging multimodal information, textual and visual, following both early and late fusion approaches. Our best runs are in the top ten of the global list, at positions 8, 9 and 10 with MAP 0.3405, 0.3367 and 0.323, being the second best group of the contest. Moreover, 18 of the 20 runs submitted are above the average MAP of its own modality (textual or mixed). In our system, the TBIR module works firstly and acts as a filter, and the CBIR system works only with the filtered sub-collection. The two ranked lists are fused using its own probability in a final ranked list. The best run of the TBIR system is in position 14 with a MAP of 0.3044, and uses subsystems IDRA and Lucene, fusing monolingual experiments carried out with IDRA preprocessing and Lucene search engine, taking into account extra information from Wikipedia articles. The best result at the CBIR system is obtained by using a logistic regression relevance feedback algorithm and CEDD low-level features. Keywords: Information Retrieval, Textual-based Retrieval, Content-Based Image Retrieval, Relevance feedback, Merge Results Lists, Indexing, Multimedia, Multimodal, Fusion 1 Introduction The UNED-UV is a research group with researchers from two different universities in Spain, the Universidad Nacional de Educación a Distancia (UNED) and the Valencia University (UV). This research group is working together [1] [2] [7] since ImageCLEF08 edition. Two kinds of experiments were submitted to the 2011 Wikipedia Retrieval edition [3]: pure textual runs (TBIR), and mixed (with visual, CBIR). For textual experiments different approaches (stemming, use of articles info, and named entities recognition) were tested in order to evaluate the differences among them. For the mixed runs, as in 2010 presented ones [7], the TBIR system works firstly over the whole database as a filter and then the CBIR only works over the filtered collection. Finally, the fusion module gets a ranked list, merging the textual and visual lists taking into account the probabilities obtained by each of the modules individually. The merging module for the multimodal information is the main goal of study of our group for this edition. The TBIR subsystem includes the UNED own implemented tool IDRA (InDexing and Retrieving Automatically) [4] which includes several functionalities: text extraction and preprocessing, indexation following a Vector Space Model (VSM) approach using TF-IDF weighted vectors, retrieval based on the cosine function, a connection to a basic Lucene [12] configuration, and some merges utilities. The CBIR subsystem uses its own low-level features or the CEDD ones [11], depending on the experiment in order to test the influence of the low-level features in the final results. Two different algorithms have also been used: a logistic regression relevance feedback algorithm and an automatic algorithm with the Tanimoto distance. A more detailed presentation of the system, the submitted experiments, and the obtained results are included in the following sections. 2 System Description To carry out the Wikipedia retrieval task, it has been used a three modules architecture, as shown at Fig. 1. The figure illustrates the global system, which includes the TBIR (text based image retrieval) module, the CBIR (content based image retrieval) module, and the fusion one. As the conceptual meaning of a topic is initially better captured by the text module itself than by the visual one, the textual module works first as a filter for the visual one, which works only with the sub-collection filtered by the textual module. Each module gets a ranked list based on a similarity score or probability (the TBIR module uses the textual information to obtain these scores while the CBIR one uses the visual one). From now on, we call textual probability (Pt) to the probability given by the textual module and image probability (Pi) to the probability given by the visual module. The way of merging these two probabilities is studied at the fusion module. The TBIR subsystem is based on the IDRA tool, which allows to preprocess the textual information associated with the images in the collection, and to index and retrieve using both its own implemented search engine (based on a VSM approach), and a basic configuration of Lucene [12]. The CBIR subsystem uses its own low-level features or the CEDD features depending on the experiment (in order to test the influence of the kind of low-level features in the final result), and its own logistic regression relevance feedback algorithm. Also, an automatic algorithm, which uses the Tanimoto distance as the score for ranking images in the collection, has been implemented in order to compare the performance of this distance with the logistic regression algorithm. Each of the two subsystems, TBIR and CBIR, generates a ranked list with a certain probability and this multimodal information is merged at the fusion module. Different ways of merging this information is tested. Moreover, merging algorithms are used inside the TBIR subsystem to fuse different textual result lists from monolingual experiments in order to obtain multilingual results, as other fusing techniques are used inside the CBIR subsystem. All details of the different levels of merging information and the algorithms used are explained in the following sections. Wikipedia Collection metadata articles images Textual TBIR CBIR Info Automatic Extraction Feature (Tanimoto IDRA extraction distance) INDEX (CEDD/UV) DE FR EN NER S Logistic IDRA (VSM) E regression A relevance FB Language- Basic R dependent Lucene C IDRA H TXT IMG Preprocess Results Results List List (Pt) (Pi) DE FR EN MAX OWA JOIN merge Pt*Pi (med) Topics TXIMG Enrich OWA Results Pi images IDRA (Orness 0.3) List FUSION Fig. 1. System overview. 2.1 Text-based Index and Image Retrieval (TBIR) This module is in charge of the index and search of the images in the collection, based on the textual information in the metadata files associated with each of these images. IDRA tool is used to extract, select and preprocess this information. Additional resources, as STILUS [9] or Snowball [10], are required in the preprocessing step. Finally, both IDRA and Lucene search engines are used for the index and retrieval of the preprocessed textual info, obtaining a ranked results list with the retrieved relevant images for each query. The components shown in Fig. 1 within the TBIR subsystem are the followings: Textual Info Extraction. Two different textual information sources can be differentiated in the collection: the metadata files and the articles files. The metadata XML tags extracted, using the JDOM Java API, are and the general for all languages, and , and for each particular language (English, French and Dutch). The tag from metadata files may include a link to the article/s from Wikipedia where images appear. This information is taken into account in some of the experiments, extracting the title and the categories from the linked articles. The final output of this component will be the selected textual information describing the images, coming from both the metadata and the articles, and separated depending on the language: text (EN), text (FR) and text (DE). Language-dependent IDRA Preprocess. This component processes the selected text in three steps: 1) special characters deletion: characters with no statistical meaning, like punctuation marks or accents, are eliminated; 2) stopwords detection: exclusion of semantic empty words from specifics lists for each language; and 3) stemming: for reducing inflected or derived words to their stem, base or root form. A different algorithm is needed to perform stemming for each one of the languages. Stemmers from Snowball [10] are used in the experimentation. NER (STILUS). The Named Entities Recognition is carried out by the „List Entities‟ functionality of the STILUS-Core API [9]. Different forms of the detected entities (from general and variants) are took into account and considered as textual information in the corresponding experiments. Index&Search. Once completed extraction and preprocess, both IDRA tool and Lucene will be used to index the selected text, and to retrieve relevant images for the proposed queries. IDRA indexation is based on the VSM approach using TF-IDF (term frequency – inverse document frequency) weighted vectors. This approach consists in calculating the weights vectors for each one of the images selected texts. Each vector is compounded by the TF-IDF weights values of the different words in the collection. TF-IDF weight is a statistical measure used to evaluate how important a word is to a text in a concrete collection. These weights are normalized using the Euclidean distance. IDRA search will launch the textual queries from the topics (English, French or Dutch) against a concrete index, obtaining this way the corresponding “TXT Results List”. For each one of the queries, IDRA calculates its corresponding weights vector in the same way as in the index. Then, the similarity between the query and an image text will depend on the proximity of their associated vectors calculated by the cosine measure. This similarity value will be calculated between the query and all the images associated text indexed. Then images are ranked in descending order of relevance in the “TXT Results List”. Lucene indexation and search can be executed from IDRA tool. Selected texts already preprocessed with IDRA are indexed with Lucene following a basic implementation that uses the WhiteSpaceAnalyzer which just separates tokens and doesn‟t apply any other linguistic preprocess. 2.2 Content-Based Information and Visual Retrieval The VISION-Team at the Computer Science Department of the University of Valencia has its own CBIR system, and that has been used in previous ImageCLEF editions since our first participation in 2008. Last edition, the focus of the work was in testing three different visual algorithms applied to the results retrieved by the text module: the automatic, the relevance feedback and the query expansion obtaining the best results with the relevance feedback algorithm. Therefore, this edition we have used the relevance feedback algorithm and the work has been focus on testing the behavior of our own low-level features with the low-level features given by the organization (the CEDD algorithm described in [11]). Extraction of low level features. As in most CBIR systems, a feature vector represents each image. The first step at the Visual Retrieval system is extracting these features for all the images on the database and for each image in the query topic. We use different low-level features describing color and texture to build a vector of 293 components based on color and texture information.  Color information: Color information has been extracted calculating both local and global histograms of the images using 10x3 bins on the HS color system. Local histograms have been calculated dividing the images in four fragments of the same size. Therefore, a feature vector of 222 components represents the color information of the image.  Texture information: Two types of texture features are computed: the granulometric distribution function, using the coefficients that result of fitting the distribution function with a B-spline basis. And, the Spatial Size Distribution. We have used two different versions of it by using as the structuring elements for the morphological operation that get size both a horizontal and a vertical segment [1]. This gives us a texture feature vector of 71 components. We assume that the conceptual meaning of a question is better captured by the text module than by a visual module when they work individually. Therefore, the task of the visual module is to re-rank the textual result list taking into account the information of the query images given at each topic. Automatic algorithm. This is a classical algorithm in a CBIR system. Each image in the database has an associated low level feature vector. Concretely, we have used for this algorithm the low level features given by the organization (CEDD). The second step is to calculate the similarity measurement between the feature vectors of each image on the database and the N query images. The distance metric applied in our experiments is the Tanimoto. As we have N query images, we will obtain N visual result lists, one for each query image in the topic. These N result lists are merged by using an average OWA operator. Relevance feedback algorithm based on logistic regression. This algorithm works differently to the two previous ones. Therefore, we will explain the concept of relevance feedback and the adjustments made to get a good performance of the algorithm for the proposed tasks [5]. Relevance feedback is a term used to describe the actions performed by a user to interactively improve the results of a query by reformulating it. An initial query formulated by a user may not fully capture his/her wishes. Users then typically change the query manually and re-execute the search until they are satisfied. By using relevance feedback, the system learns a new query that better captures the user‟s need for information. The user enters his/her preferences at each iteration through the selection of relevant and non-relevant images. We will explain the way the logistic regression relevance feedback algorithm works. Let us consider the (random) variable Y giving the user evaluation where Y=1 means that the image is positively evaluated and Y=0 means a negative evaluation. Each image in the database has been previously described by using low-level features in such a way that the j-th image has the k-dimensional feature vector xj associated. Our data will consist of (xj, yj), with j=1,…,n, where n is the total number of images, xj is the feature vector and yj the user evaluation (1=positive and 0=negative). The image feature vector x is known for any image and we intend to predict the associated value of Y. In this work, we have used a logistic regression where P(Y=1|x) i.e. the probability that Y=1 (the user evaluates the image positively) given the feature vector x, is related with the systematic part of the model (a linear combination of the feature vector) by means of the logit function. For a binary response variable Y and p explanatory variables X1,…,Xp, the model for π(x)=P(Y=1|x) at values x=(x1,…,xp) of predictors is logit[π(x)]=α+β1x1+…+βpxp, where logit[π(x)]=ln(π(x)/(1- π(x))). The model parameters are obtained by maximizing the likelihood function given by: n l (  )    ( xi ) yi [1   ( xi )]1 yi (1) The maximum likelihood estimators (MLE) of the parameter vector β are calculated by using an iterative method. We have a major difficulty when having to adjust a global regression model in which we take the whole set of variables into account, because the number of selected images (the number of positive plus negative images) is typically smaller than the number of characteristics. In this case, the regression model adjusted has as many parameters as the number of data and many relevant variables could be not considered. In order to solve this problem, our proposal is to adjust different smaller regression models: each model considers only a subset of variables consisting of semantically related characteristics of the image. Consequently, each sub-model will associate a different relevance probability to a given image x, and we face the question of how to combine them in order to rank the database according to the user‟s preferences. This problem has been solved by means of an ordered averaged weighted operator (OWA) [6]. In our case, we have adapted the manual relevance feedback to an automatic performance. The examples and the counter-examples (positive and negative images) are automatically selected for each topic. The examples are the query images of the topic plus N images taken from the first positions of the textual result list. The M counter-examples are obtained by applying a procedure which chooses J random images from the whole database (without images in the textual list). This J images are ranked by the Euclidean distance and the latest M images are taken as negative examples. 2.3 Multimodal Fusion Different types of fusion are needed in different steps of the experimentation. Depending on the point the fusion is carried out, it is called early fusion (at feature level) or late fusion (at decision level) [8]. Moreover, fusion can be applied among resources from different modes (text and image), or the same (different sources of text). Late fusion algorithms are used when fusing multiple modalities in the semantic space, at decision level, and when fusing textual results from monolingual experiments. All mixed runs submitted fuse together textual and visual runs at this level, combining the mono-modal decisions (results lists) from each modality. 2.3.1 Fusion in textual runs Several fusion approaches are used within the textual module, following both early and late fusion techniques. Fusion at feature level (early fusion) is used when combining text in different languages [8], or from different processes (i.e. NER). A late fusion approach is used when fusing together the decisions from each of the monolingual processes. The different implemented algorithms, and the purposes they were built for, are explained bellow: JOIN. Early fusion approach which just concatenates several lists of terms coming from different sources, obtaining only one. The level of fusion of this method is feature (early fusion), as it merges the components of the representation vectors of each image. This approach is used in two kinds of experiments: 1) to merge the terms from the different languages (EN, FR, DE) describing each image in the collection into a unique multilingual representation (run9); and 2) to fuse at feature level the terms coming from the metadata textual information, with those obtained from the NER process (run10). MAXmerge. Used to fuse together different results lists. This algorithm is included in IDRA tool and, for each query, selects the results from the different lists which have a higher relevance/similarity value, independently of the list in which the results appear in. The merging of the results corresponds to a decision level fusion (late fusion), where individual decisions working with each language are mixed in a unique results list, which is multilingual. (run4, run8) Enrich. Late fusion algorithm used to merge two results lists, also included in IDRA. The algorithm fuses together a main list with a support one. If a retrieved image appears in both lists for the same query, the relevance of this result in the merged list will be increased in the following way (normalized from 0 to 1): newRel: relevance value in the merged list sup Re l new Re l  main Re l  supRel: relevance value in the support list (2) pos Re l  1 mainRel: relevance value in the main list posRel: position in the support list Every results appearing in the support list but not in the main one (for each query), will be added at the end of the results for each query. In this case, relevance values will be normalized according with the lower value in this moment. This method is used (run7) to enrich the results of one of the English monolingual experiments with those from a NE-based experiment. The improvement (or not) of these fusion will be appreciable just in those queries (a total of 9) where named entities where detected. 2.3.2 Fusion in the Visual Module There are two points where a fusion algorithm is needed inside the visual procedures. The first one is in relevance feedback algorithm because, as we have explained before, our proposal is to adjust different smaller regression models. Consequently, each sub-model will associate a different relevance probability to a given image x. An ordered averaged weighted operator (OWA) with an orness of 0.5 has been used for this purpose, as explained in section 2.2. On the other hand, the automatic algorithm generates N result visual lists depending on the number of query images. These N lists are merged in one result final list by using the Mathematical aggregation operators OWA with an orness of 0.5. 2.3.3 Multimodal Fusion. Merging textual and visual lists The late fusion module is focused on merging the two probabilities obtained for each of the images from the textual and from the visual module independently. Different ways of merging these two probabilities have been tested:  Pi. Using only the image probability to rank the filtered textual list. In this experiment the textual probability is used only on the first step to filter a sub- collection of the most similar images to the topic, and then the re-ranking is only based on the visual information.  Pi*Pt: The ranking is made by using the score computed by the product of the two probabilities. At these experiments both textual and visual information is used to obtain the final re-ranked list.The OWA operator has been in the late fusion process too. This operator transforms a finite number of inputs into a single output (in our case the inputs are the Pi and Pt probabilities). With the OWA operator no weight is associated with any particular input; instead, the relative magnitude of the input decides which weight corresponds to each input. The aggregation weights used for these experiments are the weights which correspond to an orness with values 0.3 (this means that a weight of 0.3 is given to the higher probability value) and 0.5 (this is like an average operator). 3 Experiments (submitted runs) A total of 20 runs were finally submitted to the task, 10 text-based and 10 mixed combining both textual and visual techniques. A schematic description of all the 20 runs is available in the followings tables. Table 1. Submitted textual experiments. Details ID Lang System md stem Art Fusion NER run1 EN IDRA  - - - - run2 EN IDRA   - - - run3 EN IDRA    - - MAXmerge run4 ALL IDRA    - (3en, 3fr, 3de) run5 EN IDRA+Lucene    - - run6 FR IDRA+ Lucene    - - Enrich run7 EN IDRA+ Lucene     (5,NEs) MAXmerge run8 ALL IDRA+ Lucene    - (5en, 5fr, 5de) run 9 ALL IDRA+ Lucene    JOIN - run10 EN IDRA+ Lucene    JOIN  Experiments 1, 2, 3 and 4 are fully run using the IDRA tool (pre-processing, indexing and retrieval). These runs try to compare different configuration possibilities: 1) applying stemming or not; 2) adding the articles textual information; 3) using of the textual information from the languages different from English (FR, DE). The stemming process for the different languages was carried out using Snowball. The way we take into account the textual information in the related articles provided with the collection consists on extracting the title and the Wikipedia categories of the corresponding article. Run 4 uses metadata, stemming and articles (as run 3 in English) independently for each one of the 3 languages, and applies a late fusion algorithm (MAXmerge) in order to merge the obtained monolingual results. Runs 5 and 6 use the same configuration as run 3, but using IDRA just to pre- process the data, and Lucene for index and search. Run 8 also uses IDRA and Lucene, and follows the same late fusion approach among monolingual experiments carried out with IDRA pre-processing and Lucene search engine. Run 9 fuses together the text from different languages at feature level, following an early fusion method (JOIN). Runs 7 and 10 add the use of a named entities recognisor based in the STILUS- Core API. This information is mixed with the textual info already available in two ways: 1) concatenating the identified entities for each image with its own associated text, before indexing, that is, merging at feature level (the same concatenation is done with the text of the queries); 2) searching independently using only entities for indexation and search (just in 9 queries with named entities detected), and using these results to enrich (late fusion algorithm) the results list from run 5 based in english textual information. Table 2. Submitted mixed experiments. Details TXT ID Low-level Relevance run Distance Fusion Features feedback run11 run9 UV  - Pi run12 run9 CEDD  - Pi run13 run9 CEDD Automatic Tanimoto OWA(Med) run14 run9 CEDD  - Pt*Pi run15 run9 CEDD  - OWA(Med) run16 run9 CEDD  - OWA(Orness0.3) run17 run9 UV  - Pt*Pi run18 run8 CEDD  - Pt*Pi run19 run8 CEDD  - OWA(Med) run20 run8 CEDD  - OWA(Orness0.3) Two textual based algorithms have been used for testing the different ways of merging the textual and the visual information, named as the run8 and the run9 (see table 1 for textual detail algorithms). Therefore, the block of runs 14, 15 and 16 is designed to be compared with runs 18, 19 and 20 in order to see the influence of the textual based algorithm used. In each of the group, we have test three ways of merging the textual and the visual information: the product of the textual and visual probabilities, an aggregation OWA operator using an ORNESS(0.3) or ORNESS(0.5) as weights. Runs 11 and 12 have been made to test the influence of the low-level features used (UV or the CEDD), and also to test the fusion algorithm that only uses the visual information, image probability, to re-rank the final list. Finally, run13 is designed to test the two different vision algorithms used: the relevance feedback and the automatic algorithm with the Tanimoto distance. 4 Results After the evaluation by the task organizers, our results for each of the submitted experiments are presented in Table 2. The table shows how our best results are for the mixed runs 18, 20 and 19 (at positions 8, 9 and 10 of the global result list, this is at the 10% first results). For the text modality, the best result is run 8 at position 14 (at the 15% first results). It is worth pointing out, that all our runs except two of them are above the average of each modality (textual and mixed runs), and that for group classification we are the second group in the global result list. Table 2. Results for the submitted experiments Po Run Mode MAP P@10 P@20 R-prec. Bpref 93 run1 Textual 0.1727 0.3040 0.2380 0.2140 0.1786 75 run2 Textual 0.2056 0.3700 0.2900 0.2518 0.2097 63 run3 Textual 0.2243 0.3980 0.3270 0.2702 0.2287 51 run4 Textual 0.2489 0.3800 0.3290 0.2913 0.2450 45 run5 Textual 0.2601 0.4560 0.3670 0.3014 0.2600 98 run6 Textual 0.1561 0.3580 0.2600 0.2111 0.1813 50 run7 Textual 0.2515 0.4460 0.3660 0.2997 0.2541 14 run8 Textual 0.3044 0.5060 0.4040 0.3435 0.3012 36 run9 Textual 0.2758 0.4520 0.3550 0.3154 0.2771 55 run10 Textual 0.2403 0.4520 0.3510 0.2908 0.2458 107 run11 Mixed 0.0553 0.1180 0.1030 0.0816 0.0631 108 run12 Mixed 0.0516 0.0880 0.0950 0.0802 0.0579 21 run13 Mixed 0.2869 0.5040 0.4060 0.3306 0.2909 15 run14 Mixed 0.3006 0.5200 0.4030 0.3379 0.2983 20 run15 Mixed 0.2869 0.5040 0.4060 0.3306 0.2909 17 run16 Mixed 0.2980 0.5000 0.4030 0.3338 0.2954 16 run17 Mixed 0.3006 0.4960 0.3960 0.3376 0.2996 8 run18 Mixed 0.3405 0.5420 0.4500 0.3752 0.3378 10 run19 Mixed 0.3233 0.5400 0.4230 0.3586 0.3217 9 run20 Mixed 0.3367 0.5460 0.4410 0.3673 0.3314 Average Textual 0.2169 0.3973 0.3228 0.2668 0.2246 Best (pos11) Textual 0.3141 0.5160 0.4270 0.3504 0.3107 Average Mixed 0.2558 0.4542 0.3678 0.3049 0.2648 Best (pos 1) Mixed 0.3880 0.6320 0.5100 0.4162 0.3847 Textual runs 1, 2 and 3 show how the application of stemming and the use of the textual information coming from the articles categories from Wikipedia have a positive influence in the final image retrieval. Analyzing results from runs 7 and 10, it can be observed that the recognition of named entities is not a very useful in terms of MAP, may be because only 9 of the queries contain any entity (results per topic should be detailed analyzed). Only two runs (from XRCE) have obtained better results than our best one, and these both use query expansion or feedback techniques. Regarding to the textual fusion, it can be observed in our two best textual runs how late fusion (run8, 0.3044) obtains better results than early fusion (run9, 0.2758). The only difference between these two runs is that run9 fuses together the textual information from the three languages at the beginning of the process (at feature level), while run8 works independently with each language and combines the results at the end of the process (at decision level). With respect to the mixed runs, our best result is run18 with a MAP of 0.3405, ranked in position number 8. This is also our best global result. This experiment uses the images filtered by run8 (our best textual result), low level features given by the organization (CEDD), and the regression algorithm. The merging is performed by ranking the final list with probabilities Pi*Pt. Runs number 11 and 12, with MAP of 0.0553 and 0.0516, are our worst global results (position 107 and 108 of the global ranked MAP list). These two runs use only the image probability Pi for ranking the final result list. 5 Concluding Remarks and Future Work Our best results are for the mixed modality, and they are at the position 8, 9 and 10, this is at the top ten of the contest. Moreover, 18 of the 20 runs submitted have their MAP above the average of its own modality (textual or mixed). These results mean that our strategy of fusing multimodal information as the textual and visual algorithms is on the top of the retrieval information strategies. The best mixed run mentioned uses at the textual module IDRA and Lucene with fusion approach among monolingual experiments, and the visual module uses CEDD features as low-level features and the logistic regression relevance feedback algorithm; and fuses the two lists, textual and visual multiplying both probabilities. Our group will continue working tuning the two modules independently, in order to improve the results obtained. In the textual modality we have discover the positive influence of taking into account the textual information extracted from Wikipedia categories, and the better performance when fusing together the textual information from different languages at decision level (late fusion, run8) than doing it at feature level (early fusion, run9). Regarding the merging strategies, the combination of the multimodal information (textual and visual) at the decision level gives always better results than using only textual or visual information individually (runs 18,19 and 20 against run8 or run12). Among the different merging strategies presented, the best results are always obtained multiplying both probabilities, followed by the aggregation OWA operator at different values (runs 18, 20 and 19 respectively). Acknowledgments. This work has been partially supported by projects BUSCAMEDIA (CEN-20091026), MA2VICMR (S2009/TIC-1542) and project MCYT TEC2009-12980 References 1. Ana García-Serrano, Xaro Benavent, Ruben Granados, José Miguel Goñi-Menoyo. Some results using different approaches to merge visual and text-based features in CLEF‟08 photo collection. Lecture Notes in Computer Science, Evaluating Systems for Multilingual and Multimodal Information Access. Vol.: 5706/2009 Págs, 568-571. ISSN: 0302-9743. 2. Ana García-Serrano, Xaro Benavent, Ruben Granados, Esther de Ves, Jose Miguel Goñi. Multimedia Retrieval by Means of Merge of Results from Textual and Content Based Retrieval Subsystems. Lecture Notes in Computer Science. Multilingual Information Access Evaluation II. Multimedia Experiments. Revised selected papers from 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece, September 30 - October 2, 2009. Volume 6242/2010. DOI: 10.1007/978-3-642-15751-6. ISBN: 978-3-642-15750-9. pp: 142-149. #pp: 8. 3. Theodora Tsikrika, Adrian Popescu, Jana Kludas. Overview of the wikipedia image retrieval task at ImageCLEF 2011. CLEF 2011 working notes, Amsterdam, The Netherlands, 2011. 4. Ruben Granados Muñoz, Ana García Serrano, José M. Goñi Menoyo. La herramienta IDRA (Indexing and Retrieving Automatically). Procesamiento del Lenguaje Natural, nº 43, Septiembre de 2009. XXV Conferencia de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN‟09). San Sebastián, 2009. 5. Leon, T., Zuccarello, P., Ayala, G., de Ves, E., Domingo, J.: Applying logistic regression to relevance feedback in image retrieval systems, Pattern Recognition, vol. 40, pp. 2621--2632. (2007). 6. R. Yager. On ordered weighted averaging aggregation operators in multi criteria decision making. IEEE Transactions Systems Man and Cybernetics (1988). Vol. 18 pages 183-190. 7. Joan Benavent, Xaro Benavent, Esther de Ves, Ruben Granados, Ana García-Serrano. Experiences at ImageCLEF 2010 using CBIR and TBIR Mixing Information Approaches. CLEF 2010 LABs and Workshops, Notebook Papers, 22-23 September 2010, Padua, Italy. 2010. ISBN: 978-88-904810-0-0. 8. P. Atrey, M.A. Hossain, A.E. Saddik and M. Kankanhalli, Multimodal fusion for multimedia analysis: a survey, Multimedia Systems 16 (2010), pp. 345–379. 9. STILUS-Core de Daedalus. http://www.daedalus.es/productos/stilus/stilus-core.html 10. Snowball Stemmer. http://snowball.tartarus.org/ 11.S. A. Chatzichristofis, K. Zagoris, Y. S. Boutalis and N. Papamarkos, “Accurate Image Retrieval based on Compact Composite Descriptors and Relevance Feedback Information”, International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), Volume 24, Number 2 / February, 2010, pp. 207-244, World Scientific. 12.Apache Lucene. http://lucene.apache.org