LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task Jacinto Mata, Mariano Crespo, Manuel J. Maña Dpto. de Tecnologías de la Información. Universidad de Huelva Ctra. Huelva - Palos de la Frontera s/n. 21819 La Rábida (Huelva) {jacinto.mata, mariano.crespo, manuel.mana}@dti.uhu.es Abstract. This paper shows the experimentation and the results obtained for LABERINTO research group at the ImageCLEF 2011 medical task. We focus our work on image retrieval based on textual information related to the image. The initial hypothesis is that query expansion could improve the effectiveness of image retrieval systems. In this proposal, three different types of indexes were built and several information elements contained in MeSH ontology were used to expand the queries. The experiments carried out show that the expansion strategies using the MeSH ontology obtain good results for this task. Keywords: Text-based image retrieval, medical domain, query expansion, ontologies, MeSH. 1 Introduction This paper describes the contribution of the LABERINTO research group in its first participation at the Medical Image Retrieval task [1]. This task of ImageCLEF 2011 uses a subset of PubMed Central1. This year, the organization proposed three types of subtasks: Modality Classification, Ad-hoc Image-based Retrieval and Case-based Retrieval. We are particularly interested in the Ad-hoc Image-based Retrieval. This is the classic medical retrieval task, similar to those organized in 2005-2010. Participants will be given a set of 30 textual queries with 2-3 sample images for each query. The queries will be classified into textual, mixed and semantic, based on the methods that are expected to yield the best results. In this work, we have used the MeSH2 [2] ontology for query expansion in order to improve our medical image retrieval system. Query expansion is used in a search engine when new terms are added to the user's query in order to increase the efficiency in retrieval. Recently, systems based on query expansion are significantly improving their results, making use of external resources such as ontologies and lexical hierarchies. MeSH is an initiative from the U.S. National Library of Medicine. It is a controlled vocabulary used for indexing articles from Medline. It consists of sets of terms called 1 http://www.ncbi.nlm.nih.gov/pmc/ 2 http://www.nlm.nih.gov/mesh/meshhome.html 2 Jacinto Mata, Mariano Crespo, Manuel J. Maña descriptors, arranged in a hierarchical structure that enables the search at different levels of specificity. There are currently 26,142 MeSH descriptors or Main Headings. There are also over 177,000 alternative expressions, synonyms and terms related to these descriptors, named entry terms. The rest of the paper is organised as follows. Section 2 describes the expansion strategies used in the experiments. In Section 3 the results obtained are shown and discussed. Finally, conclusions and future works are outlined in Section 4. 2 Query Expansion using MeSH MeSH ontology offers many possibilities for expanding the query terms. There are several works where studies on the effect of the use of the MeSH ontology for query expansion are presented. In [3], the authors investigate a query expansion strategy process using an advanced PubMed search called Automatic Term Mapping (ATM). For this task, we have used several strategies for expansion based on the entry terms similar to those used in [4] and other strategy based on the tree structure whereby MeSH organises its descriptors [5]. Many times a descriptor or entry term is made up of more than one term. For example, if the query Mitral Valve was made for each term independently, neither Mitral or Valve correspond to a descriptor or entry term. However, the union of the two terms corresponds to a descriptor itself as "Mitral Valve", which is a biomedical concept. That is the reason why each query was pre-processed by dividing it into n-grams, with the aim of exploring all the possibilities offered by the query to obtain sequences that are MeSH descriptors or entry terms. Below is an example of processing a query with n-grams. Query: Breast cancer mammogram N - Grams (1): Breast (2): Breast cancer (3): Breast cancer mammogram (4): cancer (5): cancer mammogram (6): mammogram Where the n-gram 2 and 4 are entry terms and 1 is a descriptor. The following sections describe the strategies used to expand the queries. LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task 3 2.1 Techniques based on MeSH Tree-structure This strategy is based on the tree structure whereby MeSH organises its descriptors. In this case, if the descriptor is a parent node, it is expanded with its child descriptors. If the descriptor does not have any children there is no expansion. Figure 1 shows a brief MeSH tree excerpt which indicates that the Brain descriptor has seven children while the Central Nervous System descriptor has three. Fig. 1. Excerpt from MeSH Tree 2.2 Techniques based on Entry Terms The first expansion strategy consists in exploring the MeSH tree by checking if the query n-gram is a descriptor. If the n-gram is a descriptor, the query is expanded using all the entry terms of the descriptor. If the n-gram is not a descriptor, we check if it is an entry term. If so, its descriptor and all the entry terms of that descriptor are added to the expansion. The second strategy has only a small variation from the first. When a n-gram in the query is a descriptor, the query is expanded with the entry terms of the preferred concept, instead of all the entry terms of that descriptor. 4 Jacinto Mata, Mariano Crespo, Manuel J. Maña Fig. 2. Example of a filtering process. When the results of these expansion strategies were calculated, it was found that they introduced too much noise into the queries and the results were not as good as expected. To this end, a filtering of the query was carried out to reduce redundant entry terms. Figure 2 shows an example of a filtering process. 3 Experiments and Results This section details the experiments that were conducted to evaluate various expansion strategies. For this aim, three different indexes were created: • Captions (C): This index contains the text of the captions of each image. • Image Reference (IR): In this index, the sections of the paper that reference each image were indexed. For this indexing, the text of the papers was split into sentences using OpenNLP3 software and we have only indexed the sentences which refer to an image. • Full Text (FT): This index contains the full text of each paper. For this edition, three different runs for each indexing were sent: • Baseline (B): Original queries. • Concept Tree (CT): Queries expanded with techniques based on MeSH Tree- Structure. • Entry Terms Peferred Concept (ETPC): Queries expanded with techniques based on Entry Terms. Moreover, an additional run based on Entry Terms (ET) was sent. 3 http://incubator.apache.org/opennlp/ LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task 5 In order to perform text indexing and run the different queries, Lucene4 search engine was used with the default settings. Table 1 shows the results obtained with each run. Table 1. Results from LABERINTO research group in ImageCLEF 2011. Ranking Run MAP P10 P20 Rprec Bpref Num_Rel_Ret 1 laberinto_CTC 0.2172 0.3467 0.3017 0.2369 0.2402 1471 4 laberinto_BC 0.2133 0.3400 0.3067 0.2363 0.2384 1469 16 laberinto_ETPCC 0.1939 0.2933 0.2617 0.2089 0.2198 1526 44 laberinto_BIR 0.1496 0.3400 0.3000 0.1908 0.1992 1292 48 laberinto_CTIR 0.1466 0.3433 0.2950 0.1868 0.1953 1293 50 laberinto_ETPCIR 0.1411 0.3000 0.2850 0.1766 0.1887 1325 57 laberinto_BFT 0.1146 0.2533 0.2267 0.1621 0.1786 1355 58 laberinto_CTFT 0.1101 0.2500 0.2333 0.1512 0.1691 1348 59 laberinto_ETFT 0.1050 0.2567 0.2250 0.1302 0.1640 1292 60 laberinto_ETPCFT 0.1014 0.2400 0.2200 0.1253 0.1571 1310 Looking at specific runs comparisons, we can further draw the following conclusions: The best results were obtained using the index of the image captions. On the other hand, the most effective expansion strategy was the expansion based on the MeSH Tree Structure for all the indexes. The best result among all our runs was laberinto_CTC (Concept Tree with Captions), which reached a MAP value of 0.2172. This value was the highest among all the runs for textual retrieval type. With respect to the strategy based on Entry Terms, we can observe that it retrieve more relevant images than the other strategies. We think that it is also an effective strategy and we will work to improve the MAP and to keep high values for relevant images retrieved. 4 Conclusions and Future Work In this paper we have presented different query expansion strategies using one of the most widely used ontologies in the medical domain, with the aim of enhancing the efficacy of a textual content-based image retrieval system. Different MeSH ontology elements were chosen for expansion. The results of our experiments showed that the expansion strategies using the hierarchical structure whereby MeSH organises its descriptors, obtain good results for this task. This work verified the difficulty of finding an appropriate strategy for query expansion. We think that there are information elements or element combinations in MeSH that might be used to expand the queries and could substantially improve an image retrieval system. In future work, we will continue researching into other query expansion strategies and the use of other ontologies, such as UMLS5 [6]. Moreover, we plan to build 4 http://lucene.apache.org/ 5 http://www.nlm.nih.gov/research/umls/ 6 Jacinto Mata, Mariano Crespo, Manuel J. Maña indexes using only medical concepts extracted from the image captions. Finally, we want to experiment expanding as the queries as the indexed text. 5 Acknowledgments This work was partially funded by the Spanish Ministry of Science and Innovation, the Spanish Government Plan E and the European Union through ERDF (TIN2009- 14057-C03-03). References 1. Kalpathy-Cramer, J., Müller, H., Bedrick, S., Eggel, I., Garcia Seco de Herrera, A. and Tsikrika, T. 2011. The CLEF 2011 medical image retrieval and classification tasks. CLEF 2011 working notes, Amsterdam, The Netherlands. 2. Nelson, S.J., Schopen, M., Savage, A.G., Schulman, J.L. and Arluk, N. 2004. The MeSH translation maintenance system: structure, interface, design and implementation. M. Fieschi, et al. (Ed.). Proceedings of the 11th World Congress on Medical Informatics, pp.67–69. 3. Lu, Z., Kim W. and Wilbur, W. 2009. Evaluation of query expansion using MeSH in PubMed. Information Retrieval, Vol. 12, No. 1, pp. 69-80. 4. Díaz, M.C., Martín, M.T. and Ureña, L.A. 2009. Query expansion with a medical ontology to improve a multimodal information retrieval. Computers in Biology and Medicine, 4, 396-403. 5. Mata, J., Crespo, M. and Maña, M. 2011. Estudio del uso de ontologías para la expansión de consultas en recuperación de imágenes en el dominio biomédico. Procesamiento del Lenguaje Natural, nº 47. 6. Bodenreider, O. 2004. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(2004) 267–270.