LABERINTO at ImageCLEF 2012 Medical Image
                   Retrieval Task

                      Mariano Crespo, Jacinto Mata, Manuel J. Maña

                Dpto. de Tecnologías de la Información. Universidad de Huelva
               Ctra. Huelva - Palos de la Frontera s/n. 21819 La Rábida (Huelva)
                   {mariano.crespo, jacinto.mata, manuel.mana}@dti.uhu.es


       Abstract. This paper shows the experimentation and the results obtained for
       LABERINTO research group at the ImageCLEF 2012 medical task. We focus
       our work on image retrieval based on textual information related to the image.
       Last year we demonstrated that query expansion exploiting the hierarchical
       structure of the MeSH descriptors achieved a significant improvement in image
       retrieval systems. This year our goal is to improve the results obtained last year
       adding a relevance factor to the query terms. In addition, we have developed a
       new strategy combining the expansion strategy based on the hierarchical MeSH
       structure with another expansion strategy very popular among researchers in
       this field, where the query terms are expanded using MMTx program. The
       experiments carried out have shown that a relevance factor for the query terms
       achieves a significant improvement for the results of the different expansion
       strategies.

       Keywords: Text-based image retrieval, medical domain, query expansion,
       ontologies, MeSH, UMLS, Lucene.


1    Introduction

This paper describes the contribution of the LABERINTO research group in its
second participation at the Medical Image Retrieval task [1].
   This task of ImageCLEF 2011 uses a subset of PubMed Central 1. This year, the
organization proposed three types of subtasks: Modality Classification, Ad-hoc
Image-based Retrieval and Case-based Retrieval. We are particularly interested in the
Ad-hoc Image-based Retrieval. This is the classic medical retrieval task, similar to
those organized in 2005-2011. Participants will be given a set of 22 textual queries
with 2-3 sample images for each query. The queries will be classified into textual,
mixed and semantic, based on the methods that are expected to yield the best results.
   Due to the good results obtained in last year's edition, the aim of this year is to
improve the effectiveness of the expansion strategy used. We use the MeSH [2]
ontology for the expansion of queries, and our new proposal for this year is the
inclusion of relevance factors in the query terms and the development of a new

1 http://www.ncbi.nlm.nih.gov/pmc/
2   Mariano Crespo, Jacinto Mata, Manuel J. Maña


strategy, which is a combination between the expansion strategy based on the
hierarchical MeSH structure and a strategy that expands the query terms using
MetaMap Transfer program (MMTx) [3].
   Ontologies represent a particular knowledge domain in the form of a set of
concepts and relations between them. There are many terminological and ontological
resources available in the biomedical domain, along with a wide range of applications
in NLP: information retrieval, question answering, automatic summarization and
classification amongst others. The two resources used in this work were the MeSH
ontology and the MMTx program using the source vocabulary SNOMED-CT [4].
   MeSH is a controlled vocabulary used for indexing Medline papers. It is comprised
of term sets or descriptors, organized into a hierarchical structure to allow searches at
various levels of specificity. At present, MeSH encompasses 26,142 descriptors or
Main Headings. This is the vocabulary used to index Medline citations. Alternative
forms, synonyms and terms related to the descriptors are known as Entry Terms.
There are over 177,000 Entry Terms in MeSH.
   MetaMap is a highly configurable program developed by Dr. Alan Aronson at the
National Library of Medicine (NLM) to map biomedical text to the UMLS
Metathesaurus [5] or, equivalently, to discover Metathesaurus concepts referred to in
text.
   Finally, we used Lucene [6] to assign the relevance level of matching images based
on the terms found.
   The rest of the paper is organised as follows. Section 2 describes the expansion
strategies used in the experiments and the technique that adds a relevance factor to the
query terms. In Section 3 the results obtained are shown and discussed. Finally,
conclusions and future works are outlined in Section 4.


2    Query Expansion using MeSH

MeSH ontology offers many possibilities for expanding the query terms. Various
works report on studies into the effect of using the MeSH ontology for query
expansion. For example, in [7] the authors base the expansion on the hierarchical
structure of MeSH. When this technique locates a MeSH descriptor in the query, it
ascends the tree to higher levels to search for more general descriptors and adds those
it finds to the user query.
    In [8] the authors explore a strategy for query expansion using a process of
advanced queries known as Automatic Term Mapping (ATM) in PubMed. The study
employed a collection of 64 queries and around 160,000 MEDLINE citations which
were used in the 2006 and 2007 TREC Genomics Track. One of the main results was
an increase in the F measure [9] of 21.5% and 23.3% in the 2006 and 2007 collections
respectively through the use of query expansion. The researchers conclude that query
expansion through MeSH in PubMed can improve the effectiveness of retrieval, but
that in real situations the improvement may not prove to be significant for PubMed
users.
    In [10, 11] the authors also employ MeSH ontology to expand both the collection
and queries. The approach uses two strategies for expanding the document collection.
                      LABERINTO at ImageCLEF 2012 Medical Image Retrieval Task        3


The first of these extracts the set of MeSH descriptors from the image captions and
article titles by means of a classifier, which returns a list of descriptors in order of
relevance. The image captions, article titles and the first five descriptors returned by
the classifier are then indexed. The second strategy indexes not only the above
information, but also the MeSH descriptors associated with the corresponding article
in MEDLINE. The query is expanded by extracting the set of MeSH descriptors using
the classifier. Only those descriptors belonging to the A and C branches (diseases and
anatomical concepts) of the MeSH tree are selected, from which only the three most
relevant are added to the query.
   As these studies show, various authors have taken advantage of the MeSH
ontology in order to expand queries and improve information retrieval systems. In
doing so, they have utilized the different cross-reference systems provided by the
ontology (synonyms, entry terms, associative relationships among descriptors, …). In
this study we present a query expansion strategy that uses the MeSH Tree Structure.
Our proposal focuses on the choice of terms to be expanded and demonstrates that the
expansion is most efficient when the UMLS Metathesaurus is used, in controlled
fashion, for determining which terms are expanded.


2.1   Techniques based on MeSH Tree-structure

This strategy is based on the tree structure whereby MeSH organises its descriptors.
Figure 1 shows a short extract from the MeSH tree diagram in which it can be seen
that the descriptor Neoplasms by Site includes six more specific descriptors (children)
while the descriptor Breast Neoplasms includes only four.


                            Fig. 1. Excerpt from MeSH Tree


The expansion strategy developed in this section is governed by the following criteria
for expanding search terms:
     – If the search term is a MeSH descriptor and contains more specific
         descriptors, it is expanded using these.
     – If the search term is a MeSH descriptor but does not contain more specific
         descriptors, no expansion is performed.
4     Mariano Crespo, Jacinto Mata, Manuel J. Maña


      –    If the search term is not a MeSH descriptor, no expansion is performed.

In many cases a descriptor comprises more than one term, and performing the
expansion at the level of the term is not so efficient. For example, if the search for
Mitral Valve treats each term independently, neither the term Mitral nor the term
Valve corresponds to a descriptor. Nevertheless, the two terms in combination
correspond to the descriptor “Mitral Valve”, a biomedical concept.
    In order to discover the medical concepts within the queries, in this study we have
used the National Library of Medicine's MetaMap Transfer program (MMTx) using
the source vocabulary SNOMED-CT of the UMLS metathesaurus version 2011AA.
    The concepts labelled in this phase were mapped to the MeSH hierarchy in order to
perform an expansion of each one. Each labelled concept is sought within the MeSH
tree. If the concept is a descriptor, its children are retrieved and added to the query
according to the general scheme described above. In this approach, in addition to the
terms expanded via the MeSH tree, the UMLS concepts identified are added, as
illustrated in figure 2, which provides a schematic representation of the expansion
process for the query lymphoma MRI images.


                         Fig. 2. Example of query expansion process


2.2       Techniques based on MMTx

This strategy is based on MMTx. As in the previous section, we used the National
Library of Medicine’s MMTx to discover Metathesaurus concepts in figure captions.
MMTx employs a series of language-processing modules to map text to concepts in
                      LABERINTO at ImageCLEF 2012 Medical Image Retrieval Task        5


the UMLS Metathesaurus. Then we configured MMTx with the option MMI (-N) that
displays, in a separate section, a ranked list of all the mappings assigned to the text.
Additional data such as the PMID of the citation, CUIs, abbreviated Semantic Types
are also included. Finally, we selected those candidates which semantic types are:
Diagnostic Procedure (diap), Disease or Symdrome (dsyn), Body Part, Organ, or
Organ Component (bpoc), Neoplastic Process (neop), Injury or Poisoning (inpo),
Body Location or Region (blor), Pathologic Function (patf) or Cell (cell). In order to
elaborate this list, we did a study of the most repeated semantic types in the queries
for the last four years of ImageCLEF. We also asked an expert to make the final
selection. Figure 3 shows a schematic representation of the expansion process for the
query lymphoma MRI images.


                       Fig. 3. Example of query expansion process


2.3 Adding relevance levels to the query terms

The technique for adding a weight to each query term has been developed with the
boost factor given by Lucene. It provides the relevance level of matching documents
based on the terms found. To boost a term, the caret ("^") symbol with a boost factor
(a number) at the end of the term is used. The higher the boost factor, the more
relevant the term will be. By default, the boost factor is 1. For the experiments,
different boost factors values based on the Inverse Document Frequency (idf) were
tested. The idf is used to weigh the term information value in general, based on
frequency of use or appearance. It is averaged as shown in (1).
6    Mariano Crespo, Jacinto Mata, Manuel J. Maña


                                             N 
                      IDF (t , doc)  log 2                                       (1)
                                             DF 
    Where:
        t: Term.
        doc: document.
        N: Total number of documents in the collection.
        DF: Frequency of occurrence of the term (t) in the document (doc).

   To carry out the experiments, the idf for each query term was calculated
independently and then, it was multiplied by a factor α or β (between 0.1 and 1)
depending on original or expanded query terms. Finally, the relevance factor is
calculated as shown in (2).

                                              N 
                                      log 2     
             RF (t , doc)  ( |  )          DF                                  (2)
                                             N 
                                      log 2     
                                             DF 

    Where:
        t: Term.
        doc: document.
        α: Original terms factor.
        β: Expanded terms factor.
        N: Total number of documents in the collection.
        DF: Frequency of occurrence of the term (n) in the document (doc).

   The best results were obtained setting α = 1 and β = 0.1, i.e. when exists a greater
difference of the relevance factor between the original terms and the expanded terms,
but always adding greater weight to the original query terms.


3     Experiments and Results

This section details the experiments that were conducted to evaluate various
expansion strategies. For this aim, seven different runs were sent:

         Laberinto_BL: Original queries.
         Laberinto_BL_MSH: Queries expanded with techniques based on MeSH
            Tree-Structure.
         Laberinto_MSH_PESO_1: Relevance factor of original terms α = 1,
            relevance factor of expanded terms β = 0.1 and queries expanded with
            techniques based on MeSH Tree-Structure.
                       LABERINTO at ImageCLEF 2012 Medical Image Retrieval Task         7


         Laberinto_MSH_PESO_2: Relevance factor of original terms α = 2,
            relevance factor of expanded terms β = 0.1 and queries expanded with
            techniques based on MeSH Tree-Structure.
         Laberinto_MMTx_MSH: Queries expanded with mixed expansion
            strategies based on MeSH Tree-Structure and MMTx.
         Laberinto_MMTx_MSH_PESO_1: Relevance factor of original terms α =
            1, relevance factor of expanded terms β = 0.1 and queries expanded with
            mixed expansion strategies based on MeSH Tree-Structure and MMTx.
         Laberinto_MMTx_MSH_PESO_2: Relevance factor of original terms α =
            2, relevance factor of expanded terms β = 0.1 and queries expanded with
            mixed expansion strategies based on MeSH Tree-Structure and MMTx.

   In order to perform text indexing and run the different queries, Lucene search
engine was used with the default settings. Table 1 shows the results obtained with
each run.


Table 1. Results from LABERINTO research group in ImageCLEF 2012.


    Ranking                 Run           MAP      GM-MAP    Bpref     P10      P30
      10         Laberinto_MSH_PESO_2     0.1859    0.0537   0.1939   0.3318   0.1894
      18         Laberinto_MSH_PESO_1     0.1707    0.0512   0.1712   0.3318   0.1894
      20      Laberinto_MMTx_MSH_PESO_2   0.1680    0.0555   0.1711   0.3227   0.1909
      22      Laberinto_MMTx_MSH_PESO_1   0.1677    0.0554   0.1701   0.3182   0.1879
      24               Laberinto_BL       0.1658    0.0477   0.1667   0.3000   0.1939
      30            Laberinto_BL_MSH      0.1613    0.0462   0.1812   0.2682   0.1864
      41          Laberinto_MMTx_MSH      0.1361    0.0438   0.1570   0.2091   0.1758

   Looking at specific runs comparisons, we can further draw the following
conclusions. Adding a relevance factor to the query terms considerably improved the
results, especially when greater weights to the original terms than the expanded terms
were given. The best result among all our runs was Laberinto_MSH_PESO_2, which
reached a MAP value of 0.1859. With respect to the strategy that combines MMTx
with MeSH hierarchy, we can observe that the results are somewhat inferior, but both
techniques improve the results if adding a relevance factor to the query terms.


4     Conclusions and Future Work

The principal objective was to improve the effectiveness of image retrieval system
through textual content.
In the course of our experimentation we gained an understanding of the difficulties of
finding an appropriate strategy for performing query expansion.
   The results of our experiments showed that the expansion strategy employing
medical concepts alongside the MeSH hierarchy successfully improved the
effectiveness of the system. The results achieved with the mix MMTx and MeSH
hierarchy strategy are somewhat inferior. On the other hand, the results show that add
8    Mariano Crespo, Jacinto Mata, Manuel J. Maña


a relevance factor to the query terms considerably improved the results, especially if
we give a greater weight to the original terms than the expanded terms.
    In future studies we also intend to perform expansion on the medical concepts
occurring in the text used for constructing the index. We will also explore the new
expansion strategies both MeSH as UMLS and new techniques to assign a relevant
factor to the query terms. Finally, we also intend to dedicate future studies to
analyzing queries in detail so as to extract information from abbreviations, type of
image to search for [12] (eg, radiographs, tomographs) and so on. After all, the most
essential thing for an image retrieval system to work well is to know exactly what one
is searching for.


5    Acknowledgments

This work was partially funded by the Spanish Ministry of Science and Innovation,
the Spanish Government Plan E and the European Union through ERDF (TIN2009-
14057-C03-03).


References

1.  Kalpathy-Cramer, J., Müller, H., Bedrick, S., Eggel, I., Garcia Seco de Herrera, A. and
    Tsikrika, T. 2012. The CLEF 2012 medical image retrieval and classification tasks. CLEF
    2012 working notes, Rome, Italy.
2. Nelson, S.J., Schopen, M., Savage, A.G., Schulman, J.L. and Arluk, N. 2004. The MeSH
    translation maintenance system: structure, interface, design and implementation. M.
    Fieschi, et al. (Ed.). Proceedings of the 11th World Congress on Medical Informatics,
    pp.67–69.
3. Aronson AR. 2001. Effective mapping of biomedical text to the UMLS Metathesaurus:
    The MetaMap program. Proc AMIA Symp 2001, pp.17–21.
4. SNOMED Clinical Terms. International Health Terminology Standards Development
    Organisation (IHTSDO). Available at: http://www.ihtsdo.org/snomed-ct/. Accessed: Aug
    16, 2012.
5. Bodenreider O. 2004. The Unified Medical Language System (UMLS): integrating
    biomedical terminology. Nucleic Acids Research, vol. 32, pp. 267–270.
6. Cutting D, Busch M, Cohen D, et al. Apache Lucene. 2008. Available at:
    http://lucene.apache.org/. Accessed: Aug 16, 2012.
7. Gobel G, Andreatta S, Masser J, et al. 2001. A MeSH based intelligent search
    intermediary for Consumer Health Information Systems. Int J Med Inform. vol. 64, pp.
    241–51.
8. Lu Z, Kim W, Wilbur W. 2009. Evaluation of query expansion using MeSH in PubMed.
    Information Retrieval, vol. 12, pp. 69-80.
9. Van Rijsbergen CJ. 1979. Information Retrieval. 2nd ed. Butterworths, London, UK.
10. Gobeill J, Theodoro D, Patsche E, et al. 2009. Taking benefit of query and document
    expansion using mesh descriptors in medical imageclef 2009. Working Notes of CLEF.
                       LABERINTO at ImageCLEF 2012 Medical Image Retrieval Task         9

11. Gobeill J, Ruch P, Zhou X. 2009. Query and Document Expansion with Medical Subject
    Headings Terms at Medical ImageCLEF 2008, CLEF 2008. LNCS, Springer. vol. 5706,
    pp.736–743.
12. Rahman M, Antani S, Fushman D, et al. 2012. Biomedical Image Retrieval Using
    Multimodal Context and Concept Feature Spaces, in: Medical Content-based Retrieval for
    Clinical Decision Support. LNCS. vol. 7075, pp. 24-35.