IRIT at ImageCLEF 2011: medical retrieval task

                            Duy Dinh, Lynda Tamine

                      IRIT laboratory - University of Toulouse,
                   118 route de Narbonne, 31062 Toulouse, France
                         {Duy.Dinh,Lynda.Tamine}@irit.fr


       Abstract. In this paper, we reported some experiments conducted by
       our members in the SIG team at the IRIT laboratory in the University
       of Toulouse within the context of the medical information retrieval (IR)
       task. As in our previous participation in ImageCLEF, in 2011, our re-
       search focuses on the case-based retrieval task. We compared the perfor-
       mance of different state-of-the-art term weighting models for retrieving
       patient cases that might best suit the clinical information need. Fur-
       thermore, we also combined term scores obtained by two state-of-the-art
       weighting models using a particular data fusion technique. Finally, a
       state-of-the-art query expansion (QE) technique is used for improving
       biomedical IR performance.

       Key words: Biomedical information retrieval, Term weighting models,
       Query expansion, Model fusion


1     Introduction
This paper describes the contribution of the SIG team (Generalized Information
Systems) at the IRIT1 (Institute for Research in Informatics of Toulouse) labo-
ratory in its second year participation at the medical retrieval track. We focused
in particular on the case-based retrieval task, where patient demographics, lim-
ited symptoms and test results are provided to answer the medical professionals’
information need [1].

    We first investigate the effectiveness of two different state-of-the-art term
weighting models that have been shown to work well in the past: LGD (a log lo-
gistic model) [2], In expB2 (Inverse Expected Document Frequency model with
the Bernoulli ratio normalisation) [3]. These models are then combined using a
particular data fusion technique [4]. Next, we experiment with a state-of-the-art
pseudo or blind feedback query expansion algorithm implemented in the DFR
framework [3].

    The rest of this paper is organized as follows: Section 2 describes our indexing
and retrieval framework. Experimental results will be presented and discussed
in section 3. We conclude the paper in section 4 and outline some perspectives
for our future work.
1
    http://www.irit.fr
2        D. Dinh and L.Tamine

2     Indexing and retrieval framework

The indexing aims to organize, structure and store statistical and/or linguistic
information about terms and documents in the collection allowing a rapid and
efficient search. We use the Terrier IR platform for indexing documents [5]:
stop-words are removed from documents and queries before stemming using the
Porter algorithm [6].
     The document retrieval aims to match the user query and document represen-
tations in order to retrieve a list of results that may satisfy the user information
need. In our work, a document D containing terms used for formulating query
Q is weighted by summing the score of each term figuring in document D:
                                         X
                          RSV (D, Q) =         score(t ∈ D)                     (1)
                                         t∈Q


where score(t ∈ D) is the query term weight calculated using a particular term
weighting model. In this section, we first describe two different term weighting
models used in our experiments, namely LGD [7] and In expB2 [3]. These models
are fused to obtain a combined score for each query term figuring in documents.
We then applied a state-of-the-art pseudo-relevance feedback technique in order
to improve the information retrieval (IR) performance.


2.1     The LGD model

In the LGD model, query terms are weighted using the log logistic distribution
[7]. Formally:
                                                                  
                                             Nt                 Nt
             scoreLGD (t ∈ D) = qtf × log2 (    + tfn ) − log2 ( )              (2)
                                             N                  N

where

    – t is a query term occurring in document D,
    – Nt is the document frequency (i.e., number of documents containing term
      t),
    – N is the total number of documents in the collection,
    – qtf is the query term frequency,
    – tfn is the normalised within-document term frequency, given by:

                                                      avg dl
                           tfn = tf × log2 (1 + c ×          )                  (3)
                                                        dl

      where avg dl is the average document length (in tokens), dl is the document
      length (in tokens) and c is a multiplying factor or tuning parameter.
                            IRIT at ImageCLEF 2011: medical retrieval task        3

2.2     The In expB2 model

For the In expB2 model, query terms are weighted using the Inverse Expected
Document Frequency model with Bernoulli after-effect and term frequency nor-
malisation [3]. Formally:


                           qtf × (tf + 1) × tf n2              N +1
 scoreIn expB2 (t ∈ D) =                          ×log2                      (4)
                           Nt × (tf n2 + 1) × ln2                 −tf
                                                        N × (1 − e N ) + 0.5

where

 – t is a query term occurring in document D,
 – Nt is the document frequency,
 – N is the total number of documents in the collection,
 – qtf is the query term frequency,
 – tf is the within-document term frequency,
 – tf n2 is the normalised within-document term frequency, given by:
                                                         
                                 tf                avg dl
                        tf n2 =     × log2 1 + c ×                              (5)
                                ln2                  dl

2.3     Model fusion

In the context of information retrieval, data fusion refers to the task of combining
the output of multiple ranking strategies into a single list of objects (documents,
concepts, etc.) [4,8]. Objects in different lists can be merged together by using a
variety of aggregate functions such as MIN, MAX, SUM, AVERAGE, MEDIAN,
MNZ, etc. The combination technique based on each of those functions can be
referred to as CombXXX, where ‘XXX’ stands for the name of the aggregate
function. For example, using the SUM operator for two ranking algorithms A1
and A2, scores are summed to obtain a final score, which is the sum of the
term score obtained by ranking algorithm A1 and the one obtained by ranking
algorithm A2. CombSUM and CombMNZ have been widely studied and have
demonstrated state-of-the-art performance [9]. Such techniques are important in
distributed IR where results obtained from several corpora or IR ranking strate-
gies must be coordinated. Authors in [10] distinguished two classes of data fusion
techniques : one has access to query-document score (term weighting model), and
one does not, with access only to system rankings (document relevance scoring).
    With the development of information technology and communication, espe-
cially in the IR field, a large number of IR models have been developed and
integrated into the Terrier IR platform [5]. We study here the impact of us-
ing data fusion technique on the performance of term weighting model. More
specifically, we combine term scores obtained by summing the scores obtained
by the two state-of-the-art term weighting models described earlier, i.e., LGD
and In expB2 model. Given a query term t, its combined score is computed as
follows:
4        D. Dinh and L.Tamine


       score(t ∈ D) = scoreLGD (t ∈ D) + scoreIn expB2 (t ∈ D)
                                                        
                                   Nt                 Nt
                    = qtf × log2 (    + tfn ) − log2 ( ) +
                                   N                  N
                      qtf    (tf + 1) × tf n2                 N +1
                          ×                   × log2                           (6)
                      ln2 Nt × (tf n2 + 1)            N × (1 − e−tf /N ) + 0.5

2.4     Query expansion

The DFR framework employs a query expansion (QE) mechanism that is a gener-
alisation of Rocchio’s method [11]: terms in the top-ranked documents retrieved
in the first stage are weighted using a particular DFR term weighting model. In
general, the weight of a term of the expanded query q ∗ derived from the original
query q is obtained as follows:

                                                       Inf oDF R
                      weight(t ∈ q ∗ ) = qtf n + β ∗                           (7)
                                                       M axInf o

where

    – qtf n is the normalised within-query term frequency,
    – M axInf o = argt∈q∗ max Inf oDF R ,
    – Inf oDF R is the term frequency in the expanded query induced by using a
      DFR model, that is:

                   Inf oDF R = − log2 P rob(F req(w|K)|F req(w|C))             (8)

      where Prob is the probability of obtaining a given within-query term fre-
      quency from the top-ranked documents retrieved in the first stage. In the
      DFR framework, several measures are used to compute this probability such
      as: Bose-Einstein (Bo) statistics and Kullback-Leibler (KL) measure [3]. The
      former gives the following term frequency normalisation:

                    InfoBo = − log2 Prob(Freq(w|K)|Freq(w|C))
                                       1                          1            (9)
                           = − log2 ( 1+λ ) − Freq(w|K) ∗ log2 ( 1+λ )

      where
       • F req(w|K) (resp. F req(w|C)) is the the term frequency within the top
         ranked documents (resp. the collection)
       • λBo1 = Freq(w|C)
                    N     and λBo2 = TotalFreq(K)∗Freq(w|C)
                                           TotalFreq(C)
       • β = 0.4

      while the latter gives the following term frequency normalisation:

                  InfoKL = T FotalF
                                req(w|K)          F req(w|K)∗T otalF req(C)
                                    req(K) ∗ log2 F req(w|C)∗T otalF req(K)   (10)
                           IRIT at ImageCLEF 2011: medical retrieval task      5

3     Experimental evaluation

3.1   Collection statistics

Some statistical characteristics of the Case-based 2011 collection is depicted in
table 1. The histogram in figure 1 shows the variation of document length in the
collection.

                       Table 1. Test collection statistics
                   Number of documents               55,634
                  Average document length             3,078
                   Total number of tokens          171,251,809
                   Size of the vocabulary            815,708
                     Number of queries                  10
                    Average query length                30


             Fig. 1. Variation of document length in the collection


3.2   Evaluation metrics

Retrieval performance was evaluated using standard measures: P@10, P@20, and
MAP. P@10, P@20 represent respectively the mean precisions at the top 10, 20
returned documents. MAP (Mean Average Precision) is the average precision of
6       D. Dinh and L.Tamine

a query which is computed by averaging the precision values computed for each
relevant retrieved document of rank x ∈ (1..K), where K = 1000 is the number
of retrieved documents. Our results are generated by the trec eval standard tool2
used by the TREC community for evaluating ad hoc retrieval runs.

3.3    Run description

We submitted ten official runs to the case-based medical retrieval track [1]. Our
submitted runs are divided into two groups: the first one (6 runs) includes terms
with low inverse document frequency (IDF) while the second one (4 runs) ex-
cludes them from the document index. In table 2, runs without query expansion
are distinguished by an asterisk (*). For the first group of runs, we used the
two state-of-the-art weighting models namely LGD [7] (run 1) and In expB2 [3]
(run 2). The CombSU M technique [4] is used in run 3 at the level of term scor-
ing instead of document re-ranking, i.e. the fusion technique modifies directly
the scores obtained by retrieval models and not the final scores of output doc-
uments. Runs 4, 5 and 6 are combined with a blind feedback query expansion
based on the Kullback-Leibler (KL) statistics. Runs in the second group (7, 8,
9, 10) are submitted with the exclusion of low IDF terms using each of the two
state-of-the-art weighting models, the CombSU M fusion technique [4] and the
KL QE technique. For QE, a maximum number of twenty terms are extracted
from the top twenty returned documents. All runs are submitted with the default
configuration in Terrier: c=1.0, stopword removal, Porter stemmer.

3.4    Results and discussion

According to the results presented in table 2, we see that ignoring low IDF terms
does not help and even harms the IR performance. Normally, low IDF terms are
not useful for describing the semantics of the document and can be ignored from
the document index [12, 13]. However, in the biomedical domain, especially in
medical records retrieval, low IDF terms may be used to mention or distinguish
medical concepts such as ‘low’, ‘high’, ‘right’, ‘left’, ... (e.g., low back pain vs.
high back pain, right lung vs. left lung), etc.
    Here, we compare the performance of the two mentioned state-of-the-art
models LGD and In expB2. We notice that the LGD model is better than the
In expB2 model in terms of MAP with an improvement of +17.36%. In terms
of P@10, the two models yield the same performance, but in terms of P@20,
the former is better than the latter with an improvement of +22.2%. For this
reason, we chose the IRIT LGDc1.0 run as our strong baseline to compare
to other runs.
    The CombSU M method combining term scores obtained by those models
outperforms the In expB2 model and similar to the LGD model (baseline) in
terms of MAP. In terms of P@10, the CombSU M fusion technique outperforms
the baseline with an improvement of +29.97%. However, in terms of P@20, the
2
    http://trec.nist.gov/trec_eval/
                            IRIT at ImageCLEF 2011: medical retrieval task          7

Table 2. Official results of submitted runs in the case-based retrieval task. Runs
with an asterisk are without using pseudo relevance feedback.

    ID Run                                                 MAP P@10 P@20
                            Includes terms with low IDF
    1  IRIT In expB2c1.0 1*                                 0.0743 0.1111 0.1000
    2  IRIT LGDc1.0* (baseline)                             0.0872 0.1111 0.1222
    3  IRIT CombSUMc1.0 3*                                  0.0859 0.1444 0.1000
    4  IRIT CombSUMc1.0 KLbfree d 20 t 20 2                 0.0874 0.1111 0.1000
    5  IRIT LGDc1.0 KLbfree d 20 t 20 1                     0.1030 0.1556 0.1278
    6  IRIT In expB2c1.0 KLbfree d 20 t 20 0                0.0772 0.1000 0.1000
                       Ignores terms with low IDF from index
     7 IRIT CombSUMc1.0 KLbfree d 20 t 20 2                  0.0874 0.1111 0.1000
     8 IRIT In expB2c1.0 KLbfree d 20 t 20 0 ignore low idf 0.0793 0.1444 0.0889
     9 IRIT LGDc1.0 KLbfree d 20 t 20 1 ignore low idf      0.0937 0.1111 0.0889
    10 IRIT CombSUMc1.0 2 ignore low idf*                   0.0721 0.1333 0.0778


CombSU M technique gives the same performance as the In expB2, which is
lower than the baseline. We conclude that combining term scores at the level
of weighting models can be useful for improving the search precision (P@10),
without losing the MAP performance.
    At this level, we present the results of submitted runs obtained using query
expansion. The In expB2 model in combination with the KL QE method shows a
small improvement in terms of MAP (+03.90%), a decrease in terms of precision
P@10 (-10.00%) and no effect in terms P@20 compared to the In expB2 model
without QE. This is probably due to the fact that the clinical query length is
long (about 30 in average) and the number of extracted terms for QE is smaller
than or equal to 20; therefore extracted terms may be observed as in the original
terms or also they can be different from the latter but the top-ranked documents
do not or slightly change after expansion. The LGD model in combination with
the KL QE method (run IRIT LGDc1.0 KLbf ree d 20 t 20 1) outperforms the
baseline with an improvement rate of +17.85% in terms of MAP, +40.05% in
terms of P@10 and +4.58% in terms of P@20. This proves that query expansion
is only effective if it is based on an underlying effective ranking model. Indeed,
the In expB2 model in combination with the KL QE method performs worse
than the baseline. This also explains why the CombSU M method in combination
with the KL QE gives no improvement compared to the baseline.


4       Conclusion

In this work, we have compared and evaluated the IR performance of two state-
of-the-art term weighting models, a state-of-the-art query expansion approach.
In our empirical studies, we proposed to combine term scores obtained by differ-
ent term weighting models to improve the retrieval performance, especially the
search precision.
8       D. Dinh and L.Tamine

    Within the case-based retrieval task, we noticed that low IDF terms are
also useful for indexing and retrieval because they can be used to mention or
distinguish medical concepts. The LGD model proposed by [2] shows the best
performance on the case-based retrieval task and consistently outperforms the
In expB2 model with or without query expansion. The combination of the LGD
model, which is based on the log logistic distribution, and the KL query expan-
sion method gives the best results. We conclude that an effective ranking model
in conjunction with a appropriate query expansion strategy could be combined
together to improve the IR performance.
    Since documents in the case-based collection contains a lot of medical con-
cepts, in our future work, we aim to extract concepts from documents for better
representing the document’s semantics. In addition, we’ll also focus on adjusting
query by expanding the query with related terms denoting concepts in ontologies
or removing non informative terms from the query.

References
 1. Kalpathy-Cramer, J., Müller, H., Bedrick, S., Eggel, I., de Herrera, A.G.S.,
    Tsikrika, T.: The CLEF 2011 medical image retrieval and classification tasks.
    In: CLEF 2011 working notes, Amsterdam, The Netherlands, Springer (September
    2011)
 2. Clinchant, S., Gaussier, É.: Information-based models for ad hoc IR. In: SIGIR.
    (2010) 234–241
 3. Amati, G.: Probabilistic models for Information Retrieval based on Divergence
    from Randomness. PhD thesis, University of Glasgow (2003)
 4. Fox, E.A., Shaw, J.A.: Combination of Multiple Searches. In: TREC 1994. (1994)
    243–252
 5. Ounis, I.;Lioma, C.C.V.: Research directions in terrier. Novatica/UPGRADE
    Special Issue on Web Information Access, Ricardo Baeza-Yates et al. (Eds), Invited
    Paper (2007)
 6. Porter, M.F. In: An algorithm for suffix stripping. Morgan Kaufmann Publishers
    Inc., San Francisco, CA, USA (1997) 313–316
 7. Clinchant, S., Gaussier, É.: Retrieval constraints and word frequency distributions
    a log-logistic model for ir. Information Retrieval 14(1) (2011) 5–25
 8. Dinh, D., Tamine, L.: Voting techniques for a multi-terminology based biomedical
    information retrieval (regular paper). In: Conference on Artificial Intelligence in
    Medicine (AIME), Bled, Slovenia, 02/07/2011-06/07/2011. Volume 6747 of LNAI.,
    Springer (2011) 184–193
 9. Lee, J.H.: Analyses of multiple evidence combination. In: Proceedings of the
    20th annual international ACM SIGIR conference on Research and development
    in information retrieval. SIGIR ’97, New York, NY, USA, ACM (1997) 267–276
10. Efron, M.: Generative model-based metasearch for data fusion in information
    retrieval. In: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital
    libraries. JCDL ’09, New York, NY, USA, ACM (2009) 153–162
11. Rocchio, J. In: Relevance Feedback in Information Retrieval. (1971) 313–323
12. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. 1st edn. Addison
    Wesley (May 1999)
13. Croft, W.B., Metzler, D., Strohman, T.: Search Engines - Information Retrieval
    in Practice. Pearson Education (2009)