York University at CLEF eHealth 2014: A
      Learning-to-Rank Approach for Medical
                Document Retrieval

                          Jiajin Wu and Jimmy Huang

         Information Retrieval and Knowledge Management Research Lab,
                        School of Information Technology,
                                  York University
                            wujiajin.justin@gmail.com
                                 jhuang@yorku.ca


      Abstract. We used learning-to-rank methods for training ranking model.
      Due to the limited number of training queries, we split them and con-
      ducted 5-fold cross validation. We set the proportion of training and
      testing as 4 : 6. For the features used in learning the model, a total of
      231 features that are from multiple information retrieval models with
      different parameter settings were adopted. For the baseline run, we used
      Random Forest method to train the models with 5-fold cross validation.
      Only the binary relevance information were taken into account while
      training the model. Five trained models from 5-fold cross validation were
      used on the testing data to predict scores, and then we used equal weights
      to linearly combine the results given by different models. For run #5, we
      used 8 learning to rank methods to train models separately and linearly
      combine them together, and the binary relevance judgment was used as
      well. For run #6 and #7, graded relevance were taken into considera-
      tion. The difference between run 6# and #7 is that for run #6, multiple
      learning-to-rank methods were used while for run #7, only Random For-
      est method was used. The best result of the four runs is achieved by run
      #5, which used multiple models combination based on binary relevance
      judgment.


1   Introduction

These working notes serve to present the experimental method presented by
YorkU in the CLEF eHealth 2014 task 3a [7] which consists of retrieving relevant
medical documents for the user queries. Five training queries and fifty testing
queries were provided in the task. The goal of the task is to retrieve relevant
documents from approximate one million medical documents for the user queries.
For more details about this task and related tasks, please refer to [10]. Our main
objective in performing this task is to provide a solution that requires no manual
tuning of parameters. Secondly, we want to test the performance of learning-to-
rank [11] method in medical document retrieval.


                                      276
    To achieve the main goal, we used supervised learning-to-rank method based
on the provided five training queries to train the models. Due to the limitedness
of training dataset, we used various strategies to combine the trained models
and tested on the testing set in order to get balanced results.


2   Learning-to-Rank


Learning-to-rank is a new type of method in information retrieval (IR), which
has been merged in the past decade. Different from traditional ranking models
in IR, learning-to-rank adopts machine learning approaches to solve the ranking
problem. Similarly to other machine learning methods, learning-to-rank methods
are based on features and in most cases are supervised methods, which means
labeled training data is required. One edge of this type of methods is that it saves
the pain for tuning parameters which is usually time consuming and tedious in
traditional IR models. In the previous study of medical IR, traditional IR models
were used extensively [4], [9], but learning-to-rank has rarely been studied.
    In this work, we used an in-house IR platform to do first pass retrieval for
the training dataset. Multiple retrieval models with different parameter settings
were used to retrieve relevant documents for the training queries. Based on the
qrels information (relevance judgment) for the five training queries provided in
the task dataset, and the retrieval results from first pass retrieval, the candi-
date training documents were selected. Only those documents appearing in the
qrels and has more than m non-zero scores from the n retrieval results were
selected. Where n stands for the total number of retrieval models accounting
same retrieval model with different parameter settings as different models. n in
this work is 231, and m was chosen as 180.


                             Table 1. Training dataset

                           Query # relevant # irrelevant
                             1       23          18
                             2       25          17
                             3       37          13
                             4       31          10
                             5       18          22
                           total    134          80


   Table 1 lists the numbers of relevant/irrelevant documents provided by the
dataset. As is show in this table, the available documents for training are quite
limited.


                                     277
3     Evaluation
3.1    Dataset
We only participated in task 3a, which is a standard TREC-style IR task using
(a) the 2012 crawl of approximately one million medical documents made avail-
able by the EU-FP7 Khresmoi project1 in plain text form which was used in
CLEF eHealth 2013’s Task 3 and (b) a new 2014 set of English general public
queries that individuals may realistically pose based on the content of their dis-
charge summaries. This collection contains documents covering a broad set of
medical topics, and does not contain any patient information. The documents in
the collection come from several online sources, including the Health On the Net
organization certified websites, as well as well-known medical sites and databases
(e.g. Genetics Home Reference, ClinicalTrial.gov, Diagnosia). Queries are gener-
ated from the discharge summaries used in Tasks 2.

3.2    Metric
Evaluation will focus on P@5, P@10, NDCG@5, NDCG@10, but other suitable
IR evaluation measures will also be computed for the submitted runs (eg. MAP).
P@N indicates the percentage of relevant documents within the top N results.
NDCG [8] stands for normalized discounted cumulative gain, which is another
common metric for evaluating models in information retrieval.

4     Baseline Run
As the baseline run, only title and description in the query can be used and
no external resource (including discharge summary, corpora, ontology, etc) can
be used. To keep it simple, we used single learning-to-rank model based on the
binary relevance judgment to train our model. So far there are plenty of methods
in the literature, and we chose to use RankLib2 , an open source learning-to-rank
package which implements eight popular algorithms: MART [6], RankNet [2],
RankBoost [5], AdaRank [14], Coordinate Ascent [12], LambdaMART [13], List-
Net [3] and Random Forests [1]. So our concern is that which algorithm should
be chosen as baseline method.
    For this sake, we conducted five-fold cross validation using all the eight algo-
rithms on training dataset. Table 2 lists the cross validation setting, where the
number represents the id of training queries.
    The model that was used as baseline method is Random Forest, which achieved
the best average result over the five folds in terms of precision at 10.

5     Other Runs
As our run #5, multiple models binary relevance (MMBR) run trained models
using all eight methods with the same five-fold cross validation setting as baseline
1
    http://www.khresmoi.eu/
2
    http://people.cs.umass.edu/ vdang/ranklib.html


                                      278
                        Table 2. Five-fold cross validation

                                    Train Test
                                     1,2 3,4,5
                                     2,3 4,5,1
                                     3,4 5,1,2
                                     4,5 1,2,3
                                     5,1 2,3,4


Fig. 1. Comparison of average precision, precision at 10 and normalized discounted
cumulative gain at 10 results for submitted runs


run and using the binary relevance of training data. The final result on testing
set is gained by combination of equal linear weights of all models. As run #6,
multiple models graded relevance (MMGR) run trained models in the same way
as MMBR, only different in that using graded relevance of training data. As run
#7, single model graded relevance run trained model using Random Forest on
five-fold cross validation. For MMBR and MMGR, the combination of multiple
models are required. Since the scores range given by different models vary, the
combination was conducted on top of normalization of the scores. We rescaled
ranking scores to the range of [0 − 1] using Formula 1,

                                      X − Xmin
                              X0 =                                            (1)
                                     Xmax − Xmin

where X 0 is the normalized score, X the original score, Xmin and Xmax the
minimal and maximal of the scores given by the model for that particular query.
   The performance comparison of the four submitted runs is shown in Figure 1.


                                     279
6   Discussion


As is seen in Figure 1, the best result was achieved by run #5. Notice the
difference between baseline run and run #5 is that baseline run is based on
single model, while run #5 uses combination of multiple models. This shows that
the single model, even though it achieves the best result in training dataset, is
not better than the linear equal weights combination of multiple models in the
testing dataset.


       Fig. 2. Per-topic comparison between Run #5 and the other systems.


    Baseline run and run #5 are relatively better than the other two runs. The
main difference is that baseline run and run #5 are trained using binary relevance
judgment, while run #6 and run #7 are trained using graded relevance judg-
ment. This somewhat surprises us in that, the graded relevance provides more
information to the learning-to-rank model about the ranking of documents, thus
naturally should result in a better ranking model. But the result is contrary to
this intuition. We attribute this to that the more relevance information confuses
the learning models due to the shortage of training queries rather than being
beneficial for the model learning.


                                     280
   The comparison of our best result (run #5) with the median results of all
submitted runs is shown in Figure 2. It shows that, on approximate half of the
queries, our best result is comparable to other systems.


7    Conclusion

In this paper, we describe our methods for medical document retrieval for task
3 in CLEF eHealth 2014. Based on supervised learning-to-rank methods, we
have developed four strategies to conduct our experiments. The combination of
multiple models using binary relevance judgment is more preferable than others.
In the future, we plan to further research learning-to-rank in medical document
retrieval, for example, 1) how domain specific features could benefit the model
training, 2) how could unlabeled data be assistant in building ranking model.


Acknowledgments

This research is supported by the research grant from the Natural Sciences &
Engineering Research Council (NSERC) of Canada and the Early Researcher
Award/Premier’s Research Excellence Award.


References

 1. L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
 2. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hul-
    lender. Learning to rank using gradient descent. In Proceedings of the 22Nd In-
    ternational Conference on Machine Learning, ICML ’05, pages 89–96, New York,
    NY, USA, 2005. ACM.
 3. Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: From pairwise
    approach to listwise approach. In Proceedings of the 24th International Conference
    on Machine Learning, ICML ’07, pages 129–136, New York, NY, USA, 2007. ACM.
 4. M. Daoud, D. Kasperowicz, J. Miao, and J. Huang. York university at trec 2011:
    Medical records track. In TREC, 2011.
 5. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm
    for combining preferences. J. Mach. Learn. Res., 4:933–969, Dec. 2003.
 6. J. H. Friedman. Greedy function approximation: A gradient boosting machine.
    Annals of Statistics, 29:1189–1232, 2000.
 7. L. Goeuriot, L. Kelly, W. Li, J. Palotti, P. Pecina, G. Zuccon, A. Hanbury, G. Jones,
    and H. Mueller. Share/clef ehealth evaluation lab 2014, task 3: User-centred health
    information retrieval. In Proceedings of CLEF 2014, 2014.
 8. K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques.
    ACM Trans. Inf. Syst., 20(4):422–446, Oct. 2002.
 9. D. Kasperowicz and J. Huang. Semantic matching models for medical information
    retrieval: A case study. In in the Proceedings of the 2012 Advances in Health
    Informatics Conference (AHIC 2012), 2012.


                                        281
10. L. Kelly, L. Goeuriot, H. Suominen, T. Schrek, G. Leroy, D. L. Mowery, S. Velupil-
    lai, W. W. Chapman, D. Martinez, G. Zuccon, and J. Palotti. Overview of the
    share/clef ehealth evaluation lab 2014. In Proceedings of CLEF 2014, Lecture
    Notes in Computer Science (LNCS). Springer, 2014.
11. T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in
    Information Retrieval, 3(3):225–331, 2009.
12. D. Metzler and W. Bruce Croft. Linear feature-based models for information
    retrieval. Information Retrieval, 10(3):257–274, 2007.
13. Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting boosting for information
    retrieval measures. Inf. Retr., 13(3):254–270, June 2010.
14. J. Xu and H. Li. Adarank: A boosting algorithm for information retrieval. In
    Proceedings of the 30th Annual International ACM SIGIR Conference on Research
    and Development in Information Retrieval, SIGIR ’07, pages 391–398, New York,
    NY, USA, 2007. ACM.


                                      282