=Paper= {{Paper |id=Vol-2936/paper-71 |storemode=property |title=Learning to rank for Consumer Health Search |pdfUrl=https://ceur-ws.org/Vol-2936/paper-71.pdf |volume=Vol-2936 |authors=Hua Yang,Xiaoming Liu,Binbin Zheng,Guan Yang |dblpUrl=https://dblp.org/rec/conf/clef/YangLZY21 }} ==Learning to rank for Consumer Health Search== https://ceur-ws.org/Vol-2936/paper-71.pdf
Learning to rank for Consumer Health Search
Hua Yang1,2 , Xiaoming Liu1 , Binbin Zheng1 and Guan Yang1
1
    School of Computer Science, Zhongyuan University of Technology, Zhengzhou, China
2
    Department of Informatics, University of Évora. Portugal


                                         Abstract
                                         CLEF 2021 eHealth Consumer Health Search task aims to investigate the effectiveness of the information
                                         retrieval systems in providing health information to common health consumers. Compared to previous
                                         years, this year’s task includes three sub-tasks and adopts a new data corpus and set of queries. This
                                         paper presents the work of the Zhongyuan University of Technology participating in Subtask 1. It
                                         explores the use of learning to rank techniques in consumer health search. A number of retrieval features
                                         are used, and eight different learning to rank algorithms are then applied to train the models. The best
                                         four models are used to re-rank the documents and four runs are submitted to the subtask.

                                         Keywords
                                         consumer health, information retrieval, learning to rank




1. Introduction
CLEF 2021 eHealth Consumer Health Search (CHS) task is a continuation of the previous CLEF
eHealth information retrieval (IR) tasks that started in 2013 [1, 2, 3]. The consumer health search
task follows a standard IR shared challenge paradigm from the perspective that it provides a
test collection consisting of a set of documents and a set of topics. Participants must retrieve
web pages that fulfill a given patient’s personalized information need. This needs to fulfill the
following criteria: information credibility, quality, and suitability. The 2021 eHealth IR Task
includes 3 sub-tasks: ad-hoc information retrieval, weakly supervised information retrieval, and
document credibility prediction [4].
   This paper describes the Zhongyuan University of Technology (ZUT) approach to CLEF 2021
eHealth IR task Subtask 1. The purpose of Subtask 1 is centered on realistic use cases, and
to evaluate IR systems abilities to provide users with relevant, understandable, and credible
documents. In this paper, we mainly aim to investigate how a model learned on data from the
previous CLEF eHealth IR task [5] performs on this year’s new data collection and a new set of
queries.


2. Methods
In the information retrieval area, machine learning techniques can be applied to build ranking
models for the information retrieval systems, and this is known as Learning to Rank (LTR) [6].
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" huayangchn@gmail.com (H. Yang); ming616@zut.edu.cn (X. Liu); luckzbb@hotmail.com (B. Zheng);
yangguan@zut.edu.cn (G. Yang)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Learning to rank algorithms classification.
                   Approach     Examples
                   Pointwise    MART, Random Forest, PRank, McRank
                   Pairwise     RankNet, RankBoost, RankSVM, LambdaMART
                   Listwise     LambdaRank, AdaRank, ListNet, Coordinate ascent


   Typically, the training data consists of three elements: training queries Q, the associated
documents D, and the corresponding relevance judgments or the gold standard qrel file for
the query and document pairs. The learning algorithms are then used to generate a learning
to rank model. The creation of testing data for evaluation is very similar to the creation of
the training data which includes the testing queries and the associated documents. To these
testing queries, the learning to rank model is jointly used with a retrieval model and to sort the
documents according to their relevance to the query, and return a corresponding ranked list of
the documents as the response to the query.
   Learning to rank methods has been proposed based on different machine learning algorithms.
Typically, existing learning to rank can be categorized into three main groups: pointwise,
pairwise, and listwise approaches. The pointwise approaches, for example, MART [7] and
Random Forests [8], regard the relevance degrees as numerical or ordinal scores, and the
learning to rank problem is formulated as a regression or a classification problem. The pairwise
approaches, for example, RankBoost [9], LambdaMART [10], and RankNet [11] deal with
the ranking problem by treating documents pairs as training instances, and trains models
via the minimization of related risks. The listwise approaches, for example, ListNet [12] and
AdaRank [13], regard an entire set of documents associated with a query as instances in the
training, and trains a ranking function through the minimization of a listwise loss function.
Table 1 summarizes a number of the widely used algorithms according to each LTR approach.
   In this paper, the dataset and the assessment results from the 2018 CLEF eHealth IR task are
used for training the learning to rank models. A number of retrieval features are explored.

2.1. Features Explored for Learning to Rank
In this work, only the regularly used information retrieval features are used to train learning
to rank models. They are extracted from a group of 22 different retrieval models [14, 15], as
presented in Table 2.

2.2. Training Learning to Rank Models
We build models using eight state-of-the-art learning to rank methods, including two point-
wise algorithms, two pair-wise algorithms, and four list-wise algorithms. The point-wise
algorithms are MART [7] utilizing gradient boosting regression trees, and Random Forests [8]
using regression. The pair-wise algorithms are RankNet [11] employing relative entropy as a
loss function and gradient descent to train a neural network model, and RankBoost [9] based
on boosting. The list-wise algorithms include AdaRank [13] based on boosting, Coordinate
Table 2
Features used for learning to rank models.
                              No.   The retrieval model used for feature extracting
                              1     BB2
                              2     BM25
                              3     DFI0
                              4     DFR_BM25
                              5     DLH
                              6     DLH13
                              7     DPH
                              8     DFRee
                              9     Hiemstra_LM
                              10    DirichletLM
                              11    IFB2
                              12    In_expB2
                              13    In_expC2
                              14    InL2
                              15    LemurTF_IDF
                              16    LGD
                              17    PL2
                              18    TF_IDF
                              19    DFRWeightingModel
                              20    PL2
                              21    Tf
                              22    Dl


Ascent [16] where the ranking scores are calculated as weighted combinations of the feature
values, LambdaMART [10] combining MART and LambdaRank and directly optimize NDCG in
training, and ListNeT [12] based on neural networks.
   The dataset and the topical relevance assessments of the 2018 CLEF eHealth IRtask [5] are
used as the training data. In the assessment files, the corresponding documents are scored with
0, 1, or 2, which represent not relevant, relevant, or highly relevant, respectively.


3. Experiments and Results
This section first presents the experimental settings, the dataset and queries for the subtask,
and the evaluation measures used for the assessments. Then we describe the experiments we
performed and analyze the results.

3.1. Experimental Settings
Terrier1 platform version 5.4 is chosen as the IR model of the system. The Okapi BM25 weighting
model is used as the retrieval model, with all the parameters set to default values (k_1 = 1.2d,

    1
        http://terrier.org/
Figure 1: Example topics in the CLEF 2021 CHS Subtask 1.


k_3 = 8d, b = 0.75d). All developed learning to rank models are implemented with RankLib2
version 2.15.

3.2. Dataset
The dataset of the CLEF 2021 CHS task is basically constructed using the collection introduced
in CLEF 2018 IR task, and extended with additional webpages and social media content. Totally,
the collection consists of over 5 million medical webpages from selected domains acquired from
the CommonCrawl and other resources [4].

3.3. Topics
Totally 55 topics are used in the CLEF 2021 CHS task, and they are based on realistic search
scenarios. These topics are divided into two sets. The reddit-topics set includes 25 topics that are
based on use cases from discussion forums. These queries are extracted and manually selected
from Google trends to best fit each use case. The patients-topics set includes 30 topics which are
based on discussions with multiple sclerosis and diabetes patients. These queries are manually
generated by experts from established search scenarios. Figure 1 presents the example topics
used in the task.

3.4. Pre-processing
All queries are pre-processed with characters lower-casing, stop words removing and Porter
Stemmer stemming. The default stop words list available in the IR platform Terrier 5.4 is used.

3.5. Evaluation Measures
The task takes into account 3 dimensions in the relevance evaluation: topical relevance, under-
standability, and credibility. The ability of systems to retrieve relevant, readable, and credible
documents for the topics, and the ability of systems to retrieve all kinds of documents (web or

   2
       https://sourceforge.net/p/lemur/wiki/RankLib/
Table 3
The best four learning to rank models.
                            LTR model    LTR algorithm    NDCG@10
                            m_lm         LambdaMART       0.9662
                            m_mr         MART             0.8869
                            m_rf         Random Forests   0.6744
                            m_rb         RankBoost        0.5821


social media) are both considered. Evaluation measures used are NDCG@10, BPref, and RBP, as
well as other metrics adapted to other relevance dimensions such as uRBP.

3.6. Experiments
Using the data from the CLEF 2018 ehealth IR task, we totally train eight learning to rank
models. The loss function used to train the learning to rank model is NDCG@10. We choose
the best four performed LTR models and use them in this year’s task. The evaluation of these
top four LTR models is presented in Table 3.
   The top 1,000 relevant documents for each query are retrieved using the BM25 retrieval model
in Terrier. The selected four models are then used to re-rank the initial results obtained with
the BM25 retrieval model, and four runs are generated for the final submission.

3.7. Results
For each topic, 250 documents have been assessed in three relevance dimensions. And we
compare our four run results to the six baselines, as shown in Table 4.
   We first compare the performance among our four implemented models. The best result
was obtained by the model m_rf which used Random Forests learning to rank algorithm, then
followed by the model r_rb with RankBoost algorithm and the model m_lm with LambdaMART
algorithm. On average, the model m_mr with MART algorithm achieved the worst result,
although it showed somewhat better results in MAP and two cRBP measures when compared
to the model m_lm.
   Then we compare the best model m_rf with the baselines. When compared in MAP, this
model was able to surpass all baselines. In Bpref, the model showed better results than the
DirichletLM_qe baseline, but failed with other baselines. In the rRBP measures, the model
showed better results than the two DirichletLM baselines. In the cRBP and the RBP measures,
the model surpassed the baseline BM25 and the two DirichletLM baselines.


4. Conclusion and Future Work
This paper reports the ZUT team participation in the CLEF 2021 eHealth CHS Subtask 1. Using
the data from the CLEF 2018 eHealth IR task, a number of retrieval features are explored and
eight learning to rank algorithms are used to train the LTR models. The top performed LTR
models are used in the CLEF 2021 eHealth IR task Subtask1. In the future work, the methods
Table 4
The results and comparison to the baselines.
                                     NDCG      binary   graded   binary    graded    binary    graded
 Run                MAP     Bpref
                                     @10       rRBP     rRBP     cRBP      cRBP      RBP       RBP
 m_rf               4.090   4.686    6.148     7.035    4.943    6.227     4.138     6.028     7.426
 m_rb               3.733   4.472    5.651     6.572    4.499    6.036     4.088     5.599     6.978
 m_lm               3.381   4.409    5.258     6.248    4.076    5.240     3.187     5.198     6.655
 m_mr               3.383   4.278    4.817     5.615    3.486    5.247     3.295     4.805     6.269
 TF_IDF_qe          3.974   5.106    6.535     7.664    5.232    6.849     4.497     6.428     8.010
 TF_IDF             3.663   4.744    6.464     7.443    5.091    6.399     4.179     6.280     7.796
 BM25_qe            3.903   4.994    6.352     7.397    5.072    6.447     4.317     6.277     7.700
 BM25               3.641   4.707    6.364     7.337    5.012    6.201     4.062     6.185     7.661
 DirichletLM        3.694   4.724    5.952     6.839    4.632    6.599     4.578     5.844     7.340
 DirichletLM_qe     2.423   3.691    5.362     6.341    4.082    6.366     4.285     5.345     6.960


proposed in this paper will be further analyzed: different learning to rank features will be
explored, and an ensemble algorithm will be investigated.


References
 [1] H. Suominen, L. Goeuriot, L. Kelly, L. A. Alemany, E. Bassani, N. Brew-Sam, V. Cotik,
     D. Filippo, G. González-Sáez, F. Luque, P. Mulhem, G. Pasi, R. Roller, S. Seneviratne,
     R. Upadhyay, J. Vivaldi, M. Viviani, C. Xu, Overview of the clef ehealth evaluation lab
     2021., in: CLEF 2021 - 11th Conference and Labs of the Evaluation Forum, Lecture Notes
     in Computer Science (LNCS), Springer, September 2021.
 [2] H. Suominen, L. Kelly, L. Goeuriot, E. Kanoulas, L. Azzopardi, R. Spijker, D. Li, A. Névéol,
     L. Ramadier, A. Robert, J. Palotti, Jimmy, G. Zuccon, Overview of the clef ehealth evaluation
     lab 2018., in: CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes
     in Computer Science (LNCS), Springer,September„ 2018.
 [3] Jimmy, G. Zuccon, J. Palotti, L. Goeuriot, L. Kelly, Overview of the clef 2018 consumer
     health search task., in: CLEF 2018 Evaluation Labs and Workshop: Online Working Notes,
     CEUR-WS,September„ 2018.
 [4] L. Goeuriot, G. Pasi, H. Suominen, E. Bassani, N. Brew-Sam, G. Gonzalez-Saez, R. G.
     Upadhyay, L. Kelly, P. Mulhem, S. Seneviratne, M. Viviani, C. Xu, Consumer health search
     at clef ehealth 2021, in: CLEF 2021 Evaluation Labs and Workshop: Online Working Notes,
     CEUR-WS, September 2021.
 [5] J. Jimmy, G. Zuccon, J. Palotti, L. Goeuriot, L. Kelly, Overview of the clef 2018 consumer
     health search task (2018).
 [6] T.-Y. Liu, et al., Learning to rank for information retrieval, Foundations and Trends® in
     Information Retrieval 3 (2009) 225–331.
 [7] J. H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of
     statistics (2001) 1189–1232.
 [8] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
 [9] Y. Freund, R. Iyer, R. E. Schapire, Y. Singer, An efficient boosting algorithm for combining
     preferences, Journal of machine learning research 4 (2003) 933–969.
[10] Q. Wu, C. J. Burges, K. M. Svore, J. Gao, Adapting boosting for information retrieval
     measures, Information Retrieval 13 (2010) 254–270.
[11] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning
     to rank using gradient descent, in: Proceedings of the 22nd international conference on
     Machine learning, 2005, pp. 89–96.
[12] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, H. Li, Learning to rank: from pairwise approach
     to listwise approach, in: Proceedings of the 24th international conference on Machine
     learning, ACM, 2007, pp. 129–136.
[13] J. Xu, H. Li, Adarank: a boosting algorithm for information retrieval, in: Proceedings of
     the 30th annual international ACM SIGIR conference on Research and development in
     information retrieval, ACM, 2007, pp. 391–398.
[14] C. Macdonald, R. L. Santos, I. Ounis, B. He, About learning models with multiple query
     dependent features, ACM Transactions on Information Systems (TOIS) 31 (2013) 11.
[15] C. Macdonald, R. L. Santos, I. Ounis, The whens and hows of learning to rank for web
     search, Information Retrieval 16 (2013) 584–628.
[16] D. Metzler, W. B. Croft, Linear feature-based models for information retrieval, Information
     Retrieval 10 (2007) 257–274.