1. Introduction

Learning to rank for Consumer Health Search

Hua Yang

0 1

Xiaoming Liu

Binbin Zheng

Guan Yang

1 0 Department of Informatics, University of Évora. Portugal 1 School of Computer Science, Zhongyuan University of Technology , Zhengzhou , China

CLEF 2021 eHealth Consumer Health Search task aims to investigate the efectiveness of the information retrieval systems in providing health information to common health consumers. Compared to previous years, this year's task includes three sub-tasks and adopts a new data corpus and set of queries. This paper presents the work of the Zhongyuan University of Technology participating in Subtask 1. It explores the use of learning to rank techniques in consumer health search. A number of retrieval features are used, and eight diferent learning to rank algorithms are then applied to train the models. The best four models are used to re-rank the documents and four runs are submitted to the subtask.

eol>consumer health information retrieval learning to rank

1. Introduction 2. Methods

In the information retrieval area, machine learning techniques can be applied to build ranking models for the information retrieval systems, and this is known as Learning to Rank (LTR) [ 6 ].

Typically, the training data consists of three elements: training queries Q, the associated documents D, and the corresponding relevance judgments or the gold standard qrel file for the query and document pairs. The learning algorithms are then used to generate a learning to rank model. The creation of testing data for evaluation is very similar to the creation of the training data which includes the testing queries and the associated documents. To these testing queries, the learning to rank model is jointly used with a retrieval model and to sort the documents according to their relevance to the query, and return a corresponding ranked list of the documents as the response to the query.

Learning to rank methods has been proposed based on diferent machine learning algorithms. Typically, existing learning to rank can be categorized into three main groups: pointwise, pairwise, and listwise approaches. The pointwise approaches, for example, MART [ 7 ] and Random Forests [ 8 ], regard the relevance degrees as numerical or ordinal scores, and the learning to rank problem is formulated as a regression or a classification problem. The pairwise approaches, for example, RankBoost [9], LambdaMART [10], and RankNet [11] deal with the ranking problem by treating documents pairs as training instances, and trains models via the minimization of related risks. The listwise approaches, for example, ListNet [12] and AdaRank [13], regard an entire set of documents associated with a query as instances in the training, and trains a ranking function through the minimization of a listwise loss function. Table 1 summarizes a number of the widely used algorithms according to each LTR approach.

In this paper, the dataset and the assessment results from the 2018 CLEF eHealth IR task are used for training the learning to rank models. A number of retrieval features are explored.

2.1. Features Explored for Learning to Rank

In this work, only the regularly used information retrieval features are used to train learning to rank models. They are extracted from a group of 22 diferent retrieval models [ 14, 15], as presented in Table 2.

2.2. Training Learning to Rank Models

We build models using eight state-of-the-art learning to rank methods, including two pointwise algorithms, two pair-wise algorithms, and four list-wise algorithms. The point-wise algorithms are MART [ 7 ] utilizing gradient boosting regression trees, and Random Forests [ 8 ] using regression. The pair-wise algorithms are RankNet [11] employing relative entropy as a loss function and gradient descent to train a neural network model, and RankBoost [9] based on boosting. The list-wise algorithms include AdaRank [13] based on boosting, Coordinate Ascent [16] where the ranking scores are calculated as weighted combinations of the feature values, LambdaMART [10] combining MART and LambdaRank and directly optimize NDCG in training, and ListNeT [12] based on neural networks.

The dataset and the topical relevance assessments of the 2018 CLEF eHealth IRtask [ 5 ] are used as the training data. In the assessment files, the corresponding documents are scored with 0, 1, or 2, which represent not relevant, relevant, or highly relevant, respectively.

3. Experiments and Results

This section first presents the experimental settings, the dataset and queries for the subtask, and the evaluation measures used for the assessments. Then we describe the experiments we performed and analyze the results.

3.1. Experimental Settings

Terrier1 platform version 5.4 is chosen as the IR model of the system. The Okapi BM25 weighting model is used as the retrieval model, with all the parameters set to default values (k_1 = 1.2d, k_3 = 8d, b = 0.75d). All developed learning to rank models are implemented with RankLib2 version 2.15.

3.2. Dataset 3.3. Topics

The dataset of the CLEF 2021 CHS task is basically constructed using the collection introduced in CLEF 2018 IR task, and extended with additional webpages and social media content. Totally, the collection consists of over 5 million medical webpages from selected domains acquired from the CommonCrawl and other resources [ 4 ].

Totally 55 topics are used in the CLEF 2021 CHS task, and they are based on realistic search scenarios. These topics are divided into two sets. The reddit-topics set includes 25 topics that are based on use cases from discussion forums. These queries are extracted and manually selected from Google trends to best fit each use case. The patients-topics set includes 30 topics which are based on discussions with multiple sclerosis and diabetes patients. These queries are manually generated by experts from established search scenarios. Figure 1 presents the example topics used in the task.

3.4. Pre-processing

All queries are pre-processed with characters lower-casing, stop words removing and Porter Stemmer stemming. The default stop words list available in the IR platform Terrier 5.4 is used.

3.5. Evaluation Measures

The task takes into account 3 dimensions in the relevance evaluation: topical relevance, understandability, and credibility. The ability of systems to retrieve relevant, readable, and credible documents for the topics, and the ability of systems to retrieve all kinds of documents (web or 2https://sourceforge.net/p/lemur/wiki/RankLib/ social media) are both considered. Evaluation measures used are NDCG@10, BPref, and RBP, as well as other metrics adapted to other relevance dimensions such as uRBP.

3.6. Experiments

Using the data from the CLEF 2018 ehealth IR task, we totally train eight learning to rank models. The loss function used to train the learning to rank model is NDCG@10. We choose the best four performed LTR models and use them in this year’s task. The evaluation of these top four LTR models is presented in Table 3.

The top 1,000 relevant documents for each query are retrieved using the BM25 retrieval model in Terrier. The selected four models are then used to re-rank the initial results obtained with the BM25 retrieval model, and four runs are generated for the final submission.

3.7. Results

For each topic, 250 documents have been assessed in three relevance dimensions. And we compare our four run results to the six baselines, as shown in Table 4.

We first compare the performance among our four implemented models. The best result was obtained by the model m_rf which used Random Forests learning to rank algorithm, then followed by the model r_rb with RankBoost algorithm and the model m_lm with LambdaMART algorithm. On average, the model m_mr with MART algorithm achieved the worst result, although it showed somewhat better results in MAP and two cRBP measures when compared to the model m_lm.

Then we compare the best model m_rf with the baselines. When compared in MAP, this model was able to surpass all baselines. In Bpref, the model showed better results than the DirichletLM_qe baseline, but failed with other baselines. In the rRBP measures, the model showed better results than the two DirichletLM baselines. In the cRBP and the RBP measures, the model surpassed the baseline BM25 and the two DirichletLM baselines.

4. Conclusion and Future Work

This paper reports the ZUT team participation in the CLEF 2021 eHealth CHS Subtask 1. Using the data from the CLEF 2018 eHealth IR task, a number of retrieval features are explored and eight learning to rank algorithms are used to train the LTR models. The top performed LTR models are used in the CLEF 2021 eHealth IR task Subtask1. In the future work, the methods binary rRBP graded rRBP binary cRBP graded cRBP binary RBP graded RBP proposed in this paper will be further analyzed: diferent learning to rank features will be explored, and an ensemble algorithm will be investigated. [9] Y. Freund, R. Iyer, R. E. Schapire, Y. Singer, An eficient boosting algorithm for combining preferences, Journal of machine learning research 4 (2003) 933–969. [10] Q. Wu, C. J. Burges, K. M. Svore, J. Gao, Adapting boosting for information retrieval measures, Information Retrieval 13 (2010) 254–270. [11] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in: Proceedings of the 22nd international conference on Machine learning, 2005, pp. 89–96. [12] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, H. Li, Learning to rank: from pairwise approach to listwise approach, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp. 129–136. [13] J. Xu, H. Li, Adarank: a boosting algorithm for information retrieval, in: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2007, pp. 391–398. [14] C. Macdonald, R. L. Santos, I. Ounis, B. He, About learning models with multiple query dependent features, ACM Transactions on Information Systems (TOIS) 31 (2013) 11. [15] C. Macdonald, R. L. Santos, I. Ounis, The whens and hows of learning to rank for web search, Information Retrieval 16 (2013) 584–628. [16] D. Metzler, W. B. Croft, Linear feature-based models for information retrieval, Information Retrieval 10 (2007) 257–274.

[1]

Suominen ,

Goeuriot ,

Kelly ,

L. A.

Alemany ,

Bassani ,

Brew-Sam ,

Cotik ,

Filippo ,

González-Sáez ,

Luque ,

Mulhem , G. Pasi,

Roller ,

Seneviratne ,

Upadhyay ,

Vivaldi ,

Viviani , C. Xu, Overview of the clef ehealth evaluation lab 2021 ., in: CLEF 2021 - 11th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS) , Springer, September 2021 .

[2]

Suominen ,

Kelly ,

Goeuriot ,

Kanoulas ,

Azzopardi ,

Spijker ,

Li ,

Névéol ,

Ramadier ,

Robert ,

Palotti , Jimmy, G. Zuccon, Overview of the clef ehealth evaluation lab 2018 ., in: CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS) , Springer,September„ 2018 .

[3] Jimmy , G.

Zuccon , J.

Palotti , L.

Goeuriot , L.

Kelly , Overview of the clef 2018 consumer health search task., in: CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September „ 2018 .

[4]

Goeuriot , G. Pasi,

Suominen ,

Bassani ,

Brew-Sam ,

Gonzalez-Saez ,

R. G.

Upadhyay ,

Kelly ,

Mulhem ,

Seneviratne ,

Viviani ,

Xu , Consumer health search at clef ehealth 2021, in: CLEF 2021 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September 2021 .

[5]

Jimmy , G. Zuccon,

Palotti ,

Goeuriot ,

Kelly , Overview of the clef 2018 consumer health search task ( 2018 ).

[6]

T.-Y.

Liu , et al., Learning to rank for information retrieval , Foundations and Trends® in Information Retrieval 3 ( 2009 ) 225 - 331 .

[7]

J. H.

Friedman , Greedy function approximation: a gradient boosting machine , Annals of statistics ( 2001 ) 1189 - 1232 .

[8]

Breiman , Random forests, Machine learning 45 ( 2001 ) 5 - 32 .