1. Introduction

Learning to Rank from Relevance Judgments Distributions

Discussion Paper

Alberto Purpura

Gianmaria Silvello

silvello@dei.unipd.it 1

Gian Antonio Susto

gianantonio.susto@unipd.it 1 0 IBM Research Europe , Dublin , Ireland 1 University of Padua , Padova , Italy

LEarning TO Rank (LETOR) algorithms are usually trained on annotated corpora where a single relevance label is assigned to each available document-topic pair. Within the Cranfield framework, relevance labels result from merging either multiple expertly curated or crowdsourced human assessments. In this paper, we explore how to train LETOR models with relevance judgments distributions (either real or synthetically generated) assigned to document-topic pairs instead of single-valued relevance labels. We propose five new probabilistic loss functions to deal with the higher expressive power provided by relevance judgments distributions and show how they can be applied both to neural and gradient boosting machine (GBM) architectures. Overall, we observe that relying on relevance judgments distributions to train diferent LETOR models can boost their performance and even outperform strong baselines such as LambdaMART on several test collections.

eol>Learning to Rank Machine Learining Optimization Functions Information Retrieval

1. Introduction

Ranking is a problem that we encounter in a number of tasks we perform every day: from searching on the Web to online shopping. Given an unordered set of items, this problem consists of ordering the items according to a certain notion of relevance. Generally, in Information Retrieval (IR) we rely on a notion of relevance that depends on the information need of a user, expressed through a keyword query. When creating a new experimental collection, the corresponding relevance judgments are obtained by asking diferent judges to assign a relevance score to each document-topic pair. Multiple judges – either trained experts or participants of a crowdsourcing experiment – usually assess the same document-topic pair, and the final relevance label for the pair is obtained by aggregating these scores [ 1 ]. This process is a cornerstone for system training and evaluation and has contributed to the continuous development of IR, especially in the context of international evaluation campaigns. Nonetheless, the opinion of diferent judges on the same document-topic pair might be very diferent or even diverge to the opposite ends of the spectrum – either because of random human errors or due to a diferent interpretation of a topic. Inevitably, the aggregation process conflates the multiple assessors viewpoints on document-topic pairs onto a single one, thus losing some information – even though it also reduces annotation errors and outliers. Our research hypothesis is that Machine Learning (ML) models – i.e., LEarning TO Rank (LETOR) [ 2 ] and Neural Information Retrieval (NIR) [ 3 ] models – could use all the labels collected in the annotation process to improve the quality of their rankings. Indeed, judges’ disagreement on a certain document-topic pair can be due to an inherent dificulty of the topic or to the existence of multiple interpretations of it. We argue that designing ML models able to learn from the whole distributions of relevance judgments could improve the models’ representation of relevance and their performance through the usage of this additional information. Following this idea, we propose to interpret the output of a LETOR model as a probability value or distribution – according to the experimental hypotheses – and define diferent Kullback–Leibler ( KL) divergence-based loss functions to train a model using a distribution of relevance judgments associated to the current training item. Such a training strategy allows us to leverage all the available information from human judges without additional computational costs compared to traditional LETOR training paradigms.

The loss functions we propose [ 4 ] can be used to train any ranking model that relies on gradient-based learning, including popular NIR models or LETOR ones. In our experiments [ 4 ] we focus on one transformer-based neural LETOR model and on one decision tree-based Gradient Boosting Machine (GBM) model – the model at the base of the popular LambdaMART [ 5 ] ranker and used as a strong baseline in many recent LETOR research papers such as [ 6, 7, 8, 9, 10 ]. We assess the quality of the proposed training strategies on four standard LETOR collections (MQ2007, MQ2008, MSLR-WEB30K [ 11 ] and OHSUMED [ 12 ]). The paper is organized as follows: in Section 2 we present the details of the training strategies we propose; in Section 3 we present the most significant evaluation results we achieve and in Section 4 we report our conclusions.

2. Proposed Approach

We propose five diferent loss functions formulations that allow ranking models to take advantage of relevance judgments distributions prior to their aggregation. They are all based on two intuitions that allow us to model relevant judgments as probability distributions and to use KL divergence-based measures to compare them. We first propose to interpret the relevance label (or a set of relevance labels) assigned to the same document as if it was generated by a Binomial (or Multinomial) random variable modeling the judges’ annotation process. For example, we assume that assessors provided one binary relevance label for each document-topic pair, i.e., to state whether the pair was a relevant or a not-relevant one. This process can be modeled as a Binomial random variable ∼ Bin(, ) where the success probability for each sample is the average of the binary responses submitted in trials. We can follow the same process to represent a set of relevance labels associated to the same document as a sample from a Multinomial random variable. We then apply the same reasoning for the interpretation of our model output probability score as another Binomial (or Multinomial) distribution ^ ∼ Bin(, ^) with the same parameter – empirically tuned for the numerical stability of the gradients during training – and probability ^ equal to the output of the model. The second option we propose is to consider relevance labels associated to each document (or batch of documents) as samples from Gaussian (or multivariate Gaussian) random variables ∼ the same standard deviation but centered on a diferent point depending on the relevance ( , ) with label associated with the document. Depending on the modeling strategy, the proposed loss functions take the following formulations typical of pointwise, pairwise hinge [ 13 ] or listwise losses that are frequently employed in NIR [ 14 ] and LETOR approaches [ 15, 16, 17, 6 ]. ︁( ︁( ︁) • Pointwise() =

(||^ ) + (^ ||) * , where we rescale each term in a training batch by a factor , inversely proportional to the number of times an item of the same class (relevant or not relevant) appeared in it; • Pointwise() =

()(||^ ) + ()(^ ||) * , where we employ a Multinomial random variable instead of a Binomial one to represent a set of relevance labels associated to a certain topic-document pair by an annotator; • Pairwise() = max(0, − sign(+ − − )()( + , − ), where is a slack parameter to adjust the distance between the two distributions, + and − are the outputs of the LETOR model associated to two documents – the former with a higher relevance label than the latter – + ∼

Bin(, +) and − ∼ Binomial distributions corresponding to a relevant and to a not-relevant document-topic Bin(, − ) are two ︁) pair, respectively; • Pairwise( ) = max(0, − sign(+ − − )( )(+ , − ), where we adapt the previous hinge-style loss function to represent labels distributions with a Gaussian random variable instead of a Binomial one; • Listwise() = ()(^ || ) * , where we consider the relevance labels associated to multiple documents to rerank for the same query at the same time, modeling the list of relevance scores and their respective labels as multinomial Gaussian distributions.

We evaluate the proposed loss functions on diferent LETOR models, i.e. the LightGBM implementation of LambdaMART [ 5 ], and a simpler transformer-based neural model [ 18 ] with one self attention layer followed by a feed forward one. The experimental collections that we consider in our experiments are: MQ2007, MQ2008, MSLR-WEB30K [ 11 ] and OHSUMED [ 12 ], which are the experimental collections of reference in the LETOR domain. All collections are already organized in five diferent folds with the respective training, test and validation subsets. We report the performance of our model averaged over these folds with the exception of the MSLR-WEB30K collection where we only consider Fold 1 as in other popular research works [ 19, 20, 21, 10 ]. Our code and implementation of the proposed loss functions are available at: https://github.com/albpurpura/PLTR.

3. Evaluation

In Table 1, we report the most significant results of our performance evaluation. For each experimental collection, we report the performance of the best variants of a LambdaMART and a Transformer Neural Network (NN) model when trained with the diferent loss functions we MQ2008 WEB30K

Loss Function GBM – LambdaMART GBM – Pointwise KL (Binomial) GBM – Listwise KL (Gaussian) NM – Pairwise KL (Gaussian) NM – Listwise KL (Gaussian) functions averaged over all topics. ↑ or ↓ indicate a statistically significant ( < the LambdaMART model trained on the original relevance judgments. Best performance measures per 0.05) diference with collection are in bold as the loss function with the most best measures per collection. k ∈ {1, 3, 5} – mean Average Precision (AP) and ERR [22]. In most of the cases, our simple Tranformer-based neural model trained with the proposed loss functions is able to outperform a LambdaMART model – one of the best performing state-of-the-art models often used as baseline in the LETOR literature. We also observe how a GBM-based model is able to benefit from the proposed loss functions. In fact, when evaluated on the MQ2007, MQ2008 and OHSUMED collections, the proposed variant of the GBM model trained with the Pointwise KL (Binomial) loss function outperforms the GBM - LambdaMART model according to diferent performance measures.

4. Conclusions

We presented diferent strategies to train a LETOR model relying on relevance judgments distributions. We introduced five diferent loss functions relying on the KL divergence between distributions, opening new possibilities for the training of LETOR models. The proposed loss functions were evaluated on a transformer-based neural model and on a decision tree-based GBM model – the same model employed by the popular LambdaMART algorithm [ 5 ] – over a number of experimental collections of diferent sizes. In our experiments, the proposed loss functions outperformed the aforementioned baselines in several cases and gave a significant performance boost to LETOR approaches – especially the ones based on neural models – allowing them to also outperform other strong baselines in the LETOR domain such as the LightGBM implementation of LambdaMART [ 5, 10 ]. arXiv:2005.02553 (2020). [21] M. Ibrahim, M. Carman, Comparing pointwise and listwise objective functions for randomforest-based learning-to-rank, ACM TOIS 34 (2016). [22] O. Chapelle, D. Metlzer, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, in: Proc. of CIKM, 2009.

[1]

Hosseini ,

Cox ,

Milić-Frayling ,

Kazai ,

Vinay , On aggregating labels from multiple crowd workers to infer relevance of documents , in: Proc. of ECIR , 2012 .

[2]

Tax ,

Bockting ,

Hiemstra , A cross-benchmark comparison of 87 learning to rank methods , IP&M 51 ( 2015 ).

[3]

Onal ,

Zhang , I. Altingovde,

Rahman ,

Karagoz ,

Braylan ,

Dang ,

Chang ,

Kim ,

McNamara ,

Angert ,

Banner ,

Khetan ,

Mcdonnell ,

Nguyen ,

Xu ,

Wallace ,

Rijke ,

Lease , Neural information retrieval: At the end of the early years , Information Retrieval 21 ( 2018 ).

[4]

Purpura , G. Silvello,

Susto , Learning to rank from relevance judgments distributions , Journal of the Association for Information Science and Technology ( 2022 ).

[5]

Burges , From ranknet to lambdarank to lambdamart: An overview , in: MSR-TR-2010-82 , 2010 .

[6]

Bruch ,

Zoghi ,

Bendersky ,

Najork , Revisiting approximate metric optimization in the age of deep neural networks , in: Proc. of SIGIR , 2019 .

[7]

Bruch , An alternative cross entropy loss for learning-to-rank , arXiv: 1911 . 09798 ( 2019 ).

[8]

Pasumarthi ,

Zhuang ,

Wang ,

Bendersky ,

Najork , Permutation equivariant document interaction network for neural learning to rank , in: Proc. of ICTIR , 2020 .

[9]

Bruch , S. Han,

Bendersky ,

Najork , A stochastic treatment of learning to rank scoring functions , in: Proc. of WSDM , 2020 .

[10]

Zhen ,

Le ,

Honglei ,

Yi ,

Rama ,

Xuanhui ,

Michael ,

Marc , Neural rankers are hitherto outperformed by gradient boosted decision trees , in: Proc. of ICLR , 2021 .

[11]

Qin , T. Liu, Introducing letor 4 .0 datasets, arXiv: 1306 .2597 ( 2013 ).

[12]

Qin , T. Liu,

Xu ,

Li , Letor: A benchmark collection for research on learning to rank for information retrieval , Information Retrieval 13 ( 2010 ).

[13]

Chen , T. Liu,

Lan ,

Ma ,

Li , Ranking measures and loss functions in learning to rank , Proc. of NIPS ( 2009 ).

[14]

Marchesin ,

Purpura , G. Silvello, Focal elements of neural information retrieval models. an outlook through a reproducibility study , Information Processing & Management 57 ( 2020 ) 102109 .

[15]

Qin , T. Liu,

Li , A general approximation framework for direct optimization of information retrieval measures , IR Journal 4 ( 2010 ).

[16]

Purpura ,

Maggipinto , G. Silvello, G. Susto, Probabilistic word embeddings in neural ir: A promising model that does not work as expected (for now) , in: Proc. of ICTIR 2019 , 2019 , pp. 3 - 10 .

[17]

MacAvaney ,

Yates ,

Cohan ,

Goharian , Cedr: Contextualized embeddings for document ranking , in: Proc. of SIGIR , 2019 .

[18]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Proc. of NIPS , 2017 .

[19]

Zhuang ,

Wang ,

Bendersky ,

Najork , Feature transformation for neural ranking models , in: Proc. of SIGIR , 2020 .

[20]

Zhuang ,

Wang ,

Bendersky ,

Grushetsky ,

Wu ,

Mitrichev ,

Sterling ,

Bell ,

Ravina ,

Qian , Interpretable learning-to-rank with generalized additive models,