Improving Learning to Rank By Leveraging User Dynamics and Continuation Methods Discussion Paper? Nicola Ferro1 , Claudio Lucchese2 , Maria Maistro3 , and Raffaele Perego4 1 University of Padua, ferro@dei.unipd.it 2 Ca’ Foscari University of Venice, Italy, claudio.lucchese@unive.it 3 University of Copenhagen, Denmark, mm@di.ku.dk 4 ISTI CNR Pisa, Italy, raffaele.perego@isti.cnr.it Abstract. Learning to Rank (LtR) techniques leverage assessed sam- ples of query-document relevance to learn ranking functions able to ex- ploit the noisy signals hidden in the features used to represent queries and documents. In this paper, we explore how to enhance the state-of- the-art LambdaMart algorithm by integrating in the training process an explicit knowledge of the underlying user-interaction model and the possibility of targeting different objective functions, that can effectively drive the algorithm towards promising areas of the search space. We enrich the learning algorithm in two ways: (i) by considering complex query-based user dynamics instead than simply discounting the gain by the rank position; (ii) by designing a learning path across different loss functions that can capture different signals in the training data. Our extensive experiments, conducted on publicly available datasets, show that the proposed solution permits to improve various ranking quality measures by statistically significant margins. Keywords: Learning to Rank · User Dynamics · Continuation Methods 1 Introduction In [8], we explored whether the explicit knowledge of the underlying user in- teraction model [6] and the possibility of targeting different objective functions, by means of Continuation Methods (CM) [1, 5, 7], make it possible to improve the ranking models learned by Learning to Rank (LtR) algorithms. In particu- lar, our investigation focused on improving the state-of-the-art LtR algorithm LambdaMart [3]. This extended abstract summarizes the main contributions discussed in [8] by relying on a extensive assessment conducted on publicly avail- able datasets: ? Extended abstract of the paper originally published in [8] Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. – modelling user dynamics into LambdaMart: instead of proposing new fea- tures to account for user behavior and then training the LtR model on this extended set of features, we: (1) explicitly model the user dynamics in scan- ning a ranked result list with Markov chains trained on query log data; (2) define a novel measure which embeds this trained Markov chain (Normal- ized Markov Cumulated Gain (nMCG)), and modify the LambdaMart loss function to rely on this measure instead of the usual Normalized Discounted Cumulated Gain (nDCG) [9]; – user interaction-based curriculum learning for LambdaMart: designing a curriculum of objective functions of increasing complexity has shown to be a promising research direction. Therefore we explore if our nMCG measure provides further gain when building a curriculum based on CM techniques. 2 Methodology An LtR algorithm exploits a ground-truth set of training examples in order to learn a document scoring function σ [10]. Such training set is composed of a large collection of queries Q, where each query q ∈ Q is associated with a set of assessed documents D = {d0 , d1 , . . .}. Each document di is labeled by a relevance judgment li according to its relevance to the query q. These labels induce a partial ordering over the assessed documents, thus defining an ideal ranking which the LtR algorithm aims at approximating. Each query-document pair (q, di ) is represented by a vector of features x, able to describe the query (e.g., its length), the document (e.g., the in-link count) and their relationship (e.g., the number of occurrences of each query term in the document). LambdaRank can be summarized as follows. Given a query q and two can- didate documents di and dj in the training set with relevance labels li and lj respectively, si and sj are the scores currently predicted for the documents. The lambda gradient of any given Information Retrieval (IR) quality function Q, as defined in [3], is: ∂Qij 1 λij = = sgn(yi − yj ) ∆Qij · ∂ (si − sj ) 1 + esi −sj where, the sign is determined by the document labels only, the first factor ∆Q is the quality variation when swapping scores si and sj , and the second factor is the derivative of the RankNet cost [2], which minimizes the number of disordered pairs. When li ≥ lj , the quality Q increases with the score of document di . The larger the quality variation ∆Q, the higher the document di should be scored. Note that the RankNet multiplier fades ∆Q if documents are scored correctly, i.e. si ≥ sj , and boosts ∆Q otherwise. The lambda gradient for a documentP di is computed by marginalizing over all possible pairs in the result list: λi = j λij . LambdaRank uses nDCG as Q and so ∆Q is the variation in nDCG caused by the swap of two documents. 0.25 1 Relevant Document 0.25 1 0.25 1 0 Relevant Documents Stationary Distribution Stationary Distribution 3 Relevant Documents User Dynamic User Dynamic 0.2 5 Relevant Documents 0.2 nDCG Discount 0.2 nDCG Discount 7 Relevant Documents 0.8 0.8 Probability 9 Relevant Documents (i) = i-1 + i+ (i) = i-1 + i+ Probability Probability 0.15 = 0.2601 = 0.0848 Discount Discount 0.15 0.15 = 0.0112 = 0.0045 = -0.0378 0.6 = 0.0502 0.6 0.1 0.1 0.1 0.4 0.4 0.05 0.05 0.05 0 0 0.2 0 0.2 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Rank Positions Rank Positions Rank Positions (a) Different Query Types (b) Navigational Queries (c) Informational Queries Fig. 1. Stationary distributions for queries retrieving different numbers of relevant documents (a); and stationary distribution with its fitted curve and DCG discount for navigational (b), and informational queries (c). 2.1 User Dynamics for LtR We summarize here our methodology [6] for including the user dynamics into the above discussed LambdaRank algorithm. We model the user dynamics with a Markovian process [11], where the user scans the ranked documents in the Search Engine Result Page (SERP) according to possibly complex paths, moving both forward and backward. Let us denote by X1 , X2 , . . . the sequence of random variables representing the rank positions in R = {1, 2, . . . , R} visited by the user, where Xn = j means that the nth document visited by the user is at rank j. Moreover, we assume that the probability to move from the document at rank i to the document at rank j depends on the document at rank i only and is independent of all the previously visited documents. Finally, we denote by P the transition matrix whose entries represent the transition probabilities P = (pij : i, j ∈ R), where pij = P[Xn+1 = j|Xn = i]. The sequence of random variables (Xn )n>0 defines a discrete-time homogeneous Markov chain. Figure 1(a) plots the stationary distributions obtained from the Yandex query log described in Section ??. We focus on two distinct macroscopic behavior types, and, for the sake of simplicity, we call navigational the queries where users con- centrated on just the first item, and we consider all the other queries as infor- mational since users tend to visit more documents. On the basis of the above experimental observations, we claim that the user dynamics can be described as a mixture of the navigational and informational behavior. The navigational component is represented by the inverse of the rank position 1i , while the infor- mational component is linear with respect to the rank position i. Therefore, we model the user dynamics as δ(i) = αi−1 + βi + γ where the parameters α, β and γ are calibrated in order to fit the estimated stationary distributions computed on the Yandex dataset. Figures 1(b) and 1(c) show the stationary distributions together with the fitted curves for the navigational and informational cases, respectively. In Fig- ure 1(b) the stationary distribution is the same reported in the red line of Fig- ure 1(a), i.e. queries with just 1 relevant document, while to compute the sta- tionary distribution reported in Figure 1(c) we aggregate all the user dynamics corresponding to the other queries, i.e. queries without relevant documents or with more than one relevant document. We enhance the existing LambdaMart algorithm by replacing the above Q with a new quality measure which integrates the proposed user dynamics δ. This new measure is called Normalized Markov Cumulated Gain (nMCG) and it is defined as follows: P li  c i≤k 2 − 1 · δ (i) nMCG@k = P lh c h≤k, sorted by lh (2 − 1) · δ (h) where li is the relevance label of the i-th ranked document and δ c (i) is the user dynamics function at rank i relative to the query class c, either navigational or informational. Basically, nMCG can be seen as an extension of nDCG, where the discount function is defined by the user dynamic and depends on the query class. Moreover, since δ c depends on the query class, i.e. depends on the query q, we are optimizing two different variants of the same quality measure nMCG across the training dataset. Finally, ∆nMCG@kij can be computed efficiently as follows:  − 2li − 2lj (δ c (i) − δ c (j)) ∆nMCG@kij = P lh c . h≤k, sorted by lh (2 − 1) · δ (h) Hereinafter, we use nMCG-MART to refer to the described variant of Lamb- daMart aimed at maximizing nMCG. 2.2 Continuation Methods for LtR Applying a Continuation Methods (CM) to a cost function C means to define a sequence of cost functions Cγ with γ ∈ [0, 1], such that by increasing γ, the complexity of the function increases. Therefore, C0 represents a highly smoothed version of the original cost function corresponding to C1 . We implement CMs in the contest of forests of decision trees as a two steps learning process, where the initial trees are trained by minimizing a cost function C0 and the remaining trees are trained by optimizing C1 . We did not explore in this work more complex multi-stage scenarios, which can trivially extend this work. We denote continuation methods as C0 T0 C1 T1 , meaning that the first T0 trees of the ensemble minimize C0 , while the remaining T1 trees minimize C1 , additively refining the prediction of the first T0 trees. In order to apply a CM to LambdaMart we need to smooth the Lamb- daMart loss function. Smoother variants of nDCG have been previously pro- posed, e.g. SoftNDCG [15]; however, their performance did not prove to be sig- nificantly better than the original LambdaMart loss function. Therefore, we look at two different smoother replacements of the nDCG measure. The first option we consider is to use as C0 the Mean Square Error (MSE) as a smooth variant of the target nDCG measure. Even if MSE is easily differentiable and an MSE equal to 0 corresponds to the maximum nDCG value, the MSE cost cannot be considered a smooth approximation of nDCG. Nevertheless, Gradient Boosted Regression Trees that minimize MSE are known to perform well, even at optimizing nDCG. The rationale of using C0 equal to MSE is that of using a smooth function that alone exhibits good performance as a good seed for the refinement that the optimization of nDCG or nMCG may provide. The second option we consider is to use Recall@k as C0 . Recall is not a differentiable function either, as it suffers the same issues as nDCG. Still, it is an easier function to be optimized because it considers binary relevance instead of graded, and it discriminates between documents above or below the cut-off k without further discounting according to document ranks. The rationale is to train the first portion of the forest in order to place the most relevant documents above the cut-off, and then to complete the training with the target metric nDCG or nMCG to adjust their relative ranking. Since Recall is not differentiable, we devised a LambdaMart-based approach similar to nMCG-MART. We define ∆nRecall@kij as the normalized change in Recall@k when swapping the i-th and j-th documents in the current ranking. Given a relevance binarization thresholds ρ, ∆nRecall@kij is defined as follows: − 1li ≥ρ − 1lj ≥ρ (1i≥k − 1j≥k )  ∆nRecall@kij = , 1lh ≥ρ P h where 1p is the indicator function evaluating to 1 if predicate p is true and 0 otherwise. Finally, in addition to C1 being equal to nDCG, we also consider the case where nMCG-MART is used as C1 . Our goal is to evaluate the added value of exploiting the user dynamics in producing the final ranking, even when the first trees of the forest where built by optimizing a different metric. In conclusion, we compare the effectiveness of LambdaMart and nMCG- MART and we asses the benefit they can receive by a pre-training on a different metric, which allows the two algorithms to initiate their search process from a different point in the solution space, possibly leading to a better final solution. 3 Experimental Evaluation We remark that since there is no publicly available dataset providing, at the same time, user session data, document relevance and query-document pairs features, we have to use two different datasets in our analysis. Therefore, we calibrate the user model on the click log dataset provided by Yandex [13], while the training, validation and test for the LtR model are performed on MSLR-WEB30K and MSLR-WEB10K, provided by Microsoft [12]. We used LambdaMart as the reference LtR algorithm. Specifically, we used (and modified) the open source implementation of LambdaMart provided in the LightGBM gradient boosting framework and swiped the hyper parameters Available at https://github.com/Microsoft/LightGBM. with cross-validation so as to maximize the average performance over the vali- dation folds. As significance test, we computed Fisher’s randomization test with 100 000 random permutations and α = 0.05, which is the most appropriate statistical test to evaluate whether two approaches differ significantly [14]. As name convention, all the CM approaches are referred with the name of the first objective function C0 , the corresponding number of trees used during the first training phase T0 , followed by the name of the second objective function C1 with the corresponding number of trees T1 . Therefore, recallT300 ndcgT200 means that the model is trained with Recall@10 for the first 300 trees and then with nDCG@10 for the following 200 trees. Moreover, we exploit as reference models for CM approaches nMCG-MART and mseT200 ndcgT300, which is the best performing model proposed in [7]. 3.1 Experimental Results In this section we summarize the main findings without reporting any figure or table, the complete results can be found in the original paper [8]. First, we evaluate the performance of the proposed models against the base- lines on MSLR-WEB10K and MSLR-WEB30K: – With nDCG@10, CM approaches combining Recall and nMCG, namely recallT300 nmcgT200 and recallT400 nmcgT100, achieve the best perfor- mance on each fold. They are significantly different from both the baseline and the corresponding CM approaches which combine Recall followed by nDCG. This supports our hypothesis that the combination of CM and the user dynamics integrated by nMCG can boost LambdaMart performance. – With nDCG@100, CM approaches using nMCG are less effective and mseT200 ndcgT300 is the best performing model, significantly improving over the baseline both on each fold and on the average across folds. Moreover, CM approaches using nMCG are not significantly different from those using nDCG. However, they are still significantly improving over the baseline. – With Recall@10, the results are somehow in between nDCG@10 and nDCG@100. mseT200 ndcgT300 is the best performing model when the av- erage across each fold is considered. However, if we break down the results on each fold, recallT400 nmcgT100 is the best performing model on three folds out of five for MSLR-WEB30K. Moreover, recallT400 nmcgT100 is significantly better than the baseline and than recallT400 ndcgT100. – With Recall@100, CM approaches using nDCG, i.e. mseT200 ndcgT300, recallT300 ndcgT200 and recallT400 ndcgT100 perform better than those using nMCG. When considering the score across all the folds, mseT200 ndcgT300 is still the best performing model. Thus, when using Re- call@100 as evaluation measure, we reach conclusions similar to those in- ferred with nDCG@100. – With Expected Reciprocal Rank (ERR)@10, recallT300 nmcgT200 is the best performing model and it is the only approach which is significantly better than the baseline. Moreover, CM approaches using nMCG are also significantly better than those using nDCG, with respect to both each sep- arate fold and the average across folds. To summarize, CM approaches exploiting the user dynamics perform better at lower rank positions, while CM approaches using nDCG, as mseT200 ndcgT300, are somehow better in retrieving a higher number of relevant documents in the first 100 rank positions. Second, we repeated the previous analysis by considering the query type, i.e. we compared the proposed models on navigational and informational queries separately. On a high level the experimental results show that: – With navigational queries, there are just a few models which are signifi- cantly better than the baselines, independently of the measure or the fold. Furthermore, none of the model performs markedly better than the others, there is instead a lot of variability depending on the measure and the fold under consideration. We argue that with navigational queries the number of relevant documents is limited and systems have little room for manoeuvre: they need to place the most relevant document at the top of the ranked list and, when doing this properly, they all perform equally well in achieving this task. – The experimental comparison of models on informational queries provides more insights on the performance of CM approaches combined with the user dynamics. The exploitation of the user dynamics together with CM boosts the ranking of queries for which several relevant documents are available. Indeed, it helps in identifying those documents that are more relevant than others and in placing them at the beginning of the ranked list. Finally, we analyse the tree-wise performance of the proposed models: we compare the performance of different models with a varying number of trees. All CM approaches leveraging the user dynamics results to be always better than the baselines as the number of trees increases. Moreover, each boosting of the performance corresponds exactly to the switching point T0 , showing that replac- ing the objective function from Recall to nMCG is beneficial for the learning algorithm at any switching point. 4 Conclusion and Future Work This paper investigated whether integrating in LtR the knowledge of the user- interaction model and the possibility of targeting different objective functions is profitable and allows us to train more effective ranking functions. The re- sults of the experiments, measured with different metrics and cut-offs, gave us a more clear understanding of the problem addressed and confirmed our initial intuitions. We showed, by also breaking down the analysis to different classes of queries, that the proposed continuation methods exploiting our user-aware nMCG measure consistently outperform the baselines by statistically significant margins. As future work we will study different methods for applying click models to LtR datasets. A possibility suggested in [4] could be to exploit editorial relevance labels as a link between different datasets and use a trained click model to generate new click-based features for datasets not providing them. Acknowledgements: This paper is partially supported by the BIGDATAGRAPES (EU H2020 RIA, grant agreement N. 780751) and the OK-INSAID (MIUR-PON 2018, grant agreement N.ARS01 009 17) projects. The work is also partially funded by the “DAta BenchmarK for Keyword-based Access and Retrieval” (DAKKAR) Starting Grants project sponsored by University of Padua and Fon- dazione Cassa di Risparmio di Padova e di Rovigo, and by AMAOS (Advanced Machine Learning for Automatic Omni-Channel Support), funded by Innovationsfonden, Denmark. References [1] Allgower, E.L., Georg, K.: Numerical Continuation Methods. An Introduction. Springer-Verlag, Heidelberg, Germany (1980) [2] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hul- lender, G.: Learning to Rank using Gradient Descent. In: ICML 2005. pp. 89–96. ACM Press, New York, USA (2005) [3] Burges, C.J.C.: From RankNet to LambdaRank to LambdaMART: An Overview. Tech. rep., Microsoft Research, MSR-TR-2010-82 (2010) [4] Chuklin, A., Markov, I., de Rijke, M.: Click Models for Web Search. Morgan & Claypool Publishers, USA (2015) [5] Coleman, T.F., Wu, Z.: Parallel Continuation-based Global Optimization for Molecular Conformation and Protein Folding. Journal of Global Optimization 8(1), 49–65 (1996) [6] Ferro, N., Lucchese, C., Maistro, M., Perego, R.: On Including the User Dynamic in Learning to Rank. In: SIGIR 2017. pp. 1041–1044. ACM Press, New York, USA (2017) [7] Ferro, N., Lucchese, C., Maistro, M., Perego, R.: Continuation Methods and Cur- riculum Learning for Learning to Rank. In: CIKM 2018. pp. 1523–1526. ACM Press, New York, USA (2018) [8] Ferro, N., Lucchese, C., Maistro, M., Perego, R.: Boosting Learning to Rank with User Dynamics and Continuation Methods. Information Retrieval Journal (2019) [9] Järvelin, K., Kekäläinen, J.: Cumulated Gain-Based Evaluation of IR Techniques. TOIS 20(4), 422–446 (2002) [10] Liu, T.Y.: Learning to Rank for Information Retrieval. (FnTIR 3(3), 225–331 (2009) [11] Norris, J.R.: Markov chains. Cambridge University Press, UK (1998) [12] Qin, T., Liu, T.Y.: Introducing LETOR 4.0 Datasets. arXiv.org, Information Re- trieval (cs.IR) arXiv:1306.2597 (2013) [13] Serdyukov, P., Craswell, N., Dupret, G.: WSCD2012: Workshop on Web Search Click Data 2012. In: WSDM 2012. pp. 771–772. ACM Press, New York, USA (2012) [14] Smucker, M.D., Allan, J., Carterette, B.A.: A Comparison of Statistical Signifi- cance Tests for Information Retrieval Evaluation. In: CIKM 2007. pp. 623–632. ACM Press, New York, USA (2007) [15] Taylor, M., Guiver, J., Robertson, S., Minka, T.: SoftRank: Optimizing Non- Smooth Rank Metrics. In: WSDM 2008. pp. 77–86. ACM Press, New York, USA (2008)