1. INTRODUCTION

ULTRE framework: a framework for Unbiased Learning to Rank Evaluation based on simulation of user behavior

Yurou Zhao

Jiaxin Mao

Qingyao Ai

1 0 Renmin University of China , China 1 University of Utah , USA

Unbiased learning to rank (ULTR) with biased user behavior data has received considerable attention in the IR community. However, how to properly evaluate and compare diferent ULTR approaches has not been systematically investigated and there is no shared task or benchmark that is specifically developed for ULTR. In this paper, we propose the Unbiased Learning to Rank Evaluation(ULTRE) framework. The proposed framework utilizes multiple click models in generating simulated click logs and supports the evaluation of both the ofline, counterfactual and the online, bandit-based ULTR models. Our experiments show that the ULTRE framework are efective in click simulation and comparing diferent ULTR models. The ULTRE framework will be used in the Unbiased Learning to Rank Evaluation Task (ULTRE), a pilot task in NTCIR 16.

eol>Unbiased Learning to Rank Evaluation Click Model Click Simulation

1. INTRODUCTION

into the comparison among ULTR models as the ULTR model that shares the same user behavior assumption with the click Interest in Learning to Rank (LTR) approaches that learn from simulation model might be preferred by the evaluation ([ 6 ]). user interactions has increased recently as users’ interaction with To overcome the above limitations, we propose an unbiased search systems can reflect their implicit relevance feedback for the learning to rank evaluation (ULTRE) framework. In this framesearch results. Though collecting user clicks is much less costly work, we focus on extending and improving the click simulation and more convenient than collecting expert annotations, user phase in previous ULTR evaluation. Specifically, instead of using clicks contain diferent types of bias (such as position bias) and a single, over-simplified click model, we will use multiple user noise. Therefore,the unbiased learning to rank (ULTR) that aims behavior models that trained and calibrated on real query log at learning a ranking model from the noisy and biased user clicks as several click simulators. Equipped with the click simulators, has become a trending topic in IR. There are two main categories we further design two evaluation protocols for ofline and online of algorithms for ULTR: 1) ofline (counterfactual) LTR that learns ULTR models, respectively. an unbiased ranking model in an ofline manner with batches of In our empirical experiments, we implemented four diferent biased, historical click logs ([ 1, 2, 3 ]) 2) online ULTR which makes click simulators. After calibrations with real click logs, we inonline interventions of ranking and extracting unbiased feedback corporate them into our ULTRE framework. Then we compare or deriving unbiased gradient for modeling training ([ 4, 5 ]). several ULTR models under the framework to the verify the use

With a variety of models has been proposed for unbiased learn- fulness and efectiveness of our ULTRE framework. ing to rank, how to properly evaluate and compare diferent ULTR We believe the ULTRE framework can serve as a shared benchmodels still needs more research. Previous works on ULTR often mark and evaluation service for ULTR. It may also support an inuse a simulation-based evaluation approach due to the lack of depth investigation of the simulation-based evaluation approach. real search logs and online search systems. Such approach rely The ULTRE framework will be used in the Unbiased Learning to on predefined user behavior models and public available learning- Rank Evaluation Task (ULTRE), a pilot task in NTCIR 16.1 to-rank datasets with item-level relevance judgments to simulate user clicks. Using the simulation-based approach, we can train ULTR models with the simulated clicks and then evaluate the 2. ULTRE FRAMEWORK models on test sets with expert annotations.

Though widely adopted, current evaluation approaches have some limitations. First, there are no standard evaluation settings or shared evaluation benchmarks for the ULTR community as existing studies on ULTR often rely on their own evaluation apparatus and adopt diferent assumptions in click simulation, making the experimental results reported in diferent papers incomparable. Second, most studies only use a single user behavior model to simulate clicks, which may not fully capture the diverse patterns of real user behavior. It may also introduce systematic biases This section details the ULTRE framework, as shown in Figure 1.

The whole process is made up of three stages: 1) generating simulated click logs 2) training the ULTR models with the simulated click logs of the training queries 3) evaluating the ULTR models with the relevance annotations in the validation and test set.

Stage 1 Simulation of clicks (step 1-5) This stage is the key for the evaluation because the quality of simulated click may impact the performance of the trained ULTR model. It contains the following steps: • Step 1: Train and calibrate four user behavior models

(PBM, UBM, DCM, MCM) with real query logs. • Step 2: Construct diferent click simulators based on models obtained in step 1. Real click logs

Training queries

Step 1

Step 3 Traditional

LTR dataset with relevance annotation

Validation queries Test queries UBM

MCM Ranking lists ...

Simulated clicks Step 4

Step 5 Have participants received 100% user impressions?

Train an online ULTR model Online models

DBGD PDGD ...

Offline models SVMRank+IPW

DNN+DLA ...

Synthetic train sets Train an offline or online ULTR model? Train an offline ULTR model • Step 3: Collect the ranking lists for the training queries and the corresponding relevance annotations for the documents in the ranking lists. Depending on which class of ULTR models we want to train and evaluate, we will generate the ranking lists diferently. Ofline, counterfactual ULTR models, we will train a simple production ranker on a small proportion of the train set with relevance labels to generate ranking lists for all train queries. For the online ULTR models, the ranking lists will be generated by the online ULTR model that is being evaluated. • Step 4: Use the click simulators defined in Step 1 and calibrated in Step 2 to generate simulated click logs for the ranking lists obtained in step 3. • Step 5: Finally, collect the generated clicks and use them for the training of ULTR models. Because we construct four simulators, we will construct four synthetic train sets.

Stage 2 Training of ULTR models (step 6) After generating the synthetic training set in Stage 1, we can use diferent synthetic train sets to train the ULTR models. It is worth mentioning that if the model is an online one, step 3-5 in stage 1 and stage 2 will be repeated multiple times to simulate the online learning procedure.

Stage 3 Evaluation of ULTR models (step 7) Finally, in the last stage, we evaluate the trained ULTR models on the validation and test queries. We can compute some relevancebased evaluation metrics, such as nDCG, MAP, and MRR, with the relevance annotations in the validation and test set, to evaluate the ranking performance of the trained models.

2.1. Evaluation protocol

Based on the ULTRE framework, we can provide a shared evaluation task and benchmark for the evaluation of diferent ULTR models. In this section, we develop evaluation protocols that describe how the task organizers of the shared ULTRE task (i.e. the TOs) interact with the participants of the shared task and work together to evaluate the ULTR models developed by the participants. Since there are two categories of ULTR models, we design two evaluation protocols, one for ofline ULTR models, the other for online ones, respectively.

2.1.1. Evaluation protocol for ofline ULTR models

Figure 2 displays the steps in the evaluation protocol for ofline ULTR models, and show what each role should do in each step. TOs represent the task organizers, and participants represent those who are willing to use the ULTRE framework to evaluate their ULTR models. The protocol consists three steps: • Step 1: TOs construct click simulator based on real click log, and then simulate clicks for all queries in the train set. The participants then can use the simulated clicks to train their ULTR models. As four click simulator equipped with diferent user behavior models (PBM/UBM/DCM/MCM) will be used, TOs will produce four synthetic train sets for participants in this step. • Step 2: Participants train their ULTR models on each synthetic train set respectively. Participants may have their preferred train set, as a result, they are allowed to only train the model on a single set. However, as each set is produced under some unique user behavior assumptions, the participants are strongly encouraged to train their models on all training sets. Such exploration can test the robustness of the ULTR model. After training the Evaluation protocol for offline ULTR models Evaluation protocol for online ULTR models ULTR models, the participants can submit the ranking lists (runs) for the validation and test queries. Each run submitted by the participants should only use the synthetic data generated by a single click simulator, so ideally, for each ULTR model, we expect the participant to submit four runs. • Step 3: After receiving the runs submitted by participants, TOs evaluate the runs based on true relevance labels (i.e. expert annotation). Specifically, TOs will show the results on validation set on the leaderboard and release the oficial results in the final report.

2.1.2. Evaluation protocol for online ULTR models

As shown in Figure 3, the evaluation protocol for online ULTR models involves similar steps in the ofline one. However, the main diference between them is that the participants can iteratively submit the ranking lists to the TOs to get simulated clicks and use them to update their ULTR models in an online process. • Step 1: Participants submit the ranking lists for training queries generated by their own ULTR models and specify that they want to receive x% of user impressions. • Step 2: TOs sample x% of all training queries according to the query frequency in the real log. Based on the ranking lists of those selected training queries submitted by the participant in step 1, TOs construct synthetic training sets following the same process in the step 1 of the evaluation protocol for ofline ULTR models. • Step 3: Participants update their models with the training data received in step 2. • Repeat Step 1-Step 3 until participants receive 100% of impressions. • Step N: Same as the final step in the evaluation protocol for ofline ULTR models, TOs perform evaluation for the models on validation and test set.

2.2. Construct click simulators

This section provides more details about the process of constructing click simulator.

2.2.1. Choice of user behavior models

Compared with previous studies that only use a single click model, we use the following user behavior models: • Position-Based Model (PBM)[ 7 ]: a click model that assumes the click probability of a search result only depends on its relevance and its ranking position. • Dependent Click Model (DCM)[ 8 ]: a click model that is based on the cascade assumption that the user will sequentially examine the results list and find attractive results to click until she feels satisfied with the clicked result. • User Browsing Model (UBM)[ 9 ]: a click model that assumes the examination probability on a search result depends on its ranking position and the distance to the last clicked result. • Mobile Click Model (MCM)[ 10 ]: a click model that considers the click necessity bias (i.e.some vertical results can satisfy users’ information need without a click) in user clicks.

2.2.2. Train and calibrate the user behavior models with real query logs

We train and calibrate all the user behavior models based on real query logs collected by Sogou.com, a commercial Chinese search engine, so the synthetic clicks are similar to the real user clicks. We split the real logs evenly into training and test set, then strictly follow the training process of each user behavior model proposed in the original works[ 7, 9, 8, 10 ]. However, to make sure those models can work for all candidate documents, we assume that the attractiveness parameter of each query-document pair only depends on its five-level relevance label (0-4).

2.2.3. Generating clicks with click simulators

Equipped with the trained user behavior models, the working process of click simulators on each query session can be summarized by the code provided in [ 11 ]. We add some necessary modifications to the click generating procedure and show it in Algorithm 1.

DCM PBM UBM MCM

LL Algorithm 1 Generating synthetic clicks with a click simulator for a query session Input: user behavior model

query session consisting of query and ranking list 1,... vector of relevance labels for documents (1 , ... ) vector of vertical types for documents (1 , ... ) Output: vector of simulated clicks (1, ...) 1: for = 0 → do 2: Compute = ( = 1|1 = 1, ...− 1 = − 1) using previous clicks 1, ...− 1, relevance label ,vertical type and parameters of where is the number of unique queries and is the number of sessions observed for a particular query . This metric can be calculated for two click distributions: the distribution over sessions which shows the percentage of sessions with a certain number of clicks and the distribution over ranks which shows how many times a certain rank was clicked. Lower values of the metrics correspond to better click simulation performance.

The second metric was first used in [ 12 ] to test the distributional converge of click models. Reverse PPL is the PPL of a surrogate model (an intermediary to evaluate the similarity be

M tween the generated samples and the real data samples) that is 3: Generate random value from Bernoulli(p) trained on generated samples and evaluate on real data. Forward 4: end for PPL is the PPL of a surrogate model that is trained on real data and evaluated on generated samples. 3. EXPERIMENTS Performance on predicting clicks The results for the click prediction task on test set are presented in We conduct a series of experiments to answer the following re- Table 2, from which we can observe that MCM performs the best search questions: RQ1 : How do diferent click simulators per- among all models, similar to the observation in [ 10 ]. However, forms in predicting clicks and generating synthetic click logs? the others also have a relatively good performance, as their values RQ2: Can we evaluate existing ULTR models with the ULTRE of metrics are all close to the ideal values (0 for LL and 1 for PPL). framework? Quality of generated click logs This section measures the similarity between the real log and 3.1. Examining click simulators (RQ1) generated click logs on test set.

Table 3 summarizes the simulation performance of the four Experiment Set up user behavior models and baseline model (always simulate a Datasets The dataset used in training user behavior models were click on the first position) in terms of the KL-divergence of the sampled from real search log dataset released by Chinese com- click distribution over sessions (session-based KL) and over ranks mercial search engine Sogou.com. We divide the dataset into (rank-based KL), from which we can obtain following observatraining and test sets with proportion 1:1. The statistics of the tions: dataset are shown in Table 1. (1) UBM generates the best samples in terms of session-based Evaluation Metrics For click prediction task, we report the log- KL-divergence, while MCM performs the best in terms of ranklikelihood (LL) and perplexity (PPL) of each user behavior model. based KL-divergence. Considering the value of session-based Higher values of log-likelihood and lower values of perplexity KL-divergence of MCM only slightly higher than the one of UBM, indicates better click prediction performance. it’s fair to say that the click logs generated by MCM are the most

To measure the quality of generated samples from diferent similar to the real logs. click simulators, we compute Kullback-Leibler(KL) divergence (2) The samples of DCM are better than the samples generbetween the distribution of real clicks and the distribution of ated by the baseline model in terms of the session-based KLsimulated clicks and Reverse/Forward PPL. divergence, however the performance in terms of the rank-based Significant improvements or degradations with respect to PBMIPS are indicated with +/- in the paired samples t-test with ≤ 0.05. The best performance is highlighted in boldface.

KL-divergence is rather low. A possible reason for that is DCM does not use rank-based examination parameter as the other three models. On the contrast, the samples of PBM are better regarding the rank-based KL-divergence and worse regarding session-based KL-divergence. Such observations may caused by the simple rank-based assumption used in PBM.

Table 4 shows the results of Reverse/Forward PPL of surrogate DCM/PBM/UBM/MCM models based on diferent synthetic datasets generated from target models (DCM/PBM/UBM/MCM).

To conduct an adequate and fair experiment, all models take the role of the surrogate model. For example, when we choose DCM as the surrogate model, the generated samples of the other three models (PBM/UBM/MCM)can be compared.

From Table 4 we can obtain the following observations: (1) Samples generated by UBM and MCM achieves better performance than the ones of DCM and PBM, for the reason that when the surrogate model is UBM and MCM, MCM-samples and UBM-samples outperforms DCM-samples and PBM-samples respectively. (2) The comparison between UBM-samples and MCM-samples is a little bit complex. Since when surrogate model is DCM, MCMsamples are better in terms of both Reverse PPL and Forward PPL, however when surrogate model is PBM, the better samples are diferent regarding diferent metric. As a result, we cannot conclude whether the samples generated by UBM or MCM are the most similar to the real logs.

To conduct a fair comparison, we followed the settings in [ 14 ], using a multiple-layer perceptron network (MLP) with three hidden layers (with 512,256,128 neurons) as the ranking model for all candidate models and set the batch size to 256. We trained each candidate for 10k steps, and chose the ranking model in the iteration that has the best performance on the validation set.

Such experiment was repeated 10 times to ensure the reliability of final results. nDCG@5 was used to evaluate the performance

3.2. Evaluating ULTR models with the ULTRE of each candidate model.

framework (RQ2) Evaluation results

Table 6 shows nDCG@5 for three diferent ofline ULTR modTo answer RQ2, we evaluate several ofline ULTR models with els trained on diferent synthetic train sets. The baseline we used the ULTRE framework. is the performance of production ranker and the skyline is the Dataset and Simulation Setup performance of a lambdaMART model trained on the whole train The dataset used to evaluate ULTR models is based on Sogou- set with human annotations instead of biased clicks. From the SRR2[ 13 ], a public dataset for relevance estimation and ranking results, we can see that: in Web search. We select 1,211 unique queries with at least 10 (1) PBM-IPS trained on PBM-based training set performs the best successfully crawled results, 1,011 for training, 100 for validation compared to the model trained on other sets while CM-IPS trained and 100 for testing. In addition, we use a stratified sampling on DCM-simulated training set performs the best compared to approach to ensure that the frequency of queries in each dataset the model trained on other sets. That observation coincides with is consistent with the one in the real logs. As mentioned in the conclusion in [ 6 ] that when the used behavior models used in Section 3, we need a production ranker to produce ranking lists click simulation and the correction method of bias are consistent, for training queries, so we trained a lambdaMART model with 1% the results are better than the case in which they don’t agree. data randomly sampled from the original training set (with 5-level (2) Compared to PBM-IPS and CM-IPS, DLA performs the best on relevance annotations). After that, we follow the click-simulation all synthetic train sets, which indicates that DLA is more robust process in the ULTRE framework. Table 5 shows the details of and more adaptive to the change of user behavior assumption the dataset we constructed. used in the click simulation. That advantage can be attributed Model Setup and Evaluation to the the unification of learning propensity weights (used to We chose three ofline ULTR models as our candidate mod- correct bias in click data) and leaning ranking models proposed els, which are IPW[ 1 ] (named PBM-IPS in [ 6 ]), CM-IPS[ 6 ] and in DLA. Such learning paradigm can help DLA model adjust its DLA[ 2 ]. propensity weights automatically to the diference between different synthetic training sets, while PBM-IPS and CM-IPS model cannot.

The above observations demonstrate the usefulness and efectiveness of the ULTRE framework. By using the ULTRE framework, besides evaluating the performance of one particular model like many previous works have already done, we can conduct a fair and thorough comparison between diferent ULTR models.

In addition, we have the chance to investigate the following questions: 1) to what extent the evaluation results will be influenced by the user simulation model and the mismatch between the assumptions of the simulation model and ranking model 2) which ULTR model can adapt to diferent environments defined by different simulation models and achieves a robust improvement in ranking performance.

4. Conclusion and Future work

In this paper, we introduce the ULTRE framework that aims to improve the simulation approach used in previous ULTR evaluation. Our experiments show that ULTRE framework can provide simulated-based training sets of both quality and diversity. More importantly, it enables us to conduct a thorough and relatively objective comparison of diferent ULTR models. We further design two evaluation protocols of using this framework as a shared evaluation service for both the ofline and online ULTR models.

Our work includes an initial implementation for ULTRE framework and there are still some ongoing works for the final deployment. For example, we plan to adopt neural user behavior models such as Context-aware Click Simulator (CCS)[ 15 ] for the click simulation, since the user behavior models we used in this work are all based on the probabilistic graphic models (PGMs) and neural models may have a better click prediction performance. Moreover, the implementation of online service and comparison between online ULTR models under the ULTRE framework will be needed as we only present the comparison results of ofline ULTR models in this paper. We plan to use the ULTRE framework in the Unbiased Learning to Rank Evaluation Task (ULTRE), a pilot task in NTCIR 16.

[1]

Joachims ,

Swaminathan , T. Schnabel, Unbiased learning-to-rank with biased feedback , in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining , WSDM '17, Association for Computing Machinery, New York, NY, USA, 2017 , p. 781 - 789 .

[2]

Ai ,

Bi ,

Luo ,

Guo , W. B. Croft , Unbiased learning to rank with unbiased propensity estimation , in: The 41st International ACM SIGIR Conference on Research amp; Development in Information Retrieval , SIGIR '18, Association for Computing Machinery, New York, NY, USA, 2018 , p. 385 - 394 .

[3]

Hu ,

Wang ,

Peng ,

Li , Unbiased lambdamart: an unbiased pairwise learning-to-rank algorithm , in: The World Wide Web Conference , 2019 , pp. 2830 - 2836 .

[4]

Wang ,

Langley ,

Kim , E. McCord-Snook , H. Wang , Eficient exploration of gradient space for online learning to rank , in: The 41st International ACM SIGIR Conference on Research amp; Development in Information Retrieval , SIGIR '18, Association for Computing Machinery, New York, NY, USA, 2018 , p. 145 - 154 .

[5]

Oosterhuis , M. de Rijke, Diferentiable unbiased online learning to rank , in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management , CIKM '18, Association for Computing Machinery, New York, NY, USA, 2018 , p. 1293 - 1302 .

[6]

Vardasbi , M. de Rijke, I. Markov, Cascade model-based propensity estimation for counterfactual learning to rank , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 2089 - 2092 .

[7]

Craswell ,

Zoeter ,

Taylor , B. Ramsey , An experimental comparison of click position-bias models , in: Proceedings of the 2008 International Conference on Web Search and Data Mining , WSDM '08, Association for Computing Machinery, 2008 , p. 87 - 94 .

[8]

Guo , C. Liu,

Y. M.

Wang , Eficient multiple-click models in web search , in: Proceedings of the Second ACM International Conference on Web Search and Data Mining , WSDM '09, Association for Computing Machinery, New York, NY, USA, 2009 , p. 124 - 131 .

[9]

G. E.

Dupret ,

Piwowarski , A user browsing model to predict search engine click data from past observations ., in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '08, Association for Computing Machinery, New York, NY, USA, 2008 , p. 331 - 338 .

[10]

Mao ,

Luo ,

Zhang , S. Ma, Constructing click models for mobile search , in: The 41st International ACM SIGIR Conference on Research amp; Development in Information Retrieval , SIGIR '18, Association for Computing Machinery, New York, NY, USA, 2018 , p. 775 - 784 .

[11]

Malkevich ,

Markov , E. Michailova, M. de Rijke, Evaluating and analyzing click simulation in web search , in: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval , ICTIR '17, Association for Computing Machinery, New York, NY, USA, 2017 , p. 281 - 284 .

[12]

Dai ,

Lin ,

Zhang ,

Li ,

Liu ,

Tang ,

He ,

Hao ,

Wang ,

Yu , An adversarial imitation click model for information retrieval , arXiv preprint arXiv2104 . 06077 ( 2021 ).

[13]

Zhang , Y. Liu,

Ma ,

Tian , Relevance estimation with multiple information sources on search engine result pages , in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management , 2018 , pp. 627 - 636 .

[14]

Ai ,

Yang ,

Wang ,

Mao , Unbiased learning to rank: Online or ofline? , ACM Trans. Inf. Syst . 39 ( 2021 ).

[15]

Zhang , J. Mao, Y. Liu,

Zhang ,

Zhang , S. Ma, J. Xu , Q. Tian , Context-aware ranking by constructing a virtual environment for reinforcement learning , in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management , CIKM '19, Association for Computing Machinery, New York, NY, USA, 2019 , p. 1603 - 1612 .