ULTRE framework: a framework for Unbiased Learning to Rank Evaluation based on simulation of user behavior Yurou Zhao1 , Jiaxin Mao2 and Qingyao Ai3 1 Renmin University of China, China 2 Renmin University of China, China 3 University of Utah, USA Abstract Unbiased learning to rank (ULTR) with biased user behavior data has received considerable attention in the IR community. However, how to properly evaluate and compare different ULTR approaches has not been systematically investigated and there is no shared task or benchmark that is specifically developed for ULTR. In this paper, we propose the Unbiased Learning to Rank Evaluation(ULTRE) framework. The proposed framework utilizes multiple click models in generating simulated click logs and supports the evaluation of both the offline, counterfactual and the online, bandit-based ULTR models. Our experiments show that the ULTRE framework are effective in click simulation and comparing different ULTR models. The ULTRE framework will be used in the Unbiased Learning to Rank Evaluation Task (ULTRE), a pilot task in NTCIR 16. Keywords Unbiased Learning to Rank, Evaluation, Click Model, Click Simulation 1. INTRODUCTION into the comparison among ULTR models as the ULTR model that shares the same user behavior assumption with the click Interest in Learning to Rank (LTR) approaches that learn from simulation model might be preferred by the evaluation ([6]). user interactions has increased recently as users’ interaction with To overcome the above limitations, we propose an unbiased search systems can reflect their implicit relevance feedback for the learning to rank evaluation (ULTRE) framework. In this frame- search results. Though collecting user clicks is much less costly work, we focus on extending and improving the click simulation and more convenient than collecting expert annotations, user phase in previous ULTR evaluation. Specifically, instead of using clicks contain different types of bias (such as position bias) and a single, over-simplified click model, we will use multiple user noise. Therefore,the unbiased learning to rank (ULTR) that aims behavior models that trained and calibrated on real query log at learning a ranking model from the noisy and biased user clicks as several click simulators. Equipped with the click simulators, has become a trending topic in IR. There are two main categories we further design two evaluation protocols for offline and online of algorithms for ULTR: 1) offline (counterfactual) LTR that learns ULTR models, respectively. an unbiased ranking model in an offline manner with batches of In our empirical experiments, we implemented four different biased, historical click logs ([1, 2, 3]) 2) online ULTR which makes click simulators. After calibrations with real click logs, we in- online interventions of ranking and extracting unbiased feedback corporate them into our ULTRE framework. Then we compare or deriving unbiased gradient for modeling training ([4, 5]). several ULTR models under the framework to the verify the use- With a variety of models has been proposed for unbiased learn- fulness and effectiveness of our ULTRE framework. ing to rank, how to properly evaluate and compare different ULTR We believe the ULTRE framework can serve as a shared bench- models still needs more research. Previous works on ULTR often mark and evaluation service for ULTR. It may also support an in- use a simulation-based evaluation approach due to the lack of depth investigation of the simulation-based evaluation approach. real search logs and online search systems. Such approach rely The ULTRE framework will be used in the Unbiased Learning to on predefined user behavior models and public available learning- Rank Evaluation Task (ULTRE), a pilot task in NTCIR 16.1 to-rank datasets with item-level relevance judgments to simulate user clicks. Using the simulation-based approach, we can train ULTR models with the simulated clicks and then evaluate the 2. ULTRE FRAMEWORK models on test sets with expert annotations. Though widely adopted, current evaluation approaches have This section details the ULTRE framework, as shown in Figure 1. some limitations. First, there are no standard evaluation settings The whole process is made up of three stages: 1) generating sim- or shared evaluation benchmarks for the ULTR community as ulated click logs 2) training the ULTR models with the simulated existing studies on ULTR often rely on their own evaluation appa- click logs of the training queries 3) evaluating the ULTR models ratus and adopt different assumptions in click simulation, making with the relevance annotations in the validation and test set. the experimental results reported in different papers incompara- Stage 1 Simulation of clicks (step 1-5) ble. Second, most studies only use a single user behavior model to This stage is the key for the evaluation because the quality of simulate clicks, which may not fully capture the diverse patterns simulated click may impact the performance of the trained ULTR of real user behavior. It may also introduce systematic biases model. It contains the following steps: • Step 1: Train and calibrate four user behavior models (PBM, UBM, DCM, MCM) with real query logs. Causality in Search and Recommendation (CSR) and Simulation of • Step 2: Construct different click simulators based on Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, July 15, 2021 EMAIL: 13516173041@163.com (A. 1); maojiaxin@gmail.com (A. 2); models obtained in step 1. aiqy@cs.utah.edu (A. 3) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 http://research.nii.ac.jp/ntcir/ntcir-16/ Step 1 PBM DCM Click Simulators Real click logs Step 4 UBM MCM ... ... Step 3 Step 4 Step 5 Training Synthetic train sets queries Simulated clicks Ranking lists Train an online ULTR model Traditional No Have participants received Train an offline or LTR 100% user impressions? online ULTR model? dataset with relevance annotation Yes Train an offline ULTR model Validation queries Online models Offline models DBGD SVMRank+IPW PDGD DNN+DLA Test Evaluation Results ... ... queries Figure 1: ULTRE framework • Step 3: Collect the ranking lists for the training queries 2.1. Evaluation protocol and the corresponding relevance annotations for the doc- Based on the ULTRE framework, we can provide a shared evalu- uments in the ranking lists. Depending on which class of ation task and benchmark for the evaluation of different ULTR ULTR models we want to train and evaluate, we will gen- models. In this section, we develop evaluation protocols that erate the ranking lists differently. Offline, counterfactual describe how the task organizers of the shared ULTRE task (i.e. ULTR models, we will train a simple production ranker on the TOs) interact with the participants of the shared task and a small proportion of the train set with relevance labels to work together to evaluate the ULTR models developed by the generate ranking lists for all train queries. For the online participants. Since there are two categories of ULTR models, we ULTR models, the ranking lists will be generated by the design two evaluation protocols, one for offline ULTR models, online ULTR model that is being evaluated. the other for online ones, respectively. • Step 4: Use the click simulators defined in Step 1 and calibrated in Step 2 to generate simulated click logs for the ranking lists obtained in step 3. 2.1.1. Evaluation protocol for offline ULTR models • Step 5: Finally, collect the generated clicks and use them Figure 2 displays the steps in the evaluation protocol for offline for the training of ULTR models. Because we construct ULTR models, and show what each role should do in each step. four simulators, we will construct four synthetic train TOs represent the task organizers, and participants represent sets. those who are willing to use the ULTRE framework to evaluate Stage 2 Training of ULTR models (step 6) their ULTR models. The protocol consists three steps: After generating the synthetic training set in Stage 1, we can use • Step 1: TOs construct click simulator based on real click different synthetic train sets to train the ULTR models. It is worth log, and then simulate clicks for all queries in the train set. mentioning that if the model is an online one, step 3-5 in stage 1 The participants then can use the simulated clicks to train and stage 2 will be repeated multiple times to simulate the online their ULTR models. As four click simulator equipped with learning procedure. different user behavior models (PBM/UBM/DCM/MCM) Stage 3 Evaluation of ULTR models (step 7) will be used, TOs will produce four synthetic train sets Finally, in the last stage, we evaluate the trained ULTR models on for participants in this step. the validation and test queries. We can compute some relevance- • Step 2: Participants train their ULTR models on each based evaluation metrics, such as nDCG, MAP, and MRR, with the synthetic train set respectively. Participants may have relevance annotations in the validation and test set, to evaluate their preferred train set, as a result, they are allowed to the ranking performance of the trained models. only train the model on a single set. However, as each set is produced under some unique user behavior assump- tions, the participants are strongly encouraged to train their models on all training sets. Such exploration can test the robustness of the ULTR model. After training the Evaluation protocol for offline ULTR models Evaluation protocol for online ULTR models TOs Data Participants TOs Data Participants Construct Click Construct Click Real click log Simulator Real click log Initialize online Simulator ULTR models Simulate clicks x% train queries for training All train queries Simulate clicks Specify x% of user for training impressions is queries queries wanted online ULTR ... Train offline model ULTR models Synthetic train sets (PBM/DCM/UBM/MCM) ... Train online ULTR models ... Submit ranking Synthetic train sets lists for validate (PBM/DCM/UBM/MCM) Evaluate the and test queries Ranking lists for 100% impressions submitted runs has been received validate/test queries Relevance labels for validate/test Submit ranking lists Ranking lists for for validate and test ... query-document validate/test queries queries Evaluate the pairs submitted runs Relevance labels for validate/test query- document pairs Evaluation results Evaluation results Figure 2: Evaluation protocol for offline ULTR models Figure 3: Evaluation protocol for online ULTR models ULTR models, the participants can submit the ranking models on validation and test set. lists (runs) for the validation and test queries. Each run submitted by the participants should only use the syn- thetic data generated by a single click simulator, so ideally, 2.2. Construct click simulators for each ULTR model, we expect the participant to submit This section provides more details about the process of construct- four runs. ing click simulator. • Step 3: After receiving the runs submitted by partic- ipants, TOs evaluate the runs based on true relevance 2.2.1. Choice of user behavior models labels (i.e. expert annotation). Specifically, TOs will show the results on validation set on the leaderboard and release Compared with previous studies that only use a single click model, the official results in the final report. we use the following user behavior models: • Position-Based Model (PBM)[7]: a click model that as- 2.1.2. Evaluation protocol for online ULTR models sumes the click probability of a search result only depends on its relevance and its ranking position. As shown in Figure 3, the evaluation protocol for online ULTR • Dependent Click Model (DCM)[8]: a click model that models involves similar steps in the offline one. However, the is based on the cascade assumption that the user will main difference between them is that the participants can itera- sequentially examine the results list and find attractive tively submit the ranking lists to the TOs to get simulated clicks results to click until she feels satisfied with the clicked and use them to update their ULTR models in an online process. result. • Step 1: Participants submit the ranking lists for training • User Browsing Model (UBM)[9]: a click model that as- queries generated by their own ULTR models and specify sumes the examination probability on a search result de- that they want to receive x% of user impressions. pends on its ranking position and the distance to the last • Step 2: TOs sample x% of all training queries according clicked result. to the query frequency in the real log. Based on the rank- • Mobile Click Model (MCM)[10]: a click model that consid- ing lists of those selected training queries submitted by ers the click necessity bias (i.e.some vertical results can the participant in step 1, TOs construct synthetic train- satisfy users’ information need without a click) in user ing sets following the same process in the step 1 of the clicks. evaluation protocol for offline ULTR models. • Step 3: Participants update their models with the train- 2.2.2. Train and calibrate the user behavior models with ing data received in step 2. real query logs • Repeat Step 1-Step 3 until participants receive 100% of impressions. We train and calibrate all the user behavior models based on real • Step N: Same as the final step in the evaluation protocol query logs collected by Sogou.com, a commercial Chinese search for offline ULTR models, TOs perform evaluation for the engine, so the synthetic clicks are similar to the real user clicks. Table 1 Table 2 Statistics of dataset used in training user behavior model User behavior model performance on LL and PPL metrics. Training Test LL PPL sessions 843,933 836,979 DCM -0.1848 1.2363 unique queries 569 642 PBM -0.1721 1.2059 UBM -0.1513 1.2029 MCM -0.1503 1.1787 We split the real logs evenly into training and test set, then strictly Table 3 follow the training process of each user behavior model proposed KL-divergence between click logs generated by different user behavior in the original works[7, 9, 8, 10]. However, to make sure those models and real log models can work for all candidate documents, we assume that the attractiveness parameter 𝛼 of each query-document pair only KL-divergence depends on its five-level relevance label (0-4). Model Session-based Rank-based Baseline 0.1950 0.3884 DCM 0.1245 0.5325 2.2.3. Generating clicks with click simulators PBM 0.2856 0.2212 Equipped with the trained user behavior models, the working UBM 0.0771 0.2173 process of click simulators on each query session can be sum- MCM 0.0786 0.1951 marized by the code provided in [11]. We add some necessary The first metric is proposed by Malkevich et al.[11], it mea- modifications to the click generating procedure and show it in sures a local KL-divergence for every query and then calculate a Algorithm 1. weighted average of local divergences as follows: ∑︀ Algorithm 1 Generating synthetic clicks with a click simulator 𝑞∈ 𝑄 𝐾𝐿 − 𝑑𝑖𝑣(𝑞).𝑠𝑞 𝐾𝐿 − 𝑑𝑖𝑣 = for a query session ∑︀ 𝑞∈ 𝑄 𝑠𝑞 Input: user behavior model 𝑀 where 𝑄 is the number of unique queries and 𝑠𝑞 is the number query session 𝑠 consisting of query 𝑞 and ranking list of sessions observed for a particular query 𝑞. This metric can 𝑑1 ,... 𝑑𝑛 be calculated for two click distributions: the distribution over vector of relevance labels for documents (𝑟𝑑1 , ...𝑟𝑑𝑛 ) sessions which shows the percentage of sessions with a certain vector of vertical types for documents (𝑣𝑑1 , ...𝑣𝑑𝑛 ) number of clicks and the distribution over ranks which shows Output: vector of simulated clicks (𝑐1 , ...𝑐𝑛 ) how many times a certain rank was clicked. Lower values of the 1: for 𝑖 = 0 → 𝑛 do metrics correspond to better click simulation performance. 2: Compute 𝑝 = 𝑃 (𝐶𝑖 = 1|𝐶1 = 𝑐1 , ...𝐶𝑖−1 = 𝑐𝑖−1 ) The second metric was first used in [12] to test the distribu- using previous clicks 𝑐1 , ...𝑐𝑖−1 , tional converge of click models. Reverse PPL is the PPL of a relevance label 𝑟𝑑𝑖 ,vertical type 𝑣𝑑𝑖 and parameters of surrogate model (an intermediary to evaluate the similarity be- M tween the generated samples and the real data samples) that is 3: Generate random value 𝐶𝑖 from Bernoulli(p) trained on generated samples and evaluate on real data. Forward 4: end for PPL is the PPL of a surrogate model that is trained on real data and evaluated on generated samples. 3. EXPERIMENTS Performance on predicting clicks The results for the click prediction task on test set are presented in We conduct a series of experiments to answer the following re- Table 2, from which we can observe that MCM performs the best search questions: RQ1 : How do different click simulators per- among all models, similar to the observation in [10]. However, forms in predicting clicks and generating synthetic click logs? the others also have a relatively good performance, as their values RQ2: Can we evaluate existing ULTR models with the ULTRE of metrics are all close to the ideal values (0 for LL and 1 for PPL). framework? Quality of generated click logs This section measures the similarity between the real log and 3.1. Examining click simulators (RQ1) generated click logs on test set. Table 3 summarizes the simulation performance of the four Experiment Set up user behavior models and baseline model (always simulate a Datasets The dataset used in training user behavior models were click on the first position) in terms of the KL-divergence of the sampled from real search log dataset released by Chinese com- click distribution over sessions (session-based KL) and over ranks mercial search engine Sogou.com. We divide the dataset into (rank-based KL), from which we can obtain following observa- training and test sets with proportion 1:1. The statistics of the tions: dataset are shown in Table 1. (1) UBM generates the best samples in terms of session-based Evaluation Metrics For click prediction task, we report the log- KL-divergence, while MCM performs the best in terms of rank- likelihood (LL) and perplexity (PPL) of each user behavior model. based KL-divergence. Considering the value of session-based Higher values of log-likelihood and lower values of perplexity KL-divergence of MCM only slightly higher than the one of UBM, indicates better click prediction performance. it’s fair to say that the click logs generated by MCM are the most To measure the quality of generated samples from different similar to the real logs. click simulators, we compute Kullback-Leibler(KL) divergence (2) The samples of DCM are better than the samples gener- between the distribution of real clicks and the distribution of ated by the baseline model in terms of the session-based KL- simulated clicks and Reverse/Forward PPL. divergence, however the performance in terms of the rank-based Table 4 Reverse/Forward PPL of Surrogate DCM/PBM/UBM/MCM models based on different synthetic datasets generated from target user behavior models(DCM/PBM/UBM/MCM) Surrogate DCM Surrogate PBM Surrogate UBM Surrogate MCM Data Reverse PPL Forward PPL Reverse PPL Forward PPL Reverse PPL Forward PPL Reverse PPL Forward PPL Real data 1.2363 1.2363 1.2059 1.2059 1.2029 1.2029 1.1787 1.1787 DCM samples - - 1.2374 1.3688 1.2350 1.3625 1.2191 1.3137 PBM samples 1.2824 1.2272 - - 1.2061 1.1880 1.2053 1.2152 UBM samples 1.2409 1.2317 1.2055 1.1953 - - 1.1802 1.1764 MCM samples 1.2388 1.2248 1.2061 1.1841 1.2031 1.1831 - - KL-divergence is rather low. A possible reason for that is DCM Table 5 does not use rank-based examination parameter as the other three Statistics of ULTRE dataset models. On the contrast, the samples of PBM are better regarding Training Validation Test the rank-based KL-divergence and worse regarding session-based Unique queries 1,011 100 100 KL-divergence. Such observations may caused by the simple Session 144,675 100 100 5-level relevance 5-level relevance rank-based assumption used in PBM. Label clicked(1) or not(0) annotations(0-4) annotations(0-4) Table 4 shows the results of Reverse/Forward PPL of surro- Table 6 gate DCM/PBM/UBM/MCM models based on different synthetic Comparison of offline ULTR models on ULTRE data datasets generated from target models (DCM/PBM/UBM/MCM). To conduct an adequate and fair experiment, all models take PBM DCM UBM MCM the role of the surrogate model. For example, when we choose production ranker 0.7815 DCM as the surrogate model, the generated samples of the other (baseline) Full-info (skyline) 0.8182 three models (PBM/UBM/MCM)can be compared. PBM-IPS(IPW) 0.8017 0.7826 0.8064 0.7647 From Table 4 we can obtain the following observations: CM-IPS 0.7894 0.7932 0.8050 0.7778 (1) Samples generated by UBM and MCM achieves better per- DLA 0.8119 0.8173+ 0.8107 0.7932+ formance than the ones of DCM and PBM, for the reason that when the surrogate model is UBM and MCM, MCM-samples Significant improvements or degradations with respect to PBM- and UBM-samples outperforms DCM-samples and PBM-samples IPS are indicated with +/- in the paired samples t-test with 𝑝 ≤ 0.05. The best performance is highlighted in boldface. respectively. (2) The comparison between UBM-samples and MCM-samples is To conduct a fair comparison, we followed the settings in [14], a little bit complex. Since when surrogate model is DCM, MCM- using a multiple-layer perceptron network (MLP) with three hid- samples are better in terms of both Reverse PPL and Forward den layers (with 512,256,128 neurons) as the ranking model for PPL, however when surrogate model is PBM, the better samples all candidate models and set the batch size to 256. We trained are different regarding different metric. As a result, we cannot each candidate for 10k steps, and chose the ranking model in conclude whether the samples generated by UBM or MCM are the iteration that has the best performance on the validation set. the most similar to the real logs. Such experiment was repeated 10 times to ensure the reliability of final results. nDCG@5 was used to evaluate the performance 3.2. Evaluating ULTR models with the ULTRE of each candidate model. framework (RQ2) Evaluation results Table 6 shows nDCG@5 for three different offline ULTR mod- To answer RQ2, we evaluate several offline ULTR models with els trained on different synthetic train sets. The baseline we used the ULTRE framework. is the performance of production ranker and the skyline is the Dataset and Simulation Setup performance of a lambdaMART model trained on the whole train The dataset used to evaluate ULTR models is based on Sogou- set with human annotations instead of biased clicks. From the SRR2 [13], a public dataset for relevance estimation and ranking results, we can see that: in Web search. We select 1,211 unique queries with at least 10 (1) PBM-IPS trained on PBM-based training set performs the best successfully crawled results, 1,011 for training, 100 for validation compared to the model trained on other sets while CM-IPS trained and 100 for testing. In addition, we use a stratified sampling on DCM-simulated training set performs the best compared to approach to ensure that the frequency of queries in each dataset the model trained on other sets. That observation coincides with is consistent with the one in the real logs. As mentioned in the conclusion in [6] that when the used behavior models used in Section 3, we need a production ranker to produce ranking lists click simulation and the correction method of bias are consistent, for training queries, so we trained a lambdaMART model with 1% the results are better than the case in which they don’t agree. data randomly sampled from the original training set (with 5-level (2) Compared to PBM-IPS and CM-IPS, DLA performs the best on relevance annotations). After that, we follow the click-simulation all synthetic train sets, which indicates that DLA is more robust process in the ULTRE framework. Table 5 shows the details of and more adaptive to the change of user behavior assumption the dataset we constructed. used in the click simulation. That advantage can be attributed Model Setup and Evaluation to the the unification of learning propensity weights (used to We chose three offline ULTR models as our candidate mod- correct bias in click data) and leaning ranking models proposed els, which are IPW[1] (named PBM-IPS in [6]), CM-IPS[6] and in DLA. Such learning paradigm can help DLA model adjust its DLA[2]. propensity weights automatically to the difference between dif- ferent synthetic training sets, while PBM-IPS and CM-IPS model 2 http://www.thuir.cn/data-srr/ cannot. learning to rank, in: Proceedings of the 27th ACM Inter- The above observations demonstrate the usefulness and effec- national Conference on Information and Knowledge Man- tiveness of the ULTRE framework. By using the ULTRE frame- agement, CIKM ’18, Association for Computing Machinery, work, besides evaluating the performance of one particular model New York, NY, USA, 2018, p. 1293–1302. like many previous works have already done, we can conduct a [6] A. Vardasbi, M. de Rijke, I. Markov, Cascade model-based fair and thorough comparison between different ULTR models. propensity estimation for counterfactual learning to rank, In addition, we have the chance to investigate the following ques- in: Proceedings of the 43rd International ACM SIGIR Con- tions: 1) to what extent the evaluation results will be influenced ference on Research and Development in Information Re- by the user simulation model and the mismatch between the as- trieval, SIGIR ’20, Association for Computing Machinery, sumptions of the simulation model and ranking model 2) which New York, NY, USA, 2020, p. 2089–2092. ULTR model can adapt to different environments defined by dif- [7] N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An experimen- ferent simulation models and achieves a robust improvement in tal comparison of click position-bias models, in: Proceed- ranking performance. ings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, Association for Computing Machinery, 2008, p. 87–94. 4. Conclusion and Future work [8] F. Guo, C. Liu, Y. M. Wang, Efficient multiple-click models in web search, in: Proceedings of the Second ACM Interna- In this paper, we introduce the ULTRE framework that aims to tional Conference on Web Search and Data Mining, WSDM improve the simulation approach used in previous ULTR evalua- ’09, Association for Computing Machinery, New York, NY, tion. Our experiments show that ULTRE framework can provide USA, 2009, p. 124–131. simulated-based training sets of both quality and diversity. More [9] G. E. Dupret, B. Piwowarski, A user browsing model to importantly, it enables us to conduct a thorough and relatively predict search engine click data from past observations., in: objective comparison of different ULTR models. We further de- Proceedings of the 31st Annual International ACM SIGIR sign two evaluation protocols of using this framework as a shared Conference on Research and Development in Information evaluation service for both the offline and online ULTR models. Retrieval, SIGIR ’08, Association for Computing Machinery, Our work includes an initial implementation for ULTRE frame- New York, NY, USA, 2008, p. 331–338. work and there are still some ongoing works for the final deploy- [10] J. Mao, C. Luo, M. Zhang, S. Ma, Constructing click models ment. For example, we plan to adopt neural user behavior models for mobile search, in: The 41st International ACM SIGIR such as Context-aware Click Simulator (CCS)[15] for the click Conference on Research amp; Development in Information simulation, since the user behavior models we used in this work Retrieval, SIGIR ’18, Association for Computing Machinery, are all based on the probabilistic graphic models (PGMs) and New York, NY, USA, 2018, p. 775–784. neural models may have a better click prediction performance. [11] S. Malkevich, I. Markov, E. Michailova, M. de Rijke, Eval- Moreover, the implementation of online service and comparison uating and analyzing click simulation in web search, in: between online ULTR models under the ULTRE framework will Proceedings of the ACM SIGIR International Conference be needed as we only present the comparison results of offline on Theory of Information Retrieval, ICTIR ’17, Association ULTR models in this paper. We plan to use the ULTRE framework for Computing Machinery, New York, NY, USA, 2017, p. in the Unbiased Learning to Rank Evaluation Task (ULTRE), a 281–284. pilot task in NTCIR 16. [12] X. Dai, J. Lin, W. Zhang, S. Li, W. Liu, R. Tang, X. He, J. Hao, J. Wang, Y. Yu, An adversarial imitation click model for in- References formation retrieval, arXiv preprint arXiv2104.06077 (2021). [13] J. Zhang, Y. Liu, S. Ma, Q. Tian, Relevance estimation with [1] T. Joachims, A. Swaminathan, T. Schnabel, Unbiased multiple information sources on search engine result pages, learning-to-rank with biased feedback, in: Proceedings in: Proceedings of the 27th ACM International Conference of the Tenth ACM International Conference on Web Search on Information and Knowledge Management, 2018, pp. 627– and Data Mining, WSDM ’17, Association for Computing 636. Machinery, New York, NY, USA, 2017, p. 781–789. [14] Q. Ai, T. Yang, H. Wang, J. Mao, Unbiased learning to rank: [2] Q. Ai, K. Bi, C. Luo, J. Guo, W. B. Croft, Unbiased learning Online or offline?, ACM Trans. Inf. Syst. 39 (2021). to rank with unbiased propensity estimation, in: The 41st [15] J. Zhang, J. Mao, Y. Liu, R. Zhang, M. Zhang, S. Ma, J. Xu, International ACM SIGIR Conference on Research amp; De- Q. Tian, Context-aware ranking by constructing a virtual velopment in Information Retrieval, SIGIR ’18, Association environment for reinforcement learning, in: Proceedings of for Computing Machinery, New York, NY, USA, 2018, p. the 28th ACM International Conference on Information and 385–394. Knowledge Management, CIKM ’19, Association for Com- [3] Z. Hu, Y. Wang, Q. Peng, H. Li, Unbiased lambdamart: puting Machinery, New York, NY, USA, 2019, p. 1603–1612. an unbiased pairwise learning-to-rank algorithm, in: The World Wide Web Conference, 2019, pp. 2830–2836. [4] H. Wang, R. Langley, S. Kim, E. McCord-Snook, H. Wang, Efficient exploration of gradient space for online learning to rank, in: The 41st International ACM SIGIR Conference on Research amp; Development in Information Retrieval, SIGIR ’18, Association for Computing Machinery, New York, NY, USA, 2018, p. 145–154. [5] H. Oosterhuis, M. de Rijke, Differentiable unbiased online