ULTRE framework: a framework for Unbiased Learning to Rank
Evaluation based on simulation of user behavior
Yurou Zhao1 , Jiaxin Mao2 and Qingyao Ai3
1
  Renmin University of China, China
2
  Renmin University of China, China
3
  University of Utah, USA


                                                Abstract
                                                Unbiased learning to rank (ULTR) with biased user behavior data has received considerable attention in the IR community. However, how to
                                                properly evaluate and compare different ULTR approaches has not been systematically investigated and there is no shared task or benchmark
                                                that is specifically developed for ULTR. In this paper, we propose the Unbiased Learning to Rank Evaluation(ULTRE) framework. The proposed
                                                framework utilizes multiple click models in generating simulated click logs and supports the evaluation of both the offline, counterfactual and
                                                the online, bandit-based ULTR models. Our experiments show that the ULTRE framework are effective in click simulation and comparing
                                                different ULTR models. The ULTRE framework will be used in the Unbiased Learning to Rank Evaluation Task (ULTRE), a pilot task in NTCIR
                                                16.

                                                Keywords
                                                Unbiased Learning to Rank, Evaluation, Click Model, Click Simulation


1. INTRODUCTION                                                             into the comparison among ULTR models as the ULTR model
                                                                            that shares the same user behavior assumption with the click
Interest in Learning to Rank (LTR) approaches that learn from simulation model might be preferred by the evaluation ([6]).
user interactions has increased recently as users’ interaction with            To overcome the above limitations, we propose an unbiased
search systems can reflect their implicit relevance feedback for the learning to rank evaluation (ULTRE) framework. In this frame-
search results. Though collecting user clicks is much less costly work, we focus on extending and improving the click simulation
and more convenient than collecting expert annotations, user phase in previous ULTR evaluation. Specifically, instead of using
clicks contain different types of bias (such as position bias) and a single, over-simplified click model, we will use multiple user
noise. Therefore,the unbiased learning to rank (ULTR) that aims behavior models that trained and calibrated on real query log
at learning a ranking model from the noisy and biased user clicks as several click simulators. Equipped with the click simulators,
has become a trending topic in IR. There are two main categories we further design two evaluation protocols for offline and online
of algorithms for ULTR: 1) offline (counterfactual) LTR that learns ULTR models, respectively.
an unbiased ranking model in an offline manner with batches of                 In our empirical experiments, we implemented four different
biased, historical click logs ([1, 2, 3]) 2) online ULTR which makes click simulators. After calibrations with real click logs, we in-
online interventions of ranking and extracting unbiased feedback corporate them into our ULTRE framework. Then we compare
or deriving unbiased gradient for modeling training ([4, 5]).               several ULTR models under the framework to the verify the use-
   With a variety of models has been proposed for unbiased learn- fulness and effectiveness of our ULTRE framework.
ing to rank, how to properly evaluate and compare different ULTR               We believe the ULTRE framework can serve as a shared bench-
models still needs more research. Previous works on ULTR often mark and evaluation service for ULTR. It may also support an in-
use a simulation-based evaluation approach due to the lack of depth investigation of the simulation-based evaluation approach.
real search logs and online search systems. Such approach rely The ULTRE framework will be used in the Unbiased Learning to
on predefined user behavior models and public available learning- Rank Evaluation Task (ULTRE), a pilot task in NTCIR 16.1
to-rank datasets with item-level relevance judgments to simulate
user clicks. Using the simulation-based approach, we can train
ULTR models with the simulated clicks and then evaluate the 2. ULTRE FRAMEWORK
models on test sets with expert annotations.
   Though widely adopted, current evaluation approaches have This section details the ULTRE framework, as shown in Figure 1.
some limitations. First, there are no standard evaluation settings The whole process is made up of three stages: 1) generating sim-
or shared evaluation benchmarks for the ULTR community as ulated click logs 2) training the ULTR models with the simulated
existing studies on ULTR often rely on their own evaluation appa- click logs of the training queries 3) evaluating the ULTR models
ratus and adopt different assumptions in click simulation, making with the relevance annotations in the validation and test set.
the experimental results reported in different papers incompara- Stage 1 Simulation of clicks (step 1-5)
ble. Second, most studies only use a single user behavior model to This stage is the key for the evaluation because the quality of
simulate clicks, which may not fully capture the diverse patterns simulated click may impact the performance of the trained ULTR
of real user behavior. It may also introduce systematic biases model. It contains the following steps:
                                                                                  • Step 1: Train and calibrate four user behavior models
                                                                                    (PBM, UBM, DCM, MCM) with real query logs.
Causality in Search and Recommendation (CSR) and Simulation of                    • Step 2: Construct different click simulators based on
Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, July 15, 2021
EMAIL: 13516173041@163.com (A. 1); maojiaxin@gmail.com (A. 2);                      models obtained in step 1.
aiqy@cs.utah.edu (A. 3)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        1
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                                          http://research.nii.ac.jp/ntcir/ntcir-16/
                                    Step 1
                                                 PBM            DCM                                      Click Simulators
                  Real click logs


                                                                                                                 Step 4
                                                UBM                MCM


                                                          ...                                                    ...
                                     Step 3                                            Step 4                                    Step 5
                       Training                                                                                                             Synthetic train sets
                       queries                                                                             Simulated clicks
                                                      Ranking lists


                                                                                                              Train an online ULTR model
         Traditional                                      No
                                                                            Have participants received                                      Train an offline or
            LTR                                                              100% user impressions?                                        online ULTR model?
        dataset with
         relevance
         annotation
                                                                      Yes
                                                                                                                                  Train an offline
                                                                                                                                   ULTR model
                       Validation
                        queries                                                         Online models  Offline models
                                                                                           DBGD       SVMRank+IPW
                                                                                           PDGD         DNN+DLA
                          Test                Evaluation Results                              ...          ...
                         queries


                   Figure 1: ULTRE framework

      • Step 3: Collect the ranking lists for the training queries                   2.1. Evaluation protocol
         and the corresponding relevance annotations for the doc-
                                                                                     Based on the ULTRE framework, we can provide a shared evalu-
         uments in the ranking lists. Depending on which class of
                                                                                     ation task and benchmark for the evaluation of different ULTR
         ULTR models we want to train and evaluate, we will gen-
                                                                                     models. In this section, we develop evaluation protocols that
         erate the ranking lists differently. Offline, counterfactual
                                                                                     describe how the task organizers of the shared ULTRE task (i.e.
         ULTR models, we will train a simple production ranker on
                                                                                     the TOs) interact with the participants of the shared task and
         a small proportion of the train set with relevance labels to
                                                                                     work together to evaluate the ULTR models developed by the
         generate ranking lists for all train queries. For the online
                                                                                     participants. Since there are two categories of ULTR models, we
         ULTR models, the ranking lists will be generated by the
                                                                                     design two evaluation protocols, one for offline ULTR models,
         online ULTR model that is being evaluated.
                                                                                     the other for online ones, respectively.
      • Step 4: Use the click simulators defined in Step 1 and
         calibrated in Step 2 to generate simulated click logs for
         the ranking lists obtained in step 3.                                       2.1.1. Evaluation protocol for offline ULTR models
      • Step 5: Finally, collect the generated clicks and use them                   Figure 2 displays the steps in the evaluation protocol for offline
         for the training of ULTR models. Because we construct                       ULTR models, and show what each role should do in each step.
         four simulators, we will construct four synthetic train                     TOs represent the task organizers, and participants represent
         sets.                                                                       those who are willing to use the ULTRE framework to evaluate
Stage 2 Training of ULTR models (step 6)                                             their ULTR models. The protocol consists three steps:
After generating the synthetic training set in Stage 1, we can use                         • Step 1: TOs construct click simulator based on real click
different synthetic train sets to train the ULTR models. It is worth                         log, and then simulate clicks for all queries in the train set.
mentioning that if the model is an online one, step 3-5 in stage 1                           The participants then can use the simulated clicks to train
and stage 2 will be repeated multiple times to simulate the online                           their ULTR models. As four click simulator equipped with
learning procedure.                                                                          different user behavior models (PBM/UBM/DCM/MCM)
Stage 3 Evaluation of ULTR models (step 7)                                                   will be used, TOs will produce four synthetic train sets
Finally, in the last stage, we evaluate the trained ULTR models on                           for participants in this step.
the validation and test queries. We can compute some relevance-                            • Step 2: Participants train their ULTR models on each
based evaluation metrics, such as nDCG, MAP, and MRR, with the                               synthetic train set respectively. Participants may have
relevance annotations in the validation and test set, to evaluate                            their preferred train set, as a result, they are allowed to
the ranking performance of the trained models.                                               only train the model on a single set. However, as each
                                                                                             set is produced under some unique user behavior assump-
                                                                                             tions, the participants are strongly encouraged to train
                                                                                             their models on all training sets. Such exploration can
                                                                                             test the robustness of the ULTR model. After training the
         Evaluation protocol for offline ULTR models                                         Evaluation protocol for online ULTR models
             TOs                         Data                 Participants                 TOs                         Data                                       Participants


       Construct Click                                                              Construct Click
                                     Real click log                                   Simulator                       Real click log                              Initialize online
         Simulator                                                                                                                                                 ULTR models


       Simulate clicks                                                                                              x% train queries
         for training              All train queries                                 Simulate clicks                                                            Specify x% of user
                                                                                       for training                                                               impressions is
           queries                                                                       queries                                                                     wanted

                                                                                                                                                  online ULTR
                                   ...                  Train offline                                                                                model
                                                        ULTR models
                            Synthetic train sets
                         (PBM/DCM/UBM/MCM)
                                                                                                                    ...                                         Train online ULTR
                                                                                                                                                                      models

                                     ...                Submit ranking                                       Synthetic train sets
                                                        lists for validate                                (PBM/DCM/UBM/MCM)
        Evaluate the                                    and test queries
                             Ranking lists for                                                                                                                  100% impressions
       submitted runs                                                                                                                                            has been received
                            validate/test queries
                                    Relevance labels
                                    for validate/test                                                                                                           Submit ranking lists
                                                                                                                           Ranking lists for                    for validate and test
                                                                                                            ...
                                    query-document                                                                        validate/test queries                        queries
                                                                                       Evaluate the
                                           pairs                                      submitted runs
                                                                                                                  Relevance labels for
                                                                                                                  validate/test query-
                                                                                                                    document pairs

                                 Evaluation
                                   results

                                                                                                             Evaluation results


Figure 2: Evaluation protocol for offline ULTR models
                                                                             Figure 3: Evaluation protocol for online ULTR models


       ULTR models, the participants can submit the ranking                          models on validation and test set.
       lists (runs) for the validation and test queries. Each run
       submitted by the participants should only use the syn-
       thetic data generated by a single click simulator, so ideally,        2.2. Construct click simulators
       for each ULTR model, we expect the participant to submit              This section provides more details about the process of construct-
       four runs.                                                            ing click simulator.
     • Step 3: After receiving the runs submitted by partic-
       ipants, TOs evaluate the runs based on true relevance
                                                                             2.2.1. Choice of user behavior models
       labels (i.e. expert annotation). Specifically, TOs will show
       the results on validation set on the leaderboard and release          Compared with previous studies that only use a single click model,
       the official results in the final report.                             we use the following user behavior models:
                                                                                 • Position-Based Model (PBM)[7]: a click model that as-
2.1.2. Evaluation protocol for online ULTR models                                   sumes the click probability of a search result only depends
                                                                                    on its relevance and its ranking position.
As shown in Figure 3, the evaluation protocol for online ULTR                    • Dependent Click Model (DCM)[8]: a click model that
models involves similar steps in the offline one. However, the                      is based on the cascade assumption that the user will
main difference between them is that the participants can itera-                    sequentially examine the results list and find attractive
tively submit the ranking lists to the TOs to get simulated clicks                  results to click until she feels satisfied with the clicked
and use them to update their ULTR models in an online process.                      result.
      • Step 1: Participants submit the ranking lists for training               • User Browsing Model (UBM)[9]: a click model that as-
        queries generated by their own ULTR models and specify                      sumes the examination probability on a search result de-
        that they want to receive x% of user impressions.                           pends on its ranking position and the distance to the last
      • Step 2: TOs sample x% of all training queries according                     clicked result.
        to the query frequency in the real log. Based on the rank-               • Mobile Click Model (MCM)[10]: a click model that consid-
        ing lists of those selected training queries submitted by                   ers the click necessity bias (i.e.some vertical results can
        the participant in step 1, TOs construct synthetic train-                   satisfy users’ information need without a click) in user
        ing sets following the same process in the step 1 of the                    clicks.
        evaluation protocol for offline ULTR models.
      • Step 3: Participants update their models with the train-
                                                                             2.2.2. Train and calibrate the user behavior models with
        ing data received in step 2.
                                                                                    real query logs
      • Repeat Step 1-Step 3 until participants receive 100% of
        impressions.                                                         We train and calibrate all the user behavior models based on real
      • Step N: Same as the final step in the evaluation protocol            query logs collected by Sogou.com, a commercial Chinese search
        for offline ULTR models, TOs perform evaluation for the              engine, so the synthetic clicks are similar to the real user clicks.
Table 1                                                                   Table 2
Statistics of dataset used in training user behavior model                User behavior model performance on LL and PPL metrics.

                                  Training      Test                                                      LL        PPL
                   sessions        843,933    836,979                                         DCM      -0.1848     1.2363
                unique queries       569        642                                           PBM      -0.1721     1.2059
                                                                                              UBM      -0.1513     1.2029
                                                                                              MCM      -0.1503     1.1787
We split the real logs evenly into training and test set, then strictly
                                                                          Table 3
follow the training process of each user behavior model proposed
                                                                          KL-divergence between click logs generated by different user behavior
in the original works[7, 9, 8, 10]. However, to make sure those           models and real log
models can work for all candidate documents, we assume that
the attractiveness parameter 𝛼 of each query-document pair only                                            KL-divergence
depends on its five-level relevance label (0-4).                                        Model      Session-based Rank-based
                                                                                       Baseline        0.1950          0.3884
                                                                                        DCM            0.1245          0.5325
2.2.3. Generating clicks with click simulators
                                                                                        PBM            0.2856          0.2212
Equipped with the trained user behavior models, the working                             UBM           0.0771           0.2173
process of click simulators on each query session can be sum-                           MCM            0.0786         0.1951
marized by the code provided in [11]. We add some necessary
                                                                        The first metric is proposed by Malkevich et al.[11], it mea-
modifications to the click generating procedure and show it in
                                                                     sures a local KL-divergence for every query and then calculate a
Algorithm 1.
                                                                     weighted average of local divergences as follows:
                                                                                                    ∑︀
Algorithm 1 Generating synthetic clicks with a click simulator                                         𝑞∈ 𝑄 𝐾𝐿 − 𝑑𝑖𝑣(𝑞).𝑠𝑞
                                                                                    𝐾𝐿 − 𝑑𝑖𝑣 =
for a query session
                                                                                                            ∑︀
                                                                                                               𝑞∈ 𝑄 𝑠𝑞
Input: user behavior model 𝑀
                                                                     where 𝑄 is the number of unique queries and 𝑠𝑞 is the number
            query session 𝑠 consisting of query 𝑞 and ranking list
                                                                     of sessions observed for a particular query 𝑞. This metric can
     𝑑1 ,... 𝑑𝑛
                                                                     be calculated for two click distributions: the distribution over
            vector of relevance labels for documents (𝑟𝑑1 , ...𝑟𝑑𝑛 )
                                                                     sessions which shows the percentage of sessions with a certain
            vector of vertical types for documents (𝑣𝑑1 , ...𝑣𝑑𝑛 )
                                                                     number of clicks and the distribution over ranks which shows
Output: vector of simulated clicks (𝑐1 , ...𝑐𝑛 )
                                                                     how many times a certain rank was clicked. Lower values of the
  1: for 𝑖 = 0 → 𝑛 do
                                                                     metrics correspond to better click simulation performance.
  2:      Compute 𝑝 = 𝑃 (𝐶𝑖 = 1|𝐶1 = 𝑐1 , ...𝐶𝑖−1 = 𝑐𝑖−1 )
                                                                        The second metric was first used in [12] to test the distribu-
     using previous clicks 𝑐1 , ...𝑐𝑖−1 ,
                                                                     tional converge of click models. Reverse PPL is the PPL of a
            relevance label 𝑟𝑑𝑖 ,vertical type 𝑣𝑑𝑖 and parameters of
                                                                     surrogate model (an intermediary to evaluate the similarity be-
     M
                                                                     tween the generated samples and the real data samples) that is
  3:      Generate random value 𝐶𝑖 from Bernoulli(p)
                                                                     trained on generated samples and evaluate on real data. Forward
  4: end for
                                                                     PPL is the PPL of a surrogate model that is trained on real data
                                                                     and evaluated on generated samples.
3. EXPERIMENTS                                                       Performance on predicting clicks
                                                                     The results for the click prediction task on test set are presented in
We conduct a series of experiments to answer the following re- Table 2, from which we can observe that MCM performs the best
search questions: RQ1 : How do different click simulators per- among all models, similar to the observation in [10]. However,
forms in predicting clicks and generating synthetic click logs? the others also have a relatively good performance, as their values
RQ2: Can we evaluate existing ULTR models with the ULTRE of metrics are all close to the ideal values (0 for LL and 1 for PPL).
framework?                                                           Quality of generated click logs
                                                                     This section measures the similarity between the real log and
3.1. Examining click simulators (RQ1)                                generated click logs on test set.
                                                                        Table 3 summarizes the simulation performance of the four
Experiment Set up                                                    user behavior models and baseline model (always simulate a
Datasets The dataset used in training user behavior models were click on the first position) in terms of the KL-divergence of the
sampled from real search log dataset released by Chinese com- click distribution over sessions (session-based KL) and over ranks
mercial search engine Sogou.com. We divide the dataset into (rank-based KL), from which we can obtain following observa-
training and test sets with proportion 1:1. The statistics of the tions:
dataset are shown in Table 1.                                         (1) UBM generates the best samples in terms of session-based
Evaluation Metrics For click prediction task, we report the log- KL-divergence, while MCM performs the best in terms of rank-
likelihood (LL) and perplexity (PPL) of each user behavior model. based KL-divergence. Considering the value of session-based
Higher values of log-likelihood and lower values of perplexity KL-divergence of MCM only slightly higher than the one of UBM,
indicates better click prediction performance.                       it’s fair to say that the click logs generated by MCM are the most
   To measure the quality of generated samples from different similar to the real logs.
click simulators, we compute Kullback-Leibler(KL) divergence (2) The samples of DCM are better than the samples gener-
between the distribution of real clicks and the distribution of ated by the baseline model in terms of the session-based KL-
simulated clicks and Reverse/Forward PPL.                            divergence, however the performance in terms of the rank-based
           Table 4
           Reverse/Forward PPL of Surrogate DCM/PBM/UBM/MCM models based on different synthetic datasets generated from
           target user behavior models(DCM/PBM/UBM/MCM)

                                  Surrogate DCM              Surrogate PBM                  Surrogate UBM                   Surrogate MCM
               Data         Reverse PPL Forward PPL   Reverse PPL Forward PPL        Reverse PPL Forward PPL          Reverse PPL Forward PPL
             Real data        1.2363         1.2363     1.2059           1.2059        1.2029          1.2029           1.1787         1.1787
           DCM samples           -              -       1.2374           1.3688        1.2350          1.3625           1.2191         1.3137
           PBM samples        1.2824         1.2272        -                -          1.2061          1.1880           1.2053         1.2152
           UBM samples        1.2409         1.2317     1.2055           1.1953           -               -             1.1802         1.1764
           MCM samples        1.2388         1.2248     1.2061           1.1841        1.2031          1.1831              -              -

KL-divergence is rather low. A possible reason for that is DCM            Table 5
does not use rank-based examination parameter as the other three          Statistics of ULTRE dataset
models. On the contrast, the samples of PBM are better regarding                                    Training              Validation             Test
the rank-based KL-divergence and worse regarding session-based               Unique queries          1,011                   100                 100
KL-divergence. Such observations may caused by the simple                       Session             144,675                  100                 100
                                                                                                                      5-level relevance   5-level relevance
rank-based assumption used in PBM.                                                  Label      clicked(1) or not(0)
                                                                                                                      annotations(0-4)    annotations(0-4)
   Table 4 shows the results of Reverse/Forward PPL of surro-             Table 6
gate DCM/PBM/UBM/MCM models based on different synthetic                  Comparison of offline ULTR models on ULTRE data
datasets generated from target models (DCM/PBM/UBM/MCM).
   To conduct an adequate and fair experiment, all models take                                         PBM            DCM           UBM          MCM
the role of the surrogate model. For example, when we choose                 production ranker
                                                                                                                              0.7815
DCM as the surrogate model, the generated samples of the other               (baseline)
                                                                             Full-info (skyline)                             0.8182
three models (PBM/UBM/MCM)can be compared.
                                                                               PBM-IPS(IPW)            0.8017         0.7826      0.8064         0.7647
   From Table 4 we can obtain the following observations:
                                                                                  CM-IPS               0.7894         0.7932      0.8050         0.7778
(1) Samples generated by UBM and MCM achieves better per-                           DLA                0.8119         0.8173+     0.8107         0.7932+
formance than the ones of DCM and PBM, for the reason that
when the surrogate model is UBM and MCM, MCM-samples                              Significant improvements or degradations with respect to PBM-
and UBM-samples outperforms DCM-samples and PBM-samples                           IPS are indicated with +/- in the paired samples t-test with 𝑝 ≤
                                                                                  0.05. The best performance is highlighted in boldface.
respectively.
(2) The comparison between UBM-samples and MCM-samples is               To conduct a fair comparison, we followed the settings in [14],
a little bit complex. Since when surrogate model is DCM, MCM-        using a multiple-layer perceptron network (MLP) with three hid-
samples are better in terms of both Reverse PPL and Forward          den layers (with 512,256,128 neurons) as the ranking model for
PPL, however when surrogate model is PBM, the better samples         all candidate models and set the batch size to 256. We trained
are different regarding different metric. As a result, we cannot     each candidate for 10k steps, and chose the ranking model in
conclude whether the samples generated by UBM or MCM are             the iteration that has the best performance on the validation set.
the most similar to the real logs.                                   Such experiment was repeated 10 times to ensure the reliability
                                                                     of final results. nDCG@5 was used to evaluate the performance
3.2. Evaluating ULTR models with the ULTRE                           of each candidate model.
       framework (RQ2)                                               Evaluation results
                                                                        Table 6 shows nDCG@5 for three different offline ULTR mod-
To answer RQ2, we evaluate several offline ULTR models with els trained on different synthetic train sets. The baseline we used
the ULTRE framework.                                                 is the performance of production ranker and the skyline is the
Dataset and Simulation Setup                                         performance of a lambdaMART model trained on the whole train
The dataset used to evaluate ULTR models is based on Sogou- set with human annotations instead of biased clicks. From the
SRR2 [13], a public dataset for relevance estimation and ranking results, we can see that:
in Web search. We select 1,211 unique queries with at least 10 (1) PBM-IPS trained on PBM-based training set performs the best
successfully crawled results, 1,011 for training, 100 for validation compared to the model trained on other sets while CM-IPS trained
and 100 for testing. In addition, we use a stratified sampling on DCM-simulated training set performs the best compared to
approach to ensure that the frequency of queries in each dataset the model trained on other sets. That observation coincides with
is consistent with the one in the real logs. As mentioned in the conclusion in [6] that when the used behavior models used in
Section 3, we need a production ranker to produce ranking lists click simulation and the correction method of bias are consistent,
for training queries, so we trained a lambdaMART model with 1% the results are better than the case in which they don’t agree.
data randomly sampled from the original training set (with 5-level (2) Compared to PBM-IPS and CM-IPS, DLA performs the best on
relevance annotations). After that, we follow the click-simulation all synthetic train sets, which indicates that DLA is more robust
process in the ULTRE framework. Table 5 shows the details of and more adaptive to the change of user behavior assumption
the dataset we constructed.                                          used in the click simulation. That advantage can be attributed
Model Setup and Evaluation                                           to the the unification of learning propensity weights (used to
We chose three offline ULTR models as our candidate mod- correct bias in click data) and leaning ranking models proposed
els, which are IPW[1] (named PBM-IPS in [6]), CM-IPS[6] and in DLA. Such learning paradigm can help DLA model adjust its
DLA[2].                                                              propensity weights automatically to the difference between dif-
                                                                     ferent synthetic training sets, while PBM-IPS and CM-IPS model
   2
       http://www.thuir.cn/data-srr/
cannot.                                                                   learning to rank, in: Proceedings of the 27th ACM Inter-
   The above observations demonstrate the usefulness and effec-           national Conference on Information and Knowledge Man-
tiveness of the ULTRE framework. By using the ULTRE frame-                agement, CIKM ’18, Association for Computing Machinery,
work, besides evaluating the performance of one particular model          New York, NY, USA, 2018, p. 1293–1302.
like many previous works have already done, we can conduct a          [6] A. Vardasbi, M. de Rijke, I. Markov, Cascade model-based
fair and thorough comparison between different ULTR models.               propensity estimation for counterfactual learning to rank,
In addition, we have the chance to investigate the following ques-        in: Proceedings of the 43rd International ACM SIGIR Con-
tions: 1) to what extent the evaluation results will be influenced        ference on Research and Development in Information Re-
by the user simulation model and the mismatch between the as-             trieval, SIGIR ’20, Association for Computing Machinery,
sumptions of the simulation model and ranking model 2) which              New York, NY, USA, 2020, p. 2089–2092.
ULTR model can adapt to different environments defined by dif-        [7] N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An experimen-
ferent simulation models and achieves a robust improvement in             tal comparison of click position-bias models, in: Proceed-
ranking performance.                                                      ings of the 2008 International Conference on Web Search
                                                                          and Data Mining, WSDM ’08, Association for Computing
                                                                          Machinery, 2008, p. 87–94.
4. Conclusion and Future work                                         [8] F. Guo, C. Liu, Y. M. Wang, Efficient multiple-click models
                                                                          in web search, in: Proceedings of the Second ACM Interna-
In this paper, we introduce the ULTRE framework that aims to
                                                                          tional Conference on Web Search and Data Mining, WSDM
improve the simulation approach used in previous ULTR evalua-
                                                                          ’09, Association for Computing Machinery, New York, NY,
tion. Our experiments show that ULTRE framework can provide
                                                                          USA, 2009, p. 124–131.
simulated-based training sets of both quality and diversity. More
                                                                      [9] G. E. Dupret, B. Piwowarski, A user browsing model to
importantly, it enables us to conduct a thorough and relatively
                                                                          predict search engine click data from past observations., in:
objective comparison of different ULTR models. We further de-
                                                                          Proceedings of the 31st Annual International ACM SIGIR
sign two evaluation protocols of using this framework as a shared
                                                                          Conference on Research and Development in Information
evaluation service for both the offline and online ULTR models.
                                                                          Retrieval, SIGIR ’08, Association for Computing Machinery,
   Our work includes an initial implementation for ULTRE frame-
                                                                          New York, NY, USA, 2008, p. 331–338.
work and there are still some ongoing works for the final deploy-
                                                                     [10] J. Mao, C. Luo, M. Zhang, S. Ma, Constructing click models
ment. For example, we plan to adopt neural user behavior models
                                                                          for mobile search, in: The 41st International ACM SIGIR
such as Context-aware Click Simulator (CCS)[15] for the click
                                                                          Conference on Research amp; Development in Information
simulation, since the user behavior models we used in this work
                                                                          Retrieval, SIGIR ’18, Association for Computing Machinery,
are all based on the probabilistic graphic models (PGMs) and
                                                                          New York, NY, USA, 2018, p. 775–784.
neural models may have a better click prediction performance.
                                                                     [11] S. Malkevich, I. Markov, E. Michailova, M. de Rijke, Eval-
Moreover, the implementation of online service and comparison
                                                                          uating and analyzing click simulation in web search, in:
between online ULTR models under the ULTRE framework will
                                                                          Proceedings of the ACM SIGIR International Conference
be needed as we only present the comparison results of offline
                                                                          on Theory of Information Retrieval, ICTIR ’17, Association
ULTR models in this paper. We plan to use the ULTRE framework
                                                                          for Computing Machinery, New York, NY, USA, 2017, p.
in the Unbiased Learning to Rank Evaluation Task (ULTRE), a
                                                                          281–284.
pilot task in NTCIR 16.
                                                                     [12] X. Dai, J. Lin, W. Zhang, S. Li, W. Liu, R. Tang, X. He, J. Hao,
                                                                          J. Wang, Y. Yu, An adversarial imitation click model for in-
References                                                                formation retrieval, arXiv preprint arXiv2104.06077 (2021).
                                                                     [13] J. Zhang, Y. Liu, S. Ma, Q. Tian, Relevance estimation with
 [1] T. Joachims, A. Swaminathan, T. Schnabel, Unbiased                   multiple information sources on search engine result pages,
     learning-to-rank with biased feedback, in: Proceedings               in: Proceedings of the 27th ACM International Conference
     of the Tenth ACM International Conference on Web Search              on Information and Knowledge Management, 2018, pp. 627–
     and Data Mining, WSDM ’17, Association for Computing                 636.
     Machinery, New York, NY, USA, 2017, p. 781–789.                 [14] Q. Ai, T. Yang, H. Wang, J. Mao, Unbiased learning to rank:
 [2] Q. Ai, K. Bi, C. Luo, J. Guo, W. B. Croft, Unbiased learning         Online or offline?, ACM Trans. Inf. Syst. 39 (2021).
     to rank with unbiased propensity estimation, in: The 41st       [15] J. Zhang, J. Mao, Y. Liu, R. Zhang, M. Zhang, S. Ma, J. Xu,
     International ACM SIGIR Conference on Research amp; De-              Q. Tian, Context-aware ranking by constructing a virtual
     velopment in Information Retrieval, SIGIR ’18, Association           environment for reinforcement learning, in: Proceedings of
     for Computing Machinery, New York, NY, USA, 2018, p.                 the 28th ACM International Conference on Information and
     385–394.                                                             Knowledge Management, CIKM ’19, Association for Com-
 [3] Z. Hu, Y. Wang, Q. Peng, H. Li, Unbiased lambdamart:                 puting Machinery, New York, NY, USA, 2019, p. 1603–1612.
     an unbiased pairwise learning-to-rank algorithm, in: The
     World Wide Web Conference, 2019, pp. 2830–2836.
 [4] H. Wang, R. Langley, S. Kim, E. McCord-Snook, H. Wang,
     Efficient exploration of gradient space for online learning
     to rank, in: The 41st International ACM SIGIR Conference
     on Research amp; Development in Information Retrieval,
     SIGIR ’18, Association for Computing Machinery, New York,
     NY, USA, 2018, p. 145–154.
 [5] H. Oosterhuis, M. de Rijke, Differentiable unbiased online