<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ULTRE framework: a framework for Unbiased Learning to Rank Evaluation based on simulation of user behavior</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yurou Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiaxin Mao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qingyao Ai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Renmin University of China</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Utah</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Unbiased learning to rank (ULTR) with biased user behavior data has received considerable attention in the IR community. However, how to properly evaluate and compare diferent ULTR approaches has not been systematically investigated and there is no shared task or benchmark that is specifically developed for ULTR. In this paper, we propose the Unbiased Learning to Rank Evaluation(ULTRE) framework. The proposed framework utilizes multiple click models in generating simulated click logs and supports the evaluation of both the ofline, counterfactual and the online, bandit-based ULTR models. Our experiments show that the ULTRE framework are efective in click simulation and comparing diferent ULTR models. The ULTRE framework will be used in the Unbiased Learning to Rank Evaluation Task (ULTRE), a pilot task in NTCIR 16.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Unbiased Learning to Rank</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Click Model</kwd>
        <kwd>Click Simulation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        into the comparison among ULTR models as the ULTR model
that shares the same user behavior assumption with the click
Interest in Learning to Rank (LTR) approaches that learn from simulation model might be preferred by the evaluation ([
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
user interactions has increased recently as users’ interaction with To overcome the above limitations, we propose an unbiased
search systems can reflect their implicit relevance feedback for the learning to rank evaluation (ULTRE) framework. In this
framesearch results. Though collecting user clicks is much less costly work, we focus on extending and improving the click simulation
and more convenient than collecting expert annotations, user phase in previous ULTR evaluation. Specifically, instead of using
clicks contain diferent types of bias (such as position bias) and a single, over-simplified click model, we will use multiple user
noise. Therefore,the unbiased learning to rank (ULTR) that aims behavior models that trained and calibrated on real query log
at learning a ranking model from the noisy and biased user clicks as several click simulators. Equipped with the click simulators,
has become a trending topic in IR. There are two main categories we further design two evaluation protocols for ofline and online
of algorithms for ULTR: 1) ofline (counterfactual) LTR that learns ULTR models, respectively.
an unbiased ranking model in an ofline manner with batches of In our empirical experiments, we implemented four diferent
biased, historical click logs ([
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]) 2) online ULTR which makes click simulators. After calibrations with real click logs, we
inonline interventions of ranking and extracting unbiased feedback corporate them into our ULTRE framework. Then we compare
or deriving unbiased gradient for modeling training ([
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]). several ULTR models under the framework to the verify the
use
      </p>
      <p>With a variety of models has been proposed for unbiased learn- fulness and efectiveness of our ULTRE framework.
ing to rank, how to properly evaluate and compare diferent ULTR We believe the ULTRE framework can serve as a shared
benchmodels still needs more research. Previous works on ULTR often mark and evaluation service for ULTR. It may also support an
inuse a simulation-based evaluation approach due to the lack of depth investigation of the simulation-based evaluation approach.
real search logs and online search systems. Such approach rely The ULTRE framework will be used in the Unbiased Learning to
on predefined user behavior models and public available learning- Rank Evaluation Task (ULTRE), a pilot task in NTCIR 16.1
to-rank datasets with item-level relevance judgments to simulate
user clicks. Using the simulation-based approach, we can train
ULTR models with the simulated clicks and then evaluate the 2. ULTRE FRAMEWORK
models on test sets with expert annotations.</p>
      <p>Though widely adopted, current evaluation approaches have
some limitations. First, there are no standard evaluation settings
or shared evaluation benchmarks for the ULTR community as
existing studies on ULTR often rely on their own evaluation
apparatus and adopt diferent assumptions in click simulation, making
the experimental results reported in diferent papers
incomparable. Second, most studies only use a single user behavior model to
simulate clicks, which may not fully capture the diverse patterns
of real user behavior. It may also introduce systematic biases
This section details the ULTRE framework, as shown in Figure 1.</p>
      <p>The whole process is made up of three stages: 1) generating
simulated click logs 2) training the ULTR models with the simulated
click logs of the training queries 3) evaluating the ULTR models
with the relevance annotations in the validation and test set.</p>
      <p>Stage 1 Simulation of clicks (step 1-5)
This stage is the key for the evaluation because the quality of
simulated click may impact the performance of the trained ULTR
model. It contains the following steps:
• Step 1: Train and calibrate four user behavior models</p>
      <p>(PBM, UBM, DCM, MCM) with real query logs.
• Step 2: Construct diferent click simulators based on
models obtained in step 1.
Real click logs</p>
      <p>Training
queries</p>
      <p>Step 1</p>
      <p>Step 3
Traditional</p>
      <p>LTR
dataset with
relevance
annotation</p>
      <p>Validation
queries
Test
queries
UBM</p>
      <p>MCM
Ranking lists
...</p>
      <p>Simulated clicks
Step 4</p>
      <p>Step 5
Have participants received
100% user impressions?</p>
      <p>Train an online ULTR model
Online models</p>
      <p>DBGD
PDGD
...</p>
      <p>Offline models
SVMRank+IPW</p>
      <p>DNN+DLA
...</p>
      <p>Synthetic train sets
Train an offline or
online ULTR model?
Train an offline
ULTR model
• Step 3: Collect the ranking lists for the training queries
and the corresponding relevance annotations for the
documents in the ranking lists. Depending on which class of
ULTR models we want to train and evaluate, we will
generate the ranking lists diferently. Ofline, counterfactual
ULTR models, we will train a simple production ranker on
a small proportion of the train set with relevance labels to
generate ranking lists for all train queries. For the online
ULTR models, the ranking lists will be generated by the
online ULTR model that is being evaluated.
• Step 4: Use the click simulators defined in Step 1 and
calibrated in Step 2 to generate simulated click logs for
the ranking lists obtained in step 3.
• Step 5: Finally, collect the generated clicks and use them
for the training of ULTR models. Because we construct
four simulators, we will construct four synthetic train
sets.</p>
      <p>Stage 2 Training of ULTR models (step 6)
After generating the synthetic training set in Stage 1, we can use
diferent synthetic train sets to train the ULTR models. It is worth
mentioning that if the model is an online one, step 3-5 in stage 1
and stage 2 will be repeated multiple times to simulate the online
learning procedure.</p>
      <p>Stage 3 Evaluation of ULTR models (step 7)
Finally, in the last stage, we evaluate the trained ULTR models on
the validation and test queries. We can compute some
relevancebased evaluation metrics, such as nDCG, MAP, and MRR, with the
relevance annotations in the validation and test set, to evaluate
the ranking performance of the trained models.</p>
      <sec id="sec-1-1">
        <title>2.1. Evaluation protocol</title>
        <p>Based on the ULTRE framework, we can provide a shared
evaluation task and benchmark for the evaluation of diferent ULTR
models. In this section, we develop evaluation protocols that
describe how the task organizers of the shared ULTRE task (i.e.
the TOs) interact with the participants of the shared task and
work together to evaluate the ULTR models developed by the
participants. Since there are two categories of ULTR models, we
design two evaluation protocols, one for ofline ULTR models,
the other for online ones, respectively.</p>
        <sec id="sec-1-1-1">
          <title>2.1.1. Evaluation protocol for ofline ULTR models</title>
          <p>Figure 2 displays the steps in the evaluation protocol for ofline
ULTR models, and show what each role should do in each step.
TOs represent the task organizers, and participants represent
those who are willing to use the ULTRE framework to evaluate
their ULTR models. The protocol consists three steps:
• Step 1: TOs construct click simulator based on real click
log, and then simulate clicks for all queries in the train set.
The participants then can use the simulated clicks to train
their ULTR models. As four click simulator equipped with
diferent user behavior models (PBM/UBM/DCM/MCM)
will be used, TOs will produce four synthetic train sets
for participants in this step.
• Step 2: Participants train their ULTR models on each
synthetic train set respectively. Participants may have
their preferred train set, as a result, they are allowed to
only train the model on a single set. However, as each
set is produced under some unique user behavior
assumptions, the participants are strongly encouraged to train
their models on all training sets. Such exploration can
test the robustness of the ULTR model. After training the
Evaluation protocol for offline ULTR models
Evaluation protocol for online ULTR models
ULTR models, the participants can submit the ranking
lists (runs) for the validation and test queries. Each run
submitted by the participants should only use the
synthetic data generated by a single click simulator, so ideally,
for each ULTR model, we expect the participant to submit
four runs.
• Step 3: After receiving the runs submitted by
participants, TOs evaluate the runs based on true relevance
labels (i.e. expert annotation). Specifically, TOs will show
the results on validation set on the leaderboard and release
the oficial results in the final report.</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>2.1.2. Evaluation protocol for online ULTR models</title>
          <p>As shown in Figure 3, the evaluation protocol for online ULTR
models involves similar steps in the ofline one. However, the
main diference between them is that the participants can
iteratively submit the ranking lists to the TOs to get simulated clicks
and use them to update their ULTR models in an online process.
• Step 1: Participants submit the ranking lists for training
queries generated by their own ULTR models and specify
that they want to receive x% of user impressions.
• Step 2: TOs sample x% of all training queries according
to the query frequency in the real log. Based on the
ranking lists of those selected training queries submitted by
the participant in step 1, TOs construct synthetic
training sets following the same process in the step 1 of the
evaluation protocol for ofline ULTR models.
• Step 3: Participants update their models with the
training data received in step 2.
• Repeat Step 1-Step 3 until participants receive 100% of
impressions.
• Step N: Same as the final step in the evaluation protocol
for ofline ULTR models, TOs perform evaluation for the
models on validation and test set.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>2.2. Construct click simulators</title>
        <p>This section provides more details about the process of
constructing click simulator.</p>
        <sec id="sec-1-2-1">
          <title>2.2.1. Choice of user behavior models</title>
          <p>
            Compared with previous studies that only use a single click model,
we use the following user behavior models:
• Position-Based Model (PBM)[
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]: a click model that
assumes the click probability of a search result only depends
on its relevance and its ranking position.
• Dependent Click Model (DCM)[
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]: a click model that
is based on the cascade assumption that the user will
sequentially examine the results list and find attractive
results to click until she feels satisfied with the clicked
result.
• User Browsing Model (UBM)[
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]: a click model that
assumes the examination probability on a search result
depends on its ranking position and the distance to the last
clicked result.
• Mobile Click Model (MCM)[
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]: a click model that
considers the click necessity bias (i.e.some vertical results can
satisfy users’ information need without a click) in user
clicks.
          </p>
        </sec>
        <sec id="sec-1-2-2">
          <title>2.2.2. Train and calibrate the user behavior models with real query logs</title>
          <p>
            We train and calibrate all the user behavior models based on real
query logs collected by Sogou.com, a commercial Chinese search
engine, so the synthetic clicks are similar to the real user clicks.
We split the real logs evenly into training and test set, then strictly
follow the training process of each user behavior model proposed
in the original works[
            <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7, 9, 8, 10</xref>
            ]. However, to make sure those
models can work for all candidate documents, we assume that
the attractiveness parameter  of each query-document pair only
depends on its five-level relevance label (0-4).
          </p>
        </sec>
        <sec id="sec-1-2-3">
          <title>2.2.3. Generating clicks with click simulators</title>
          <p>
            Equipped with the trained user behavior models, the working
process of click simulators on each query session can be
summarized by the code provided in [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. We add some necessary
modifications to the click generating procedure and show it in
Algorithm 1.
          </p>
          <p>DCM
PBM
UBM
MCM</p>
          <p>LL
Algorithm 1 Generating synthetic clicks with a click simulator
for a query session
Input: user behavior model</p>
          <p>query session  consisting of query  and ranking list
1,... 
vector of relevance labels for documents (1 , ... )
vector of vertical types for documents (1 , ... )
Output: vector of simulated clicks (1, ...)
1: for  = 0 →  do
2: Compute  =  ( = 1|1 = 1, ...− 1 = − 1)
using previous clicks 1, ...− 1,
relevance label  ,vertical type  and parameters of
where  is the number of unique queries and  is the number
of sessions observed for a particular query . This metric can
be calculated for two click distributions: the distribution over
sessions which shows the percentage of sessions with a certain
number of clicks and the distribution over ranks which shows
how many times a certain rank was clicked. Lower values of the
metrics correspond to better click simulation performance.</p>
          <p>
            The second metric was first used in [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] to test the
distributional converge of click models. Reverse PPL is the PPL of a
surrogate model (an intermediary to evaluate the similarity
be
          </p>
          <p>
            M tween the generated samples and the real data samples) that is
3: Generate random value  from Bernoulli(p) trained on generated samples and evaluate on real data. Forward
4: end for PPL is the PPL of a surrogate model that is trained on real data
and evaluated on generated samples.
3. EXPERIMENTS Performance on predicting clicks
The results for the click prediction task on test set are presented in
We conduct a series of experiments to answer the following re- Table 2, from which we can observe that MCM performs the best
search questions: RQ1 : How do diferent click simulators per- among all models, similar to the observation in [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. However,
forms in predicting clicks and generating synthetic click logs? the others also have a relatively good performance, as their values
RQ2: Can we evaluate existing ULTR models with the ULTRE of metrics are all close to the ideal values (0 for LL and 1 for PPL).
framework? Quality of generated click logs
This section measures the similarity between the real log and
3.1. Examining click simulators (RQ1) generated click logs on test set.
          </p>
          <p>Table 3 summarizes the simulation performance of the four
Experiment Set up user behavior models and baseline model (always simulate a
Datasets The dataset used in training user behavior models were click on the first position) in terms of the KL-divergence of the
sampled from real search log dataset released by Chinese com- click distribution over sessions (session-based KL) and over ranks
mercial search engine Sogou.com. We divide the dataset into (rank-based KL), from which we can obtain following
observatraining and test sets with proportion 1:1. The statistics of the tions:
dataset are shown in Table 1. (1) UBM generates the best samples in terms of session-based
Evaluation Metrics For click prediction task, we report the log- KL-divergence, while MCM performs the best in terms of
ranklikelihood (LL) and perplexity (PPL) of each user behavior model. based KL-divergence. Considering the value of session-based
Higher values of log-likelihood and lower values of perplexity KL-divergence of MCM only slightly higher than the one of UBM,
indicates better click prediction performance. it’s fair to say that the click logs generated by MCM are the most</p>
          <p>To measure the quality of generated samples from diferent similar to the real logs.
click simulators, we compute Kullback-Leibler(KL) divergence (2) The samples of DCM are better than the samples
generbetween the distribution of real clicks and the distribution of ated by the baseline model in terms of the session-based
KLsimulated clicks and Reverse/Forward PPL. divergence, however the performance in terms of the rank-based
Significant improvements or degradations with respect to
PBMIPS are indicated with +/- in the paired samples t-test with  ≤
0.05. The best performance is highlighted in boldface.</p>
          <p>KL-divergence is rather low. A possible reason for that is DCM
does not use rank-based examination parameter as the other three
models. On the contrast, the samples of PBM are better regarding
the rank-based KL-divergence and worse regarding session-based
KL-divergence. Such observations may caused by the simple
rank-based assumption used in PBM.</p>
          <p>Table 4 shows the results of Reverse/Forward PPL of
surrogate DCM/PBM/UBM/MCM models based on diferent synthetic
datasets generated from target models (DCM/PBM/UBM/MCM).</p>
          <p>To conduct an adequate and fair experiment, all models take
the role of the surrogate model. For example, when we choose
DCM as the surrogate model, the generated samples of the other
three models (PBM/UBM/MCM)can be compared.</p>
          <p>From Table 4 we can obtain the following observations:
(1) Samples generated by UBM and MCM achieves better
performance than the ones of DCM and PBM, for the reason that
when the surrogate model is UBM and MCM, MCM-samples
and UBM-samples outperforms DCM-samples and PBM-samples
respectively.
(2) The comparison between UBM-samples and MCM-samples is
a little bit complex. Since when surrogate model is DCM,
MCMsamples are better in terms of both Reverse PPL and Forward
PPL, however when surrogate model is PBM, the better samples
are diferent regarding diferent metric. As a result, we cannot
conclude whether the samples generated by UBM or MCM are
the most similar to the real logs.</p>
          <p>
            To conduct a fair comparison, we followed the settings in [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ],
using a multiple-layer perceptron network (MLP) with three
hidden layers (with 512,256,128 neurons) as the ranking model for
all candidate models and set the batch size to 256. We trained
each candidate for 10k steps, and chose the ranking model in
the iteration that has the best performance on the validation set.
          </p>
          <p>Such experiment was repeated 10 times to ensure the reliability
of final results. nDCG@5 was used to evaluate the performance</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>3.2. Evaluating ULTR models with the ULTRE of each candidate model.</title>
        <p>framework (RQ2) Evaluation results</p>
        <p>
          Table 6 shows nDCG@5 for three diferent ofline ULTR
modTo answer RQ2, we evaluate several ofline ULTR models with els trained on diferent synthetic train sets. The baseline we used
the ULTRE framework. is the performance of production ranker and the skyline is the
Dataset and Simulation Setup performance of a lambdaMART model trained on the whole train
The dataset used to evaluate ULTR models is based on Sogou- set with human annotations instead of biased clicks. From the
SRR2[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], a public dataset for relevance estimation and ranking results, we can see that:
in Web search. We select 1,211 unique queries with at least 10 (1) PBM-IPS trained on PBM-based training set performs the best
successfully crawled results, 1,011 for training, 100 for validation compared to the model trained on other sets while CM-IPS trained
and 100 for testing. In addition, we use a stratified sampling on DCM-simulated training set performs the best compared to
approach to ensure that the frequency of queries in each dataset the model trained on other sets. That observation coincides with
is consistent with the one in the real logs. As mentioned in the conclusion in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] that when the used behavior models used in
Section 3, we need a production ranker to produce ranking lists click simulation and the correction method of bias are consistent,
for training queries, so we trained a lambdaMART model with 1% the results are better than the case in which they don’t agree.
data randomly sampled from the original training set (with 5-level (2) Compared to PBM-IPS and CM-IPS, DLA performs the best on
relevance annotations). After that, we follow the click-simulation all synthetic train sets, which indicates that DLA is more robust
process in the ULTRE framework. Table 5 shows the details of and more adaptive to the change of user behavior assumption
the dataset we constructed. used in the click simulation. That advantage can be attributed
Model Setup and Evaluation to the the unification of learning propensity weights (used to
We chose three ofline ULTR models as our candidate mod- correct bias in click data) and leaning ranking models proposed
els, which are IPW[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] (named PBM-IPS in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]), CM-IPS[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and in DLA. Such learning paradigm can help DLA model adjust its
DLA[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. propensity weights automatically to the diference between
different synthetic training sets, while PBM-IPS and CM-IPS model
cannot.
        </p>
        <p>The above observations demonstrate the usefulness and
efectiveness of the ULTRE framework. By using the ULTRE
framework, besides evaluating the performance of one particular model
like many previous works have already done, we can conduct a
fair and thorough comparison between diferent ULTR models.</p>
        <p>In addition, we have the chance to investigate the following
questions: 1) to what extent the evaluation results will be influenced
by the user simulation model and the mismatch between the
assumptions of the simulation model and ranking model 2) which
ULTR model can adapt to diferent environments defined by
different simulation models and achieves a robust improvement in
ranking performance.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Conclusion and Future work</title>
      <p>In this paper, we introduce the ULTRE framework that aims to
improve the simulation approach used in previous ULTR
evaluation. Our experiments show that ULTRE framework can provide
simulated-based training sets of both quality and diversity. More
importantly, it enables us to conduct a thorough and relatively
objective comparison of diferent ULTR models. We further
design two evaluation protocols of using this framework as a shared
evaluation service for both the ofline and online ULTR models.</p>
      <p>
        Our work includes an initial implementation for ULTRE
framework and there are still some ongoing works for the final
deployment. For example, we plan to adopt neural user behavior models
such as Context-aware Click Simulator (CCS)[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] for the click
simulation, since the user behavior models we used in this work
are all based on the probabilistic graphic models (PGMs) and
neural models may have a better click prediction performance.
Moreover, the implementation of online service and comparison
between online ULTR models under the ULTRE framework will
be needed as we only present the comparison results of ofline
ULTR models in this paper. We plan to use the ULTRE framework
in the Unbiased Learning to Rank Evaluation Task (ULTRE), a
pilot task in NTCIR 16.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          , T. Schnabel,
          <article-title>Unbiased learning-to-rank with biased feedback</article-title>
          ,
          <source>in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining</source>
          , WSDM '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>781</fpage>
          -
          <lpage>789</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Unbiased learning to rank with unbiased propensity estimation</article-title>
          ,
          <source>in: The 41st International ACM SIGIR Conference on Research amp; Development in Information Retrieval</source>
          , SIGIR '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>385</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Unbiased lambdamart: an unbiased pairwise learning-to-rank algorithm</article-title>
          ,
          <source>in: The World Wide Web Conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2830</fpage>
          -
          <lpage>2836</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Langley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          , E.
          <string-name>
            <surname>McCord-Snook</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Eficient exploration of gradient space for online learning to rank</article-title>
          ,
          <source>in: The 41st International ACM SIGIR Conference on Research amp; Development in Information Retrieval</source>
          , SIGIR '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>145</fpage>
          -
          <lpage>154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Oosterhuis</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Diferentiable unbiased online learning to rank</article-title>
          ,
          <source>in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>1293</fpage>
          -
          <lpage>1302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vardasbi</surname>
          </string-name>
          , M. de Rijke, I. Markov,
          <article-title>Cascade model-based propensity estimation for counterfactual learning to rank</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>2089</fpage>
          -
          <lpage>2092</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zoeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , B.
          <string-name>
            <surname>Ramsey</surname>
          </string-name>
          ,
          <article-title>An experimental comparison of click position-bias models</article-title>
          ,
          <source>in: Proceedings of the 2008 International Conference on Web Search and Data Mining</source>
          , WSDM '08,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2008</year>
          , p.
          <fpage>87</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Guo</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Eficient multiple-click models in web search</article-title>
          ,
          <source>in: Proceedings of the Second ACM International Conference on Web Search and Data Mining</source>
          , WSDM '09,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2009</year>
          , p.
          <fpage>124</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Dupret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piwowarski</surname>
          </string-name>
          ,
          <article-title>A user browsing model to predict search engine click data from past observations</article-title>
          .,
          <source>in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '08,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2008</year>
          , p.
          <fpage>331</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ma,
          <article-title>Constructing click models for mobile search</article-title>
          ,
          <source>in: The 41st International ACM SIGIR Conference on Research amp; Development in Information Retrieval</source>
          , SIGIR '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>775</fpage>
          -
          <lpage>784</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Malkevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Markov</surname>
          </string-name>
          , E. Michailova, M. de Rijke,
          <article-title>Evaluating and analyzing click simulation in web search</article-title>
          ,
          <source>in: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval</source>
          , ICTIR '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>281</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>An adversarial imitation click model for information retrieval</article-title>
          ,
          <source>arXiv preprint arXiv2104</source>
          .
          <volume>06077</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Relevance estimation with multiple information sources on search engine result pages</article-title>
          ,
          <source>in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>627</fpage>
          -
          <lpage>636</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <article-title>Unbiased learning to rank: Online or ofline?</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>39</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Mao, Y. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ma, J.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Context-aware ranking by constructing a virtual environment for reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>1603</fpage>
          -
          <lpage>1612</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>