<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning to Rank from Relevance Judgments Distributions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Discussion Paper</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Purpura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmaria Silvello</string-name>
          <email>silvello@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gian Antonio Susto</string-name>
          <email>gianantonio.susto@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research Europe</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padua</institution>
          ,
          <addr-line>Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>LEarning TO Rank (LETOR) algorithms are usually trained on annotated corpora where a single relevance label is assigned to each available document-topic pair. Within the Cranfield framework, relevance labels result from merging either multiple expertly curated or crowdsourced human assessments. In this paper, we explore how to train LETOR models with relevance judgments distributions (either real or synthetically generated) assigned to document-topic pairs instead of single-valued relevance labels. We propose five new probabilistic loss functions to deal with the higher expressive power provided by relevance judgments distributions and show how they can be applied both to neural and gradient boosting machine (GBM) architectures. Overall, we observe that relying on relevance judgments distributions to train diferent LETOR models can boost their performance and even outperform strong baselines such as LambdaMART on several test collections.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Learning to Rank</kwd>
        <kwd>Machine Learining</kwd>
        <kwd>Optimization Functions</kwd>
        <kwd>Information Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Ranking is a problem that we encounter in a number of tasks we perform every day: from
searching on the Web to online shopping. Given an unordered set of items, this problem consists
of ordering the items according to a certain notion of relevance. Generally, in Information
Retrieval (IR) we rely on a notion of relevance that depends on the information need of a
user, expressed through a keyword query. When creating a new experimental collection, the
corresponding relevance judgments are obtained by asking diferent judges to assign a relevance
score to each document-topic pair. Multiple judges – either trained experts or participants of a
crowdsourcing experiment – usually assess the same document-topic pair, and the final relevance
label for the pair is obtained by aggregating these scores [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This process is a cornerstone
for system training and evaluation and has contributed to the continuous development of
IR, especially in the context of international evaluation campaigns. Nonetheless, the opinion
of diferent judges on the same document-topic pair might be very diferent or even diverge
to the opposite ends of the spectrum – either because of random human errors or due to a
diferent interpretation of a topic. Inevitably, the aggregation process conflates the multiple
assessors viewpoints on document-topic pairs onto a single one, thus losing some information
– even though it also reduces annotation errors and outliers. Our research hypothesis is that
Machine Learning (ML) models – i.e., LEarning TO Rank (LETOR) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Neural Information
Retrieval (NIR) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] models – could use all the labels collected in the annotation process to
improve the quality of their rankings. Indeed, judges’ disagreement on a certain document-topic
pair can be due to an inherent dificulty of the topic or to the existence of multiple interpretations
of it. We argue that designing ML models able to learn from the whole distributions of relevance
judgments could improve the models’ representation of relevance and their performance through
the usage of this additional information. Following this idea, we propose to interpret the output
of a LETOR model as a probability value or distribution – according to the experimental
hypotheses – and define diferent Kullback–Leibler ( KL) divergence-based loss functions to
train a model using a distribution of relevance judgments associated to the current training item.
Such a training strategy allows us to leverage all the available information from human judges
without additional computational costs compared to traditional LETOR training paradigms.
      </p>
      <p>
        The loss functions we propose [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] can be used to train any ranking model that relies on
gradient-based learning, including popular NIR models or LETOR ones. In our experiments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
we focus on one transformer-based neural LETOR model and on one decision tree-based Gradient
Boosting Machine (GBM) model – the model at the base of the popular LambdaMART [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] ranker
and used as a strong baseline in many recent LETOR research papers such as [
        <xref ref-type="bibr" rid="ref10 ref6 ref7 ref8 ref9">6, 7, 8, 9, 10</xref>
        ].
We assess the quality of the proposed training strategies on four standard LETOR collections
(MQ2007, MQ2008, MSLR-WEB30K [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and OHSUMED [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). The paper is organized as follows:
in Section 2 we present the details of the training strategies we propose; in Section 3 we present
the most significant evaluation results we achieve and in Section 4 we report our conclusions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Approach</title>
      <p>
        We propose five diferent loss functions formulations that allow ranking models to take
advantage of relevance judgments distributions prior to their aggregation. They are all based on two
intuitions that allow us to model relevant judgments as probability distributions and to use KL
divergence-based measures to compare them. We first propose to interpret the relevance label
(or a set of relevance labels) assigned to the same document as if it was generated by a Binomial
(or Multinomial) random variable modeling the judges’ annotation process. For example, we
assume that  assessors provided one binary relevance label for each document-topic pair, i.e.,
to state whether the pair was a relevant or a not-relevant one. This process can be modeled
as a Binomial random variable  ∼ Bin(, ) where the success probability  for each sample
is the average of the binary responses submitted in  trials. We can follow the same process
to represent a set of relevance labels associated to the same document as a sample from a
Multinomial random variable. We then apply the same reasoning for the interpretation of our
model output probability score as another Binomial (or Multinomial) distribution ^ ∼ Bin(, ^)
with the same parameter  – empirically tuned for the numerical stability of the gradients
during training – and probability ^ equal to the output of the model. The second option we
propose is to consider relevance labels associated to each document (or batch of documents)
as samples from Gaussian (or multivariate Gaussian) random variables  ∼ 
the same standard deviation  but centered on a diferent point depending on the relevance
( ,  ) with
label associated with the document. Depending on the modeling strategy, the proposed loss
functions take the following formulations typical of pointwise, pairwise hinge [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or listwise
losses that are frequently employed in NIR [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and LETOR approaches [
        <xref ref-type="bibr" rid="ref15 ref16 ref17 ref6">15, 16, 17, 6</xref>
        ].
︁(
︁(
︁)
• Pointwise() =
      </p>
      <p>(||^ ) + (^ ||) * , where we rescale each term
in a training batch by a factor , inversely proportional to the number of times an item
of the same class (relevant or not relevant) appeared in it;
• Pointwise() =</p>
      <p>()(||^ ) + ()(^ ||) * , where we employ
a Multinomial random variable instead of a Binomial one to represent a set of relevance
labels associated to a certain topic-document pair by an annotator;
• Pairwise() = max(0,  −
sign(+ −
− )()( +
,  −
), where  is a
slack parameter to adjust the distance between the two distributions, + and − are the
outputs of the LETOR model associated to two documents – the former with a higher
relevance label than the latter – + ∼</p>
      <p>Bin(, +) and  − ∼
Binomial distributions corresponding to a relevant and to a not-relevant document-topic
Bin(, − ) are two
︁)
pair, respectively;
• Pairwise( ) = max(0,  − sign(+ − − )( )(+ ,  − ), where we adapt the
previous hinge-style loss function to represent labels distributions with a Gaussian random
variable instead of a Binomial one;
• Listwise() = ()(^ || ) * , where we consider the relevance labels
associated to multiple documents to rerank for the same query at the same time,
modeling the list of relevance scores and their respective labels as multinomial Gaussian
distributions.</p>
      <p>
        We evaluate the proposed loss functions on diferent LETOR models, i.e. the LightGBM
implementation of LambdaMART [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and a simpler transformer-based neural model [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] with
one self attention layer followed by a feed forward one. The experimental collections that we
consider in our experiments are: MQ2007, MQ2008, MSLR-WEB30K [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and OHSUMED [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
which are the experimental collections of reference in the LETOR domain. All collections
are already organized in five diferent folds with the respective training, test and validation
subsets. We report the performance of our model averaged over these folds with the exception
of the MSLR-WEB30K collection where we only consider Fold 1 as in other popular research
works [
        <xref ref-type="bibr" rid="ref10 ref19 ref20">19, 20, 21, 10</xref>
        ]. Our code and implementation of the proposed loss functions are available
at: https://github.com/albpurpura/PLTR.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>In Table 1, we report the most significant results of our performance evaluation. For each
experimental collection, we report the performance of the best variants of a LambdaMART and
a Transformer Neural Network (NN) model when trained with the diferent loss functions we
MQ2008
WEB30K</p>
      <p>Loss Function
GBM – LambdaMART
GBM – Pointwise KL (Binomial)
GBM – Listwise KL (Gaussian)
NM – Pairwise KL (Gaussian)
NM – Listwise KL (Gaussian)
functions averaged over all topics. ↑ or ↓ indicate a statistically significant ( &lt;
the LambdaMART model trained on the original relevance judgments. Best performance measures per
0.05) diference with
collection are in bold as the loss function with the most best measures per collection.
k ∈ {1, 3, 5} – mean Average Precision (AP) and ERR [22]. In most of the cases, our simple
Tranformer-based neural model trained with the proposed loss functions is able to outperform a
LambdaMART model – one of the best performing state-of-the-art models often used as baseline
in the LETOR literature. We also observe how a GBM-based model is able to benefit from the
proposed loss functions. In fact, when evaluated on the MQ2007, MQ2008 and OHSUMED
collections, the proposed variant of the GBM model trained with the Pointwise KL (Binomial)
loss function outperforms the GBM - LambdaMART model according to diferent performance
measures.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>
        We presented diferent strategies to train a LETOR model relying on relevance judgments
distributions. We introduced five diferent loss functions relying on the
KL divergence between
distributions, opening new possibilities for the training of LETOR models. The proposed loss
functions were evaluated on a transformer-based neural model and on a decision tree-based GBM
model – the same model employed by the popular LambdaMART algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] – over a number
of experimental collections of diferent sizes. In our experiments, the proposed loss functions
outperformed the aforementioned baselines in several cases and gave a significant performance
boost to LETOR approaches – especially the ones based on neural models – allowing them to also
outperform other strong baselines in the LETOR domain such as the LightGBM implementation
of LambdaMART [
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ].
arXiv:2005.02553 (2020).
[21] M. Ibrahim, M. Carman, Comparing pointwise and listwise objective functions for
randomforest-based learning-to-rank, ACM TOIS 34 (2016).
[22] O. Chapelle, D. Metlzer, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded
relevance, in: Proc. of CIKM, 2009.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Cox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Milić-Frayling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kazai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vinay</surname>
          </string-name>
          ,
          <article-title>On aggregating labels from multiple crowd workers to infer relevance of documents</article-title>
          ,
          <source>in: Proc. of ECIR</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tax</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bockting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          ,
          <article-title>A cross-benchmark comparison of 87 learning to rank methods</article-title>
          ,
          <source>IP&amp;M</source>
          <volume>51</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Onal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Altingovde,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Karagoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Braylan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Angert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Banner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khetan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mcdonnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rijke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lease</surname>
          </string-name>
          ,
          <article-title>Neural information retrieval: At the end of the early years</article-title>
          ,
          <source>Information Retrieval</source>
          <volume>21</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Purpura</surname>
          </string-name>
          , G. Silvello,
          <string-name>
            <given-names>G.</given-names>
            <surname>Susto</surname>
          </string-name>
          ,
          <article-title>Learning to rank from relevance judgments distributions</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <article-title>From ranknet to lambdarank to lambdamart: An overview</article-title>
          ,
          <source>in: MSR-TR-2010-82</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zoghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <article-title>Revisiting approximate metric optimization in the age of deep neural networks</article-title>
          ,
          <source>in: Proc. of SIGIR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bruch</surname>
          </string-name>
          ,
          <article-title>An alternative cross entropy loss for learning-to-rank</article-title>
          , arXiv:
          <year>1911</year>
          .
          <volume>09798</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pasumarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <article-title>Permutation equivariant document interaction network for neural learning to rank</article-title>
          ,
          <source>in: Proc. of ICTIR</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bruch</surname>
          </string-name>
          , S. Han,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <article-title>A stochastic treatment of learning to rank scoring functions</article-title>
          ,
          <source>in: Proc. of WSDM</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Honglei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xuanhui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Marc</surname>
          </string-name>
          ,
          <article-title>Neural rankers are hitherto outperformed by gradient boosted decision trees</article-title>
          ,
          <source>in: Proc. of ICLR</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <source>Introducing letor 4</source>
          .0 datasets, arXiv:
          <fpage>1306</fpage>
          .2597 (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Letor: A benchmark collection for research on learning to rank for information retrieval</article-title>
          ,
          <source>Information Retrieval</source>
          <volume>13</volume>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Ranking measures and loss functions in learning to rank</article-title>
          ,
          <source>Proc. of NIPS</source>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purpura</surname>
          </string-name>
          , G. Silvello,
          <article-title>Focal elements of neural information retrieval models. an outlook through a reproducibility study</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>57</volume>
          (
          <year>2020</year>
          )
          <fpage>102109</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A general approximation framework for direct optimization of information retrieval measures</article-title>
          ,
          <source>IR Journal 4</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Purpura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maggipinto</surname>
          </string-name>
          , G. Silvello, G. Susto,
          <article-title>Probabilistic word embeddings in neural ir: A promising model that does not work as expected (for now)</article-title>
          ,
          <source>in: Proc. of ICTIR</source>
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          , Cedr:
          <article-title>Contextualized embeddings for document ranking</article-title>
          ,
          <source>in: Proc. of SIGIR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Proc. of NIPS</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <article-title>Feature transformation for neural ranking models</article-title>
          ,
          <source>in: Proc. of SIGIR</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grushetsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitrichev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sterling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ravina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <article-title>Interpretable learning-to-rank with generalized additive models,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>