<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>End-to-End Neural Ranking for eCommerce Product Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eliot P. Brenner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jet.com/Walmart Labs Hoboken</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>NJ eliot.brenner@jet.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aliasgar Kutiyanawala</string-name>
          <email>aliasgar@jet.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jet.com/Walmart Labs Hoboken</institution>
          ,
          <addr-line>NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jun (Raymond) Zhao Jet.com/Walmart Labs Hoboken</institution>
          ,
          <addr-line>NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ranking</institution>
          ,
          <addr-line>Neural IR, Kernel Pooling, Relevance Model, Embedding, eCommerce, Product Search, Click Models, Task Models</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Zheng ( John) Yan Jet.com/Walmart Labs Hoboken</institution>
          ,
          <addr-line>NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>CCS CONCEPTS • Information systems → Query representation; Probabilistic retrieval models; Relevance assessment; Task models; Enterprise search; • Computing methodologies → Neural networks; Bayesian network models;</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We consider the problem of retrieving and ranking items in an
eCommerce catalog, often called SKU s, in order of relevance to a
user-issued query. The input data for the ranking are the texts of
the queries and textual elds of the SKUs indexed in the catalog.
We review the ways in which this problem both resembles and
di ers from the problems of information retrieval (IR) in the
context of web search, which is the context typically assumed in the
IR literature. The di erences between the product-search problem
and the IR problem of web search necessitate a di erent approach
in terms of both models and datasets. We rst review the recent
state-of-the-art models for web search IR, focusing on the CLSM of
[20] as a representative of one type, which we call the distributed
type, and the kernel pooling model of [26], as a representative of
another type, which we call the local-interaction type. The di erent
types of relevance models developed for IR have complementary
advantages and disadvantages when applied to eCommerce
product search. Further, we explain why the conventional methods
for dataset construction employed in the IR literature fail to
produce data which su ces for training or evaluation of models for
eCommerce product search. We explain how our own approach,
applying task modeling techniques to the click-through logs of an
eCommerce site, enables the construction of a large-scale dataset
for training and robust benchmarking of relevance models. Our
experiments consist of applying several of the models from the
IR literature to our own dataset. Empirically, we have established
that, when applied to our dataset, certain models of local-interaction
type reduce ranking errors by one-third compared to the baseline
system (tf—idf). Applied to our dataset, the distributed models fail
to outperform the baseline. As a basis for a deployed system, the
distributed models have several advantages, computationally, over
the local-interaction models. This motivates an ongoing program of
work, which we outline at the conclusion of the paper.
∗Corresponding Author.</p>
      <p>Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for pro t or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).</p>
      <p>SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s).
1</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Currently deployed systems for eCommerce product search tend
to use inverted-index based retrieval, as implemented in
Elasticsearch [4] or Solr [21]. For ranking, these systems typically use
legacy relevance functions such as tf—idf [25] or Okpai BM25 [18],
as implemented in these search systems. Such relevance functions
are based on exact ("hard") matches of tokens, rather than semantic
("soft") matches, are insensitive to word order, and have hard-coded,
rather than learned weights. On the one hand, their simplicity
makes legacy relevance functions scalable and easy-to-implement.
One the other hand, they are found to be inadequate in practice for
ne-grained ranking of search results. Typically, in order to achieve
rankings of search results that are acceptable for presentation to
the user, eCommerce sites overlay on top of the legacy relevance
function score, a variety of handcrafted lters (using structured
data elds) as well as hard-coded rules for speci c queries. In some
cases, eCommerce sites are able to develop intricate and specialized
proprietary NLP systems, referred to as Query-SKU
Understanding (QSU) Systems, for analyzing and matching relevant SKUs to
queries. QSU systems, while potentially very e ective at
addressing the shortcomings of legacy relevance scores, require a some
degree of domain-speci c knowledge to engineer [8]. Because of
concept drift, the maintenance of QSU systems demands a
longterm commitment of analyst and programmer labor. As a result of
these scalability issues, QSU systems are within reach only for the
very largest eCommerce companies with abundant resources.</p>
      <p>Recently, the eld of neural IR (NIR) , has shown great promise to
overturn this state of a airs. The approach of NIR to ranking di ers
from the aforementioned QSU systems in that it learns vector-space
representations of both queries and SKUs which facilitate
learningto-rank (LTR) models to address the task of relevance ranking in an
end-to-end manner. NIR, if successfully applied to eCommerce, can
allow any company with access to commodity GPUs and abunant
user click-through logs to build an accurate and robust model for
ranking search results at lower cost over the long-term compared
to a QSU system. For a current and comprehensive review of the
eld of NIR, see [11].</p>
      <p>Our aim in this paper is to provide further theoretical justi cation
and empirical evidence that fresh ideas and techniques are needed
to make Neural IR a practical alternative to legacy relevance and
rule-based systems for eCommerce search. Based on the results of
model training which we present, we delineate a handful of ideas
which appear promising so far and deserve further development.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATION TO PREVIOUS WORK IN</title>
    </sec>
    <sec id="sec-4">
      <title>NEURAL IR</title>
      <p>The eld of NIR shares its origin with the more general eld of
neural NLP [3] in the word embeddings work of word2vec [9] and
its variants. From there, though, the eld diverges from the more
general stream of neural NLP in a variety of ways. The task of
NIR, as compared with other widely known branches such as topic
modeling, document clustering, machine translation and automatic
summarization, consists of matching texts from one collection (the
queries) with texts from another collection (webpages or SKUs, or
other corpus entries, as the case may be). The speci c challenges
and innovations of NIR in terms of models are driven by the fact
that entries from the two collections (queries versus documents) are
of a very di erent nature from one another, both in length, and
internal structure and semantics.</p>
      <p>In terms of datasets, the notion of relevance has to be de ned
in relation to a speci c IR task and the population which will use
the system. In contrast to the way many general NLP models can
be trained and evaluated on publicly available labeled benchmark
corpora, a NIR model has to be trained on a datasest tailored to
re ect the information needs of of the population the system is
meant to serve. In order to produce such task-speci c datasets in
a reliable and scalable manner, practitioners need to go beyond
traditional methods such as expert labeling and simple statistical
aggregations. The solution has been to take "crowdsourcing" to its
logical conclusion and to use the "big data" of user logs to extract
relevance judgments. This has led NIR to depend for scalability on
the eld of Click Models, which are discussed at great length in the
context of web search in the book [2].</p>
      <p>While NIR has great promise for application to eCommerce, the
existing NIR literature has thus far been biased towards the web
search problem, relegating eCommerce product search to "niche"
eld status. We hope to remedy this situation by demonstrating
the need for radical innovation in the eld of NIR to address the
challenges of eCommerce product search.</p>
      <p>One of the most important di erences is that for product search,
as compared to web search, both the intent and vocabulary of
queries tend to be more restricted and "predictable". For example,
when typing into a commercial web search engine, people can be
expected to search for any type of information. Generally speaking,
customers on a particular eCommerce site are only looking to satisfy
only one type of information need, namely to retrieve a list of SKUs
in the catalog meeting certain criteria. Users, both customers and
merchants, are highly motivated by economic concerns (time and
money spent) to craft queries and SKU text elds to facilitate the
surfacing of relevant content, for example, by training themselves
to pack popular and descriptive keywords into queries and SKUs.
As a consequence, the non-neural baselines, such as tf—idf, tend
to achieve higher scores on IR metrics in product search datasets
than in web search. An NIR model, when applied to product search
as opposed to web search, has a higher bar to clear to justify the
added system complexity.</p>
      <p>Further, the lift NIR provides over tf—idf is largely in detecting
semantic, non-exact matches. For example, consider a user searching
the web with the query "king’s castle". This user will likely have her
information need met with a document about the "monarch’s castle"
or even possibly "queen’s castle", even if it does not explicitly
mention "king". In contrast, consider the user issuing the query "king
bed" on an eCommerce site. She would likely consider a "Monarch
bed" 1 irrelevant unless that bed is also "king", and a "queen bed"
even more irrelevant. An approach based on word2vec or Glove
[16] vectors would likely consider the words "king", "queen" and
"monarch" all similar to one another based on the distributional
hypothesis. The baseline systems deployed on eCommerce
websites often deal with semantic similarity by incorporating analyzers
with handcrafted synonym lists built in, an approach which is
completely infeasible for open-domain web search. This further blunts
the positive impact of learned semantic similarity relationships.</p>
      <p>Another di erence between "general" IR for web-search and
IR specialized for eCommerce product search is that in the
latter the relevance landscape is simutaneously " atter" and more
"jagged". With regard to " atness", consider a very popular web
search for 2017 (according to [22]): "how to make solar eclipse
glasses". Although the query expresses a very speci c search intent,
it is likely that the user making it has other, related information
needs. Consequently, a good search engine results page (SERP),
could be composed entirely of links to instructions on making solar
eclipse glasses, which while completely relevant, could be
redundant. A better SERP would be composed of a mixture of the former,
and of links to retailers selling materials for making such glasses, to
instructions on how to use the glasses, and to the best locations for
viewing the upcoming eclipse, all which are likely relevant to some
degree to the user’s infromation need and improve SERP diversity.
In contrast, consider the situation for a more typical eCommerce
query: "tv remote". A good product list view (PLV) page for the
search "tv remote" would display exclusively SKUs which are tv
remotes, and no SKUs which are tvs. Further, all the "tv remote"
SKUs are equally relevant to the query, though some might be more
popular and engaging than others. With regard to "jaggedness",
1For the sake of this example, we are assuming that "Monarch" is a brand of bed, only
some of which are "king" (size) beds.
consider the query "desk chair": any "desk with chair" SKUs would
be considered completely irrelevant and do not belong anywhere
on the PLV page, spite of the very small lexical di erence between
the queries "desk chair" and "desk with chair". In web search, by
contrast, documents which are relevant for a given query tend to
remain partially relevant for queries which are lexically similar to
the original query. If the NIR system is to compete with currently
deployed rule-based systems, it is of great importance for a dataset
for training and benchmarking NIR models for product search to
incorporate such "adversarial" examples prevalently.</p>
      <p>Complications arise when we estimate relevance from the
clickthrough logs of an eCommerce site, as compared to web search
logs. Price, image quality are factors comparable in importance to
relevance in driving customer click behavior. As an illustration,
consider the situation in which the top-ranked result on a SERP or
PLV page for a query has a lower click-through rate (CTR) than
the documents or SKUs at lower ranks. In the web SERP case, the
low CTR sends a strong signal that the document is irrelevant to
the query, but in the eCommerce PLV it may have other other
explanations, for example, the SKU’s being mispriced or having an
inferior brand or image. We will address this point in much more
detail in Section 4 below.</p>
      <p>Another factor which assumes more importance in the eld of
eCommerce search versus web search is the di erence between
using the NIR model for ranking only and using it for retrieval and
ranking. In the former scenario, the model is applied at runtime as
the last step of a pipeline or "cascade", where earlier steps of the
pipeline use cruder but computationally faster techniques (such
as inverted indexes and tf—idf) to identify a small subcorpus (of
say a few hundred SKUs) as potential entries on the PLV page, and
the NIR model only re-ranks this subcorpus (see e.g. [23]). In the
latter scenario, the model, or more precisely, approximate nearest
neighbor (ANN) methods acting on vector representations derived
from the model, both select and rank the search results for the
PLV page from the entire corpus. The latter scenario, retrieval and
ranking as one step, is more desirable but (see §§3 and 6 below) poses
additional challenges for both algorithm developers and engineers.
The possibility of using the NIR model for retrieval, as opposed
to mere re-ranking, receives relatively little attention in the web
search literature, presumably because of its infeasibility: the corpus
of web documents is so vast, in the trillions [24], compared to only
a few million items in each category of eCommerce SKUs.
3</p>
    </sec>
    <sec id="sec-5">
      <title>NEURAL INFORMATION RETRIEVAL</title>
    </sec>
    <sec id="sec-6">
      <title>MODELS</title>
      <p>In terms of the high-level classi cation of Machine Learning tasks,
which includes such categories as "classi cation" and "regression",
Neural Information Retrieval (NIR) falls under the Learning to
Rank (LTR) category (see [10] for a comprehensive survey). An
LTR algorithm uses an objective function de ned over possible
"queries" q 2 Q and lists of "documents" d 2 D, to learn a model
that, when applied to an new previously unseen query, uses the
features of the both the query and documents to "optimally" order
the documents for that query. For a more formal de nition, see §1.3
of [10]. As explained in §2.3.2 of [10], it is common practice in IR
to approximate the general LTR task with a binary classi cation
task. For the sake of simplicity, we follow this so-called "pairwise
approach" to LTR. Namely, we rst consider LTR algorithms whose
orderings are induced from a scoring function f : Q ⇥ D ! R
in the sense that the ordering for q consists of sorting D in the
order d1, . . . dN satisfying f (q, d1) f (q, d2) · · · f (q, dN ). Next,
we train and evaluate the LTR model by presenting it with triples
(q, drel, dirrel) 2 Q ⇥ D ⇥ D, where drel is deemed to be more relevant
to q than dirrel: the binary classi cation task amounts to assigning
scores f so that f (q, drel) &gt; f (q, dirrel).</p>
      <p>There are many ways of classifying the popular types of NIR
models, but the way that we nd most fundamental and useful for our
purposes is the classi cation into distributed and local-interaction
models. The distinction between the two types of models lies in the
following architectural di erence. The rst type of model rst
transforms d and q into "distributed" vector representations D (d) 2 VD
and Q (q) 2 VQ of xed dimension, and only after that feeds the
representations into an LTR model. The second type of model never
forms a xed-dimensional "distributed" vector representation of
document or query, and instead forms a "interaction" matrix
representation hi(q), i(d)i of the pair (q, d). The matrix hi(q), i(d)i is
sensitive only to local, not global, interactions between q and d,
and the model subsequently uses hi(q), i(d)i as the input to the LTR
model. More formally, in the notation of Table 1, the score function
f takes the following form in the distributed case:
f (q, d) = r</p>
      <p>Q (q), D (d) = r</p>
      <p>Q (i(q)), D (i(d)) ,
and the following form in the local-interaction case:</p>
      <p>f (q, d) = r hi(q), i(d)i .</p>
      <p>Thus, the interaction between q and d occurs at a global level in
the distributed case and at the local level in the local-interaction
case. The NIR models are distinguished individually from others
of the same type by the choices of building blocks in Table 1.2 In
Tables 2 and 3, we show how to obtain some common NIR models
in the literature by making speci c choices of the building blocks.
Note that since w determines i, and , i together determine , to
specify a local-interaction model (Table 3) we only need to specify
the mappings w, r , and to specify a distributed NIR model (Table 2),
we need only additionally specify . We remark that the distributed
word embeddings in the "Siamese" and Kernel Pooling models are
implicitly assumed to be normalized to have 2-norm one, which
allows the application the dot product h·, ·i to compute the cosine
similarity.</p>
      <p>
        The classi cation of NIR models in Tables 2 and 3 is not meant
to be an exhaustive list of all models found in the literature: for
a more comprehensive exposition, see for example the baselines
section §4.3 of [26]. Laying out the models in this fashion facilitates
a determination of which parts of the "space" of possible models
remain unexplored. For example, we see the possibility of forming
a new "hybrid" by using the distributed word embeddings layer w
from the Kernel Pooling model [26], followed by the representation
layer from CLSM [20], and we have called this distributed model,
apparently nowhere considered in the existing literature "Siamese".
2Although the dimension of word2vec is a hyperparameter, we make the conventional
choice that word-vectors have dimension 300 in the examples to keep the number of
symbols to a minimum.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
Substituting into (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), we obtain the following formula for the score
f of the Siamese model:
f (q, d) = hmlp1
cnn1(i(q)), mlp1
cnn1(i(d))i.
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
As an example missing from the space of possible local-interaction
models, we have not seen any consideration of the "hybrid"
architecture obtained by using the local-interaction layer from the
Kernel-Pooling model and a position-aware LTR layer such as a
convolution neural net. Substituting into (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), the score function
would be
      </p>
      <p>
        f (q, d) = mlp cnn hi(q), i(d)i , (
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
where in both (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), i is the embedding layer obtained induced
by a trainable word embedding w. These are just a couple of the
other new architectures which could be formed in this way: we just
highlight these two because they seem especially promising.
      </p>
      <p>The main bene t of distinguishing between distributed and
local-interaction models is that a distributed architecture enables
retrieval-and-ranking, whereas a local-interaction architecture
restricts the model to re-ranking the results of a separate retrieval
mechanism. See the last paragraph of Section 2 for an explanation
of this distinction. From a computational complexity perspective,
the reason for this is that the task of precomputing and storing
all the representations involved in computing the score is, for a
distributed model, O(|Q | + |D |) in space and time, but for a
localinteraction model, O(|Q ||D |). (Note that we need to compute the
representations which are the inputs to r , namely, in the distributed
case, the vectors Q (q), D (d), and in the local-interaction case, the
matrices hi(q), i(d)i: computing the variable-length representations
i(d) and i(q) alone will not su ce in either case). Further, in the
distributed model case, assuming the LTR function r (·) is chosen
to be the dot-product h·, ·i, the retrieval step can be implemented
by ANN. This is the reason for the choice of r as h·, ·i in de ning
the "Siamese" model. For practical implementations of ANN using
Elasticsearch at industrial scale, see [19] for a general text-data
use-case, and [14] for an image-data use-case in an e-Commerce
context. We will return to this point in §7.
4
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>FROM CLICK MODELS TO TASK MODELS</title>
    </sec>
    <sec id="sec-8">
      <title>Click Models</title>
      <p>The purpose of click models is to extract, from observed variables,
an estimate of latent variables. The observed variables generally
include the sequence of queries, PLVs/SERPs and clicks, and may
also include hovers, add-to-carts (ATCs), and other browser
interactions recorded in the site’s web-logs. The main latent variables
are the relevance and attractiveness of an item to a user. A click
model historically takes the form of a probabilistic graphical model
called a Bayesian Network (BN), whose structure is represented
by a Directed Acyclic Graph (DAG), though more recent works
have introduced other types of probabilistic model including
recurrent [1] and adversarial [13] neural networks. The click model
embodies certain assumptions about how users behave on the site.
We are going to focus on the commonalities of the various click
models rather than their individual di erences, because our aim in
discussing them is to motivate another type of model called task
models, which will be used to construct the experimental dataset in
Section 5.</p>
      <p>
        To model the stochasticity inherent in the user browsing
process, click models adopt the machinery of probability theory and
conceptualize the click event, as well as related events such as
examinations, ATCs, etc., as the outcome of a (Bernoulli) random variable.
The three fundamental events for describing user interaction with
a SKU u on a PLV are denoted as follows:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) The user examining the SKU, denoted by Eu ;
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) The user being su ciently attracted by the SKU "tile" to click
it, denoted by Au ;
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) The user clicking on the SKU, by Cu .
      </p>
      <p>The most basic assumption relating these three events is the
following Examination Hypothesis (Equation (3.4), p. 10 of [2]):
EH</p>
      <p>Cu = 1 , Eu = 1 and Au = 1.</p>
      <p>The EH is universally shared by click models, in view of the
observation that any click which the user makes without examining the
SKU is just "noise" because it cannot convey any information about
the SKU’s relevance. In addition, most of the standard click models,
including the Position Based (PBM) and Cascade (CM) Models, also
incorporate the following Independence Hypothesis:
IH</p>
      <p>Eu
=A|u .</p>
      <p>
        The IH appears reasonable because the attractiveness of a SKU
represents an inherent property of the query-SKU pair, whereas
whether the SKU is examined is an event contingent on a particular
presentation on the PLV and the individual user’s behavior. This
means that Eu and Au should not be able to in uence one another,
as the IH claims. Adding the following two ingredients to the EH
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) and the IH (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ), we obtain a functional, if minimal, click model
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) A probabilistic description of how the user interacts with the
PLV: i.e., at a minimum, a probabilistic model of the variables
Eu .
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) A parameterization of the distributions underlying the
variables Au .
      </p>
      <p>
        We can specify a simple click model by carrying out (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )–(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) as
follows:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) P (Eu = 1) depends only on the rank of u, and is given by
a single parameter r for each rank, so that P (Eur ) =: r 2
[0, 1].
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) P (Au = 1|Eu = 1) =: u,q 2 [0, 1], where there is an
independent Bernoulli parameter u,q for each pair of query and
SKU.
      </p>
      <p>
        The click model obtained by specifying (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )–(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) as above is called the
PBM. For a template-based representation of the PBM, see Figure 1.
For further detail on the PBM and other BN click models, see [2],
and for interpretation of template-based representations of BNs see
Chapter 6 of [7].
      </p>
      <p>
        The parameter estimation of click model BN’s relies on the
following consequence of the EH (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ):
{Eu = Eu0 = 1 &amp; Cu = 0 &amp; Cu0 = 1}
u0,q &gt; u,q ,
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
where in (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ), the symbol means "increases the likelihood that".
This still leaves open the question of how to connect attractiveness
to relevance, which click models need to do in order to ful ll their
purpose, namely extracting relevance judgments from the click
logs. The simplest way of making the connection is to assume
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
Dimension of local document embedding along token axis
Vector space for distributed representations of D or Q
Vector space for distributed representations of D and Q
parameterized mapping of i({q|d }) to {q |d }({q|d }) 2 V{Q |D }
parameterized mapping of i(d), i(q) to 2 V
{q |d } i, global document embedding of q or d into V{Q |D }
i, global document embedding of q or d into V
LTR model mapping hi(q), i(d)i or ( Q (q), D (d)) 7! R
summation along 1-axis, mapping Rm⇥ n ! Rm
Hadamard (elementwise) product (of matrices, tensors)
Broadcasting, followed by Hadamard (operator overloading)
      </p>
      <sec id="sec-8-1">
        <title>Kernel-pooling map of [26] Kernel-pooling composed with soft TF map of [26]</title>
      </sec>
      <sec id="sec-8-2">
        <title>Examples or Formulas</title>
        <p>A 2 Rm⇥ k , B 2 Rn⇥ k ) h A, Bi 2 Rm⇥ n , hA, Bii, j = Õk`=1 Ai, `Bj, `
A1, . . . , AK 2 RM ⇥ N ) [A1, . . . , Ak ] 2 RM ⇥ K N
mlp2(x ) = tanh(W2 · tanh(W1 · x + b1) + b2), Wi weight matrices
cnn1(x ) = tanh(Wc · x ), Wc convolutional weight matrix
word hashing of [5]
mapping derived from word2vec, e.g. token t 7! w(t ) 2 R300
i : [t1, . . . , tk ] 7! [w(t1)T , . . . , w(tk0 )T , 0T , . . . , 0T ] 2 R300⇥ N
| k0=m{inz(k, N ) } |N k{0ztime}s
ND = 1000, NQ = 10 for Duet model.</p>
        <p>
          VQ = R300, VD = R300⇥ 899 for Duet Model
as above, when VQ = VD .
mlp; cnn; mlp cnn; mlp 1
special case of above, when VQ = VD , and Q = D
composition or previous mappings
special case when VQ = VD , and Q = D , Q = D
h·, ·i [20]; mlp3 « [12]; mlp1 K [26]
1 : Rm⇥ n ! Rm , 1(A)i = Õnj=1 Ai, j
A, B 2 Rm⇥ n ) A « B 2 Rm⇥ n
A 2 Rm , B 2 Rm⇥ n ) A « B = [A, . . . , A] « B
| {z }
n times
K(Mi ) = [K1, . . . , KK ], Kk RBF kernels with mean µk .
M 2 RNQ ⇥ ND )
(M) = ÕiN=Q1 log K(Mi )
that an item (web page or SKU) is attractive if and only if it is
relevant, but this assumption is obviously implausible even in the
web search case, and all but the simplest click models reject it. A
more sophisticated approach, which is the option taken by click
models such as DCM, CCD, DBN, is to de ne another Bernoulli
random variable
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) Su , the user having her information need satis ed by SKU u,
satisfying the following Click Hypothesis:
        </p>
        <p>
          CH Cu = 0 ) Su = 0.
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
The CH is formally analogous to the EH, (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ), since just as the EH
says that only u for which Eu = 1 are eligible to have Cu = 1,
regardless of their inherent attractiveness u,q , the CH says that
only u for which Cu = 1 are eligible to have Su = 1, regardless of
their inherent relevance u,q . The justi cation of the CH is that
the user cannot know if the document is truly relevant, and thus
cannot have her information need satis ed by the document, unless
she clicks on the document’s link displayed on the SERP. According
to the CH, in order to completely describe Su , we need only specify
its conditional distribution when Cu = 1. We will follow the above
click models by specifying (in continuation of (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )-(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) above):
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) P (Su = 1|Cu = 1) =: u,q 2 [0, 1], where there is an
independent Bernoulli parameter u,q for each pair of query and
SKU.
        </p>
        <p>Since Su is latent, to allow estimation of the paramters u,q , we
need to connect Su to the observed variables Cu , and the connection
is generally made through the examination variables {E u0 }, by an
assumption such as the following:</p>
        <p>P (Eu0 = 1|Su = 1) , P (Eu0 = 1|Su = 0),
where u 0 , u is a SKU encountered after u in the process of
browsing the SERP/PLV.</p>
        <p>Our reason for highlighting the CH is that we have found that the
CH limits the applicability of these models in practice. Consider the
following Implication Assumption:</p>
        <p>IA
u,q is high )
u,q is high.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
Clearly the IA is not always true, even in the SERP case, because, for
example, the search engine may have done a poor job of producing
the snippet for a relevant document. But in order for parameter
tuning to converge in a data-e cient manner, the AI must be
predominantly true. The reason for this is that, according to the CH
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          ), the variable Cu acts as a "censor" deleting random samples for
the variable Su . For a xed number N of observations of a SKU with
xed u,q , the e ective sample size for the empirical estimate ˆu,q
is approximately N · u,q , so that as u,q ! 0+, the variance of
ˆu,q is scaled up by a factor of 1/ u,q ! 1 . See Chapter 19 of [7]
for a more comprehensive discussion of values missing at random
and associated phenomena. We have already discussed in §2 that
relevance of a SKU is a necessary but not su cient condition for a
high CTR, and consequently the IA, (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) is frequently violated in
practice for SKUs. Empirically, we have observed poor performance
of most click models (other than the PBM and UBM) on eCommerce
site click-through logs, and we attribute this to the failure of the IA.
Thus a suitable modi cation of the usual click model framework is
needed to extract relevance judgments from click data. This is the
subject of §4.3 below.
At this point, we take a step back from our development of the task
model to address a question that the reader may already be asking:
given the complications of extracting relevance from click logs in
the eCommerce setting, why not completely forgo modeling
relevance and instead model the attractiveness of SKUs directly? After
all, it would seem that the goal of search ranking in eCommerce
is to present the most engaging PLV which will result in the most
clicks, ATCs, and purchases. If a model can be trained to predict
SKU attractiveness directly from query and SKU features in an
end-to-end manner, that would seem to be su cient and decrease
the motivation to model relevance separately.
        </p>
        <p>There are at least two practical reasons for wanting to model
relevance separately from attractiveness. The rst is that relevance is
the most stable among the factors that a ect attractiveness, namely
price, customer taste, etc., all of which vary signi cantly over time.
Armed with both a model of relevance and a separate model of
how relevance interacts with the more transient factors to impact
attractiveness, the site manager can estimate the e ects on
attractiveness that can be achieved by pulling various levers available
to her, i.e., by modifying the transient factors (changing the price,
or attempting to alter customer taste through di erent marketing).
The second is that the textual (word, query, and SKU-document)
representations produced by modeling relevance, for example, by
applying any of the distributed models from §3, have potential
applications in related areas such as recommendation, synonym
identi cation, and automated ontology construction. Allowing
transient factors correlated with attractiveness, such as the price and
changing customer taste, to in uence these representations, would
skew them in unpredictable and undesirable ways, limiting their
utility. We will return to the point of non-search applications of the
representations in §7.
Instead of using a click model that considers each query-PLV
independently, we will use a form of behavioral analysis that groups
search requests into tasks. The point of view we are taking is similar
to the one adopted in §4 of [28] and subsequent works on the
Taskcentric Click Model (TCM). Similar to the TCM of [28], we assume
that when a user searches on several semantically related queries
in the same session, the user goes through a process of successively
querying, examining results, possibly with clicks, and re ning the
query until it matches her intent. We sum up the process in the ow
chart, Figure 2, which corresponds to both Figure 2, the "Macro
Model" and Figure 3, the "Micro model" in [28].</p>
        <p>
          There are several important ways in which we have simpli ed
the analysis and model as compared with the TCM of [28]. First, we
do not consider the so-called "duplicate bias" or the associated
freshness variable of a SKU in our analysis, but we do indicate explicitly
that the click variable depends on both relevance and other factors
of SKU attractiveness. Second, we do not consider the last query
in the query chain, or the clicks on the last PLV, as being in any
way special. Third, [28] perform inference and parameter tuning
on the model, whereas in this work, at least, we use the model for
a di erent purpose (see below). As a result [28] need to adopt a
particular click model inside the TCM (called the "micro model",
or PLV/SERP interaction model) to fully specify the TCM. For our
purposes the PLV interaction model could remain unspeci ed, but
for the sake of concreteness, we specify a particular model, similar
to the PBM, to govern the user’s interaction with the PLV, inside
TCM in Figure 3, which is comparable to Figure 4 of [28]. Note that
in comparison to their TCM model we have added the A
(attractiveness factor), eliminated the "freshness" and "previous examination
factors", and otherwise just made some changes of notation, namely
using S instead of R to denote relevance/satisfaction in agreement
with the notation of §4.1, and the more standard r instead of j for
"rank". The important new variables present in the TCM are the
following two relating to the session ow, rather than the internal
search request ow (continuing the numbering (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )–(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) from §4.1):
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) The user’s intent being matched by query i, denoted by Mi ;
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) The user submitting another search request after the ith
query session, denoted by Ni .
        </p>
        <p>
          A complete speci cation of our TCM consists of the template-based
DAG representation of gure 3 together with the following
parameterizations (compare (
          <xref ref-type="bibr" rid="ref16">16</xref>
          )–(
          <xref ref-type="bibr" rid="ref24">24</xref>
          ) of [28]):
        </p>
        <p>P (Mi = 1) = 1 2 [0, 1]
P (Ni = 1|Mi = 1) = 2 2 [0, 1]</p>
        <p>P (Ei,r ) = r
P (Ai,r ) = u,q
P (Si,r ) = u,q
Mi = 0 )
Ci,r = 1 ,</p>
        <p>Ni = 1</p>
        <p>Mi = 1, Ei,r = 1, Si,r = 1, Ai,r = 1.</p>
        <p>
          Resuming our discussion from §3, we are seeking a way to
extract programmatically from the click logs a collection of triples
(q, drel, dirrel) 2 Q ⇥ D ⇥ D where drel,q &gt; dirrel,q , which is su
ciently "adversarial". The notion of "adversarial" which we adopt,
is that, rst, q belongs to a task (multi-query session) in the sense
of the TCM, q = qi0 was preceded by a "similar" query qi , and
dirrel, while irrelevant to qi0 , is relevant to qi . Note that this method
of identifying the triples, at least heuristically, has a much higher
chance of producing adversarial example because the similarity
between qi and qi0 implies with high probability that drel,qi0 dirrel,qi0
is much smaller than would be expected if we chose dirrel from the
SKU collection at random. Leaving aside the question, for the
moment, of how to de ne search sessions (a point we will return to in
Section 5), we can begin our approach to the construction of the
triples by de ning criteria which make it likely that the user’s true
search intent is expressed by qi0 , but not by qi (recall again that
i &lt; i 0, meaning qi is the earlier of the two queries):
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) The PLV for qi had no clicks: Cqi,ur = 0, r = 1, . . . n.
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) The PLV for qi0 had (at least) one click, on r 0: Cqi0,ur0 = 1.
It turns out that relying on these criteria alone is too naïve. The
problem is not the relevance uqi0,r 0 to qi0 , but the supposed
irrelevance of the uqi,r to qi . In the terms of the TCM, this is because
criterion (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), absence of any click on the PLV for qi , does not in
general imply that Mi = 0. In [28], the authors address this issue
by performing parameter estimation using the click logs
holistically. We make adopt a version this approach in future work. In the
present work, we take a di erent approach, which is to examine
the content of qi , qi0 and add a lter (condition) that makes it much
more likely that Mi = 0, namely
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) The tokens of qi0 properly contain the tokens of qi .
An example of a (qi , qi0 ) which satis es (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is (bookshelf,
bookshelf with doors), whereas an example which does not satisfy (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is
(wooden bookshelf, bookshelf with doors): see the last line of Table
4. The idea behind (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is in this situation, the user is re ning her
query to better match her true search intent, so we apply the term
"re nement" either to qi0 or to the pair (qi , qi0 ) as a whole. This
addresses the issue that we need Mi = 0 to conclude that ui,r are
likely irrelevant to the query qi0 .
        </p>
        <p>
          However, there are still another couple of issues with using the
triple (q0, uqi0,r 0, uqi,r ) as a training example. The rst stems from
the observation that lack of click on uqi,r provides evidence for the
irrelevance (to the user’s true intent q0) only if it was examined,
Euqi ,r = 1. This is the same observation as (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ), but in the context
of TCMs. We address this by adding a lter:
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) The rank r of uqi,r  ; a small integer parameter,
implying that r is relatively close to 1. The second is an
"exceptional" type of situation where Mi = 0, but certain ui,r on the PLV
for qi are relevant to qi0 . Consider as an example, the user issues
the query qi "rubber band", then the query "Acme rubber band" qi0 ,
then clicks on an SKU uqi0,r 0 = uqi,r she has previously examined
on the PLV for qi . This may indicate that the PLV for qi actually
contained some results relevant to qi0 , and examining such results
reminded the user of her true search intent. In order to lter out
the noise that from the dataset which would result from such cases,
we add this condition:
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) uqi0,r 0 does not appear anywhere in the PLV for qi .
5
        </p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>DATASET CONSTRUCTION</title>
      <p>
        We group consecutive search requests by the same user into one
session. Within each session, we extract semantically related query
pairs (qi , qi0 ), qi submitted before qi0 . The query qi0 is considered
semantically related to the previously submitted qi if they satisfy
condition (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) above in §4.3. We keep only (qi , qi0 ) satisfying (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )–(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
in §4.3. As explained in §4.3, if in addition the clicked SKU uqi0,r 0 ,
and the unclicked SKUs uqi,r satisfy (
        <xref ref-type="bibr" rid="ref4">4</xref>
        )–(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ), we have high con
dence that Mi = 0 and uqi,r does not satisfy the user’s information
      </p>
      <p>Query
epson ink cartridges</p>
      <p>batteries aa
microsd 128gb
accent chair
bar stool red 2
bookshelf with doors</p>
      <p>Relevant SKU</p>
      <sec id="sec-9-1">
        <title>Epson 252XL High-capacity Black Ink Cartridge</title>
        <p>Sony S-am3b24a Stamina Plus Alkaline Batteries (aa; 24
Pk)
Sandisk Sdsqxvf-128g-an6ma Extreme Microsd Uhs-i
Card With Adapter (128gb)
Acme Furniture Ollano Accent Chair — Fish Pattern —
Dark Blue
Belleze© Leather Hydraulic Lift Adjustable Counter
Bar Stool Dining Chair Red — Pack of 2
Better Homes and Gardens Crossmill Bookcase with
Doors, Multiple Finishes</p>
      </sec>
      <sec id="sec-9-2">
        <title>Irrelevant SKU</title>
      </sec>
      <sec id="sec-9-3">
        <title>Canon CL-241XL Color Ink Cartridge</title>
        <p>Durcell Quantum Alkaline Batteries — AAA, 12 Count</p>
      </sec>
      <sec id="sec-9-4">
        <title>PNY 32GB MicroSDHC Card</title>
      </sec>
      <sec id="sec-9-5">
        <title>ProLounger Wall Hugger Micro ber Recliner Flash Furniture Modern Vinyl 23.25 In. — 32 In. Adjustable Swivel Barstool</title>
      </sec>
      <sec id="sec-9-6">
        <title>Way Basics Eco-Friendly 4 Cubby Bookcase</title>
        <p>need, whereas Mi0 = 1 and uqi0,r 0 does. Therefore, based on our
heuristic, we can construct an example for our dataset by
constructing as many as training examples for each eligible click, of the
form</p>
        <p>(q, drel, dirrel) := (qi0, uqi0,r 0, uqi,r ), r = 1, . . . , .</p>
        <p>For our experiments, we processed logged search data on Jet.com
from April to November 2017. We ltered to only search requests
related to electronics and furniture categories, so as to enable fast
experimentation. We implemented our construction method in
Apache Spark [27]. Our nal dataset consists of around 3.6 million
examples, with 130k unique q, 131k unique drel, 275k unique dirrel,
with 68k SKUs appearing as drel and dirrel for di erent examples.
Table 4 shows some examples extracted by our method.</p>
        <p>In order to compensate for the relatively small number of unique
q in the dataset and the existence of concept drift in eCommerce, we
formed the train-validate-split in the following manner rather than
the usual practice in ML of using random splits: we reserved the rst
six months of data for training, the seventh month for validation,
and the eighth ( nal) month for testing. Further, we ltered out
from the validation set all examples with a q seen in training; and
from the test set, all examples with a q seen in validation or training.
This turned out to result in a (training:validation:test) ratio of (75 :
2.5 : 1). Although the split is lopsided towards training examples, it
still results in 46k test examples. We believe this drastic deletion of
validation/test examples is well worth it to detect over tting and
distinguish true model learning from mere memorization.</p>
        <p>We now give an overview of the principles used to form the
SKU (document) text as actually encoded by the models. The query
text is just whatever the user types into the search eld. The SKU
text is formed from the concatenating the SKU title with the text
extracted from some other elds associated with the SKU in a
database, some of which are free-text description, and others of which
are more structured in nature. Finally, both the query and SKU
texts are lowercased and normalized to clean up certain extraneous
elements such as html tags and non-ASCII characters and to expand
some abbreviations commonly found the SKU text. For example,
an apostrophe immediately following numbers was turned into the
token "feet". No models and no stemming or NLP analyzers, only
regular expressions, are used in the text normalization.</p>
        <p>We used the Adam optimizer [6], with an initial learning rate of
1⇥ 10 4, using PyTorch’s built-in learning rate scheduler to decrease
the learning rate in response to a plateau in validation loss. For the
kernel-pooling model, it takes about 8 epochs, with a run time of
about 2.5 hours assuming a batch-size of 512, for the learning rate to
reach 1⇥ 10 6, after which further decrease in the learning rate does
not result in signi cant validation accuracy improvements. Also, for
the kernel-pooling model, we explored the e ect of truncating the
SKU text at various lengths. In particular, we tried keeping the rst
32, 64 and 128 tokens of the text, and we report the results below.
For the CLSM and Siamese models we tried changing the dimension
of both V , the distributed query/document representations, and the
number of channels of the cnn layer (dimension of input to mlp1)
using values evenly spaced in logarithmic space between 32 and
512. We also tried 3 di erent values of the dropout rate in these
models.</p>
      </sec>
      <sec id="sec-9-7">
        <title>Model</title>
      </sec>
      <sec id="sec-9-8">
        <title>Kernel-pooling, trainable embeddings truncation length 32 truncation length 64 truncation length 128</title>
      </sec>
      <sec id="sec-9-9">
        <title>Kernel-pooling, frozen embeddings truncation length 64 tf—idf baseline Validation</title>
        <p>Test</p>
        <p>For the Kernel-pooling model, we used word vectors from an
unsupervised pre-training step, using the training/validation texts
as the corpus. As our word2vec algorithm, we used the CBOW and
Skipgram algorithms as implemented Gensim’s [17] FastText
wrapper. We observed no improvement in performance of the relevance
model from changing the choice of word2vec algorithm or altering
the word2vec hyperparameters from their default values. For the
tf—idf baseline, we also used the Gensim library, without changing
any of the default settings, to compile the idf (inverse document
frequency) statistics.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>6.2 Error Rate Comparison</title>
      <p>We have reported our main results, the error rates of the trained
relevance models in Table 5. The most notable nding is that the
kernel-pooling model, our main representative of the distributed class
showed a sizable improvement over the baseline, and in our
experiments thus far, none of the distributed representation models even
matched the baseline. We found that the distributed models had
adequate capacity to over t the training data, but they are not
generalizing well to the validation/test data. We will discuss ongoing
e orts to correct this in §7 below. Another notable nding is that
there is an ideal truncation length of the SKU texts for the
kernelpooling model, in our case around 26 tokens, which allows the
model enough information without introducing too much
extraneous noise. Finally, following [26], we also evaluated a variant of the
kernel-pooling model where the word embeddings were "frozen",
i.e. xed at their initial word2vec values. Interestingly, unlike what
was observed in [26], we observed only a modest degredation in
performance, as measured by overall error rate, from the model
using the word2vec embeddings as compared with the full model.
Based on the qualitative analysis of the learned embeddings, §6.3,
we believe it is still worthwhile to train the full model.</p>
    </sec>
    <sec id="sec-11">
      <title>6.3 Pre-trained versus ne-tuned embeddings</title>
      <p>Similarly to [26], we found that the main e ect of the supervised
retraining of the word embeddings was to decouple certain word
pairs. Corresponding to Table 8 of their paper we have listed some
examples of the moved word pairs in Table 7. The training decouples
roughly twice as many word pairs as it moves closer together. In
spite of the relatively modest gains to overall accuracy from the
netuning of embeddings, we believe this demonstrates the potential
value of the ne-tuned embeddings for other search-related tasks.</p>
    </sec>
    <sec id="sec-12">
      <title>7 CONCLUSIONS</title>
      <p>We showed how to construct a rich, adversarial dataset for
eCommerce relevance. We demonstrated that one of the current
stateof-art NIR models, namely the Kernel Pooling model, is able to
reduce pairwise ranking errors on this dataset, as compared to the
tf—idf baseline, by over a third. We observed that the distributional
NIR models such as DSSM and CLSM over t and do not learn to
generalize well on this dataset. Because of the inherent advantages
of distributional over local-interaction models, our rst priority
for ongoing work is to diagnose and overcome this over tting so
that the distributional models at least outperform the baseline. The
work is proceeding along two parallel tracks. One is explore further
architectures in the space of all possible NIR models to nd ones
which are easier to regularize. The other is to perform various forms
of data augmentation, both in order to increase the sheer quantity
of data available for the models to train on and to overcome any
biases that the current data generation process may introduce.</p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors would like to thank Ke Shen for his assistance setting
up the data collection pipelines.</p>
      <sec id="sec-13-1">
        <title>From</title>
        <p>µ = 0.8
µ = 0.5
µ = 0.1
µ = 0.5
µ = 0.1
To
µ = 0.1
µ = 0.1
µ = 0.3
µ = 0.1
µ = 0.3</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Borisov</surname>
          </string-name>
          , Ilya Markov, Maarten de Rijke, and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A neural click model for web search</article-title>
          .
          <source>In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee</source>
          ,
          <fpage>531</fpage>
          -
          <lpage>541</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Aleksandr</given-names>
            <surname>Chuklin</surname>
          </string-name>
          , Ilya Markov, and Maarten de Rijke.
          <year>2015</year>
          .
          <article-title>Click models for web search</article-title>
          .
          <source>Synthesis Lectures on Information Concepts</source>
          ,
          <source>Retrieval, and Services 7</source>
          ,
          <issue>3</issue>
          (
          <year>2015</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Neural Network Methods for Natural Language Processing</article-title>
          .
          <source>Synthesis Lectures on Human Language Technologies</source>
          <volume>37</volume>
          (
          <year>2017</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Clinton</given-names>
            <surname>Gormley</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zachary</given-names>
            <surname>Tong</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <string-name>
            <surname>Elasticsearch: The De nitive Guide</surname>
            :
            <given-names>A Distributed</given-names>
          </string-name>
          <string-name>
            <surname>Real-Time Search</surname>
            and
            <given-names>Analytics</given-names>
          </string-name>
          <string-name>
            <surname>Engine. O'Reilly Media</surname>
          </string-name>
          , Inc.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Po-Sen</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Xiaodong He,
          <string-name>
            <surname>Jianfeng Gao</surname>
            , Li Deng,
            <given-names>Alex</given-names>
          </string-name>
          <string-name>
            <surname>Acero</surname>
            , and
            <given-names>Larry</given-names>
          </string-name>
          <string-name>
            <surname>Heck</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Learning deep structured semantic models for web search using clickthrough data</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Conference on information &amp; knowledge management. ACM</source>
          ,
          <volume>2333</volume>
          -
          <fpage>2338</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Daphne</given-names>
            <surname>Koller</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nir</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Probabilistic graphical models: principles and techniques</article-title>
          . MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Aliasgar</given-names>
            <surname>Kutiyanawala</surname>
          </string-name>
          , Prateek Verma, and
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Towards a simpli ed ontology for better e-commerce search</article-title>
          .
          <source>In Proceedings of the SIGIR 2018 Workshop on eCommerce (ECOM 18).</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>In Proceedings of the 31st International Conference on Machine Learning (ICML-14)</source>
          .
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning to rank for information retrieval and natural language processing</article-title>
          .
          <source>Synthesis Lectures on Human Language Technologies</source>
          <volume>7</volume>
          ,
          <issue>3</issue>
          (
          <year>2014</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Bhaskar</given-names>
            <surname>Mitra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nick</given-names>
            <surname>Craswell</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Neural Models for Information Retrieval</article-title>
          .
          <source>arXiv preprint arXiv:1705.01509</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Bhaskar</surname>
            <given-names>Mitra</given-names>
          </string-name>
          , Fernando Diaz, and
          <string-name>
            <given-names>Nick</given-names>
            <surname>Craswell</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Learning to match using local and distributed representations of text for web search</article-title>
          .
          <source>In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee</source>
          ,
          <fpage>1291</fpage>
          -
          <lpage>1299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>John</surname>
            <given-names>Moore</given-names>
          </string-name>
          , Joel Pfei er, Kai Wei, Rishabh Iyer, Denis Charles,
          <string-name>
            <surname>Ran</surname>
            <given-names>GiladBachrach</given-names>
          </string-name>
          , Levi Boyles, and
          <string-name>
            <given-names>Eren</given-names>
            <surname>Manavoglu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Modeling and Simultaneously Removing Bias via Adversarial Neural Networks</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>06909</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Cun</surname>
            <given-names>Mu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Guang</given-names>
            <surname>Yang</surname>
          </string-name>
          , Jing Zhang, and Zheng Yan.
          <year>2018</year>
          .
          <article-title>Towards Practical Visual Search Engine Within Elasticsearch</article-title>
          .
          <source>In Proceedings of the SIGIR 2018 Workshop on eCommerce (ECOM 18).</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Adam</surname>
            <given-names>Paszke</given-names>
          </string-name>
          , Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
          <string-name>
            <surname>Zachary</surname>
            <given-names>DeVito</given-names>
          </string-name>
          , Zeming Lin, Alban Desmaison, Luca Antiga, and
          <string-name>
            <given-names>Adam</given-names>
            <surname>Lerer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Automatic di erentiation in PyTorch</article-title>
          . In NIPS-W.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Je</surname>
            rey Pennington, Richard Socher, and
            <given-names>Christopher</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          .
          <volume>1532</volume>
          -
          <fpage>1543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Radim</given-names>
            <surname>Rehurek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA</source>
          , Valletta, Malta,
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Stephen</surname>
            <given-names>Robertson</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hugo</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , et al.
          <year>2009</year>
          .
          <article-title>The probabilistic relevance framework: BM25 and beyond</article-title>
          .
          <source>Foundations and Trends® in Information Retrieval 3</source>
          ,
          <issue>4</issue>
          (
          <year>2009</year>
          ),
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Jan</surname>
            <given-names>Rygl</given-names>
          </string-name>
          , Jan Pomikalek, Radim Rehurek, Michal Ruzicka, Vit Novotny, and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on Representation Learning for NLP</source>
          .
          <volume>81</volume>
          -
          <fpage>90</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Yelong</surname>
            <given-names>Shen</given-names>
          </string-name>
          , Xiaodong He,
          <string-name>
            <surname>Jianfeng Gao</surname>
            ,
            <given-names>Li</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
            , and
            <given-names>Grégoire</given-names>
          </string-name>
          <string-name>
            <surname>Mesnil</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A latent semantic model with convolutional-pooling structure for information retrieval</article-title>
          .
          <source>In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM</source>
          ,
          <volume>101</volume>
          -
          <fpage>110</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>David</given-names>
            <surname>Smiley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Eric</given-names>
            <surname>Pugh</surname>
          </string-name>
          , Kranti Parisa, and Matt Mitchell.
          <year>2015</year>
          .
          <article-title>Apache Solr enterprise search server</article-title>
          . Packt Publishing Ltd.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Google</given-names>
            <surname>Trends</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Year in Search 2017</article-title>
          . Retrieved April 29,
          <year>2018</year>
          from https://trends.google.com/trends/yis/2017/GLOBAL/
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Zhucheng</surname>
            <given-names>Tu</given-names>
          </string-name>
          , Matt Crane, Royal Sequiera, Junchen Zhang, and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>An Exploration of Approaches to Integrating Neural Reranking Models in MultiStage Ranking Architectures</article-title>
          .
          <source>arXiv preprint arXiv:1707.08275</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Venturebeat</surname>
          </string-name>
          .com.
          <year>2013</year>
          .
          <article-title>How Google Searches 30 Trillion Web Pages 100 Billion Times a month</article-title>
          .
          <source>Retrieved April 29</source>
          ,
          <year>2018</year>
          from https://venturebeat.com/
          <year>2013</year>
          /03/ 01/how-google-searches-30
          <string-name>
            <surname>-</surname>
          </string-name>
          trillion
          <article-title>-web-pages-100-billion-times-a-month/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Wikipedia</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Tf-idf</article-title>
          .
          <source>Retrieved April 30</source>
          ,
          <year>2018</year>
          from https://en.wikipedia.org/ wiki/Tf%E2%
          <fpage>80</fpage>
          %
          <fpage>93idf</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Chenyan</surname>
            <given-names>Xiong</given-names>
          </string-name>
          , Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and
          <string-name>
            <given-names>Russell</given-names>
            <surname>Power</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>End-to-end neural ad-hoc ranking with kernel pooling</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <volume>55</volume>
          -
          <fpage>64</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Matei</surname>
            <given-names>Zaharia</given-names>
          </string-name>
          , Reynold S Xin, Patrick Wendell,
          <string-name>
            <surname>Tathagata Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Michael Armbrust</surname>
          </string-name>
          , Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,
          <string-name>
            <surname>Michael J Franklin</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Apache spark: a uni ed engine for big data processing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>59</volume>
          ,
          <issue>11</issue>
          (
          <year>2016</year>
          ),
          <fpage>56</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Yuchen</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Weizhu Chen,
          <string-name>
            <given-names>Dong</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>User-click modeling for understanding and predicting search-behavior</article-title>
          .
          <source>In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM</source>
          ,
          <volume>1388</volume>
          -
          <fpage>1396</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>