<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM SIGIR Workshop on eCommerce, July</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>merce⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nguyen Vo</string-name>
          <email>nguyen.vo@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongwei Shang</string-name>
          <email>hongwei.shang@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhen Yang</string-name>
          <email>zhen.yang@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juexin Lin</string-name>
          <email>juexin.lin@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seyed Danial Mohseni Taheri</string-name>
          <email>syeyddanial.mohseni@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sunnyvale, California, USA</institution>
          ,
          <addr-line>94086</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>18</volume>
      <issue>2024</issue>
      <abstract>
        <p>Ensuring the relevance of text between user queries and products is vital for e-commerce search engines to enhance user experience and facilitate finding desired products. Thanks to deep learning models' capabilities in semantic understanding, they have become the primary choice for relevance matching tasks. In real-time e-commerce scenarios, representation-based models are commonly used due to their eficiency. On the other hand, interaction-based models, while ofering better efectiveness, are often time-consuming and challenging to deploy online. The emergence of large language model (LLM) has marked a significant advancement in relevance search, presenting both value and complexity when applied to e-commerce domain. To address these challenges, we propose a novel framework to distill a highly efective interaction-based LLM into a low latency representation-based architecture (i.e. student model). To further increase efectiveness of the LLM, we propose to use soft human labels and items' attributes. Our student model is trained to mimic the margin between a relevant document and a less relevant product outputted from the LLM. Experimental results showed that our model improves both relevancy and engagement metrics. Our model increased NDCG@5 by 1.30% and the number of sessions with clicks by 0.214% compared with a production system.</p>
      </abstract>
      <kwd-group>
        <kwd>cross encoders</kwd>
        <kwd>dual encoders</kwd>
        <kwd>knowledge distillation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Major online shopping platforms such as Walmart, Ebay and Amazon cater to millions of users
daily with a vast array of products. Search engines play a crucial role in helping users find what
they are looking for, but in the realm of commercial e-commerce, search engines typically rely
heavily on user engagement signals to understand query intent and provide the best possible
search results [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Search queries from users are usually segmented into head, torso and
tail queries. Head and torso queries generally provide enough user engagement data to train
machine learning models for retrieving and reranking relevant items. However, it is dificult
to efectively retrieve and rerank the most relevant products for tail queries due to the lack of
engagement data. Ensuring that search results align closely with diferent types of queries from
users is vital for maintaining customer satisfaction and trust over time.
⋆Both authors contributed equally to this research.
      </p>
      <p>(, !, ") = ((, !) – (, "), (, !) – (, "))
Human
labels
Textual
concat</p>
      <p>MLP
LLM
+
(, !)
(, !)</p>
      <p>MLP MLP</p>
      <p>Shared params
DistilBERT DistilBERT
(, ")</p>
      <p>MLP MLP</p>
      <p>Shared params
DistilBERT DistilBERT</p>
      <p>Student model
(, ")</p>
      <p>Human
labels
MLP
LLM
+ Tceoxntucaatl</p>
      <p>T
e
a
c
h
e
r
m
o
d
e
l
!: iphone 15 pro max</p>
      <p>Query: apple iphone 15 pro max
": samsung galaxy s24</p>
      <p>
        Traditional methods of matching queries to products have limitations, particularly in bridging
the vocabulary gap. To address this challenge, advanced neural network models have emerged as
a powerful solution. These models, categorized into representation-based and interaction-based
models, ofer diferent approaches to text matching. Representation-based models encode queries
and product titles into fixed-dimensional vectors separately, and compute cosine similarity as a
semantic matching feature for reranking [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7 ref8 ref9">4, 5, 6, 7, 8, 9</xref>
        ], enabling eficient online computation
but potentially sacrificing detailed matching information.
      </p>
      <p>
        On the other hand, interaction-based models excel at capturing fine-grained matching details
by analyzing diferent parts of queries and products at a low level before making a final decision
based on aggregated evidence [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ]. While these models outperform representation-based
ones in many text matching scenarios, they face challenges in terms of online deployment due
to their inability to pre-compute embeddings ofline and consider context efectively.
      </p>
      <p>
        Recent advancements like LLMs (e.g. BERT [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Llamma [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Mistral [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and Gemma
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) have revolutionized text matching tasks by combining the strengths of interaction-based
and representation based models. Their multilayer architecture based on Transformer [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
allows for comprehensive interaction between queries and products at various semantic levels,
addressing the shortcomings of previous models. Despite its efectiveness, LLM’s computational
intensity poses hurdles for practical online applications such as e-Commerce search engines.
      </p>
      <p>
        In this work, our goal is to improve efectiveness of representation-based models used in
production while still meeting strict latency requirements of e-Commerce search systems for tail
queries segment. Toward this goal, we propose a novel knowledge distillation (KD) framework
to distill an encoder-only LLM (i.e. BERT base [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]) into a representation-based student model
(i.e. DistilBERT [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]), ofering improved efectiveness of the student model while maintaining
eficiency of the representation-based models. We firstly train a highly efective teacher model
1, followed by training the student network to mimic the LLM’s behavior. To train the teacher
model, we propose to use soft human labels converted from editorial feedback to make the
model aware of diferences between a perfect match item, an item with a mismatched attribute
1We use LLM and teacher model interchangeably
(e.g. brand, color, style etc) and completely irrelevant products, instead of simply using binarized
labels [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] commonly adopted in literature. We show that using soft human labels improve
efectiveness of our teacher model. We further incorporate items’ attributes to our teacher model
to enhance its performance. For our student model, we aim to mimic the margin between a
relevant product  + and an irrelevant document  − outputted by the teacher model. Intuitively,
soft targets outputted by the LLM reduce noises and ofer more informative knowledge about
relevant diferences between the two items 2. The teacher model/LLM will be served ofline
while we can deploy the newly trained student model into production. The high level overview
of our framework in shown in Figure 1. Our contributions are as follows:
• We propose a novel framework, consisting of a representation-based student model distilled
from an LLM, to generate a semantic matching feature for a reranking system in a major
e-Commerce search engine.
• We proposed to improve efectiveness of the teacher model by using soft human labels to
distinguish a perfectly matched item from items with mismatched attributes and completely
irrelevant items.
• We conducted extensive ofline experiments on an in-house dataset and tested our framework
with real-time production trafic. Online testing results showed significant gain of our model
over an existing commercial production system.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Related work</title>
      <p>
        In this section, we summarize related work about relevance search on e-Commerce, neural
ranking models for text search and knowledge distillation methods.
2.1. Relevance e-Commerce Search and Ranking
The challenge of e-Commerce search surpasses that of traditional web search [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] owing to the
shortness of user queries and the large number of potentially relevant items [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Researchers
have suggested an iterative method involving multiple steps, starting with retrieving a set of
candidate items, then iteratively reranking and reducing this set by selecting the top items
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. In e-commerce, various signals are used to assess search result quality, with some studies
[
        <xref ref-type="bibr" rid="ref1 ref2 ref23 ref24 ref3">1, 2, 3, 23, 24</xref>
        ] optimizing results based on user engagement metrics like click-through rate and
conversion rate, best-selling products [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and product result diversity [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. However, sparseness
of user engagement data may limit model performance on queries without engagement (e.g.
tail queries). Recently, deep textual matching features based on deep neural-based models have
been employed for retrieval and ranking, with enhancements such as incorporating diferent
text representations and loss functions [
        <xref ref-type="bibr" rid="ref27 ref28 ref8">27, 28, 29, 30, 31, 8, 32</xref>
        ]. Additionally, some models
have integrated interaction features between user queries and a product graph to capture
relationships among similar products in the ranking process [33] and reinforcement learning
for product search [34]. Our work develops a semantic matching feature based on our novel
knowledge distillation framework, and is used among other engagement signals for reranking
at a major e-Commerce search engine.
2We use items, products, documents interchangeably
      </p>
      <sec id="sec-3-1">
        <title>2.2. Neural Ranking Models for Text Search</title>
        <p>
          Neural ranking models for text search can be categorized into two groups:
representationbased models and interaction-based methods. The former one seeks to learn representations
of a query and a document, and measure their similarity [
          <xref ref-type="bibr" rid="ref27 ref4 ref5 ref6 ref7">4, 5, 6, 7, 35, 36, 27</xref>
          ], while the later
one [37, 38, 39, 40, 41, 42, 43] aims to capture relevant matching signals between a query and
a document based on word/tokens interactions. There are methods aiming to unified two
categories within a single model such as Mitra et al. [44], Rao et al. [
          <xref ref-type="bibr" rid="ref29">45</xref>
          ]. Recent research
has been centered around leveraging pretrained large language models, with BERT being a
prominent example [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In the context of BERT-based relevance models, there are two common
approaches in literature. The first one is about independently learning representations of
queries and items/products using dual BERT encoders (e.g. siamese or two-tower structure)
[
          <xref ref-type="bibr" rid="ref27 ref28 ref30 ref8 ref9">8, 27, 28, 46, 9</xref>
          ]. The second approach is to concatenate textual contents of a query-item pair
and input the text into a BERT model [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref31 ref32 ref33">47, 48, 10, 11, 12, 49</xref>
          ] which demonstrate state-of-the-art
performance on various benchmarks. The former approach is known as representation-based
learning method while the later one is an interaction-based approach. The e-commerce relevance
task, akin to text matching, poses challenges for commercial search engines due to high trafic
and low latency requirements. This makes deploying interaction-based LLMs online a significant
hurdle. To address this issue, our work proposes distilling the interaction-based LLM (i.e. BERT
base) into a representation-based architecture (i.e. DistilBERT), aiming to enhance ranking
efectiveness while maintaining eficiency of online search systems.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.3. Knowledge Distillation Methods</title>
        <p>
          Online recommendation/search systems require strict latency in real time which hinders the
deployment of LLMs (e.g BERT [
          <xref ref-type="bibr" rid="ref34">50</xref>
          ], LLamma [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], GPT [
          <xref ref-type="bibr" rid="ref35">51</xref>
          ]). Recently, researchers and
practitioners utilize compression techniques to compress these models into smaller ones. One
of the most widely used method is Knowledge Distillation [
          <xref ref-type="bibr" rid="ref36">52</xref>
          ]. It enables online systems to
leverage sophisticated models like BERT efectively. The core concept of KD involves training
a high-performance teacher model initially, followed by training a simpler student network
to replicate the teacher’s behavior. Knowledge distillation methods mainly fall into three
groups: (1) response-based learning [
          <xref ref-type="bibr" rid="ref32 ref36 ref37 ref38 ref39 ref40 ref41">53, 52, 48, 54, 55, 56, 57, 30</xref>
          ], (2) representation-based
methods [
          <xref ref-type="bibr" rid="ref18 ref42 ref43 ref44">18, 58, 59, 60, 61</xref>
          ] and (3) relation-based knowledge [62]. Our method can be viewed
as a response-based one since our student model is optimized to learn from the soft targets
generated by a large language model (LLM), which are more informative and less noisy. Our
work is closest to [
          <xref ref-type="bibr" rid="ref32">48, 30</xref>
          ]. However, our teacher model is trained with products’ attributes and
soft ratings converted from editorial feedback to increase efectiveness.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Our Framework</title>
      <p>Problem Formulation: Given a query  and an item  , where every item  has title and
textual attributes such as product type, brand, color and gender, we aim to train a teacher model
(, ) ∈ ℝ and a student model (, ) ∈ ℝ . These two functions will determine relevancy of 
and  . After training the LLM, we will train the student model by learning from soft-targets
outputted by the LLM (i.e. knowledge distillation process). Our framework (Figure 1) consists
of two main components: (1) the interaction-based LLM (i.e. BERT base) used as the teacher
model, (2) the representation-based model (i.e. DistilBERT) which is the student model. Details
of these components will be described in following subsections.</p>
      <sec id="sec-4-1">
        <title>3.1. The teacher model</title>
        <p>For each query-item pair (, ) , we utilize an LLM (i.e. BERT base) as encoder, and concatenate
a query and title of an item as input to the BERT model. As the item title may not contain
suficient information to determine relevancy of the query and the item, we also concatenate
the item’s attributes (e.g. product type (PT), brand and so on) if they are available. The title and
each of the attributes will have unique separator tokens as shown in Eq.1. The hidden state
E(,)
([])
of []</p>
        <p>token is taken as the query-item pair representation. To the best of our
knowledge, our work is the first using items’ attributes such as product types, brands, colors
and genders to enhance efectiveness of an interaction-based LLM.</p>
        <p>E(,)
=  ([]  [ ]
[
 ][
 ]  [
To compute relevance score (, )
as follows:
of the teacher model, we input E(,)
into MLP layers
(, ) =</p>
        <p>W2 ⋅    (</p>
        <p>
          W1 ⋅ E(,)
([]))
where W1 ∈ ℝ768× and W2 ∈ ℝ×1 . We remove biases to avoid clutter. For each query-item
pair (, ) , its rating can be Excellent (i.e. perfect match), Good (i.e. item with a mismatched
attribute (e.g. brand, color, style etc)), Okay, Bad (i.e. irrelevant items) and so on. We can simply
label excellent/good items as 1s and the rest as 0s similar to [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. However, it is suboptimal since
excellent items and good items are viewed as equal. To help our LLM distinguish these items,
we propose to convert the editorial feedback into soft human labels by labelling an excellent
item as 1, a good item as 0.5 and a completely irrelevant item as 0. The converted human labels
are used in cross entropy loss to train our LLM as follows:
        </p>
        <p>ℒ ((, ),  ) = − ⋅ ((, )) − (1 −  ) ⋅ (1 − (, ))
where  ∈ {0, 1, 0.5} converted from original editorial feedback.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. The student model</title>
        <p>E ([])

_(
of the []</p>
        <p>E ([]),
As shown in Figure 1, our student model uses DistilBERT as encoder and has identical towers
(Siamese network). For each query-item pair (, ) , we input the query to the DistillBERT
as follows: E

=  ([]  [ ])
and use hidden state E ([])
of the []
token as the query’s representation. For the item, we concatenate its title and its available
attributes, and input the concatenated text into DistilBERT as shown in 4. The hidden state
token is used as the item’s representation. The scoring function (, ) =
E ([]))</p>
        <p>.</p>
        <p>E =  ([] 
[
 ]   [
 ]  )
 ] )
([])
(1)
(2)
(3)
(4)
query
Retrieval
System</p>
        <p>Item
database</p>
        <p>Query
embedding</p>
        <p>model
Retrieved
Candidates</p>
        <p>Item
embedding
model</p>
        <p>Rerank
system
Indexing
store</p>
        <sec id="sec-4-2-1">
          <title>Online serving</title>
          <p>Search
results</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Offline indexing</title>
          <p>
            To train our student model, we use loss function similar as the margin MSE loss [
            <xref ref-type="bibr" rid="ref32">48</xref>
            ] to help
the student model mimic the LLM’s predicted margin. In [
            <xref ref-type="bibr" rid="ref32">48</xref>
            ] where simply binary labeling
is used, triplets (,  +,  −) are sampled where  + is relevant document and  − is irrelevant
document for the query  . The teacher’s scores (,  +) and (,  −) viewed as soft targets, and
the student scores (,  +) and (,  −) are computed. In [
            <xref ref-type="bibr" rid="ref32">48</xref>
            ], the margin MSE loss for a query
 between a relevant document  + and a irrelevant document  − is shown in Eq.5.
ℒ (,  +,  −) =  ((, 
+) − (, 
−), (, 
+) − (, 
−)).
          </p>
          <p>(5)</p>
          <p>
            Extending Hofstätter et al. [
            <xref ref-type="bibr" rid="ref32">48</xref>
            ]’s work from binary classes to accommodate three distinct
document classes, we use  1,  2, and  3 to denote excellent document (label 1), good document
(label 0.5) and irrelevant document (label 0) respectively for the query  . We sample triplets
(,   ,   ) where   is more relevant than document   for the query  , so there are three possible
combinations (,  1,  2), (,  1,  3), and (,  2,  3). We apply Eq.5 on the generated triplets to
compute loss for the query  .
          </p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Online serving</title>
        <p>After training our student model, we deploy it into production. The overview of our online
serving system is in Figure 2. We index all products’ embeddings with an ofline pipeline.
For every query  , we generate  ’s embedding online. From top-k retrieved candidates of a
retrieval system, we compute a semantic matching feature based on the query’s embedding and
the retrieved items’ embeddings. The feature will be used among other ranking features by a
tree-based model to rank documents and return search results. The features used in a rerank
system can be organized into three groups: (1) query features (e.g. query’s attributes, length
etc), (2) item features (e.g. item attributes, user reviews, ratings etc) and (3) query-item features
(e.g. query-item engagement). Our semantic matching feature is a query-item feature.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <sec id="sec-5-1">
        <title>4.1. Data Collection</title>
        <p>In this section, we discuss our strategies to collect data, performance of the teacher model and
the student model, and our online tests.</p>
        <p>
          To train text matching models, we can either use engagement information (e.g. click search
logs) [43, 41] or human editorial feedback [
          <xref ref-type="bibr" rid="ref19 ref8">8, 19</xref>
          ]. While using engagement information to
collect data may help us to generate large-scale data, we find that for tail queries, engagement
information is usually limited and noisy, leading to poor efectiveness of our models. Therefore,
we leverage human editorial labels, which may have smaller size but more reliable to capture
textual relevancy between a query and an item, to train our models.
        </p>
        <p>
          Over the years, our human editorial evaluation data is generated by manually assessing the
top-ranked items for a set of sampled queries by a control ranking model and a variant model.
The queries are sampled based on search trafic. Totally, we collected an in-house dataset where
each query has a list of ∼10-20 items with human editorial ratings similar to [
          <xref ref-type="bibr" rid="ref19 ref8">43, 41, 8, 19</xref>
          ].
Again, we did not use click-search logs to train our models in this paper. We convert the original
ratings into soft human labels as discussed in Section 3.1. For each query-item pair (, ) , its
rating can be Excellent (i.e. perfect match), Good (i.e. item with a mismatched attribute (e.g.
brand, color, style etc)), Okay, Bad (i.e. irrelevant items) and so on. It should be noted that not all
attributes hold equal importance. In this paper we will omit the specific details of our annotation
guidelines. To further increase the number of query-item pairs, we have also included some
hard negative items for each of the queries. While the addition of these hard negatives did not
lead to significant relevance gains, we observed that including hard negatives resulted in the
model yielding more consistent results than using random negative items.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Performance of the teacher model</title>
        <p>We explored multiple methods to train our teacher models, with an emphasis on the labeling
strategy and the loss function. Our current production model employs aggressive labeling
where excellent items are labeled as positive 1, while all others are labeled as negative 0.
Our analysis shows that subject mismatch accounts 20% of irrelevant search results, thus it
is important to distinguish between good and irrelevant items for improving the relevance
of top items. In Table 1, we compare the performance of our model trained with aggressive
labeling to that trained with soft-labeling, where label is 1 for excellent match, 0.5 for good
match, 0 for irrelevant match. We observe a relative gain of +0.47% in NDCG5 with the
soft-labeling approach. Additionally, we explored other methods for distinguishing between
good items and irrelevant items, including multi-class classification ( MCCE) and Multivariate
Ordinal Regression (Ordinal) [63], these approaches did not result in NDCG improvement.
For knowledge distilling, using soft-labeling is also easier for knowledge distillation compared
against MCCE and Ordinal. Soft-labeling approach generates a single logit output, simplifying
the knowledge distillation process compared to the two-output approach of MCCE and Ordinal.
Based on above, we adopt the soft-labeling method as our teacher model.</p>
        <p>Teacher Model
BERT w/ aggressive labeling
BERT w/ soft-labeling w/o item attributes
BERT w/ soft-labeling</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Performance of the student model</title>
        <p>We compare our student model (KD-DistilBERT) trained with margin MSE loss with
state-ofthe-art KD response-based method. We also include performance of our best teacher model. As
shown in Table 2, all KD-based methods outperform distilBERT training without knowledge
distillation significantly with p-value &lt; 0.001 by using t-test, indicating the efectiveness of
using soft-targets outputted by our teacher model. Our model (KD-DistilBERT) performs best
among KD-based methods. We can see the teacher model outperforms all student models with
large gaps. Note that, all student models have the same model architecture (DistilBERT) for fair
comparisons. As the gap between our student model and our teacher model is considerable, we
may consider using a bigger model as a student model to further improve efectiveness while
latency increases modestly. We leave it for future work.</p>
        <p>In terms of latency, we observe that the teacher model is much slower than our student model.
In runtime, given a query (, ) , the teacher model needs to make inference for a concatenation
of the query and the item, while for the student model, we can compute the item’s embedding
ofline and as the content of the query is short, online inference for the query’s representation
is fast. Therefore, the student model is much more preferable for online applications. As our
student model has same architecture with the existing production model, our student model
does not incur any additional latency.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Online experiments</title>
        <p>Our KD-DistilBERT performance was assessed by human evaluators who compared the
top10 results from our model with Walmart’s production system which already has a semantic</p>
        <p>
          Method
KD-DistilBERT
matching feature by using siamese DistilBERT model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. As we use DistilBERT as encoder,
our framework does not incur any additional latency. Queries were randomly sampled from
search trafic at Walmart. As we can see in Table 3, our model outperforms the production
system significantly on relevancy metrics (NDCG@5 and NDCG@10). Reported results were
stastistically significance t-test. We also conducted A/B test to compare engagement metrics
of our proposed framework and the production system. As reported in Table 4, our model
increases first-time buyer by 2.55%, reduces abandonment search sessions by 0.25% and increase
the number of sessions with click by 0.214%.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Discussion and Future Work</title>
      <p>
        In this paper, we employ an encoder-only LLM (i.e. BERT) as the teacher model. We found that
powerful decoder-only LLMs with more number of parameters (e.g. Llama [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Mistral [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ])
are more efective and can further improve efectiveness of the student model. We leave it for
further work.
      </p>
      <p>Currently, our student model is only trained on soft-targets outputted from the teacher model
for query-item pairs in human editorial feedback dataset. It is suboptimal since we can apply
the teacher model on unlabeled dataset to have a much larger dataset. Our preliminary results
show that it is beneficial to generate soft-labels for unlabeled query-item pairs. We will further
explore this direction in the future work. In addition to that, as a next step, we will explore the
possibility of incorporating a multi-objective loss function that combines both relevance and
engagement information.</p>
      <p>As our model is served for tail queries only, we will expand it for head/torso segment and
further include users’ information to make search results more personalized. We leave it for
further work as well.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>
        In this paper, we proposed a novel knowledge distillation framework consisting of an LLM
as the teacher model and a DistilBERT as the student model. We proposed to improve the
efectiveness of LLM by using soft human labels and items’ attributes. Our KD-DistilBERT
outperformed baselines in ofline and online experiments while maintaining eficiency of the
existing production system. Our work opens the door for new industrial applications of other
LLMs [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">15, 14, 16</xref>
        ] in e-Commerce search.
SIGKDD International Conference on Knowledge Discovery &amp; Data Mining, 2020, pp.
2553–2561.
[29] W. Kong, S. Khadanga, C. Li, S. K. Gupta, M. Zhang, W. Xu, M. Bendersky, Multi-aspect
dense retrieval, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, 2022, pp. 3178–3186.
[30] L. Kumar, S. Sarkar, Listbert: Learning to rank e-commerce products with listwise Bert,
arXiv preprint arXiv:2206.15198 (2022).
[31] P. Pobrotyn, T. Bartczak, M. Synowiec, R. Białobrzeski, J. Bojar, Context-aware learning to
rank with self-attention, arXiv preprint arXiv:2005.10084 (2020).
[32] E. P. Brenner, J. Zhao, A. Kutiyanawala, Z. Yan, End-to-end neural ranking for ecommerce
product search, Proceedings of SIGIR eCom 18 (2018) 7.
[33] Y. Zhang, D. Wang, Y. Zhang, Neural ir meets graph embedding: A ranking model for
product search, arXiv preprint arXiv:1901.08286 (2019).
[34] Y. Hu, Q. Da, A. Zeng, Y. Yu, Y. Xu, Reinforcement learning to rank in e-commerce search
engine: Formalization, analysis, and application, in: Proceedings of the 24th ACM SIGKDD
international conference on knowledge discovery &amp; data mining, 2018, pp. 368–377.
[35] M. Zhu, A. Ahuja, W. Wei, C. K. Reddy, A hierarchical attention retrieval model for
healthcare question answering, in: The World Wide Web Conference, 2019, pp. 2472–2482.
[36] L. Gao, J. Callan, Condenser: a pre-training architecture for dense retrieval, in: Proceedings
of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp.
981–993.
[37] H. Chen, F. X. Han, D. Niu, D. Liu, K. Lai, C. Wu, Y. Xu, Mix: Multi-channel information
crossing for text matching, in: Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery &amp; Data Mining, 2018, pp. 110–119.
[38] K. Hui, A. Yates, K. Berberich, G. de Melo, Pacrr: A position-aware neural ir model for
relevance matching, in: Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing, 2017, pp. 1049–1058.
[39] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, X. Cheng, Text matching as image recognition, in:
      </p>
      <p>Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[40] J. Guo, Y. Fan, Q. Ai, W. B. Croft, A deep relevance matching model for ad-hoc retrieval, in:
Proceedings of the 25th ACM International on Conference on Information and Knowledge
Management, 2016, pp. 55–64.
[41] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-to-end neural ad-hoc ranking with kernel
pooling, in: Proceedings of the 40th International ACM SIGIR Conference on Research
and Development in Information Retrieval, 2017, pp. 55–64.
[42] K. Hui, A. Yates, K. Berberich, G. De Melo, Co-pacrr: A context-aware neural ir model for
ad-hoc retrieval, in: Proceedings of the eleventh ACM International Conference on Web
Search and Data Mining, 2018, pp. 279–287.
[43] Z. Dai, C. Xiong, J. Callan, Z. Liu, Convolutional neural networks for soft-matching
n-grams in ad-hoc search, in: Proceedings of the eleventh ACM International Conference
on Web Search and Data Mining, 2018, pp. 126–134.
[44] B. Mitra, F. Diaz, N. Craswell, Learning to match using local and distributed representations
of text for web search, in: Proceedings of the 26th International Conference on World
Wide Web, 2017, pp. 1291–1299.
[61] K. Howell, J. Wang, A. Hazare, J. Bradley, C. Brew, X. Chen, M. Dunn, B. A. Hockey, A.
Maurer, D. Widdows, Domain-specific knowledge distillation yields smaller and better models
for conversational commerce, in: Proceedings of the Fifth Workshop on e-Commerce and
NLP (ECNLP 5), 2022, pp. 151–160.
[62] Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, Y. Duan, Knowledge distillation via instance
relationship graph, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2019, pp. 7096–7104.
[63] L. Yan, Z. Qin, X. Wang, G. Shamir, M. Bendersky, Learning to rank when grades matter,
arXiv preprint arXiv:2306.08650 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Teo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dattatreya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mohan</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Leverage implicit feedback for context-aware product search</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <year>02065</year>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Magnani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaidaroon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Puthenputhussery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <article-title>A multi-task learning framework for product ranking with Bert</article-title>
          ,
          <source>in: Proceedings of the ACM Web Conference</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>493</fpage>
          -
          <lpage>501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Karmaker Santu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sondhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <article-title>On application of learning to rank for e-commerce search</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>475</fpage>
          -
          <lpage>484</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.-S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Acero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Heck</surname>
          </string-name>
          ,
          <article-title>Learning deep structured semantic models for web search using clickthrough data</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM International Conference on Information &amp; Knowledge Management</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>2333</fpage>
          -
          <lpage>2338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mesnil</surname>
          </string-name>
          ,
          <article-title>A latent semantic model with convolutionalpooling structure for information retrieval</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>101</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Severyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moschitti</surname>
          </string-name>
          ,
          <article-title>Learning to rank short text pairs with convolutional deep neural networks</article-title>
          ,
          <source>in: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>373</fpage>
          -
          <lpage>382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <article-title>Combining fact extraction and verification with neural semantic matching networks</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>6859</fpage>
          -
          <lpage>6866</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Magnani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaidaroon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reddy Suram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Puthenputhussery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          , et al.,
          <article-title>Semantic retrieval at walmart</article-title>
          ,
          <source>in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3495</fpage>
          -
          <lpage>3503</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , G. Zhang,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Beyond two-tower: Attribute guided representation learning for candidate retrieval</article-title>
          ,
          <source>in: Proceedings of the ACM Web Conference</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>3173</fpage>
          -
          <lpage>3181</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Deeper text understanding for ir with contextual neural language modeling</article-title>
          ,
          <source>in: Proceedings of the 42nd international ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>985</fpage>
          -
          <lpage>988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Monath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaheer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mccallum</surname>
          </string-name>
          ,
          <article-title>Eficient k-nn search with cross-encoders using adaptive multi-round cur decomposition</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>8088</fpage>
          -
          <lpage>8103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>Colbert: Eficient and efective passage search via contextualized late interaction over Bert</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7b, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Banks</surname>
          </string-name>
          , T. Warkentin, Gemma:
          <article-title>Introducing new state-of-the-art open models</article-title>
          ,
          <year>2024</year>
          . URL: https://blog.google/technology/developers/gemma-open-models/.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>01108</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          , Ms marco:
          <article-title>Benchmarking ranking models in the large-data regime</article-title>
          ,
          <source>in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1566</fpage>
          -
          <lpage>1576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T.-Y. Liu,
          <string-name>
            <surname>M.-F. Tsai</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Learning to rank: from pairwise approach to listwise approach</article-title>
          ,
          <source>in: Proceedings of the 24th International Conference on Machine Learning</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sarvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Voskarides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mooiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Rijke</surname>
          </string-name>
          ,
          <article-title>A comparison of supervised learning to match methods for product search</article-title>
          , arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>10296</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Degenhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kallumadi</surname>
          </string-name>
          ,
          <article-title>The architecture of ebay search</article-title>
          ., in: eCOM@ SIGIR,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Da</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhang,</surname>
          </string-name>
          <article-title>Improving multi-scenario learning to rank in e-commerce by exploiting task relationships in the label space</article-title>
          ,
          <source>in: Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2605</fpage>
          -
          <lpage>2612</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <article-title>Learning a product relevance model from click-through data in e-commerce</article-title>
          ,
          <source>in: Proceedings of the Web Conference</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>2890</fpage>
          -
          <lpage>2899</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Enhancing product search by best-selling prediction in e-commerce</article-title>
          ,
          <source>in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>2479</fpage>
          -
          <lpage>2482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sundaresan</surname>
          </string-name>
          ,
          <article-title>Beyond relevance in marketplace search</article-title>
          ,
          <source>in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>2109</fpage>
          -
          <lpage>2112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Nardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Frieder</surname>
          </string-name>
          ,
          <article-title>Eficient document re-ranking for transformers by precomputing term representations</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>J.-T. Huang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , P. Pronin,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Padmanabhan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Ottaviano</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Embedding-based retrieval in facebook search</article-title>
          ,
          <source>in: Proceedings of the 26th ACM</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rao</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Bridging the gap between relevance matching and semantic matching for short text similarity modeling</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5373</fpage>
          -
          <lpage>5384</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Abrego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hall</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          , et al.,
          <article-title>Large dual encoders are generalizable retrievers</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>9844</fpage>
          -
          <lpage>9855</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          <article-title>, Multi-stage document ranking with Bert</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>14424</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Althammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schröder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sertkan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Improving eficient neural ranking models with cross-architecture knowledge distillation</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>02666</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>K.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saad-Falcon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zaharia, Colbertv2: Efective and eficient retrieval via lightweight late interaction</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3715</fpage>
          -
          <lpage>3734</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [50]
          <string-name>
            <surname>J. D. M.-W. C. Kenton</surname>
            ,
            <given-names>L. K.</given-names>
          </string-name>
          <string-name>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of naacL-HLT</source>
          , volume
          <volume>1</volume>
          ,
          <year>2019</year>
          , p.
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowledge in a neural network</article-title>
          ,
          <source>arXiv preprint arXiv:1503.02531</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Is chatgpt good at search? investigating large language models as re-ranking agent</article-title>
          ,
          <source>arXiv preprint arXiv:2304.09542</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>A.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jayasumana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddi</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Kumar, In defense of dualencoders for neural ranking</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>15376</fpage>
          -
          <lpage>15400</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Bert2dnn: Bert distillation with massive unlabeled data for online e-commerce search</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Data Mining (ICDM)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>A.</given-names>
            <surname>Muhamed</surname>
          </string-name>
          , I. Keivanloo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Perera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mracek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajagopalan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zeng</surname>
          </string-name>
          , T. Chilimbi, Ctr-bert:
          <article-title>Cost-efective knowledge distillation for billion-parameter teacher models</article-title>
          ,
          <source>in: NeurIPS Eficient Natural Language and Speech Processing Workshop</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Knowledge distillation based contextual relevance matching for e-commerce product search</article-title>
          ,
          <source>arXiv preprint arXiv:2210.01701</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Tinybert: Distilling Bert for natural language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>10351</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Reprbert: Distilling bert to an eficient representation-based relevance model for e-commerce</article-title>
          ,
          <source>in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4363</fpage>
          -
          <lpage>4371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Multimodal pre-training with self-distillation for product understanding in e-commerce</article-title>
          ,
          <source>in: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1039</fpage>
          -
          <lpage>1047</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>