<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Behavior-driven Query Similarity Prediction based on Pre-trained Language Models for E-Commerce Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yupin Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiri Gesi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinyu Hong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Han Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Zhong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vivek Mittal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qingjun Cui</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vamsi Salaka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon</institution>
          ,
          <addr-line>Palo Alto</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California</institution>
          ,
          <addr-line>Irvine, Irvine</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Pre-trained language models (PLM) excel at capturing semantic similarity in language, while in ecommerce, customer shopping behavior data (e.g., clicks, add-to-cart, purchases) helps establish connections between similar queries based on behavior on products. This work addressed the challenges of using sparse behavior data to build a robust query-to-query similarity prediction model and apply it to a product search ranking system. Our contributions include a straightforward method for data generation, testing diferent model structures on both public PLMs and in-house PLMs fine-tuned with Amazon internal data. The fine-tuned in-house PLM model shows a 27.4% NDCG improvement compared with the BERT. And we designed an end-to-end pipeline that incorporates model outputs into prior feature. The prior scores can be used to impact ranking, matching, and recommendation systems. We tested the prior in an online experiment, which led to a significant improvement of 0.08% in the search click rate and a 0.03% reduction in the search reformulation rate. Overall, our approach has significant implications for improving search ranking and matching applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Neural Networks</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>Query-to-Query</kwd>
        <kwd>Distillation</kwd>
        <kwd>E-Commerce</kwd>
        <kwd>Search Ranking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Behavior features constructed based on user feedback towards corresponding queries are one
of the most crucial features in ranking models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. They have been a key revenue driver in
applications like search ranking and matching, click-through rate prediction, and sourcing.
Though powerful, the user feedback signals are sparse. Taking Amazon’s US website as an
example, it receives several billion unique queries every year, but the majority of them do
not have suficient customer behavior signals (e.g., clicks, add-to-cart, and purchases) to build
high-quality behavior features used in ranking. The same pattern happened on YouTube as
well [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Thus, for long-tail unique queries, customer signals are too sparse to generate features.
Although many queries are semantically similar [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (for example, "acoustic noise-canceling
panels", "soundproofing acoustic studio foam," and "sound-absorbing acoustic panels"), their
corresponding signals difer significantly, resulting in unbalanced feature quality.
      </p>
      <p>
        Pre-trained language models (PLMs) have a significant impact on query-related tasks. PLMs
convert raw text into continuous, high-dimensional vectors that encode the semantic meaning
of the text [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The distance (e.g., cosine similarity) between two vectors can measure whether
the two queries are semantically related. We tested several PLMs, including BERT-base [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
Sentence-BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and InfoXLM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] on a zero-shot setting. For the above mentioned query
"acoustic noise-canceling panels", it introduces some defects like "noise canceling headphones"
and "noise cancelling earbuds for sleep" which are undesirable. Some methods in e-commerce
typically require several additional auxiliary tasks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to keep query intent or the creation of a
query-product knowledge graph [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This is not practical to implement due to its complexity in
generating training data with special requirements.
      </p>
      <p>
        Our work leveraged the large amount of anonymized and aggregated customer behavior data
to create training data and labels, then tested our model structures in several PLMs, including
in-house PLMs trained with e-commerce data. We use this model to establish better query
representation and ultimately improve feature coverage by mapping tail queries (usually longer,
specific queries have low search volumes) to head queries (usually short, common queries have
high search volumes) with better behavior signals. For production applications, we designed a
two stage pipeline (retrieval and re-ranking) to incorporate the model outputs into behavior
feature priors [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>In this paper, we present three contributions:
• Introduced a straightforward yet efective method for collecting similar queries and
creating labels from search logs only.
• Tested diferent model structures on several PLMs, including two in-house PLMs
finetuned with Amazon internal data. Our experiments highlight the superior performance
and necessity of the in-house PLM, with a 27.4% improvement in ofline NDCG.
• Designed a method to extract similar query behavioral scores into priors used for product
search ranking models. The online A/B test conducted on 100M search sessions achieved
significant improvement in both the revenue-aware metrics and user-engagement metrics,
with the search click rate increasing by 0.08% and the search reformulation rate decreasing
by 0.03%. By adopting similar queries’ behavioral signals, we also observed a significant
reduction in search defects in production.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Query Normalization</title>
        <p>
          Query normalization is important in helping match user queries with relevant products when the
query contains diferent forms or alternative expressions of the same concept. It consists of some
common techniques [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] including tokenization, filtering, and stemming. In e-commerce, the
user queries are short in length, so we found via experiments that sorting tokens alphabetically
within a query could further reduce the number of unique queries while maintaining a high level
of precision at Amazon. Figure 1 (left) shows 37 user queries like "women trendy sunglasses",
"womans sunglasses trendy", and "trendy woman sunglass", etc. After query normalization
like filtering (e.g., women’s -&gt; women), stemming (e.g., women -&gt; woman) and sorting tokens,
they can all use one query, "sunglass trendy woman" to represent. This significantly facilitates
customer feedback signal sharing for building behavior features. Though powerful, it is limited
at the lexicon level and thus cannot establish associations with queries that contain diferent
tokens. Figure 1(right) shows some semantically related queries grouped by diferent types that
our work aims to generate on top of query normalization.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Query Rewriting</title>
        <p>
          Query rewriting is a crucial aspect of information retrieval and ranking, and it has been an
active research field [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
          ]. Three primary rewriting techniques exist: replacement-based,
generation-based, and retrieval-based methods. Replacement-based methods employ synonym
replacement [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ] or query term-dropping [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Figure 1 (right) showed some results in
the [synonyms] and [token drop] sections that can be achieved using the replacement-based
method. But these methods could lead to some poorly rewritten queries like "trendy woman"
or "stylish sunglass women" that may not reflect the actual terms customers use in real-world
scenarios. Generation-based methods typically utilize Seq2Seq models [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] or transformer
models (BERT [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and GPT [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]), which lead to significant improvements compared to traditional
rule-based and statistical methods. Zhang et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] developed a multi-task learning model that
predicts the target query while also fulfilling query-matching, category classification, and
product name prediction tasks to preserve the query’s shopping intent. However, this increases
the original task’s complexity and makes label collection more challenging. Another challenge
with generation-based models is the long inference time. Hofstatter’s analysis [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] on the
Fusion-in-Decoder model shows that the decoding latency is 10X of encoding thus is hard to
meet the online latency requirement for our applications. To simplify the training task and
better leverage the power of pre-training PLMs, we choose a retrieval-based method and build
our work on top of PLMs. This ensures that similar queries come from the pre-defined candidate
set that has good customer signals.
        </p>
        <p>
          Collecting labeled data for query rewriting can be challenging and expensive. E-commerce
queries are usually short in length and lack customer shopping context; thus, it’s dificult to
set judgment standards and train people to generate consistent labels. This made the weakly
supervised technique of mining query pairs from search logs [
          <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
          ] a widely accepted approach.
The implementation varies in practice. Ozertem et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] collect customer-rewritten queries
within the same session; Fujita et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] use co-click data to gather similar queries with higher
click data rankings; Baeza-Yates et al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] create query clusters by extracting tokens from
queries and clicked URLs, then identify similar queries within the input query’s corresponding
cluster. In this work, we designed a divide-and-conquer method to collect similar query pairs
that can be efectively applied to large amounts of data.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Formulation</title>
      <p>We formally define the problem of predicting query similarity based on customer behavior data
in e-commerce applications. We denote all input queries of all users by , which consist of a
collection of queries . Given an input query , there could exist a set of candidate query set
 = {1, 2, ...} from  that are close to . We train model to predict the similarity between
 and each candidate query  , denoted as  , ranging from 0 to 1. So the training data can
be denoted as T = {(,  ,  )}∈{1,2,...,||}. In fact, it is not possible to generate a similarity
score that precisely represents the relationship between two queries, as the actual score does
not have a meaning in a real-world scenario. We tried to approximate this value by considering
the co-purchase actions between them.</p>
      <p>Traning Label Design To compute the similarity score  between two queries (,  ), we
define two types of similarity between queries: overlap similarity ( ) and jaccard similarity
( ). We use () to denote the unique purchased products from query  then  is calculated
by:
(1)
(2)
(,  ) =</p>
      <p>|() ∩ ( )|
min(|()|, |( )|)
To prevent popular queries from dominating every query’s candidate query list, we introduce  .
Popular queries are those that have many purchases of diferent products. For example, ‘nike
shoes’ is a popular query that has many diferent products purchased. Since the denominator in
  uses the union of the purchased products from two queries, it tries to avoid every shoe- or
Nike-related query having ‘nike shoes’ as the top similar candidate.</p>
      <p>(,  ) = |() ∩ ( )|</p>
      <p>|() ∪ ( )|</p>
      <p>We use the product of the two similarities as the label to represent the similarity between
two queries.</p>
      <p>(,  ) = (,  ) *  (,  )
(3)</p>
      <p>We consider label  to be close to 1 when queries  and  have the most co-purchased
products overlap among other candidate queries, and close to 0 otherwise. We then create a
model to fit on ( ,  ,  ) to score the similarity between the two queries.</p>
      <p>Once generate a candidate query set  for a given query , we could combine the signals
from each query in  to generate a prior score (details in Section 4.5). We then combine prior
with query ’s behavioral signals to create a new ranking feature that is used in tree-based
ranking models to improve ranking quality.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Our proposed framework (in Figure 2) consists of three main components: the input layer,
the encoder layer, and the similarity calculation layer. We tested both bi-encoder (BE) and
cross-encoder (CE) structures in the encoder layer.</p>
      <sec id="sec-4-1">
        <title>4.1. Input Layer</title>
        <p>Input Data We use co-purchased products to bridge similar queries. After we retrieved a large
pool of query pairs from the search log, we designed the following steps to filter and label the
number of samples.</p>
        <p>1. The two queries had at least three diferent co-purchased products within a year. This is
to prevent weak query pairs from being generated.
2. For each query, we rank its similar queries based on a defined QuerySimilarity score.
3. For each query’s ranked queries, we select the top 301 or top 60% queries. We chose the
top 60% threshold here for very popular queries that could have thousands of similar
queries.</p>
        <p>Model Input For BE models, the search query and candidate query are fed separately into the
pre-trained model, which outputs two separate embeddings. For CE models, we concatenate
the two queries using the special tokens &lt;CLS&gt; and &lt;SEP&gt;. The input is like &lt;CLS&gt; Search
Query &lt;SEP&gt; Candidate Query &lt;SEP&gt; and then feed it into pre-trained models.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Encoder Layer</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Bi-encoder model</title>
          <p>Under the bi-encoder structure (Figure 5 in Appendix A), each encoder uses a
representationbased model that takes one query as input and outputs one feature embedding vector. Then,
the generated embeddings of two queries can be fed into a simple dot product or perception
to calculate the similarity. The advantage of a BE model is the low inference cost that enables
online deployment with low latency requirements.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Cross-encoder Model</title>
          <p>The CE model (Figure 6 in Appendix B) has multiple inputs, and they allow informational
interactions at the early stage by leveraging their attention heads to exploit inter-query
interactions. This interaction can be as simple as feeding two connected feature embeddings into a
multi-layer perceptron (MLP) or be more complex, such as leveraging an attention mechanism
between two input queries.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Pre-trained Language Model</title>
          <p>
            PLMs are trained on massive multi-lingual corpora from the internet and have proven to
be foundational game-changers for various natural language processing (NLP) and natural
language understanding tasks. Amazon, which operates in over 20 countries worldwide, has a
vast amount of product abd search log data. Thus, within Amazon, we also have PLMs fine-tuned
for e-commerce. Table 1 summarizes the four diferent PLMs we tested in this project.
• BERT base [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] is transformer model pre-trained on a large corpus of English data in a
self-supervised fashion using a masked language modeling (MLM) objective.
• Sentence-BERT [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] is a project that aims to train sentence embedding models on very
large sentence level datasets using a self-supervised contrastive learning objective. It
is pretrained on a dataset of 1 billion sentence pairs, i.e. given a pair of sentences, the
1We found that as we included more than 30 queries, the diferences between the later ones were small and less
relevant. We tested using the top 150 in model training and found the ofline NDCG dropped by 1000 bps.
          </p>
          <p>Number
of layers</p>
          <p>Hidden
size</p>
          <p>Self-attention
heads</p>
          <p>Param-eters
768
384
768
1024
12
12
12
16
110M
33M
158M
300M</p>
          <p>Training
method</p>
          <p>MLM
contrastive
learning</p>
          <p>MLM
+ TLM
MLM</p>
          <p>Training</p>
          <p>data
Book Corpus of
11,038 books and
English Wikipedia
1B sentence
pairs dataset
1B distinct
queries +
266M parallel
translations</p>
          <p>Amazon
internal data</p>
          <p>To generate the final embedding representation from diferent encoders, we explored using
&lt;CLS&gt; special token embeddings and average pooling in our experiments.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Similarity Calculation Layer</title>
        <p>We calculated the similarity between the two generated embeddings using three methods:
cosine similarity, MLP, and a combination of the two. Details of these methods are provided in
Appendix C Figure 7.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Supervised Learning</title>
        <sec id="sec-4-4-1">
          <title>4.4.1. Pointwise Training</title>
          <p>•  → (0, 1], when  and  have co-purchased products.</p>
          <p>•  → 0, when  and  have no co-purchased products.</p>
          <p>
            During training, we artificially created negative pairs ( ,  , 0) where there was no
copurchase for  and  , and mixed them with positive pairs (,  ,  ), and trained a model on
the label. During inference, we use the model to rank a list of candidates with the input form
[(,  )]. In this study, we use two diferent loss functions:
• Binary Cross Entropy Loss (BCE) [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]:
          </p>
          <p>1 ∑︁  (ˆ ) + (1 −  ) · (1 − ˆ )
BCE = −</p>
          <p>=1
• Mean Square Error Loss (MSE):
MSE =
=1

1 ∑︁( − ˆ )2</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>4.4.2. Negative Contrastive Learning</title>
          <p>
            To mine hard negative training pairs, we adopted Approximate nearest neighbor Negative
Contrastive Learning (ANCE) [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ] on our representation model to generate a set of negative
candidates for a given query . We denote  to include one positive candidate + and  − 1
negative candidates {−1, −2, ...}. Then we use infoNCE [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] loss to optimize the negative log
probability of identifying the positive sample + amongst noise samples.
          </p>
          <p>1 ∑︁ log
ℒNCE = − 
=1</p>
          <p>exp(+)
∑︀=1 exp()</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Application: Product Search Ranking</title>
        <p>Two Stage Q2Q Pipeline CE models generally outperform BE models, but the computational
cost increases significantly. For example, given 100 million search queries and 100 million
historical candidate queries, the model will be called 100 2 times and will take 100 million
GPU-days. With an optimal 10x inference speed and 1000 GPUs, it will take 10,000 days to
ifnish. So it is not practical to use it on a very large set of query candidates like the Amazon
search scenario, where the number of queries to be ranked is on a million scale. To design an
end-to-end framework, we employ a two-stage procedure (Figure 3): retrieval and reranking.</p>
        <p>
          For the first stage, we select the best representation-based model and run every non-tail query
through it to generate embedding. We built an ANN graph for fast retrieval. Here we choose
to use PECOS-HNSW [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], a graph-based ANN library for large-scale vector-similarity search
that achieves state-of-the-art performance on ANN benchmark evaluations, to index all these
embeddings. For any given input query embedding, the similarity computation is conducted
using the inner product.
        </p>
        <p>In the second stage, after each query retrieves its top k (k = 300 in our experiments) most
similar queries from the previous stage, we deploy the best performing interaction-based model
to rerank the retrieved candidates and output similarity scores.</p>
        <p>Prior Calculation To improve search ranking using the Q2Q model output, we designed a
method to augment tail queries where behavior signals are sparse with prior scores. Prior score
measures the initial likelihood of an event before observing any data, which is a key concept
in Bayesian methods that are commonly adopted in combination with observed data to make
predictions. For a given query , the Q2Q model finds ’s similar queries  = {1, ..., }
(4)
(5)
(6)
sharing similar customer actions, then build the   with ’s behavior signals. For query 
and a related product , we use (, ) to represent their history signal score2. Then we have
 (, ) defined as:
 (, ) =
1</p>
        <p>∑︁ (, )
|| ∈[1,]
To calculate the final feature  (· ), we combine the  (· ) with (· ):
where  is defined as:
 (, ) =  · (, ) + (1 −  ) ·  (, ) ·</p>
        <p>(,  ,)
 = ℎ ⎣ (, max ,) ⎦</p>
        <p>⎡
⎤
(7)
(8)
(9)
, denotes the number of impressions for product  under query .  is a constant to cap
the very large numbers. We chose 10,000 here, and it depends on diferent application trafic.
So  is computed from query-product impressions. The more query-product impressions, the
higher  will be. It is used to balance the weight between observed historical signals and prior
scores. We introduce  as the confidence rate used to further adjust the weight of prior.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Evaluation</title>
      <sec id="sec-5-1">
        <title>5.1. Data Preparation</title>
        <p>We utilized the anonymized Amazon search logs for our experiments. Compared to click signals,
purchase signals are more sparse but of higher quality. To reduce the training data noise, we
considered query similarity based on the customer’s purchase signals only. And we aggregated
at the (query, product) tuple across all search sessions recorded over a one year period. For
example in search log, if a user searches "harry potter", then all the returned products from the
search would be listed out as separate rows. And each row contains the elementary metrics
associated with a query-product impression and its subsequent actions (click, add-to-cart,
purchase). The intuition is that if two diferent queries lead to purchases of the same product,
these queries are likely to represent similar customer intentions.</p>
        <p>Figure 4 illustrates an example of query-pairs scoring overview. We have "full size bed sheet"
and "cookie sheet pan" as input queries with six candidate queries. For "full size bed sheet", three
of them have the same product purchase history. The thicker the line, the more co-purchases
there are between them. The connected queries are used as positive samples and are assigned a
2(, ) is a weighted combination of clicks, adds and purchases of the (q, p) pair, normalized by the sum of its
impressions and a query-level constant
similarity score. Similarly, for "cookie sheet pan", there are three queries with co-purchased
products. Using this method, we build a group of positive samples for the query. For negative
samples, we randomly pair two queries that do not have any co-purchased product history.</p>
        <p>To improve the recall at the retrieval stage, we adopted noise contrastive learning to mine
hard negative pairs from ANN retrieval results using our initial representation model.</p>
        <p>Data Example</p>
        <p>Table 2 presents the top queries that have similar purchase histories with the "goya lady
ifngers". "Lady finger" is a type of sponge cake biscuits and "goya" is the brand name. In the
"candidate query" column, we observe that the top-ranked queries include "lady fingers for
tiramisu prime", "ladyfinger cookies". It is not straightforward to consider "goya lady finger"
and "tiramisu cookies" as similar queries at a lexical level. But rich behavior signals show the
two queries share similar purchased products, thus teaching the model to learn this semantic
level similarity.</p>
        <p>Scalability Challenge</p>
        <p>When the search logs have billions of unique query product pairs, using the Cartesian product
directly on all the queries would yield over a trillion query pairs, resulting in an out-of-memory
issue. To reduce enormous communications between nodes and fully leverage the parallel
computation power in Spark, we aggregate the query-product pairs at the product level and
then enumerate query pairs from all the corresponding products. The greater the number of
related queries for a product, the longer the enumeration time. Thus, we implemented the
divide-and-conquer strategy for this task. We bucketed the product based on the number of
corresponding queries and then triggered the enumeration process. However, there are some
very popular products with a high number of related queries. We first did a sampling of queries,
then triggered this process to save computation time.</p>
        <p>Final Dataset</p>
        <p>Table 3 shows the details of the training, validation, and test data. For the training data, there
were 1.67 billion query pairs with 27.9 million unique input queries. For validation data, there
were 687,300 pair queries and 22,910 unique input queries. We have two test data sets. A small
test set can be used to quickly test model performance as well as the best checkpoint selection.
A large test set is needed to evaluate the performance of a large pool and generate stable scores
for model comparisons.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Ofline Evaluation Metric</title>
        <p>
          In this study, we use Recall@100, 1000 and NDCG@3 (Normalized Discounted Cumulative
Gain [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]) to evaluate the model’s performance. Recall is to measure the retrieval stage
performance for the representation-based model. NDCG is to measure the reranking stage performance,
and we chose 3 here to align with our downstream applications for computing the prior score.
In our task, the rank of the relevant queries is more important than the actual prediction score.
So we choose NDCG as it accumulates gain from the top of the query list to the bottom, with
the gain of each result discounted at lower ranks. The metric ranges from 0 to 1.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Ofline Evaluation</title>
        <p>We trained models using diferent combinations of model components, including BE and CE
structure in the encoder layer, and MLP and Cosine similarity in the similarity calculation layer.</p>
        <p>BE Model Comparisons Table 4 shows the NDCG@3 metric on BE models with cosine
similarity for interaction. We found that 1) Using the Q2Q model train from scratch as a baseline,
the models using PLMs as the backbone have lifted the NDCG range from 937 bps to 1562 bps. 2)
Using our behavioral-driven query pairs to fine-tune PLMs improved the performance, ranging
from 2.4% to 9.4% (compare bBv1 vs. bBv2 and bSTv1 vs. bSTv2). Specifically, the best BE model,
bSTv2 outperformed the vanilla Q2Q model by 27.4%.</p>
        <p>CE Model Comparisons Under the CE model (Table 5) structure, we first frozen the encoder
and tested diferent interaction layers. We found combining MLP and cosine similarity, followed
by a 2-to-1 layer, is better than MLP or cosine similarity individually. And the best CE model,
cAv2 (using Amazon’s in-house A-PLMv2), outperformed cBv1 (using BERT-base) with an
improvement of 26%. In addition, our evaluation shows that the model training using A-PLMv2
is not sensitive to loss (MSE, BCE) or pooling choices (avg, cls).</p>
        <p>Distillation We further evaluated whether we could improve BE model performance by
distilling the best CE model, cAv2.Instead of using hard labels from the original training data,
we use cAv2 to generate soft labels for student models. Table 6 shows that the BE student1
outperforms the benchmark2 fine-tuned with hard labels by 300 bps (rows 3, 4), but still trails
the CE teacher by 700bps (rows 1, 4). With one more linear layer to allow more embedding
interaction, the model narrows the gap and trails the CE teacher by 180 bps (rows 1, 5). As
expected, fine-tuning public Sentence-BERT using our label improved the performance by 7.5%.
And replacing cosine similarity with MLP brings a 7.5% gain (rows 4, 5) on the distilled student
model.</p>
        <p>Model Recall In Table 7, the recall@100 for the initial trained representation model is 0.59.
To improve the model recall, we first add the label as weight in training, and the recall increases
to 0.64. After we introduce the ANCE technique with the negative queries coming from the
model’s retrieval phase, the recall at 100 reaches 0.78 with 32% improvement compared with
baseline.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Ablation Studies</title>
        <p>To better understand the impact of the backbone models, pretraining, finetuning, and distillation.
We conducted a series of evaluations on our models trained with BE architecture, measured
by NDCG@3. We measured whether Amazon’s in-house PLM provided additional benefits
over its parent, InfoXLM. In Table 8, specifically, we observed that pretraining with Amazon
query/product datasets brings 340 bps lift (rows 2, 3) to 1000 bps lift (rows 11, 12). For Q2Q
tasks, when find-tuned on Sentence-BERT and A-PLMv2 with hard label, A-PLMv2 has 303bps
lift over Sentence-BERT. On the other hand, fine-tuning the behavior signal brings an additional
470 bps lift (rows 3, 4). Distillation brings an additional 60 bps lift (rows 4, 5). Surprisingly, we
found the simple MLP layer on top of the BE layer could bring the model to similar performance
as the best CE model with negligible diferences of 11 bps (rows 8, 12).</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Production Experiment Results</title>
        <p>Using the two-stage Q2Q pipeline, we conducted an online A/B test on the Amazon US website
for one week on 100 million search sessions at a 5% level of significance. We use the best BE
model, bSTv2, in the retrieval stage and the best CE model, cAV2, in the reranking stage. The
model yielded significant revenue wins and significantly reduced search defects, along with
other search-related metrics improvements, including the number of searches increasing by
0.03%, search page clicks increasing by 0.08%, search reformulation rate decreasing by 0.03%,
and the average click depth decreasing by 0.05%.</p>
        <p>In particular, this experiment shows a stronger improvement in clothing and fashion-related
shopping categories across all the metrics, than other categories. The stats show that these
categories have a lower conversion rate than electronics and kitchens. We conjecture that these
significant wins are due to the fact that the prior scores powered by the Q2Q model provide
customers with more related choices to browse and select. With the Q2Q augmentation, we are
able to improve this capacity and thus gain more customer purchases.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this study, we present a query similarity prediction framework that leverages behavior data.
We first mine the query pairs from the yearly aggregated logs and design the training labels that
can approximate their similarities. These query pairs could go beyond the semantic level when
fused with domain-specific knowledge. For example, the behavior data could link "cheap" with
"amazon basic" and have better domain-specific token representations like "amazon", "prime",
"gift card", and "brands". Then, we explored various model components and compared their
performance on both public and Amazon in-house PLMs. The model fine-tuned on Amazon’s
in-house PLM has improved 27.4% over the BERT baseline. To improve ranking quality in
ecommerce, we designed an end-to-end pipeline to utilize the model output to build prior behavior
features. The online experiments conducted in the US showed significant improvements in
search click rates and defect reduction.</p>
      <p>Our work provides a practical solution to leverage similar queries to improve search ranking
in e-commerce settings. This study emphasizes the value of combining customer behavior
signals, which contain precise and up-to-date knowledge, with the general knowledge provided
by PLMs. And we selectively combined them for diferent applications with diferent latency
requirements. While CE models generally exhibited superior performance to BE models, we
found that with distillation techniques and combining MLP on top of the best-performing BE
model, we could achieve similar performance as a CE model. This combination not only leads
to better precision in downstream applications but also facilitates online deployment.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgements</title>
      <p>This work builds on top of the work that has been done at Amazon Search. Special thanks to
Wei-Cheng Chang, Jyun-Yu Jiang, Choon-Hui Teo, Saeedeh Salimianrizi, Christopher Fayoux,
Mackie Hembrador, Anurag Shiv, Kang Wang, Ram Kandasamy, Ruirui Li, Haoming Jiang, Yifan
Gao, Qingyu Yin, Cuize Han, Zhengyang Wang, Chen Luo, Xiaojie Wang, Ziqi Zhang and
Hsiang-Fu Yu.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Bi-encoder Q2Q model</title>
    </sec>
    <sec id="sec-9">
      <title>B. Cross-encoder Q2Q model</title>
    </sec>
    <sec id="sec-10">
      <title>C. Similarity layer</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Agichtein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <article-title>Improving web search ranking by incorporating user behavior information</article-title>
          ,
          <source>in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Covington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Adams</surname>
          </string-name>
          , E. Sargin,
          <article-title>Deep neural networks for youtube recommendations</article-title>
          ,
          <source>in: Proceedings of the 10th ACM conference on recommender systems</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Maji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <article-title>Addressing vocabulary gap in e-commerce search</article-title>
          ,
          <source>in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1073</fpage>
          -
          <lpage>1076</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NIPS)</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-L.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>InfoXLM:</surname>
          </string-name>
          <article-title>An information-theoretic framework for cross-lingual language model pre-training, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3576</fpage>
          -
          <lpage>3588</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>280</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>280</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Wu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rustamov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Advancing query rewriting in e-commerce via shopping intent learning</article-title>
          ,
          <source>in: Proceedings of the 2022 ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <article-title>Neural search: Learning query and product representations in fashion e-commerce</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2107</volume>
          .
          <fpage>08291</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bayes</surname>
          </string-name>
          ,
          <article-title>An essay towards solving a problem in the doctrine of chances</article-title>
          ,
          <source>Philosophical Transactions of the Royal Society of London</source>
          <volume>53</volume>
          (
          <issue>1763</issue>
          )
          <fpage>370</fpage>
          -
          <lpage>418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , Introduction to Information Retrieval, Cambridge University Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vassalos</surname>
          </string-name>
          ,
          <article-title>Query rewriting for semistructured data</article-title>
          ,
          <source>ACM SIGMOD Record</source>
          <volume>28</volume>
          (
          <year>1999</year>
          )
          <fpage>455</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Few-shot generative conversational query rewriting</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1933</fpage>
          -
          <lpage>1936</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Tsai</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          <article-title>, Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting</article-title>
          ,
          <source>ACM Transactions on Information Systems (TOIS) 39</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mandal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. K.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Query rewriting using automatic synonym extraction for e-commerce search</article-title>
          ,
          <source>in: eCOM@ SIGIR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Aleksandrovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Unsupervised synonym extraction for document enhancement in e-commerce search (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Query rewrite for null and low search results in ecommerce</article-title>
          ,
          <source>in: eCOM@SIGIR</source>
          ,
          <year>2017</year>
          . URL: https://api.semanticscholar.org/CorpusID: 59528277.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Sequence to sequence learning with neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:1409.3215</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Raman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <article-title>Fid-light: Eficient and efective retrievalaugmented text generation</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2209</volume>
          .
          <fpage>14290</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Severyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Neural ranking models with weak supervision</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1704</volume>
          .
          <fpage>08803</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          , W.-Y. Ma,
          <article-title>Probabilistic query expansion using query logs</article-title>
          , in: WWW, ACM,
          <year>2002</year>
          , pp.
          <fpage>325</fpage>
          -
          <lpage>332</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>U.</given-names>
            <surname>Ozertem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Donmez</surname>
          </string-name>
          ,
          <article-title>Learning to suggest: A machine learning framework for ranking query suggestions</article-title>
          ,
          <source>in: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fujita</surname>
          </string-name>
          , G. Dupret,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <article-title>Semantics of query rewriting patterns in search logs</article-title>
          ,
          <source>in: Proceedings of the fifth workshop on Exploiting semantic annotations in information retrieval (ESAIR '12)</source>
          , Association for Computing Machinery, New York, NY, USA,
          <year>2012</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1145/2390148.2390153.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hurtado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          ,
          <article-title>Query recommendation using query logs in search engines</article-title>
          , in: EDBT, Springer,
          <year>2004</year>
          , pp.
          <fpage>588</fpage>
          -
          <lpage>596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Goutam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Short text pre-training with extended token classification for e-commerce query understanding</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2210</volume>
          .
          <fpage>03915</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Shannon</surname>
          </string-name>
          ,
          <source>A mathematical theory of communication</source>
          ,
          <source>The Bell System Technical Journal</source>
          <volume>27</volume>
          (
          <year>1948</year>
          )
          <fpage>379</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-F.</given-names>
            <surname>Tang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Overwijk</surname>
          </string-name>
          ,
          <article-title>Approximate nearest neighbor negative contrastive learning for dense text retrieval</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2007</year>
          .00808.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>A. van den Oord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Representation learning with contrastive predictive coding</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1807</year>
          .03748.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>H.-F. Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>I. Dhillon</given-names>
          </string-name>
          , Pecos:
          <article-title>Prediction for enormous and correlated output spaces</article-title>
          ,
          <source>in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kekäläinen</surname>
          </string-name>
          ,
          <article-title>Cumulated gain-based evaluation of ir techniques 20 (</article-title>
          <year>2002</year>
          )
          <fpage>422</fpage>
          -
          <lpage>446</lpage>
          . doi:
          <volume>10</volume>
          .1145/582415.582418.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>