-

1613-0073

Gap by Predicting Missing Tokens for E-com merce Search

Kaihao Li

kaihao.li@walmart.com 0

Juexin Lin

juexin.lin@walmart.com 0

Tony Lee

tony.lee@walmart.com 0

Document Expansion, Information Retrieval, E-commerce Search

0 Walmart Global Technology , USA

Addressing the “vocabulary mismatch” issue in information retrieval is a central challenge for e-commerce search engines, because product pages often miss important keywords that customers search for. Doc2Query [1] is a popular document-expansion technique that predicts search queries for a document and includes the predicted queries with the document for retrieval. However, this approach can be ineficient for e-commerce search, because the predicted query tokens are often already present in the document. In this paper, we propose Doc2Token, a technique that predicts relevant tokens (instead of queries) that are missing from the document and includes these tokens in the document for retrieval. For the task of predicting missing tokens, we introduce a new metric, “novel ROUGE score”. Doc2Token is demonstrated to be superior to Doc2Query in terms of novel ROUGE score and diversity of predictions. Doc2Token also exhibits eficiency gains by reducing both training and inference times. We deployed the feature to production and observed significant revenue gain in an online A/B test, and launched the feature to full trafic on Walmart.com.

CEUR ceur-ws.org

1. Introduction

The vocabulary gap problem in e-commerce search is a central challenge, as it arises from discrepancies between the vocabulary used by customers and sellers when describing products. Customer queries are often short and ambiguous, while product descriptions tend to be more detailed and explicit. For instance, a customer might search for “small building set”, intending to ifnd a set that ofers simpler building experiences for young children. However, in the product catalog, those products are often characterized by piece count and target age group, which do not align directly with this search query.

Diferent approaches have been proposed to address the vocabulary mismatch issue. In the context of lexical retrieval, query expansion [ 2, 3, 4, 5, 6 ] and document expansion [ 1, 7, 8 ] are two efective techniques. Query expansion enriches user queries with additional terms or synonyms to better capture the user’s intent, while document expansion enriches product information with additional keywords or phrases. Doc2Query [ 7 ] is a document expansion technique that predicts and indexes search queries for documents. Recently, embedding-based dense retrieval models [ 9 ] have demonstrated their ability to align queries and documents by projecting them into a representation space to learn their semantic similarity. Although dense retrieval with approximate nearest neighbor search has shown impressive results, lexical eCom’24: ACM SIGIR Workshop on eCommerce, July 18, 2024, Washington, D.C., USA nEvelop-O © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). retrieval remains an important component of e-commerce search due to its desirable properties, such as interpretability, scalability, as well as handling of rare words and numerical tokens.

In this paper, we propose a new document expansion technique, called Doc2Token, which generalizes Doc2Query for application in e-commerce search, as depicted in Figure 1. Our task is to, given a product, generate relevant keywords that are absent from the product’s indexed metadata to ensure that the product is retrieved when customers search using these keywords. We call these “novel tokens”. We observed that Doc2Query’s predicted queries often contain tokens already in the product metadata instead of novel tokens, which makes it ineficient for our task. In contrast, Doc2Token is designed to predict novel tokens (instead of queries). The approach is to prepare a dataset of pairs of product and novel token, then train a seq2seq model on that dataset. By design, Doc2Token eficiently generates tokens with a high probability of being novel, rather than producing long and redundant sequences like in Doc2Query. Using the product shown in Figure 1 as an example, Doc2Query predicts “6 year old boy toy”, “3 in 1 creator”, “sea animal toy”, and “building toy for boy”. On the other hand, Doc2Token provides a more diversified set of tokens, including “small”, “kit”, and “tank”. To incorporate the Doc2Token predictions into the search system, we added them to the product’s metadata and indexed for retrieval matching and ranking.

To assess performance on the novel-token-prediction task, we introduce a new evaluation metric called “novel ROUGE score”, denoted by “nROUGE”, to measure the ROUGE score [ 10 ] specifically for novel tokens. Our results indicate that Doc2Token surpasses Doc2Query in terms of nROUGE score. Regarding eficiency, Doc2Token is capable of predicting more diverse results while significantly reducing training and inference times. The efectiveness of this approach is further demonstrated through online relevance evaluation and A/B testing, confirming that the novel tokens generated are not only novel but also relevant to products, ultimately driving customer engagement.

Our contributions are summarized as follows: ing novel tokens.

ciency compared to Doc2Query. • We propose a novel technique, Doc2Token, for document expansion in e-commerce search, encompassing both the training setup and the modification of the loss function. • We introduce a new metric, “novel ROUGE score”, to evaluate the performance of predict• We demonstrate that Doc2Token achieves improvements in both efectiveness and efi

2. Methodology

We define the task of novel token prediction in this section. For a product , let and represent the product text and the set of tokens extracted from , respectively. The goal is to predict novel tokens, i.e., tokens that are absent from . To achieve this, we first collect a list of relevant queries (section 3.1) for the product based on historical search logs. Next, for each product, we assemble a set of unique tokens, disregarding their sequence, through the subsequent process. We concatenate all queries, divide the concatenated sequence into individual tokens, count their frequencies, and exclude tokens already present in product . As a result, for product , we have a target token set ⋃ queries for product , and ∉ (

, ), where represents the th unique token from target denotes the frequency of token . Instead of using all tokens as one training target, we divide them into training instances. We train a seq2seq generative input and outputs a sequence of tokens ̂ in an autoregressive manner. More formally, language model, T5 [ 11 ], with an encoder-decoder structure. It takes the product text as token pair is: loss for 5 .

̂ = Decoder(Encoder( )).

To account for the token frequency, we modify the T5 loss as follows. The loss for a product weighted( , ̂ ) = ( ) ∗ 5 ( , ̂ ), where is the smoothing factor, set to 0.5 in our implementation. This value is chosen to balance the contribution of token frequency to the overall loss. In practice, we use the cross-entropy (1) (2)

For model inference, we employ beam search [ 12 ] to generate the top N predictions. Since T5 tokenizes words into subtokens [ 13 ], the target token and predicted token ̂ may consist of multiple subtokens, although they are always single words. We utilize the beam score [ 12 ] to determine the confidence of the prediction, and we only retain predictions with scores greater than a predetermined cutof value (more details are discussed in Section 3.4).

3. Experiments 3.1. Datasets

We sampled the product-query pairs from user engagement data on Walmart.com with at least a certain number of add-to-cart (ATCs) over a two-year period. Then we did the following preprocessing steps. The product information used throughout the experiment includes product title, product type, brand, color, gender and description.

Preprocessing None

RF RF + FMF + PTF (Doc2Query)

RF + FMF + PTF + tokenization + OTF (Doc2Token)

Relevance filter (RF). The engaged products are not always relevant to the search query, because users’ decisions are influenced by factors other than relevance, such as price, visual appeal, ranking, etc. Additionally, the minimal match criteria [ 14, 15 ] for our lexical retrieval is not always 100%. As a result, customers may be shown products that don’t fully match their search terms. For instance, a customer may search for “vanilla ice cream” but end up buying chocolate ice cream. To mitigate such noise, we removed product-query pairs predicted to be irrelevant by applying a relevance model. (The relevance model is a BERT [ 16 ] cross-encoder model with a classifier head that takes the query and product information as input and outputs a relevance score. It was trained on manually annotated relevance data.) Full match filter (FMF). For the purpose of predicting novel tokens, we focused on product-query pairs with a vocabulary mismatch. This implies that at least one query token was not found in the product information. Thus, we removed pairs where all query tokens were in the product information.

Price token filter (PTF). Customer queries sometimes include price and deal intent (e.g., “under $500” or “on sale”). Such phrases are not very useful for our task, since price and deals can lfuctuate rapidly, so it does not make sense to include them as training labels. We utilized regular expressions to identify and eliminate these phrases from the query.

Overlapping token filter (OTF). This step excludes all query tokens that are present in the product information. (This is a stronger extension of the full match filter.) After this step, only novel tokens remain.

Detailed data statistics can be found in Table 1. In later sections, we show results for both Doc2Query and Doc2Token. For the Doc2Query dataset, we applied the first three filters, resulting in 14.9M product-query pairs. For the Doc2Token dataset, we built upon the Doc2Query dataset by further dividing the queries into tokens and applying the overlapping token filter, resulting in 10.3M product-token pairs. For each dataset, we partitioned it by product into training, validation, and test sets in an 8:1:1 ratio to ensure no product overlap between the sets.

3.2. Metrics

To evaluate model performance, we utilize the standard ROUGE score, a widely-used evaluation metric for summarization tasks [ 17 ]. The ROUGE score assesses the quality of generated text by comparing it to a reference text, considered as ground truth, using n-gram overlaps. Unlike the summarization task, which may require complex evaluations based on n-grams, lexical retrieval = 1 ∑

= 1 ∑

ℎ ∑

∈ ∑ ∈ ( ̂ ( ℎ ℎ

(, ̂ ) (, ̂ ) primarily relies on unigrams. Therefore, we measure the ROUGE scores in terms of unigrams. In our context, we formulate the ROUGE score as follows. where , ̂ represent the reference text and predicted text for product , respectively. denotes the number of tokens, and token in the predicted text. is the number of products.

denotes the number of co-occurrences of a reference

However, a higher ROUGE score does not necessarily indicate better performance at predicting novel tokens. We observed that text predictions with high ROUGE scores often exhibit substantial token overlap with the product information. To assess the performance for novel tokens, we introduce a new metric, “novel ROUGE score” (nROUGE), where the reference text consists solely of novel tokens. Formally, the nROUGE score is defined as follows.

= 1 ∑

= 1 ∑ ∑

∈ ∑ ∈ ∗ ∗ ( ( ̂ ℎ ℎ ∗ )

(, ̂ ) (, ̂ but not in the product information. where ∗ represents the novel reference text for product , i.e., the tokens in the reference text

3.3. Experiment Setup

For the model training, we fine-tuned the public T5-base model using eight A-100 Nvidia GPUs, learning rate of 1e-4, batch size of 64, maximum input sequence of 256, and maximum output sequence of 32. For model inference, we employed a top-k beam search strategy with a beam size of 10. The input of the model is a text of product information consisting of the product title and some attributes, such as brand, color, etc.

3.4. Results

In this section, we compare our proposed Doc2Token model to the baseline Doc2Query model from both efectiveness and eficiency perspectives. To assess efectiveness, we measure performance based on the ROUGE and nROUGE scores. In terms of eficiency, we report the resources used for model training and inference. We evaluated on four models, including both Doc2Query and Doc2Token models with and without full-match filter in data preprocessing step. )

, ) , )

, ) (3) (4) (5) (6)

Model Cutof Doc2Query

w/o FMF

Doc2Token w/o (FMF + OTF) Doc2Query

w/ FMF Doc2Token w/ (FMF + OTF) 3.4.1. Ofline evaluation results In Table 2, we present the results for each model based on the top 10 predictions with various beam score cutofs. The predictions were chosen if their beam scores exceeded the respective cutof value. These cutof values were tuned, for each model, to achieve the optimal nROUGE F1 score. Additionally, to assess models’ eficiency in predicting novel tokens, we concatenated the predictions and calculated the total number of predicted tokens and the number of predicted novel tokens. For the Doc2Query models, we present the result without any beam score cutof, as well as the result that achieves the highest nROUGE F1 score. For the Doc2Token models, we reported three results: one without any cutof, one with the optimal nROUGE F1 score, and a result generating a similar number of novel tokens as the optimal Doc2Query model.

To evaluate the efectiveness of the full-match filter in data preprocessing, we trained the models without incorporating that step. We observed no apparent impact on the ROUGE F1 score. However, there was a substantial improvement in the nROUGE F1 score for both Doc2Query (from 0.438 to 0.481) and Doc2Token (from 0.469 to 0.500) at the optimal cutof values. This is expected, as with the full-match filter, our training data primarily relies on labels containing novel tokens.

In our comparison between the Doc2Query and Doc2Token models, we observed that the Doc2Query models tend to achieve higher ROUGE F1 scores than the Doc2Token models. However, Doc2Token excels in achieving superior nROUGE F1 scores. Comparing the models with a similar number of predicted novel tokens, for example, the Doc2Query model with a cutof of 0.51 and Doc2Token model with a cutof of 0.29, the Doc2Token model outperforms the Doc2Query model in both nROUGE precision and nROUGE recall, yielding a higher nROUGE F1 score. With optimal cutofs, the Doc2Token model shows the superior performance compared to the Doc2Query model by achieving a 3.95% higher nROUGE F1 score from 0.496 to 0.500. This improvement is statistically significant, with a 95% confidence interval for the Doc2Token F1 score of (0.498, 0.501) obtained through bootstrap resampling. Moreover, the Doc2Token model is more eficient in generating novel tokens, achieving nearly 100% of predicted tokens being novel. In contrast, Doc2Query produces only 20% novel tokens, indicating a higher degree

Model Doc2Query Doc2Token Doc2Query Doc2Token Preprocessing

w/o FMF w/o (FMF + OTF) all all of redundancy. This is expected, as the Doc2Token model is designed to predict more diverse novel tokens. 3.4.2. Model eficiency Table 3 presents the results for training and inference times. The training time is primarily afected by the size of the training data. Without the full-match filter, splitting queries into tokens explodes the data size, resulting in a longer training time for Doc2Token compared to Doc2Query. However, with the full-match filter, the situation changes: the Doc2Token strategy significantly reduces the dataset size, leading to shorter training times than Doc2Query. For inference time, we sampled 100,000 products from the test dataset and conducted model inference on the top 10 results with a batch size of 16 using a single K80 GPU machine. The inference time for Doc2Token is faster than that for Doc2Query, as the output of Doc2Token is generally shorter. The results are in agreement with the eficiency discussions from Table 2.

3.5. Examples Product Input title: Toddler Floaties, Swim Vest for Boys and Girls Age 2-7 Years Old, 20-50 Pounds Children Water Wings Arm Floaties in Puddle/Sea/Pool/Beach (Dinosaur) brand: Dark Lightning color: Blue gender: Unisex Doc2Query “swimming vest for kid”, “toddler boy swim vest”, “swim vest for kid”, “boy floaty”, “kid floaty”

Doc2Token “float”, “kid”, “floaty”, “floater”, “salvavida”, “swimmy”, “baby”, “floatation”, “children”, “life”

Product Input title: Hanno Muller-Brachmann - North German Poets - Classical - CD brand: Artists color: white Doc2Query “classical cds”, “germany cd”, “country music cd”, “north germany cd”, “west

germany cd”

Doc2Token “music”, “country”, “5”, “b”, “classical”, “christmas”, “soundtrack”

Table 4 showcases two example products along with its corresponding Doc2Query and Doc2Token predictions. The Doc2Query model produces the top 5 queries, while the Doc2Token model generates the top 10 tokens. The novel tokens, after the process of stemming [ 14, 15 ], are bold. In general, the Doc2Query model produces queries containing tokens that are already present in the product information, while the Doc2Token model does not. In contrast, all tokens produced by the Doc2Token model are relevant and absent from the product. In the first positive example, the Doc2Token model is capable of predicting a Spanish word “salvavida” (“lifeguard” in English), indicating its ability to handle Spanish queries. Queries in Spanish are commonly observed in US e-commerce search. The second example, a product from northern Germany, illustrates some bad predictions from the models. The predicted tokens “country”, “west”, “christmas” are irrelevant. This is mainly due to a lack of media-related data in our training set. Replacing T5 with a more knowledgeable LLM could potentially address this issue. 4. Implementation and online tests We implemented the Doc2Token model in production because of its superior performance compared to the Doc2Query model as shown in Section 3.4. We ran the model inference on all products in our catalog using cost-efective K80 GPUs and a batch size of 16. We predicted the top 10 tokens and retained the predictions with scores above 0.33. The inference process is conducted ofline on a daily basis. For online usage, the Doc2Token predictions serve as an additional text matching field in Solr [ 14 ], which is an enterprise search platform built on Apache Lucene [ 15 ] utilized for the search retrieval at Walmart.com.

We evaluated the performance of the Doc2Token feature from both relevance and engagement perspectives. For the relevance evaluation, we enlisted the human annotators to assess the top 10 ranked products from impacted queries. These assessments were based on a three-point scale (exact match, substitute, irrelevant), considering factors such as product title, image, and product page at Walmart.com. We then computed the NDCG@10 based on this 3-point scale, showing a 0.49% lift (p-value=0.066). For engagement assessment, we conducted a two-week A/B test for the feature on live trafic. The test revealed a statistically significant 0.28% lift in revenue (p-value = 0.013). While the NDCG@10 improvement is statistically marginal, the statistically significant revenue increase demonstrates the efectiveness of the Doc2Token feature. By introducing relevant products in the retrieval process, the Doc2Token feature is able to enhance the end-to-end search results, assisting customers in finding what they are searching for.

5. Conclusions

In this study, we present Doc2Token, a novel document expansion technique for e-commerce search engines. We introduce the novel ROUGE score, a new metric crafted to evaluate the eficacy of document expansion endeavors. Our analysis has demonstrated that Doc2Token surpasses Doc2Query in terms of eficiency and efectiveness in addressing the vocabulary mismatch challenge. The Doc2Token feature has been deployed and evaluated online, resulting in a significant improvement in both relevance and revenue.

[1]

Nogueira ,

Yang ,

Lin ,

Cho , Document expansion by query prediction , arXiv preprint arXiv: 1904 . 08375 ( 2019 ).

[2]

J. J.

Rocchio , Relevance feedback in information retrieval , in: G. Salton (Ed.), The Smart retrieval system - experiments in automatic document processing , Englewood Clifs, NJ: Prentice-Hall, 1971 , pp. 313 - 323 .

[3]

G. A.

Miller , Wordnet: a lexical database for english , Communications of the ACM 38 ( 1995 ) 39 - 41 .

[4]

Lv ,

Zhai , A comparative study of methods for estimating query language models with pseudo feedback , in: Proceedings of the 18th ACM conference on Information and knowledge management , 2009 , pp. 1895 - 1898 .

[5]

Liu ,

Li ,

Lin ,

Riedel ,

Stenetorp , Query expansion using contextual clue sampling with language models , arXiv preprint arXiv:2210.07093 ( 2022 ).

[6]

Wang ,

Yang ,

Wei , Query2doc: Query expansion with large language models , arXiv preprint arXiv:2303.07678 ( 2023 ).

[7]

Nogueira ,

Lin ,

Epistemic , From doc2query to doctttttquery, Online preprint 6 ( 2019 ) 2 .

[8]

Formal ,

Piwowarski ,

Clinchant , Splade: Sparse lexical and expansion model for ifrst stage ranking , in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2021 , pp. 2288 - 2292 .

[9]

Xiong ,

Li ,

K.-F.

Tang , J. Liu,

Bennett ,

Ahmed ,

Overwijk , Approximate nearest neighbor negative contrastive learning for dense text retrieval , arXiv preprint arXiv: 2007 . 00808 ( 2020 ).

[10] C.-Y. Lin , Rouge: A package for automatic evaluation of summaries , in: Text summarization branches out, 2004 , pp. 74 - 81 .

[11]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-to-text transformer , The Journal of Machine Learning Research 21 ( 2020 ) 5485 - 5551 .

[12]

Yang ,

Huang , M. Ma, Breaking the beam search curse: A study of (re-) scoring methods and stopping criteria for neural machine translation , arXiv preprint arXiv: 1808 . 09582 ( 2018 ).

[13]

Kudo ,

Richardson , Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing , arXiv preprint arXiv: 1808 . 06226 ( 2018 ).

[14]

Shahi , Apache solr, Springer, 2016 .

[15] Apache

lucene

, http://lucene.apache.org, 2019 .

[16]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , CoRR abs/ 1810 .04805 ( 2018 ).

[17]

See ,

P. J.

Liu ,

C. D.

Manning , Get to the point: Summarization with pointer-generator networks , arXiv preprint arXiv:1704.04368 ( 2017 ).