=Paper=
{{Paper
|id=Vol-2311/paper_8
|storemode=property
|title=Query Rewrite for Null and Low Search Results in eCommerce
|pdfUrl=https://ceur-ws.org/Vol-2311/paper_8.pdf
|volume=Vol-2311
|authors=Zehong Tan,Canran Xu,Mengjie Jiang,Hua Yang,Xiaoyuan Wu
|dblpUrl=https://dblp.org/rec/conf/sigir/TanXJYW17
}}
==Query Rewrite for Null and Low Search Results in eCommerce==
Query Rewrite for Null and Low Search Results in
eCommerce
Zehong Tan Canran Xu Mengjie Jiang
eBay Search Science eBay Search Science eBay Search Science
zetan@ebay.com canxu@ebay.com menjiang@ebay.com
Hua Yang Xiaoyuan Wu
eBay Search Science eBay Search Science
hyang2@ebay.com xiaowu@ebay.com
ABSTRACT will be used to boost the relevant items in the ranking phase.
In most cases when users encounter with null and low search However, randomly dropping or replacing terms may alter
queries, there are lots of inventory to these queries on eBay, the original search intent which could result in retrieving ir-
but they can’t be retrieved by regular query expansion. There- relevant items. For example, when a null query “led brake
fore, retrieving more relevant items by performing more ag- light nano made in taiwan” is sent for term dropping using
gressive query rewrites will largely improve the user experi- the current algorithm, top subqueries could preserve unim-
ence. This paper proposes a query rewrite system that refor- portant segments like “made in taiwan” As a result, even
mulate the original query into multiple alternative queries though these items contain enough number of tokens in the
that enlarge recall without altering the main search intent. user query, the relevance is low and user experience is bad.
In this paper, we present a term dropping algorithm that un- To address this issue, we implement separate algorithms that
derstands and keeps the important terms in the query, and can drop or replace terms using techniques originated from
a term replacement algorithm that learns from user behav- natural language processing. These algorithms will attempt
ior and language model. Experiments on randomly sampled to recognize the important segments of the query so that
eBay queries shows that the n-best rewrites by these two dropping or replacing terms will not distort its intent.
algorithms effectively recover more relevant items. Besides, The task of query term dropping is assisted by tagging
the experiments indicate that our algorithm can predict the the part-of-speech (POS), entities, unit of measures (UOM)
exact user refined query in its top-5 rewrite candidates for and phrases. In the eCommerce domain, entities are item
over 50% of the null and low queries. properties such as brand name, material, color, shape, gen-
der, etc. On the other hand, for null and low queries that
are not verbose, we will reformulate the query by replacing
Keywords some of the tokens. This approach avoids accidental turn-
eCommerce, Search Experience, Query Rewrite, Language ing the specific seed query into a broad query. To achieve
Model this, we build a language model from large amounts of user
query logs. With a given set of randomly selected null and
1. INTRODUCTION low queries, we focus on the evaluation of n-best rewrites
that are produced for evaluation.
When the retrieved items of a query is lower than some
To summarize, our main contributions are:
certain number (e.g. 10), the search results page results in
bad customer experiences. Therefore recovering more items • We propose an effective algorithm to drop unimportant
for null and low queries will dramatically increase conversion terms without distorting the intent of the query
and click through rates. Causes of this issue range from a
misspelled word to a too specific query. Therefore, with a • We apply a language model to estimate the probability
few exceptions this task can be tackled by simple rule-based of candidate queries for term replacement.
solutions. In this scenario, dropping or replacing some terms
in these queries may recover more items. The current null 2. RELATED WORKS
and low recovery algorithm randomly drops tokens in the
user query up to some certain percentage and come up with Information retrieval with verbose queries has attracted
a collection of subqueries. Subsequently a relevance model lots of interest in recent years. Redundancies that reside in
these queries suggest that generating subqueries by dropping
unimportant terms using natural language processing (NLP)
Copyright © 2017 by the paper’s authors. Copying permitted for private and techniques[1, 2, 3] in the original query would be helpful.
academic purposes. To extract patterns to reduce long queries, researchers tried
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.): to extract features like tf-idf, POS, entities and similarity
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, scores to the original queries. When the retrieved documents
published at http://ceur-ws.org are available, more sophisticated work employs random walk
from each token in the original query on a graph of retrieved
documents and the score of each suggestion is ranked by the
Hadamard product of each term[4].
Query rewrite systems can be traced back to 2001’s when
Google started to provide dynamic spelling checking sys-
tems. Almost simultaneously, Yahoo started to track com-
mon misspelling of frequent queries such as movie stars.
Technical details for these systems are not available, but
they seem to be based on spelling algorithms and statisti-
cal frequencies. Traditional query rewriting system tries to
use rule-based techniques such as finding synonyms of words
that are extracted from wordnet[5]. In recent years, tech- Figure 1: Workflow of query term dropping. (a)
niques that utilize NLP and statistical models have been Query understanding phase: each token of the query
widely used for tasks of query rewrite systems. Mays et. is tagged by part-of-speech or entities. (b) Rewrites
al.[6] showed that noisy channel model could be used for generation phase: drop unimportant tokens to gen-
both real-word or non-word errors spelling correction. In erate subqueries. (c) Rewrites ordering phase: each
these works, either a set of real-word confusion pairs or a candidate is assigned with a score according to the
lexicon against the edit distance is prepared[7]. [8, 9] pro- tags and passes (see text), and the ranking of the
posed to use a large set of ordered query pairs that are scored candidates are given accordingly.
by their semantic similarities. These algorithms are not
specifically designed for null and low search queries. On the
other hand, [10] tried to add taxa constraints to sub-queries gorithm that considers both the alternate query length and
and retrieve items from a prebuilt database of historical the quality of the tokens.
purchases of null and low queries. Distribution of inferred
taxa constraints from the historical data are then added to 3.1 Brand Disambiguation
each subquery for secondary retrieval. Furthermore, using Due to the ambiguity of brands which are derived from
query-snippet pairs from user query logs enlarges the scope real word, recognition of brands such as “white” or “1928”
of query expansion in an improved contextual way[11, 12]. requires an additional disambiguation algorithm. Therefore
Some works[13] tried to build a query rewrite system without we build a discriminative model P (B|q) that predicts the
requiring a predefined dictionary. There are several other term B is a brand in the context of query q. This is achieved
researches that are relevant to query rewrite, such as using by computing
large human labeled data set to build a conditional random X
fields discriminative model[14]. P (B|q) = P (Ci |q)P (B|Ci ), (1)
i
3. QUERY TERM DROPPING where Ci is the ith shopping category. In Eq. (1), P (Ci |q)
In this section, we describe the proposed approach to is the model that projects the context of the query onto the
query term dropping. The aim of term dropping is to re- shopping categories, and P (B|Ci ) is extracted from the data
duce a verbose query to multiple short query candidates to from eBay’s inventory. The model that learns P (Ci |q) is
increase recall as well as keep the query intent. evaluated by inferred category demand. The optimal thresh-
To understand users’ intent, we tag the possible product old for classification is determined by the user click data.
and aspects in the user query, including brands (such as Tagging procedure for brand name will be performed once
Nike), product names (such as Macbook and Play Station Eq. (1) has been run through the entire token list in the
2), colors, materials, product models (such as hd-650), UOM query.
etc. Entities and attributes are tagged by mapping the dic-
tionaries mined from eBay’s structured catalog. As null and 3.2 Scoring Algorithm
low queries are mostly tail queries, most of the tokens cannot The alternative query generation is implemented as a multi-
be covered by structured data. The syntax functionalities pass process. Earlier passes keep more original tokens in
of these tokens are labeled by the POS tagging. Moreover, the alternate query, resulting in more efficient query with
phrases are also detected in this phase by mining the user less recalls. Later passes will keep fewer tokens to enlarge
query logs to make sure tokens in a phrase will be dropped the recall size. The search engine issues the term dropping
or kept as a whole. To this end, when there are multiple queries pass by pass, and stops whenever it gets enough ef-
possible tagging sequences, we select the sequence that tags fective alternative queries and items to show to users. An
most of the tokens. This query tagging procedure provides a offline experiment suggests that for 70% of the null and low
context for core concept and attributives extraction in term queries, it need only one pass to complete the whole process.
dropping. Suggestions from the earlier pass will always rank higher
The selection of terms to drop is set by predetermined than the ones from later passes, as overall queries with more
rules. Brands, product names or first noun phrases are con- original tokens are believed to be closer to the original user
sidered as Tier 1 group. They are the likely core product in intent.
this query. Aspects (colors, materials, etc), UOM and ad- In order to score the suggestions in one pass, we basically
jectives etc. are considered as Tier 2 group, as they are the prefer suggestions with more brands or nouns over sugges-
attributes of the core product. The stop words and conjunc- tions with more adjectives. The baseline tagging score is
tive words etc. are considered as Tier 3 group. To this end, given by a step function over their tags. After tagging, we
a term dropping suggestion is generated by the core product found that there are remaining 54% tokens are nouns. Due
terms in combination with attribute terms. Under the re- to the fact that lots of null and low queries have misspelled
quirement of minimum number of tokens of the alternative words that are of low document frequencies, we are inspired
query, the n-best suggestions are by generated a scoring al- to rank these nouns by the cross-entropy of their clicking
probabilities conditioning on their document frequencies: most predictive. Unlike grammatically strict natural lan-
guage for long text, the feature of user queries is even more
Pclick (r) = −P (click|r) log (1 − P (click|r)), (2) sparse. Therefore we only incorporate bigram features in the
where r is the document frequency of token, and P (click|r) language model. Remedies of the sparsity of bigram features
is evaluated by a large sample of null queries. The clicking are achieved by Kneser-Ney smoothing technique[16].
score is added to the tagging score with an undetermined In the scenario of web search for eCommerce, the terms
coefficient, which is later optimized by maximizing the dis- in the queries are not necessarily following a chronicle order.
counted cumulative gain (DCG). If a user types “case iphone”, the intent is not far from the
query “iphone case”. For this reason, we also incorporate
the backward-bigrams in the model without enhancing the
4. QUERY TERM REPLACEMENT number of parameters. Since the backward-bigram proba-
In this section we describe the proposed approach to term bility is not commensurate with the normal bigrams, these
replacement. The inputs are a query q0 consisting of a se- probabilities are assigned with a free parameter λ:
quence of m words (w1 , . . . , wm ). As the formalism in [6],
the algorithm generates a set of candidates C(q0 ), and then
the best suggestion is given by the highest language model n−1
Y n
Y
probability. To generate the candidates, we start by gener- P (w1 . . . wn ) ' P (wi+1 |wi ) P (wi−1 |wi )λ (7)
ating a set of candidate words for each input word wi that is i=1 i=2
mined from the user behavior logs. Thus the probability for
each candidate query q1 ∈ C(q0 ) would be proportional to Finally, instead of just choosing a candidate term replace-
P (q0 |q1 )P (q1 ), where P (q0 ) is the probability of the original ment with the highest probability, we only accommodate the
query modeled by the n-gram language model. Therefore, candidates that have higher probability estimated from the
term replacement is formulated as language model than the original query. This is to remedy
over-replacing issue, that is, avoid null and low queries that
q1∗ = arg max P (q0 |q1 )P (q1 ), (3) are not suitable for term replacement.
q1 ∈C(q0 )
where P (q0 |q1 ) and P (q1 ) is the noisy channel model and
language model respectively. Furthermore, in practice, con- 5. EXPERIMENT
sidering the fact that the number of candidates in the noisy We test how well our alternative query service can perform
channel model of a particular token ranges differently to the on the eBay search engine. The evaluation scenario aims at
cardinality of bigrams, we raise the noisy channel by a power measuring the ability of the algorithm on recovering null and
γ: low search result page and retrieve relevant items.
q1∗ = arg max P (q0 |q1 )γ P (q1 ). (4)
q1 ∈C(q0 ) 5.1 Data Preparation
We sampled two query sets of null and low queries from
4.1 Noisy Channel Model search log for term dropping and replacement separately.
The noisy channel model, P (q0 |q1 ), describes the emis- The training set for term dropping consists of one week of
sion probability of an original query q0 being replaced by q1 . user’s search and click data. The training data for the query
We collect query pairs of term replacement within the same term replacement consists of query pairs taken from 11 weeks
session from user query logs. For each original query, we of query logs. The sampled test set contains 3,000 null and
0
assume that only one token wi is replaced by wi , therefore low queries from the consecutive week of the training data.
P (q0 |q1 ) = P (wi |wi0 ).
Note that for each observed token wi , the probability 5.1.1 Query Term Dropping
P (wi |wi0 = wi ) should be counted as well. Unlike [Mays
In the experiment for term dropping, we leverage eBay’s
1991], where the probability of wi not being replaced is as-
structure data dictionary for the product and attribute rec-
signed with a tuning parameter, it is estimated by the effec-
ognizing in the query tagging phase. When we perform
tive non-replacement count:
POS tagging for search queries, we employ the Stanford bi-
Count(wi → wi ) directional model. However, in the scenario of web search,
P (wi |wi0 = wi ) = , (5) the pior probability of verbs is much less than ordinary new
Count(wi → wi0 )
articles, we perform an offline correction on top of the Stan-
where Count(wi → wi ) = (Count(wi ) − Count(wi → wi 6= ford POS tagging. With mined unigram and bigrams from
wi0 ))/2. 9 days’ search queries, the tagging for phrases are obtained
by the normalized mutual information:
4.2 Language model
Language modeling in our approach utilizes an n-gram P (wi , wi+1 )2
language model that assigns the following probability to a M I(wi , wi+1 ) = (8)
P (wi )P (wi+1 )
string of words:
m
Y n
Y 5.1.2 Query Term Replacement
P (w1 . . . wm ) = P (wi |w1i−1 ) ' i−1
P (wi |wi−n+1 ) (6)
We remove all non-alphanumeric characters from the queries
i=1 i=1
or spelling corrections pairs. After filtering, there are 380
As shown in [15], language models solely built from query, millions of query pairs. A collection of data statistics of the
instead of the mixture of query and others, would be the training data is shown in Table 1.
original query. Short queries with only 2-3 tokens are more
Table 1: Statistics of unique n-grams in language troublesome in term dropping with overlap rate 60.4%, as
model. dropping even one term may change the users’ intent largely.
1-gram 2-grams Noisy Channel For example, in the query “boden shoes 30”, dropping token
Parameters 2.8 million 29 million 18 million “30” (size 30) will alter the category from “kid’s clothing and
Count 300 million 300 million 126 million shoes” to “women’s shoes”. Similar issues can be relieved by
term replacement. Our result achieved an overall 74.3% cat-
egory overlap, compared to 74%, as reported in [10].1
5.2 Results on Search Metrics
Our algorithm is fine-tuned in order to balance the ca-
pability of retrieving more items and their relevances. In Table 4: Category overlap between original query
this section, we present the result of our experiment on the and n-best query rewrite candidates.
search metrics. 1st 2nd 3rd overall
5.2.1 Recovery of user behaviors Term dropping 71.5% 65.5% 63.3% 66.8%
Users’ intent is reflected by their own query refinements Term replacement 68.6% 93.1% 83.9% 81.9%
when they are not satisfied with the returned search result of
the original query. We first evaluate the probability of exact 5.3 Qualitative Results
predictions of our algorithm from one of the n-best query
rewrite candidates. The results are summarized in Table 2. One important feature of our algorithms that lead to bet-
ter search experiences is the explicit mechanism to rewrite
search queries based on the context. In Table 5, we show
Table 2: Exact match between user’s refinements several examples of top query rewrite generated from the
and n-best query rewrite candidates. algorithm.
Top-1 Top-5 As for term dropping, in the first query “tommy hilfiger”
is detected as brand and “sport bra” as an unbreakable noun
Dropping 23.3% 54.6% phrase. Our term dropping rewrites the query by dropping
Replacement 31.9% 66.4% “prep” “free” and misspelled word “shipp”. In query “10w
high quality speaker 6ohm”, our algorithms recognizes that
5.2.2 Recall Enhancement it contains two UOM constraints “10w” “6ohm” and relaxes
the query by dropping one of the UOM constraints. As men-
In Table 3, we show the number of retrieved items, R, of
tioned in section 3.1, the brand names were disambiguated
the original query and the best rewritten query. For com-
by considering the shopping query. Though word “watt” of-
parison, we also show the number for users’ rewritten query.
ten refers to a unit at most of the time, it is a high confident
We are recovering the null and low search result pages in
brand when it occurs in queries in pottery&craft category.
general.
In query “watt spagehetti platter”, our algorithm recognizes
“watt” as a china brand and keeps it in query rewriting. For
Table 3: Medium of number of retrieved items, R, term replacement, the query “iphone watch” is rewritten as
of original query (Or) , user’s rewritten query (Re) “apple watch”, instead of “iphone clock”, because the bigram
and 3-best query rewrite candidates. probability Pb (watch|iphone )is much greater than the emis-
sion probability Pe (clock|watch). On the other hand, “am-
Or Re 1st 2nd 3rd
ber ball insect” is replaced by “amber sphere insect” because
Dropping (R = 0) 0 8 8 5 12 the Kneser-Ney smoothing correctly compensate the proba-
Dropping (R ≤ 25) 2 42 75 45 75 bility of “amber sphere”, instead of recognizing “amber” as a
Replacement (R = 0) 0 41 3 365 0 color.
Replacement (R ≤ 25) 4 70 44 362 7
6. CONCLUSION AND OUTLOOK
5.2.3 Category Overlap In this work, we have implemented algorithms that aim
The distribution of queries over shopping categories is key to cure the null and low search results. In particular, we
indicator of high-level relevance for query rewrite systems showed that tagging technique developed in NLP could be
in eCommerce[10, 17]. It is a sufficiently strong signal that used to generate more context relevant subqueries, and that
the alternative query distorts user’s search intent if the in- additionally statistical natural language model provides a
ferred level-2 categories are changed. The category overlap promising avenue for making progress on the path to build-
is computed as the Jaccard similarity of up-to top 50 items ing full natural language understanding systems. Our nu-
50
retrieved from the alternative query, CA , to the dominant 1
To our knowledge, the most relevant work to ours is Ref[10],
category of the original query, CO : in which the authors reported an algorithm to generate
50
CA ∩ CO search sub-queries in the scenario of null search page with
J(A, O) = 50
(9) additional taxa constraints. However, the result of this ap-
|CA ∪ CO | proach is not feasible to reproduce because the expense of
In Table 4, we show the category overlap of top 3 sug- inferred exact taxa constrains of each query is maintaining a
dynamic database as large as the active inventory. For this
gestions by term dropping and term replacement. Overall reason, this baseline is compared to the reported number in
71.5% (68.6%) of the items retrieved by top-1 term dropping the original literature, instead of reproducing its result for
(replacement) suggestion has the same category with the the same query set.
Table 5: Examples of top query rewrite from the algorithms.
Or 1st
tommy hilfiger prep sport bra nwt free shipp tommy hilfiter sport bra nwt
Dropping 10w high quality speaker 6ohm 10w high quality speaker
watt spagehetti platter watt platter
iphone watch apple watch
Replacement
amber ball insect amber sphere insect
merical experiments indicate that user’s shopping experience (New York, NY, USA, 2006), WWW ’06, ACM,
could be improved with higher level of engagement. pp. 387–396.
The improvements achieved by our algorithms are not [9] Hasan, M. A., Parikh, N., Singh, G., and
unique to vanilla null and low search pages, and are readily Sundaresan, N. Query suggestion for e-commerce
applicable to even broader tasks with search engine, such as sites. In Proceedings of the Fourth ACM International
extracting structured data from item titles, general query Conference on Web Search and Data Mining (New
reformulation and anomalous query synopsis. It could lead York, NY, USA, 2011), WSDM ’11, ACM,
to significant improvements in shopping experiences with pp. 765–774.
similar models especially with text data of less grammatical [10] Singh, G., Parikh, N., and Sundaresan, N.
strictness. Rewriting null e-commerce queries to recommend
products. In Proceedings of the 21st International
7. REFERENCES Conference on World Wide Web (New York, NY,
USA, 2012), WWW ’12 Companion, ACM, pp. 73–82.
[1] Kumaran, G., and Carvalho, V. R. Reducing long [11] Riezler, S., and Liu, Y. Query rewriting using
queries using query quality predictors. In Proceedings monolingual statistical machine translation.
of the 32nd international ACM SIGIR conference on Computational Linguistics 36, 3 (sep 2010), 569–582.
Research and development in information retrieval
[12] Chen, Q., Li, M., and Zhou, M. Improving query
(2009), ACM, pp. 564–571.
spelling correction using web search results. In
[2] Chen, Y., and Zhang, Y.-Q. A query Proceedings of the 2007 Joint Conference on Empirical
substitution-search result refinement approach for long Methods in Natural Language Processing and
query web searches. In Proceedings of the 2009 Computational Natural Language Learning (2007),
IEEE/WIC/ACM International Joint Conference on pp. 181–189.
Web Intelligence and Intelligent Agent Technology -
[13] Gao, J., Li, X., Micol, D., Quirk, C., and Sun,
Volume 01 (Washington, DC, USA, 2009), WI-IAT
X. A large scale ranker-based system for search query
’09, IEEE Computer Society, pp. 245–251.
spelling correction. In Proceedings of the 23rd
[3] Duan, H., and Hsu, B.-J. P. Online spelling International Conference on Computational
correction for query completion. In Proceedings of the Linguistics (2010), Association for Computational
20th International Conference on World Wide Web Linguistics, pp. 358–366.
(New York, NY, USA, 2011), WWW ’11, ACM,
[14] Guo, J., Xu, G., Li, H., and Cheng, X. A unified
pp. 117–126.
and discriminative model for query refinement. In
[4] Maxwell, K. T., and Croft, W. B. Compact Proceedings of the 31st annual international ACM
query term selection using topically related text. In SIGIR conference on Research and development in
Proceedings of the 36th Intl. ACM SIGIR Conf. on information retrieval (2008), pp. 379–386.
Research and Development in Information Retrieval
[15] Huang, J., Gao, J., Miao, J., Li, X., Wang, K.,
(2013).
Behr, F., and Giles, C. L. Exploring web scale
[5] Zhang, J., Deng, B., and Li, X. Concept based language models for search query processing. In
query expansion using wordnet. In AST ’09 WWW ’10 Proceedings of the 19th international
Proceedings of the 2009 International e-Conference on conference on World wide web (2010), pp. 451–460.
Advanced Science and Technology (2009), pp. 52–55.
[16] Kneser, R., and Ney, H. Improved backing-off for
[6] Mays, E., Damerau, F. J., and Mercer, R. L. m-gram language modeling. In Acoustics, Speech, and
Context based spelling correction. In Information Signal Processing, 1995. ICASSP-95., 1995
Processing and Management: An International International Conference on (1995), vol. 1, IEEE,
Journal (1991), vol. 27, pp. 517–522. pp. 181–184.
[7] Ahmad, F., and Kondrak, G. Learning a spelling [17] Hasan, S., Heger, C., and Mansour, S. Spelling
error model from search query logs. In Proceedings of correction of user search queries through statistical
the Conference on Human Language Technology and machine translation. In EMNLP (2015), L. MÃ rquez,
Empirical Methods in Natural Language Processing C. Callison-Burch, J. Su, D. Pighin, and Y. Marton,
(Stroudsburg, PA, USA, 2005), HLT ’05, Association Eds., The Association for Computational Linguistics,
for Computational Linguistics, pp. 955–962. pp. 451–460.
[8] Jones, R., Rey, B., Madani, O., and Greiner, W.
Generating query substitutions. In Proceedings of the
15th International Conference on World Wide Web