=Paper= {{Paper |id=Vol-2410/paper20.pdf |storemode=property |title=Query Rewriting using Automatic Synonym Extraction for E-commerce Search |pdfUrl=https://ceur-ws.org/Vol-2410/paper20.pdf |volume=Vol-2410 |authors=Aritra Mandal,Ishita Khan,Prathyusha Senthil Kumar |dblpUrl=https://dblp.org/rec/conf/sigir/MandalKK19 }} ==Query Rewriting using Automatic Synonym Extraction for E-commerce Search== https://ceur-ws.org/Vol-2410/paper20.pdf
        Query Rewriting using Automatic Synonym Extraction for
                          E-commerce Search
                  Aritra Mandal*                                              Ishita Khan*                      Prathyusha Senthil Kumar
                     eBay Inc                                                  eBay Inc                                      eBay Inc
           San Jose, CA, United States                               San Jose, CA, United States                   San Jose, CA, United States
             arimandal@ebay.com                                         ishikhan@ebay.com                           prathykumar@ebay.com

ABSTRACT                                                                              1 INTRODUCTION
Query rewriting is a critical component in modern search engines.                     One of the aims of query rewriting in information retrieval is to
It is the process of altering and enhancing raw user queries using                    bridge the vocabulary gap between queries and documents. Other
synonymous keywords or structured metadata to improve search                          direct applications of rewriting a user query, such as addressing
recall and relevance using data mining techniques applied on tex-                     miss-spelling, different ways of describing the same entity regard-
tual data and user behavioral signals. For example, the query bicycle                 less of level of formality exits. In the context of e-commerce search,
is rewritten to match (bicycle OR bike) - i.e. all items that either con-             user queries are much shorter and often colloquial, compared to
tain the word bicycle or bike in their title are returned for this query.             product listing titles that are more verbose and contain formal
Choosing the right set of synonymous terms for a given query term                     terms. As an example, buyers may search for noise absorbing blan-
is essential to ensure the quality of search results, especially in the               kets when they are looking for soundproof blankets. In order to find
context of e-commerce where buyer needs can be very specific. As                      all relevant inventory that matches the buyer’s search intent, query
an example, shoe is a good synonym for shoes, whereas sandals,                        terms are expanded or rewritten to synonymous terms which are
while related, is not a good synonym for shoes. In this work, we                      also used to retrieve documents. In this example, we may rewrite
describe one version of the approaches to query rewriting taken at                    the user’s original query to acoustic blankets, soundproof blankets
eBay search. At a high level, we use a two step process to generate                   or soundproof blanket etc.
and apply synonyms for query expansions - 1. offline token level                          Several techniques have been proposed in literature to automati-
synonym generation and 2. runtime search query rewriting. In the                      cally identify query expansions including those based on a. pseudo-
offline phase, we first generate a large pool of candidate synonyms                   relevance feedback [22] where expansion terms are selected from
for query tokens using various techniques leveraging user behav-                      the top results retrieved from a first round of retrieval, b. leveraging
ioral data, inventory and taxonomy information and open source                        user search logs [3], [11], [10], [1], [17] which capitalize on the large
knowledge bases; then, we leverage a machine learned binary clas-                     volumes of readily available user behavioral logs to extract query
sifier t rained o n h uman j udged b inary r elevance l abels t o filter              substitutions using a variety of innovative techniques, c. ontology
the candidate synonyms that are truly useful as query expansions                      or thesaurus based [13], [16], [8] and [12], that employ either ex-
without compromising result set precision; this classifier allows us                  ternally available knowledge bases or create one from the specific
to leverage a wide variety of sources and techniques to generate                      corpora to expand search queries. One more interesting approach
synonym candidates by providing a scientific and scalable method                      proposed by Gao et.al [6][9][7] uses ranking mechanism to score
to evaluate their effectiveness for query rewriting. This filtered set                rewrite lexicons specific to a query and boosts rewrites which are
of token level synonyms is stored in a dictionary for runtime query                   more relevant to the query.
rewriting. In the online phase, we rewrite user search queries by                         The paper by Andrew et.al [20] gives an overall view of the
combining the token level synonyms in the dictionary, creating a                      search architecture at eBay. In our work we focus only on the query
boolean recall expression. We empirically demonstrate the value of                    rewrite part of search at eBay. Our approach combines the effec-
this approach to enhance e-commerce search recall and relevance.                      tiveness of several of the above mentioned techniques to extract
                                                                                      synonymous words and phrases for query terms. Contrary to these
KEYWORDS                                                                              techniques, we use a boolean recall basis to determine the items
query rewriting, automatic query expansion, synonym generation,                       returned for a query, which are then ranked based on the user’s
machine learning, big data, classification, web search, e-commerce                    specific ranking criteria (relevance, price, time on site, etc). The
ACM Reference format:                                                                 automatically generated query term synonyms are used to augment
Aritra Mandal*, Ishita Khan*, and Prathyusha Senthil Kumar. 2019. Query               the query in the boolean recall expression. As an example, the query
Rewriting using Automatic Synonym Extraction for E-commerce Search. In                mens bikes is rewritten to ((men OR mens) AND (bike OR bikes OR
Proceedings of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom),                bicycle OR bicycles), where the query term mens is expanded to mens
7 pages.                                                                              and the term bikes is expanded to bike, bicycle and bicycles. These
                                                                                      synonyms are mined using a number of techniques as described in
                                                                                      3.1 relying on the presence of these synonyms in behavioral logs
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic   and item index. These techniques result in a large pool of synonym
purposes.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at
http://ceur-ws.org
SIGIR 2019 eCom, July 2019, Paris, France                                       Aritra Mandal*, Ishita Khan*, and Prathyusha Senthil Kumar




      Figure 1: Schematic diagram showing the generation and evaluation framework of synonyms used in query rewrites


candidates with varying quality. A novel second step involving a          large set of candidate pairs to a set of filtering components, one
machine learned classifier is used to differentiate the truly valuable    for each synonym generation technique (e.g. stemming, acronyms),
synonyms from semantically related but not synonymous terms. As           that identifies if the candidate pair of n-grams qualifies as a syn-
illustrated earlier, shoe and shoes are truly synonymous, whereas         onym pair according to that technique. Following are some of the
shoe and sandals are related but not synonymous. This classifier          filtering components / synonym generation techniques that we will
is trained using a variety of behavioral and lexical features on a        use in the scope of this paper: stemming equivalents, compounding
human judged relevance label.                                             equivalents, phrasing, open-source dictionary based synonyms and
    The rest of the paper is organized as follows: in Section 2 we pro-   acronyms-full form equivalents (We are unable to disclose the full
vide an overview of our end to end process for e-commerce query           set of filtering components).
rewriting, in Section 3 we describe the offline synonym dictionary            As mentioned earlier, we developed a machine learned classifi-
generation process in detail, followed by evaluation and results for      cation model to evaluate the above generated synonym candidates,
the synonym classifier and its effectiveness for query rewriting in       and filter out the synonym pairs that are truly valuable for query
Section 4 and conclude with future work in Section 5.                     expansions. This classifier is optimized for a human judged binary
                                                                          relevance target (label=1 for candidates that are true synonyms,
2     END-TO-END ARCHITECTURE FOR QUERY                                   label=0 for candidates that are irrelevant or related but not synony-
                                                                          mous). A number of behavioral and lexical features are extracted
      REWRITING
                                                                          for the list of candidate synonyms being considered. In particu-
In this section we provide an overview of the synonym candidate           lar, for the behavioral features, we use clicks and other associated
generation and evaluation pipeline and the runtime use of keyword         engagement signals from historic user sessions where the query
synonyms for query rewriting. Subsections 2.1 and 2.2 outline the         token (trigger) and the corresponding synonym candidate occurred.
offline and online components in our search query rewriting system,       Additionally we also derive ratio-based features from these count-
respectively.                                                             based features. We construct lexical features based on semantics
                                                                          such as presence of numeric characters, lengths of the trigger and
2.1    Offline Synonym Candidate Generation                               synonym etc. After the classifier has predicted a score for each can-
       and Selection Pipeline                                             didate, we perform a score thresholding step to filter out the final
                                                                          set of synonyms. The thresholding is based on analyzing the score
The overall pipeline of the synonym candidate generation and eval-
                                                                          distribution on the training dataset in positive and negative classes
uation framework for query rewriting is shown in Figure 1. We
                                                                          and selecting a cutoff that balances between classifier performance
mine user behavioral data to collect search queries and item titles
                                                                          and false negative rate. The final filtered query n-gram - synonym
from search result pages (SRPs). From there, we use a semantic
                                                                          pairs are stored in a dictionary and used in runtime query rewriting
learning function that models three types of candidates:a. query-
                                                                          to enhance the recall set for queries.
query transitions generated from pairs of queries within a single
user session; b. query-item pairs of query and clicked item titles
in the same SRP; c. item-item pairs of items shown in the same            2.2     Runtime Query Rewriting using N-gram
SRP. For each of these types of candidates, the semantic learning                 Based Synonyms
function essentially considers pairs with a certain minimum num-          As mentioned above, once the final filtered synonym rewrites are
ber of occurrences across search sessions, and then generates all         generated for the n-gram query tokens, we store them in an offline
possible token n-grams (n up to a fixed limit) from these pairs as        rewrite dictionary for runtime query rewriting. At run-time, we
potential synonym candidates. Next, we pass this exponentially            have a two-phase approach to constructing boolean query rewrite
Query Rewriting using Automatic Synonym Extraction for E-commerce Search                              SIGIR 2019 eCom, July 2019, Paris, France


expressions using the n-gram rewrite dictionaries: query segmen-            pairs to a set of filtering components, one for each synonym gen-
tation and query rewriting. In the query segmentation phase, the            eration technique, that identifies if the candidate pair of n-grams
raw user query is split into non-overlapping n-gram segments to             qualifies as a synonym pair according to that technique. Follow-
maximize coverage in the rewrite dictionary. The segmentation               ing are some of the filtering components we currently have in our
heuristic used in this particular application is to take the longest        pipeline:
matching segment from left among the query tokens, and then split
                                                                                  • Stemming equivalents Words that resolve to the same stem
the rest of the query tokens recursively in the same manner. In the
                                                                                     root, using Snowball stemmer [15]. E.g. boat and boats, both
query rewriting phase, we look up the synonyms for the n-gram
                                                                                     resolve to the same root boat.
segments in the synonym dictionary, and then combine them into
                                                                                  • Compounding equivalents These are n-grams that when
a boolean expression. The different synonyms for a given segment
                                                                                     written as a compound word mean the same as the n-gram.
are combined with an OR clause while the different segments for a
                                                                                     These are more popular in some languages like German.
given query are combined with an AND clause. For example, for
                                                                                     E.g.: sail boat and sailboat, granitpflastersteine and granit
the query ps 4 games, the first phase will segment the query as ps 4
                                                                                     pflastersteine or granit pflaster steine.
and games and the second phase might create a rewrite expression
                                                                                  • External dictionary based synonyms We leverage open
like ((ps 4 OR playstation 4) AND (games OR game)), based on the
                                                                                     source knowledge bases and dictionaries to extract syn-
following n-gram synonym rewrites: ps 4 -> playstation 4 and games
                                                                                     onyms. One such example is WordNet [4, 5, 21] which is
-> game.
                                                                                     a large lexical database of English, where nouns, verbs, ad-
                                                                                     jectives and adverbs are grouped into sets of cognitive syn-
3     SYNONYM DICTIONARY GENERATION                                                  onyms, each expressing a distinct concept. Each synonym in
      FOR QUERY REWRITING                                                            one group are interlinked by means of conceptual-semantic
In this section, we deep dive into the steps involved in the offline                 and lexical relations. From our n-gram candidates, we filter
synonym rewrite dictionary generation as outlined in Figure 1.                       out synonyms as identified by wordnet. E.g.: sunblock and
                                                                                     sunscreen, tomahawk to hatchet.
3.1      Mining Synonym Candidates                                                • Acronyms-full form equivalents We implement the Schwartz
                                                                                     Hearst algorithm [18] for extracting acronym based syn-
For generating individual synonym candidates, we process user                        onyms that are identified using pattern matching on item
behavioral data to collect query and item titles from SRPs. We use                   titles and query item pairs. At a high level, the algorithm
this data in a semantic learning function that models three types of                 consists of two tasks: extraction of (short form, long form)
transitions to generate potential synonym candidates. For each of                    pair candidates from text, and identification of the correct
these transitions, we create several pairs of n-grams (length upto a                 long form among the candidates in the sentence that sur-
fixed n) as potential candidates as described below.                                 rounds the short form. E.g.: hp and Hewlett Packard, NY&Co
      • Query - Query Transitions: generated from pairs of queries                   and New York & Company.
         within the same user session. For example, consider a search             • Phrasing By default, all tokens in a query are matched as a
         session with the following queries: womens dresses and ladies               bag of words, meaning any item that contains those tokens
         gown. Using the pairs of queries within this set, we generate               in any order is returned for the query. For example, for the
         n-gram pairs as possible synonym candidates, where each                     query womens dress we may return items like womens sum-
         member of the pair comes from a different query. With n=2                   mer dress. However, in some cases we might want to enforce
         (bigrams), we generate the following candidates - womens -                  strict word ordering while matching terms in items for user
         ladies, womens - gown, dresses - ladies dresses - gown, womens              queries. This applies to named entities like brands, models,
         dresses - ladies, womens dresses - gown, womens dresses - ladies            book names etc where the query intent is preserved only
         gown, ladies gown - womens and ladies gown - dresses.                       when matched exactly. Phrasing is a mechanism that enables
      • Query - Item Transitions: pairs of query and clicked item                    us to do this enforcement. These are mined by understanding
         titles in the same SRP. Consider the query womens dresses                   user preferences from query - item n-gram pairs. (Kumar et
         - let’s say we observe the following items being clicked in                 al. [19]). E.g.: If a user searches for new balance shoes, since
         an SRP for this query - womens dress/gown size 4, summer                    new balance is a single entity, we enforce that only the items
         dress for women. Similar to the query-query transition case,                containing the words new balance contiguously in that order
         all pairs on n-grams between each pair of query and clicked-                be returned for this query.
         item are considered as potential candidates.
      • Item - Item Transitions: pairs of items shown in the same           3.2      Synonym Selection with ML classifier
         SRP (same query). Using our example with query-item transi-
                                                                            The different synonym mining techniques described in Section 3.1
         tions, for pairs of items in the same SRP - womens dress/gown
                                                                            yield a large pool of synonyms with varying degrees of quality
         size 4, summer dress for women, we generate n-gram candi-
                                                                            and coverage. However, not all of these will be useful and could
         date pairs, one from each item.
                                                                            even potentially be harmful when use e-commerce search query
   For each of these types of candidates, we only consider pairs with       expansion. For example, the words tub and bathtub may be used
a certain minimum number of occurrences across search sessions.             interchangeably in some contexts, but using one as an expansion
Once generated, we pass this exponentially large set of candidate           for the other may result in some unexpected results from other
SIGIR 2019 eCom, July 2019, Paris, France                                      Aritra Mandal*, Ishita Khan*, and Prathyusha Senthil Kumar


product categories. We developed a Random Forest classifier [2] in       appears in an item title. Here N is the number of query-title transi-
order to filter out the truly useful synonyms from candidates that       tions considered. Additionally we also derive ratio-based features
are related but not always synonymous.                                   from these count-based features.
                                                                            We also construct lexical features based on word semantics such
    3.2.1 Training Dataset and Classifier Information. Our ran-          as presence of numeric characters, string lengths, and lengths of
dom forest classifier for filter useful synonym candidates is trained    matched and non-matched parts in trigger terms and expansions.
on human judged binary labels indicating the applicability of the        To separate out the most predictive features we also added three
synonyms to query expansions (label=1 for candidates that are            random numbers features. For the final model we used all features
true synonyms, label=0 for candidates that are irrelevant or related     with a feature importance greater than the first random feature,
but not synonymous). For example, the pair shoe - shoes will be          and ended up with 50 features in total.
labeled as 1, while the pair shoes - sandals will be labeled as 0. The
training dataset itself is a weighted sample of synonym candidates
from each synonym mining technique, where the weights are deter-
mined based on the quality of synonyms from each source - i.e. we
oversample synonym types that have a larger fraction of negatives.
For example, through our internal judgment of data quality, we
empirically identified that synonyms based on external knowledge
sources have a higher false positive rate than those from stemming,
and hence we give a higher weight to synonyms from external
sources compared to the ones from stemming. A training example
                                                                         Figure 2: Classifier score distribution for A. negative class,
pair of trigger n-gram and synonym expansion is labeled as relevant
                                                                         and B. positive class
if and only if adding the synonym as an expansion to the original
terms in the query n-gram brings in additional relevant inventory.
The classifier has 100 trees with a tree depth of 20. We use the            3.2.3 Threshold tuning for classifier scores. After the ML
sklearn package from python[14] for the classifier operations.           classifier has predicted a score for each synonym candidate mined
                                                                         through some of the processes detailed in Section 3.1, we perform
   3.2.2 Feature Engineering. A number of behavioral and lex-            a threshold tuning step to filter out the final set of good synonyms.
ical features (133 in total) are extracted for the list of candidate     The threshold selection is based on analyzing the score distribu-
synonyms to be evaluated by the classifier. For behavioral features,     tion on the training dataset in positive and negative classes and
we use clicks, sales and other associated engagement signals from        selecting a cutoff that balances between classifier performance and
historic user sessions (inclusive of SRP level data as well for query-   false negative rate. Figure 2 shows the analysis on the individual
item and item-item transitions) where the query terms (trigger)          mining processes discussed in Section 3.1. With 1276 human labeled
and the corresponding synonym candidate occurred. For this we            training samples in the English sites, we look at the distribution
sample queries and engaged items in a 2 week window. We are              of random forest classifier scores for the negative(Figure 2A) and
unable to reveal the full set of behavioral features used beyond the     positive labels(Figure 2B) sub-sampled in space synonyms (27 posi-
top 10. Below are examples of two of the behavioral features that        tive labels, 240 negative labels), stem synonyms (70 positive labels,
appear in the top 10:                                                    153 negative labels), wordnet synonyms (200 positive labels, 131
                                                                         negative labels), stop word synonyms (3 positive labels, 55 negative
                                                                         labels), and phrases (129 positive labels, 50 negative labels) and
                        ExpansionInQueryTSal e
                                          r iддer I nT itl e =           select a threshold of 0.3 where 85% of the negative labels and only
              N                                                          5% of the positive training labels are filtered out. The final filtered
                    [ExpansionInQuery Sal e |TriддerInTitle]
              Õ
                                                                         synonyms are stored in a dictionary that is used for runtime query
              i=1                                                        rewriting.
  For a given trigger and its expansion, the above equation cal-         4     EVALUATION RESULTS
culates number of times a sale event happens when the trigger
appears in an item title, and the expansion appears in the search        In this section we discuss the experimental results for our approach.
query.                                                                   In Section 4.1, we share details about the synonym datasets, before
                                                                         and after filtering with the synonym classifier. In Section 4.2 we
                                                                         report metrics about the synonym classifier described in Section
               T riддerInQueryCl ick
                              Bot hT r iддer ExpansionI nT itl e =
                                                                         3.2, and cover impact of synonym based query expansions on the
                                                                         number of search results returned across three of our major markets
     N
     Õ                                                                   in Section 4.3 and results from our A/B tests in Section 4.4 and finally
           [T riддerInQueryCl ick |Bot hT r iддer ExpansionI nT itl e]
                                                                         some anecdotal examples in Section 4.5.
     i=1

   Similarly, for a given trigger and its expansion, the above equa-     4.1     Synonym Datasets
tion calculates number of times a click event happens when the           We filter the synonym candidates generated with various tech-
trigger appears in the query, and both the trigger and the expansion     niques described in Section 3.1 using a classifier trained on human
Query Rewriting using Automatic Synonym Extraction for E-commerce Search                         SIGIR 2019 eCom, July 2019, Paris, France


judged labels, where a pair of query n-gram and its expansion               Figure 4 shows the top 10 important features picked up by the
is judged based on the utility of the expansion to bring in addi-       model. We are unable to disclose the full list of features considered
tional relevant items matching the query n-gram. For example,           and the specifics of the different engagement signals. We indicate
(box, boxing) → irrelevant(label = 0) because adding boxing as a        different engagement signals (denoted ef ) in the Figure) with num-
synonym for box could change the intent from box as a container         bers when they are used as stand alone feature, and as Combined
to boxing the sport, although they do qualify as stem synonyms,         when all the different signals are combined to capture a certain
whereas (rug, carpet) → relevant(label = 1) because adding car-         feature. We observe that most of the top features picked up by the
pet as an alternate expansion to rug brings in additional relevant      model are a combination of behavioral and semantic features. For
items for the query. We trained the synonym classifier using 4K         example, the most important feature is an engagement signal cap-
training records from our three biggest markets (US, UK, and Ger-       turing the fraction of that engagement signal for queries where the
many) with 10 fold cross validation. Within this dataset, 1.8K are      trigger n-gram is present in both the query and title, and the user
positive labels and 2.2K are negative labels. We also use 300 records   behavioral signals are combined together. Another important signal
as holdout set for final model testing. For generating final syn-       is the number of items engaged with where the trigger n-gram was
onym candidates we sampled 2M synonym pairs, from which we              in the query and both the expansion and the trigger n-grams were
filtered 940K synonym pairs using the ML classifier based synonym       present in the title of the engaged items.
selection process as discussed in Section 3.2.                              Figure 5 shows the variation of prediction scores across different
                                                                        types of synonyms and the different transition types as discussed
                                                                        in section 3.1. For instance, we observe that for stemming based
                                                                        synonyms(stemsyn) the median classifier score is lower than the
                                                                        median score for compounding based synonyms(spacesyn). In Fig-
                                                                        ure 6 we show the absolute error of the predictions made by the
                                                                        classifier across different rewrite types, and consistent with the
                                                                        previous observation, the absolute error for stemming based syn-
                                                                        onyms overall seems higher than that of the space synonyms for
                                                                        most subtypes.




Figure 3: AUC and AUPRC of the model on the hold-out data



4.2    ML Classifier Results
                                                                              Figure 5: Prediction across different rewrite types
In this section we discuss in detail the models performance on the
holdout set. Figure 3 shows the AUC and AUPRC performance of
the random forest synonym classification model on the holdout              Another interesting observation we made is that the change in
data. We achieved an AUC of 0.87 and an AUPRC of 0.84 on the            test error for the given model with increasing training data size
holdout data.                                                           saturates after 900 records. This suggests that with the given model
                                                                        complexity and feature set, a small training dataset has a reasonably
                                                                        high discriminating capability. Figure 7 shows the change in test
                                                                        error with increasing training data size.

                                                                        4.3    Recall Impact of Query Rewriting
                                                                        We sample queries across the three major sites and ran an experi-
                                                                        ment where we recorded the mean number of search results across
                                                                        the query set (recall size), with and without including these syn-
                                                                        onym based query expansions. The result is shown in Figure 8. This
                                                                        analysis shows the direct recall change we make in eBay search
                                                                        using synonym query expansions (on average 3x or more items are
Figure 4: Top 10 features for the Random Forest synonym                 returned with synonym query expansions) for the impacted query
classifier; ef stands for Engagement Feature                            set.
SIGIR 2019 eCom, July 2019, Paris, France                                    Aritra Mandal*, Ishita Khan*, and Prathyusha Senthil Kumar




   Figure 6: Absolute error across different rewrite types




   Figure 7: Absolute error across different rewrite types

                                                                         Figure 9: A example showing how adding air pods as rewrite
                                                                         to airpods improves recall



                                                                         5   CONCLUSION
                                                                         In this work, we presented our system for e-commerce query rewrit-
                                                                         ing through token expansions. We described our end-to-end process
                                                                         for generating token synonym candidates, evaluating them and in-
                                                                         corporating them for improving retrieval. We also shared our results
                                                                         in terms of the quality of the query rewrites filtered with a machine
                                                                         learned classifier when evaluated against a human labeled dataset
                                                                         as well as its value in improving the quality of search results. For
                                                                         future work, we plan to incorporate more techniques for mining
Figure 8: Recall impact with and without query rewriting in              synonym candidates, incorporating more context in the synonym
major sites                                                              selection process as well as leverage other sources like knowledge
                                                                         graphs for query rewriting.

4.4    A/B Test Results                                                  REFERENCES
                                                                         [1] Ricardo Baeza-Yates and Alessandro Tiberi. 2007. Extracting semantic relations
We ran online A/B tests to evaluate the value of the synonym mining          from query logs. In Proceedings of the 13th ACM SIGKDD international conference
and filtering techniques described in this work. For the test we used        on Knowledge discovery and data mining. ACM, 76–85.
33 percent of the live traffic for test and control, ran over a period   [2] Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (Oct. 2001), 5–32. https:
                                                                             //doi.org/10.1023/A:1010933404324
of 3 weeks. We observed improvements in both relevance of the            [3] Hang Cui, Ji-Rong Wen, Jian-Yun Nie, and Wei-Ying Ma. 2002. Probabilistic query
search result page (SRP) and user engagement across all markets,             expansion using query logs. In Proceedings of the 11th international conference on
but are unable to share more specifics on the business metrics.              World Wide Web. ACM, 325–332.
                                                                         [4] Ingo Feinerer and Kurt Hornik. 2017. wordnet: WordNet Interface. https://CRAN.
                                                                             R-project.org/package=wordnet R package version 0.1-14.
4.5    Anecdotal Examples                                                [5] Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Bradford
                                                                             Books.
In this section we show an example of how good rewrites for a given      [6] Jianfeng Gao, Xiaodong He, Shasha Xie, and Alnur Ali. 2012. Learning lexicon
                                                                             models from search logs for query expansion. In Proceedings of the 2012 Joint
term helps get more relevant results. The example shows the rewrite          Conference on Empirical Methods in Natural Language Processing and Computa-
airpods for the term air pods in figure 9. As we can see 6 out of            tional Natural Language Learning. Association for Computational Linguistics,
the 8 top listings for the query don’t contain the original query            666–676.
                                                                         [7] Jianfeng Gao, Gu Xu, and Jinxi Xu. 2013. Query expansion using path-constrained
term, and are present in the SRP due to the synonym expansion.               random walks. In Proceedings of the 36th international ACM SIGIR conference on
All these results are relevant to the original query.                        Research and development in information retrieval. ACM, 563–572.
Query Rewriting using Automatic Synonym Extraction for E-commerce Search                                                SIGIR 2019 eCom, July 2019, Paris, France


 [8] Julio Gonzalo, Felisa Verdejo, Irina Chugur, and Juan Cigarran. 1998. Indexing       [15] Martin F. Porter. 2001. Snowball: A language for stemming algorithms. Published
     with WordNet synsets can improve text retrieval. arXiv preprint cmp-lg/9808002            online. (October 2001). http://snowball.tartarus.org/texts/introduction.html Ac-
     (1998).                                                                                   cessed 11.03.2008, 15.00h.
 [9] Yunlong He, Jiliang Tang, Hua Ouyang, Changsung Kang, Dawei Yin, and Yi              [16] Yonggang Qiu and Hans-Peter Frei. 1993. Concept based query expansion. In
     Chang. 2016. Learning to rewrite queries. In Proceedings of the 25th ACM In-              Proceedings of the 16th annual international ACM SIGIR conference on Research
     ternational on Conference on Information and Knowledge Management. ACM,                   and development in information retrieval. ACM, 160–169.
     1443–1452.                                                                           [17] Stefan Riezler and Yi Liu. 2010. Query rewriting using monolingual statistical
[10] Chien-Kang Huang, Lee-Feng Chien, and Yen-Jen Oyang. 2003. Relevant term                  machine translation. Computational Linguistics 36, 3 (2010), 569–582.
     suggestion in interactive web search based on contextual information in query        [18] Ariel S. Schwartz and Marti A. Hearst. 2003. A Simple Algorithm for Iden-
     session logs. Journal of the American Society for Information Science and Technol-        tifying Abbreviation Definitions in Biomedical Text. In Pacific Symposium on
     ogy 54, 7 (2003), 638–649.                                                                Biocomputing. 451–462.
[11] Rosie Jones, Benjamin Rey, Omid Madani, and Wiley Greiner. 2006. Generating          [19] Prathyusha Senthil Kumar, Vamsi Salaka, Tracy Holloway King, and Brian John-
     query substitutions. In Proceedings of the 15th international conference on World         son. 2014. Mickey Mouse is not a Phrase: Improving Relevance in E-Commerce
     Wide Web. ACM, 387–396.                                                                   with Multiword Expressions. In Proceedings of the 10th Workshop on Multiword
[12] Rila Mandala, Takenobu Tokunaga, and Hozumi Tanaka. 1999. Combining mul-                  Expressions (MWE). Association for Computational Linguistics, Gothenburg, Swe-
     tiple evidence from different types of thesaurus for query expansion. In SIGIR,           den, 62–66. https://doi.org/10.3115/v1/W14-0810
     Vol. 99. 15–19.                                                                      [20] Andrew Trotman, Jon Degenhardt, and Surya Kallumadi. 2017. The architecture
[13] Roberto Navigli and Paola Velardi. 2003. An analysis of ontology-based query              of ebay search. In ACM SIGIR Forum. ACM.
     expansion strategies. In Proceedings of the 14th European Conference on Machine      [21] Mike Wallace. 2007. Jawbone Java WordNet API. http://mfwallace.googlepages.
     Learning, Workshop on Adaptive Text Extraction and Mining, Cavtat-Dubrovnik,              com/jawbone
     Croatia. Citeseer, 42–49.                                                            [22] Jinxi Xu and W. Bruce Croft. 1996. Query Expansion Using Local and Global
[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.             Document Analysis. In Proceedings of the 19th Annual International ACM SIGIR
     Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-        Conference on Research and Development in Information Retrieval (SIGIR ’96). ACM,
     napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine              New York, NY, USA, 4–11. https://doi.org/10.1145/243199.243202
     Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.