E-commerce Query Classification Using Product Taxonomy
              Mapping: A Transfer Learning Approach
                               Michael Skinner                                                            Surya Kallumadi
                           research@mcskinner.com                                                           The Home Depot
                                                                                                             surya@ksu.edu

ABSTRACT                                                                              as ‘battery for lawn tractor’ or ‘battery operated lawn
In web search, query classification (QC) is used to map a query to                    tractor’.
a user’s search intent. In the e-commerce domain, user’s product                          In product search, the objective of query classification is to map
search queries can be broadly categorised into product specific                       a user query to a pre-defined product category. QC can improve the
queries and category specific queries [9]. In these instances, accu-                  relevance of search results while preserving the recall. A typical
rate classification of queries will help with identifying the right                   e-commerce site such as Amazon.com can have millions of products,
product categories from which relevant products can be retrieved.                     and thousands of product categories of various granularities. Cu-
Thus, mapping a query to a pre-defined product taxonomy is an                         rating a query-category labeled data set with good coverage over
important step in e-commerce query understanding pipeline. A                          all the categories is expensive, labor intensive, and can take a long
typical e-commerce website has thousands of categories, and cu-                       time. Approaches that can reduce the effort needed to categorize
rating a labeled data set for query classification is expensive, time                 the search queries can significantly improve the performance of
consuming, and labor intensive. In addition, product search queries                   QC. In this work, we propose a transfer learning approach for QC
are short, and the vocabulary changes over time as the catalogue                      by using product titles. As the products in the domain are mapped
evolves. Reducing this effort of generating query-category labels                     to a well defined product taxonomy, the product mapping can be
would save time and resources. In this work we show how an exist-                     exploited to improve QC, and reduce the need for labeled data.
ing product-taxonomy mapping can improve query classification,                            Transfer learning has proven to be an effective technique to
and reduce the need for labeled data, using transfer learning. Our                    improve the performance of various tasks in computer vision and
results demonstrate that such an approach can match, and often                        natural language processing (NLP) [1]. The goal of transfer learning
exceed, the performance of direct training with a smaller computa-                    is to utilize knowledge present within a source domain to improve
tional budget. We further explore how performance varies as the                       a task within a target domain. Neural network and deep learning
amount of available training data varies, and show that transfer                      based transfer learning approaches have been shown to be quite
learning is most useful when the target data set size is small. In                    useful to improve the performance of a wide range of target tasks in
addition, we make available a large query data set of 535, 506 unique                 NLP [7]. To demonstrate transfer learning for QC in the e-commerce
e-commerce labeled queries, mapped over 58 categories. The results                    domain, we use Amazon.com titles as the source data set [5], and
and transfer learning approaches presented in this work can act as                    queries obtained by crawling Amazon.com auto-complete service
strong baselines for this collection and task.                                        as target data set.
                                                                                          Academic research for e-commerce query classification task
CCS CONCEPTS                                                                          has been limited because of a lack of availability of labeled data.
                                                                                      Through this work, and the query-category data set made available,
• Information systems → Clustering and classification;
                                                                                      we hope to facilitate progress in this research area. In addition
                                                                                      to the introduction of a new data set, our contributions are as fol-
KEYWORDS                                                                              lows: 1) We present a methodology for this domain-specific transfer
Test Collection, e-Commerce, Query Classification                                     learning, in which the source model is tuned as a classifier on a
                                                                                      similar problem. 2) We demonstrate that such an approach can be
1    INTRODUCTION                                                                     leveraged to speed training and improve results when compared
In the e-commerce domain query understanding can have a signifi-                      to direct training. 3) We explore the impact of target data size on
cant impact on user satisfaction. An incorrectly interpreted query                    both direct and transferred models, showing that transfer learning
can lead to search abandonment by the user, resulting in lower                        improves more on direct training as the target training data shrinks.
conversion rates. E-commerce queries are usually short and lack
linguistic structure, and they can be ambiguous as a result. For                      2   RELATED WORK
example the query ‘battery lawn tractor’, can be interpreted                          In the query classification challenge, organized by ACM KDD cup
                                                                                      2005 competition, the task was to categorize 800, 000 web queries
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic   into 67 predefined categories [3]. The data set for this challenge
purposes.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                        contained 111 queries with category mappings, and the queries in
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at   the test data set can be tagged by up to 5 categories. The submissions
http://ceur-ws.org                                                                    were evaluated on an 800 query subset of the complete data set.
                                                                                      This competition highlighted the challenge of assigning labels to
                                                                                      queries.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                               Michael Skinner and Surya Kallumadi
                                                                                                              140000


Product Titles                                                    Category                                    120000
Compaq 256MB 168-Pin 100Mhz DIMM SDRAM for Compaq Proliant        Electronics
EK Ekcessories 10708C-BLUE-AM Blue Jeep Visor Clip                Automotive                                  100000

NHL Chicago Blackhawks Franchise Fitted Hat, Black, Extra Large   Sports & Outdoors                                                                                    Training        Validation   Test
                                                                                                              80000
Sesame Street Robe with Embroidered Washcloth                     Health & Personal Care


                                                                                                  Frequency
Emerica Men’s The Westgate Skate Shoe                             Clothing, Shoes & Jewelry
                                                                                                              60000
Queries                                                           Category
13mm wrench                                                       tools                                       40000

hip action zukes peanut butter                                    pets
nerf guns under 30 dollars                                        toys-and-games                              20000

bernaise sauce mix                                                grocery
                                                                                                                  0
door lever lock child proof                                       baby-products                                        1   2   3   4    5    6        7        8        9         10        11      12       13   14
                                                                                                                                            Query length (Number of tokens)
Table 1: Examples from the source (titles) and target
(queries) data sets.                                                                          Figure 1: Query token-length distributions across target data
                                                                                              splits.
   Lin et al. propose using implicit feedback from user clicks as a
signal to collect training data for QC in e-commerce domain [4]. We                                                                    Source               Target Data - Queries
consider this work to be complementary to the transfer learning                                                                        (Titles)            Train                     Val.                     Test
approach we propose in this paper. Leveraging user click stream                                Documents                           6, 835, 398            435, 506                50, 000                  50, 000
data and the product hierarchy together can be used to improve                                 Num. bytes                            1 − 2000               1 − 69                 3 − 69                   2 − 69
the overall system performance. Click stream data is useful when a
                                                                                               Avg num. bytes                            59.98               21.32                  21.34                    21.37
sufficient amount of user behavior has been observed for a category,
                                                                                               Token length                            1 − 434              1 − 14                 1 − 14                   1 − 14
but this fails for new categories and items. The transfer learning
                                                                                               Avg token length                           9.45                3.52                   3.52                     3.52
approach exploiting product titles does not suffer from item and
category cold start.                                                                          Table 2: Statistics of each part of the source and target data.
   Sondhi et al. identify a taxonomy of e-commerce queries intents,
based on search logs and user behavior data [9]. This work identifies                         from this data set. The query-category labels suggested by auto-
five categories of e-commerce queries based on user search behavior:                          complete had an accuracy of 98.6%. The auto-complete crawl was
1) Shallow Exploration Queries, 2) Targeted Purchase Queries, 3)                              performed over a duration of 1 week, in December 2018. The queries
Major-Item Shopping Queries, 4) Minor-Item Shopping Queries,                                  in the resulting data-set were mapped to 58 high level categories.
and 5) Hard-Choice Shopping Queries. This paper highlights the
complexity of user intent in the e-commerce domain, and proposes                              3.1                Data Splits
techniques for leveraging these insights.                                                     Both the source and target data sets are split into training, validation,
                                                                                              and test sets, stratified by category. This resulted in 5, 811, 656
3    DATA COLLECTION AND DATA SET                                                             training examples for the source data, 500, 000 validation examples
Domain adaptation and transfer learning usually requires two data                             and 500, 000 test examples. The target data had 435, 506 training
sets, a source data set and a target data set. For supervised tasks                           examples, with 50, 000 examples reserved for validation and test sets
such as QC, transfer learning would help in scenarios where we                                each. The target training data was also progressively sub-sampled
have very little training data in the target data set, and lots of data in                    to create smaller training sets of 50%, 20%, and 10% of the original
the source data set. Also, the source and target data set should have                         data, each a subset of the previous sample. In Figure 1 we can
similar characteristics. In this work, as product titles and queries                          see that the length of queries is similarly distributed across the 3
share a similar vocabulary, we chose product titles as the source                             splits. Both the validation set and the training set show a Pearson’s
data set. McAuley et al. [5] provide a crawl of Amazon.com’s product                          correlation of > 0.99 with the test set. Due to the use of stratified
pages including 142.8 million reviews, 9.43 million products, and                             sampling, the category distributions over the three sets are similar.
6.83 million titles1 . We utilize the titles data available in this data
set as the source data for transfer learning.                                                 3.2                Data Characteristics
   As no product-query data sets are publicly available for QC, we                            Table 1 shows examples of product titles and queries from the
leveraged Amazon.com’s auto-completion to generate e-commerce                                 source and target data set, respectively. Table 2 shows the high
queries2 . In addition to providing suggestions for partial queries,                          level statistics of the source and target data sets. While the average
auto-complete also provides high level candidate categories for                               length of a title is 9.45, queries are much shorter (3.52 tokens). This
suggested queries. These query-category results serve as our target                           significant difference in query and title lengths poses an interesting
data set for the QC task. The seeds for auto-complete crawl were                              transfer learning challenge.
common terms and phrases found in the data set by McAuley et al.
In addition, we used random alpha-numeric character combination                               4         SYSTEM ARCHITECTURE DESCRIPTION
as seeds for the query crawl. A total of 535, 506 query-category
                                                                                              Recent work in NLP has shown the wide utility of Long Short-
labels were obtained by this exercise. To ascertain the accuracy of
                                                                                              Term Memory (LSTM) architectures for transfer learning tasks [6].
this data, we manually evaluated 1000 randomly sampled queries
                                                                                              Howard and Ruder used a pre-trained LSTM architecture to achieve
1 http://jmcauley.ucsd.edu/data/amazon/                                                       state-of-the-art results on several text classification tasks [2]. The
2 http://completion.amazon.com/api/2017/suggestions
                                                                                              Balanced Pooling View (BPV) architecture, which builds on these
E-commerce Query Classification Using Product Taxonomy Mapping                                                   SIGIR 2019 eCom, July 2019, Paris, France


                                             Figure 2: An illustration of the BPV architecture.


approaches, has been shown to be effective for product taxonomy                           1.1
classification tasks [8].
   The model architecture, which can be seen in Figure 2, is centered                     1.0

around a character-level LSTM, which is fed via an a embedding.
                                                                                          0.9
The time series output from the Recurrent Neural Network (RNN) is
                                                                              Test Loss
                                                                                                                                                                Direct
then summarized in 4 ways: by taking the last value as in a typical                       0.8                                                                   Transfer
RNN architecture, and then with mean-pooling, max-pooling, and
min-pooling. Those 4 summaries are concatenated and fed through                           0.7
a linear layer with output size equal to the number of categories.
When transferring, only the output layer needs to be replaced, in                         0.6

order to accommodate the new category space. The embedding size,
                                                                                                      10         20              50              100
RNN width and depth, and dropout settings are all set as in [8].                                                      Size (%)
   On the target problem, we explore two different training styles,
                                                                          Figure 3: Cross-entropy loss on the test set, with varying tar-
1) target only direct training and 2) transfer learning from a source
                                                                          get data size.
model. Direct training only uses the target data, without reference
to either the source model or the source data. Transfer learning
uses the source model to initialize network weights, replacing the
output layer to accommodate the new category set, and then oth-           5               EVALUATION
erwise proceeding as before. Adam optimization was found to be            We report cross-entropy loss, accuracy, precision, recall, and F 1
consistently better than stochastic gradient descent (SGD) and is         scores for our models. As the queries are not uniformly distributed
used for all target models. Cross-entropy loss is used throughout.        across the categories, we use weighted precision, recall, and F 1 to
   Final hyper-parameters were tuned using a grid search around           measure the performance of the approaches on the test data. If (Pi ),
those initial values, varying the learning rate schedule and peak         (Ri ) and (F 1i ) are precision, recall, and F 1 scores for each category
learning rate, as well as the number of training epochs for direct        c i , then the corresponding weighted metrics can be calculated as:
training. Transfer learning was fixed at 5 epochs throughout, since                             K                            K                                  K
any increase in the number of epochs led to overfitting and an                                    ni                           ni                                 ni
                                                                                                Õ                            Õ                                  Õ
                                                                           Pw =                           Pi          Rw =                Ri           F 1w =              F 1i
increasing validation loss. This process was performed separately                               i=1
                                                                                                      N                      i=1
                                                                                                                                      N                         i=1
                                                                                                                                                                      N
for direct training and transfer learning, as well as for each of the 4                                    (1)                             (2)                              (3)
data scales.
   Hyper-parameters with consistently strong validation results           6               RESULTS
were then chosen for each of the two training styles. A learning rate
                                                                          Figure 3 shows the results for test loss as the amount of target data
of 0.003 was best for all variants. A linearly decreasing "burndown"
                                                                          varies, for each of the two training approaches. The advantages
schedule was better than 1cycle or a flat learning rate for transfer.
                                                                          of transfer learning are most apparent at low data scales, where it
Direct training was most effective with 10 epochs when trained on
                                                                          produces significantly better results. The two approaches eventually
subsets of the target data, but better still with 20 epochs on a full
                                                                          converge in performance as target data becomes fully available.
100% of the target data. Once settled, these parameters were used in
                                                                          Figure 4 shows the equivalent results for accuracy. In this case the
4 independent training runs for each training style and data scale.
                                                                          performance difference is not as large, and direct training closes
Each model was used to make predictions over the test set, and the
                                                                          the gap at 50% of the target data. This corresponds to a regime in
results are based on these predictions.
                                                                          which the training loss continues to drop rapidly while validation
                                                                          loss levels off, which might indicate overfitting.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                       Michael Skinner and Surya Kallumadi


                                                                                             Category                  Source + 10% Target              10% Target only
                        84                                                                                             P         R         F1         P         R         F1
    Test Accuracy (%)


                                                                                             fash-wom-shoes            0.887     0.859     0.873      0.835     0.808     0.821
                        80
                                                                                             pets                      0.903     0.820     0.860      0.882     0.813     0.846
                                                                                Direct
                                                                                             mobile                    0.860     0.827     0.844      0.862     0.823     0.842
                                                                                Transfer
                                                                                             fash-wom-cloth            0.847     0.837     0.842      0.792     0.753     0.772
                        76
                                                                                             beauty                    0.835     0.830     0.833      0.802     0.787     0.795
                                                                                             garden                    0.821     0.836     0.828      0.787     0.828     0.807
                                                                                             fash-wom-jlry             0.755     0.895     0.819      0.733     0.777     0.754
                        72                                                                   grocery                   0.784     0.831     0.807      0.750     0.779     0.764
                             10           20              50      100
                                                                                             baby-products             0.824     0.727     0.772      0.773     0.655     0.709
                                               Size (%)
                                                                                             electronics               0.712     0.844     0.772      0.741     0.809     0.774
    Figure 4: Accuracy on the test set for various data scales.                              automotive                0.705     0.776     0.739      0.686     0.761     0.721
                                                                                             toys-and-games            0.738     0.732     0.735      0.698     0.693     0.695
                                                                                             videogames                0.782     0.691     0.734      0.800     0.740     0.769
Target Size                        Source + Target                      Target only          hpc                       0.726     0.719     0.722      0.701     0.697     0.699
                                                                                             office-products           0.749     0.698     0.722      0.711     0.691     0.701
                                  Pw           Rw         F1w     Pw       Rw        F1w
                                                                                             sports-&-fitness          0.719     0.661     0.689      0.685     0.616     0.649
10%                               0.757        0.757      0.754   0.733    0.734     0.732   arts-crafts               0.740     0.636     0.684      0.692     0.611     0.649
20%                               0.791        0.790      0.788   0.782    0.783     0.781   fash-mens-cloth           0.681     0.676     0.679      0.641     0.560     0.598
50%                               0.828        0.828      0.826   0.828    0.829     0.827   lawngarden                0.763     0.606     0.676      0.716     0.582     0.642
                                                                                             tools                     0.678     0.670     0.674      0.651     0.668     0.660
100%                              0.852        0.852      0.851   0.862    0.861     0.860
                                                                                             fan-shop                  0.745     0.597     0.663      0.631     0.430     0.511
Table 3: Comparing the performance of transfer learning                                      mi                        0.761     0.577     0.656      0.694     0.569     0.625
and direct target-only training.                                                             outdoor-rec               0.715     0.526     0.606      0.680     0.538     0.601
                                                                                             industrial                0.579     0.355     0.440      0.506     0.371     0.428
                                                                                             appliances                0.580     0.348     0.435      0.564     0.297     0.389
   Table 3 shows the overall weighted precision, recall, and F 1                             Table 4: Per-category results on 10% of the target data, for
scores for each training variant across the different target data                            categories with at least 100 test examples.
scales. Recall is equal to the accuracy metrics reported in Figure 4.
Table 4 shows the per-category results in the case when the target
training data set is small (10%), for categories with at least 100 test                         In addition, we make available a large query-category labeled
examples. Transfer learning is able to improve F 1 for nearly all                            data set which can facilitate additional progress in this research
categories, sometimes significantly, and for categories that were                            area. This data provides scope for research tasks such as query
both difficult as well as easy for the directly trained model. Transfer                      intent mining, query segmentation and query scoping.
learning was particularly helpful for rare categories. The top 6 F 1
improvements (bolded) were achieved on the 6 categories with                                 REFERENCES
the fewest examples in the 10% subset of target training data. This                          [1] Hal Daumé, III, Abhishek Kumar, and Avishek Saha. 2010. Frustratingly Easy
highlights the benefit of a transfer learning approach for cold start                            Semi-supervised Domain Adaptation. In Proceedings of the 2010 Workshop on
                                                                                                 Domain Adaptation for Natural Language Processing (DANLP 2010). Association
categories and items.                                                                            for Computational Linguistics, Stroudsburg, PA, USA, 53–59. http://dl.acm.org/
                                                                                                 citation.cfm?id=1870526.1870534
7                       CONCLUSION                                                           [2] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning
                                                                                                 for Text Classification. (2018). arXiv:arXiv:1801.06146
Our results show that product-title data is an effective pre-training                        [3] Ying Li, Zijian Zheng, and Honghua (Kathy) Dai. 2005. KDD CUP-2005 Report:
                                                                                                 Facing a Great Challenge. SIGKDD Explor. Newsl. 7, 2 (Dec. 2005), 91–99. https:
source for query-taxonomy classification. When there is not much                                 //doi.org/10.1145/1117454.1117466
training data, transfer learning improves the quality of the final                           [4] Y. Lin, A. Datta, and G. D. Fabbrizio. 2018. E-commerce Product Query Classifi-
target models. Although the results converge for larger target data                              cation Using Implicit UserâĂŹs Feedback from Clicks. In 2018 IEEE International
                                                                                                 Conference on Big Data (Big Data). 1955–1959. https://doi.org/10.1109/BigData.
sets, we observe that pre-trained transfer learning models converge                              2018.8622008
in fewer epochs than models trained only on the target data set.                             [5] J McAuley, R Pandey, and J Leskovec. 2015. Inferring Networks of Substitutable
   This convergence is noteworthy and worth exploring in more                                    and Complementary Products. In KDD 2015. 785–794.
                                                                                             [6] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and
detail. The implication is that, at a certain data scale, the source                             Optimizing LSTM Language Models. (2017). arXiv:arXiv:1708.02182
model does not contain any information that is more useful than                              [7] Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How
                                                                                                 Transferable are Neural Networks in NLP Applications?. In Proceedings of the 2016
that in the target data. One possible reason for this is that the                                Conference on Empirical Methods in Natural Language Processing. Association for
model architecture can only encode so much information, and it                                   Computational Linguistics, 479–489. https://doi.org/10.18653/v1/D16-1046
may be the case that the full target data can saturate it. If so, then                       [8] M Skinner. 2018. Product Categorization with LSTMs and Balanced Pooling Views.
                                                                                                 In Proceedings of the 2018 SIGIR Workshop On eCommerce.
increasing the size of the pre-trained source model might lead to                            [9] Parikshit Sondhi, Mohit Sharma, Pranam Kolari, and Chengxiang Zhai. 2018. A
further improvements.                                                                            taxonomy of queries for e-commerce search. In 41st International ACM SIGIR
E-commerce Query Classification Using Product Taxonomy Mapping                         SIGIR 2019 eCom, July 2019, Paris, France


  Conference on Research and Development in Information Retrieval, SIGIR 2018. Asso-
  ciation for Computing Machinery, Inc, 1245–1248. https://doi.org/10.1145/3209978.
  3210152