E-commerce Query Classification Using Product Taxonomy Mapping: A Transfer Learning Approach Michael Skinner Surya Kallumadi research@mcskinner.com The Home Depot surya@ksu.edu ABSTRACT as ‘battery for lawn tractor’ or ‘battery operated lawn In web search, query classification (QC) is used to map a query to tractor’. a user’s search intent. In the e-commerce domain, user’s product In product search, the objective of query classification is to map search queries can be broadly categorised into product specific a user query to a pre-defined product category. QC can improve the queries and category specific queries [9]. In these instances, accu- relevance of search results while preserving the recall. A typical rate classification of queries will help with identifying the right e-commerce site such as Amazon.com can have millions of products, product categories from which relevant products can be retrieved. and thousands of product categories of various granularities. Cu- Thus, mapping a query to a pre-defined product taxonomy is an rating a query-category labeled data set with good coverage over important step in e-commerce query understanding pipeline. A all the categories is expensive, labor intensive, and can take a long typical e-commerce website has thousands of categories, and cu- time. Approaches that can reduce the effort needed to categorize rating a labeled data set for query classification is expensive, time the search queries can significantly improve the performance of consuming, and labor intensive. In addition, product search queries QC. In this work, we propose a transfer learning approach for QC are short, and the vocabulary changes over time as the catalogue by using product titles. As the products in the domain are mapped evolves. Reducing this effort of generating query-category labels to a well defined product taxonomy, the product mapping can be would save time and resources. In this work we show how an exist- exploited to improve QC, and reduce the need for labeled data. ing product-taxonomy mapping can improve query classification, Transfer learning has proven to be an effective technique to and reduce the need for labeled data, using transfer learning. Our improve the performance of various tasks in computer vision and results demonstrate that such an approach can match, and often natural language processing (NLP) [1]. The goal of transfer learning exceed, the performance of direct training with a smaller computa- is to utilize knowledge present within a source domain to improve tional budget. We further explore how performance varies as the a task within a target domain. Neural network and deep learning amount of available training data varies, and show that transfer based transfer learning approaches have been shown to be quite learning is most useful when the target data set size is small. In useful to improve the performance of a wide range of target tasks in addition, we make available a large query data set of 535, 506 unique NLP [7]. To demonstrate transfer learning for QC in the e-commerce e-commerce labeled queries, mapped over 58 categories. The results domain, we use Amazon.com titles as the source data set [5], and and transfer learning approaches presented in this work can act as queries obtained by crawling Amazon.com auto-complete service strong baselines for this collection and task. as target data set. Academic research for e-commerce query classification task CCS CONCEPTS has been limited because of a lack of availability of labeled data. Through this work, and the query-category data set made available, • Information systems → Clustering and classification; we hope to facilitate progress in this research area. In addition to the introduction of a new data set, our contributions are as fol- KEYWORDS lows: 1) We present a methodology for this domain-specific transfer Test Collection, e-Commerce, Query Classification learning, in which the source model is tuned as a classifier on a similar problem. 2) We demonstrate that such an approach can be 1 INTRODUCTION leveraged to speed training and improve results when compared In the e-commerce domain query understanding can have a signifi- to direct training. 3) We explore the impact of target data size on cant impact on user satisfaction. An incorrectly interpreted query both direct and transferred models, showing that transfer learning can lead to search abandonment by the user, resulting in lower improves more on direct training as the target training data shrinks. conversion rates. E-commerce queries are usually short and lack linguistic structure, and they can be ambiguous as a result. For 2 RELATED WORK example the query ‘battery lawn tractor’, can be interpreted In the query classification challenge, organized by ACM KDD cup 2005 competition, the task was to categorize 800, 000 web queries Copyright © 2019 by the paper’s authors. Copying permitted for private and academic into 67 predefined categories [3]. The data set for this challenge purposes. In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): contained 111 queries with category mappings, and the queries in Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at the test data set can be tagged by up to 5 categories. The submissions http://ceur-ws.org were evaluated on an 800 query subset of the complete data set. This competition highlighted the challenge of assigning labels to queries. SIGIR 2019 eCom, July 2019, Paris, France Michael Skinner and Surya Kallumadi 140000 Product Titles Category 120000 Compaq 256MB 168-Pin 100Mhz DIMM SDRAM for Compaq Proliant Electronics EK Ekcessories 10708C-BLUE-AM Blue Jeep Visor Clip Automotive 100000 NHL Chicago Blackhawks Franchise Fitted Hat, Black, Extra Large Sports & Outdoors Training Validation Test 80000 Sesame Street Robe with Embroidered Washcloth Health & Personal Care Frequency Emerica Men’s The Westgate Skate Shoe Clothing, Shoes & Jewelry 60000 Queries Category 13mm wrench tools 40000 hip action zukes peanut butter pets nerf guns under 30 dollars toys-and-games 20000 bernaise sauce mix grocery 0 door lever lock child proof baby-products 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Query length (Number of tokens) Table 1: Examples from the source (titles) and target (queries) data sets. Figure 1: Query token-length distributions across target data splits. Lin et al. propose using implicit feedback from user clicks as a signal to collect training data for QC in e-commerce domain [4]. We Source Target Data - Queries consider this work to be complementary to the transfer learning (Titles) Train Val. Test approach we propose in this paper. Leveraging user click stream Documents 6, 835, 398 435, 506 50, 000 50, 000 data and the product hierarchy together can be used to improve Num. bytes 1 − 2000 1 − 69 3 − 69 2 − 69 the overall system performance. Click stream data is useful when a Avg num. bytes 59.98 21.32 21.34 21.37 sufficient amount of user behavior has been observed for a category, Token length 1 − 434 1 − 14 1 − 14 1 − 14 but this fails for new categories and items. The transfer learning Avg token length 9.45 3.52 3.52 3.52 approach exploiting product titles does not suffer from item and category cold start. Table 2: Statistics of each part of the source and target data. Sondhi et al. identify a taxonomy of e-commerce queries intents, based on search logs and user behavior data [9]. This work identifies from this data set. The query-category labels suggested by auto- five categories of e-commerce queries based on user search behavior: complete had an accuracy of 98.6%. The auto-complete crawl was 1) Shallow Exploration Queries, 2) Targeted Purchase Queries, 3) performed over a duration of 1 week, in December 2018. The queries Major-Item Shopping Queries, 4) Minor-Item Shopping Queries, in the resulting data-set were mapped to 58 high level categories. and 5) Hard-Choice Shopping Queries. This paper highlights the complexity of user intent in the e-commerce domain, and proposes 3.1 Data Splits techniques for leveraging these insights. Both the source and target data sets are split into training, validation, and test sets, stratified by category. This resulted in 5, 811, 656 3 DATA COLLECTION AND DATA SET training examples for the source data, 500, 000 validation examples Domain adaptation and transfer learning usually requires two data and 500, 000 test examples. The target data had 435, 506 training sets, a source data set and a target data set. For supervised tasks examples, with 50, 000 examples reserved for validation and test sets such as QC, transfer learning would help in scenarios where we each. The target training data was also progressively sub-sampled have very little training data in the target data set, and lots of data in to create smaller training sets of 50%, 20%, and 10% of the original the source data set. Also, the source and target data set should have data, each a subset of the previous sample. In Figure 1 we can similar characteristics. In this work, as product titles and queries see that the length of queries is similarly distributed across the 3 share a similar vocabulary, we chose product titles as the source splits. Both the validation set and the training set show a Pearson’s data set. McAuley et al. [5] provide a crawl of Amazon.com’s product correlation of > 0.99 with the test set. Due to the use of stratified pages including 142.8 million reviews, 9.43 million products, and sampling, the category distributions over the three sets are similar. 6.83 million titles1 . We utilize the titles data available in this data set as the source data for transfer learning. 3.2 Data Characteristics As no product-query data sets are publicly available for QC, we Table 1 shows examples of product titles and queries from the leveraged Amazon.com’s auto-completion to generate e-commerce source and target data set, respectively. Table 2 shows the high queries2 . In addition to providing suggestions for partial queries, level statistics of the source and target data sets. While the average auto-complete also provides high level candidate categories for length of a title is 9.45, queries are much shorter (3.52 tokens). This suggested queries. These query-category results serve as our target significant difference in query and title lengths poses an interesting data set for the QC task. The seeds for auto-complete crawl were transfer learning challenge. common terms and phrases found in the data set by McAuley et al. In addition, we used random alpha-numeric character combination 4 SYSTEM ARCHITECTURE DESCRIPTION as seeds for the query crawl. A total of 535, 506 query-category Recent work in NLP has shown the wide utility of Long Short- labels were obtained by this exercise. To ascertain the accuracy of Term Memory (LSTM) architectures for transfer learning tasks [6]. this data, we manually evaluated 1000 randomly sampled queries Howard and Ruder used a pre-trained LSTM architecture to achieve 1 http://jmcauley.ucsd.edu/data/amazon/ state-of-the-art results on several text classification tasks [2]. The 2 http://completion.amazon.com/api/2017/suggestions Balanced Pooling View (BPV) architecture, which builds on these E-commerce Query Classification Using Product Taxonomy Mapping SIGIR 2019 eCom, July 2019, Paris, France Figure 2: An illustration of the BPV architecture. approaches, has been shown to be effective for product taxonomy 1.1 classification tasks [8]. The model architecture, which can be seen in Figure 2, is centered 1.0 around a character-level LSTM, which is fed via an a embedding. 0.9 The time series output from the Recurrent Neural Network (RNN) is Test Loss Direct then summarized in 4 ways: by taking the last value as in a typical 0.8 Transfer RNN architecture, and then with mean-pooling, max-pooling, and min-pooling. Those 4 summaries are concatenated and fed through 0.7 a linear layer with output size equal to the number of categories. When transferring, only the output layer needs to be replaced, in 0.6 order to accommodate the new category space. The embedding size, 10 20 50 100 RNN width and depth, and dropout settings are all set as in [8]. Size (%) On the target problem, we explore two different training styles, Figure 3: Cross-entropy loss on the test set, with varying tar- 1) target only direct training and 2) transfer learning from a source get data size. model. Direct training only uses the target data, without reference to either the source model or the source data. Transfer learning uses the source model to initialize network weights, replacing the output layer to accommodate the new category set, and then oth- 5 EVALUATION erwise proceeding as before. Adam optimization was found to be We report cross-entropy loss, accuracy, precision, recall, and F 1 consistently better than stochastic gradient descent (SGD) and is scores for our models. As the queries are not uniformly distributed used for all target models. Cross-entropy loss is used throughout. across the categories, we use weighted precision, recall, and F 1 to Final hyper-parameters were tuned using a grid search around measure the performance of the approaches on the test data. If (Pi ), those initial values, varying the learning rate schedule and peak (Ri ) and (F 1i ) are precision, recall, and F 1 scores for each category learning rate, as well as the number of training epochs for direct c i , then the corresponding weighted metrics can be calculated as: training. Transfer learning was fixed at 5 epochs throughout, since K K K any increase in the number of epochs led to overfitting and an ni ni ni Õ Õ Õ Pw = Pi Rw = Ri F 1w = F 1i increasing validation loss. This process was performed separately i=1 N i=1 N i=1 N for direct training and transfer learning, as well as for each of the 4 (1) (2) (3) data scales. Hyper-parameters with consistently strong validation results 6 RESULTS were then chosen for each of the two training styles. A learning rate Figure 3 shows the results for test loss as the amount of target data of 0.003 was best for all variants. A linearly decreasing "burndown" varies, for each of the two training approaches. The advantages schedule was better than 1cycle or a flat learning rate for transfer. of transfer learning are most apparent at low data scales, where it Direct training was most effective with 10 epochs when trained on produces significantly better results. The two approaches eventually subsets of the target data, but better still with 20 epochs on a full converge in performance as target data becomes fully available. 100% of the target data. Once settled, these parameters were used in Figure 4 shows the equivalent results for accuracy. In this case the 4 independent training runs for each training style and data scale. performance difference is not as large, and direct training closes Each model was used to make predictions over the test set, and the the gap at 50% of the target data. This corresponds to a regime in results are based on these predictions. which the training loss continues to drop rapidly while validation loss levels off, which might indicate overfitting. SIGIR 2019 eCom, July 2019, Paris, France Michael Skinner and Surya Kallumadi Category Source + 10% Target 10% Target only 84 P R F1 P R F1 Test Accuracy (%) fash-wom-shoes 0.887 0.859 0.873 0.835 0.808 0.821 80 pets 0.903 0.820 0.860 0.882 0.813 0.846 Direct mobile 0.860 0.827 0.844 0.862 0.823 0.842 Transfer fash-wom-cloth 0.847 0.837 0.842 0.792 0.753 0.772 76 beauty 0.835 0.830 0.833 0.802 0.787 0.795 garden 0.821 0.836 0.828 0.787 0.828 0.807 fash-wom-jlry 0.755 0.895 0.819 0.733 0.777 0.754 72 grocery 0.784 0.831 0.807 0.750 0.779 0.764 10 20 50 100 baby-products 0.824 0.727 0.772 0.773 0.655 0.709 Size (%) electronics 0.712 0.844 0.772 0.741 0.809 0.774 Figure 4: Accuracy on the test set for various data scales. automotive 0.705 0.776 0.739 0.686 0.761 0.721 toys-and-games 0.738 0.732 0.735 0.698 0.693 0.695 videogames 0.782 0.691 0.734 0.800 0.740 0.769 Target Size Source + Target Target only hpc 0.726 0.719 0.722 0.701 0.697 0.699 office-products 0.749 0.698 0.722 0.711 0.691 0.701 Pw Rw F1w Pw Rw F1w sports-&-fitness 0.719 0.661 0.689 0.685 0.616 0.649 10% 0.757 0.757 0.754 0.733 0.734 0.732 arts-crafts 0.740 0.636 0.684 0.692 0.611 0.649 20% 0.791 0.790 0.788 0.782 0.783 0.781 fash-mens-cloth 0.681 0.676 0.679 0.641 0.560 0.598 50% 0.828 0.828 0.826 0.828 0.829 0.827 lawngarden 0.763 0.606 0.676 0.716 0.582 0.642 tools 0.678 0.670 0.674 0.651 0.668 0.660 100% 0.852 0.852 0.851 0.862 0.861 0.860 fan-shop 0.745 0.597 0.663 0.631 0.430 0.511 Table 3: Comparing the performance of transfer learning mi 0.761 0.577 0.656 0.694 0.569 0.625 and direct target-only training. outdoor-rec 0.715 0.526 0.606 0.680 0.538 0.601 industrial 0.579 0.355 0.440 0.506 0.371 0.428 appliances 0.580 0.348 0.435 0.564 0.297 0.389 Table 3 shows the overall weighted precision, recall, and F 1 Table 4: Per-category results on 10% of the target data, for scores for each training variant across the different target data categories with at least 100 test examples. scales. Recall is equal to the accuracy metrics reported in Figure 4. Table 4 shows the per-category results in the case when the target training data set is small (10%), for categories with at least 100 test In addition, we make available a large query-category labeled examples. Transfer learning is able to improve F 1 for nearly all data set which can facilitate additional progress in this research categories, sometimes significantly, and for categories that were area. This data provides scope for research tasks such as query both difficult as well as easy for the directly trained model. Transfer intent mining, query segmentation and query scoping. learning was particularly helpful for rare categories. The top 6 F 1 improvements (bolded) were achieved on the 6 categories with REFERENCES the fewest examples in the 10% subset of target training data. This [1] Hal Daumé, III, Abhishek Kumar, and Avishek Saha. 2010. Frustratingly Easy highlights the benefit of a transfer learning approach for cold start Semi-supervised Domain Adaptation. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP 2010). Association categories and items. for Computational Linguistics, Stroudsburg, PA, USA, 53–59. http://dl.acm.org/ citation.cfm?id=1870526.1870534 7 CONCLUSION [2] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. (2018). arXiv:arXiv:1801.06146 Our results show that product-title data is an effective pre-training [3] Ying Li, Zijian Zheng, and Honghua (Kathy) Dai. 2005. KDD CUP-2005 Report: Facing a Great Challenge. SIGKDD Explor. Newsl. 7, 2 (Dec. 2005), 91–99. https: source for query-taxonomy classification. When there is not much //doi.org/10.1145/1117454.1117466 training data, transfer learning improves the quality of the final [4] Y. Lin, A. Datta, and G. D. Fabbrizio. 2018. E-commerce Product Query Classifi- target models. Although the results converge for larger target data cation Using Implicit UserâĂŹs Feedback from Clicks. In 2018 IEEE International Conference on Big Data (Big Data). 1955–1959. https://doi.org/10.1109/BigData. sets, we observe that pre-trained transfer learning models converge 2018.8622008 in fewer epochs than models trained only on the target data set. [5] J McAuley, R Pandey, and J Leskovec. 2015. Inferring Networks of Substitutable This convergence is noteworthy and worth exploring in more and Complementary Products. In KDD 2015. 785–794. [6] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and detail. The implication is that, at a certain data scale, the source Optimizing LSTM Language Models. (2017). arXiv:arXiv:1708.02182 model does not contain any information that is more useful than [7] Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How Transferable are Neural Networks in NLP Applications?. In Proceedings of the 2016 that in the target data. One possible reason for this is that the Conference on Empirical Methods in Natural Language Processing. Association for model architecture can only encode so much information, and it Computational Linguistics, 479–489. https://doi.org/10.18653/v1/D16-1046 may be the case that the full target data can saturate it. If so, then [8] M Skinner. 2018. Product Categorization with LSTMs and Balanced Pooling Views. In Proceedings of the 2018 SIGIR Workshop On eCommerce. increasing the size of the pre-trained source model might lead to [9] Parikshit Sondhi, Mohit Sharma, Pranam Kolari, and Chengxiang Zhai. 2018. A further improvements. taxonomy of queries for e-commerce search. In 41st International ACM SIGIR E-commerce Query Classification Using Product Taxonomy Mapping SIGIR 2019 eCom, July 2019, Paris, France Conference on Research and Development in Information Retrieval, SIGIR 2018. Asso- ciation for Computing Machinery, Inc, 1245–1248. https://doi.org/10.1145/3209978. 3210152