CCS CONCEPTS

E-commerce Query Classification Using Product Taxonomy Mapping: A Transfer Learning Approach

Surya Kallumadi

surya@ksu.edu 0 0 The Home Depot

2019

1245 1248

In web search, query classification (QC) is used to map a query to a user's search intent. In the e-commerce domain, user's product search queries can be broadly categorised into product specific queries and category specific queries [ 9]. In these instances, accurate classification of queries will help with identifying the right product categories from which relevant products can be retrieved. Thus, mapping a query to a pre-defined product taxonomy is an important step in e-commerce query understanding pipeline. A typical e-commerce website has thousands of categories, and curating a labeled data set for query classification is expensive, time consuming, and labor intensive. In addition, product search queries are short, and the vocabulary changes over time as the catalogue evolves. Reducing this efort of generating query-category labels would save time and resources. In this work we show how an existing product-taxonomy mapping can improve query classification, and reduce the need for labeled data, using transfer learning. Our results demonstrate that such an approach can match, and often exceed, the performance of direct training with a smaller computational budget. We further explore how performance varies as the amount of available training data varies, and show that transfer learning is most useful when the target data set size is small. In addition, we make available a large query data set of 535, 506 unique e-commerce labeled queries, mapped over 58 categories. The results and transfer learning approaches presented in this work can act as strong baselines for this collection and task.

CCS CONCEPTS

• Information systems → Clustering and classification;

INTRODUCTION

In the e-commerce domain query understanding can have a significant impact on user satisfaction. An incorrectly interpreted query can lead to search abandonment by the user, resulting in lower conversion rates. E-commerce queries are usually short and lack linguistic structure, and they can be ambiguous as a result. For example the query ‘battery lawn tractor’, can be interpreted as ‘battery for lawn tractor’ or ‘battery operated lawn tractor’.

In product search, the objective of query classification is to map a user query to a pre-defined product category. QC can improve the relevance of search results while preserving the recall. A typical e-commerce site such as Amazon.com can have millions of products, and thousands of product categories of various granularities. Curating a query-category labeled data set with good coverage over all the categories is expensive, labor intensive, and can take a long time. Approaches that can reduce the efort needed to categorize the search queries can significantly improve the performance of QC. In this work, we propose a transfer learning approach for QC by using product titles. As the products in the domain are mapped to a well defined product taxonomy, the product mapping can be exploited to improve QC, and reduce the need for labeled data.

Transfer learning has proven to be an efective technique to improve the performance of various tasks in computer vision and natural language processing (NLP) [ 1 ]. The goal of transfer learning is to utilize knowledge present within a source domain to improve a task within a target domain. Neural network and deep learning based transfer learning approaches have been shown to be quite useful to improve the performance of a wide range of target tasks in NLP [ 7 ]. To demonstrate transfer learning for QC in the e-commerce domain, we use Amazon.com titles as the source data set [ 5 ], and queries obtained by crawling Amazon.com auto-complete service as target data set.

Academic research for e-commerce query classification task has been limited because of a lack of availability of labeled data. Through this work, and the query-category data set made available, we hope to facilitate progress in this research area. In addition to the introduction of a new data set, our contributions are as follows: 1) We present a methodology for this domain-specific transfer learning, in which the source model is tuned as a classifier on a similar problem. 2) We demonstrate that such an approach can be leveraged to speed training and improve results when compared to direct training. 3) We explore the impact of target data size on both direct and transferred models, showing that transfer learning improves more on direct training as the target training data shrinks. 2

RELATED WORK

In the query classification challenge, organized by ACM KDD cup 2005 competition, the task was to categorize 800, 000 web queries into 67 predefined categories [ 3 ]. The data set for this challenge contained 111 queries with category mappings, and the queries in the test data set can be tagged by up to 5 categories. The submissions were evaluated on an 800 query subset of the complete data set. This competition highlighted the challenge of assigning labels to queries. Product Titles Category Compaq 256MB 168-Pin 100Mhz DIMM SDRAM for Compaq Proliant Electronics EK Ekcessories 10708C-BLUE-AM Blue Jeep Visor Clip Automotive NHL Chicago Blackhawks Franchise Fitted Hat, Black, Extra Large Sports & Outdoors Sesame Street Robe with Embroidered Washcloth Health & Personal Care Emerica Men’s The Westgate Skate Shoe Clothing, Shoes & Jewelry Queries Category 13mm wrench tools hip action zukes peanut butter pets nerf guns under 30 dollars toys-and-games bernaise sauce mix grocery door lever lock child proof baby-products Table 1: Examples from the source (titles) and target (queries) data sets.

Lin et al. propose using implicit feedback from user clicks as a signal to collect training data for QC in e-commerce domain [ 4 ]. We consider this work to be complementary to the transfer learning approach we propose in this paper. Leveraging user click stream data and the product hierarchy together can be used to improve the overall system performance. Click stream data is useful when a suficient amount of user behavior has been observed for a category, but this fails for new categories and items. The transfer learning approach exploiting product titles does not sufer from item and category cold start.

Sondhi et al. identify a taxonomy of e-commerce queries intents, based on search logs and user behavior data [ 9 ]. This work identifies ifve categories of e-commerce queries based on user search behavior: 1) Shallow Exploration Queries, 2) Targeted Purchase Queries, 3) Major-Item Shopping Queries, 4) Minor-Item Shopping Queries, and 5) Hard-Choice Shopping Queries. This paper highlights the complexity of user intent in the e-commerce domain, and proposes techniques for leveraging these insights. 3

DATA COLLECTION AND DATA SET Domain adaptation and transfer learning usually requires two data sets, a source data set and a target data set. For supervised tasks such as QC, transfer learning would help in scenarios where we have very little training data in the target data set, and lots of data in the source data set. Also, the source and target data set should have similar characteristics. In this work, as product titles and queries share a similar vocabulary, we chose product titles as the source data set. McAuley et al. [ 5 ] provide a crawl of Amazon.com’s product pages including 142.8 million reviews, 9.43 million products, and 6.83 million titles1. We utilize the titles data available in this data set as the source data for transfer learning.

As no product-query data sets are publicly available for QC, we leveraged Amazon.com’s auto-completion to generate e-commerce queries2. In addition to providing suggestions for partial queries, auto-complete also provides high level candidate categories for suggested queries. These query-category results serve as our target data set for the QC task. The seeds for auto-complete crawl were common terms and phrases found in the data set by McAuley et al. In addition, we used random alpha-numeric character combination as seeds for the query crawl. A total of 535, 506 query-category labels were obtained by this exercise. To ascertain the accuracy of this data, we manually evaluated 1000 randomly sampled queries 1http://jmcauley.ucsd.edu/data/amazon/ 2http://completion.amazon.com/api/2017/suggestions 140000 120000 100000

Training Validation Test 1 2 3 4 5 from this data set. The query-category labels suggested by autocomplete had an accuracy of 98.6%. The auto-complete crawl was performed over a duration of 1 week, in December 2018. The queries in the resulting data-set were mapped to 58 high level categories. 3.1

Data Splits

Both the source and target data sets are split into training, validation, and test sets, stratified by category. This resulted in 5, 811, 656 training examples for the source data, 500, 000 validation examples and 500, 000 test examples. The target data had 435, 506 training examples, with 50, 000 examples reserved for validation and test sets each. The target training data was also progressively sub-sampled to create smaller training sets of 50%, 20%, and 10% of the original data, each a subset of the previous sample. In Figure 1 we can see that the length of queries is similarly distributed across the 3 splits. Both the validation set and the training set show a Pearson’s correlation of > 0.99 with the test set. Due to the use of stratified sampling, the category distributions over the three sets are similar. 3.2

Data Characteristics

SYSTEM ARCHITECTURE DESCRIPTION Recent work in NLP has shown the wide utility of Long ShortTerm Memory (LSTM) architectures for transfer learning tasks [ 6 ]. Howard and Ruder used a pre-trained LSTM architecture to achieve state-of-the-art results on several text classification tasks [ 2 ]. The Balanced Pooling View (BPV) architecture, which builds on these approaches, has been shown to be efective for product taxonomy classification tasks [ 8 ].

The model architecture, which can be seen in Figure 2, is centered around a character-level LSTM, which is fed via an a embedding. The time series output from the Recurrent Neural Network (RNN) is then summarized in 4 ways: by taking the last value as in a typical RNN architecture, and then with mean-pooling, max-pooling, and min-pooling. Those 4 summaries are concatenated and fed through a linear layer with output size equal to the number of categories. When transferring, only the output layer needs to be replaced, in order to accommodate the new category space. The embedding size, RNN width and depth, and dropout settings are all set as in [ 8 ].

On the target problem, we explore two diferent training styles, 1) target only direct training and 2) transfer learning from a source model. Direct training only uses the target data, without reference to either the source model or the source data. Transfer learning uses the source model to initialize network weights, replacing the output layer to accommodate the new category set, and then otherwise proceeding as before. Adam optimization was found to be consistently better than stochastic gradient descent (SGD) and is used for all target models. Cross-entropy loss is used throughout.

Final hyper-parameters were tuned using a grid search around those initial values, varying the learning rate schedule and peak learning rate, as well as the number of training epochs for direct training. Transfer learning was fixed at 5 epochs throughout, since any increase in the number of epochs led to overfitting and an increasing validation loss. This process was performed separately for direct training and transfer learning, as well as for each of the 4 data scales.

Hyper-parameters with consistently strong validation results were then chosen for each of the two training styles. A learning rate of 0.003 was best for all variants. A linearly decreasing "burndown" schedule was better than 1cycle or a flat learning rate for transfer. Direct training was most efective with 10 epochs when trained on subsets of the target data, but better still with 20 epochs on a full 100% of the target data. Once settled, these parameters were used in 4 independent training runs for each training style and data scale. Each model was used to make predictions over the test set, and the results are based on these predictions. We report cross-entropy loss, accuracy, precision, recall, and F 1 scores for our models. As the queries are not uniformly distributed across the categories, we use weighted precision, recall, and F 1 to measure the performance of the approaches on the test data. If (Pi ), (Ri ) and (F 1i ) are precision, recall, and F 1 scores for each category ci , then the corresponding weighted metrics can be calculated as: Pw =

K Õ ni i=1 N

Pi (1) 6

RESULTS

Rw =

K Õ ni i=1 N

Ri (2)

F 1w =

K Õ ni i=1 N

F1i (3) Figure 3 shows the results for test loss as the amount of target data varies, for each of the two training approaches. The advantages of transfer learning are most apparent at low data scales, where it produces significantly better results. The two approaches eventually converge in performance as target data becomes fully available. Figure 4 shows the equivalent results for accuracy. In this case the performance diference is not as large, and direct training closes the gap at 50% of the target data. This corresponds to a regime in which the training loss continues to drop rapidly while validation loss levels of, which might indicate overfitting. 10% 20% 50% 100% 10 20 50

100

Table 3 shows the overall weighted precision, recall, and F 1 scores for each training variant across the diferent target data scales. Recall is equal to the accuracy metrics reported in Figure 4. Table 4 shows the per-category results in the case when the target training data set is small (10%), for categories with at least 100 test examples. Transfer learning is able to improve F 1 for nearly all categories, sometimes significantly, and for categories that were both dificult as well as easy for the directly trained model. Transfer learning was particularly helpful for rare categories. The top 6 F 1 improvements (bolded) were achieved on the 6 categories with the fewest examples in the 10% subset of target training data. This highlights the benefit of a transfer learning approach for cold start categories and items.

7 CONCLUSION

Our results show that product-title data is an efective pre-training source for query-taxonomy classification. When there is not much training data, transfer learning improves the quality of the final target models. Although the results converge for larger target data sets, we observe that pre-trained transfer learning models converge in fewer epochs than models trained only on the target data set.

This convergence is noteworthy and worth exploring in more detail. The implication is that, at a certain data scale, the source model does not contain any information that is more useful than that in the target data. One possible reason for this is that the model architecture can only encode so much information, and it may be the case that the full target data can saturate it. If so, then increasing the size of the pre-trained source model might lead to further improvements.

[1]

Hal

Daumé , III, Abhishek

Kumar

, and

Avishek

Saha . 2010 . Frustratingly Easy Semi-supervised Domain Adaptation . In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP 2010 ). Association for Computational Linguistics , Stroudsburg, PA, USA, 53 - 59 . http://dl.acm.org/ citation.cfm?id= 1870526 . 1870534

[2]

Jeremy

Howard and

Sebastian

Ruder . 2018 . Universal Language Model Fine-tuning for Text Classification . ( 2018 ). arXiv:arXiv: 1801 .06146

[3]

Ying

Li ,

Zijian

Zheng , and Honghua (Kathy) Dai . 2005 . KDD CUP-2005 Report: Facing a Great Challenge. SIGKDD Explor. Newsl. 7 , 2 (Dec. 2005 ), 91 - 99 . https: //doi.org/10.1145/1117454.1117466

[4]

Lin ,

Datta , and

G. D.

Fabbrizio . 2018 . E-commerce Product Query Classification Using Implicit UserâĂŹs Feedback from Clicks . In 2018 IEEE International Conference on Big Data (Big Data) . 1955 - 1959 . https://doi.org/10.1109/BigData. 2018 .8622008

[5]

McAuley ,

Pandey , and

Leskovec . 2015 . Inferring Networks of Substitutable and Complementary Products . In KDD 2015 . 785 - 794 .

[6]

Stephen

Merity , Nitish Shirish Keskar, and Richard Socher. 2017 . Regularizing and Optimizing LSTM Language Models . ( 2017 ). arXiv:arXiv:1708.02182

[7] Lili

Mou

, Zhao

Meng

, Rui Yan , Ge

Li , Yan Xu, Lu Zhang, and Zhi

Jin . 2016 . How Transferable are Neural Networks in NLP Applications? . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics , 479 - 489 . https://doi.org/10.18653/v1/ D16 -1046

[8]

Skinner . 2018 . Product Categorization with LSTMs and Balanced Pooling Views . In Proceedings of the 2018 SIGIR Workshop On eCommerce.

[9]

Parikshit

Sondhi , Mohit Sharma, Pranam Kolari, and

Chengxiang

Zhai . 2018 . A taxonomy of queries for e-commerce search . In 41st International ACM SIGIR