System Design of Extreme Multi-label Query Classification using a Hybrid Model Xianjing Liu∗ Hejia Zhang† ∗ eBay Search Princeton University xianjliu@ebay.com hejiaz@princeton.edu Mingkuan Liu Alan Lu eBay Search eBay Search mingkuan@gmail.com alalu@ebay.com ABSTRACT product listing titles, or product listing descriptions to merchandise Extreme multi-label classification is a rapidly growing research categories. A typical merchandise category taxonomy is a multi- area with many applications. In this paper we propose a system leveled tree. Among these classification tasks, query categorization design of extreme multi-label text classification (XMTC) on query is much more challenging than traditional document classification classification in the e-commerce domain. Search query classifica- tasks. First, queries are usually very short. The average length of tion is more challenging than conventional document classification queries by eBay users is about four tokens. Second, many queries because queries are usually very short and often ambiguous. We de- are ambiguous. For example, the query ’Nike’ can be classified to sign a hybrid model that uses a deep neural network for long queries many categories such as men’s shoes, men’s clothing, women’s shoes, and uses a Naive Bayes model for short queries. We formulate and kids’ clothing, Golf, and Yoga. ’Father’s day gift’ can be categorized to apply new data augmentation techniques and create new evalua- home and living, electronic, art, clothing. The relevant search results tion metrics that are more suitable for the extreme multi-label task for such queries can span many leaf categories from multiple meta in e-commerce. We also design end-to-end system level evaluation categories (categories at the root level). Third, the label set is very methods to address the challenge in human judgment due to the large when classifying the query to the leaf level of categories. extremely large label space. We compare our deep neural network There are, for example, over 20k leaf categories at the eBay U.S. site model with the state-of-the-art method on our real e-commerce alone. data and observe about a 15% improvement in the F1 score. The It is natural to model query categorization as a multi-class classi- end-to-end system evaluation results show that our new system fication problem. The traditional binary or multi-class classification improves query classification performance for a variety of query problems, where one and only one label belongs to each document, sets. In particular, for the torso and tail queries in e-commerce, we have been studied heavily in the literature [4, 5, 6]. However, we ob- see 0.3% and 1.1% improvements in the NDCG score. serve that, as mentioned above, many queries have more than one ground-truth label, so we consider query categorization as a multi- KEYWORDS label problem. Multi-label classification is fundamentally different from binary and multi-class classifications. A multi-label classifier multi-label, extreme text classification, query classification, neural assigns the most relevant subset of labels to each sample while the networks, CNN-RNN, deep learning, e-commerce label set in a multi-class classifier is treated as independent vari- ACM Reference Format: ables, and the dependencies among labels are not leveraged. That Xianjing Liu, Hejia Zhang, Mingkuan Liu, and Alan Lu. 2019. System De- is, a multi-class classifier assumes that the class labels are mutually sign of Extreme Multi-label Query Classification using a Hybrid Model. In exclusive [7]. Proceedings of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 The query categorization in our task is also an extreme classifi- eCom), 8 pages. cation problem since the label set contains a tremendous number of labels. Extreme classification is a rapidly growing research area 1 INTRODUCTION dealing with multi-class and multi-label problems with a very large In e-commerce, there are many cases where we need to classify text set of labels [8, 9, 10, 11, 12]. Combining the properties of our to a large label space [1, 2, 3], such as classifying search queries, query categorization, we treat it as an extreme multi-label (XML) ∗Both authors contributed equally to the paper, in alphabetical order of last names. classification problem. †This work is done when Hejia Zhang is intern at eBay. In this paper, we propose a system design to tackle query cate- gorization as an XML problem. We design a hybrid model where Copyright © 2019 by the paper’s authors. Copying permitted for private and academic we combine deep neural network (DNN) model with a Naive Bayes purposes. model. In particular, the model handles short queries using a Naive In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at Bayes model and handles long queries with DNN. The motivation http://ceur-ws.org for using a hybrid model is based on our experiment results, de- tailed in Section 4.3 Table4, where we found Naive Bayes model has better performance in the short queries, while the DNN model SIGIR 2019 eCom, July 2019, Paris, France Liu, Zhang, et al. outperforms the Naive Bayes model in long queries. We also apply XML-CNN and shows that XML-CNN is the SOTA, so we will only new data augmentation techniques and create new evaluation met- compare our model with XML-CNN in our paper. rics for XML problems that are more suitable in the e-commerce setting. We compare our system with the state-of-the-art method 3 PROPOSED SYSTEM on real data in e-commerce and show that our system enhances the Here we propose a system design for the extreme multi-label query performance of query categorization in a few success measures. classification with a hybrid model. The hybrid model is a combi- nation of a DNN model and a Naive Bayes model where the Naive 2 RELATED WORK Bayes classifier is called when the query length is 1 (token), and the DNN model is used otherwise. We experimented with various cut- 2.1 Naive Bayes text classification model off query length thresholds, and a cutoff length of 1 gives the best Naive Bayes classifiers are used widely in text classification tasks. overall performance. Our DNN model for queries, namely XML-Q, These classifiers belong to the probabilistic classification family. is a CNN-RNN model with differentiable F1 loss. It is an extension They are based on the Bayes theorem and assume that the features of the model XML_CNN [7] that is more suitable for the query are mutually independent. Despite its oversimplified assumption classification. on feature independence, Naive Bayes achieves competitive per- We also propose three techniques for training data augmentation formance in many complex applications [13, 14, 15]. The paper in to address the skewness of the data to improve model performance. [16] extended the Naive Bayes classifier to utilize the structural The standard definition of evaluation metric precision at k is flawed characteristics of e-catalogs for e-commerce and achieved improved for the multi-label task when the total number of true label for accuracy. The study of [17] proposed a semantic Naive Bayes clas- the query is less than k, more details in Section 4.2.1. We therefore sifier that incorporates the semantic feature of the document to design a new evaluation metric PR@k for the multi-label classifier improve the conventional Naive Bayes classifier. Because of the in Section 4.2.1. simplicity, effectiveness and excellent performance, Naive Bayes The challenge of using human judgment to evaluate the XML classifier has been widely used in the industry, including the e- classifiers directly is that the label space is too large for the judge commerce domain. Therefore, we use Naive Bayes classifier as one to categorize the query to the entire label space according to the of the baselines, although it is not specifically designed for the judgment guidelines. Thus, we propose an end-to-end evaluation XML classification. Naive Bayes performs very well with a small in the system/application level. We implement the query classifier amount of training data [18] that most other classifiers would find in the search system and design an offline evaluation pipeline us- insufficient, especially deep neural network. However, the deep ing golden data and the side-by-side human judgment of search neural network usually outperforms the Naive Bayes where a large performance in 4.2.2 and Section 4.2.3. The end-to-end system per- amount of training data has been provided like in e-commerce. formance such as search relevance and ranking quality is evaluated instead. 2.2 Models for XML 2.2.1 Deep neural network models for XML. There are many ex- 3.1 Hybrid Model isting methods for the XML text classification problem. Among We propose a hybrid model where the Naive Bayes model is used those methods, the neural network models, such as FastText [19], when the query length is 1, and the XML-Q model is called when CNN-Kim [20], Bow-CNN [21], and XML-CNN [7], constitute a big query length is larger than 1. The design is based on the comparison family. These methods design different neural network structures of these two classifiers in the evaluation results shown in Table 4. and directly map input text into the label space. XML-CNN is the The Naive Bayes model has better performance for the head queries state-of-the-art (SOTA) in the XML text classification task as shown where the query length is mostly 1-2, while the XML-Q classifier in the paper [7]. It passes the document through convolutional fil- is better for the torso and tail queries where query lengths are ters and dynamic max-pooling layers to extract features and maps usually longer than the head queries. One possible explanation is those features to the label space with two fully-connected layers. that Naive Bayes is based on the joint probability of n-gram tokens. Despite the large amount of DNN methods in the XML text clas- The probability would vanish as the length of query increases. On sification field, there are few DNN models designed for the XML the other hand, the XML-Q model has higher capacity to learn the query categorization problem. Since queries are much shorter than fine-grained patterns in long queries. Also, the large number of the samples of text datasets used in most of those DNN methods, it noisy training samples makes the XML-Q model robust by reducing is not desirable to apply those methods directly on query datasets. the variance of the XML-Q model. An additional explanation is discussed below. 2.2.2 Other models for XML. There are also some other methods for The current setting of cutoff query length for the hybrid model the XML text classification problem. They can be roughly divided is 1. Other choices of cutoff length such as 2,3 and 4 have also been into two groups. The first group is target-embedding methods, such evaluated in our study, but the cutoff length of 1 gives the best as SLEEC [22]. These methods project the label vectors to a low- performance. Note that cutoff length of 1 means the function of dimensional space to deal with the sparsity issue of the label space. Naive Bayes classifier is similar to a look-up table, which gives The second group is tree-based methods, such as FastXML [23]. a good performance when plenty of user click data is accumu- These methods have a hierarchical structure similar to decision lated, and a high-quality look-up table is generated. Meanwhile, trees. The XML-CNN paper [7] also compares these methods to the performance of XML-Q is not ideal because of the limited input System Design of Extreme Multi-label Query Classification using a Hybrid Model SIGIR 2019 eCom, July 2019, Paris, France information when the query length is 1. This analysis could also be 2, which implies the bi-gram information is more important in the a reason why Naive Bayes outperforms XML-Q for head queries query classification. The output from the convolutional layer then with a length of 1. passes through a batch-normalization layer [24] and a ReLU layer and is fed into a layer with gated recurrent units (GRU) [25]. The 3.2 Naive Bayes model GRU layer also consists of three groups of units with 512, 1024, 512 Our Naive Bayes model is a production model that is generated units, taking in the outputs from the three groups of filters from the using user clicked data. It is one of the baseline models in our study. last layer respectively. The output from the GRU layer then passes For each query, the clicked product listings are logged, and the through a layer-normalization layer [26] and is concatenated. The related categories for the listings are aggregated and grouped. Then concatenated length-2048 representation is then fed into two fully- a table that maps queries to categories and click counts is generated. connected layers where the number of units in the hidden layer is The query has been further broken down to uni-gram and bi-gram 80% of the number of units in the concatenated layer. The output tokens to generate a table that contains tokens, categories, and from the last fully-connected layer is passed through a sigmoid click counts. The noise in the table is reduced by removing rows function, and the output length is equals to the number of total where the click counts are below a threshold where the threshold categories. is a hyperparemeter we tuned. The bag of words and multinomial 3.3.2 Loss function. Another difference between the query dataset Naive Bayes methods are applied in the model. The probability of and document dataset in [7] is that the number of categories, or the query given the category can be calculated as the production labels, per query is 1.5, which is much smaller than the number of from the likelihoods of bi-grams or uni-grams in the query. labels per document. The total number of categories of queries is about 20, 000, which is similar to, or greater than the number of 3.3 Deep Neural Network Model categories in document datasets. Therefore, each query, or sample, Our DNN model for long queries is called XML-Q (XML for queries). has a much sparser output space. This sparsity leads to an imbal- It is based on the XML-CNN [7], which is designed for XML clas- anced classes issue. That is, the output tends to be all close to zero sification of the documents. In XML-Q, we made modifications to if no action is taken to alleviate this issue because the vast majority the XML-CNN model architecture and loss function due to the dif- of the ground-truth labels are zero. ferences between query datasets and the document dataset used in To solve this problem, instead of using the cross-entropy loss [7]. as in [7], we design a new soft F1 loss. According to the definition in [27], the F1 score in multi-label classification is calculated for 3.3.1 Architecture. Our queries, with an average length of four each sample and is defined as the harmonic mean of precision and tokens, as shown in Table 1, are much shorter compared with the recall. The F1 score is a common metric for classification problem document dataset which has an average length of hundreds of to- with imbalanced classes. It is desirable to optimize the F1 score kens. our queries contain far less information than document and directly in our case. However, the F1 score is not differentiable, need a richer feature extractor. While the XML-CNN uses only a thus it is hard to use it as a loss function in the DNN. Since the convolutional layer to extract features from the documents, XML- harmonic averaging operation itself is differentiable, we examine Q has an extra recurrent units layer after the convolutional layer the precision and recall formula to identify the non-differentiable to extract more sequential information. This recurrent layer also part. replaces the functionality of the pooling layer in XML-CNN. In our CNN+RNN structure, the convolutional layer extracts n-gram TP TP Precision = = information and the recurrent layer further extracts sequential in- TP + FP Number of positive labels in prediction formation, thus it is a stronger feature extractor compared with TP TP Recall = = CNN or RNN alone. This observation is consistent with our experi- TP + FN Number of positive labels in ground-truth ment results that the CNN+RNN structure has a better prediction , where TP is true positive, FP is false positive, and FN is false performance. negative. We observe that thresholds are applied in deriving the The architecture of XML-Q is summarized in Figure 1. All of predicted labels and TP, and the non-differentiability comes from the hyper-parameters in the model architecture are tuned with a this thresholding step. Therefore, we remove the thresholding and validation set. The input query is truncated or padded to 10 tokens estimate TP and number of positive labels in prediction with a soft, where the last token is fixed to be . Based on Table 1, 90% differentiable formula defined as: of our queries have length less than 6, so we are not losing much L information in truncation. If the query is shorter than 10 tokens, Õ soft TP = tl · σ (yl ) tokens are padded to the beginning of the query to comply l =1 with the preference of recurrent units. The tokens are then mapped soft num. of positive predictions = ∥σ (y)∥22 to learn-able length-256 word embeddings. The embeddings are passed through a convolutional layer which is arranged similar to , where L is thenumber of categories, t is the ground-truth vector, XML-CNN [7]. The convolutional layer consists of three groups of y is the logits vector, σ is the sigmoid function, and subscript l convolutional filters. The three groups have 512, 1024, 512 filters means the l’th category. By replacing the TP and number of positive with filter size 1, 2, 3, respectively. A filter with filter size k actually predictions with the soft version, the F1 score is differentiable, and has size 256 × k, meaning that the filter is convolved with word we call it "soft F1 score". The soft F1 score is used as the cost function embedding of k tokens. The model has more filters with filter size in XML-Q. SIGIR 2019 eCom, July 2019, Paris, France Liu, Zhang, et al. Kernel size: 1 # Filter: 512 I Kernel size: 2 # Unit: 512 am # Filter: 1024 a query with Kernel size: 3 # Unit: 1024 multiple labels # Filter: 512 # Unit: 512 Tokenized Word Convolutional Gated recurrent Fully connected Fully connected layer Query Embeddings layer + units + layer + with sigmoid output BatchNorm + LayerNorm + BatchNorm + Length: 10 Dimension: 256 ReLU ReLU ReLU # Unit: 1638 # Label: ~20k Figure 1: Architecture of deep neural network model XML-Q for long queries 3.3.3 dropout and learning rate decay. During the training phase, the latest half-year data as the base dataset and collect another half- we applied two techniques to prevent over-fitting and accelerate year data before the base dataset as the augmentation candidate set. the convergence. First, we use dropout [28] with a keep-rate of 0.8 The final training set consists of the full base set and part of the in the fully-connected layers to prevent over-fitting. The second candidate set we selected. The final training set leads to a better technique is the application of cosine decay with restarts to the performance in our experiment compared with the base set alone learning rate as proposed in [29]. The learning rate starts from or base set plus full candidate set. We applied three augmentation 0.001 and follows a cosine decay function with a period of 4 epochs. techniques. When the learning rate touches zero in 2 epochs, it restarts to 0.0008, The first technique is used to alleviate the skewness of labels, or and the cosine decay function has a period of 8 epochs. Every time categories, in query classification. Some popular categories have it restarts, the learning rate is 0.8 of the previous restart, and the lots of data while some others have far less. For example, in our cosine function period is twice as long as the previous one. The dataset, the maximum number of queries for a category is 364771 learning rate scheduler is shown in Figure 2. while the minimum number is 1, and the average number of queries for each category is 1377, as shown in Table 1. We, therefore, added data from the 1500 categories with the lowest frequencies in the 1e-4 1e-4 8e-5 candidate set to the final set. Learning rate 5e-5 mean 90% tile 2 epochs 6 epochs Query length 3.9 6.0 0 #labels for each query 1.5 2.0 0 100k 200k 300k Number of iterations #search-requests for each query 1377.1 2420.0 Table 1: query set statistics Figure 2: Learning rate scheduler of the DNN model The second technique is to augment selected queries that con- 3.4 Data augmentation tain brand names, such as Dior, Haier, or Nike. When such queries The number of available training samples in e-commerce is usually are short and ambiguous, the predicted categories are sometimes very large, but we found that it is still helpful to augment the dataset dominated by the information in the brand name. For example, the to balance the data distribution in query classification. We collect ground-truth category of the query "Dior Gaudron" is "pottery & System Design of Extreme Multi-label Query Classification using a Hybrid Model SIGIR 2019 eCom, July 2019, Paris, France glass" where "Gaudron" is a series of products of Dior for dinner- can evaluate ranking, which evaluates if the labels are ranked in ware. However, since the majority of queries with brand "Dior" are order of human judged relevance; third, one can use the label hier- clothes or cosmetics and "Gaudron" contains only very ambigu- archy, which evaluates the effectiveness of the system to take into ous information to the model, the model tends to classify "Dior account the hierarchical structure of the labels [30, 31]. Gaudron" into clothes and cosmetics. Therefore, for queries with In our study, all the categories are at the leaf level, and the brand names, if the category has low frequencies within the brand, hierarchical structure of the labels can be ignored, so we evaluated repeat the queries 50 times and add to the final set. We repeat the both the partitions and the rankings. The partitions measure the queries 50 times as a result of tuning with the validation set. More partial correctness. We used the definition of precision, recall, and specifically, for each brand, we collect all queries with this brand F1 proposed by [27]. The precision is the ratio of the correctly name and count the number of queries for each category, and then predicted true labels to the total number of true labels, averaged we sort the categories according to the query frequencies, and take over all samples. The recall is the ratio of correctly predicted true half of the categories with the lowest query frequency. The queries labels to the total number of predicted true labels, averaged over in those categories with that specific brand are then selected from all samples. F1 is the harmonic mean of precision and recall. the candidate set and added to the final set. n 1 Õ |Yi Z i | Ñ The third technique is used to catch the residuals of our model. Precision = We first train XML-Q with the base set and record the misclassifi- n i=1 |Z i | cation rate of each category on the base training set. We then add n 1 Õ |Yi Z i | Ñ the samples of 1000 categories with worst performance and which Recall = have not yet added to the final set. n i=1 |Yi | n 1 Õ 2 × |Yi Z i | Ñ 4 EXPERIMENT F1 = n i=1 |Yi | + |Z i | 4.1 Dataset collection and preprocessing In the e-commerce domain, there are millions of buyer engagement data (click/add2cart/purchase) generated in search log systems ev- , where n is the number of instances in evaluation set. Y_i are ery day. Our data set consists of 275 million the true labels for the instance x_i, and Z_i are the predicted labels pairs and the click-through rate (CTR) for each pair. The query is for the instance x_i. Both Y_i and Z_i ∈ {0,1} normalized by removing certain special characters. For each query, In the e-commerce domain, the order of the predicted categories the clicked product listings and the click/impression count are ag- for the query matters. The more relevant categories with higher gregated for one year and grouped by the category. We eliminated CTR should rank higher than the less relevant categories with lower those pairs where aggregated click counts are CTR and the categories that are not relevant. Rank-based evaluation less than 3 to reduce the noise in the data while retaining as many metrics, such as precision at top k (P@k), have been widely used categories as possible. In our dataset, the number of total categories in the multi-label tasks [7, 23, 10, 22]. P@k is calculated for each is about 20,000, covering the entire label space. query and then averaged over the whole evaluation set. For each query, the P@k is defined as: 4.2 Evaluation 1Õ p@k = yl The evaluation in our study includes a model level evaluation and k a system level evaluation. We propose a system level evaluation because the number of total categories is too large for human judges , where k could be 1,3,10 and y_l is the ground-truth labels to evaluate the quality of the query classifier directly. In our study, among the top k predicted labels for each query. the query categorization is a component in the e-commerce search The standard definition of P@k is flawed when the total number system. We implement the baseline or the target query models of ground-truth labels is less than k. For example, assume the query in the search system and compare whether there is any improve- has two true labels only, and the query classifier predicts both of ment for the search system with the target model and how much them right. Then the expected precision at top k should be 1. How- improvement it brings. ever, according to the definition, the P@3 will be 2/3. Furthermore, We design two approaches for the system level evaluations. The based on the definition, P@k is not a comparable metric between first approach is the offline golden data evaluation, and the second datasets when the average number of labels of queries various a approach is the human judgment of end-to-end search relevance. lot. 4.2.1 model level evaluation. The evaluation of a multi-label classi- We propose a modified metric called PR@k as: fier is more challenging than the evaluation of a multi-class classi- 1 Õ fier. In the multi-label setting, both the ground-truth and predicted PR@k = yl min(k, tl) labels for a testing sample could be a subset of the label set. Hence, the prediction can be entirely correct, partially correct or entirely incorrect. There are three ways to evaluate a multi-label classifier. , where tl is the number of true labels in the top k. In the definition First, one could evaluate partitions, which measures how far the of PR@k, the sum of true positives is divided by the minimum of k classifier predictions are from the ground-truth labels; second, one and the number of true labels tl, instead of k, to address the problem SIGIR 2019 eCom, July 2019, Paris, France Liu, Zhang, et al. when k is larger than tl. In this new definition, the precision at k and the recall at k will be the same, so we name it as PR@k. DCG NDCG* = IDCG N Õ 2r eli DCG = i=1 loд2 (i + 1) 4.2.2 offline golden data evaluation pipeline. Human judgment is ÕN 2pr ef ecti time- and money-consuming. Thus, we built the offline golden data IDCG = i=1 loд2 (i + 1) evaluation pipeline and evaluated the system performance and the search relevance with the pipeline before the human judgment. The where N=min(5,#totalReturns), the NDCG* is calculated based pipeline utilizes human judgment results accumulated from the on the top up to 5 returned product listings for each query. rel_i is past as the golden dataset. the relevance score of the product listing at position i by a human The offline golden data evaluation pipeline can be treated as judge. We modified the definition of IDCG such that all the top 5 a mini search system. The inventory is a mixture of all product product listings should be Excellent for a perfect search system, so listings from the golden dataset and 10% of product listings from prefect_i in the IDCG equation is Excellent score. The NDCG score the production inventory. The golden data contains 500k human is calculated for each query and averaged over all the evaluation judgment data in the format of query, the top 3 returned product set. listings, and the human labeled relevance score for each pair. The relevance score is either 0 or 1, where 0 means not 4.3 Results relevant, and 1 means relevant. Based on the user search frequency, The XML-CNN model [7] has achieved the-state-of-the-art perfor- the queries can be divided into head, torso and tail queries. We mance and beats the other 7 most representative multi-label models define that head queries are those count for the top 30% of the total such as FastXML [23], SLEEC [22], Bow-CNN [21], and FastText impressions, and torso queries are those with the top 30-60% im- [19] on 6 benchmark datasets. Thus, among all the multi-label mod- pressions, and tail queries are those with the rest 40% impressions. els, we will compare our model to XML-CNN only. Another baseline The performance of a search engine is usually quite different in is the Naive Bayes model which was the production model in our head, torso and tail queries. Most of the search engines perform system. In the experiments, we evaluated: 1) the performance of our very well in the head query. The performance in the tail query deep learning model XML-Q compared to the XML-CNN model; set distinguishes the good search engines from the others. To bet- 2) the comparison of the Naive Bayes model, XML-Q model, and ter evaluate the search system, we generated head, torso, and tail hybrid model in the offline golden data evaluation pipeline; 3) the queries based on the demand frequency from the latest search log comparison of the hybrid model and Naive Bayes model in search joined with the queries in the golden data. We sampled 10k queries relevance human judgments. from each intersection as the final evaluation set for head, torso Our XML-Q model is an extension of XML-CNN by adding a few and tail queries. For each query in the evaluation set, the top N components that are more suitable for e-commerce query classifi- product listings can be predicted with the golden data evaluation cation. These components include differentiable F1 loss, additional pipeline, and the precision, recall, accuracy, and AUC [32, 33] can RNN layer, and cosine learning rate decay. As shown in Table 2, be calculated by comparing the system prediction to the golden by replacing the original binary cross-entropy loss with the dif- data label. ferentiable F1 loss, we see 22.95% improvement in the F1 score, 41.43% improvement in the precision, 17.43% drop in the recall, and 6.74%, 2.80%, and -0.22% changes in the PR@1, 3, and 10, respec- tively. The contributions of adding additional RNN layer and using cosine learning rate decay to the F1 score are 0.69% and 2.58%, and 4.2.3 search relevance human judgment. If the result of the target contribution to the PR@10 is 1.38% and 1.54% respectively. model is better than the baseline from the offline golden data eval- uation pipeline, the next step is to conduct human judgment on end-to-end search relevance. The search system is set up with base- δP δR δ F1 δ PR1 δ PR3 δ PR10 line or control models and with the full inventory. The evaluation F1 Loss 41.43 -17.43 22.95 6.74 2.80 -0.22 query set includes 4500 queries, 1/3 for each of the head, torso, Add RNN 1.51 0.03 0.69 1.24 1.22 1.38 and tail query segments. The judges are asked to judge the top 5 Cos Decay 2.39 2.72 2.58 2.15 1.86 1.54 product listings for each query to four levels of relevance based on Table 2: The contributions of new components in XML-Q the judgment guideline: Excellent, Good, Acceptable and off-topic. to the model performance in precision, recall, F1, PR@k Three judgments were collected for each query to reduce bias. (k=1,3,10) (in %). Meanwhile, we designed a side by side group comparison. Two groups of results were shown on the one page for each judgment, one group from the baseline, and another group from the target. NDCG is a score defined to measure the search ranking. We Since differentiable F1 loss has a big impact on the performance modified the definition of NDCG to represent both the ranking of query categorization, for easy comparison, we show the perfor- quality and relevance. mance of XML-CNN with differentiable F1 loss (XML-CNN*) and System Design of Extreme Multi-label Query Classification using a Hybrid Model SIGIR 2019 eCom, July 2019, Paris, France XML-Q model in Table 3. From the XML-CNN* model to XML-Q QuerySet NDCG improvement model, the total improvement is about 11.2% in precision, 18.6% head 0.0 in recall, 15.0% in F1 score, and 11.8%, 10.5%, 8.9% in PR@1, 3, 10, respectively. Note that data augmentation also contributes to total torso 0.3 improvement. tail 1.1 mix 0.4 Model P R F1 PR1 PR3 PR10 Table 5: Comparison of hybrid model and Naive Bayes model in NDCG* (in %) for head, torso and tail query sets XML-CNN* 61.01 55.09 57.90 64.79 71.28 78.89 from search relevance human judgment results. XML-Q 72.16 73.68 72.91 76.57 81.75 87.75 Table 3: Comparison of XML-CNN* and XML-Q in precision, recall, F1, PR@k (k=1,3,10) (in %). 5 CONCLUSION This paper presents the system design of extreme multi-label query Table 4 shows the comparison of XML-Q and the Naive Bayes classification as an application in the e-commerce search system. model in the top and the comparison of the hybrid model and the We propose a hybrid model that combines the Naive Bayes classifier Naive Bayes model in the bottom using the offline golden data and a deep neural network model, XML-Q, for the query classifi- evaluation pipeline. The top of the table illustrates that XML-Q cation based on the length of queries. The Naive Bayes model is model has a better performance in both torso and tail query sets used for the queries with one word, and the deep neural network and the Naive Bayes model outperforms the XML-Q model in the model is used for the rest of the queries. The deep learning model head query set. This mixed result is one of the reasons why we is an extension of the state-of-the-art XML-CNN model. A few propose the hybrid model that combines the Naive Bayes model components and adjustments have been made to make the model and XML-Q. It is not easy to classify whether a query is the head more suitable for query classification, such as the differentiable query or not. We could build a dictionary that contains all the head F1 loss, additional RNN layers, and the cosine learning rate decay. queries based on user behavior date and maintain it. Alternatively, Three new data augmentation techniques have been applied to the to our observation, most of the head queries are shorter than torso training data which significantly improve the model performance. and tail queries, so it is also straightforward to build a hybrid model New evaluation metric PR@k is designed to address the problem based on the query length. We tested different cutoff query lengths in the standard precision@k metrics when k is larger than the total from 1 to 4, and the hybrid model with a cutoff length 1 gives the true labels of the query. Since the label set is huge for the query best performance. As shown in the bottom of the table, the hybrid classification in this paper, it is almost impossible to do direct hu- model outperforms the Naive Bayes model among all head, torso, man judgment on such large label space. Therefore, we propose and tail query sets. end-to-end system level evaluations. The evaluation results show that the hybrid model enhances the performance of query classifi- QSet δ Acc δP δR δ F1 δ AUC cation for different query sets, especially the torso and tail query head -0.3 -0.2 -0.8 -0.8 -0.1 sets. XML_Q-NBayes torso 0.2 0.2 0.5 0.5 0.1 tail 0.4 0.0 1.7 1.5 0.1 REFERENCES head 0.3 0.3 1.1 1.0 0.2 [1] Jung-Woo Ha, Hyuna Pyo, and Jeonghee Kim. 2016. Large- Hybrid-NBayes torso 0.5 0.0 1.9 1.6 0.1 scale item categorization in e-commerce using multiple re- tail 0.8 0.3 2.6 2.2 0.3 current neural networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery Table 4: Comparison of XML_Q and Naive Bayes model (top) and Data Mining. ACM, 107–115. and Hybrid model and Naive Bayes model (bottom) in accu- [2] Vivek Gupta, Harish Karnick, Ashendra Bansal, and Pradhu- racy, precision, recall, F1 and AUC (in %) for head, torso and man Jhala. 2016. Product classification in e-commerce using tail query set using Offline golden data evaluation pipeline. distributional semantics. arXiv preprint arXiv:1606.06083. [3] Mangi Kang, Jaelim Ahn, and Kichun Lee. 2018. Opinion mining using ensemble text hidden markov models for text The search relevance human judgment result in Table 5 shows classification. Expert Systems with Applications, 94, 218–227. that from Naive Bayes model to the hybrid model, the NDCG* score [4] Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang improves by 0.3% for torso queries and 1.1% for tail queries. The Xue. 2016. Multi-stream multi-class fusion of deep networks improvement of the head queries is neutral. One reason is that a for video classification. In Proceedings of the 24th ACM inter- large portion of the head queries is of length 1, so the hybrid and national conference on Multimedia. ACM, 791–800. Naive Bayes are the same model in this case. Another reason is that [5] Husheng Guo and Wenjian Wang. 2015. An active learning- the head queries are relatively easy to classify based on the human based svm multi-class classification model. Pattern recogni- judgment results, so the predictions of both the Naive Bayes model tion, 48, 5, 1577–1597. and XML-Q are correct. The average performance improvement of NDCG* for the full query sets is 0.4%. SIGIR 2019 eCom, July 2019, Paris, France Liu, Zhang, et al. [6] Venkataraman Santhanam, Vlad I Morariu, David Harwood, and naive bayes. In Advances in neural information processing and Larry S Davis. 2016. A non-parametric approach to ex- systems, 841–848. tending generic binary classifiers for multi-classification. [19] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Pattern Recognition, 58, 149–158. Mikolov. 2016. Bag of tricks for efficient text classification. [7] Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming arXiv preprint arXiv:1607.01759. Yang. 2017. Deep learning for extreme multi-label text clas- [20] Yoon Kim. 2014. Convolutional neural networks for sentence sification. In Proceedings of the 40th International ACM SIGIR classification. arXiv preprint arXiv:1408.5882. Conference on Research and Development in Information Re- [21] Rie Johnson and Tong Zhang. 2014. Effective use of word trieval. ACM, 115–124. order for text categorization with convolutional neural net- [8] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. 2016. works. arXiv preprint arXiv:1412.1058. Extreme multi-label loss functions for recommendation, tag- [22] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, ging, ranking & other missing label applications. In Proceed- and Prateek Jain. 2015. Sparse local embeddings for extreme ings of the 22nd ACM SIGKDD International Conference on multi-label classification. In Advances in neural information Knowledge Discovery and Data Mining. ACM, 935–944. processing systems, 730–738. [9] Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai [23] Yashoteja Prabhu and Manik Varma. 2014. Fastxml: a fast, Zhong, and Inderjit Dhillon. 2016. Pd-sparse: a primal and accurate and stable tree-classifier for extreme multi-label dual sparse approach to extreme multiclass and multilabel learning. In Proceedings of the 20th ACM SIGKDD interna- classification. In International Conference on Machine Learn- tional conference on Knowledge discovery and data mining. ing, 3069–3077. ACM, 263–272. [10] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik [24] Sergey Ioffe and Christian Szegedy. 2015. Batch normaliza- Varma. 2013. Multi-label learning with millions of labels: tion: accelerating deep network training by reducing internal recommending advertiser bid phrases for web pages. In Pro- covariate shift. arXiv preprint arXiv:1502.03167. ceedings of the 22nd international conference on World Wide [25] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, Web. ACM, 13–24. and Yoshua Bengio. 2014. On the properties of neural ma- [11] Wei Bi and James Kwok. 2013. Efficient multi-label classifica- chine translation: encoder-decoder approaches. arXiv preprint tion with many labels. In International Conference on Machine arXiv:1409.1259. Learning, 405–413. [26] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. [12] Himanshu Jain, Venkatesh Balasubramanian, Bhanu Chun- 2016. Layer normalization. arXiv preprint arXiv:1607.06450. duri, and Manik Varma. 2019. Slice: scalable linear extreme [27] Shantanu Godbole and Sunita Sarawagi. 2004. Discrimina- classifiers trained on 100 million labels for related searches. tive methods for multi-labeled classification. In Pacific-Asia In Proceedings of the Twelfth ACM International Conference conference on knowledge discovery and data mining. Springer, on Web Search and Data Mining. ACM, 528–536. 22–30. [13] Adel Hamdan Mohammad, Tariq AlwadaâĂŹn, and Omar [28] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Al-Momani. 2018. Arabic text categorization using support Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a sim- vector machine, naïve bayes and neural network. GSTF Jour- ple way to prevent neural networks from overfitting. The nal on Computing (JoC), 5, 1. Journal of Machine Learning Research, 15, 1, 1929–1958. [14] Wei Chen, Xusheng Yan, Zhou Zhao, Haoyuan Hong, Dieu [29] Ilya Loshchilov and Frank Hutter. 2016. Sgdr: stochastic gra- Tien Bui, and Biswajeet Pradhan. 2019. Spatial prediction dient descent with warm restarts. arXiv preprint arXiv:1608.03983. of landslide susceptibility using data mining-based kernel [30] Mohammad S Sorower. 2010. A literature survey on algo- logistic regression, naive bayes and rbfnetwork models for rithms for multi-label learning. Oregon State University, Cor- the long county area (china). Bulletin of Engineering Geology vallis, 18, 1–25. and the Environment, 78, 1, 247–266. [31] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. [15] S Vijayarani and S Dhayanand. 2015. Liver disease prediction 2009. Mining multi-label data. In Data mining and knowledge using svm and naïve bayes algorithms. International Journal discovery handbook. Springer, 667–685. of Science, Engineering and Technology Research (IJSETR), 4, [32] Jin Huang and Charles X Ling. 2005. Using auc and accu- 4, 816–820. racy in evaluating learning algorithms. IEEE Transactions on [16] Young-gon Kim, Taehee Lee, Jonghoon Chun, and Sang- knowledge and Data Engineering, 17, 3, 299–310. goo Lee. 2006. Modified naïve bayes classifier for e-catalog [33] Charles X Ling, Jin Huang, Harry Zhang, et al. 2003. Auc: classification. In International Workshop on Data Engineering a statistically consistent and more discriminating measure Issues in E-Commerce and Services. Springer, 246–257. than accuracy. In Ijcai. Volume 3, 519–524. [17] How Jing, Yu Tsao, Kuan-Yu Chen, and Hsin-Min Wang. 2013. Semantic naïve bayes classifier for document classification. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, 1117–1123. [18] Andrew Y Ng and Michael I Jordan. 2002. On discriminative vs. generative classifiers: a comparison of logistic regression