System Design of Extreme Multi-label Query Classification
                         using a Hybrid Model

                                  Xianjing Liu∗                                                               Hejia Zhang† ∗
                                   eBay Search                                                             Princeton University
                                xianjliu@ebay.com                                                          hejiaz@princeton.edu

                                  Mingkuan Liu                                                                    Alan Lu
                                 eBay Search                                                                    eBay Search
                             mingkuan@gmail.com                                                               alalu@ebay.com

ABSTRACT                                                                               product listing titles, or product listing descriptions to merchandise
Extreme multi-label classification is a rapidly growing research                       categories. A typical merchandise category taxonomy is a multi-
area with many applications. In this paper we propose a system                         leveled tree. Among these classification tasks, query categorization
design of extreme multi-label text classification (XMTC) on query                      is much more challenging than traditional document classification
classification in the e-commerce domain. Search query classifica-                      tasks. First, queries are usually very short. The average length of
tion is more challenging than conventional document classification                     queries by eBay users is about four tokens. Second, many queries
because queries are usually very short and often ambiguous. We de-                     are ambiguous. For example, the query ’Nike’ can be classified to
sign a hybrid model that uses a deep neural network for long queries                   many categories such as men’s shoes, men’s clothing, women’s shoes,
and uses a Naive Bayes model for short queries. We formulate and                       kids’ clothing, Golf, and Yoga. ’Father’s day gift’ can be categorized to
apply new data augmentation techniques and create new evalua-                          home and living, electronic, art, clothing. The relevant search results
tion metrics that are more suitable for the extreme multi-label task                   for such queries can span many leaf categories from multiple meta
in e-commerce. We also design end-to-end system level evaluation                       categories (categories at the root level). Third, the label set is very
methods to address the challenge in human judgment due to the                          large when classifying the query to the leaf level of categories.
extremely large label space. We compare our deep neural network                        There are, for example, over 20k leaf categories at the eBay U.S. site
model with the state-of-the-art method on our real e-commerce                          alone.
data and observe about a 15% improvement in the F1 score. The                              It is natural to model query categorization as a multi-class classi-
end-to-end system evaluation results show that our new system                          fication problem. The traditional binary or multi-class classification
improves query classification performance for a variety of query                       problems, where one and only one label belongs to each document,
sets. In particular, for the torso and tail queries in e-commerce, we                  have been studied heavily in the literature [4, 5, 6]. However, we ob-
see 0.3% and 1.1% improvements in the NDCG score.                                      serve that, as mentioned above, many queries have more than one
                                                                                       ground-truth label, so we consider query categorization as a multi-
KEYWORDS                                                                               label problem. Multi-label classification is fundamentally different
                                                                                       from binary and multi-class classifications. A multi-label classifier
multi-label, extreme text classification, query classification, neural
                                                                                       assigns the most relevant subset of labels to each sample while the
networks, CNN-RNN, deep learning, e-commerce
                                                                                       label set in a multi-class classifier is treated as independent vari-
ACM Reference Format:                                                                  ables, and the dependencies among labels are not leveraged. That
Xianjing Liu, Hejia Zhang, Mingkuan Liu, and Alan Lu. 2019. System De-                 is, a multi-class classifier assumes that the class labels are mutually
sign of Extreme Multi-label Query Classification using a Hybrid Model. In              exclusive [7].
Proceedings of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019
                                                                                           The query categorization in our task is also an extreme classifi-
eCom), 8 pages.
                                                                                       cation problem since the label set contains a tremendous number
                                                                                       of labels. Extreme classification is a rapidly growing research area
1 INTRODUCTION                                                                         dealing with multi-class and multi-label problems with a very large
In e-commerce, there are many cases where we need to classify text                     set of labels [8, 9, 10, 11, 12]. Combining the properties of our
to a large label space [1, 2, 3], such as classifying search queries,                  query categorization, we treat it as an extreme multi-label (XML)
∗Both authors contributed equally to the paper, in alphabetical order of last names.
                                                                                       classification problem.
†This work is done when Hejia Zhang is intern at eBay.                                     In this paper, we propose a system design to tackle query cate-
                                                                                       gorization as an XML problem. We design a hybrid model where
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic    we combine deep neural network (DNN) model with a Naive Bayes
purposes.                                                                              model. In particular, the model handles short queries using a Naive
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at    Bayes model and handles long queries with DNN. The motivation
http://ceur-ws.org                                                                     for using a hybrid model is based on our experiment results, de-
                                                                                       tailed in Section 4.3 Table4, where we found Naive Bayes model
                                                                                       has better performance in the short queries, while the DNN model
SIGIR 2019 eCom, July 2019, Paris, France                                                                                   Liu, Zhang, et al.


outperforms the Naive Bayes model in long queries. We also apply        XML-CNN and shows that XML-CNN is the SOTA, so we will only
new data augmentation techniques and create new evaluation met-         compare our model with XML-CNN in our paper.
rics for XML problems that are more suitable in the e-commerce
setting. We compare our system with the state-of-the-art method         3     PROPOSED SYSTEM
on real data in e-commerce and show that our system enhances the
                                                                        Here we propose a system design for the extreme multi-label query
performance of query categorization in a few success measures.
                                                                        classification with a hybrid model. The hybrid model is a combi-
                                                                        nation of a DNN model and a Naive Bayes model where the Naive
2 RELATED WORK                                                          Bayes classifier is called when the query length is 1 (token), and the
                                                                        DNN model is used otherwise. We experimented with various cut-
2.1 Naive Bayes text classification model
                                                                        off query length thresholds, and a cutoff length of 1 gives the best
Naive Bayes classifiers are used widely in text classification tasks.   overall performance. Our DNN model for queries, namely XML-Q,
These classifiers belong to the probabilistic classification family.    is a CNN-RNN model with differentiable F1 loss. It is an extension
They are based on the Bayes theorem and assume that the features        of the model XML_CNN [7] that is more suitable for the query
are mutually independent. Despite its oversimplified assumption         classification.
on feature independence, Naive Bayes achieves competitive per-             We also propose three techniques for training data augmentation
formance in many complex applications [13, 14, 15]. The paper in        to address the skewness of the data to improve model performance.
[16] extended the Naive Bayes classifier to utilize the structural      The standard definition of evaluation metric precision at k is flawed
characteristics of e-catalogs for e-commerce and achieved improved      for the multi-label task when the total number of true label for
accuracy. The study of [17] proposed a semantic Naive Bayes clas-       the query is less than k, more details in Section 4.2.1. We therefore
sifier that incorporates the semantic feature of the document to        design a new evaluation metric PR@k for the multi-label classifier
improve the conventional Naive Bayes classifier. Because of the         in Section 4.2.1.
simplicity, effectiveness and excellent performance, Naive Bayes           The challenge of using human judgment to evaluate the XML
classifier has been widely used in the industry, including the e-       classifiers directly is that the label space is too large for the judge
commerce domain. Therefore, we use Naive Bayes classifier as one        to categorize the query to the entire label space according to the
of the baselines, although it is not specifically designed for the      judgment guidelines. Thus, we propose an end-to-end evaluation
XML classification. Naive Bayes performs very well with a small         in the system/application level. We implement the query classifier
amount of training data [18] that most other classifiers would find     in the search system and design an offline evaluation pipeline us-
insufficient, especially deep neural network. However, the deep         ing golden data and the side-by-side human judgment of search
neural network usually outperforms the Naive Bayes where a large        performance in 4.2.2 and Section 4.2.3. The end-to-end system per-
amount of training data has been provided like in e-commerce.           formance such as search relevance and ranking quality is evaluated
                                                                        instead.
2.2    Models for XML
2.2.1 Deep neural network models for XML. There are many ex-            3.1    Hybrid Model
isting methods for the XML text classification problem. Among           We propose a hybrid model where the Naive Bayes model is used
those methods, the neural network models, such as FastText [19],        when the query length is 1, and the XML-Q model is called when
CNN-Kim [20], Bow-CNN [21], and XML-CNN [7], constitute a big           query length is larger than 1. The design is based on the comparison
family. These methods design different neural network structures        of these two classifiers in the evaluation results shown in Table 4.
and directly map input text into the label space. XML-CNN is the        The Naive Bayes model has better performance for the head queries
state-of-the-art (SOTA) in the XML text classification task as shown    where the query length is mostly 1-2, while the XML-Q classifier
in the paper [7]. It passes the document through convolutional fil-     is better for the torso and tail queries where query lengths are
ters and dynamic max-pooling layers to extract features and maps        usually longer than the head queries. One possible explanation is
those features to the label space with two fully-connected layers.      that Naive Bayes is based on the joint probability of n-gram tokens.
Despite the large amount of DNN methods in the XML text clas-           The probability would vanish as the length of query increases. On
sification field, there are few DNN models designed for the XML         the other hand, the XML-Q model has higher capacity to learn the
query categorization problem. Since queries are much shorter than       fine-grained patterns in long queries. Also, the large number of
the samples of text datasets used in most of those DNN methods, it      noisy training samples makes the XML-Q model robust by reducing
is not desirable to apply those methods directly on query datasets.     the variance of the XML-Q model. An additional explanation is
                                                                        discussed below.
2.2.2 Other models for XML. There are also some other methods for          The current setting of cutoff query length for the hybrid model
the XML text classification problem. They can be roughly divided        is 1. Other choices of cutoff length such as 2,3 and 4 have also been
into two groups. The first group is target-embedding methods, such      evaluated in our study, but the cutoff length of 1 gives the best
as SLEEC [22]. These methods project the label vectors to a low-        performance. Note that cutoff length of 1 means the function of
dimensional space to deal with the sparsity issue of the label space.   Naive Bayes classifier is similar to a look-up table, which gives
The second group is tree-based methods, such as FastXML [23].           a good performance when plenty of user click data is accumu-
These methods have a hierarchical structure similar to decision         lated, and a high-quality look-up table is generated. Meanwhile,
trees. The XML-CNN paper [7] also compares these methods to             the performance of XML-Q is not ideal because of the limited input
System Design of Extreme Multi-label Query Classification using a Hybrid Model                           SIGIR 2019 eCom, July 2019, Paris, France


information when the query length is 1. This analysis could also be            2, which implies the bi-gram information is more important in the
a reason why Naive Bayes outperforms XML-Q for head queries                    query classification. The output from the convolutional layer then
with a length of 1.                                                            passes through a batch-normalization layer [24] and a ReLU layer
                                                                               and is fed into a layer with gated recurrent units (GRU) [25]. The
3.2     Naive Bayes model                                                      GRU layer also consists of three groups of units with 512, 1024, 512
Our Naive Bayes model is a production model that is generated                  units, taking in the outputs from the three groups of filters from the
using user clicked data. It is one of the baseline models in our study.        last layer respectively. The output from the GRU layer then passes
For each query, the clicked product listings are logged, and the               through a layer-normalization layer [26] and is concatenated. The
related categories for the listings are aggregated and grouped. Then           concatenated length-2048 representation is then fed into two fully-
a table that maps queries to categories and click counts is generated.         connected layers where the number of units in the hidden layer is
The query has been further broken down to uni-gram and bi-gram                 80% of the number of units in the concatenated layer. The output
tokens to generate a table that contains tokens, categories, and               from the last fully-connected layer is passed through a sigmoid
click counts. The noise in the table is reduced by removing rows               function, and the output length is equals to the number of total
where the click counts are below a threshold where the threshold               categories.
is a hyperparemeter we tuned. The bag of words and multinomial                 3.3.2 Loss function. Another difference between the query dataset
Naive Bayes methods are applied in the model. The probability of               and document dataset in [7] is that the number of categories, or
the query given the category can be calculated as the production               labels, per query is 1.5, which is much smaller than the number of
from the likelihoods of bi-grams or uni-grams in the query.                    labels per document. The total number of categories of queries is
                                                                               about 20, 000, which is similar to, or greater than the number of
3.3     Deep Neural Network Model                                              categories in document datasets. Therefore, each query, or sample,
Our DNN model for long queries is called XML-Q (XML for queries).              has a much sparser output space. This sparsity leads to an imbal-
It is based on the XML-CNN [7], which is designed for XML clas-                anced classes issue. That is, the output tends to be all close to zero
sification of the documents. In XML-Q, we made modifications to                if no action is taken to alleviate this issue because the vast majority
the XML-CNN model architecture and loss function due to the dif-               of the ground-truth labels are zero.
ferences between query datasets and the document dataset used in                   To solve this problem, instead of using the cross-entropy loss
[7].                                                                           as in [7], we design a new soft F1 loss. According to the definition
                                                                               in [27], the F1 score in multi-label classification is calculated for
3.3.1 Architecture. Our queries, with an average length of four
                                                                               each sample and is defined as the harmonic mean of precision and
tokens, as shown in Table 1, are much shorter compared with the
                                                                               recall. The F1 score is a common metric for classification problem
document dataset which has an average length of hundreds of to-
                                                                               with imbalanced classes. It is desirable to optimize the F1 score
kens. our queries contain far less information than document and
                                                                               directly in our case. However, the F1 score is not differentiable,
need a richer feature extractor. While the XML-CNN uses only a
                                                                               thus it is hard to use it as a loss function in the DNN. Since the
convolutional layer to extract features from the documents, XML-
                                                                               harmonic averaging operation itself is differentiable, we examine
Q has an extra recurrent units layer after the convolutional layer
                                                                               the precision and recall formula to identify the non-differentiable
to extract more sequential information. This recurrent layer also
                                                                               part.
replaces the functionality of the pooling layer in XML-CNN. In
our CNN+RNN structure, the convolutional layer extracts n-gram                               TP                       TP
                                                                               Precision =         =
information and the recurrent layer further extracts sequential in-                        TP + FP   Number of positive labels in prediction
formation, thus it is a stronger feature extractor compared with                             TP                         TP
                                                                                  Recall =         =
CNN or RNN alone. This observation is consistent with our experi-                          TP + FN   Number of positive labels in ground-truth
ment results that the CNN+RNN structure has a better prediction                , where TP is true positive, FP is false positive, and FN is false
performance.                                                                   negative. We observe that thresholds are applied in deriving the
   The architecture of XML-Q is summarized in Figure 1. All of                 predicted labels and TP, and the non-differentiability comes from
the hyper-parameters in the model architecture are tuned with a                this thresholding step. Therefore, we remove the thresholding and
validation set. The input query is truncated or padded to 10 tokens            estimate TP and number of positive labels in prediction with a soft,
where the last token is fixed to be <EOS>. Based on Table 1, 90%               differentiable formula defined as:
of our queries have length less than 6, so we are not losing much                                                          L
information in truncation. If the query is shorter than 10 tokens,
                                                                                                                          Õ
                                                                                                                soft TP =    tl · σ (yl )
<PAD> tokens are padded to the beginning of the query to comply                                                               l =1
with the preference of recurrent units. The tokens are then mapped
                                                                                           soft num. of positive predictions = ∥σ (y)∥22
to learn-able length-256 word embeddings. The embeddings are
passed through a convolutional layer which is arranged similar to              , where L is thenumber of categories, t is the ground-truth vector,
XML-CNN [7]. The convolutional layer consists of three groups of               y is the logits vector, σ is the sigmoid function, and subscript l
convolutional filters. The three groups have 512, 1024, 512 filters            means the l’th category. By replacing the TP and number of positive
with filter size 1, 2, 3, respectively. A filter with filter size k actually   predictions with the soft version, the F1 score is differentiable, and
has size 256 × k, meaning that the filter is convolved with word               we call it "soft F1 score". The soft F1 score is used as the cost function
embedding of k tokens. The model has more filters with filter size             in XML-Q.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                Liu, Zhang, et al.


   <PAD>                            Kernel size: 1 # Filter: 512
   <PAD>
       I                            Kernel size: 2                          # Unit: 512
      am                                               # Filter: 1024
       a
    query
     with                           Kernel size: 3                         # Unit: 1024
   multiple
    labels
                                                        # Filter: 512
   <EOS>                                                                    # Unit: 512

 Tokenized     Word      Convolutional                                  Gated recurrent Fully connected Fully connected layer
   Query    Embeddings      layer +                                         units +          layer +       with sigmoid output
                         BatchNorm +                                     LayerNorm +     BatchNorm +
 Length: 10 Dimension: 256 ReLU                                             ReLU              ReLU # Unit: 1638          # Label: ~20k

                                           Figure 1: Architecture of deep neural network model XML-Q for long queries


3.3.3 dropout and learning rate decay. During the training phase,                    the latest half-year data as the base dataset and collect another half-
we applied two techniques to prevent over-fitting and accelerate                     year data before the base dataset as the augmentation candidate set.
the convergence. First, we use dropout [28] with a keep-rate of 0.8                  The final training set consists of the full base set and part of the
in the fully-connected layers to prevent over-fitting. The second                    candidate set we selected. The final training set leads to a better
technique is the application of cosine decay with restarts to the                    performance in our experiment compared with the base set alone
learning rate as proposed in [29]. The learning rate starts from                     or base set plus full candidate set. We applied three augmentation
0.001 and follows a cosine decay function with a period of 4 epochs.                 techniques.
When the learning rate touches zero in 2 epochs, it restarts to 0.0008,                 The first technique is used to alleviate the skewness of labels, or
and the cosine decay function has a period of 8 epochs. Every time                   categories, in query classification. Some popular categories have
it restarts, the learning rate is 0.8 of the previous restart, and the               lots of data while some others have far less. For example, in our
cosine function period is twice as long as the previous one. The                     dataset, the maximum number of queries for a category is 364771
learning rate scheduler is shown in Figure 2.                                        while the minimum number is 1, and the average number of queries
                                                                                     for each category is 1377, as shown in Table 1. We, therefore, added
                                                                                     data from the 1500 categories with the lowest frequencies in the
                        1e-4 1e-4
                                             8e-5                                    candidate set to the final set.
        Learning rate


                        5e-5
                                                                                                                              mean         90% tile
                                                   2 epochs             6 epochs            Query length                        3.9          6.0
                         0                                                                  #labels for each query              1.5          2.0
                             0              100k               200k       300k
                                             Number of iterations                           #search-requests for each query 1377.1          2420.0
                                                                                                        Table 1: query set statistics
      Figure 2: Learning rate scheduler of the DNN model


                                                                                        The second technique is to augment selected queries that con-
3.4      Data augmentation                                                           tain brand names, such as Dior, Haier, or Nike. When such queries
The number of available training samples in e-commerce is usually                    are short and ambiguous, the predicted categories are sometimes
very large, but we found that it is still helpful to augment the dataset             dominated by the information in the brand name. For example, the
to balance the data distribution in query classification. We collect                 ground-truth category of the query "Dior Gaudron" is "pottery &
System Design of Extreme Multi-label Query Classification using a Hybrid Model                      SIGIR 2019 eCom, July 2019, Paris, France


glass" where "Gaudron" is a series of products of Dior for dinner-        can evaluate ranking, which evaluates if the labels are ranked in
ware. However, since the majority of queries with brand "Dior" are        order of human judged relevance; third, one can use the label hier-
clothes or cosmetics and "Gaudron" contains only very ambigu-             archy, which evaluates the effectiveness of the system to take into
ous information to the model, the model tends to classify "Dior           account the hierarchical structure of the labels [30, 31].
Gaudron" into clothes and cosmetics. Therefore, for queries with              In our study, all the categories are at the leaf level, and the
brand names, if the category has low frequencies within the brand,        hierarchical structure of the labels can be ignored, so we evaluated
repeat the queries 50 times and add to the final set. We repeat the       both the partitions and the rankings. The partitions measure the
queries 50 times as a result of tuning with the validation set. More      partial correctness. We used the definition of precision, recall, and
specifically, for each brand, we collect all queries with this brand      F1 proposed by [27]. The precision is the ratio of the correctly
name and count the number of queries for each category, and then          predicted true labels to the total number of true labels, averaged
we sort the categories according to the query frequencies, and take       over all samples. The recall is the ratio of correctly predicted true
half of the categories with the lowest query frequency. The queries       labels to the total number of predicted true labels, averaged over
in those categories with that specific brand are then selected from       all samples. F1 is the harmonic mean of precision and recall.
the candidate set and added to the final set.                                                                n
                                                                                                          1 Õ |Yi Z i |
                                                                                                                   Ñ
   The third technique is used to catch the residuals of our model.                          Precision =
We first train XML-Q with the base set and record the misclassifi-                                        n i=1 |Z i |
cation rate of each category on the base training set. We then add                                           n
                                                                                                          1 Õ |Yi Z i |
                                                                                                                   Ñ
the samples of 1000 categories with worst performance and which                                 Recall =
have not yet added to the final set.                                                                      n i=1 |Yi |
                                                                                                             n
                                                                                                          1 Õ 2 × |Yi Z i |
                                                                                                                        Ñ
4 EXPERIMENT                                                                                        F1 =
                                                                                                          n i=1 |Yi | + |Z i |
4.1 Dataset collection and preprocessing
In the e-commerce domain, there are millions of buyer engagement
data (click/add2cart/purchase) generated in search log systems ev-           , where n is the number of instances in evaluation set. Y_i are
ery day. Our data set consists of 275 million <query, category>           the true labels for the instance x_i, and Z_i are the predicted labels
pairs and the click-through rate (CTR) for each pair. The query is        for the instance x_i. Both Y_i and Z_i ∈ {0,1}
normalized by removing certain special characters. For each query,           In the e-commerce domain, the order of the predicted categories
the clicked product listings and the click/impression count are ag-       for the query matters. The more relevant categories with higher
gregated for one year and grouped by the category. We eliminated          CTR should rank higher than the less relevant categories with lower
those <query, category> pairs where aggregated click counts are           CTR and the categories that are not relevant. Rank-based evaluation
less than 3 to reduce the noise in the data while retaining as many       metrics, such as precision at top k (P@k), have been widely used
categories as possible. In our dataset, the number of total categories    in the multi-label tasks [7, 23, 10, 22]. P@k is calculated for each
is about 20,000, covering the entire label space.                         query and then averaged over the whole evaluation set. For each
                                                                          query, the P@k is defined as:
4.2    Evaluation                                                                                            1Õ
                                                                                                     p@k =          yl
The evaluation in our study includes a model level evaluation and                                            k
a system level evaluation. We propose a system level evaluation
because the number of total categories is too large for human judges         , where k could be 1,3,10 and y_l is the ground-truth labels
to evaluate the quality of the query classifier directly. In our study,   among the top k predicted labels for each query.
the query categorization is a component in the e-commerce search             The standard definition of P@k is flawed when the total number
system. We implement the baseline or the target query models              of ground-truth labels is less than k. For example, assume the query
in the search system and compare whether there is any improve-            has two true labels only, and the query classifier predicts both of
ment for the search system with the target model and how much             them right. Then the expected precision at top k should be 1. How-
improvement it brings.                                                    ever, according to the definition, the P@3 will be 2/3. Furthermore,
   We design two approaches for the system level evaluations. The         based on the definition, P@k is not a comparable metric between
first approach is the offline golden data evaluation, and the second      datasets when the average number of labels of queries various a
approach is the human judgment of end-to-end search relevance.            lot.
4.2.1 model level evaluation. The evaluation of a multi-label classi-        We propose a modified metric called PR@k as:
fier is more challenging than the evaluation of a multi-class classi-                                         1      Õ
fier. In the multi-label setting, both the ground-truth and predicted                          PR@k =                   yl
                                                                                                          min(k, tl)
labels for a testing sample could be a subset of the label set. Hence,
the prediction can be entirely correct, partially correct or entirely
incorrect. There are three ways to evaluate a multi-label classifier.        , where tl is the number of true labels in the top k. In the definition
First, one could evaluate partitions, which measures how far the          of PR@k, the sum of true positives is divided by the minimum of k
classifier predictions are from the ground-truth labels; second, one      and the number of true labels tl, instead of k, to address the problem
SIGIR 2019 eCom, July 2019, Paris, France                                                                                     Liu, Zhang, et al.


when k is larger than tl. In this new definition, the precision at k
and the recall at k will be the same, so we name it as PR@k.                                              DCG
                                                                                              NDCG* =
                                                                                                         IDCG
                                                                                                          N
                                                                                                         Õ    2r eli
                                                                                                DCG =
                                                                                                         i=1
                                                                                                               loд2 (i + 1)
4.2.2 offline golden data evaluation pipeline. Human judgment is                                         ÕN
                                                                                                              2pr ef ecti
time- and money-consuming. Thus, we built the offline golden data                               IDCG =
                                                                                                         i=1
                                                                                                             loд2 (i + 1)
evaluation pipeline and evaluated the system performance and the
search relevance with the pipeline before the human judgment. The          where N=min(5,#totalReturns), the NDCG* is calculated based
pipeline utilizes human judgment results accumulated from the           on the top up to 5 returned product listings for each query. rel_i is
past as the golden dataset.                                             the relevance score of the product listing at position i by a human
    The offline golden data evaluation pipeline can be treated as       judge. We modified the definition of IDCG such that all the top 5
a mini search system. The inventory is a mixture of all product         product listings should be Excellent for a perfect search system, so
listings from the golden dataset and 10% of product listings from       prefect_i in the IDCG equation is Excellent score. The NDCG score
the production inventory. The golden data contains 500k human           is calculated for each query and averaged over all the evaluation
judgment data in the format of query, the top 3 returned product        set.
listings, and the human labeled relevance score for each <query,
item> pair. The relevance score is either 0 or 1, where 0 means not     4.3    Results
relevant, and 1 means relevant. Based on the user search frequency,     The XML-CNN model [7] has achieved the-state-of-the-art perfor-
the queries can be divided into head, torso and tail queries. We        mance and beats the other 7 most representative multi-label models
define that head queries are those count for the top 30% of the total   such as FastXML [23], SLEEC [22], Bow-CNN [21], and FastText
impressions, and torso queries are those with the top 30-60% im-        [19] on 6 benchmark datasets. Thus, among all the multi-label mod-
pressions, and tail queries are those with the rest 40% impressions.    els, we will compare our model to XML-CNN only. Another baseline
The performance of a search engine is usually quite different in        is the Naive Bayes model which was the production model in our
head, torso and tail queries. Most of the search engines perform        system. In the experiments, we evaluated: 1) the performance of our
very well in the head query. The performance in the tail query          deep learning model XML-Q compared to the XML-CNN model;
set distinguishes the good search engines from the others. To bet-      2) the comparison of the Naive Bayes model, XML-Q model, and
ter evaluate the search system, we generated head, torso, and tail      hybrid model in the offline golden data evaluation pipeline; 3) the
queries based on the demand frequency from the latest search log        comparison of the hybrid model and Naive Bayes model in search
joined with the queries in the golden data. We sampled 10k queries      relevance human judgments.
from each intersection as the final evaluation set for head, torso          Our XML-Q model is an extension of XML-CNN by adding a few
and tail queries. For each query in the evaluation set, the top N       components that are more suitable for e-commerce query classifi-
product listings can be predicted with the golden data evaluation       cation. These components include differentiable F1 loss, additional
pipeline, and the precision, recall, accuracy, and AUC [32, 33] can     RNN layer, and cosine learning rate decay. As shown in Table 2,
be calculated by comparing the system prediction to the golden          by replacing the original binary cross-entropy loss with the dif-
data label.                                                             ferentiable F1 loss, we see 22.95% improvement in the F1 score,
                                                                        41.43% improvement in the precision, 17.43% drop in the recall, and
                                                                        6.74%, 2.80%, and -0.22% changes in the PR@1, 3, and 10, respec-
                                                                        tively. The contributions of adding additional RNN layer and using
                                                                        cosine learning rate decay to the F1 score are 0.69% and 2.58%, and
4.2.3 search relevance human judgment. If the result of the target      contribution to the PR@10 is 1.38% and 1.54% respectively.
model is better than the baseline from the offline golden data eval-
uation pipeline, the next step is to conduct human judgment on
end-to-end search relevance. The search system is set up with base-                      δP      δR       δ F1      δ PR1     δ PR3   δ PR10
line or control models and with the full inventory. The evaluation        F1 Loss       41.43   -17.43   22.95       6.74     2.80     -0.22
query set includes 4500 queries, 1/3 for each of the head, torso,         Add RNN        1.51    0.03     0.69       1.24     1.22     1.38
and tail query segments. The judges are asked to judge the top 5          Cos Decay      2.39    2.72     2.58       2.15     1.86     1.54
product listings for each query to four levels of relevance based on
                                                                        Table 2: The contributions of new components in XML-Q
the judgment guideline: Excellent, Good, Acceptable and off-topic.
                                                                        to the model performance in precision, recall, F1, PR@k
   Three judgments were collected for each query to reduce bias.
                                                                        (k=1,3,10) (in %).
Meanwhile, we designed a side by side group comparison. Two
groups of results were shown on the one page for each judgment,
one group from the baseline, and another group from the target.
   NDCG is a score defined to measure the search ranking. We               Since differentiable F1 loss has a big impact on the performance
modified the definition of NDCG to represent both the ranking           of query categorization, for easy comparison, we show the perfor-
quality and relevance.                                                  mance of XML-CNN with differentiable F1 loss (XML-CNN*) and
System Design of Extreme Multi-label Query Classification using a Hybrid Model                           SIGIR 2019 eCom, July 2019, Paris, France


XML-Q model in Table 3. From the XML-CNN* model to XML-Q                                        QuerySet      NDCG improvement
model, the total improvement is about 11.2% in precision, 18.6%
                                                                                                head                     0.0
in recall, 15.0% in F1 score, and 11.8%, 10.5%, 8.9% in PR@1, 3, 10,
respectively. Note that data augmentation also contributes to total                             torso                    0.3
improvement.                                                                                    tail                     1.1
                                                                                                mix                      0.4
   Model            P           R           F1          PR1     PR3     PR10    Table 5: Comparison of hybrid model and Naive Bayes
                                                                                model in NDCG* (in %) for head, torso and tail query sets
   XML-CNN*       61.01     55.09       57.90          64.79    71.28   78.89
                                                                                from search relevance human judgment results.
   XML-Q          72.16     73.68       72.91          76.57    81.75   87.75
Table 3: Comparison of XML-CNN* and XML-Q in precision,
recall, F1, PR@k (k=1,3,10) (in %).
                                                                                5      CONCLUSION
                                                                                This paper presents the system design of extreme multi-label query
   Table 4 shows the comparison of XML-Q and the Naive Bayes                    classification as an application in the e-commerce search system.
model in the top and the comparison of the hybrid model and the                 We propose a hybrid model that combines the Naive Bayes classifier
Naive Bayes model in the bottom using the offline golden data                   and a deep neural network model, XML-Q, for the query classifi-
evaluation pipeline. The top of the table illustrates that XML-Q                cation based on the length of queries. The Naive Bayes model is
model has a better performance in both torso and tail query sets                used for the queries with one word, and the deep neural network
and the Naive Bayes model outperforms the XML-Q model in the                    model is used for the rest of the queries. The deep learning model
head query set. This mixed result is one of the reasons why we                  is an extension of the state-of-the-art XML-CNN model. A few
propose the hybrid model that combines the Naive Bayes model                    components and adjustments have been made to make the model
and XML-Q. It is not easy to classify whether a query is the head               more suitable for query classification, such as the differentiable
query or not. We could build a dictionary that contains all the head            F1 loss, additional RNN layers, and the cosine learning rate decay.
queries based on user behavior date and maintain it. Alternatively,             Three new data augmentation techniques have been applied to the
to our observation, most of the head queries are shorter than torso             training data which significantly improve the model performance.
and tail queries, so it is also straightforward to build a hybrid model         New evaluation metric PR@k is designed to address the problem
based on the query length. We tested different cutoff query lengths             in the standard precision@k metrics when k is larger than the total
from 1 to 4, and the hybrid model with a cutoff length 1 gives the              true labels of the query. Since the label set is huge for the query
best performance. As shown in the bottom of the table, the hybrid               classification in this paper, it is almost impossible to do direct hu-
model outperforms the Naive Bayes model among all head, torso,                  man judgment on such large label space. Therefore, we propose
and tail query sets.                                                            end-to-end system level evaluations. The evaluation results show
                                                                                that the hybrid model enhances the performance of query classifi-
                        QSet        δ Acc        δP       δR     δ F1   δ AUC   cation for different query sets, especially the torso and tail query
                        head         -0.3        -0.2    -0.8    -0.8    -0.1   sets.
  XML_Q-NBayes          torso         0.2         0.2    0.5     0.5     0.1
                         tail         0.4         0.0    1.7     1.5     0.1    REFERENCES
                        head         0.3         0.3      1.1    1.0     0.2     [1]    Jung-Woo Ha, Hyuna Pyo, and Jeonghee Kim. 2016. Large-
  Hybrid-NBayes         torso        0.5         0.0      1.9    1.6     0.1            scale item categorization in e-commerce using multiple re-
                         tail        0.8         0.3      2.6    2.2     0.3            current neural networks. In Proceedings of the 22nd ACM
                                                                                        SIGKDD International Conference on Knowledge Discovery
Table 4: Comparison of XML_Q and Naive Bayes model (top)                                and Data Mining. ACM, 107–115.
and Hybrid model and Naive Bayes model (bottom) in accu-                         [2]    Vivek Gupta, Harish Karnick, Ashendra Bansal, and Pradhu-
racy, precision, recall, F1 and AUC (in %) for head, torso and                          man Jhala. 2016. Product classification in e-commerce using
tail query set using Offline golden data evaluation pipeline.                           distributional semantics. arXiv preprint arXiv:1606.06083.
                                                                                 [3]    Mangi Kang, Jaelim Ahn, and Kichun Lee. 2018. Opinion
                                                                                        mining using ensemble text hidden markov models for text
   The search relevance human judgment result in Table 5 shows                          classification. Expert Systems with Applications, 94, 218–227.
that from Naive Bayes model to the hybrid model, the NDCG* score                 [4]    Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang
improves by 0.3% for torso queries and 1.1% for tail queries. The                       Xue. 2016. Multi-stream multi-class fusion of deep networks
improvement of the head queries is neutral. One reason is that a                        for video classification. In Proceedings of the 24th ACM inter-
large portion of the head queries is of length 1, so the hybrid and                     national conference on Multimedia. ACM, 791–800.
Naive Bayes are the same model in this case. Another reason is that              [5]    Husheng Guo and Wenjian Wang. 2015. An active learning-
the head queries are relatively easy to classify based on the human                     based svm multi-class classification model. Pattern recogni-
judgment results, so the predictions of both the Naive Bayes model                      tion, 48, 5, 1577–1597.
and XML-Q are correct. The average performance improvement of
NDCG* for the full query sets is 0.4%.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                   Liu, Zhang, et al.


 [6]   Venkataraman Santhanam, Vlad I Morariu, David Harwood,                   and naive bayes. In Advances in neural information processing
       and Larry S Davis. 2016. A non-parametric approach to ex-                systems, 841–848.
       tending generic binary classifiers for multi-classification.      [19]   Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas
       Pattern Recognition, 58, 149–158.                                        Mikolov. 2016. Bag of tricks for efficient text classification.
 [7]   Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming                     arXiv preprint arXiv:1607.01759.
       Yang. 2017. Deep learning for extreme multi-label text clas-      [20]   Yoon Kim. 2014. Convolutional neural networks for sentence
       sification. In Proceedings of the 40th International ACM SIGIR           classification. arXiv preprint arXiv:1408.5882.
       Conference on Research and Development in Information Re-         [21]   Rie Johnson and Tong Zhang. 2014. Effective use of word
       trieval. ACM, 115–124.                                                   order for text categorization with convolutional neural net-
 [8]   Himanshu Jain, Yashoteja Prabhu, and Manik Varma. 2016.                  works. arXiv preprint arXiv:1412.1058.
       Extreme multi-label loss functions for recommendation, tag-       [22]   Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma,
       ging, ranking & other missing label applications. In Proceed-            and Prateek Jain. 2015. Sparse local embeddings for extreme
       ings of the 22nd ACM SIGKDD International Conference on                  multi-label classification. In Advances in neural information
       Knowledge Discovery and Data Mining. ACM, 935–944.                       processing systems, 730–738.
 [9]   Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai             [23]   Yashoteja Prabhu and Manik Varma. 2014. Fastxml: a fast,
       Zhong, and Inderjit Dhillon. 2016. Pd-sparse: a primal and               accurate and stable tree-classifier for extreme multi-label
       dual sparse approach to extreme multiclass and multilabel                learning. In Proceedings of the 20th ACM SIGKDD interna-
       classification. In International Conference on Machine Learn-            tional conference on Knowledge discovery and data mining.
       ing, 3069–3077.                                                          ACM, 263–272.
[10]   Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik          [24]   Sergey Ioffe and Christian Szegedy. 2015. Batch normaliza-
       Varma. 2013. Multi-label learning with millions of labels:               tion: accelerating deep network training by reducing internal
       recommending advertiser bid phrases for web pages. In Pro-               covariate shift. arXiv preprint arXiv:1502.03167.
       ceedings of the 22nd international conference on World Wide       [25]   Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau,
       Web. ACM, 13–24.                                                         and Yoshua Bengio. 2014. On the properties of neural ma-
[11]   Wei Bi and James Kwok. 2013. Efficient multi-label classifica-           chine translation: encoder-decoder approaches. arXiv preprint
       tion with many labels. In International Conference on Machine            arXiv:1409.1259.
       Learning, 405–413.                                                [26]   Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.
[12]   Himanshu Jain, Venkatesh Balasubramanian, Bhanu Chun-                    2016. Layer normalization. arXiv preprint arXiv:1607.06450.
       duri, and Manik Varma. 2019. Slice: scalable linear extreme       [27]   Shantanu Godbole and Sunita Sarawagi. 2004. Discrimina-
       classifiers trained on 100 million labels for related searches.          tive methods for multi-labeled classification. In Pacific-Asia
       In Proceedings of the Twelfth ACM International Conference               conference on knowledge discovery and data mining. Springer,
       on Web Search and Data Mining. ACM, 528–536.                             22–30.
[13]   Adel Hamdan Mohammad, Tariq AlwadaâĂŹn, and Omar                  [28]   Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
       Al-Momani. 2018. Arabic text categorization using support                Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a sim-
       vector machine, naïve bayes and neural network. GSTF Jour-               ple way to prevent neural networks from overfitting. The
       nal on Computing (JoC), 5, 1.                                            Journal of Machine Learning Research, 15, 1, 1929–1958.
[14]   Wei Chen, Xusheng Yan, Zhou Zhao, Haoyuan Hong, Dieu              [29]   Ilya Loshchilov and Frank Hutter. 2016. Sgdr: stochastic gra-
       Tien Bui, and Biswajeet Pradhan. 2019. Spatial prediction                dient descent with warm restarts. arXiv preprint arXiv:1608.03983.
       of landslide susceptibility using data mining-based kernel        [30]   Mohammad S Sorower. 2010. A literature survey on algo-
       logistic regression, naive bayes and rbfnetwork models for               rithms for multi-label learning. Oregon State University, Cor-
       the long county area (china). Bulletin of Engineering Geology            vallis, 18, 1–25.
       and the Environment, 78, 1, 247–266.                              [31]   Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas.
[15]   S Vijayarani and S Dhayanand. 2015. Liver disease prediction             2009. Mining multi-label data. In Data mining and knowledge
       using svm and naïve bayes algorithms. International Journal              discovery handbook. Springer, 667–685.
       of Science, Engineering and Technology Research (IJSETR), 4,      [32]   Jin Huang and Charles X Ling. 2005. Using auc and accu-
       4, 816–820.                                                              racy in evaluating learning algorithms. IEEE Transactions on
[16]   Young-gon Kim, Taehee Lee, Jonghoon Chun, and Sang-                      knowledge and Data Engineering, 17, 3, 299–310.
       goo Lee. 2006. Modified naïve bayes classifier for e-catalog      [33]   Charles X Ling, Jin Huang, Harry Zhang, et al. 2003. Auc:
       classification. In International Workshop on Data Engineering            a statistically consistent and more discriminating measure
       Issues in E-Commerce and Services. Springer, 246–257.                    than accuracy. In Ijcai. Volume 3, 519–524.
[17]   How Jing, Yu Tsao, Kuan-Yu Chen, and Hsin-Min Wang. 2013.
       Semantic naïve bayes classifier for document classification.
       In Proceedings of the Sixth International Joint Conference on
       Natural Language Processing, 1117–1123.
[18]   Andrew Y Ng and Michael I Jordan. 2002. On discriminative
       vs. generative classifiers: a comparison of logistic regression