=Paper= {{Paper |id=Vol-2319/ecom18DC_paper_6 |storemode=property |title=An Empirical Study of Using An Ensemble Model in e-Commerce Taxonomy Classification Challenge |pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_6.pdf |volume=Vol-2319 |authors=Yugang Jia,Xin Wang,Hanqing Cao,Boshu Ru,Tianzhong Yang |dblpUrl=https://dblp.org/rec/conf/sigir/JiaWCRY18 }} ==An Empirical Study of Using An Ensemble Model in e-Commerce Taxonomy Classification Challenge== https://ceur-ws.org/Vol-2319/ecom18DC_paper_6.pdf
    An Empirical Study of Using An Ensemble Model in
     E-commerce Taxonomy Classification Challenge
                      Yugang Jia                                                       Xin Wang                                          Hanqing Cao
                    Winchester, MA                                                    Newton, MA                                          Mahwah, NJ
                 yugang.jia@gmail.com                                            wangxin8588@gmail.com                                 vauus@yahoo.com

                                                     Boshu Ru                                                       Tianzhong Yang
                                         Dept. of Software and Information                                    The University of Texas Health
                                             Systems, UNC Charlotte                                                  Science Center
                                                   Charlotte, NC                                                      Houston, TX
                                               boshu.ru@gmail.com                                              tianzhong.yang@gmail.com

ABSTRACT                                                                                                  to manually categorize the products. Therefore, large-scale
In the Rakuten data challenge on taxonomy Classification                                                  automatic categorization is in great need.
for eCommerce - scale Product Catalogs, we propose an
approach based on deep convolutional neural networks to                                                   2     RELATED WORK
predict product taxonomies using their descriptions. The                                                  The challenges in large-scale automatic categorization in-
classification performance of the proposed system is further                                              clude: Firstly, the products are sparsely distributed in a large
improved with oversampling, threshold moving and error                                                    number of categories and the data distribution is far away
correct output coding. The best classification accuracy is                                                from uniform distribution. Such imbalanced data could largely
obtained through ensembling multiple networks trained dif-                                                deteriorate the categorization performance [5, 21]; Secondly,
ferently with multiple inputs comprising of various extracted                                             the commercial product taxonomy is usually categorized in
features.                                                                                                 tree structures with thousands of leaf nodes, which adds
                                                                                                          another layer of difficulty to explore the correlations among
CCS CONCEPTS                                                                                              large number of taxonomies in a hierarchical structure. Vari-
• Computing methodologies → Machine learning;                                                             ous classification methods, such as flat classification, cascade
                                                                                                          classification and probabilistic cascading have been deployed
KEYWORDS                                                                                                  in the large-scale taxonomies. [1, 8, 13]. However, it remains
Multi-class classification, Imbalanced classes, Word embed-                                               to be a challenging problem due to the large data scale, data
ding, Convolutional neural networks, Error correct output                                                 heterogeneity, and category skewness [24].
coding                                                                                                       In light of the challenges, different methods have been
                                                                                                          proposed to achieve optimal classification performance. For
1     INTRODUCTION                                                                                        example, Naive Bayes demonstrates the effectiveness and
E-commerce sites provide millions of products that are con-                                               efficiency for classifying test documents [15, 19], but it has
tinuously updated by merchants. The correct categorization                                                poor performance when some categories are sparse [23].
of each product plays a crucial role in helping customers find                                            Support vector machines (SVMs) have been served as a well-
the product that meets their need in many aspects, such as                                                established benchmark model for classifying e-commerce
product searching, targeted advertising, personalized recom-                                              product [9]. Chen et. al proposed a multi-class SVM with
mendation and product clustering. However, due to the large                                               an extension of using margin re-scaling to optimize average
scale of the products, it is often not feasible and error prone                                           revenue loss [4]. However, SVM has been shown to have
                                                                                                          longer computing time and it only works well when the
Permission
Copyright © 2018  tobymake    digital
                        the paper’s     or hard
                                    authors. Copyingcopies   offorpart
                                                     permitted          orand
                                                                   private allacademic
                                                                                of this purposes.
                                                                                         work for
In: J. Degenhardt,
personal              G. Di Fabbrizio,
             or classroom        use is S.granted
                                             Kallumadi, M. Kumar,
                                                     without    feeY.-C.  Lin, A. that
                                                                     provided     Trotman, H. Zhao
                                                                                       copies   are
                                                                                                          number of categories is less than five [5]. As one of the
(eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, USA,
not   made
published     or distributed for profit or commercial advantage and that copies
           at http://ceur-ws.org
                                                                                                          deep learning algorithms, recurrent neural network (RNN) is
bear this notice and the full citation on the first page. Copyrights for third-                           proposed by Pyo and Ha to deal with the multi-class classifi-
party components of this work must be honored. For all other uses, contact                                cation problem with unbalanced data[8], in which the learnt
the owner/author(s).                                                                                      word embedding depends on a recursive representation of
SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
                                                                                                          the same initial feature space. In addition, convolutional
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
                                                                                                          neural network (CNN) achieves remarkable performance in
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                                   sentence-level classification [10, 11, 27]. Recently, CNN has
                                                                                                      1
SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                         Y. Jia et al.

been regarded as a replacement for well-established SVM                There are many characteristics and potential challenges in
and logistic regression models [28], which uses pre-trained            this dataset. At the class level, there exists large variation in
word vectors as inputs for training the CNN models [28].               the number of samples in each class, resulting in a largely
   Product categorization is a hierarchical multi-class classi-        unbalanced dataset (See Table 1). Out of the 3008 classes,
fication problem. Hence, a natural way of classifying product          19 classes have only one product, and 1484 (49.3 percent
is to use hierarchical classification. However, hierarchical           ) classes have less than 25 products, which sums up to be
classification suffers from error propagation issue [25]. Kos-         13,618 products, or 1.70 percent of all products. On the other
mopoulos proposed a probabilistic hierarchical classification          hand, 12 classes have more than 10,000 products, which sums
approach that predicted the leaf categories by estimating the          up to be a total of 256,689 products, or 32.1 percent of all
probability of each root-to-leaf path [13]. Cevahir et.al [3]          products. The top 3 largest classes include 69915, 30146, and
mitigated the error propagation issue in hierarchical clas-            25481 products respectively (See Table 2). The classes with
sification and achieved better results than flat models by             many samples might embrace both the richness as well as the
incorporating a large number of features in leaf category              diversity of data that could result in the inter-class distance
prediction.                                                            being closer than the intra-class distance. The classes with
   To combat with imbalanced classification problem, cost              only a handful of samples are expected to be hard to classify
sensitive training appears to be an effective solution. Zhihua         accurately in the testing data.
Zhou et.al [29] show that only over-sampling and threshold             The product description is a mixture of letters, words, num-
moving is effective for training cost-sensitive neural net-            bers and other ASCII characters. At the product description
works by empirical studies. However, it becomes difficult to           level, there appears to have at least the following challenges.
define costs of misclassification when there are large number          The first challenge is that the difference between data sam-
of classes. A more recent paper [18] also supports similar             ples in the same class might be larger than that of samples
claims.                                                                belonging to different classes to the extent of seemingly mis-
   Error-Correcting Output coding (ECOC) is another method             labels. For example, both "Mont Blanc Mb Starwalker Men
that has been used in multi-class text classification to further       Eau De Toilette Edt 2.5Oz / 70Ml" (a type of perfume) and
improve a classifier’s accuracy [6, 7]. The idea behind this           "Humminbird Pc11 Power Cord" (an electronic device) are in
approach is to encode each class label to a unique binary              the category of "3625>4399>1598>3903", while "Creed Green
code with a number of digits, such that redundancy is in-              Irish Tweed Eau De Perfum For Men - Small Size 1oz (30ml)
troduced to the transformed class labels that are then used            ", also a type of perfume, is in category "3625>3005". The
for training a supervised Machine Learning model. Even if              second challenge is the categories under the same parent cat-
some errors occur in the prediction of a transformed label,            egories (or principal category) are correlated, which would
we may still be able to recover the correct original label by          enforce the challenge above. A third challenge is that one ab-
choosing the one that is the closest to the prediction. This           breviation could have different meanings, for example, "hp"
approach helps to reduce the space of model output when                could be "horsepower" or "Hewlett-Packard", "mb" could be
there are a large number of class labels. At the meantime it           "mega byte" or "marble". The fourth challenge is the large
also alleviates the class imbalance problem.                           variation in the number of words in each product descrip-
                                                                       tion, ranging from 1 to dozens. While a short description
3   DATA CHARACTERISTICS                                               might not provide sufficient information, such as a single
The SIGIR eCom Data Challenge is on large-scale taxonomy               word product "Bonjour", a long description might include
classification. The competition requires us to classify each           too much details that the relevant information might be
product description to one of the classes accurately in the            hidden, such as "fosmon 2100mah dual port usb rapid car
unlabeled testing data using models developed from the la-             charger for apple iphone 5c/5s/5/4s/4, samsung galaxy note
beled training data. Data is provided as a training data (800          3/2, s5/s4/mini/active/s3, lg g3/lg g2/ g2 mini, google nexus
thousands of products) and a testing data (200 thousands of            5/4, blackberry z10/z30/q10, htc one (m8), motorola moto
products). The training data has one column having produc-             e/moto x/moto g, nokia lumia 1020".
tion description and one column having product categories.             Due to the above data characteristics, data preprocessing
There are 14 principal product categories represented by               that filters out noise and keeps relevant information is sys-
numbers, all but one of which have secondary categories of             tematically designed and implemented and various modeling
levels ranging from 1 to 7, or 1 to 8 levels in total, which           strategies are tested as described in the next sections.
results in a total of 3008 classes. Each of such combination
of principal and secondary categories formulates a string of
numbers that represents a class, eg "3625>4399>1598>3903",
"2296>3597>689".
                                                                   2
Ensemble model for Taxonomy Classification               SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA

      Table 1: Data Characteristics - Category Size                    algorithm [20]. The representation of a word is learnt to be
                                                                       useful for prediction of other words in the sentence. The
    log10(Categorize Size)   Count    Category Size Range              model contains 300-dimensional vectors for 3 million words
                                                                       and phrases.
              0                893    < 10
                                                                          Word embeddings pre-trained on product descriptions.
              1               1372    (10 - 100]
                                                                       We also learn word embeddings from all product descriptions
              2                623    (100 - 1,000]
                                                                       by using word2vec algorithm implemented within Gensim
              3                108    (1,000 - 10,000]
                                                                       [22], which is an unsupervised learning model. The dimen-
              4                12     > 10,000
                                                                       sion of word vector is set to be 50 and we train the word2vec
                                                                       model using CBOW algorithm [20] with 5 epochs. In each
                  Table 2: Top 3 Categories                            epoch, we initially set learning rate to be 0.025 and let it
                                                                       linearly decay to 0.0001.
              Category            Number of Products                      Word embeddings learnt in training. Firstly, we use
                                                                       one-hot coding to represent an individual word in a product
            2199>4592>12                  69915                        description. Then we learn a weight matrix in an embedding
        3292>3581>3145>2201               30146                        layer of proposed model to transform each word into a word
         4015>2337>1458>40                25481                        vector having 50 dimensions. The word vectors transformed
                                                                       from one-hot coding are used as the input to train the first
                                                                       convolutional layer.
4    PREPROCESSING AND FEATURIZATION
                                                                          Character-level embeddings learnt in training. Besides
In this section we introduce how the product descriptions are          word embeddings, we also use character level embeddings
preprocessed and the features extracted to train our model.            learnt in training, where the learning approach is similar to
                                                                       the approach of learn word embeddings in training. We use
Preprocessing                                                          one-hot coding to represent each unique character in the
As we described above, the product description contains                raw texts rather than each word.
noise. We mitigate the noise in a trial-and-error way in order
to find the balance of signal noise ratio. As a result, we apply
                                                                       Named entity and part-of-speech tag features
the following procedure. We first convert all letters to lower
case, and replacd special characters such as parentheses ex-           The appearance of named entities might be associated with
cept single hyphen, and repeated characters such as multiple           certain product categories. For example, locations, landmarks,
hyphen or dots with spaces. We then unify physical units               and famous people are often prevalent in categories such
of "in", "ft", "hp", "ml","oz" and etc that follow a number to         as branded perfume, books and movies, while organization
"nnnhp", "nnnml", and "nnnoz", respectively. For example,              names might be seen more commonly in electronic products
"3.4oz", "3.4 oz" or "3.4-oz" would become "nnnoz". Finally,           such as Apple and HP. In addition, individual words asso-
we remove dash (-), standalone numbers, and extra white                ciated with the named entities might be rare and therefore
spaces. As a result of the preprocessing, distinct words are           filtered by our word frequency requirement, but they could
reduced from 670K to 160K.                                             be informative. For example, the word "Beethoven" occurs in
                                                                       only one product description, but it is the name of a famous
                                                                       musician and it carries strong information about what type
Word-level and character-level embeddings for                          of product it could be. Therefore, we use Stanford CoreNLP
system input                                                           package [17] to extract 23 types of named entities from the
We use three different sets of word embedding to create                product description, namely cause of death, city, country,
vector presentation of words, each of which is as described            criminal charge, date, duration, email, ideology, location,
in more details below. The resulted vectors of each word in            misc, money, nationality, number, ordinal, organization, per-
each product description are then concatenated (i.e, column-           cent, person, religion, set, state or province, time, title, and
bind) as word embedding features. As shown in Fig. 1, word             URL. For each product description, we count the number
embeddings are concatenated as the input to train a CNN                of words in each entity type and normalize values by the
model. Three sets of word embeddings generated by varying              length of sentence, resulting a 23-dimension feature vector
methods are used as three separate inputs.                             representing the distribution of named entity types. Addi-
   Word embeddings pre-trained on Google News. The                     tionally, we generate a 36-dimension feature vector for the
pre-trained word embeddings are trained on part of Google              distribution of part-of-speech tags, which was also identified
News dataset (about 100 billion words) by using word2vec               by Stanford CoreNLP.
                                                                   3
SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                              Y. Jia et al.


                                                Pre-processing                                                                 Raw Text


                               Supervised word          Google word2vec                 Google news                 Supervised character
                                 embedding                embedding                      embedding                    level embedding


  NER and POS       Doc2Vec                             Kim-CNN with             Kim-CNN
    features       embedding        Kim-CNN              oversampling                                    Kim-CNN                  Zhang-CNN
                                                                                 with ECOC




          Dense layer                    Dense layer                     Dense layer                     Dense layer             Dense layer
          with softmax                   with softmax                    with Sigmoid                    with Softmax            with Softmax



                                                                  Maximum likelihood
                                                                      decoding

                                                                                             Threshold                    Threshold
                             Threshold                           Threshold
   Threshold                                                                                  moving                       moving
                              moving                              moving
    moving




                                                          Rule based final ensemble


                             Figure 1: Illustration of analysis system for product title categorization




                       (a)                                         (b)                                        (c)


                Figure 2: Classifier performance (F1-score, recall and precision) as a function of Category Size




                                                                    4
Ensemble model for Taxonomy Classification              SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA

Creating sentence level representation                                  Oversampling and threshold moving
We use doc2vec algorithm proposed in "Distributed Repre-                We adopt an oversampling strategy to improve the model’s
sentations of Sentences and Documents" by Quoc Le et al.                performance on imbalanced data. In training one of the five
[14], to create another set of features for product descriptions.       networks as shown in Fig. 1, we initially draw 256 examples
The algorithm modifies word2vec to unsupervised learning                from training data to form one batch. We add 768 more
of continuous representations for larger blocks of text, such           examples to the batch of data by duplicating those ones
as sentences, paragraphs or entire documents. We represent              whose classes having less than 5000 examples in the entire
each product description by a 50 dimensional feature vector             training set. Within the 768 duplicated examples, we further
and we train the doc2vec model with 5 epochs.                           define three categories of classes according to the class sizes
                                                                        in training set: classes with 1000 - 5000 examples, 100 - 1000
                                                                        examples and less than 100 examples. The proportion of
5   PROPOSED MODEL FOR PRODUCT                                          three categories of classes should follow 1:2:4 to form the
    TAXONOMY CLASSIFICATION                                             total 768 oversampled examples. In the end we have 1024
An overall architecture of our system is shown in Fig. 1. We            examples in each batch.
introduce the key components in the proposed model in                      We also apply a threshold moving strategy to adjust model
the following sections. We train five models separately with            predictions to alleviate data imbalance problem. The original
each one using multiple data inputs and varying setups. For             output of our models, which are the probabilities for each
making the final predictions, we ensemble the predicting                class, are divided by the class sizes before we yield final class
results from these models. All the models are trained using             labels. This strategy will reduce the probabilities yielded
Adam algorithm [12] with a learning rate 0.001.                         on large classes but increase those on smaller classes. Our
                                                                        experimental results show that this strategy is able to further
                                                                        improve model performance.
Kim-CNN architecture
Several recent studies have examined CNN models for text                Error correcting output coding
classification tasks and reported CNN based models achieved             To leverage the hierarchy of class labels and explore label
outstanding performance [2, 11, 16]. We adopt the design                correlation, we also used error correcting output coding
of Kim’s CNN model to extract informative patterns from                 [6, 7]. For our cases with large amount of classes, we create
the word embedding representation of input data [11]. Kim’s             a unique binary coding for each taxonomy. The coding for
model contains paralleled convolution filters of three differ-          each taxonomy level is different, for example, the first level
ent kernel sizes to protrude informative patterns in a sub              taxonomy has 3 digits while the second level taxonomy has
area of the input data. The kernel size (k) of filter determines        4 digits. Then we concatenate all codes corresponding to
the magnitude of the sub area in the case of text, that is              multiple taxonomies to form the new output coding. The
the number of continuous words in a sentence. By mixing                 original taxonomy prediction problem is transformed into a
convolution filters of three continuous sizes (k − 1, k, and            multi-label learning problem using the encoded labels. The
k + 1), the network can learn patterns in the sentence at               longest sequence of taxonomies contains 8 taxonomy codes.
three different scales. This design is similar to combining             We pad those encoded labels of taxonomy sequences less
n-gram features of different scales (e.g., unigram, bigram,             than 8 with zeros.
and trigram). The max pooling filter scans through outputs
of convolution filters and preserves only the max value in              Model ensembling
each area. This operation washes out information that is less           The models that are trained for ensembling are shown in Fig.
relevant to the classification task and reduces the dimension-          1 and descriptions of these models are as the following:
ality of features extracted by convolution filters. The output
                                                                            • Model 1 : The last fully connected layer uses the input
is further processed by one layer of fully-connected neurons
                                                                              including: (1) the output of Kim-CNN networks; (2)
to condense output matrix.
                                                                              NER features; (3) doc2vec features.
   Based on the Kim-CNN architecture proposed in [11] that
                                                                            • Model 2 : The last fully connected layer uses the in-
used kernels in the sizes of 3, 4 and 5, we insert one more
                                                                              put including: (1) the output of one Kim-CNN model
kernel with size being 1 to capture the unigram features
                                                                              trained with upsampling; (2) the output of the other
and we apply a batch normalization layer followed by a
                                                                              Kim-CNN trained with ECOC; (3) NER features.
dropout layer after the fully-connected layer of a standard
                                                                            • Model 3 : The last fully connected layer uses the in-
Kim-CNN, then the output is further processed by a second
                                                                              put including: (1) the output of Kim-CNN networks
fully-connected layer to yield final outputs.
                                                                              trained with ECOC; (2) NER features and (3) doc2vec
                                                                    5
SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                      Y. Jia et al.

      features. The output of Model 3 is decoded from the                               Table 3: Testing Results
      yielded multi-label predictions in a way of calculating
      the likelihood of predicting each individual class label                 Metric     Testing-Stage1     Testing-Stage2
      firstly and then we assign the original class label with
                                                                              Precision        0.8545        0.8528
      the highest computed likelihood to the example.
                                                                               Recall          0.8172        0.8172
    • Model 4 : We train a Kim-CNN model using character
                                                                                 F1            0.8278        0.8295
      level embeddings of raw texts. The last fully connected
      layer uses the input including: (1) the output of Kim-
      CNN; (2) NER features and (3) doc2vec features.
                                                                          • Number of filters : 1024
    • Model 5 : Different from Kim-CNN that concatenates
                                                                          • Window size of Kim-CNN: {1, 3, 4, 5}
      the feature maps generated from different window
                                                                          • Dropout rate : 0.5
      sizes in parallel, we train a new CNN model with six
                                                                          • Regularization parameter : 3×10−6
      layers sequentially, which is referred as Zhang-CNN
      [26]. The last fully connected layer uses the input in-         In the tuning process, firstly we fix the number of filters to
      cluding: (1) the output of Zhang-CNN; (2) NER features          be 256 and we tune other parameters. Then we increase the
      and (3) doc2vec features. We use the similar hyperpa-           number of filters to 512 and 1024 and the latter one shows
      rameters that have been proposed in their paper with            a better performance. We also notice that a higher dropout
      fine tuning.                                                    rate 0.5 would help to get the best performance. Besides, our
                                                                      model is quite sensitive to the regularization parameter in
The output of each model is also adjusted by adopting thresh-         the sense of converging speed. Once we increase it from
old moving such that each model provides 2 versions of pre-           3×10−6 to 1×10−3 , the model converges much slower but
dictions that are in terms of probabilities, with or without          with no increase in performance.
threshold moving. We derive an ensembling procedure to                   Our error analysis shows that the classifier works well
combine multiple model predictions. Firstly, we select the            for categories whose sample sizes are big, for example, more
predictions from Model 1, 4 and 5 to form a initial set of            than 1000 cases, and not so well for smaller categories. Figure
candidate predictions, containing 6 versions of predictions.          2 shows the performance on a random validation dataset (20
We repeat training Model 2 for 3 times with random data               percent of training data) using ensemble of models trained
shuffling and oversampling to provide 6 more versions of              with over-sampling, threshold moving and error correcting
predictions and add those to the set. We adopt 3 different            output coding techniques. The result inspires us to adopt
ways of creating output codings and repeatedly train Model            various sampling strategies aiming to increase the sample
3, to obtain 6 more versions of predictions. We observe that          size for those small categories.
using even more predictions beyond 6 versions from Model
2 or 3 will not further improve the overall performance. In           7   RESULTS AND DISCUSSION
the end we implement a majority voting over all versions
                                                                      The models are firstly trained on 80% of the given 800K
of predicted labels within the candidate set to predict the
                                                                      training samples and validated on the rest 20% of data to tune
final labels. When there is a tie, we choose the label with the
                                                                      the hyperparameters. Then the models are retrained with
highest averaged probability.
                                                                      fixed hyperparameters following an early stopping strategy
                                                                      in training using all given 800k samples and tested on 200K
6   PARAMETER TUNING AND ERROR ANALYSIS                               samples with unknown labels (Table 3). We achieve a good
For training the proposed model, hyper-parameters such                performance, with a F1 score of 0.8295.
as number of filters among {256, 512, 1024} for each con-                We observe that feature engineering is particularly impor-
volutional layer, dropout rate among {0.1, 0.2, 0.3, 0.4, 0.5}        tant in further improving performance using CNN. We went
and regularization parameter for convolutional and fully              through a path that as we added more relevant but somehow
connected layers among {1×10−6 , 3×10−6 , 1×10−3 , 3×10−3 ,           different features, for example the NER features, the per-
1×10−1 } are tuned using a hold-out set in the training data.         formance was improved accordingly. We also observe that
We adopt the same combination of window sizes 3, 4, 5 as              there is considerable performance difference among different
what was used in Kim-CNN [11], otherwise we add a new                 models. The performance of models based on character level
window spanning only one word which was helpful to pre-               embedding are not as good as the others. However, it helps
dict the taxonomy correctly with selected a few single words.         to improve the overall performance of ensemble model.
We select an architecture with the best performance. The                 The threshold moving method is very helpful to increase
adopted hyper-parameters for Kim-CNN model are as the                 the precision of individual model, which is critical in final
following :                                                           ensembling. The oversampling and ECOC algorithms can
                                                                  6
Ensemble model for Taxonomy Classification                           SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA

add additional randomness and improve the performance of                               [16] Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017.
ensembling model to certain extent.                                                         Deep Learning for Extreme Multi-label Text Classification. In Proceed-
                                                                                            ings of the 40th International ACM SIGIR Conference on Research and
                                                                                            Development in Information Retrieval. ACM, 115–124.
ACKNOWLEDGMENTS                                                                        [17] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel,
The author would like to thank the organizer of SIGIR 2018                                  Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP nat-
eCom Data Challenge (Rakuten Institute of Technology Boston                                 ural language processing toolkit. In Proceedings of 52nd annual meeting
                                                                                            of the association for computational linguistics: system demonstrations.
(RIT-Boston)) for their support.
                                                                                            55–60.
                                                                                       [18] Maciej A. Mazurowski Mateusz Buda, Atsuto Maki. 2017. A systematic
REFERENCES                                                                                  study of the class imbalance problem in convolutional neural networks.
 [1] Rohit Babbar, Ioannis Partalas, Eric Gaussier, and Massih R Amini. 2013.               arXiv:1710.05381 (2017).
     On flat versus hierarchical classification in large-scale taxonomies. In          [19] Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event
     Advances in Neural Information Processing Systems. 1824–1832.                          models for naive bayes text classification. In AAAI-98 workshop on
 [2] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard                       learning for text categorization, Vol. 752. Citeseer, 41–48.
     Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort,             [20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey
     Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for self-driving            Dean. 2013. Distributed Representations of Words and Phrases and
     cars. arXiv preprint arXiv:1604.07316 (2016).                                          Their Compositionality. In Proceedings of the 26th International Con-
 [3] Ali Cevahir and Koji Murakami. 2016. Large-scale Multi-class and                       ference on Neural Information Processing Systems - Volume 2 (NIPS’13).
     Hierarchical Product Categorization for an E-commerce Giant. In Pro-                   Curran Associates Inc., USA, 3111–3119.
     ceedings of COLING 2016, the 26th International Conference on Compu-              [21] Michael Pazzani, Christopher Merz, Patrick Murphy, Kamal Ali, Timo-
     tational Linguistics: Technical Papers. 525–535.                                       thy Hume, and Clifford Brunk. 1994. Reducing misclassification costs.
 [4] Jianfu Chen and David Warren. 2013. Cost-sensitive learning for                        In Machine Learning Proceedings 1994. Elsevier, 217–225.
     large-scale hierarchical classification. In Proceedings of the 22nd ACM           [22] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic
     international conference on Conference on information & knowledge                      Modelling with Large Corpora. In Proceedings of the LREC 2010 Work-
     management. ACM, 1351–1360.                                                            shop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta,
 [5] Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, and                      45–50.
     Ankur Datta. 2016. Large-scale taxonomy categorization for noisy                  [23] Dan Shen, Jean-David Ruvini, Rajyashree Mukherjee, and Neel Sun-
     product listings. In Big Data (Big Data), 2016 IEEE International Con-                 daresan. 2012. A study of smoothing algorithms for item categorization
     ference on. IEEE, 3885–3894.                                                           on e-commerce sites. Neurocomputing 92 (2012), 54–60.
 [6] Thomas G. Dietterich and Ghulum Bakiri. 1995. Solving Multiclass                  [24] Dan Shen, Jean David Ruvini, Manas Somaiya, and Neel Sundaresan.
     Learning Problems via Error-correcting Output Codes. J. Artif. Int. Res.               2011. Item categorization in the e-commerce domain. In Proceedings of
     2, 1 (Jan. 1995), 263–286.                                                             the 20th ACM international conference on Information and knowledge
 [7] Rayid Ghani. 2000. Using Error-Correcting Codes for Text Classifi-                     management. ACM, 1921–1924.
     cation. In Proceedings of the Seventeenth International Conference on             [25] Carlos N Silla and Alex A Freitas. 2011. A survey of hierarchical
     Machine Learning (ICML ’00). Morgan Kaufmann Publishers Inc., San                      classification across different application domains. Data Mining and
     Francisco, CA, USA, 303–310.                                                           Knowledge Discovery 22, 1-2 (2011), 31–72.
 [8] Jung-Woo Ha, Hyuna Pyo, and Jeonghee Kim. 2016. Large-scale item                  [26] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level
     categorization in e-commerce using multiple recurrent neural net-                      Convolutional Networks for Text Classification. In Proceedings of the
     works. In Proceedings of the 22nd ACM SIGKDD International Confer-                     28th International Conference on Neural Information Processing Systems
     ence on Knowledge Discovery and Data Mining. ACM, 107–115.                             - Volume 1 (NIPS’15). MIT Press, Cambridge, MA, USA, 649–657.
 [9] Thorsten Joachims. 1998. Text categorization with support vector ma-              [27] Ye Zhang, Stephen Roller, and Byron Wallace. 2016. MGNC-CNN: A
     chines: Learning with many relevant features. In European conference                   simple approach to exploiting multiple word embeddings for sentence
     on machine learning. Springer, 137–142.                                                classification. arXiv preprint arXiv:1603.00968 (2016).
[10] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A                  [28] Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and
     convolutional neural network for modelling sentences. arXiv preprint                   practitioners’ guide to) convolutional neural networks for sentence
     arXiv:1404.2188 (2014).                                                                classification. arXiv preprint arXiv:1510.03820 (2015).
[11] Yoon Kim. 2014. Convolutional neural networks for sentence classifi-              [29] Zhi-Hua Zhou and Xu-Ying Liu. 2006. Training Cost-Sensitive Neural
     cation. In Proceedings of the 2014 Conference on Empirical Methods in                  Networks with Methods Addressing the Class Imbalance Problem.
     Natural Language Processing.                                                           IEEE Trans. on Knowl. and Data Eng. 18, 1 (Jan. 2006), 63–77.
[12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Sto-
     chastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980
[13] Aris Kosmopoulos, Georgios Paliouras, and Ion Androutsopoulos. 2015.
     Probabilistic cascading for large scale hierarchical classification. arXiv
     preprint arXiv:1505.02251 (2015).
[14] Quoc Le and Tomas Mikolov. 2014. Distributed Representations of
     Sentences and Documents. In Proceedings of the 31st International
     Conference on International Conference on Machine Learning - Volume
     32 (ICML’14). JMLR.org, II–1188–II–1196.
[15] David D Lewis and Marc Ringuette. 1994. A comparison of two learn-
     ing algorithms for text categorization. In Third annual symposium on
     document analysis and information retrieval, Vol. 33. 81–93.
                                                                                   7