Team Waterloo at the SIGIR E-Commerce Data Challenge
                Angshuman Ghosh                                                         Vineet John                                     Rahul Iyer
                University of Waterloo                                         University of Waterloo                            University of Waterloo
               a25ghosh@uwaterloo.ca                                          email@institution.domain                          rahul.iyer@uwaterloo.ca

ABSTRACT                                                                                          existing algorithms can be re-purposed for taxonomy classification,
This paper describes the methods we use to solve a taxonomy                                       as will be subsequently discussed in Section 3.
classification problem specific to categorization of products in an                                  The seminal work on taxonomy classification was published by
e-commerce catalog. We describe the use of learning algorithms                                    KOLLER. The authors propose learning a small set of features to
like multinomial naive-bayes, logistic regressors, batched gradient-                              solve each classification task in the tree structure independently.
descent classifiers and neural softmax classifiers retro-fitted for                               This method involves training a classifier for each non-leaf node
the hierarchical classification objective. We explore methods that                                in the tree. This model comprises a top-down approach under the
exploit the hierarchical nature of the data, and compare it against                               assumption that the instance is already classified with respect to all
flat classification approaches. We use simple models with minimal                                 the labels that precede the current label in the hierarchy. This also
pre-processing to build and compare classifiers by reporting their                                drastically reduces the number of target classes’ probabilities to
respective weighted precision, recall and F 1 score metrics on the                                predict. However, a drawback of this approach is the large number
test data-set.                                                                                    of models that need to be trained to facilitate it. It also suffers
                                                                                                  from the cascading effect of classification failure i.e. all subsequent
CCS CONCEPTS                                                                                      classifications after an incorrect classification will also be incorrect.
                                                                                                     Since the most common schemes for multi-class classification in
• Computing methodologies → Natural language process-
                                                                                                  general are the One-vs-One and One-vs-All schemes, support vector
ing; Information extraction;
                                                                                                  machines, being max-margin classifiers, have been used frequently
                                                                                                  to address this task in previous works [5] [4] [2]. These works have
KEYWORDS                                                                                          used the SVM in the standard flat multi-classification setting and
taxonomy,classification                                                                           also proposed changes to adapt the classifier to the hierarchical
                                                                                                  setting, namely the Binary Hierarchical Classifier (BHC) [4] and
1     INTRODUCTION                                                                                the Hierarchical SVM (H-SVM) [2].
                                                                                                     Another work, published by McCallum et al. uses shrinkage to
Taxonomy classification is a multi-label classification problem where
                                                                                                  improve classification of a hierarchy of text articles. They use a
the classes form a hierarchy, as in a tree or a directed acyclic graph
                                                                                                  maximum log-likelihood estimate (MLE) based objective to shrink
(DAG). Although a significant amount of research in natural lan-
                                                                                                  the MLE of each node towards the MLEs of all its ancestors.
guage processing (NLP) pertains to flat-classification problems,
                                                                                                     Our approaches, in contrast, use simpler estimators, and rely
class hierarchies are fairly common in the real-world, and it follows
                                                                                                  on on minimal post-processing for a few among them to collate
that are there are tasks that require classification algorithms that
                                                                                                  the predictions in the event of multiple classifiers being used for a
address such hierarchical classification problems.
                                                                                                  single learning algorithm.
   Hierarchies emerge as a natural result of the organization of
extensive data-sets into fine-grained categories [6]. In this task
                                                                                                  3     APPROACH
we work with a product catalog taxonomy, with each individual
product being a part of an increasingly finer-grained category. The                               In Section 3.1, we describe the data cleaning and simple intuitions
objective of the task is to be able to accurately predict the entire                              we use to guide our learning strategies. We subsequently describe
class hierarchy under which the product is categorized.                                           each of the approaches that we utilized to obtain the requisite
   The contributions of this paper are the evaluation of different                                predictions.
strategies including flat and hierarchical classification, and the
implementation of a general framework to address the task of tax-                                 3.1    Preprocessing
onomy classification.                                                                             Each data instance to be classified in this task is a product name,
                                                                                                  which is usually comprised of the product’s canonical name, a brand
2     RELATED WORK                                                                                name and any high-level specifications (like dimensions, color etc.)
Taxonomy classification is a topic well-explored within the NLP                                   if applicable.
community, with several novel approaches over the past two decades.                                   There are 3008 unique category paths known from the training
It is also known in the literature as ‘hierarchical’ or ‘structured’ clas-                        data, with a maximum category depth of 8.
sification. The nature of the task implies that binary and multi-class                                As opposed to full sentences, there is intuitively a lesser advan-
classifiers cannot be used as is to solve this problem. However, these                            tage in implementing a forget mechanism for past contexts in a
                                                                                                  product name, disincentivizing the use of recurrent networks as
                                                                                                  feature extractors.
SIGIR’18,
 CopyrightJuly   2018,
            © 2018      Annpaper’s
                     by the  Arborauthors.
                                   Michigan,   USApermitted for private and academic purposes.
                                           Copying
 In: J.ACM
2018.    Degenhardt,  G. Di Fabbrizio, S. Kallumadi,. .M.
             ISBN 978-x-xxxx-xxxx-x/YY/MM.                Kumar, Y.-C. Lin, A. Trotman, H. Zhao
                                                       $15.00                                         Also, most of the numbers in the data-set, for instance, in the
 (eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, USA,
https://doi.org/10.1145/nnnnnnn.nnnnnnn
 published at http://ceur-ws.org                                                                  examples presented in Table 1 are used to quantify specifications
SIGIR’18, July 2018, Ann Arbor Michigan, USA


  Replacement Viewsonic VG710 LCD Monitor 48Watt AC                                                 Level   Unique Categories
  Adapter 12V 4A                                                                                     0                   14
  4-Pack Replacement Engine Air Filter for 2009 Sterling Truck                                       1                  108
  Bullet 55 L6 6.7 Car/Automotive                                                                    2                  865
  Luscious Pink Perfume 3.4 oz Eau De Parfum Spray For Women                                         3                 1573
  By MARIAH CAREY                                                                                    4                  752
         Table 1: Examples of products with numbers                                                  5                  232
                                                                                                     6                  148
                                                                                                     7                    3
                                                                                             Table 2: Level-wise category counts
of the product, and are not useful unless they’re part of the brand
or product name itself.
   Hence, we trim the numbers from the product names, in addition
to trimming the default english stopwords from curated dictionaries
maintained by NLTK1 , spaCy2 and scikit-learn3 . This eliminates
most of the conjunctions and prepositions used within a product
name that are not relevant to its class. Additionally, we also remove
any word less that 3 characters in length, to eliminate most of the
uninformative measure of quantities (e.g. V, mm, oz etc.).

3.2     One vs. All Classification
The class taxonomy provided is a Tree and each leaf node uniquely
identifies a class. Thus, for our initial trials, we decided to train
models using only leaf nodes. We used two classifiers: Multinomial
Naive Bayes (MNB) and XGBoost [1]. While there were no hyper-
parameters for MNB, XGBoost has a considerable number of tunable
parameters. XGBoost is an ensemble method where a large number                            Figure 1: Classifier per level architecture
of decision trees are combined using gradient boosting and it has
performed really well in multiple Kaggle competitions. The two
most important hyper parameters are the depth of the decision                     However, this method suffers from the shortcoming of disjoint
trees that are grown, and the number of trees that are trained. We             predictions from level-to-level. For example, from the classifier 1
performed a random search for parameters and obtained the best                 may predict a class that is not a descendant of the prediction of
results with a max-depth of 7 and 200 estimators.                              class 0. In this event, some additional processing is required such
                                                                               that the final category path is valid.
3.3     Multi-level Classification                                                We use a greedy search within the empirically computed log-
This classifier attempts to exploit the hierarchical nature of the             probabilities of classifiers for level ≥ 1. Instead of predicting the
categories. This requires using multiple classifiers (one for each             child class with the maximum log-likelihood predicted by the classi-
level of the taxonomy tree), and combining their eventual prediction           fier, we simply chose the next level’s prediction to be the valid child
results to form the entire category path prediction. Based on the              of the previous level’s prediction with the maximum log-likelihood.
training set, the count of unique classes to predict at each level is          An alternative to this approach would be to evaluate multiple pos-
shown in Table 2.                                                              sibilities at each level in a beam-search with the objective of maxi-
    This is analogous to the Local Classifier Per Level approach               mizing sum of their log-likelihood. An obvious drawback of using
discussed in a comprehensive survey of hierarchical classification             this method is that an incorrect prediction at a level will result in
methods [8]. This is not a hierarchical classification algorithm in            incorrect predictions at all of the levels below it.
itself, but the method serves to correct the hierarchical inconsisten-            If a the class predicted at any level happens to be a leaf node,
cies that result from predicting the class at each level in isolation.         the search is terminated. This is a safe operation to do, as the tree
To contextualize this approach with the current problem, we train              nodes in the product data-set are exclusively either leaf or non-leaf
8 separate classifiers, each the input for each classifier being the           nodes.
features extracted from the product name and the output being
the probability distribution across all the possible classes at the            3.4    Neural Classification
specific level for which the classifier was trained, as depicted in            We build multiple models using artificial neural networks as the
Figure 1. The feature extractor for this model is a bag-of-words               function approximators with non-linear activation functions (ReLU).
tf-idf vectorizer.
                                                                                  3.4.1 Simple Multi-Layer Perceptron. The feature extractor for
1 http://www.nltk.org/nltk_data
2 https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py
                                                                               the first neural classifier uses the tf-idf weighting scheme for the
3 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_    words in the corpus, and a single hidden layer of neurons to predict
extraction/stop_words.py                                                       the entire category path of each product. In this formulation, the
Team Waterloo at the SIGIR E-Commerce Data Challenge                                               SIGIR’18, July 2018, Ann Arbor Michigan, USA


entire category path, like ‘2296>3597>2989’, for instance, is treated     5    DISCUSSION
as a distinct class.                                                      Based on our experimentation and reported results, there seems to
   Another neural network model we tried was using the ULMFiT             be little to no advantage in using methods that predict each level
[7] pre-trained weights trained on Wikipedia-103 data-set and fine-       of the category path independently. This could be attributed to the
tuning the language model to predict the taxonomy. This method            fact that the evaluation metric is equally punitive for both partially
didn’t produce any better results as the data-set contained a lot of      correct and completely incorrect category paths, thereby, providing
rare words and the number of examples to fine-tune the embeddings         no extra benefit for paths that are partially correct, typically seen
layer was simply not sufficient. We report a precision of 0.64 with       in piece-wise constructed methods like the per-level local classifier
this model.                                                               approach.
                                                                             We evaluated a per-level local classifier, using a batched gradient-
   3.4.2 Per-Level Multi-Layer Perceptron. We also built a neural
                                                                          descent model and a neural model. These were post-processed
network with the number of layers equivalent to the number of
                                                                          to include only valid category-paths. However, these turned out
the maximum depth of the taxonomy. We employed layer-wise
                                                                          to be our worst performing models. The baseline against the flat
training and used the softmax predictions of a layer along with
                                                                          classification model using the same classification algorithm and
the word embeddings together as the input for the next layer. The
                                                                          hyper-parameters indicates that leaf-node only flat classification
model showed promising results for the initial two levels but the
                                                                          performs better and our subsequent evaluations were restricted to
performance fell as the number of classes increased as we go down
                                                                          flat classifiers.
the tree.
                                                                          6    CONCLUSION
3.5    Ensembles
                                                                          In this paper, we describe our approach to address the taxonomy
We also build two ensemble models constructed using all the previ-        classification task for product categorization. We build a variety
ously described classifiers. We utilize a simple voting-based scheme      of models include flat and per-level classifiers, implemented using
for combining the classifier predictions, with the majority label         multinomial naive-bayes, logistic regressor, batched SGD classifier
being selected as the final label. We built two different ensembles:      and multi-layer perceptron models. Our best performing model
(i) Combining predictions made by all the models implemented,             uses a minimal amount of data-preprocessing, and uses a single
(ii) Combining predictions for which the weighted F 1 score was at        hidden layer in an MLP model with non-linear activations with the
least 0.70.                                                               features extracted using a tf-idf weighting scheme vectorizer.
                                                                             In future work, we would like to explore the possibility of train-
4     EVALUATION                                                          ing a distinct model per non-leaf node to address offset the problem
The training data is comprised of 800,000 products and their re-          of having to perform a greedy search for postprocessing in the
spective category path labels, a subset of which we hold out for          per-level classifier approach, as well as to implement additional
validation. The test set is comprised of 200,000 products, on which       boosting techniques to ensemble our best performing models.
the experiment results are reported.
   The metrics used to evaluate model performance, are weighted           REFERENCES
precision, recall and a F 1 scores.                                       [1] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting sys-
                                                                              tem. In Proceedings of the 22nd acm sigkdd international conference on knowledge
   Table 3 shows the performance of each of the models against the            discovery and data mining. ACM, 785–794.
test set.                                                                 [2] Yangchi Chen, Melba M Crawford, and Joydeep Ghosh. 2004. Integrating support
                                                                              vector machines in a hierarchical output space decomposition framework. In
                                                                              Geoscience and Remote Sensing Symposium, 2004. IGARSS’04. Proceedings. 2004 IEEE
                                                                              International, Vol. 2. IEEE, 949–952.
 Approach                                            P      R      F1     [3] D KOLLER. 1997. Hierarchically classifying documents using very few words. In
                                                                              Proc. 14th International Conference on Machine Learning. 170–178.
 Local-Per-Level Stochastic Gradient Descent        0.50   0.49   0.46    [4] Shailesh Kumar, Joydeep Ghosh, and Melba M Crawford. 2002. Hierarchical
 Flat One-vs-All Multinomial Naive Bayes            0.55   0.54   0.47        fusion of multiple classifiers for hyperspectral data analysis. Pattern Analysis &
 Flat Stochastic Gradient Descent                   0.65   0.66   0.63        Applications 5, 2 (2002), 210–220.
                                                                          [5] Ana Carolina Lorena and André Carlos PLF de Carvalho. 2004. Comparing tech-
 Flat One-vs-All Logistic Regression                0.76   0.74   0.72        niques for multiclass classification using binary SVM predictors. In Mexican Inter-
 Flat One-vs-All XGBoost                            0.77   0.77   0.75        national Conference on Artificial Intelligence. Springer, 272–281.
 Flat Multi-Layer Perceptron                        0.78   0.79   0.78    [6] Andrew McCallum, Ronald Rosenfeld, Tom M Mitchell, and Andrew Y Ng. 1998.
                                                                              Improving Text Classification by Shrinkage in a Hierarchy of Classes.. In ICML,
 Ensemble - All                                     0.77   0.76   0.75        Vol. 98. 359–367.
 Ensemble - XGBoost and Neural Networks             0.78   0.77   0.75    [7] Sebastian Ruder and Barbara Plank. 2018. Strong Baselines for Neural Semi-
                                                                              supervised Learning under Domain Shift. arXiv preprint arXiv:1804.09530 (2018).
                  Table 3: Experiment Results                             [8] Carlos N Silla and Alex A Freitas. 2011. A survey of hierarchical classification
                                                                              across different application domains. Data Mining and Knowledge Discovery 22,
                                                                              1-2 (2011), 31–72.


  We observe that even simple classification approaches that ig-
nore the hierarchical nature of the data, like the Flat MLP classifier,
obtains the best scores among the different models used, with the
ensembling methods achieving comparable performance.