<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team Waterloo at the SIGIR E-Commerce Data Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Angshuman Ghosh</string-name>
          <email>a25ghosh@uwaterloo.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vineet John</string-name>
          <email>email@institution.domain</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rahul Iyer</string-name>
          <email>rahul.iyer@uwaterloo.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Waterloo</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>This paper describes the methods we use to solve a taxonomy classification problem specific to categorization of products in an e-commerce catalog. We describe the use of learning algorithms like multinomial naive-bayes, logistic regressors, batched gradientdescent classifiers and neural softmax classifiers retro-fitted for the hierarchical classification objective. We explore methods that exploit the hierarchical nature of the data, and compare it against lfat classification approaches. We use simple models with minimal pre-processing to build and compare classifiers by reporting their respective weighted precision, recall and F1 score metrics on the test data-set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Natural language
processing; Information extraction;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Taxonomy classification is a multi-label classification problem where
the classes form a hierarchy, as in a tree or a directed acyclic graph
(DAG). Although a significant amount of research in natural
language processing (NLP) pertains to flat-classification problems,
class hierarchies are fairly common in the real-world, and it follows
that are there are tasks that require classification algorithms that
address such hierarchical classification problems.</p>
      <p>
        Hierarchies emerge as a natural result of the organization of
extensive data-sets into fine-grained categories [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this task
we work with a product catalog taxonomy, with each individual
product being a part of an increasingly finer-grained category. The
objective of the task is to be able to accurately predict the entire
class hierarchy under which the product is categorized.
      </p>
      <p>The contributions of this paper are the evaluation of diferent
strategies including flat and hierarchical classification, and the
implementation of a general framework to address the task of
taxonomy classification.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>Taxonomy classification is a topic well-explored within the NLP
community, with several novel approaches over the past two decades.
It is also known in the literature as ‘hierarchical’ or ‘structured’
classification. The nature of the task implies that binary and multi-class
classifiers cannot be used as is to solve this problem. However, these
existing algorithms can be re-purposed for taxonomy classification,
as will be subsequently discussed in Section 3.</p>
      <p>The seminal work on taxonomy classification was published by
KOLLER. The authors propose learning a small set of features to
solve each classification task in the tree structure independently.
This method involves training a classifier for each non-leaf node
in the tree. This model comprises a top-down approach under the
assumption that the instance is already classified with respect to all
the labels that precede the current label in the hierarchy. This also
drastically reduces the number of target classes’ probabilities to
predict. However, a drawback of this approach is the large number
of models that need to be trained to facilitate it. It also sufers
from the cascading efect of classification failure i.e. all subsequent
classifications after an incorrect classification will also be incorrect.</p>
      <p>
        Since the most common schemes for multi-class classification in
general are the One-vs-One and One-vs-All schemes, support vector
machines, being max-margin classifiers, have been used frequently
to address this task in previous works [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These works have
used the SVM in the standard flat multi-classification setting and
also proposed changes to adapt the classifier to the hierarchical
setting, namely the Binary Hierarchical Classifier (BHC) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
the Hierarchical SVM (H-SVM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Another work, published by McCallum et al. uses shrinkage to
improve classification of a hierarchy of text articles. They use a
maximum log-likelihood estimate (MLE) based objective to shrink
the MLE of each node towards the MLEs of all its ancestors.</p>
      <p>Our approaches, in contrast, use simpler estimators, and rely
on on minimal post-processing for a few among them to collate
the predictions in the event of multiple classifiers being used for a
single learning algorithm.
3</p>
    </sec>
    <sec id="sec-4">
      <title>APPROACH</title>
      <p>In Section 3.1, we describe the data cleaning and simple intuitions
we use to guide our learning strategies. We subsequently describe
each of the approaches that we utilized to obtain the requisite
predictions.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Preprocessing</title>
      <p>Each data instance to be classified in this task is a product name,
which is usually comprised of the product’s canonical name, a brand
name and any high-level specifications (like dimensions, color etc.)
if applicable.</p>
      <p>There are 3008 unique category paths known from the training
data, with a maximum category depth of 8.</p>
      <p>As opposed to full sentences, there is intuitively a lesser
advantage in implementing a forget mechanism for past contexts in a
product name, disincentivizing the use of recurrent networks as
feature extractors.</p>
      <p>Also, most of the numbers in the data-set, for instance, in the
examples presented in Table 1 are used to quantify specifications
of the product, and are not useful unless they’re part of the brand
or product name itself.</p>
      <p>Hence, we trim the numbers from the product names, in addition
to trimming the default english stopwords from curated dictionaries
maintained by NLTK1, spaCy2 and scikit-learn3. This eliminates
most of the conjunctions and prepositions used within a product
name that are not relevant to its class. Additionally, we also remove
any word less that 3 characters in length, to eliminate most of the
uninformative measure of quantities (e.g. V, mm, oz etc.).
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>One vs. All Classification</title>
      <p>
        The class taxonomy provided is a Tree and each leaf node uniquely
identifies a class. Thus, for our initial trials, we decided to train
models using only leaf nodes. We used two classifiers: Multinomial
Naive Bayes (MNB) and XGBoost [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While there were no
hyperparameters for MNB, XGBoost has a considerable number of tunable
parameters. XGBoost is an ensemble method where a large number
of decision trees are combined using gradient boosting and it has
performed really well in multiple Kaggle competitions. The two
most important hyper parameters are the depth of the decision
trees that are grown, and the number of trees that are trained. We
performed a random search for parameters and obtained the best
results with a max-depth of 7 and 200 estimators.
3.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Multi-level Classification</title>
      <p>This classifier attempts to exploit the hierarchical nature of the
categories. This requires using multiple classifiers (one for each
level of the taxonomy tree), and combining their eventual prediction
results to form the entire category path prediction. Based on the
training set, the count of unique classes to predict at each level is
shown in Table 2.</p>
      <p>
        This is analogous to the Local Classifier Per Level approach
discussed in a comprehensive survey of hierarchical classification
methods [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This is not a hierarchical classification algorithm in
itself, but the method serves to correct the hierarchical
inconsistencies that result from predicting the class at each level in isolation.
To contextualize this approach with the current problem, we train
8 separate classifiers, each the input for each classifier being the
features extracted from the product name and the output being
the probability distribution across all the possible classes at the
specific level for which the classifier was trained, as depicted in
Figure 1. The feature extractor for this model is a bag-of-words
tf-idf vectorizer.
1http://www.nltk.org/nltk_data
2https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py
3https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_
extraction/stop_words.py
      </p>
      <sec id="sec-7-1">
        <title>Level</title>
        <p>Unique Categories</p>
        <p>However, this method sufers from the shortcoming of disjoint
predictions from level-to-level. For example, from the classifier 1
may predict a class that is not a descendant of the prediction of
class 0. In this event, some additional processing is required such
that the final category path is valid.</p>
        <p>We use a greedy search within the empirically computed
logprobabilities of classifiers for level ≥ 1. Instead of predicting the
child class with the maximum log-likelihood predicted by the
classiifer, we simply chose the next level’s prediction to be the valid child
of the previous level’s prediction with the maximum log-likelihood.
An alternative to this approach would be to evaluate multiple
possibilities at each level in a beam-search with the objective of
maximizing sum of their log-likelihood. An obvious drawback of using
this method is that an incorrect prediction at a level will result in
incorrect predictions at all of the levels below it.</p>
        <p>If a the class predicted at any level happens to be a leaf node,
the search is terminated. This is a safe operation to do, as the tree
nodes in the product data-set are exclusively either leaf or non-leaf
nodes.
3.4</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Neural Classification</title>
      <p>We build multiple models using artificial neural networks as the
function approximators with non-linear activation functions (ReLU).</p>
      <p>3.4.1 Simple Multi-Layer Perceptron. The feature extractor for
the first neural classifier uses the tf-idf weighting scheme for the
words in the corpus, and a single hidden layer of neurons to predict
the entire category path of each product. In this formulation, the
entire category path, like ‘2296&gt;3597&gt;2989’, for instance, is treated
as a distinct class.</p>
      <p>
        Another neural network model we tried was using the ULMFiT
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] pre-trained weights trained on Wikipedia-103 data-set and
finetuning the language model to predict the taxonomy. This method
didn’t produce any better results as the data-set contained a lot of
rare words and the number of examples to fine-tune the embeddings
layer was simply not suficient. We report a precision of 0.64 with
this model.
      </p>
      <p>3.4.2 Per-Level Multi-Layer Perceptron. We also built a neural
network with the number of layers equivalent to the number of
the maximum depth of the taxonomy. We employed layer-wise
training and used the softmax predictions of a layer along with
the word embeddings together as the input for the next layer. The
model showed promising results for the initial two levels but the
performance fell as the number of classes increased as we go down
the tree.</p>
    </sec>
    <sec id="sec-9">
      <title>3.5 Ensembles</title>
      <p>We also build two ensemble models constructed using all the
previously described classifiers. We utilize a simple voting-based scheme
for combining the classifier predictions, with the majority label
being selected as the final label. We built two diferent ensembles:
(i) Combining predictions made by all the models implemented,
(ii) Combining predictions for which the weighted F1 score was at
least 0.70.</p>
    </sec>
    <sec id="sec-10">
      <title>4 EVALUATION</title>
      <p>The training data is comprised of 800,000 products and their
respective category path labels, a subset of which we hold out for
validation. The test set is comprised of 200,000 products, on which
the experiment results are reported.</p>
      <p>The metrics used to evaluate model performance, are weighted
precision, recall and a F1 scores.</p>
      <p>Table 3 shows the performance of each of the models against the
test set.</p>
      <sec id="sec-10-1">
        <title>Approach</title>
      </sec>
      <sec id="sec-10-2">
        <title>Local-Per-Level Stochastic Gradient Descent</title>
        <p>Flat One-vs-All Multinomial Naive Bayes
Flat Stochastic Gradient Descent
Flat One-vs-All Logistic Regression
Flat One-vs-All XGBoost
Flat Multi-Layer Perceptron
Ensemble - All
Ensemble - XGBoost and Neural Networks
P
0.50
0.55
0.65
0.76
0.77
0.78
0.77
0.78</p>
        <p>R
0.49
0.54
0.66
0.74
0.77
0.79
0.76
0.77</p>
        <p>F1
0.46
0.47
0.63
0.72
0.75
0.78
0.75
0.75</p>
        <p>We observe that even simple classification approaches that
ignore the hierarchical nature of the data, like the Flat MLP classifier,
obtains the best scores among the diferent models used, with the
ensembling methods achieving comparable performance.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5 DISCUSSION</title>
      <p>Based on our experimentation and reported results, there seems to
be little to no advantage in using methods that predict each level
of the category path independently. This could be attributed to the
fact that the evaluation metric is equally punitive for both partially
correct and completely incorrect category paths, thereby, providing
no extra benefit for paths that are partially correct, typically seen
in piece-wise constructed methods like the per-level local classifier
approach.</p>
      <p>We evaluated a per-level local classifier, using a batched
gradientdescent model and a neural model. These were post-processed
to include only valid category-paths. However, these turned out
to be our worst performing models. The baseline against the flat
classification model using the same classification algorithm and
hyper-parameters indicates that leaf-node only flat classification
performs better and our subsequent evaluations were restricted to
lfat classifiers.</p>
    </sec>
    <sec id="sec-12">
      <title>6 CONCLUSION</title>
      <p>In this paper, we describe our approach to address the taxonomy
classification task for product categorization. We build a variety
of models include flat and per-level classifiers, implemented using
multinomial naive-bayes, logistic regressor, batched SGD classifier
and multi-layer perceptron models. Our best performing model
uses a minimal amount of data-preprocessing, and uses a single
hidden layer in an MLP model with non-linear activations with the
features extracted using a tf-idf weighting scheme vectorizer.</p>
      <p>In future work, we would like to explore the possibility of
training a distinct model per non-leaf node to address ofset the problem
of having to perform a greedy search for postprocessing in the
per-level classifier approach, as well as to implement additional
boosting techniques to ensemble our best performing models.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Tianqi</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          .
          <source>In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM</source>
          ,
          <volume>785</volume>
          -
          <fpage>794</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Yangchi</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Melba M Crawford</surname>
            ,
            <given-names>and Joydeep</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Integrating support vector machines in a hierarchical output space decomposition framework</article-title>
          .
          <source>In Geoscience and Remote Sensing Symposium</source>
          ,
          <year>2004</year>
          . IGARSS'
          <fpage>04</fpage>
          .
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          . 2004 IEEE International, Vol.
          <volume>2</volume>
          . IEEE,
          <fpage>949</fpage>
          -
          <lpage>952</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D</given-names>
            <surname>KOLLER</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Hierarchically classifying documents using very few words</article-title>
          .
          <source>In Proc. 14th International Conference on Machine Learning</source>
          .
          <fpage>170</fpage>
          -
          <lpage>178</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Shailesh</given-names>
            <surname>Kumar</surname>
          </string-name>
          , Joydeep Ghosh, and
          <string-name>
            <surname>Melba M Crawford</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Hierarchical fusion of multiple classifiers for hyperspectral data analysis</article-title>
          .
          <source>Pattern Analysis &amp; Applications 5</source>
          ,
          <issue>2</issue>
          (
          <year>2002</year>
          ),
          <fpage>210</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ana</given-names>
            <surname>Carolina Lorena and André Carlos PLF de Carvalho</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Comparing techniques for multiclass classification using binary SVM predictors</article-title>
          .
          <source>In Mexican International Conference on Artificial Intelligence</source>
          . Springer,
          <fpage>272</fpage>
          -
          <lpage>281</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Andrew</surname>
            <given-names>McCallum</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ronald</surname>
            <given-names>Rosenfeld</given-names>
          </string-name>
          , Tom M Mitchell, and Andrew Y Ng.
          <year>1998</year>
          .
          <article-title>Improving Text Classification by Shrinkage in a Hierarchy of Classes.</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>ICML</given-names>
          </string-name>
          , Vol.
          <volume>98</volume>
          .
          <fpage>359</fpage>
          -
          <lpage>367</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          and
          <string-name>
            <given-names>Barbara</given-names>
            <surname>Plank</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Strong Baselines for Neural Semisupervised Learning under Domain Shift</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>09530</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Carlos</surname>
            <given-names>N</given-names>
          </string-name>
          <string-name>
            <surname>Silla and Alex A Freitas</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A survey of hierarchical classification across diferent application domains</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>22</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          (
          <year>2011</year>
          ),
          <fpage>31</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>