=Paper=
{{Paper
|id=Vol-2319/ecom18DC_paper_12
|storemode=property
|title=Team Waterloo at the SIGIR e-Commerce Data Challenge
|pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_12.pdf
|volume=Vol-2319
|authors=Angshuman Ghosh,Vineet John,Rahul Iyer
|dblpUrl=https://dblp.org/rec/conf/sigir/GhoshJI18
}}
==Team Waterloo at the SIGIR e-Commerce Data Challenge==
Team Waterloo at the SIGIR E-Commerce Data Challenge
Angshuman Ghosh Vineet John Rahul Iyer
University of Waterloo University of Waterloo University of Waterloo
a25ghosh@uwaterloo.ca email@institution.domain rahul.iyer@uwaterloo.ca
ABSTRACT existing algorithms can be re-purposed for taxonomy classification,
This paper describes the methods we use to solve a taxonomy as will be subsequently discussed in Section 3.
classification problem specific to categorization of products in an The seminal work on taxonomy classification was published by
e-commerce catalog. We describe the use of learning algorithms KOLLER. The authors propose learning a small set of features to
like multinomial naive-bayes, logistic regressors, batched gradient- solve each classification task in the tree structure independently.
descent classifiers and neural softmax classifiers retro-fitted for This method involves training a classifier for each non-leaf node
the hierarchical classification objective. We explore methods that in the tree. This model comprises a top-down approach under the
exploit the hierarchical nature of the data, and compare it against assumption that the instance is already classified with respect to all
flat classification approaches. We use simple models with minimal the labels that precede the current label in the hierarchy. This also
pre-processing to build and compare classifiers by reporting their drastically reduces the number of target classes’ probabilities to
respective weighted precision, recall and F 1 score metrics on the predict. However, a drawback of this approach is the large number
test data-set. of models that need to be trained to facilitate it. It also suffers
from the cascading effect of classification failure i.e. all subsequent
CCS CONCEPTS classifications after an incorrect classification will also be incorrect.
Since the most common schemes for multi-class classification in
• Computing methodologies → Natural language process-
general are the One-vs-One and One-vs-All schemes, support vector
ing; Information extraction;
machines, being max-margin classifiers, have been used frequently
to address this task in previous works [5] [4] [2]. These works have
KEYWORDS used the SVM in the standard flat multi-classification setting and
taxonomy,classification also proposed changes to adapt the classifier to the hierarchical
setting, namely the Binary Hierarchical Classifier (BHC) [4] and
1 INTRODUCTION the Hierarchical SVM (H-SVM) [2].
Another work, published by McCallum et al. uses shrinkage to
Taxonomy classification is a multi-label classification problem where
improve classification of a hierarchy of text articles. They use a
the classes form a hierarchy, as in a tree or a directed acyclic graph
maximum log-likelihood estimate (MLE) based objective to shrink
(DAG). Although a significant amount of research in natural lan-
the MLE of each node towards the MLEs of all its ancestors.
guage processing (NLP) pertains to flat-classification problems,
Our approaches, in contrast, use simpler estimators, and rely
class hierarchies are fairly common in the real-world, and it follows
on on minimal post-processing for a few among them to collate
that are there are tasks that require classification algorithms that
the predictions in the event of multiple classifiers being used for a
address such hierarchical classification problems.
single learning algorithm.
Hierarchies emerge as a natural result of the organization of
extensive data-sets into fine-grained categories [6]. In this task
3 APPROACH
we work with a product catalog taxonomy, with each individual
product being a part of an increasingly finer-grained category. The In Section 3.1, we describe the data cleaning and simple intuitions
objective of the task is to be able to accurately predict the entire we use to guide our learning strategies. We subsequently describe
class hierarchy under which the product is categorized. each of the approaches that we utilized to obtain the requisite
The contributions of this paper are the evaluation of different predictions.
strategies including flat and hierarchical classification, and the
implementation of a general framework to address the task of tax- 3.1 Preprocessing
onomy classification. Each data instance to be classified in this task is a product name,
which is usually comprised of the product’s canonical name, a brand
2 RELATED WORK name and any high-level specifications (like dimensions, color etc.)
Taxonomy classification is a topic well-explored within the NLP if applicable.
community, with several novel approaches over the past two decades. There are 3008 unique category paths known from the training
It is also known in the literature as ‘hierarchical’ or ‘structured’ clas- data, with a maximum category depth of 8.
sification. The nature of the task implies that binary and multi-class As opposed to full sentences, there is intuitively a lesser advan-
classifiers cannot be used as is to solve this problem. However, these tage in implementing a forget mechanism for past contexts in a
product name, disincentivizing the use of recurrent networks as
feature extractors.
SIGIR’18,
CopyrightJuly 2018,
© 2018 Annpaper’s
by the Arborauthors.
Michigan, USApermitted for private and academic purposes.
Copying
In: J.ACM
2018. Degenhardt, G. Di Fabbrizio, S. Kallumadi,. .M.
ISBN 978-x-xxxx-xxxx-x/YY/MM. Kumar, Y.-C. Lin, A. Trotman, H. Zhao
$15.00 Also, most of the numbers in the data-set, for instance, in the
(eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, USA,
https://doi.org/10.1145/nnnnnnn.nnnnnnn
published at http://ceur-ws.org examples presented in Table 1 are used to quantify specifications
SIGIR’18, July 2018, Ann Arbor Michigan, USA
Replacement Viewsonic VG710 LCD Monitor 48Watt AC Level Unique Categories
Adapter 12V 4A 0 14
4-Pack Replacement Engine Air Filter for 2009 Sterling Truck 1 108
Bullet 55 L6 6.7 Car/Automotive 2 865
Luscious Pink Perfume 3.4 oz Eau De Parfum Spray For Women 3 1573
By MARIAH CAREY 4 752
Table 1: Examples of products with numbers 5 232
6 148
7 3
Table 2: Level-wise category counts
of the product, and are not useful unless they’re part of the brand
or product name itself.
Hence, we trim the numbers from the product names, in addition
to trimming the default english stopwords from curated dictionaries
maintained by NLTK1 , spaCy2 and scikit-learn3 . This eliminates
most of the conjunctions and prepositions used within a product
name that are not relevant to its class. Additionally, we also remove
any word less that 3 characters in length, to eliminate most of the
uninformative measure of quantities (e.g. V, mm, oz etc.).
3.2 One vs. All Classification
The class taxonomy provided is a Tree and each leaf node uniquely
identifies a class. Thus, for our initial trials, we decided to train
models using only leaf nodes. We used two classifiers: Multinomial
Naive Bayes (MNB) and XGBoost [1]. While there were no hyper-
parameters for MNB, XGBoost has a considerable number of tunable
parameters. XGBoost is an ensemble method where a large number Figure 1: Classifier per level architecture
of decision trees are combined using gradient boosting and it has
performed really well in multiple Kaggle competitions. The two
most important hyper parameters are the depth of the decision However, this method suffers from the shortcoming of disjoint
trees that are grown, and the number of trees that are trained. We predictions from level-to-level. For example, from the classifier 1
performed a random search for parameters and obtained the best may predict a class that is not a descendant of the prediction of
results with a max-depth of 7 and 200 estimators. class 0. In this event, some additional processing is required such
that the final category path is valid.
3.3 Multi-level Classification We use a greedy search within the empirically computed log-
This classifier attempts to exploit the hierarchical nature of the probabilities of classifiers for level ≥ 1. Instead of predicting the
categories. This requires using multiple classifiers (one for each child class with the maximum log-likelihood predicted by the classi-
level of the taxonomy tree), and combining their eventual prediction fier, we simply chose the next level’s prediction to be the valid child
results to form the entire category path prediction. Based on the of the previous level’s prediction with the maximum log-likelihood.
training set, the count of unique classes to predict at each level is An alternative to this approach would be to evaluate multiple pos-
shown in Table 2. sibilities at each level in a beam-search with the objective of maxi-
This is analogous to the Local Classifier Per Level approach mizing sum of their log-likelihood. An obvious drawback of using
discussed in a comprehensive survey of hierarchical classification this method is that an incorrect prediction at a level will result in
methods [8]. This is not a hierarchical classification algorithm in incorrect predictions at all of the levels below it.
itself, but the method serves to correct the hierarchical inconsisten- If a the class predicted at any level happens to be a leaf node,
cies that result from predicting the class at each level in isolation. the search is terminated. This is a safe operation to do, as the tree
To contextualize this approach with the current problem, we train nodes in the product data-set are exclusively either leaf or non-leaf
8 separate classifiers, each the input for each classifier being the nodes.
features extracted from the product name and the output being
the probability distribution across all the possible classes at the 3.4 Neural Classification
specific level for which the classifier was trained, as depicted in We build multiple models using artificial neural networks as the
Figure 1. The feature extractor for this model is a bag-of-words function approximators with non-linear activation functions (ReLU).
tf-idf vectorizer.
3.4.1 Simple Multi-Layer Perceptron. The feature extractor for
1 http://www.nltk.org/nltk_data
2 https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py
the first neural classifier uses the tf-idf weighting scheme for the
3 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_ words in the corpus, and a single hidden layer of neurons to predict
extraction/stop_words.py the entire category path of each product. In this formulation, the
Team Waterloo at the SIGIR E-Commerce Data Challenge SIGIR’18, July 2018, Ann Arbor Michigan, USA
entire category path, like ‘2296>3597>2989’, for instance, is treated 5 DISCUSSION
as a distinct class. Based on our experimentation and reported results, there seems to
Another neural network model we tried was using the ULMFiT be little to no advantage in using methods that predict each level
[7] pre-trained weights trained on Wikipedia-103 data-set and fine- of the category path independently. This could be attributed to the
tuning the language model to predict the taxonomy. This method fact that the evaluation metric is equally punitive for both partially
didn’t produce any better results as the data-set contained a lot of correct and completely incorrect category paths, thereby, providing
rare words and the number of examples to fine-tune the embeddings no extra benefit for paths that are partially correct, typically seen
layer was simply not sufficient. We report a precision of 0.64 with in piece-wise constructed methods like the per-level local classifier
this model. approach.
We evaluated a per-level local classifier, using a batched gradient-
3.4.2 Per-Level Multi-Layer Perceptron. We also built a neural
descent model and a neural model. These were post-processed
network with the number of layers equivalent to the number of
to include only valid category-paths. However, these turned out
the maximum depth of the taxonomy. We employed layer-wise
to be our worst performing models. The baseline against the flat
training and used the softmax predictions of a layer along with
classification model using the same classification algorithm and
the word embeddings together as the input for the next layer. The
hyper-parameters indicates that leaf-node only flat classification
model showed promising results for the initial two levels but the
performs better and our subsequent evaluations were restricted to
performance fell as the number of classes increased as we go down
flat classifiers.
the tree.
6 CONCLUSION
3.5 Ensembles
In this paper, we describe our approach to address the taxonomy
We also build two ensemble models constructed using all the previ- classification task for product categorization. We build a variety
ously described classifiers. We utilize a simple voting-based scheme of models include flat and per-level classifiers, implemented using
for combining the classifier predictions, with the majority label multinomial naive-bayes, logistic regressor, batched SGD classifier
being selected as the final label. We built two different ensembles: and multi-layer perceptron models. Our best performing model
(i) Combining predictions made by all the models implemented, uses a minimal amount of data-preprocessing, and uses a single
(ii) Combining predictions for which the weighted F 1 score was at hidden layer in an MLP model with non-linear activations with the
least 0.70. features extracted using a tf-idf weighting scheme vectorizer.
In future work, we would like to explore the possibility of train-
4 EVALUATION ing a distinct model per non-leaf node to address offset the problem
The training data is comprised of 800,000 products and their re- of having to perform a greedy search for postprocessing in the
spective category path labels, a subset of which we hold out for per-level classifier approach, as well as to implement additional
validation. The test set is comprised of 200,000 products, on which boosting techniques to ensemble our best performing models.
the experiment results are reported.
The metrics used to evaluate model performance, are weighted REFERENCES
precision, recall and a F 1 scores. [1] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting sys-
tem. In Proceedings of the 22nd acm sigkdd international conference on knowledge
Table 3 shows the performance of each of the models against the discovery and data mining. ACM, 785–794.
test set. [2] Yangchi Chen, Melba M Crawford, and Joydeep Ghosh. 2004. Integrating support
vector machines in a hierarchical output space decomposition framework. In
Geoscience and Remote Sensing Symposium, 2004. IGARSS’04. Proceedings. 2004 IEEE
International, Vol. 2. IEEE, 949–952.
Approach P R F1 [3] D KOLLER. 1997. Hierarchically classifying documents using very few words. In
Proc. 14th International Conference on Machine Learning. 170–178.
Local-Per-Level Stochastic Gradient Descent 0.50 0.49 0.46 [4] Shailesh Kumar, Joydeep Ghosh, and Melba M Crawford. 2002. Hierarchical
Flat One-vs-All Multinomial Naive Bayes 0.55 0.54 0.47 fusion of multiple classifiers for hyperspectral data analysis. Pattern Analysis &
Flat Stochastic Gradient Descent 0.65 0.66 0.63 Applications 5, 2 (2002), 210–220.
[5] Ana Carolina Lorena and André Carlos PLF de Carvalho. 2004. Comparing tech-
Flat One-vs-All Logistic Regression 0.76 0.74 0.72 niques for multiclass classification using binary SVM predictors. In Mexican Inter-
Flat One-vs-All XGBoost 0.77 0.77 0.75 national Conference on Artificial Intelligence. Springer, 272–281.
Flat Multi-Layer Perceptron 0.78 0.79 0.78 [6] Andrew McCallum, Ronald Rosenfeld, Tom M Mitchell, and Andrew Y Ng. 1998.
Improving Text Classification by Shrinkage in a Hierarchy of Classes.. In ICML,
Ensemble - All 0.77 0.76 0.75 Vol. 98. 359–367.
Ensemble - XGBoost and Neural Networks 0.78 0.77 0.75 [7] Sebastian Ruder and Barbara Plank. 2018. Strong Baselines for Neural Semi-
supervised Learning under Domain Shift. arXiv preprint arXiv:1804.09530 (2018).
Table 3: Experiment Results [8] Carlos N Silla and Alex A Freitas. 2011. A survey of hierarchical classification
across different application domains. Data Mining and Knowledge Discovery 22,
1-2 (2011), 31–72.
We observe that even simple classification approaches that ig-
nore the hierarchical nature of the data, like the Flat MLP classifier,
obtains the best scores among the different models used, with the
ensembling methods achieving comparable performance.