Team Waterloo at the SIGIR E-Commerce Data Challenge Angshuman Ghosh Vineet John Rahul Iyer University of Waterloo University of Waterloo University of Waterloo a25ghosh@uwaterloo.ca email@institution.domain rahul.iyer@uwaterloo.ca ABSTRACT existing algorithms can be re-purposed for taxonomy classification, This paper describes the methods we use to solve a taxonomy as will be subsequently discussed in Section 3. classification problem specific to categorization of products in an The seminal work on taxonomy classification was published by e-commerce catalog. We describe the use of learning algorithms KOLLER. The authors propose learning a small set of features to like multinomial naive-bayes, logistic regressors, batched gradient- solve each classification task in the tree structure independently. descent classifiers and neural softmax classifiers retro-fitted for This method involves training a classifier for each non-leaf node the hierarchical classification objective. We explore methods that in the tree. This model comprises a top-down approach under the exploit the hierarchical nature of the data, and compare it against assumption that the instance is already classified with respect to all flat classification approaches. We use simple models with minimal the labels that precede the current label in the hierarchy. This also pre-processing to build and compare classifiers by reporting their drastically reduces the number of target classes’ probabilities to respective weighted precision, recall and F 1 score metrics on the predict. However, a drawback of this approach is the large number test data-set. of models that need to be trained to facilitate it. It also suffers from the cascading effect of classification failure i.e. all subsequent CCS CONCEPTS classifications after an incorrect classification will also be incorrect. Since the most common schemes for multi-class classification in • Computing methodologies → Natural language process- general are the One-vs-One and One-vs-All schemes, support vector ing; Information extraction; machines, being max-margin classifiers, have been used frequently to address this task in previous works [5] [4] [2]. These works have KEYWORDS used the SVM in the standard flat multi-classification setting and taxonomy,classification also proposed changes to adapt the classifier to the hierarchical setting, namely the Binary Hierarchical Classifier (BHC) [4] and 1 INTRODUCTION the Hierarchical SVM (H-SVM) [2]. Another work, published by McCallum et al. uses shrinkage to Taxonomy classification is a multi-label classification problem where improve classification of a hierarchy of text articles. They use a the classes form a hierarchy, as in a tree or a directed acyclic graph maximum log-likelihood estimate (MLE) based objective to shrink (DAG). Although a significant amount of research in natural lan- the MLE of each node towards the MLEs of all its ancestors. guage processing (NLP) pertains to flat-classification problems, Our approaches, in contrast, use simpler estimators, and rely class hierarchies are fairly common in the real-world, and it follows on on minimal post-processing for a few among them to collate that are there are tasks that require classification algorithms that the predictions in the event of multiple classifiers being used for a address such hierarchical classification problems. single learning algorithm. Hierarchies emerge as a natural result of the organization of extensive data-sets into fine-grained categories [6]. In this task 3 APPROACH we work with a product catalog taxonomy, with each individual product being a part of an increasingly finer-grained category. The In Section 3.1, we describe the data cleaning and simple intuitions objective of the task is to be able to accurately predict the entire we use to guide our learning strategies. We subsequently describe class hierarchy under which the product is categorized. each of the approaches that we utilized to obtain the requisite The contributions of this paper are the evaluation of different predictions. strategies including flat and hierarchical classification, and the implementation of a general framework to address the task of tax- 3.1 Preprocessing onomy classification. Each data instance to be classified in this task is a product name, which is usually comprised of the product’s canonical name, a brand 2 RELATED WORK name and any high-level specifications (like dimensions, color etc.) Taxonomy classification is a topic well-explored within the NLP if applicable. community, with several novel approaches over the past two decades. There are 3008 unique category paths known from the training It is also known in the literature as ‘hierarchical’ or ‘structured’ clas- data, with a maximum category depth of 8. sification. The nature of the task implies that binary and multi-class As opposed to full sentences, there is intuitively a lesser advan- classifiers cannot be used as is to solve this problem. However, these tage in implementing a forget mechanism for past contexts in a product name, disincentivizing the use of recurrent networks as feature extractors. SIGIR’18, CopyrightJuly 2018, © 2018 Annpaper’s by the Arborauthors. Michigan, USApermitted for private and academic purposes. Copying In: J.ACM 2018. Degenhardt, G. Di Fabbrizio, S. Kallumadi,. .M. ISBN 978-x-xxxx-xxxx-x/YY/MM. Kumar, Y.-C. Lin, A. Trotman, H. Zhao $15.00 Also, most of the numbers in the data-set, for instance, in the (eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, USA, https://doi.org/10.1145/nnnnnnn.nnnnnnn published at http://ceur-ws.org examples presented in Table 1 are used to quantify specifications SIGIR’18, July 2018, Ann Arbor Michigan, USA Replacement Viewsonic VG710 LCD Monitor 48Watt AC Level Unique Categories Adapter 12V 4A 0 14 4-Pack Replacement Engine Air Filter for 2009 Sterling Truck 1 108 Bullet 55 L6 6.7 Car/Automotive 2 865 Luscious Pink Perfume 3.4 oz Eau De Parfum Spray For Women 3 1573 By MARIAH CAREY 4 752 Table 1: Examples of products with numbers 5 232 6 148 7 3 Table 2: Level-wise category counts of the product, and are not useful unless they’re part of the brand or product name itself. Hence, we trim the numbers from the product names, in addition to trimming the default english stopwords from curated dictionaries maintained by NLTK1 , spaCy2 and scikit-learn3 . This eliminates most of the conjunctions and prepositions used within a product name that are not relevant to its class. Additionally, we also remove any word less that 3 characters in length, to eliminate most of the uninformative measure of quantities (e.g. V, mm, oz etc.). 3.2 One vs. All Classification The class taxonomy provided is a Tree and each leaf node uniquely identifies a class. Thus, for our initial trials, we decided to train models using only leaf nodes. We used two classifiers: Multinomial Naive Bayes (MNB) and XGBoost [1]. While there were no hyper- parameters for MNB, XGBoost has a considerable number of tunable parameters. XGBoost is an ensemble method where a large number Figure 1: Classifier per level architecture of decision trees are combined using gradient boosting and it has performed really well in multiple Kaggle competitions. The two most important hyper parameters are the depth of the decision However, this method suffers from the shortcoming of disjoint trees that are grown, and the number of trees that are trained. We predictions from level-to-level. For example, from the classifier 1 performed a random search for parameters and obtained the best may predict a class that is not a descendant of the prediction of results with a max-depth of 7 and 200 estimators. class 0. In this event, some additional processing is required such that the final category path is valid. 3.3 Multi-level Classification We use a greedy search within the empirically computed log- This classifier attempts to exploit the hierarchical nature of the probabilities of classifiers for level ≥ 1. Instead of predicting the categories. This requires using multiple classifiers (one for each child class with the maximum log-likelihood predicted by the classi- level of the taxonomy tree), and combining their eventual prediction fier, we simply chose the next level’s prediction to be the valid child results to form the entire category path prediction. Based on the of the previous level’s prediction with the maximum log-likelihood. training set, the count of unique classes to predict at each level is An alternative to this approach would be to evaluate multiple pos- shown in Table 2. sibilities at each level in a beam-search with the objective of maxi- This is analogous to the Local Classifier Per Level approach mizing sum of their log-likelihood. An obvious drawback of using discussed in a comprehensive survey of hierarchical classification this method is that an incorrect prediction at a level will result in methods [8]. This is not a hierarchical classification algorithm in incorrect predictions at all of the levels below it. itself, but the method serves to correct the hierarchical inconsisten- If a the class predicted at any level happens to be a leaf node, cies that result from predicting the class at each level in isolation. the search is terminated. This is a safe operation to do, as the tree To contextualize this approach with the current problem, we train nodes in the product data-set are exclusively either leaf or non-leaf 8 separate classifiers, each the input for each classifier being the nodes. features extracted from the product name and the output being the probability distribution across all the possible classes at the 3.4 Neural Classification specific level for which the classifier was trained, as depicted in We build multiple models using artificial neural networks as the Figure 1. The feature extractor for this model is a bag-of-words function approximators with non-linear activation functions (ReLU). tf-idf vectorizer. 3.4.1 Simple Multi-Layer Perceptron. The feature extractor for 1 http://www.nltk.org/nltk_data 2 https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py the first neural classifier uses the tf-idf weighting scheme for the 3 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_ words in the corpus, and a single hidden layer of neurons to predict extraction/stop_words.py the entire category path of each product. In this formulation, the Team Waterloo at the SIGIR E-Commerce Data Challenge SIGIR’18, July 2018, Ann Arbor Michigan, USA entire category path, like ‘2296>3597>2989’, for instance, is treated 5 DISCUSSION as a distinct class. Based on our experimentation and reported results, there seems to Another neural network model we tried was using the ULMFiT be little to no advantage in using methods that predict each level [7] pre-trained weights trained on Wikipedia-103 data-set and fine- of the category path independently. This could be attributed to the tuning the language model to predict the taxonomy. This method fact that the evaluation metric is equally punitive for both partially didn’t produce any better results as the data-set contained a lot of correct and completely incorrect category paths, thereby, providing rare words and the number of examples to fine-tune the embeddings no extra benefit for paths that are partially correct, typically seen layer was simply not sufficient. We report a precision of 0.64 with in piece-wise constructed methods like the per-level local classifier this model. approach. We evaluated a per-level local classifier, using a batched gradient- 3.4.2 Per-Level Multi-Layer Perceptron. We also built a neural descent model and a neural model. These were post-processed network with the number of layers equivalent to the number of to include only valid category-paths. However, these turned out the maximum depth of the taxonomy. We employed layer-wise to be our worst performing models. The baseline against the flat training and used the softmax predictions of a layer along with classification model using the same classification algorithm and the word embeddings together as the input for the next layer. The hyper-parameters indicates that leaf-node only flat classification model showed promising results for the initial two levels but the performs better and our subsequent evaluations were restricted to performance fell as the number of classes increased as we go down flat classifiers. the tree. 6 CONCLUSION 3.5 Ensembles In this paper, we describe our approach to address the taxonomy We also build two ensemble models constructed using all the previ- classification task for product categorization. We build a variety ously described classifiers. We utilize a simple voting-based scheme of models include flat and per-level classifiers, implemented using for combining the classifier predictions, with the majority label multinomial naive-bayes, logistic regressor, batched SGD classifier being selected as the final label. We built two different ensembles: and multi-layer perceptron models. Our best performing model (i) Combining predictions made by all the models implemented, uses a minimal amount of data-preprocessing, and uses a single (ii) Combining predictions for which the weighted F 1 score was at hidden layer in an MLP model with non-linear activations with the least 0.70. features extracted using a tf-idf weighting scheme vectorizer. In future work, we would like to explore the possibility of train- 4 EVALUATION ing a distinct model per non-leaf node to address offset the problem The training data is comprised of 800,000 products and their re- of having to perform a greedy search for postprocessing in the spective category path labels, a subset of which we hold out for per-level classifier approach, as well as to implement additional validation. The test set is comprised of 200,000 products, on which boosting techniques to ensemble our best performing models. the experiment results are reported. The metrics used to evaluate model performance, are weighted REFERENCES precision, recall and a F 1 scores. [1] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting sys- tem. In Proceedings of the 22nd acm sigkdd international conference on knowledge Table 3 shows the performance of each of the models against the discovery and data mining. ACM, 785–794. test set. [2] Yangchi Chen, Melba M Crawford, and Joydeep Ghosh. 2004. Integrating support vector machines in a hierarchical output space decomposition framework. In Geoscience and Remote Sensing Symposium, 2004. IGARSS’04. Proceedings. 2004 IEEE International, Vol. 2. IEEE, 949–952. Approach P R F1 [3] D KOLLER. 1997. Hierarchically classifying documents using very few words. In Proc. 14th International Conference on Machine Learning. 170–178. Local-Per-Level Stochastic Gradient Descent 0.50 0.49 0.46 [4] Shailesh Kumar, Joydeep Ghosh, and Melba M Crawford. 2002. Hierarchical Flat One-vs-All Multinomial Naive Bayes 0.55 0.54 0.47 fusion of multiple classifiers for hyperspectral data analysis. Pattern Analysis & Flat Stochastic Gradient Descent 0.65 0.66 0.63 Applications 5, 2 (2002), 210–220. [5] Ana Carolina Lorena and André Carlos PLF de Carvalho. 2004. Comparing tech- Flat One-vs-All Logistic Regression 0.76 0.74 0.72 niques for multiclass classification using binary SVM predictors. In Mexican Inter- Flat One-vs-All XGBoost 0.77 0.77 0.75 national Conference on Artificial Intelligence. Springer, 272–281. Flat Multi-Layer Perceptron 0.78 0.79 0.78 [6] Andrew McCallum, Ronald Rosenfeld, Tom M Mitchell, and Andrew Y Ng. 1998. Improving Text Classification by Shrinkage in a Hierarchy of Classes.. In ICML, Ensemble - All 0.77 0.76 0.75 Vol. 98. 359–367. Ensemble - XGBoost and Neural Networks 0.78 0.77 0.75 [7] Sebastian Ruder and Barbara Plank. 2018. Strong Baselines for Neural Semi- supervised Learning under Domain Shift. arXiv preprint arXiv:1804.09530 (2018). Table 3: Experiment Results [8] Carlos N Silla and Alex A Freitas. 2011. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22, 1-2 (2011), 31–72. We observe that even simple classification approaches that ig- nore the hierarchical nature of the data, like the Flat MLP classifier, obtains the best scores among the different models used, with the ensembling methods achieving comparable performance.