Encoder-Decoder neural networks for taxonomy classification

Encoder-Decoder neural networks for taxonomy classification MakotoHiramatsu KeiWakabayashi Graduate School of Library, Information and Media Studies University of Tsukuba Tsukuba

Ibaraki

Faculty of Library, Information and Media Science University of Tsukuba Tsukuba

Ibaraki

Ann Arbor Michigan USA

Encoder-Decoder neural networks for taxonomy classification 5D303200D2D2E9BDA34831E9EAEEB1AB 10.1145/nnnnnnn.nnnnnnn GROBID - A machine learning software for extracting information from scholarly documents Encoder-Decoder Neural Networks Recurrent Neural Networks Taxonomy classification

This paper describes our taxonomy classifier for SIGIR eCom Rakuten Data Challenge. We propose a taxonomy classifier based on sequenceto-sequence neural networks, which are widely used in machine translation and automatic document summarization, by treating taxonomy classification as the translation problem from a description of a product to a category path. Experiments show that our method can predict category paths more accurately than baseline classifier.

INTRODUCTION

Taxonomy is the major classification schemes in organizing concepts. With the rapid growth of the e-commerce market accompanying the development on the Internet, the number of products on e-commerce becomes enormous. In this situation, it is required to develop methods that predict taxonomic categories automatically because it is costly to classify all the products manually.

Rakuten Data Challenge, which is a competition we participated, provides a task to predict correct categories for each given product. As a feature of this task, categories have a hierarchical structure. This hierarchical structure corresponds to a taxonomy, which indicates that items in a category are further classified into a subcategory that contains further lower detail information. Each product has a path in the taxonomy like "Clothing, Shoes & Accessories → Shoes → Men → Boots".

As an approach to solving this task, the most straightforward approach is to train a multi-class classifier (e.g., Random Forest) that predicts a category path as a class of a given product. However, as mentioned earlier, the number of category paths is 3,695, which is fairly large to be considered as a set of classes for ordinal machine learning classifier. Moreover, this approach independently treats these category paths although a category path shares a part of another category path of a similar product. It is expected that this fact causes more data sparseness issue and degrades the performance because the classifier has no way to find common patterns that are shared in two different category paths.

In this paper, we propose a taxonomy classifier based on Encoder-Decoder neural networks. The key idea is to regard the category path as a series of category names in each hierarchical level. From this perspective, the taxonomy classification task can be converted into a sequence-to-sequence problem, which has a text (i.e., a sequence of words) of the product name as the input and a sequence of category names as the output. In recent years, remarkable performance has been demonstrated in the field of machine translation and automatic summarization by using the model called neural network Encoder-Decoder architecture. We apply the Encoder-Decoder model to the taxonomy classification task and evaluate the performance. Experiments show that our approach can successfully predict category paths more precisely than the baseline approach that treats the task as a multi-class classification problem and applies Random Forest. We have 800,000 records for training data and 200,000 records for test data. Each record has a description of a product and a category path. The number of labels in the training data is 3,695, and each label is assigned to 868 items on average. The category (id=4015) is most frequently assigned to products, which is assigned to 268,295 items. Figure 1 Table 1 shows the histogram of the depth of category path in the training set. The depth of category path in training set is 4.01 on average. In other words, each product has four categories on average. The maximum depth of the depth of category path was 8, and the minimum depth was 1.

DATASET

PROPOSED METHOD 3.1 Preprocessing

We used 20 % of the training dataset as the validation set to evaluate models. As preprocessing, we lowercase a product name in training/validation/test sets with SpaCy1 . We use both the original corpus and the lowercase corpus and compare classifier performances.

For the weights of dense word representation layer, we use GloVe [6] pre-trained embeddings trained on Gigaword and Wikipedia. GloVe contains the lowercase words in its vocabulary. The preprocessing of lowercase makes the vocabulary matchinд rate improve. We show the matchinд rate of two corpora in Table 2 where source means descriptions of products, which are inputs. matchinд rate is defined by

matchinд rate = |V Dat aset ∩ V GloV e | |V Dat aset | ,(1)

where V Dat aset is the vocabulary of the dataset and V GloV e is the vocabulary in the GloVe embeddings.

Encoder-Decoder neural networks for taxonomy classifier

Encoder-Decoder Neural Network is a type of neural network that is actively studied in recent years [1,3,7], which shows very good performance in various tasks such as machine translation and automatic summarization. We will describe the Encoder-Decoder Neural Network used in this research.

Figure 2 shows our Encoder-Decoder neural network with attention mechanism [1]. Our model has two main functions called encoder and decoder. An encoder function f enc takes an input sequence of words x = (x 1 , x 2 , . . . , x n ) and a decoder function f dec predicts the probability of a category path sequence y = (y 1 , y 2 , . . . , y m ). f enc outputs a sequnce of hidden states h = (h 1 , h 2 , . . . , h n ). To predict y t , f dec uses information from h and c t . A context vector c t captures input sequence information to help predict an each label y t . A context vector c t is defined as following:

c t = i a t i h i , (2)

and attention is defined as following:

a t i = âti j ât j , (3) âti = att(h i , ht ),(4)

where att(h t , hi ) is an attention function. The attention function of our works is based on Luong et al. [4] defined as following:

att(h i , ht ) = h i T W a ht , (5)

where h is the encoder state, h is the decoder state and W a is the weight matrix that controls the contribution of each h i and ht .

Encoder-Decoder neural networks for taxonomy classification SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA

After the encoder takes input, the decoder predicts outputs using encoder state. As a feature of the Encoder-Decoder neural network, the input sequence length and the output sequence length do not have to match. It can predict various length category path with various length of a description of a product.

EXPERIMENTS

This section presents evaluations of our taxonomy classifier and the baseline classifier in the validation set. At the time of training, we use up to 50,000 words as the features both in baseline and the proposed model. In the experiment, we examine parameters of our taxonomy classifier (in Table 3) and show best parameters in each pair of encoder and decoder in Table 4.

Baseline

We use Random Forest [2] as the baseline. Random Forest is commonly used in various kind of tasks including classification. If we try to solve the multi-label problem where there are 3,695 labels, the computational cost is very expensive. To avoid this difficulty, we use the category path as the label to predict. Therefore our baseline tries to solve the multi-class (3,695 classes) classification problem.

We use the TF-IDF vectors for features of the product description representations. To implement the baseline, we use scikit-learn [5]. We use the scikit-learn's default parameters to train the Random Forest.

Results

We evaluate the performance of our proposed models and the baseline on the validation set with the official script (eval.py). We show the best parameters for each model in Table 4, and the results in Table 5. Bidirectional LSTM with GloVe achieves the best F1 score. Our model achieved the best performance when it uses Bidirectional LSTM as an encoder/decoder, lowercase dataset and use GloVe embeddings to initialize the weights of the embedding layer for the input sequence. Interestingly, it shows bad scores when we use GRU for encoder and decoder. We will further investigate the reason for this.

CONCLUSION

In this paper, we propose an encoder-decoder neural network for taxonomy classification where there are various sizes of category paths. It is computationally expensive to solve this problem as a multi-label classification because there are over 3,695 categories in the dataset, To avoid this difficulty, we regarded taxonomy classification as the translation from the description of products to the

Figure 1 :1Figure 1: Histogram of the number of words in product descriptions in the dataset

Table 1 :1Histogram of the depth of category pathsCategory depth Frequency of item18,17222,7923228,8884344,4725166,165645,25374,197861

Table 2 :2Vocabulary matching ratePreprocessing The size of source vocabulary Matching rateNone670,09210.69%lowercase626,56757.82%

t | y <t , x) https://spacy.io

ACKNOWLEDGEMENTS

This work was supported by JSPS KAKENHI Grant Number 16H02904. Also, we would like to show our gratitude to Kento Nozawa and Taro Tezuka for comments that greatly improved the manuscript.

Neural Machine Translation by Jointly Learning to Align and Translate DzmitryBahdanau KyunghyunCho YoshuaBengio Proc. International Conference on Learning Representations International Conference on Learning Representations 2014 <author> <persName><forename type="first">Leo</forename><surname>Breiman</surname></persName> </author> <idno type="DOI">10.1023/A:1010933404324</idno> <ptr target="https://doi.org/10.1023/A:1010933404324" /> </analytic> <monogr> <title level="j">Random Forests. Mach. Learn 45 1 2001. Oct. 2001 Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation KyunghyunCho BartVan Merrienboer CaglarGulcehre DzmitryBahdanau FethiBougares HolgerSchwenk YoshuaBengio Proc. Empirical Methods in Natural Language Processing Empirical Methods in Natural Language essing 2014 Effective Approaches to Attention-based Neural Machine Translation Minh-ThangLuong ChristopherDManning Proc. Empirical Methods in Natural Language Processing Empirical Methods in Natural Language essing 2015 Scikit-learn: Machine Learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011. 2011 GloVe: Global Vectors for Word Representation JeffreyPennington RichardSocher ChristopherDManning Proc. Empirical Methods in Natural Language Processing Empirical Methods in Natural Language essing 2014 Sequence to sequence learning with neural networks IlyaSutskever OriolVinyals VQuoc Le Proc. Advances in Neural Information Processing Systems Advances in Neural Information essing Systems 2014