INTRODUCTION

Com Data Challenge). ACM, New York, NY, USA, Article

Large Scale Taxonomy Classification using BiLSTM with Self-Atention

Hang Gao

hanggao1@umbc.edu 0

Tim Oates

oates@cs.umbc.edu 0 0 University of Maryland Baltimore County , Baltimore, Maryland , USA

2018

4 5

In this paper we present a deep learning model for the task of large scale taxonomy classification, where the model is expected to predict the corresponding category ID path given a product title. The proposed approach relies on a Bidirectional Long Short Term Memory Network (BiLSTM) to capture the context information for each word, followed by a multi-head attention model to aggregate useful information from these words as the final representation of the product title. Our model adopts an end-to-end architecture that does not rely on any hand-craft features, and is regulated by various techniques.

INTRODUCTION

The cataloging of product listing through taxonomy categorization is a popular research area for many e-commerce marketplace. The task is challenging due to various reasons, for example, the lack of read data from actual commercial product catalogs, the noisy nature of product labels and the typical unbalanced data distribution.

In this paper, we present an end-to-end neural network based system for taxonomy classification. The proposed approach employs a BiLSTM network augmented with a multi-head self attention mechanism, producing a feature representation used for classification. We also regulate the system with various regulation techniques in order to obtain better generalization.

OVERVIEW

Our approach consists of two main steps: (1) the sampling step, where we over-sample instances of rare categories with augmentation; (2) the network step, which includes three parts: a recurrent neural network to generate word representation with context information; a self-attention network to generate a distribution over Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

ACM ISBN 123-4567-24-567/08/06. https://doi.org/10.475/123_4 the enriched word representations and a classifier that performs classification.

Task Definition. The task is to predict the category ID path given a product title, as shown in Table 1, which includes examples taken from the data challenge description page. Evaluation of a system in this task is measured by weighted-precision, recall, F1 score with complete matching.

The task adopts the data released by Rakuten, which includes 1M product listings in tsv format, split into train/test set with ratio 80%/20%. The train set includes 3008 category ID paths and is highly imbalanced. In Figure 1, we show a comparison of the number of product titles among these category ID paths. 2.1

Sampling

As mentioned above, since the data is highly imbalanced, it is often important to over-sample instances of rare classes to force a model to pay more attention to them, instead of overwhelmed by those frequent ones. A common strategy of over-sampling is replication, but in our system, we adopts a diferent version where we augment the replicated samples to prevent our system from simply remembering them in order to get better generalization.

Given an sample, we first randomly replicate it by d times, where d is randomly picked from the set {1, 2, 3, ..., D}; we then concatenate these replicas together and split them into a bag of words, followed by random shufling; next we pick a subset of the bag of words and generate a sequence based on them; finally we append the sequence to the end of the original sample to get the augmented one. This sampling strategy aims at enforcing the model to be robust to the noise generated by reordering or repeating pieces of a sample itself. 2.2

Recurrent Neural Network

We model product titles using recurrent neural networks (RNN). RNNs processes their input in a sequential way, with each time step sharing the same operation (commonly achieved by sharing weights). In addition, for a rnn, the output of each time step is fed back to itself as a part of the input at next time step. In this way, a rnn is powerful at handling inputs of variable length.

However, RNNs are known to be hard to train [ 9 ], due to the gradient exploding/vanishing problems [ 2, 4 ]. A key idea to overcome these problems is to construct constant error flow (CEF) for each RNN neuron. Inspired by it, more sophisticated variants of (a) Regular RNNs (b) RNNs with attention mechanism vanilla RNNs like Long Short Term Memory (LSTM) network [ 5 ] and Gated Recurrent Unit (GRU) network [ 3 ] are proposed, which allow better gradient flow to learn the long-term dependencies. 2.3

Self-Attention Mechanism

Instead of directly using the final hidden state ht of a rnn on a product title as its final representation r , we use a self-attention mechanism [ 1 ], in order to amplify the contribution of important words. When using a attention mechanism, we compute r as a convex combination of all hidden states hi , i ∈ [1, t ], with weights ai , indicating the importance of their corresponding hi . Formally, r = Ít

i=0 ai hi , where Íi ai = 1 and ai >= 0. Figure 2 illustrates the diference between regular RNNs and RNNs with attention mechanism. 3

MODEL DESCRIPTION

We use a multi-layer word-level BiLSTM to capture context information for each word of a product title and a multi-head attention model to aggregate useful information from the learned word representations generated by the BiLSTM. We present the architecture of the proposed model in Figure 3.

Embedding Layer. The input to the network is a product title, treated as a sequence of words. We use an embedding layer to project the words w1, w2, w3, ..., wt to a low dimensional dense vector vir , where r is the dimension of embedding space and t is the number of words in a product title. It is often popular to pre-train word embeddings with algorithms like Word2Vec [ 8 ] and Glove [ 10 ], but we simply randomly initialize them with other parameters in our model.

BiLSTM Layer. A LSTM takes a sequence of vectors as input and produces an annotation for each time step h1, h2, ..., ht . A BiLSTM performs similar operations, but in both forward and backward directions. Although there are various ways to combine the forward and backward annotation hf ,i and hb,i for a BiLSTM, we simply concatenate them together, i.e., hi = hf ,i ||hb,i , where || denotes the concatenation operation. Note that hi ∈ R2L , where L is the size of the BiLSTM hidden layer.

Multi-Head Attention. Similar to the attention mechanism mentioned above, a multi-head attention model also aims at aggregating useful information from word features, but allows multiple convex combinations for attention on diferent words. In our model, we adopt an attention model with the following transitions: In this paper, we also adopt diferent regulation techniques to improve the model’s generalization capability. In specific, we use L2 regularization, embedding dropout, DropConnect [ 7 ] and dropout [ 12 ].

Embedding dropout. We employ embedding dropout by randomly dropping out dimensions of word embeddings with the rest dimensions scaled by 1/1 − ρ, where ρ denotes the dropout probability. This is equivalent to adding random bernoulli noise to the word embeddings.

DropConnect. Preventing overfitting within recurrent neural network has been a popular research area that draws a lot of attention. Many of the proposed methods focus on the hidden state vector hi , aiming at introducing a dropout operation between time steps or on the update to the memory state ci . [ 7 ] instead proposes an approach called "DropConnect" that randomly throws away connections of hidden neurons to themselves, i.e., the hidden to hidden weight matrices. In our approach, we adopt this technique on both forward and backward LSTMs.

Dropout. Dropout is widely used as a regulation technique for deep neural networks. We employ the technique between BiLSTM layers and before self-attention model. When applying dropout between BiLSTM layers, we scale the non-dropped dimensions by 1/1 − ρ, similar to embedding dropout, while when used before self-attention model, a regular version is adopted, i.e., dimensions are scaled by (1 − ρ) at evaluation time. SGD remains one of the most popular optimization techniques for training deep learning models in various areas, such as computer vision, natural language processing and deep reinforcement learning. As a variant of SGD, Non-monotonically Triggered ASGD [ 7 ] (NTASGD), may further improve the training process as it provides certain advantages such as its asymptotic second-order convergence [ 6, 11 ]. We adopt NT-ASGD and SGD as the optimization algorithms. 4 4.1

EXPERIMENTS Experiment Setup

Training. We use the combination of SGD and NT-ASGD [ 7 ] as the optmization algorithm. Initially we start training the model by SGD algorithm with logging interval set as one epoch. After 5 non-monotone interval, NT-ASGD is triggered and employed for the rest epochs. We set the mini-batch size to be 32, the word embedding size to be 300, the hidden size of BiLSTM to be 400, the number of layers of BiLSTMs to be 2, the number of heads to be 3, the embedding dropout rate to be 0.4, the DropConnect rate to be 0.5, the dropout rate between BiLSTM layers to be 0.25 and the dropout rate before self-attention model to be 0.3. The initial learning rate is set to be 0.5 and the weight decay rate to be 1.2e-6.

Result. We list the current evaluation result on test data in Table 2, along with systems with relative close performance. Our system currently rank at 15 with weighted precision, recall and F1 to be 0.78, 0.77 and 0.77. 4.2

Accuracy Analysis

In order to analyze the strength and weakness of our system, we perform an analysis on accuracy with respect to loд2(n) + 1 for each category ID path, where n is their corresponding number of product titles in the train set. We show the result in Figure 4.

Generally speaking, our model performs better on frequent category ID paths than rare ones. For most frequent category ID paths, the model can achieve almost 100% accuracy, but varies when it comes to rare ones. It is as expected since deep learning models are well known to often be data hungry. The more data they are fed, the better performance they can achieve.

Another observation is that the overall accuracy for train set is at least above 0.80 as the accuracy of only some less frequent category ID paths is below that threshold. Compared to the performance of our model on test data, this suggests that our model is overfitting the train set, indicating the necessity of better regulation techniques. 5

EXTENSION

After stage2, we further improve our model by adopting a set of pre-processing steps and Glove vectors [ 10 ] for word embedding initialization. These pre-processing steps include: (1) lower case the product titles; (2) remove all non-ascii characters; (3) remove all punctuation; (4) remove all digits; (5) stem all words with NLTK WordNet Lemmatizer; (6) remove all rare words with document frequency less than 3.

We randomly split train data into train/valid/test sets according to the ratio 0.8/0.1/0.1. In order to compare the impact of the new set of pre-processing steps and Glove vectors, we perform two diferent runs: one with the same setting adopted in section 4 and the other with all the newly added extensions. Except that the number of layers of BiLSTM is set to be 3 and the hidden size is set to be 350, we use exactly the same training setting as in section 4. We show the results in Table 3 and these results indicate that these extension steps may further improve the performance of our model. 6

FUTURE WORK

We aim at further improving the performance of our model in the following directions.

Word Dropout. We find a large portion of words occur only once or twice in the train set, which may cause the model to unexpectedly rely on them as they show strong discrimination power when it comes to classification when the model has enough capacity. A possible way to reduce the impact is to randomly dropping out rare words before feeding a product title to the network. Dynamic Class Re-Weighting. One technique to overcome the problem that the network simply ignores rare category ID paths is to re-weight classes in order to force the model to pay more attention to rare ones. Thus changing class weights dynamically during the training procedure seems promising.

Ensemble of Models. Ensembling through bagging or boosting has proven to benefit many systems, thus we seek to improve model robustness by adopting this technique in the future.

[1]

Dzmitry

Bahdanau , Kyunghyun Cho, and

Yoshua

Bengio . 2014 . Neural machine translation by jointly learning to align and translate . arXiv preprint arXiv:1409.0473 ( 2014 ).

[2]

Yoshua

Bengio , Patrice Simard, and

Paolo

Frasconi . 1994 . Learning long-term dependencies with gradient descent is dificult . IEEE transactions on neural networks 5 , 2 ( 1994 ), 157 - 166 .

[3]

Kyunghyun

Cho , Bart Van Merriënboer, Caglar Gulcehre , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio . 2014 . Learning phrase representations using RNN encoder-decoder for statistical machine translation . arXiv preprint arXiv:1406.1078 ( 2014 ).

[4]

Sepp

Hochreiter , Yoshua Bengio, Paolo Frasconi,

Jürgen

Schmidhuber , et al. 2001 . Gradient flow in recurrent nets: the dificulty of learning long-term dependencies .

[5]

Sepp

Hochreiter and

Jürgen

Schmidhuber . 1997 . Long short-term memory . Neural computation 9 , 8 ( 1997 ), 1735 - 1780 .

[6]

Stephan

Mandt , Matthew D Hofman , and David M Blei. 2017 . Stochastic gradient descent as approximate bayesian inference . arXiv preprint arXiv:1704.04289 ( 2017 ).

[7]

Stephen

Merity , Nitish Shirish Keskar, and Richard Socher. 2017 . Regularizing and optimizing LSTM language models . arXiv preprint arXiv:1708.02182 ( 2017 ).

[8]

Tomas

Mikolov , Kai Chen, Greg Corrado, and

Jefrey

Dean . 2013 . Eficient estimation of word representations in vector space . arXiv preprint arXiv:1301.3781 ( 2013 ).

[9]

Razvan

Pascanu , Tomas Mikolov, and

Yoshua

Bengio . 2013 . On the dificulty of training recurrent neural networks . In International Conference on Machine Learning . 1310 - 1318 .

[10] Jefrey

Pennington

, Richard Socher, and

Christopher

Manning . 2014 . Glove: Global vectors for word representation . In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . 1532 - 1543 .

[11] Boris

Polyak and Anatoli B Juditsky . 1992 . Acceleration of stochastic approximation by averaging . SIAM Journal on Control and Optimization 30 , 4 ( 1992 ), 838 - 855 .

[12] Nitish

Srivastava

, Geofrey Hinton, Alex Krizhevsky, Ilya Sutskever, and

Ruslan

Salakhutdinov . 2014 . Dropout: A simple way to prevent neural networks from overfitting . The Journal of Machine Learning Research 15 , 1 ( 2014 ), 1929 - 1958 .