<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Com Data Challenge). ACM, New York, NY, USA,
Article</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Large Scale Taxonomy Classification using BiLSTM with Self-Atention</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hang Gao</string-name>
          <email>hanggao1@umbc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Oates</string-name>
          <email>oates@cs.umbc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Maryland Baltimore County</institution>
          ,
          <addr-line>Baltimore, Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>4</volume>
      <issue>5</issue>
      <abstract>
        <p>In this paper we present a deep learning model for the task of large scale taxonomy classification, where the model is expected to predict the corresponding category ID path given a product title. The proposed approach relies on a Bidirectional Long Short Term Memory Network (BiLSTM) to capture the context information for each word, followed by a multi-head attention model to aggregate useful information from these words as the final representation of the product title. Our model adopts an end-to-end architecture that does not rely on any hand-craft features, and is regulated by various techniques.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The cataloging of product listing through taxonomy categorization
is a popular research area for many e-commerce marketplace. The
task is challenging due to various reasons, for example, the lack of
read data from actual commercial product catalogs, the noisy nature
of product labels and the typical unbalanced data distribution.</p>
      <p>In this paper, we present an end-to-end neural network based
system for taxonomy classification. The proposed approach employs a
BiLSTM network augmented with a multi-head self attention
mechanism, producing a feature representation used for classification.
We also regulate the system with various regulation techniques in
order to obtain better generalization.</p>
    </sec>
    <sec id="sec-2">
      <title>OVERVIEW</title>
      <p>Our approach consists of two main steps: (1) the sampling step,
where we over-sample instances of rare categories with
augmentation; (2) the network step, which includes three parts: a recurrent
neural network to generate word representation with context
information; a self-attention network to generate a distribution over
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).</p>
      <p>SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s).</p>
      <p>ACM ISBN 123-4567-24-567/08/06.
https://doi.org/10.475/123_4
the enriched word representations and a classifier that performs
classification.</p>
      <p>Task Definition. The task is to predict the category ID path given
a product title, as shown in Table 1, which includes examples taken
from the data challenge description page. Evaluation of a system in
this task is measured by weighted-precision, recall, F1 score with
complete matching.</p>
      <p>The task adopts the data released by Rakuten, which includes
1M product listings in tsv format, split into train/test set with ratio
80%/20%. The train set includes 3008 category ID paths and is highly
imbalanced. In Figure 1, we show a comparison of the number of
product titles among these category ID paths.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Sampling</title>
      <p>As mentioned above, since the data is highly imbalanced, it is often
important to over-sample instances of rare classes to force a model
to pay more attention to them, instead of overwhelmed by those
frequent ones. A common strategy of over-sampling is replication,
but in our system, we adopts a diferent version where we
augment the replicated samples to prevent our system from simply
remembering them in order to get better generalization.</p>
      <p>Given an sample, we first randomly replicate it by d times, where
d is randomly picked from the set {1, 2, 3, ..., D}; we then
concatenate these replicas together and split them into a bag of words,
followed by random shufling; next we pick a subset of the bag of
words and generate a sequence based on them; finally we append
the sequence to the end of the original sample to get the augmented
one. This sampling strategy aims at enforcing the model to be
robust to the noise generated by reordering or repeating pieces of a
sample itself.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Recurrent Neural Network</title>
      <p>We model product titles using recurrent neural networks (RNN).
RNNs processes their input in a sequential way, with each time
step sharing the same operation (commonly achieved by sharing
weights). In addition, for a rnn, the output of each time step is fed
back to itself as a part of the input at next time step. In this way, a
rnn is powerful at handling inputs of variable length.</p>
      <p>
        However, RNNs are known to be hard to train [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], due to the
gradient exploding/vanishing problems [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ]. A key idea to
overcome these problems is to construct constant error flow (CEF) for
each RNN neuron. Inspired by it, more sophisticated variants of
(a) Regular RNNs
(b) RNNs with attention mechanism
vanilla RNNs like Long Short Term Memory (LSTM) network [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and Gated Recurrent Unit (GRU) network [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are proposed, which
allow better gradient flow to learn the long-term dependencies.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Self-Attention Mechanism</title>
      <p>
        Instead of directly using the final hidden state ht of a rnn on a
product title as its final representation r , we use a self-attention
mechanism [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], in order to amplify the contribution of important
words. When using a attention mechanism, we compute r as a
convex combination of all hidden states hi , i ∈ [1, t ], with weights
ai , indicating the importance of their corresponding hi . Formally,
r = Ít
      </p>
      <p>i=0 ai hi , where Íi ai = 1 and ai &gt;= 0. Figure 2 illustrates
the diference between regular RNNs and RNNs with attention
mechanism.
3</p>
    </sec>
    <sec id="sec-6">
      <title>MODEL DESCRIPTION</title>
      <p>We use a multi-layer word-level BiLSTM to capture context
information for each word of a product title and a multi-head attention
model to aggregate useful information from the learned word
representations generated by the BiLSTM. We present the architecture
of the proposed model in Figure 3.</p>
      <p>
        Embedding Layer. The input to the network is a product title,
treated as a sequence of words. We use an embedding layer to
project the words w1, w2, w3, ..., wt to a low dimensional dense
vector vir , where r is the dimension of embedding space and t is the
number of words in a product title. It is often popular to pre-train
word embeddings with algorithms like Word2Vec [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Glove
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but we simply randomly initialize them with other parameters
in our model.
      </p>
      <p>BiLSTM Layer. A LSTM takes a sequence of vectors as input and
produces an annotation for each time step h1, h2, ..., ht . A BiLSTM
performs similar operations, but in both forward and backward
directions. Although there are various ways to combine the forward
and backward annotation hf ,i and hb,i for a BiLSTM, we simply
concatenate them together, i.e., hi = hf ,i ||hb,i , where || denotes
the concatenation operation. Note that hi ∈ R2L , where L is the
size of the BiLSTM hidden layer.</p>
      <p>
        Multi-Head Attention. Similar to the attention mechanism
mentioned above, a multi-head attention model also aims at
aggregating useful information from word features, but allows multiple
convex combinations for attention on diferent words. In our model,
we adopt an attention model with the following transitions:
In this paper, we also adopt diferent regulation techniques to
improve the model’s generalization capability. In specific, we use L2
regularization, embedding dropout, DropConnect [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and dropout
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Embedding dropout. We employ embedding dropout by
randomly dropping out dimensions of word embeddings with the rest
dimensions scaled by 1/1 − ρ, where ρ denotes the dropout
probability. This is equivalent to adding random bernoulli noise to the
word embeddings.</p>
      <p>
        DropConnect. Preventing overfitting within recurrent neural
network has been a popular research area that draws a lot of
attention. Many of the proposed methods focus on the hidden state
vector hi , aiming at introducing a dropout operation between time
steps or on the update to the memory state ci . [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] instead proposes
an approach called "DropConnect" that randomly throws away
connections of hidden neurons to themselves, i.e., the hidden to
hidden weight matrices. In our approach, we adopt this technique
on both forward and backward LSTMs.
      </p>
      <p>
        Dropout. Dropout is widely used as a regulation technique for
deep neural networks. We employ the technique between BiLSTM
layers and before self-attention model. When applying dropout
between BiLSTM layers, we scale the non-dropped dimensions by
1/1 − ρ, similar to embedding dropout, while when used before
self-attention model, a regular version is adopted, i.e., dimensions
are scaled by (1 − ρ) at evaluation time.
SGD remains one of the most popular optimization techniques for
training deep learning models in various areas, such as computer
vision, natural language processing and deep reinforcement learning.
As a variant of SGD, Non-monotonically Triggered ASGD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
(NTASGD), may further improve the training process as it provides
certain advantages such as its asymptotic second-order
convergence [
        <xref ref-type="bibr" rid="ref11 ref6">6, 11</xref>
        ]. We adopt NT-ASGD and SGD as the optimization
algorithms.
4
4.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-8">
      <title>Experiment Setup</title>
      <p>
        Training. We use the combination of SGD and NT-ASGD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as
the optmization algorithm. Initially we start training the model
by SGD algorithm with logging interval set as one epoch. After
5 non-monotone interval, NT-ASGD is triggered and employed
for the rest epochs. We set the mini-batch size to be 32, the word
embedding size to be 300, the hidden size of BiLSTM to be 400,
the number of layers of BiLSTMs to be 2, the number of heads to
be 3, the embedding dropout rate to be 0.4, the DropConnect rate
to be 0.5, the dropout rate between BiLSTM layers to be 0.25 and
the dropout rate before self-attention model to be 0.3. The initial
learning rate is set to be 0.5 and the weight decay rate to be 1.2e-6.
      </p>
      <p>Result. We list the current evaluation result on test data in Table
2, along with systems with relative close performance. Our system
currently rank at 15 with weighted precision, recall and F1 to be
0.78, 0.77 and 0.77.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Accuracy Analysis</title>
      <p>In order to analyze the strength and weakness of our system, we
perform an analysis on accuracy with respect to loд2(n) + 1 for
each category ID path, where n is their corresponding number of
product titles in the train set. We show the result in Figure 4.</p>
      <p>Generally speaking, our model performs better on frequent
category ID paths than rare ones. For most frequent category ID paths,
the model can achieve almost 100% accuracy, but varies when it
comes to rare ones. It is as expected since deep learning models are
well known to often be data hungry. The more data they are fed,
the better performance they can achieve.</p>
      <p>Another observation is that the overall accuracy for train set is at
least above 0.80 as the accuracy of only some less frequent category
ID paths is below that threshold. Compared to the performance of
our model on test data, this suggests that our model is overfitting the
train set, indicating the necessity of better regulation techniques.
5</p>
    </sec>
    <sec id="sec-10">
      <title>EXTENSION</title>
      <p>
        After stage2, we further improve our model by adopting a set of
pre-processing steps and Glove vectors [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for word embedding
initialization. These pre-processing steps include: (1) lower case
the product titles; (2) remove all non-ascii characters; (3) remove
all punctuation; (4) remove all digits; (5) stem all words with NLTK
WordNet Lemmatizer; (6) remove all rare words with document
frequency less than 3.
      </p>
      <p>We randomly split train data into train/valid/test sets according
to the ratio 0.8/0.1/0.1. In order to compare the impact of the new set
of pre-processing steps and Glove vectors, we perform two diferent
runs: one with the same setting adopted in section 4 and the other
with all the newly added extensions. Except that the number of
layers of BiLSTM is set to be 3 and the hidden size is set to be 350,
we use exactly the same training setting as in section 4. We show
the results in Table 3 and these results indicate that these extension
steps may further improve the performance of our model.
6</p>
    </sec>
    <sec id="sec-11">
      <title>FUTURE WORK</title>
      <p>We aim at further improving the performance of our model in the
following directions.</p>
      <p>Word Dropout. We find a large portion of words occur only
once or twice in the train set, which may cause the model to
unexpectedly rely on them as they show strong discrimination power
when it comes to classification when the model has enough capacity.
A possible way to reduce the impact is to randomly dropping out
rare words before feeding a product title to the network.
Dynamic Class Re-Weighting. One technique to overcome
the problem that the network simply ignores rare category ID
paths is to re-weight classes in order to force the model to pay more
attention to rare ones. Thus changing class weights dynamically
during the training procedure seems promising.</p>
      <p>Ensemble of Models. Ensembling through bagging or boosting
has proven to benefit many systems, thus we seek to improve model
robustness by adopting this technique in the future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Patrice Simard, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Frasconi</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Learning long-term dependencies with gradient descent is dificult</article-title>
          .
          <source>IEEE transactions on neural networks 5</source>
          ,
          <issue>2</issue>
          (
          <year>1994</year>
          ),
          <fpage>157</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart Van Merriënboer,
          <string-name>
            <surname>Caglar Gulcehre</surname>
            , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          , Yoshua Bengio, Paolo Frasconi,
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          , et al.
          <year>2001</year>
          .
          <article-title>Gradient flow in recurrent nets: the dificulty of learning long-term dependencies</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9</source>
          ,
          <issue>8</issue>
          (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Stephan</given-names>
            <surname>Mandt</surname>
          </string-name>
          ,
          <string-name>
            <surname>Matthew D Hofman</surname>
          </string-name>
          , and David M Blei.
          <year>2017</year>
          .
          <article-title>Stochastic gradient descent as approximate bayesian inference</article-title>
          .
          <source>arXiv preprint arXiv:1704.04289</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Merity</surname>
          </string-name>
          , Nitish Shirish Keskar, and Richard Socher.
          <year>2017</year>
          .
          <article-title>Regularizing and optimizing LSTM language models</article-title>
          .
          <source>arXiv preprint arXiv:1708.02182</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Eficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Razvan</given-names>
            <surname>Pascanu</surname>
          </string-name>
          , Tomas Mikolov, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>On the dificulty of training recurrent neural networks</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          .
          <fpage>1310</fpage>
          -
          <lpage>1318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jefrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          .
          <volume>1532</volume>
          -
          <fpage>1543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Boris</surname>
            <given-names>T</given-names>
          </string-name>
          <string-name>
            <surname>Polyak</surname>
          </string-name>
          and
          <string-name>
            <surname>Anatoli B Juditsky</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Acceleration of stochastic approximation by averaging</article-title>
          .
          <source>SIAM Journal on Control and Optimization</source>
          <volume>30</volume>
          ,
          <issue>4</issue>
          (
          <year>1992</year>
          ),
          <fpage>838</fpage>
          -
          <lpage>855</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Nitish</surname>
            <given-names>Srivastava</given-names>
          </string-name>
          , Geofrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>15</volume>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>