<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Empirical Study of Using An Ensemble Model in E-commerce Taxonomy Classification Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yugang Jia</string-name>
          <email>yugang.jia@gmail.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xin Wang</string-name>
          <email>wangxin8588@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanqing Cao</string-name>
          <email>vauus@yahoo.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boshu Ru</string-name>
          <email>boshu.ru@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tianzhong Yang</string-name>
          <email>tianzhong.yang@gmail.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Software and Information</institution>
          ,
          <addr-line>Systems, UNC Charlotte, Charlotte, NC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mahwah</institution>
          ,
          <addr-line>NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Newton</institution>
          ,
          <addr-line>MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The University of Texas Health, Science Center</institution>
          ,
          <addr-line>Houston, TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Winchester</institution>
          ,
          <addr-line>MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>In the Rakuten data challenge on taxonomy Classification for eCommerce - scale Product Catalogs, we propose an approach based on deep convolutional neural networks to predict product taxonomies using their descriptions. The classification performance of the proposed system is further improved with oversampling, threshold moving and error correct output coding. The best classification accuracy is obtained through ensembling multiple networks trained differently with multiple inputs comprising of various extracted features.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Machine learning;</p>
    </sec>
    <sec id="sec-2">
      <title>1 INTRODUCTION</title>
      <p>
        E-commerce sites provide millions of products that are
continuously updated by merchants. The correct categorization
of each product plays a crucial role in helping customers find
the product that meets their need in many aspects, such as
product searching, targeted advertising, personalized
recommendation and product clustering. However, due to the large
scale of the products, it is often not feasible and error prone
to manually categorize the products. Therefore, large-scale
automatic categorization is in great need.
2
The challenges in large-scale automatic categorization
include: Firstly, the products are sparsely distributed in a large
number of categories and the data distribution is far away
from uniform distribution. Such imbalanced data could largely
deteriorate the categorization performance [
        <xref ref-type="bibr" rid="ref21 ref5">5, 21</xref>
        ]; Secondly,
the commercial product taxonomy is usually categorized in
tree structures with thousands of leaf nodes, which adds
another layer of dificulty to explore the correlations among
large number of taxonomies in a hierarchical structure.
Various classification methods, such as flat classification, cascade
classification and probabilistic cascading have been deployed
in the large-scale taxonomies. [
        <xref ref-type="bibr" rid="ref1 ref13 ref8">1, 8, 13</xref>
        ]. However, it remains
to be a challenging problem due to the large data scale, data
heterogeneity, and category skewness [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>
        In light of the challenges, diferent methods have been
proposed to achieve optimal classification performance. For
example, Naive Bayes demonstrates the efectiveness and
eficiency for classifying test documents [
        <xref ref-type="bibr" rid="ref15 ref19">15, 19</xref>
        ], but it has
poor performance when some categories are sparse [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
Support vector machines (SVMs) have been served as a
wellestablished benchmark model for classifying e-commerce
product [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Chen et. al proposed a multi-class SVM with
an extension of using margin re-scaling to optimize average
revenue loss [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, SVM has been shown to have
longer computing time and it only works well when the
number of categories is less than five [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As one of the
deep learning algorithms, recurrent neural network (RNN) is
proposed by Pyo and Ha to deal with the multi-class
classification problem with unbalanced data[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], in which the learnt
word embedding depends on a recursive representation of
the same initial feature space. In addition, convolutional
neural network (CNN) achieves remarkable performance in
sentence-level classification [
        <xref ref-type="bibr" rid="ref10 ref11 ref27">10, 11, 27</xref>
        ]. Recently, CNN has
been regarded as a replacement for well-established SVM
and logistic regression models [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], which uses pre-trained
word vectors as inputs for training the CNN models [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>
        Product categorization is a hierarchical multi-class
classiifcation problem. Hence, a natural way of classifying product
is to use hierarchical classification. However, hierarchical
classification sufers from error propagation issue [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
Kosmopoulos proposed a probabilistic hierarchical classification
approach that predicted the leaf categories by estimating the
probability of each root-to-leaf path [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Cevahir et.al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
mitigated the error propagation issue in hierarchical
classification and achieved better results than flat models by
incorporating a large number of features in leaf category
prediction.
      </p>
      <p>
        To combat with imbalanced classification problem, cost
sensitive training appears to be an efective solution. Zhihua
Zhou et.al [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] show that only over-sampling and threshold
moving is efective for training cost-sensitive neural
networks by empirical studies. However, it becomes dificult to
define costs of misclassification when there are large number
of classes. A more recent paper [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] also supports similar
claims.
      </p>
      <p>
        Error-Correcting Output coding (ECOC) is another method
that has been used in multi-class text classification to further
improve a classifier’s accuracy [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. The idea behind this
approach is to encode each class label to a unique binary
code with a number of digits, such that redundancy is
introduced to the transformed class labels that are then used
for training a supervised Machine Learning model. Even if
some errors occur in the prediction of a transformed label,
we may still be able to recover the correct original label by
choosing the one that is the closest to the prediction. This
approach helps to reduce the space of model output when
there are a large number of class labels. At the meantime it
also alleviates the class imbalance problem.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA CHARACTERISTICS</title>
      <p>The SIGIR eCom Data Challenge is on large-scale taxonomy
classification. The competition requires us to classify each
product description to one of the classes accurately in the
unlabeled testing data using models developed from the
labeled training data. Data is provided as a training data (800
thousands of products) and a testing data (200 thousands of
products). The training data has one column having
production description and one column having product categories.
There are 14 principal product categories represented by
numbers, all but one of which have secondary categories of
levels ranging from 1 to 7, or 1 to 8 levels in total, which
results in a total of 3008 classes. Each of such combination
of principal and secondary categories formulates a string of
numbers that represents a class, eg "3625&gt;4399&gt;1598&gt;3903",
"2296&gt;3597&gt;689".</p>
      <p>There are many characteristics and potential challenges in
this dataset. At the class level, there exists large variation in
the number of samples in each class, resulting in a largely
unbalanced dataset (See Table 1). Out of the 3008 classes,
19 classes have only one product, and 1484 (49.3 percent
) classes have less than 25 products, which sums up to be
13,618 products, or 1.70 percent of all products. On the other
hand, 12 classes have more than 10,000 products, which sums
up to be a total of 256,689 products, or 32.1 percent of all
products. The top 3 largest classes include 69915, 30146, and
25481 products respectively (See Table 2). The classes with
many samples might embrace both the richness as well as the
diversity of data that could result in the inter-class distance
being closer than the intra-class distance. The classes with
only a handful of samples are expected to be hard to classify
accurately in the testing data.</p>
      <p>The product description is a mixture of letters, words,
numbers and other ASCII characters. At the product description
level, there appears to have at least the following challenges.
The first challenge is that the diference between data
samples in the same class might be larger than that of samples
belonging to diferent classes to the extent of seemingly
mislabels. For example, both "Mont Blanc Mb Starwalker Men
Eau De Toilette Edt 2.5Oz / 70Ml" (a type of perfume) and
"Humminbird Pc11 Power Cord" (an electronic device) are in
the category of "3625&gt;4399&gt;1598&gt;3903", while "Creed Green
Irish Tweed Eau De Perfum For Men - Small Size 1oz (30ml)
", also a type of perfume, is in category "3625&gt;3005". The
second challenge is the categories under the same parent
categories (or principal category) are correlated, which would
enforce the challenge above. A third challenge is that one
abbreviation could have diferent meanings, for example, "hp"
could be "horsepower" or "Hewlett-Packard", "mb" could be
"mega byte" or "marble". The fourth challenge is the large
variation in the number of words in each product
description, ranging from 1 to dozens. While a short description
might not provide suficient information, such as a single
word product "Bonjour", a long description might include
too much details that the relevant information might be
hidden, such as "fosmon 2100mah dual port usb rapid car
charger for apple iphone 5c/5s/5/4s/4, samsung galaxy note
3/2, s5/s4/mini/active/s3, lg g3/lg g2/ g2 mini, google nexus
5/4, blackberry z10/z30/q10, htc one (m8), motorola moto
e/moto x/moto g, nokia lumia 1020".</p>
      <p>Due to the above data characteristics, data preprocessing
that filters out noise and keeps relevant information is
systematically designed and implemented and various modeling
strategies are tested as described in the next sections.
log10(Categorize Size) Count Category Size Range</p>
    </sec>
    <sec id="sec-4">
      <title>PREPROCESSING AND FEATURIZATION</title>
      <p>In this section we introduce how the product descriptions are
preprocessed and the features extracted to train our model.</p>
    </sec>
    <sec id="sec-5">
      <title>Preprocessing</title>
      <p>As we described above, the product description contains
noise. We mitigate the noise in a trial-and-error way in order
to find the balance of signal noise ratio. As a result, we apply
the following procedure. We first convert all letters to lower
case, and replacd special characters such as parentheses
except single hyphen, and repeated characters such as multiple
hyphen or dots with spaces. We then unify physical units
of "in", "ft", "hp", "ml","oz" and etc that follow a number to
"nnnhp", "nnnml", and "nnnoz", respectively. For example,
"3.4oz", "3.4 oz" or "3.4-oz" would become "nnnoz". Finally,
we remove dash (-), standalone numbers, and extra white
spaces. As a result of the preprocessing, distinct words are
reduced from 670K to 160K.</p>
    </sec>
    <sec id="sec-6">
      <title>Word-level and character-level embeddings for system input</title>
      <p>We use three diferent sets of word embedding to create
vector presentation of words, each of which is as described
in more details below. The resulted vectors of each word in
each product description are then concatenated (i.e,
columnbind) as word embedding features. As shown in Fig. 1, word
embeddings are concatenated as the input to train a CNN
model. Three sets of word embeddings generated by varying
methods are used as three separate inputs.</p>
      <p>
        Word embeddings pre-trained on Google News. The
pre-trained word embeddings are trained on part of Google
News dataset (about 100 billion words) by using word2vec
algorithm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The representation of a word is learnt to be
useful for prediction of other words in the sentence. The
model contains 300-dimensional vectors for 3 million words
and phrases.
      </p>
      <p>
        Word embeddings pre-trained on product descriptions.
We also learn word embeddings from all product descriptions
by using word2vec algorithm implemented within Gensim
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], which is an unsupervised learning model. The
dimension of word vector is set to be 50 and we train the word2vec
model using CBOW algorithm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] with 5 epochs. In each
epoch, we initially set learning rate to be 0.025 and let it
linearly decay to 0.0001.
      </p>
      <p>Word embeddings learnt in training. Firstly, we use
one-hot coding to represent an individual word in a product
description. Then we learn a weight matrix in an embedding
layer of proposed model to transform each word into a word
vector having 50 dimensions. The word vectors transformed
from one-hot coding are used as the input to train the first
convolutional layer.</p>
      <p>Character-level embeddings learnt in training. Besides
word embeddings, we also use character level embeddings
learnt in training, where the learning approach is similar to
the approach of learn word embeddings in training. We use
one-hot coding to represent each unique character in the
raw texts rather than each word.</p>
    </sec>
    <sec id="sec-7">
      <title>Named entity and part-of-speech tag features</title>
      <p>
        The appearance of named entities might be associated with
certain product categories. For example, locations, landmarks,
and famous people are often prevalent in categories such
as branded perfume, books and movies, while organization
names might be seen more commonly in electronic products
such as Apple and HP. In addition, individual words
associated with the named entities might be rare and therefore
ifltered by our word frequency requirement, but they could
be informative. For example, the word "Beethoven" occurs in
only one product description, but it is the name of a famous
musician and it carries strong information about what type
of product it could be. Therefore, we use Stanford CoreNLP
package [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to extract 23 types of named entities from the
product description, namely cause of death, city, country,
criminal charge, date, duration, email, ideology, location,
misc, money, nationality, number, ordinal, organization,
percent, person, religion, set, state or province, time, title, and
URL. For each product description, we count the number
of words in each entity type and normalize values by the
length of sentence, resulting a 23-dimension feature vector
representing the distribution of named entity types.
Additionally, we generate a 36-dimension feature vector for the
distribution of part-of-speech tags, which was also identified
by Stanford CoreNLP.
Pre-processing
Supervised word
embedding
      </p>
      <p>Google word2vec
embedding</p>
      <p>Google news
embedding</p>
      <p>Raw Text
Supervised character</p>
      <p>level embedding
Kim-CNN</p>
      <p>Zhang-CNN
Dense layer
with Softmax</p>
      <p>Dense layer
with Softmax
Threshold
moving</p>
      <p>Threshold
moving</p>
    </sec>
    <sec id="sec-8">
      <title>Creating sentence level representation</title>
      <p>
        We use doc2vec algorithm proposed in "Distributed
Representations of Sentences and Documents" by Quoc Le et al.
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], to create another set of features for product descriptions.
The algorithm modifies word2vec to unsupervised learning
of continuous representations for larger blocks of text, such
as sentences, paragraphs or entire documents. We represent
each product description by a 50 dimensional feature vector
and we train the doc2vec model with 5 epochs.
5
      </p>
    </sec>
    <sec id="sec-9">
      <title>PROPOSED MODEL FOR PRODUCT</title>
    </sec>
    <sec id="sec-10">
      <title>TAXONOMY CLASSIFICATION</title>
      <p>
        An overall architecture of our system is shown in Fig. 1. We
introduce the key components in the proposed model in
the following sections. We train five models separately with
each one using multiple data inputs and varying setups. For
making the final predictions, we ensemble the predicting
results from these models. All the models are trained using
Adam algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] with a learning rate 0.001.
      </p>
    </sec>
    <sec id="sec-11">
      <title>Kim-CNN architecture</title>
      <p>
        Several recent studies have examined CNN models for text
classification tasks and reported CNN based models achieved
outstanding performance [
        <xref ref-type="bibr" rid="ref11 ref16 ref2">2, 11, 16</xref>
        ]. We adopt the design
of Kim’s CNN model to extract informative patterns from
the word embedding representation of input data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Kim’s
model contains paralleled convolution filters of three
diferent kernel sizes to protrude informative patterns in a sub
area of the input data. The kernel size (k) of filter determines
the magnitude of the sub area in the case of text, that is
the number of continuous words in a sentence. By mixing
convolution filters of three continuous sizes ( k − 1, k, and
k + 1), the network can learn patterns in the sentence at
three diferent scales. This design is similar to combining
n-gram features of diferent scales (e.g., unigram, bigram,
and trigram). The max pooling filter scans through outputs
of convolution filters and preserves only the max value in
each area. This operation washes out information that is less
relevant to the classification task and reduces the
dimensionality of features extracted by convolution filters. The output
is further processed by one layer of fully-connected neurons
to condense output matrix.
      </p>
      <p>
        Based on the Kim-CNN architecture proposed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] that
used kernels in the sizes of 3, 4 and 5, we insert one more
kernel with size being 1 to capture the unigram features
and we apply a batch normalization layer followed by a
dropout layer after the fully-connected layer of a standard
Kim-CNN, then the output is further processed by a second
fully-connected layer to yield final outputs.
      </p>
    </sec>
    <sec id="sec-12">
      <title>Oversampling and threshold moving</title>
      <p>We adopt an oversampling strategy to improve the model’s
performance on imbalanced data. In training one of the five
networks as shown in Fig. 1, we initially draw 256 examples
from training data to form one batch. We add 768 more
examples to the batch of data by duplicating those ones
whose classes having less than 5000 examples in the entire
training set. Within the 768 duplicated examples, we further
define three categories of classes according to the class sizes
in training set: classes with 1000 - 5000 examples, 100 - 1000
examples and less than 100 examples. The proportion of
three categories of classes should follow 1:2:4 to form the
total 768 oversampled examples. In the end we have 1024
examples in each batch.</p>
      <p>We also apply a threshold moving strategy to adjust model
predictions to alleviate data imbalance problem. The original
output of our models, which are the probabilities for each
class, are divided by the class sizes before we yield final class
labels. This strategy will reduce the probabilities yielded
on large classes but increase those on smaller classes. Our
experimental results show that this strategy is able to further
improve model performance.</p>
    </sec>
    <sec id="sec-13">
      <title>Error correcting output coding</title>
      <p>
        To leverage the hierarchy of class labels and explore label
correlation, we also used error correcting output coding
[
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. For our cases with large amount of classes, we create
a unique binary coding for each taxonomy. The coding for
each taxonomy level is diferent, for example, the first level
taxonomy has 3 digits while the second level taxonomy has
4 digits. Then we concatenate all codes corresponding to
multiple taxonomies to form the new output coding. The
original taxonomy prediction problem is transformed into a
multi-label learning problem using the encoded labels. The
longest sequence of taxonomies contains 8 taxonomy codes.
We pad those encoded labels of taxonomy sequences less
than 8 with zeros.
      </p>
    </sec>
    <sec id="sec-14">
      <title>Model ensembling</title>
      <p>
        The models that are trained for ensembling are shown in Fig.
1 and descriptions of these models are as the following:
• Model 1 : The last fully connected layer uses the input
including: (1) the output of Kim-CNN networks; (2)
NER features; (3) doc2vec features.
• Model 2 : The last fully connected layer uses the
input including: (1) the output of one Kim-CNN model
trained with upsampling; (2) the output of the other
Kim-CNN trained with ECOC; (3) NER features.
• Model 3 : The last fully connected layer uses the
input including: (1) the output of Kim-CNN networks
trained with ECOC; (2) NER features and (3) doc2vec
features. The output of Model 3 is decoded from the
yielded multi-label predictions in a way of calculating
the likelihood of predicting each individual class label
ifrstly and then we assign the original class label with
the highest computed likelihood to the example.
• Model 4 : We train a Kim-CNN model using character
level embeddings of raw texts. The last fully connected
layer uses the input including: (1) the output of
KimCNN; (2) NER features and (3) doc2vec features.
• Model 5 : Diferent from Kim-CNN that concatenates
the feature maps generated from diferent window
sizes in parallel, we train a new CNN model with six
layers sequentially, which is referred as Zhang-CNN
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. The last fully connected layer uses the input
including: (1) the output of Zhang-CNN; (2) NER features
and (3) doc2vec features. We use the similar
hyperparameters that have been proposed in their paper with
ifne tuning.
      </p>
      <p>The output of each model is also adjusted by adopting
threshold moving such that each model provides 2 versions of
predictions that are in terms of probabilities, with or without
threshold moving. We derive an ensembling procedure to
combine multiple model predictions. Firstly, we select the
predictions from Model 1, 4 and 5 to form a initial set of
candidate predictions, containing 6 versions of predictions.
We repeat training Model 2 for 3 times with random data
shufling and oversampling to provide 6 more versions of
predictions and add those to the set. We adopt 3 diferent
ways of creating output codings and repeatedly train Model
3, to obtain 6 more versions of predictions. We observe that
using even more predictions beyond 6 versions from Model
2 or 3 will not further improve the overall performance. In
the end we implement a majority voting over all versions
of predicted labels within the candidate set to predict the
ifnal labels. When there is a tie, we choose the label with the
highest averaged probability.
6</p>
    </sec>
    <sec id="sec-15">
      <title>PARAMETER TUNING AND ERROR ANALYSIS</title>
      <p>
        For training the proposed model, hyper-parameters such
as number of filters among {256, 512, 1024} for each
convolutional layer, dropout rate among {0.1, 0.2, 0.3, 0.4, 0.5}
and regularization parameter for convolutional and fully
connected layers among {1×10−6, 3×10−6, 1×10−3, 3×10−3,
1×10−1} are tuned using a hold-out set in the training data.
We adopt the same combination of window sizes 3, 4, 5 as
what was used in Kim-CNN [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], otherwise we add a new
window spanning only one word which was helpful to
predict the taxonomy correctly with selected a few single words.
We select an architecture with the best performance. The
adopted hyper-parameters for Kim-CNN model are as the
following :
• Number of filters : 1024
• Window size of Kim-CNN: {1, 3, 4, 5}
• Dropout rate : 0.5
• Regularization parameter : 3×10−6
In the tuning process, firstly we fix the number of filters to
be 256 and we tune other parameters. Then we increase the
number of filters to 512 and 1024 and the latter one shows
a better performance. We also notice that a higher dropout
rate 0.5 would help to get the best performance. Besides, our
model is quite sensitive to the regularization parameter in
the sense of converging speed. Once we increase it from
3×10−6 to 1×10−3, the model converges much slower but
with no increase in performance.
      </p>
      <p>Our error analysis shows that the classifier works well
for categories whose sample sizes are big, for example, more
than 1000 cases, and not so well for smaller categories. Figure
2 shows the performance on a random validation dataset (20
percent of training data) using ensemble of models trained
with over-sampling, threshold moving and error correcting
output coding techniques. The result inspires us to adopt
various sampling strategies aiming to increase the sample
size for those small categories.
7</p>
    </sec>
    <sec id="sec-16">
      <title>RESULTS AND DISCUSSION</title>
      <p>The models are firstly trained on 80 % of the given 800K
training samples and validated on the rest 20% of data to tune
the hyperparameters. Then the models are retrained with
ifxed hyperparameters following an early stopping strategy
in training using all given 800k samples and tested on 200K
samples with unknown labels (Table 3). We achieve a good
performance, with a F1 score of 0.8295.</p>
      <p>We observe that feature engineering is particularly
important in further improving performance using CNN. We went
through a path that as we added more relevant but somehow
diferent features, for example the NER features, the
performance was improved accordingly. We also observe that
there is considerable performance diference among diferent
models. The performance of models based on character level
embedding are not as good as the others. However, it helps
to improve the overall performance of ensemble model.</p>
      <p>The threshold moving method is very helpful to increase
the precision of individual model, which is critical in final
ensembling. The oversampling and ECOC algorithms can
add additional randomness and improve the performance of
ensembling model to certain extent.</p>
    </sec>
    <sec id="sec-17">
      <title>ACKNOWLEDGMENTS</title>
      <p>The author would like to thank the organizer of SIGIR 2018
eCom Data Challenge (Rakuten Institute of Technology Boston
(RIT-Boston)) for their support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Rohit</given-names>
            <surname>Babbar</surname>
          </string-name>
          , Ioannis Partalas, Eric Gaussier, and
          <string-name>
            <surname>Massih R Amini</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>On flat versus hierarchical classification in large-scale taxonomies</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          . 1824-
          <fpage>1832</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mariusz</given-names>
            <surname>Bojarski</surname>
          </string-name>
          ,
          <source>Davide Del Testa</source>
          , Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal,
          <string-name>
            <surname>Lawrence D Jackel</surname>
            , Mathew Monfort, Urs Muller,
            <given-names>Jiakai</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>End to end learning for self-driving cars</article-title>
          .
          <source>arXiv preprint arXiv:1604.07316</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ali</given-names>
            <surname>Cevahir</surname>
          </string-name>
          and
          <string-name>
            <given-names>Koji</given-names>
            <surname>Murakami</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Large-scale Multi-class and Hierarchical Product Categorization for an E-commerce Giant</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          .
          <fpage>525</fpage>
          -
          <lpage>535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jianfu</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>David</given-names>
            <surname>Warren</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Cost-sensitive learning for large-scale hierarchical classification</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Conference on information &amp; knowledge management. ACM</source>
          ,
          <volume>1351</volume>
          -
          <fpage>1360</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Pradipto</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yandi Xia</surname>
            , Aaron Levine, Giuseppe Di Fabbrizio, and
            <given-names>Ankur</given-names>
          </string-name>
          <string-name>
            <surname>Datta</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Large-scale taxonomy categorization for noisy product listings</article-title>
          .
          <source>In Big Data (Big Data)</source>
          ,
          <source>2016 IEEE International Conference on. IEEE</source>
          ,
          <fpage>3885</fpage>
          -
          <lpage>3894</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Thomas</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Dietterich</surname>
            and
            <given-names>Ghulum</given-names>
          </string-name>
          <string-name>
            <surname>Bakiri</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Solving Multiclass Learning Problems via Error-correcting Output Codes</article-title>
          .
          <source>J. Artif. Int. Res. 2</source>
          ,
          <issue>1</issue>
          (Jan.
          <year>1995</year>
          ),
          <fpage>263</fpage>
          -
          <lpage>286</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Rayid</given-names>
            <surname>Ghani</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Using Error-Correcting Codes for Text Classification</article-title>
          .
          <source>In Proceedings of the Seventeenth International Conference on Machine Learning (ICML '00)</source>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <fpage>303</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Jung-Woo</surname>
            <given-names>Ha</given-names>
          </string-name>
          , Hyuna Pyo, and
          <string-name>
            <given-names>Jeonghee</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Large-scale item categorization in e-commerce using multiple recurrent neural networks</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM</source>
          ,
          <volume>107</volume>
          -
          <fpage>115</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Text categorization with support vector machines: Learning with many relevant features</article-title>
          .
          <source>In European conference on machine learning</source>
          . Springer,
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Nal</surname>
            <given-names>Kalchbrenner</given-names>
          </string-name>
          , Edward Grefenstette, and
          <string-name>
            <given-names>Phil</given-names>
            <surname>Blunsom</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A convolutional neural network for modelling sentences</article-title>
          .
          <source>arXiv preprint arXiv:1404.2188</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Convolutional neural networks for sentence classification</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980 (
          <year>2014</year>
          ). arXiv:
          <volume>1412</volume>
          .
          <fpage>6980</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Aris</surname>
            <given-names>Kosmopoulos</given-names>
          </string-name>
          , Georgios Paliouras, and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Probabilistic cascading for large scale hierarchical classification</article-title>
          .
          <source>arXiv preprint arXiv:1505.02251</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Distributed Representations of Sentences and Documents</article-title>
          .
          <source>In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML'14)</source>
          . JMLR.org, II-1188
          <string-name>
            <surname>-</surname>
          </string-name>
          II-1196.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>David</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
            and
            <given-names>Marc</given-names>
          </string-name>
          <string-name>
            <surname>Ringuette</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>A comparison of two learning algorithms for text categorization</article-title>
          .
          <source>In Third annual symposium on document analysis and information retrieval</source>
          , Vol.
          <volume>33</volume>
          .
          <fpage>81</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Jingzhou</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei-Cheng</surname>
            <given-names>Chang</given-names>
          </string-name>
          , Yuexin Wu, and
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Deep Learning for Extreme Multi-label Text Classification</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <volume>115</volume>
          -
          <fpage>124</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Christopher</surname>
            <given-names>Manning</given-names>
          </string-name>
          , Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and
          <string-name>
            <surname>David McClosky</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          .
          <source>In Proceedings of 52nd annual</source>
          <article-title>meeting of the association for computational linguistics: system demonstrations</article-title>
          .
          <volume>55</volume>
          -
          <fpage>60</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Maciej</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mazurowski Mateusz</surname>
            <given-names>Buda</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Atsuto</given-names>
            <surname>Maki</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A systematic study of the class imbalance problem in convolutional neural networks</article-title>
          .
          <source>arXiv:1710.05381</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Andrew</surname>
            <given-names>McCallum</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kamal</given-names>
            <surname>Nigam</surname>
          </string-name>
          , et al.
          <year>1998</year>
          .
          <article-title>A comparison of event models for naive bayes text classification</article-title>
          .
          <source>In AAAI-98 workshop on learning for text categorization</source>
          , Vol.
          <volume>752</volume>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <volume>41</volume>
          -
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed Representations of Words and Phrases and Their Compositionality</article-title>
          .
          <source>In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13)</source>
          . Curran Associates Inc., USA,
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Pazzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Merz</surname>
          </string-name>
          , Patrick Murphy, Kamal Ali, Timothy Hume, and
          <string-name>
            <given-names>Cliford</given-names>
            <surname>Brunk</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Reducing misclassification costs</article-title>
          .
          <source>In Machine Learning Proceedings 1994. Elsevier</source>
          ,
          <volume>217</volume>
          -
          <fpage>225</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Radim</given-names>
            <surname>Řehůřek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA</source>
          , Valletta, Malta,
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Dan</surname>
            <given-names>Shen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jean-David Ruvini</surname>
            ,
            <given-names>Rajyashree</given-names>
          </string-name>
          <string-name>
            <surname>Mukherjee</surname>
            , and
            <given-names>Neel</given-names>
          </string-name>
          <string-name>
            <surname>Sundaresan</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A study of smoothing algorithms for item categorization on e-commerce sites</article-title>
          .
          <source>Neurocomputing</source>
          <volume>92</volume>
          (
          <year>2012</year>
          ),
          <fpage>54</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Dan</surname>
            <given-names>Shen</given-names>
          </string-name>
          , Jean David Ruvini,
          <string-name>
            <given-names>Manas</given-names>
            <surname>Somaiya</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Neel</given-names>
            <surname>Sundaresan</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Item categorization in the e-commerce domain</article-title>
          .
          <source>In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM</source>
          ,
          <year>1921</year>
          -
          <fpage>1924</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Carlos</surname>
            <given-names>N</given-names>
          </string-name>
          <string-name>
            <surname>Silla and Alex A Freitas</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A survey of hierarchical classification across diferent application domains</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>22</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          (
          <year>2011</year>
          ),
          <fpage>31</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Xiang</surname>
            <given-names>Zhang</given-names>
          </string-name>
          ,
          <source>Junbo Zhao, and Yann LeCun</source>
          .
          <year>2015</year>
          .
          <article-title>Character-level Convolutional Networks for Text Classification</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'15)</source>
          . MIT Press, Cambridge, MA, USA,
          <fpage>649</fpage>
          -
          <lpage>657</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Ye</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Stephen Roller, and
          <string-name>
            <given-names>Byron</given-names>
            <surname>Wallace</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>MGNC-CNN: A simple approach to exploiting multiple word embeddings for sentence classification</article-title>
          .
          <source>arXiv preprint arXiv:1603.00968</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Ye</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Byron</given-names>
            <surname>Wallace</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification</article-title>
          .
          <source>arXiv preprint arXiv:1510.03820</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Zhi-Hua Zhou</surname>
          </string-name>
          and
          <string-name>
            <surname>Xu-Ying Liu</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem</article-title>
          .
          <source>IEEE Trans. on Knowl. and Data Eng</source>
          .
          <volume>18</volume>
          ,
          <issue>1</issue>
          (Jan.
          <year>2006</year>
          ),
          <fpage>63</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>