<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large-Scale Taxonomy Problem: a Mixed Machine Learning Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Quentin Labernia</string-name>
          <email>quentin@dais.is.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toshiyuki Oike</string-name>
          <email>oike@dais.is.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashio Kabashima∗</string-name>
          <email>kabashima@dais.is.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kohei Asano</string-name>
          <email>asano@dais.is.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Takeshi Tokuyama</string-name>
          <email>tokuyama@dais.is.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michimasa Irie</string-name>
          <email>irie@dais.is.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jinhee Chun</string-name>
          <email>jinhee@dais.is.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Machine Learning, Natural Language Processing, Deep Neural Net-</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tohoku University</institution>
          ,
          <addr-line>GSIS</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>work</institution>
          ,
          <addr-line>Random Forest</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Rakuten Data Challenge suggests tackling the Large-Scale Taxonomy Challenge. Given a large amount of product titles and category paths leading to these products, we would like to predict the category path of a given product, only based on its title. The provided paths are structured as a forest of 14 trees. The learning process is split into two steps: we first retrieve the tree the input belongs to and then handle the category path. We describe data embedding which represents an important task in this challenge and then introduced the so-called two step architecture. The original idea is based on deep neural network model. We also introduce an actual method as second step modification since the former second step is not eficient enough. This last technique makes usage of multiple sets of random forest classifiers to navigate inside each tree.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>As part of the SIGIR 2018 workshop, we tackle in this paper the
Rakuten Data Challenge. E-commerce websites handle a large
variety of data and their products are categorized in some way: when a
new product is given to the system, one has to compute its category.
For instance, an inkjet printer is an ofice product, an electronic
ofice good, and of course kind of printer. Rakuten Data Challenge
addresses this problem in the following terms. Given only the
title of a product, try to predict its path through some categories
– as usually shown to the customer in most e-commerce website.
Each path starts out from one of the root categories, reaches to the
actual product’s category, going through some intermediate
levels. For example, the product ’Replacement Viewsonic VG710 LCD
Monitor 48Watt AC Adapter 12V 4A’ is associated with such path
3292&gt;114&gt;1231. We refer to this problem as large-scale taxonomy
classification.</p>
      <p>Rakuten Data Challenge focuses on machine learning techniques
as the amount of data is large, about one million instances. Also,
labels correspond to paths in tree structure. Considering these
characteristics, it might seem more natural to tackle this problem
using machine learning techniques rather than typical data mining
methodologies like FCA – Formal Concept Analysis – or logical
rules based approaches. Such approaches indeed lead to high
computation complexity and could simply fail because of the not adequate
input and output data representation.</p>
      <p>Up to now, datasets which are representative of practical
usage of e-commerce websites have not been made public yet and
Rakuten Data Challenge provides such dataset. However, it brings
with it serious challenges. Among those, the large size of the dataset
represents one of the main issue. This justifies the large-scale
classiifcation appellation. One indeed needs to handle around 0.8 million
of instances each of them consisting of a textual feature – title of the
product - and categories’ path description. The output label consists
of an ordered list of categories. We are dealing with a classification
problem and each distinct list is a diferent label. Considering the
test dataset, we come up with more than 3000 distinct list
configurations. Labels are also unbalanced with regard to the input instances
and we put emphasis on that point in section 5. Some categories like
“4015” foster a lot of product whereas some others do not. Handling
very small classes along with over-representative ones represents
another big issue. The last considered point is related to the noisy
semantic of each instance’s title. We are referring here to the fact
that the title of a product can include the name of the product or its
description which is more or less precise. Although two products
share some words, it might be the case they are actually very
diferent and so are not categorized under the same label. For instance,
a laptop whose name is given after a given fruit, a book which
talks about this fruit: both share common words. To get rid of such
mistakes is a fundamental and necessary condition for our model
to predict labels in the most eficient way.</p>
      <p>In this paper, we introduce two methods to tackle the challenge.
The original method refers to our first idea which results are not
convincing. Hence, we also introduce the actual method whose
diferences with the original one will be discussed in section 4.
This last method has been used to get results of Rakuten Challenge
shown in section 5.</p>
      <p>As for data preprocessing, we give a detailed explanation of how
do we address most of the previously mentioned problems in a
logical way. Once the input and output data are formatted in the
right way, we define the architecture of the models we built to
perform the prediction. Both methods are introduced. Based on
this architecture, we outline the experiment strategy. Finally, we
compare diferences between the two strategies and provide critical
explanations of our work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>DATA OVERVIEW</title>
      <p>The provided data is composed of one million instances, split
between the training dataset (80%) and test dataset (20%). A pair
(xi , yi ) denotes the instance of index i. A sample of data picked up
from the training dataset is shown in table 1.</p>
      <p>The input data – title xi – is composed of one feature formated
as a string of UTF-8 characters. It corresponds to the title of the
product. A title xi has a specific length and potentially contains
special characters such as the trademark character.</p>
      <p>The taxonomy is a tree as stated in the Rakuten Data Challenge
rules. After gathering all the nodes – each actually corresponds to
one category – we end up with a forest of 14 trees. One tree is sort
of very general category of product. It is obviously possible to link
the root node of all the trees together so we form a unique tree.
However, we chose to not perform such operation since it would
make more sense to build specialized models managing one of these
general categories at a time. Indeed, the classification logic between
electronic products and make-up goods – for instance – could be
potentially totally diferent. As a result, we consider the existence of
T1, . . . , T14, the trees we build from the given learning dataset only.
A node of a tree Ti is a category c ∈ N. This category corresponds to
the node in the forest, which means that such category c is unique
across all the trees. Notice that tree T14 is degenerated in the sense
that it contains only one node.</p>
      <p>The output data, or labels, are tuples of categories. One tuple
yi = (c1, . . . , cDk ) describes a path of categories – associated with
the instance xi –, that is: c1 is a root node of one tree Tk , c2 is one of
the c1 child nodes, and so on until we reach cDk , the end of the path.
After running analysis on the provided datasets, we found that cDk
is actually always a leaf, thus we consider the following hypothesis
to always be true: for any instance (xi , yi ), the last component of
yi , cDk , is a leaf of a tree. A predicted tuple y˜i matches its ground
truth counterpart yi if and only if both tuples are equal – perfect
matching between the two. Put another words, our goal is to find
a tuple with the right amount of components and each of these
components have to match the provided ground truth.</p>
      <p>We do not proceed to any data augmentation, either for data
processing or model usage. This paper focuses on getting the best
results while just relying on the title as input feature. Here, we find
more interests to look at how we could obtain consistent results
with semantically noisy title features as recalled in the introduction.
Let us fix an important notation hereafter used: all the variables
written with a tilde is a predicted value from our models. Next
section introduces the data preprocessing based on the previously
stated hypothesis.
Let’s first take a look at the input feature transformation process.
We describe two steps corresponding to bag of words and word2vec
processes. Both representations are further used in section 4 and
so we introduce them once at a time as follows.</p>
      <p>3.1.1 Bag of words. A title xi is a string of characters whose
encoding is UTF-8. We split the title into words using space
characters. The integer qi denotes the number of words in the title
xi . We define a binary vector representation of each word and so
create a dictionary using the previously split words. The size of the
vocabulary corresponds to the number of distinct words we meet in
the training dataset because we build it using the training dataset.
We drop the least and “most” frequent words if respectively: a word
appears less than ten times ; a word’s frequency for each tree is
high and about the same across all the trees – such words are not
discriminative across diferent categories. When dealing with an
instance xi , any word which does not belong to the dictionary is
simply dropped. Any word wi, j – word j of the title xi – can be
encoded into a binary vector of size ω. One can notice we do not
proceed to any lemmatization of the words. We refer to the bag of
word vector of the title xi using xi.</p>
      <p>
        3.1.2 word2vec. The next step is dimensionality reduction of
each words in the title xi. To do so, we use the well known word2vec
technique[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We set each resulting word to be an element of R2000.
Since titles are not of fixed size, we need to either build a model
which can handle such variable length inputs, or transform the
data into a predefined fixed format. Since our words are embedded
in sort of semantic space using word2vec, we state that it makes
sense to perform the addition of all the words of one title to get a
significant meaning. Considering titles of Internet marketplaces,
words order usually does not matter. For instance, “red shoes”
and “shoes / red” should be meaningful speaking quite close.
Moreover it gives us an elegant way to take into account all the
words in the title while having a final fixed size. To recap, given
xi a title, we first embed its binary vector words wi,1, . . . , wi,qi in
a semantic space and obtain ui,1, . . . , ui,qi . We then aggregate all
the words using sum to form the final title ti = Í1⩽j ⩽qi ui, j . We
feed the machine learning model with these ti .
3.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Output labels</title>
      <p>
        This section is specifically designed for the original method. Labels
are originally formated like “c1&gt;c2&gt;. . .&gt;cDk ” as mentioned in
section 2. However, such format cannot be eficiently used as output
label because it would have to be parsed. Each leaf is reached by
only one path so it is theoretically possible to proceed to the
prediction task considering multiclass – an instance is associated with
one and only one leaf, which also corresponds to only one path. By
doing so, we do not take any advantage of the tree structure
hypothesis. Also, there exist some paths containing a very small number
of instances. Therefore, we consider the tree structure and suggest
another way of performing prediction. We choose to transform
these raw labels yi into binary vectors zi . Our goal is to describe
the path from the root node to one leaf by giving to which direction
we go at each level. An easy way to proceed is to first sort all the
nodes by level, then we encode which node we go through using a
binary vector. For instance, the path 1&gt;3&gt;8 shown in figure 1 could
be written as:
( 0, 1 , 0, 0, 0, 0, 1 )
|{z} |{z} |{z}
level 1 | {z }
level 2
The first level group tells us to keep going through the node 3
– within a level, we use category integer representation to order
nodes. The first subgroup of the second level is filled out with zeros
for the reason that it corresponds to the nodes reachable from 2.
Since it is mandatory to go through the root node, we ignore it and
start out from the second level of the tree. If the tree is big, it implies
very large labels. In order to reduce the size of such label, we use a
simple trick as a reworked idea of anchor representation, used for
instance in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This idea is used in the case of image recognition
where the number of contained labels can difer between images.
      </p>
      <p>A label yi belongs to exactly one of the 14 trees, thus we only
consider the kth tree. The height of this tree is Dk as defined in section
2. We also recall that the depth of a node is its distance to the root
node. We define Lk,d to be the set of nodes of depth d and the set
Mk,d = {|C | | C ⊆ Lk,d ∧ every nodes in C have the same parent}
with d &gt; 0. The set Mk,d gathers the number of children of the
nodes in the level d − 1. The compressed vector zi is of length
ri = Í1⩽d ⩽Dk max Mk,d and can be represented as follows:
zi = (</p>
      <p>. . .</p>
      <p>|{z}
level 1 : Gk,1Bmax Mk,1
. . .</p>
      <p>. . .</p>
      <p>|{z}
level Dk : Gk,Dk Bmax Mk,Dk
)
Looking at the figure 1 as an illustration, each group of components
has the size of the biggest box at the considered level. After
choosing a node д in the path, we can only reach its children child(д).
Therefore the way we compress is by considering child(д) – one box
in figure 1 – instead of Lk,d – the whole layer. However, depending
on д, the number of elements in child(д) might difer, which is why
we encode a level d using Gk,d components – the biggest box in</p>
      <sec id="sec-3-1">
        <title>Step 1</title>
      </sec>
      <sec id="sec-3-2">
        <title>Step 2</title>
        <p>x</p>
        <p>Original
method
Actual
method</p>
        <p>A
one level. It ensures we can always encode child(д) while setting
up a fixed size. Let’s fix an ordering for each child(д). Since we only
pick exactly one node per level, we set up the nth component to be
1 if and only if we select the nth node of child(д). Otherwise, we
put 0. That is, the previous 1&gt;3&gt;8 path of figure 1 can be rewritten
like so:
( 0, 1 , 0, 1, 0 )</p>
        <p>|{z} |{z}
level 1 : Gk,1=2 level 2 : Gk,2=3
By doing so, we drastically reduce the size of the labels and give
a fixed representation given the tree k: indeed, recall from section
2 that given any tree k, all its labels share the same length. Albeit
this representation is dependent on the tree we choose, it is not a
problem here since we build a specific model for each tree.</p>
        <p>We obviously need to first compute for each tree its level
representation, that is Gk,1, . . . , Gk, Dk . Then it is possible to transform
any yi into its corresponding zi . Conversely, it allows us to get the
raw format label y˜i back from the predicted z˜i .
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>MODELS DESCRIPTION</title>
      <p>This section describes two diferent methods as recalled in section 1:
our original idea along with the practical method. Both are based
on a two step model architecture since we deal with the 14 trees
separately and do not gather them into a unique one. The process
is to first predict the tree Ti the current input belongs to, and then
ifnd the entire path in that selected tree as the second step. First
step model is denoted by A, while the second step ones are
written B1, . . . , B13. Step one is common to both methodologies and
only the latter part is what changes between the two. We put in
section 4.1 and section 4.2 the respective explanations for step one
and step two.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>First step: classification over the trees</title>
      <p>Let us consider an instance xi . We feed the model A with ti – recall
that ti is the preprocessed title coming from xi . Our goal is to first
determine to which tree this instance belongs to. For that purpose,
we define the model A to be a neural network, whose output is a
class between 1 and 14. If the instance xi is classified as k ∈ ⟦1; 14⟧,
it means the instance is part of the tree Tk and thus further consider
the model Bk as second stage process. We use a five layers fully
connected neural network. Since we classify over 14 trees, model’s
output size is also 14. Loss function is softmax cross entropy as loss
function since any input title is associated with exactly one tree
– the use of softmax operation transforms the raw output into a
probability vector.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Second step: prediction of the path</title>
      <p>4.2.1 Original Method. Each of the model B1, . . . , B13 is either
a shallow neural network or a deep neural network. All the models
are independent each other. Depending on the characteristics of a
tree, we rely on diferent machine learning methods.</p>
      <p>First, when trees have simple structures, we use shallow
feedforward fully connected neural networks, each of these composed
of three layers. The model outputs zi . A label zi contains multiple 1,
thus they do not represent a probability distribution. Actually, each
group G1, . . . , GDk of zi is a probability distribution itself. Then we
choose to decompose the loss function in such a way we apply the
softmax cross entropy over each group, and we finally aggregate
the resulting values using summation. If we denote each group of zi
by дi,1, . . . , дi, Dk – which are of respective length Gk,1, . . . , Gk, Dk
– then the loss function Lk for the model Bk of tree Tk is given by:
Lk (z˜i , zi ) =</p>
      <p>cross entropy(softmax(д˜i,d ), дi,d )
Õ
1⩽d ⩽Dk</p>
      <p>
        Next, we consider hard tree structures, that is trees whose labels
zi have a high dimension. For this purpose, we rely on deep neural
networks, inspired by image processing technique. Since we deal
with fixed inputs, we do not consider recurrent networks and
instead focuses on deep convolutional neural networks. Let us recall
that inputs ti are embedded in a 2000 dimensional semantic space
created using word2vec[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and then aggregating all the words in the
title using summation. We want to take advantage of convolution in
order to retrieve expressive features before applying classical fully
connected layers at the end of the network. To that purpose we
suggest the use of ResNet[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in its 50 layer flavor. Such architecture
is based on residual blocks composed of three convolutional layers
and a shortcut link between the input and output of the layer. A
more detailed view of the architecture is shown figure 3. It allows us
to enjoy the expressiveness of a deep architecture and is known to
be easier to train than other deep convolutional models like VGG[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
or AlexNet[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The loss function is same as the former shallow fully
connected neural networks’ one.
      </p>
      <p>4.2.2 Actual Method. As second step, we construct hierarchical
models. Let us recall that step one provides us the tree k the current
input belongs to. Put another words, we know the first category
c1 of the path yi = (c1, . . . , cDk ). In order to predict the full path,
we construct 13 hierarchical classifiers B1, . . . , B13 whose task is
to predict the rest of the path c2 to cDk once at a time. Notice that
as for tree 14, the answer becomes obvious since T14 has only one
node.</p>
      <sec id="sec-6-1">
        <title>Hyperparameter</title>
      </sec>
      <sec id="sec-6-2">
        <title>Value</title>
      </sec>
      <sec id="sec-6-3">
        <title>Models</title>
        <p>
          Model topology ResNet50[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] or Shallow NN All
Regularization ℓ2-regularization All
Loss function Softmax cross entropy A
∀k ∈ ⟦1, 13⟧, Lk Bs
        </p>
        <p>
          Optimizer Adam[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] All
        </p>
        <p>Batch scheme Mini-batches (⩽ 32) All
Table 2: (Original method) Gathered specifications as for
neural networks models. We use the following notation
Bs B B1, . . . , B13.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Residual Bottleneck Block</title>
      </sec>
      <sec id="sec-6-5">
        <title>ResNet50</title>
        <p>Convolution layer
Convolution layer
Convolution layer
relu
relu
relu</p>
      </sec>
      <sec id="sec-6-6">
        <title>Input</title>
        <p>Convolution layer
relu
Fully Connected layer</p>
      </sec>
      <sec id="sec-6-7">
        <title>Output 16 Blocks</title>
        <p>We build the second model Bk of the tree k as follows. It is
based on a set of models Bk,n where n is a parent node label of
the tree k. The goal of each classifier Bk,n is to output the child
node corresponding to the input title. Starting from the root node
of the tree k, we can navigate in the tree by predicting the nodes
of the path yi , once at a time. Theoretically, this strategy seems
to be computationally expensive since the number of parent node
increases exponentially with the layers. However, the maximum
number of node is less than 1000, which means that we can aford
such strategy.</p>
        <p>
          Each classifier Bk,n is fed with bag of words titles xi and outputs
the child node of n of the tree k which corresponds to our input.
We use random forest[
          <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
          ] algorithm which has got 150 trees – this
hyperparameter is fixed for all Bk,n . As for the implementation, we
rely on the open source library scikit-learn (http://scikit-learn.org/).
Algorithm 1 shows how to build the model Bk . When classifying
a title xi in tree k, we call PredictionProcedure(k, xi , ⊤k , ) as
described in algorithm 2. The symbol ⊤k refers to the root node of the
tree k. This simple algorithm consists in starting from the root node
Bk .
and going deeper once at a time by choosing the corresponding
classifier inside the set
        </p>
      </sec>
      <sec id="sec-6-8">
        <title>Algorithm 1 HierarchicalRandomForest</title>
        <p>In this section, we report results of the Sigir Rakuten workshop
oficial evaluation. As stated in section 2, 0.8 million train data and
0.2 million test data are provided. As for test data, ground truth is
not given. A leaderboard system shows up the results at each stage
of the evaluation process. The computed evaluation metrics are
weighted-{precision, recall, f1} on the test set of exact
CategoryIdPath match and a ranking is set over the f1-score. Visible in table 3
results we showed the best result in each original and actual model.
The Rakuten Data Challenge is split into two evaluation stages. The
ifrst stage corresponds to the evaluation over a subset of the test
dataset and multiple submissions are possible. On the other hand,
the second stage consists of only one evaluation of the method over
the whole test dataset. Both original and actual methods have been
tested during the first stage while only the actual one is used for
the second stage. Table 3 summarizes these evaluations along with
their corresponding evaluation measures. The final ranking of the
method is 20th over 28 teams.</p>
        <p>It is clear that our original method does not provide satisfactory
results. We suppose the main reason relies on how complex category
trees structure is and the number of data available for training
considering all distinct paths. It has to be said that this original
method’s issue is overfitting. The first consideration is that given
a tree, it might happen that a small number of training data is
only available. Even considering tree with large amount of data,
the original method fails to provide good generalization property.
Indeed, going deeper in the tree means having less data available for
training. Since deep neural networks are used in this context, deep
levels cannot be learned properly. This leads to get good results
on the training set and a very poor generalization ability. On the
other hand, the actual method provides good performance because
it relies on decision trees – more precisely, random forest algorithm
– combined with a simple bag of words embedding. Predicting
independently children of each node of the tree with such technique
gives us insurance that the learning process will lead to good and
comprehensible results even if near the leaves.
6</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>In this work, we suggest a two layer architecture which tackles the
large-scale taxonomy challenge as part of Rakuten Data Challenge
in SIGIR 2018 workshop. While putting the accent on machine
learning methodology, data preprocessing represents a major point
of attention. We create bag of words then embed each words of
the vocabulary in a semantic space and aggregate words of title
using summation. Bag of words are used for the actual method’s
second step while the further embedding is used for the rest. We
build a two step architecture to predict the final path associated
with a newly seen instance. The first step model A chooses which
tree we have to consider – for instance the kth –, we then feed
the right second step model – Bk – to output the encoded path.
This second step is split into two diferent methods: original and
actual. Our original idea relies on a one shot fixed scheme encoding
of the whole category path. We rely on several machine learning
techniques: shallow feed-forward fully connected neural networks
and ResNet architecture. As for the actual method, we make usage
of hierarchical classification model composed of random forests
models. This method provide nice performances compared to the
former one as shown in section 5.</p>
      <p>Because of the imbalanced classes and complex data structures,
our original strategy does not ofer good performances. However,
we think that such method becomes very eficient when having
enough samples and a uniform distribution over the classes. Since
this situation appears in the shallow levels of most of the trees, we
think it could be worthy of combining the original method with the
actual one in order to improve overall performances.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENT</title>
      <p>This work was funded by ImPACT Program of Council for Science,
Technology and Innovation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Leo</given-names>
            <surname>Breiman</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <string-name>
            <given-names>Random</given-names>
            <surname>Forests</surname>
          </string-name>
          .
          <source>Machine Learning</source>
          <volume>45</volume>
          ,
          <volume>1</volume>
          (
          <issue>01</issue>
          <year>Oct 2001</year>
          ),
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          . DOI:http://dx.doi.org/10.1023/A:1010933404324
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>CoRR abs/1512</source>
          .03385 (
          <year>2015</year>
          ). http://arxiv.org/ abs/1512.03385
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980 (
          <year>2014</year>
          ). http://arxiv.org/abs/1412.6980
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geofrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>ImageNet Classification with Deep Convolutional Neural Networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          25,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J. C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          (Eds.). Curran Associates, Inc.,
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          . http://papers.nips.cc/paper/ 4824-imagenet
          <article-title>-classification-with-deep-convolutional-neural-networks</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Andy</given-names>
            <surname>Liaw</surname>
          </string-name>
          , Matthew Wiener, and others.
          <source>2002</source>
          .
          <article-title>Classification and regression by randomForest</article-title>
          .
          <source>R news 2</source>
          ,
          <issue>3</issue>
          (
          <year>2002</year>
          ),
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Wei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Dragomir Anguelov, Dumitru Erhan,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Scott E</given-names>
            . Reed, Cheng-Yang
            <surname>Fu</surname>
          </string-name>
          , and
          <string-name>
            <surname>Alexander</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berg</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>SSD: Single Shot MultiBox Detector</article-title>
          .
          <source>CoRR abs/1512</source>
          .02325 (
          <year>2015</year>
          ). http://arxiv.org/abs/1512.02325
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>CoRR abs/1310</source>
          .4546 (
          <year>2013</year>
          ). http://arxiv.org/abs/1310.4546
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          .
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          ). http: //arxiv.org/abs/1409.1556
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>