<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Ensemble of Neural Networks for Multi-label Document Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ladislav Lenc</string-name>
          <email>llenc@kiv.zcu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Král</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia</institution>
          ,
          <addr-line>Univerzitní 8, 306 14 Plzenˇ</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NTIS-New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia</institution>
          ,
          <addr-line>Technická 8, 306 14 Plzenˇ</addr-line>
          ,
          <country>Czech</country>
          <addr-line>Republic nlp.kiv.zcu.cz</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1885</volume>
      <fpage>186</fpage>
      <lpage>192</lpage>
      <abstract>
        <p>This paper deals with multi-label document classification using an ensemble of neural networks. The assumption is that different network types can keep complementary information and that the combination of more neural classifiers will bring higher accuracy. We verify this hypothesis by an error analysis of the individual networks. One contribution of this work is thus evaluation of several network combinations that improve performance over one single network. Another contribution is a detailed analysis of the achieved results and a proposition of possible directions of further improvement. We evaluate the approaches on a Czech Cˇ TK corpus and also compare the results with state-of-the-art approaches on the English Reuters-21578 dataset. We show that the ensemble of neural classifiers achieves competitive results using only very simple features.</p>
      </abstract>
      <kwd-group>
        <kwd>Czech</kwd>
        <kwd>deep neural networks</kwd>
        <kwd>document classification</kwd>
        <kwd>multi-label</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        This paper deals with multi-label document classification
by neural networks. Formally, this task can be seen as the
problem of finding a model M which assigns a document
d ∈ D a set of appropriate labels (categories) c ∈ C as
follows M : d → c where D is the set of all documents and
C is the set of all possible document labels. The
multilabel classification using neural networks is often done by
thresholding of the output layer [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. It has been shown
that both standard feed-forward networks (FNNs) and
convolutional neural networks (CNNs) achieve
state-of-theart results on the standard corpora [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>However, we believe that there is still some room for
further improvement. A combination of classifiers is a
natural step forward. Therefore, we combine a CNN and an
FNN in this work to gain further improvement in the terms
of precision and recall. We support the claim that
combination may bring better results by studying the errors of
the individual networks. The main contribution of this
paper thus consists in the analysis of errors in the prediction
results of the individual networks. Then we present the
results of several combination methods and illustrate that the
ensemble of neural networks brings significant
improvement over the individual networks.</p>
      <p>The methods are evaluated on documents in the Czech
language, being a representative of highly inflectional
Slavic language with a free word order. These properties
decrease the performance of usual methods. We further
compare the results of our methods with other
state-ofthe-art approaches on English Reuters-215781 dataset in
order to show its robustness across languages.
Additionally we analyze the final F-measure on document sets
divided according to the number of assigned labels in order
to improve the accuracy of the presented approach.</p>
      <p>The rest of the paper is organized as follows. Section 2
is a short review of document classification methods with
a particular focus on neural networks. Section 3 describes
our neural network models and the combination methods.
Section 4 deals with experiments realized on the Cˇ TK and
Reuters corpora and then analyzes and discusses the
obtained results. In the last section, we conclude the
experimental results and propose some future research
directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Document classification is usually based on a supervised
machine learning. A classifier is trained on an annotated
corpus and it then assigns class labels to unlabelled
documents. Most works use vector space model (VSM), which
generally represents each document as a vector of all word
occurrences usually weighted by their tf-idf.</p>
      <p>
        Several classification methods have been successfully
used [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], as for instance Bayesian classifiers, maximum
entropy, support vector machines, etc. However, the main
issue of this task is that the feature space is highly
dimensional which decreases the classification results.
Feature selection/reduction [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or better document
representation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] can be used to solve this problem.
      </p>
      <p>
        Nowadays, “deep” neural nets outperform majority of
the state-of-the-art natural language processing (NLP)
methods on several tasks with only very simple features.
These include for instance POS tagging, chunking, named
entity recognition and semantic role labelling [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Several different topologies and learning algorithms were
proposed. For instance, Zhang et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] propose two
convolutional neural nets (CNN) for ontology classification,
sen1http://www.daviddlewis.com/resources/testcollections/reuters21578/
timent analysis and single-label document classification.
They show that the proposed method significantly
outperforms the baseline approach (bag of words) on English and
Chinese corpora. Another interesting work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] uses in the
first layer pre-trained vectors from word2vec [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
authors show that the proposed models outperform the state
of the art on 4 out of 7 tasks, including sentiment
analysis and question classification. Recurrent convolutional
neural nets are used for text classification in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The
authors demonstrated that their approach outperforms the
standard convolutional networks on four corpora in
singlelabel document classification task.
      </p>
      <p>
        On the other hand, traditional feed-forward neural net
architectures are used for multi-label document
classification rather rarely. These models were more popular
before as shown for instance in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. They build a simple
multi-layer perceptron with three layers (20 inputs, 6
neurons in hidden layer and 10 neurons in the output layer, i.e.
number of classes) which gives F-measure about 78% on
the standard Reuters dataset. The feed-forward neural
networks were used for multi-label document classification
in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The authors have modified standard
backpropagation algorithm for multi-label learning (BP-MLL) which
employs a novel error function. This approach is
evaluated on functional genomics and text categorization.
      </p>
      <p>
        A recent study on multi-label text classification was
proposed by Nam et al. in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The authors build on the
assumption that neural networks can model label
dependencies in the output layer. They investigate limitations of
multi-label learning and propose a simple neural network
approach. The authors use cross-entropy algorithm instead
of ranking loss for training and they also further employ
recent advances in deep learning field, e.g. rectified linear
units activation, AdaGrad learning with dropout [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ].
TF-IDF representation of documents is used as network
input. The multi-label classification is handled by
performing thresholding on the output layer. Each possible label
has its own output node and based the final value of the
node a final decision is made. The approach is evaluated
on several multi-label datasets and reaches results
comparable to the state of the art.
      </p>
      <p>
        Another method [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] based on neural networks
leverages the co-occurrence of labels in the multi-label
classification. Some neurons in the output layer capture the
patterns of label co-occurrences, which improves the
classification accuracy. The architecture is basically a
convolutional network and utilizes word embeddings for
initialization of the embedding layer. The method is evaluated
on the natural language query classification in a document
retrieval system.
      </p>
      <p>
        An alternative approach to handling the multi-label
classification is proposed by Yang and Gopal in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The
conventional representations of texts and categories are
transformed into meta-level features. These features are then
utilized in a learning-to-rank algorithm. Experiments on
six benchmark datasets show the abilities of this approach
in comparison with other methods.
      </p>
      <p>
        Another recent work proposes novel features based on
the unsupervised machine learning [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        A significant amount of work about combination of
classifiers was done previously. Our approaches are
motivated by the review of Tulyakov et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Neural Networks and Combination</title>
      <sec id="sec-3-1">
        <title>3.1 Individual Nets</title>
        <p>We use two individual neural nets with different activation
functions (sigmoid and softmax) in the output layer. Their
topologies are briefly presented in the following two
sections.</p>
        <p>Feed-forward Deep Neural Network (FDNN) We use
a Multi-Layer Perceptron (MLP) with two hidden
layers2. As the input of our network we use the simple bag
of words (BoW) which is a binary vector where value 1
means that the word with a given index is present in the
document. The size of this vector depends on the size of
the dictionary which is limited by N most frequent words
which defines the size of the input layer. The first
hidden layer has 1024 while the second one has 512 nodes.
This configuration was set based on the experimental
results. The output layer has the size equal to the number
of categories |C|. To handle the multi-label classification,
we threshold the values of nodes in the output layer. Only
the labels with values larger than a given threshold are
assigned to the document.</p>
        <p>
          Convolutional Neural Network (CNN) The input is a
sequence of words in the document. We use the same
dictionary as in the previous approach. The words are then
represented by the indexes into the dictionary. The
architecture of our network (see Figure 1) is motivated by Kim
in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. However, based on our preliminary experiments,
we used only one-dimensional (1D) convolutional kernels
instead of the combination of several sizes of 2D kernels.
The input of our network is a vector of word indexes of
the length L where L is the number of words used for
document representation. The issue of the variable document
size is solved by setting a fixed value (longer documents
are shortened and the shorter ones padded). The second
layer is an embedding layer which represents each input
word as a vector of a given length. The document is thus
represented as a matrix with L rows and EMB columns
where EMB is the length of the embedding vectors. The
third layer is the convolutional one. We use NC
convolution kernels of the size K × 1 which means we do 1D
convolution over one position in the embedding vector over K
input words. The following layer performs max-pooling
over the length L − K + 1 resulting in NC 1 × EMB vectors.
        </p>
        <p>2We have also experimented with an MLP with one hidden layer
with lower accuracy.
The output of this layer is then flattened and connected
with the output layer containing |C| nodes. The final result
is, as in the previous case, obtained by the thresholding of
the network outputs.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Combination</title>
        <p>We consider that the different nets keep some
complementary information which can compensate recognition errors.
We also assume that similar network topology with
different activation functions can bring some different
information and thus that all nets should have its particular impact
for the final classification. Therefore, we consider all the
nets as the different classifiers which will be further
combined.</p>
        <p>Two types of combination will be evaluated and
compared. The first group does not need any training phase,
while the second one learns a classifier.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Unsupervised Combination The first combination</title>
        <p>method compensates the errors of individual classifiers by
computing the average value from the inputs. This value is
thresholded subsequently to obtain the final classification
result. This method is called hereafter Averaged
thresholding.</p>
        <p>The second combination approach first thresholds the
scores of all individual classifiers. Then, the final
classification output is given as an agreement of the majority of
the classifiers. We call this method asMajority voting with
thresholding
Supervised Combination We use another neural network
of type multi-layer perceptron to combine the results. This
network has three layers: n × |C| inputs, hidden layer with
512 nodes and the output layer composed of |C| neurons
(number of categories to classify). n value is the
number of the nets to combine. This configuration was set
experimentally. We also evaluate and compare, as in the
case of the individual classifiers, two different activation
functions: sigmoid and softmax. These combination
approaches are hereafter called FNN with sigmoid and FNN
with softmax. According to the previous experiments with
neural nets on multi-label classification, we assume better
results of this net with sigmoid activation (see first part of
Table 1).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section we first describe the corpora that we used
for evaluation of our methods. Then, we describe the
performed experiments and the final results.
4.1</p>
      <sec id="sec-4-1">
        <title>Tools and Corpora</title>
        <p>
          For implementation of all neural nets we used Keras
toolkit [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] which is based on the Theano deep learning
library [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. It has been chosen mainly because of good
performance and our previous experience with this tool.
All experiments were computed on GPU to achieve
reasonable computation times.
4.2
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Czech CˇTK Corpus</title>
      <p>For the following experiments we used first the Czech
Cˇ TK corpus. This corpus contains 2,974,040 words
belonging to 11,955 documents. The documents are
annotated from a set of 60 categories as for instance
agriculture, weather, politics or sport out of which we used 37
most frequent ones. The category reduction was done
to allow comparison with previously reported results on
this corpus where the same set of 37 categories was used.
We have further created a development set which is
composed of 500 randomly chosen samples removed from the
entire corpus. Figure 2 illustrates the distribution of the
documents depending on the number of labels. Figure 3
shows the distribution of the document lengths (in word
tokens). This corpus is freely available for research
purposes at http://home.zcu.cz/~pkral/sw/.
3821
2723</p>
      <p>1837
2
3</p>
      <p>4 5
Number of labels
656</p>
      <p>
        We use the five-folds cross validation procedure for all
experiments on this corpus. The optimal value of the
threshold is determined on the development set. For
evaluation of the multi-label document classification results,
we use the standard recall, precision and F-measure (F1)
metrics [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The values are micro-averaged.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Reuters-21578 English Corpus The Reuters-215783 cor</title>
      <p>pus is a collection of 21,578 documents. This corpus is
used to compare our approaches with the state of the art.
As suggested by many authors, the training part is
composed of 7769 documents, while 3019 documents are
reserved for testing. The number of possible categories is 90
and average label/document number is 1.23.
4.3</p>
      <sec id="sec-6-1">
        <title>Results of the Individual Nets</title>
        <p>
          The first experiment (see Table 1) shows the results of the
individual neural nets with sigmoid and softmax
activation functions against the baseline approach proposed by
Brychcín et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. These nets will be further referenced
by the method number.
        </p>
        <p>This table demonstrates very good classification
performance of both individual nets and that the classification
results are very close to each other and comparable. This
table also shows that softmax activation function is slightly
better for FDNN, while sigmoid activation function gives
significantly better results for CNN.</p>
        <p>Another interesting fact regarding to these results is that
the approaches no. 1 - 3 have comparable precision and
3http://www.daviddlewis.com/resources/testcollections/reuters21578/
recall, while the best performing method no. 4 has
significantly better precision than recall (Δ ∼ 4%).</p>
        <p>This table further shows that three individual neural
networks outperform the baseline approach.</p>
        <p>Error Analysis To confirm the potential benefits of the
combination we analyze the errors of the individual nets.
As already stated, we assume that different classifiers
retain different information and thus they should bring
different types of errors which could be compensated by a
combination. Following analysis shows the numbers of
incorrectly identified documents for two categories. We
present the numbers of errors for all individual classifiers
and compare it with the combination of all classifiers.</p>
        <p>The upper part of Figure 4 is focused on the most
frequent class - politics. The graph shows that the numbers
of errors produced by the individual nets are
comparable. However, the networks make errors on different
documents and only few ones (384 from 2221 are common for
all the nets.</p>
        <p>The lower part of Figure 4 is concentrated on the less
frequent class - chemical industry. This analysis
demonstrates that the performances of the different nets
significantly differ, the sigmoid activation function is
substantially better than the softmax and the different nets provide
also different types of errors. The number of the common
errors is 49 (from 232 in total).</p>
        <p>To conclude, both analysis clearly confirm our
assumption that the combination should be beneficial for
improvement of the results of the individual nets.
4.4</p>
      </sec>
      <sec id="sec-6-2">
        <title>Results of Unsupervised Combinations</title>
        <p>The second experiment shows (see Table 2) the results of
Averaged thresholding method. These results confirm our
assumption that the different nets keep complementary
information and that it is useful to combine them. This
experiment further shows that the combination of the nets
with lower scores (particularly with net no. 2) can degrade
the final classification score (e.g. combination 1 &amp; 2 vs.
individual net no. 1).</p>
        <p>Another interesting, somewhat surprising, observation
is that the CNN with the lowest classification accuracy
can have some positive impact to the final classification
(e.g. combination 1 &amp; 3). However, the FDNN no. 2 (with
significantly better results) brings only very small positive
impact to any combination.</p>
        <p>The next experiment which is depicted in Table 3 deals
with the results of the second unsupervised combination
method, Majority voting with thresholding. Note, that we
consider an agreement of at least one half of the classifiers
to obtain unambiguous results. Therefore, we evaluated
the combinations of at least three networks.</p>
        <p>This table shows that this combination approach brings
also positive impact to document classification and the
results of both methods are comparable. However, from the
point of view of the contribution of the individual nets, the
net no. 2 contributes better for the final results as in the
previous case.
The following experiments show the results of the
supervised combination method with an FNN (see Sec 3.2). We
have evaluated and compared the nets with both sigmoid
(see Table 4) and softmax (see Table 5) activation
functions.</p>
        <p>These tables show that these combinations have also
positive impact on the classification and that sigmoid
activation function brings better results than softmax. This
is a similar behaviour as in the case of the individual
nets. Moreover, as supposed, this supervised
combination slightly outperforms both previously described
unsupervised methods.
4.6</p>
      </sec>
      <sec id="sec-6-3">
        <title>Final Results Analysis</title>
        <p>Finally, we analyze the results for the different document
types. The main criterion was the number of the document
labels. We assume that this number will play an important
role for classification and intuitively, the documents with
less labels will be easier to classify. We thus divided the
documents into five distinct classes according to the
number of labels (i.e. the documents with one, two, three and
four labels and the remaining documents). Then, we tried
to determine an optimal threshold for every class and
report the F-measure. This value is compared to the results
obtained with global threshold identified previously (one
threshold for all documents).</p>
        <p>The results of this analysis are shown in Figure 5. We
have chosen two representative cases to analyze, the
individual FDNN with softmax (left side) and the combination
by Averaged thresholding method (right side). The
adaptive threshold means that the threshold is optimized for
each group of documents separately. The fixed threshold
is the one that was optimized on the development set. This
figure confirms our assumption. The best classification
results are for the documents with one label and then they
decrease. Moreover, this analysis shows that this
number plays a crucial role for document classification for all
cases. Hypothetically, if we could determine the number
of labels for a particular document before the thresholding,
we could improve the final F-measure by 1.5%.
4.7</p>
      </sec>
      <sec id="sec-6-4">
        <title>Results on English Corpus</title>
        <p>This experiment shows results of our methods on the
frequently used Reuters-21578 corpus. We present the results
on English dataset mainly for comparison with other
stateof-the-art methods while we cannot provide such
comparison on Czech data. Table 6 shows the performance of
proposed models on the benchmark Reuters-21578 dataset.
The bottom part of the table provides comparison with
other state-of-the-art methods.
5</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we have used several combination methods
to improve the results of individual neural nets for
multilabel document classification of Czech text documents.
We have also presented the results of our methods on a
standard English corpus. We have compared several
popular (unsupervised and also supervised) combination
methods.</p>
      <p>
        1Approach proposed by Zhang et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and used with ReLU
activation, AdaGrad and dropout.
92
90
88
      </p>
      <p>The experimental results have confirmed our
assumption that the different nets keep different information.
Therefore, it is useful to combine them to improve the
classification score of the individual nets. We have also proved
that the thresholding is a good method to assign the
document labels of multi-label classification. We have further
shown that the results of all the approaches are
comparable. However, the best combination method is the
supervised one which uses an FNN with sigmoid activation
function. The F-measure on Czech is 85.3% while the best
result for English is 87.6%. Results on both languages are
thus at least comparable with the state of the art.</p>
      <p>One perspective for further work is to improve the
combination methods while the error analysis has shown that
there is still some room for improvement. We have also
shown that knowing the number of classes could improve
the result. Another perspective is thus to build a classifier
with thresholds dependent on the number of labels.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work has been supported by the project LO1506 of
the Czech Ministry of Education, Youth and Sports. We
also would like to thank the Czech New Agency (Cˇ TK)
for support and for providing the data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Nam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mencía</surname>
            ,
            <given-names>E.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fürnkranz</surname>
          </string-name>
          , J.:
          <article-title>Large-scale multi-label text classification-revisiting neural networks</article-title>
          .
          <source>In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          , Springer (
          <year>2014</year>
          )
          <fpage>437</fpage>
          -
          <lpage>452</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Lenc</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Král</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Deep neural networks for czech multilabel document classification</article-title>
          .
          <source>CoRR abs/1701</source>
          .03849 (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Della</given-names>
            <surname>Pietra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Della Pietra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Lafferty</surname>
          </string-name>
          , J.:
          <article-title>Inducing features of random fields</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>19</volume>
          (
          <issue>4</issue>
          ) (
          <year>1997</year>
          )
          <fpage>380</fpage>
          -
          <lpage>393</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>J.O.:</given-names>
          </string-name>
          <article-title>A comparative study on feature selection in text categorization</article-title>
          .
          <source>In: Proceedings of the Fourteenth International Conference on Machine Learning. ICML '97</source>
          , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. (
          <year>1997</year>
          )
          <fpage>412</fpage>
          -
          <lpage>420</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ramage</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nallapati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora</article-title>
          .
          <source>In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:</source>
          Volume 1
          <article-title>- Volume 1</article-title>
          . EMNLP '
          <volume>09</volume>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA, Association for Computational Linguistics (
          <year>2009</year>
          )
          <fpage>248</fpage>
          -
          <lpage>256</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Collobert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karlen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuksa</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LeCun</surname>
          </string-name>
          , Y.:
          <article-title>Text understanding from scratch</article-title>
          .
          <source>arXiv preprint arXiv:1502.01710</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for sentence classification</article-title>
          .
          <source>arXiv preprint arXiv:1408.5882</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>In: Proceedings of Workshop at ICLR</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Recurrent convolutional neural networks for text classification</article-title>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Manevitz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yousef</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>One-class document classification via neural networks</article-title>
          .
          <source>Neurocomputing</source>
          <volume>70</volume>
          (
          <issue>7-9</issue>
          ) (
          <year>2007</year>
          )
          <fpage>1466</fpage>
          -
          <lpage>1481</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.H.</given-names>
          </string-name>
          :
          <article-title>Multilabel neural networks with applications to functional genomics and text categorization. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on
          <volume>18</volume>
          (
          <issue>10</issue>
          ) (
          <year>2006</year>
          )
          <fpage>1338</fpage>
          -
          <lpage>1351</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Rectified linear units improve restricted boltzmann machines</article-title>
          .
          <source>In: Proceedings of the 27th international conference on machine learning (ICML-10)</source>
          . (
          <year>2010</year>
          )
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ) (
          <year>2014</year>
          )
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Kurata</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence</article-title>
          .
          <source>In: Proceedings of NAACL-HLT</source>
          . (
          <year>2016</year>
          )
          <fpage>521</fpage>
          -
          <lpage>526</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gopal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Multilabel classification with metalevel features in a learning-to-rank framework</article-title>
          .
          <source>Machine Learning</source>
          <volume>88</volume>
          (
          <issue>1-2</issue>
          ) (
          <year>2012</year>
          )
          <fpage>47</fpage>
          -
          <lpage>68</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Brychcín</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Král</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Novel unsupervised features for Czech multi-label document classification</article-title>
          .
          <source>In: 13th Mexican International Conference on Artificial Intelligence (MICAI</source>
          <year>2014</year>
          ), Tuxtla Gutierrez, Chiapas, Mexic,
          <source>Springer (16-22 November</source>
          <year>2014</year>
          )
          <fpage>70</fpage>
          -
          <lpage>79</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Tulyakov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaeger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Govindaraju</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doermann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Review of classifier combination methods</article-title>
          .
          <source>In: Machine Learning in Document Analysis and Recognition</source>
          . Springer (
          <year>2008</year>
          )
          <fpage>361</fpage>
          -
          <lpage>386</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          : keras. https://github.com/fchollet/ keras (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Bergstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breuleux</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bastien</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lamblin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pascanu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Desjardins</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turian</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warde-Farley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Theano: a cpu and gpu math expression compiler</article-title>
          .
          <source>In: Proceedings of the Python for scientific computing conference (SciPy)</source>
          . Volume
          <volume>4</volume>
          ., Austin, TX (
          <year>2010</year>
          )
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Powers</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Evaluation: From precision, recall and fmeasure to roc., informedness, markedness &amp; correlation</article-title>
          .
          <source>Journal of Machine Learning Technologies</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ) (
          <year>2011</year>
          )
          <fpage>37</fpage>
          -
          <lpage>63</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>T.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chambers</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smyth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steyvers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Statistical topic models for multi-label document classification</article-title>
          .
          <source>Machine learning 88(1-2)</source>
          (
          <year>2012</year>
          )
          <fpage>157</fpage>
          -
          <lpage>208</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>