=Paper=
{{Paper
|id=Vol-1885/186
|storemode=property
|title=Ensemble of Neural Networks
        for Multi-label Document Classification
|pdfUrl=https://ceur-ws.org/Vol-1885/186.pdf
|volume=Vol-1885
|authors=Ladislav Lenc,Pavel Král
|dblpUrl=https://dblp.org/rec/conf/itat/LencK17
}}
==Ensemble of Neural Networks
        for Multi-label Document Classification==
<pdf width="1500px">https://ceur-ws.org/Vol-1885/186.pdf</pdf>
<pre>
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 186–192
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 L. Lenc, P. Král


             Ensemble of Neural Networks for Multi-label Document Classification

                                                   Ladislav Lenc1,2 and Pavel Král1,2
                              1Department of Computer Science and Engineering, Faculty of Applied Sciences,
                                 University of West Bohemia, Univerzitní 8, 306 14 Plzeň, Czech Republic
                            2 NTIS—New Technologies for the Information Society, Faculty of Applied Sciences,

                                 University of West Bohemia, Technická 8, 306 14 Plzeň, Czech Republic
                                                           nlp.kiv.zcu.cz
                                                     {llenc,pkral}@kiv.zcu.cz

      Abstract: This paper deals with multi-label document                The methods are evaluated on documents in the Czech
      classification using an ensemble of neural networks. The         language, being a representative of highly inflectional
      assumption is that different network types can keep com-         Slavic language with a free word order. These properties
      plementary information and that the combination of more          decrease the performance of usual methods. We further
      neural classifiers will bring higher accuracy. We verify this    compare the results of our methods with other state-of-
      hypothesis by an error analysis of the individual networks.      the-art approaches on English Reuters-215781 dataset in
      One contribution of this work is thus evaluation of several      order to show its robustness across languages. Addition-
      network combinations that improve performance over one           ally we analyze the final F-measure on document sets di-
      single network. Another contribution is a detailed analysis      vided according to the number of assigned labels in order
      of the achieved results and a proposition of possible direc-     to improve the accuracy of the presented approach.
      tions of further improvement. We evaluate the approaches            The rest of the paper is organized as follows. Section 2
      on a Czech ČTK corpus and also compare the results with         is a short review of document classification methods with
      state-of-the-art approaches on the English Reuters-21578         a particular focus on neural networks. Section 3 describes
      dataset. We show that the ensemble of neural classifiers         our neural network models and the combination methods.
      achieves competitive results using only very simple fea-         Section 4 deals with experiments realized on the ČTK and
      tures.                                                           Reuters corpora and then analyzes and discusses the ob-
      Keywords: Czech, deep neural networks, document classi-          tained results. In the last section, we conclude the ex-
      fication, multi-label                                            perimental results and propose some future research di-
                                                                       rections.

      1    Introduction
                                                                       2 Related Work
      This paper deals with multi-label document classification
                                                                       Document classification is usually based on a supervised
      by neural networks. Formally, this task can be seen as the
                                                                       machine learning. A classifier is trained on an annotated
      problem of finding a model M which assigns a document
                                                                       corpus and it then assigns class labels to unlabelled docu-
      d ∈ D a set of appropriate labels (categories) c ∈ C as fol-
                                                                       ments. Most works use vector space model (VSM), which
      lows M : d → c where D is the set of all documents and
                                                                       generally represents each document as a vector of all word
      C is the set of all possible document labels. The multi-
                                                                       occurrences usually weighted by their tf-idf.
      label classification using neural networks is often done by
                                                                          Several classification methods have been successfully
      thresholding of the output layer [1, 2]. It has been shown
                                                                       used [3], as for instance Bayesian classifiers, maximum
      that both standard feed-forward networks (FNNs) and con-
                                                                       entropy, support vector machines, etc. However, the main
      volutional neural networks (CNNs) achieve state-of-the-
                                                                       issue of this task is that the feature space is highly di-
      art results on the standard corpora [1, 2].
                                                                       mensional which decreases the classification results. Fea-
         However, we believe that there is still some room for
                                                                       ture selection/reduction [4] or better document representa-
      further improvement. A combination of classifiers is a nat-
                                                                       tion [5] can be used to solve this problem.
      ural step forward. Therefore, we combine a CNN and an
                                                                          Nowadays, “deep” neural nets outperform majority of
      FNN in this work to gain further improvement in the terms
                                                                       the state-of-the-art natural language processing (NLP)
      of precision and recall. We support the claim that combi-
                                                                       methods on several tasks with only very simple features.
      nation may bring better results by studying the errors of
                                                                       These include for instance POS tagging, chunking, named
      the individual networks. The main contribution of this pa-
                                                                       entity recognition and semantic role labelling [6]. Sev-
      per thus consists in the analysis of errors in the prediction
                                                                       eral different topologies and learning algorithms were pro-
      results of the individual networks. Then we present the re-
                                                                       posed. For instance, Zhang et al. [7] propose two convolu-
      sults of several combination methods and illustrate that the
                                                                       tional neural nets (CNN) for ontology classification, sen-
      ensemble of neural networks brings significant improve-
      ment over the individual networks.                                  1 http://www.daviddlewis.com/resources/testcollections/reuters21578/
Ensemble of Neural Networks for Multi-label Document Classification                                                                       187

      timent analysis and single-label document classification.          Another recent work proposes novel features based on
      They show that the proposed method significantly outper-        the unsupervised machine learning [17].
      forms the baseline approach (bag of words) on English and          A significant amount of work about combination of
      Chinese corpora. Another interesting work [8] uses in the       classifiers was done previously. Our approaches are mo-
      first layer pre-trained vectors from word2vec [9]. The au-      tivated by the review of Tulyakov et al. [18].
      thors show that the proposed models outperform the state
      of the art on 4 out of 7 tasks, including sentiment anal-
      ysis and question classification. Recurrent convolutional       3 Neural Networks and Combination
      neural nets are used for text classification in [10]. The
      authors demonstrated that their approach outperforms the        3.1 Individual Nets
      standard convolutional networks on four corpora in single-      We use two individual neural nets with different activation
      label document classification task.                             functions (sigmoid and softmax) in the output layer. Their
         On the other hand, traditional feed-forward neural net       topologies are briefly presented in the following two sec-
      architectures are used for multi-label document classifica-     tions.
      tion rather rarely. These models were more popular be-
      fore as shown for instance in [11]. They build a simple
      multi-layer perceptron with three layers (20 inputs, 6 neu-     Feed-forward Deep Neural Network (FDNN) We use
      rons in hidden layer and 10 neurons in the output layer, i.e.   a Multi-Layer Perceptron (MLP) with two hidden lay-
      number of classes) which gives F-measure about 78% on           ers2 . As the input of our network we use the simple bag
      the standard Reuters dataset. The feed-forward neural net-      of words (BoW) which is a binary vector where value 1
      works were used for multi-label document classification         means that the word with a given index is present in the
      in [12]. The authors have modified standard backpropaga-        document. The size of this vector depends on the size of
      tion algorithm for multi-label learning (BP-MLL) which          the dictionary which is limited by N most frequent words
      employs a novel error function. This approach is evalu-         which defines the size of the input layer. The first hid-
      ated on functional genomics and text categorization.            den layer has 1024 while the second one has 512 nodes.
         A recent study on multi-label text classification was pro-   This configuration was set based on the experimental re-
      posed by Nam et al. in [1]. The authors build on the            sults. The output layer has the size equal to the number
      assumption that neural networks can model label depen-          of categories |C|. To handle the multi-label classification,
      dencies in the output layer. They investigate limitations of    we threshold the values of nodes in the output layer. Only
      multi-label learning and propose a simple neural network        the labels with values larger than a given threshold are as-
      approach. The authors use cross-entropy algorithm instead       signed to the document.
      of ranking loss for training and they also further employ
      recent advances in deep learning field, e.g. rectified linear   Convolutional Neural Network (CNN) The input is a se-
      units activation, AdaGrad learning with dropout [13, 14].       quence of words in the document. We use the same dic-
      TF-IDF representation of documents is used as network in-       tionary as in the previous approach. The words are then
      put. The multi-label classification is handled by perform-      represented by the indexes into the dictionary. The archi-
      ing thresholding on the output layer. Each possible label       tecture of our network (see Figure 1) is motivated by Kim
      has its own output node and based the final value of the        in [8]. However, based on our preliminary experiments,
      node a final decision is made. The approach is evaluated        we used only one-dimensional (1D) convolutional kernels
      on several multi-label datasets and reaches results compa-      instead of the combination of several sizes of 2D kernels.
      rable to the state of the art.                                  The input of our network is a vector of word indexes of
         Another method [15] based on neural networks lever-          the length L where L is the number of words used for doc-
      ages the co-occurrence of labels in the multi-label clas-       ument representation. The issue of the variable document
      sification. Some neurons in the output layer capture the        size is solved by setting a fixed value (longer documents
      patterns of label co-occurrences, which improves the clas-      are shortened and the shorter ones padded). The second
      sification accuracy. The architecture is basically a convo-     layer is an embedding layer which represents each input
      lutional network and utilizes word embeddings for initial-      word as a vector of a given length. The document is thus
      ization of the embedding layer. The method is evaluated         represented as a matrix with L rows and EMB columns
      on the natural language query classification in a document      where EMB is the length of the embedding vectors. The
      retrieval system.                                               third layer is the convolutional one. We use NC convolu-
         An alternative approach to handling the multi-label clas-    tion kernels of the size K × 1 which means we do 1D con-
      sification is proposed by Yang and Gopal in [16]. The con-      volution over one position in the embedding vector over K
      ventional representations of texts and categories are trans-    input words. The following layer performs max-pooling
      formed into meta-level features. These features are then        over the length L−K +1 resulting in NC 1×EMB vectors.
      utilized in a learning-to-rank algorithm. Experiments on
      six benchmark datasets show the abilities of this approach          2 We have also experimented with an MLP with one hidden layer

      in comparison with other methods.                               with lower accuracy.
188                                                                                                                          L. Lenc, P. Král

                                                                      result. This method is called hereafter Averaged thresh-
                                                                      olding.
                                                                         The second combination approach first thresholds the
                                                                      scores of all individual classifiers. Then, the final classifi-
                                                                      cation output is given as an agreement of the majority of
                                                                      the classifiers. We call this method as Majority voting with
                                                                      thresholding


                                                                      Supervised Combination We use another neural network
                                                                      of type multi-layer perceptron to combine the results. This
                                                                      network has three layers: n × |C| inputs, hidden layer with
                                                                      512 nodes and the output layer composed of |C| neurons
                                                                      (number of categories to classify). n value is the num-
                                                                      ber of the nets to combine. This configuration was set
                                                                      experimentally. We also evaluate and compare, as in the
                                                                      case of the individual classifiers, two different activation
                                                                      functions: sigmoid and softmax. These combination ap-
                                                                      proaches are hereafter called FNN with sigmoid and FNN
                                                                      with softmax. According to the previous experiments with
                                                                      neural nets on multi-label classification, we assume better
                                                                      results of this net with sigmoid activation (see first part of
                                                                      Table 1).


                                                                      4   Experiments

                                                                      In this section we first describe the corpora that we used
                                                                      for evaluation of our methods. Then, we describe the per-
                                                                      formed experiments and the final results.

                     Figure 1: CNN architecture
                                                                      4.1 Tools and Corpora

      The output of this layer is then flattened and connected        For implementation of all neural nets we used Keras tool-
      with the output layer containing |C| nodes. The final result    kit [19] which is based on the Theano deep learning li-
      is, as in the previous case, obtained by the thresholding of    brary [20]. It has been chosen mainly because of good
      the network outputs.                                            performance and our previous experience with this tool.
                                                                      All experiments were computed on GPU to achieve rea-
                                                                      sonable computation times.
      3.2   Combination

      We consider that the different nets keep some complemen-        4.2 Czech ČTK Corpus
      tary information which can compensate recognition errors.
      We also assume that similar network topology with differ-       For the following experiments we used first the Czech
      ent activation functions can bring some different informa-      ČTK corpus. This corpus contains 2,974,040 words be-
      tion and thus that all nets should have its particular impact   longing to 11,955 documents. The documents are anno-
      for the final classification. Therefore, we consider all the    tated from a set of 60 categories as for instance agricul-
      nets as the different classifiers which will be further com-    ture, weather, politics or sport out of which we used 37
      bined.                                                          most frequent ones. The category reduction was done
         Two types of combination will be evaluated and com-          to allow comparison with previously reported results on
      pared. The first group does not need any training phase,        this corpus where the same set of 37 categories was used.
      while the second one learns a classifier.                       We have further created a development set which is com-
                                                                      posed of 500 randomly chosen samples removed from the
                                                                      entire corpus. Figure 2 illustrates the distribution of the
      Unsupervised Combination The first combination                  documents depending on the number of labels. Figure 3
      method compensates the errors of individual classifiers by      shows the distribution of the document lengths (in word
      computing the average value from the inputs. This value is      tokens). This corpus is freely available for research pur-
      thresholded subsequently to obtain the final classification     poses at http://home.zcu.cz/~pkral/sw/.
Ensemble of Neural Networks for Multi-label Document Classification                                                                                                 189

                          4500
                          4000          3821                                                        Table 1: Results of the individual nets with sigmoid and
                          3500
                                                                                                    softmax activation functions against the baseline approach
                          3000   2693          2723
             Documents

                          2500
                                                         1837
                                                                                                     No.    Network/activation       Prec.    Recall    F1 [%]
                          2000
                          1500                                                                       1.     FDNN     softmax         84.4      82.1      83.3
                                                                                                     2.              sigmoid         83.0      81.2      82.1
                          1000
                                                                  656
                           500                                             183
                                                                                        41
                                                                                                     3.      CNN     softmax         80.6      80.8      80.7
                                                                                                1
                            0
                                  1       2     3         4         5         6         7       8
                                                        Number of labels
                                                                                                     4.              sigmoid         86.3      81.9      84.1

      Figure 2: Distribution of documents depending on the                                                      Baseline [17]        89.0      75.6       81.7
      number of labels assigned to the documents

                          2500                                                                      recall, while the best performing method no. 4 has signifi-
                                                                                                    cantly better precision than recall (∆ ∼ 4%).
                          2000
                                                                                                       This table further shows that three individual neural net-
                                                                                                    works outperform the baseline approach.
             Documents


                          1500


                          1000


                           500                                                                      Error Analysis To confirm the potential benefits of the
                                                                                                    combination we analyze the errors of the individual nets.
                            0
                                        200         400         600               800        1000   As already stated, we assume that different classifiers re-
                                                    Document length (words)
                                                                                                    tain different information and thus they should bring dif-
                                                                                                    ferent types of errors which could be compensated by a
                         Figure 3: Distribution of the document lengths
                                                                                                    combination. Following analysis shows the numbers of
                                                                                                    incorrectly identified documents for two categories. We
         We use the five-folds cross validation procedure for all                                   present the numbers of errors for all individual classifiers
      experiments on this corpus. The optimal value of the                                          and compare it with the combination of all classifiers.
      threshold is determined on the development set. For eval-                                        The upper part of Figure 4 is focused on the most fre-
      uation of the multi-label document classification results,                                    quent class - politics. The graph shows that the numbers
      we use the standard recall, precision and F-measure (F1)                                      of errors produced by the individual nets are compara-
      metrics [21]. The values are micro-averaged.                                                  ble. However, the networks make errors on different docu-
                                                                                                    ments and only few ones (384 from 2221 are common for
                                                                                                    all the nets.
      Reuters-21578 English Corpus The Reuters-215783 cor-                                             The lower part of Figure 4 is concentrated on the less
      pus is a collection of 21,578 documents. This corpus is                                       frequent class - chemical industry. This analysis demon-
      used to compare our approaches with the state of the art.                                     strates that the performances of the different nets signif-
      As suggested by many authors, the training part is com-                                       icantly differ, the sigmoid activation function is substan-
      posed of 7769 documents, while 3019 documents are re-                                         tially better than the softmax and the different nets provide
      served for testing. The number of possible categories is 90                                   also different types of errors. The number of the common
      and average label/document number is 1.23.                                                    errors is 49 (from 232 in total).
                                                                                                       To conclude, both analysis clearly confirm our assump-
      4.3                Results of the Individual Nets                                             tion that the combination should be beneficial for improve-
                                                                                                    ment of the results of the individual nets.
      The first experiment (see Table 1) shows the results of the
      individual neural nets with sigmoid and softmax activa-
      tion functions against the baseline approach proposed by                                      4.4 Results of Unsupervised Combinations
      Brychcín et al. [17]. These nets will be further referenced
                                                                                                    The second experiment shows (see Table 2) the results of
      by the method number.
                                                                                                    Averaged thresholding method. These results confirm our
         This table demonstrates very good classification perfor-
                                                                                                    assumption that the different nets keep complementary in-
      mance of both individual nets and that the classification
                                                                                                    formation and that it is useful to combine them. This ex-
      results are very close to each other and comparable. This
                                                                                                    periment further shows that the combination of the nets
      table also shows that softmax activation function is slightly
                                                                                                    with lower scores (particularly with net no. 2) can degrade
      better for FDNN, while sigmoid activation function gives
                                                                                                    the final classification score (e.g. combination 1 & 2 vs.
      significantly better results for CNN.
                                                                                                    individual net no. 1).
         Another interesting fact regarding to these results is that
                                                                                                       Another interesting, somewhat surprising, observation
      the approaches no. 1 - 3 have comparable precision and
                                                                                                    is that the CNN with the lowest classification accuracy
            3 http://www.daviddlewis.com/resources/testcollections/reuters21578/                    can have some positive impact to the final classification
190                                                                                                                     L. Lenc, P. Král


                                                                      Table 3: Combinations of the nets by Majority voting with
                                                                      thresholding
                                                                        Net combi.         Precision     Recall     F1 [%]
                                                                        1&2&3                86.1         82.9       84.6
                                                                        1&2&4                87.5         82.6       85.0
                                                                        1&3&4                86.5         82.9       84.6
                                                                        2&3&4                86.9         82.7       84.8
                                                                        1&2&3&4              84.1         85.7       84.9


                                                                      4.5 Results of Supervised Combinations
                                                                      The following experiments show the results of the super-
                                                                      vised combination method with an FNN (see Sec 3.2). We
                                                                      have evaluated and compared the nets with both sigmoid
                                                                      (see Table 4) and softmax (see Table 5) activation func-
                                                                      tions.
                                                                         These tables show that these combinations have also
      Figure 4: Error analysis of the individual nets for the most    positive impact on the classification and that sigmoid ac-
      frequent (top, politics) and for the less frequent (bottom,     tivation function brings better results than softmax. This
      chemical industry) classes, numbers of incorrectly identi-
      fied documents in brackets
                                                                      Table 4: Combinations of the nets by FNN with sigmoid
      Table 2: Combinations of nets by Averaged thresholding            Net combi.         Precision     Recall     F1 [%]
         Net combi.         Precision      Recall      F1 [%]           1&2                  86.1         82.1       84.1
         1&2                  83.0          82.4        82.7            1&3                  87.1         81.5       84.2
         1&3                  83.2          84.6        83.9            1&4                  88.4         81.9       85.0
         1&4                  85.7          84.3        85.0            2&3                  86.6         81.4       83.9
         2&3                  86.2          79.6        82.8            2&4                  87.7         82.0       84.7
         2&4                  84.9          83.5        84.2            3&4                  89.3         80.0       84.4
         3&4                  87.3          81.7        84.4            1&2&3                86.9         82.4       84.6
         1&2&3                84.8          81.9        83.3            1&2&4                87.9         82.8       85.3
         1&2&4                90.1          79.6        84.5            1&3&4                88.2         82.5       85.2
         1&3&4                86.7          83.5        85.1            2&3&4                87.9         82.2       85.0
         2&3&4                89.3          80.5        84.6            1&2&3&4              88.0         82.8       85.3
         1&2&3&4              89.7          80.5        84.9
                                                                       Table 5: Combinations of the nets by FNN with softmax
      (e.g. combination 1 & 3). However, the FDNN no. 2 (with
                                                                        Net combi.         Precision     Recall     F1 [%]
      significantly better results) brings only very small positive
      impact to any combination.                                        1&2                  85.3         81.6       83.4
         The next experiment which is depicted in Table 3 deals         1&3                  85.4         81.8       83.6
      with the results of the second unsupervised combination           1&4                  86.3         82.6       84.4
      method, Majority voting with thresholding. Note, that we          2&3                  85.4         80.9       83.1
      consider an agreement of at least one half of the classifiers     2&4                  86.1         82.0       84.0
      to obtain unambiguous results. Therefore, we evaluated
      the combinations of at least three networks.                      3&4                  86.7         81.3       83.9
         This table shows that this combination approach brings         1&2&3                85.0         82.7       83.9
      also positive impact to document classification and the re-       1&2&4                85.7         83.2       84.4
      sults of both methods are comparable. However, from the           1&3&4                85.8         83.3       84.5
      point of view of the contribution of the individual nets, the     2&3&4                85.6         82.9       84.3
      net no. 2 contributes better for the final results as in the
      previous case.                                                    1&2&3&4              85.7         83.6       84.6
Ensemble of Neural Networks for Multi-label Document Classification                                                                                  191

      is a similar behaviour as in the case of the individual                                           F-measure based on number of labels
      nets. Moreover, as supposed, this supervised combina-
                                                                                               92
                                                                                                                          Adaptive threshold
                                                                                                                            Fixed threshold
      tion slightly outperforms both previously described unsu-                                90


      pervised methods.                                                                        88


                                                                                               86


                                                                                   F-measure
      4.6   Final Results Analysis
                                                                                               84


                                                                                               82


      Finally, we analyze the results for the different document                               80


      types. The main criterion was the number of the document                                 78

      labels. We assume that this number will play an important                                76

      role for classification and intuitively, the documents with                                   1           2         3          4         >=5

      less labels will be easier to classify. We thus divided the                                                     #Labels
                                                                                                        F-measure based on number of labels
      documents into five distinct classes according to the num-                               92
                                                                                                                          Adaptive threshold
      ber of labels (i.e. the documents with one, two, three and                               90                           Fixed threshold
      four labels and the remaining documents). Then, we tried                                 88

      to determine an optimal threshold for every class and re-                                86
      port the F-measure. This value is compared to the results


                                                                                   F-measure
      obtained with global threshold identified previously (one
                                                                                               84


      threshold for all documents).                                                            82


         The results of this analysis are shown in Figure 5. We                                80


      have chosen two representative cases to analyze, the indi-                               78


      vidual FDNN with softmax (left side) and the combination                                 76


      by Averaged thresholding method (right side). The adap-                                       1           2         3
                                                                                                                      #Labels
                                                                                                                                     4         >=5


      tive threshold means that the threshold is optimized for
      each group of documents separately. The fixed threshold                  Figure 5: F-measure according to the number of labels for
      is the one that was optimized on the development set. This               adaptive and fixed thresholds, the upper graph shows the
      figure confirms our assumption. The best classification re-              results for MLP with softmax while the lower one is for
      sults are for the documents with one label and then they                 the combination of all nets
      decrease. Moreover, this analysis shows that this num-
      ber plays a crucial role for document classification for all
      cases. Hypothetically, if we could determine the number                       Table 6: Results on the Reuters-21578 dataset
      of labels for a particular document before the thresholding,              Method               Precision     Recall    F1 [%]
      we could improve the final F-measure by 1.5%.                             MLP/softmax            89.08        80.6      85.0
                                                                                MLP/sigmoid             89.6        82.7      86.0
      4.7   Results on English Corpus                                           CNN/softmax             87.8        84.1      85.9
      This experiment shows results of our methods on the fre-                  CNN/sigmoid             89.4        81.3      85.2
      quently used Reuters-21578 corpus. We present the results                 Supervised combi        91.4        84.1      87.6
      on English dataset mainly for comparison with other state-                NNAD [1]                90.4        83.4      86.8
      of-the-art methods while we cannot provide such compari-
                                                                                BP − MLLTAD 1           84.2        84.2      84.2
      son on Czech data. Table 6 shows the performance of pro-
      posed models on the benchmark Reuters-21578 dataset.                      BRR [22]                89.8        86.0      87.9
      The bottom part of the table provides comparison with
      other state-of-the-art methods.
                                                                                  The experimental results have confirmed our assump-
                                                                               tion that the different nets keep different information.
      5 Conclusions and Future Work                                            Therefore, it is useful to combine them to improve the clas-
                                                                               sification score of the individual nets. We have also proved
      In this paper, we have used several combination methods                  that the thresholding is a good method to assign the docu-
      to improve the results of individual neural nets for multi-              ment labels of multi-label classification. We have further
      label document classification of Czech text documents.                   shown that the results of all the approaches are compa-
      We have also presented the results of our methods on a                   rable. However, the best combination method is the su-
      standard English corpus. We have compared several popu-                  pervised one which uses an FNN with sigmoid activation
      lar (unsupervised and also supervised) combination meth-                 function. The F-measure on Czech is 85.3% while the best
      ods.                                                                     result for English is 87.6%. Results on both languages are
           1 Approach proposed by Zhang et al. [12] and used with ReLU acti-   thus at least comparable with the state of the art.
      vation, AdaGrad and dropout.                                                One perspective for further work is to improve the com-
192                                                                                                                                 L. Lenc, P. Král

      bination methods while the error analysis has shown that             [13] Nair, V., Hinton, G.E.: Rectified linear units improve re-
      there is still some room for improvement. We have also                    stricted boltzmann machines. In: Proceedings of the 27th
      shown that knowing the number of classes could improve                    international conference on machine learning (ICML-10).
      the result. Another perspective is thus to build a classifier             (2010) 807–814
      with thresholds dependent on the number of labels.                   [14] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I.,
                                                                                Salakhutdinov, R.: Dropout: a simple way to prevent neural
                                                                                networks from overfitting. Journal of Machine Learning
      Acknowledgements                                                          Research 15(1) (2014) 1929–1958
                                                                           [15] Kurata, G., Xiang, B., Zhou, B.:           Improved neural
      This work has been supported by the project LO1506 of                     network-based multi-label classification with better initial-
      the Czech Ministry of Education, Youth and Sports. We                     ization leveraging label co-occurrence. In: Proceedings of
      also would like to thank the Czech New Agency (ČTK)                      NAACL-HLT. (2016) 521–526
      for support and for providing the data.                              [16] Yang, Y., Gopal, S.: Multilabel classification with meta-
                                                                                level features in a learning-to-rank framework. Machine
                                                                                Learning 88(1-2) (2012) 47–68
      References                                                           [17] Brychcín, T., Král, P.: Novel unsupervised features for
                                                                                Czech multi-label document classification. In: 13th Mexi-
       [1] Nam, J., Kim, J., Mencía, E.L., Gurevych, I., Fürnkranz, J.:
                                                                                can International Conference on Artificial Intelligence (MI-
           Large-scale multi-label text classification—revisiting neu-
                                                                                CAI 2014), Tuxtla Gutierrez, Chiapas, Mexic, Springer
           ral networks. In: Joint European Conference on Machine
                                                                                (16-22 November 2014) 70–79
           Learning and Knowledge Discovery in Databases, Springer
           (2014) 437–452                                                  [18] Tulyakov, S., Jaeger, S., Govindaraju, V., Doermann, D.:
                                                                                Review of classifier combination methods. In: Machine
       [2] Lenc, L., Král, P.: Deep neural networks for czech multi-
                                                                                Learning in Document Analysis and Recognition. Springer
           label document classification. CoRR abs/1701.03849
                                                                                (2008) 361–386
           (2017)
                                                                           [19] Chollet, F.: keras. https://github.com/fchollet/
       [3] Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing
                                                                                keras (2015)
           features of random fields. IEEE Transactions on Pattern
           Analysis and Machine Intelligence 19(4) (1997) 380–393          [20] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pas-
                                                                                canu, R., Desjardins, G., Turian, J., Warde-Farley, D., Ben-
       [4] Yang, Y., Pedersen, J.O.: A comparative study on fea-
                                                                                gio, Y.: Theano: a cpu and gpu math expression compiler.
           ture selection in text categorization. In: Proceedings of the
                                                                                In: Proceedings of the Python for scientific computing con-
           Fourteenth International Conference on Machine Learning.
                                                                                ference (SciPy). Volume 4., Austin, TX (2010) 3
           ICML ’97, San Francisco, CA, USA, Morgan Kaufmann
           Publishers Inc. (1997) 412–420                                  [21] Powers, D.: Evaluation: From precision, recall and f-
                                                                                measure to roc., informedness, markedness & correlation.
       [5] Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: La-
                                                                                Journal of Machine Learning Technologies 2(1) (2011) 37–
           beled lda: A supervised topic model for credit attribution in
                                                                                63
           multi-labeled corpora. In: Proceedings of the 2009 Confer-
           ence on Empirical Methods in Natural Language Process-          [22] Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statis-
           ing: Volume 1 - Volume 1. EMNLP ’09, Stroudsburg, PA,                tical topic models for multi-label document classification.
           USA, Association for Computational Linguistics (2009)                Machine learning 88(1-2) (2012) 157–208
           248–256
       [6] Collobert, R., Weston, J., Bottou, L., Karlen, M.,
           Kavukcuoglu, K., Kuksa, P.: Natural language processing
           (almost) from scratch. The Journal of Machine Learning
           Research 12 (2011) 2493–2537
       [7] Zhang, X., LeCun, Y.: Text understanding from scratch.
           arXiv preprint arXiv:1502.01710 (2015)
       [8] Kim, Y.: Convolutional neural networks for sentence clas-
           sification. arXiv preprint arXiv:1408.5882 (2014)
       [9] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient es-
           timation of word representations in vector space. In: Pro-
           ceedings of Workshop at ICLR. (2013)
      [10] Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional
           neural networks for text classification. (2015)
      [11] Manevitz, L., Yousef, M.: One-class document classifica-
           tion via neural networks. Neurocomputing 70(7-9) (2007)
           1466–1481
      [12] Zhang, M.L., Zhou, Z.H.: Multilabel neural networks with
           applications to functional genomics and text categorization.
           Knowledge and Data Engineering, IEEE Transactions on
           18(10) (2006) 1338–1351

</pre>