=Paper=
{{Paper
|id=Vol-1885/186
|storemode=property
|title=Ensemble of Neural Networks
for Multi-label Document Classification
|pdfUrl=https://ceur-ws.org/Vol-1885/186.pdf
|volume=Vol-1885
|authors=Ladislav Lenc,Pavel Král
|dblpUrl=https://dblp.org/rec/conf/itat/LencK17
}}
==Ensemble of Neural Networks
for Multi-label Document Classification==
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 186–192
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 L. Lenc, P. Král
Ensemble of Neural Networks for Multi-label Document Classification
Ladislav Lenc1,2 and Pavel Král1,2
1Department of Computer Science and Engineering, Faculty of Applied Sciences,
University of West Bohemia, Univerzitní 8, 306 14 Plzeň, Czech Republic
2 NTIS—New Technologies for the Information Society, Faculty of Applied Sciences,
University of West Bohemia, Technická 8, 306 14 Plzeň, Czech Republic
nlp.kiv.zcu.cz
{llenc,pkral}@kiv.zcu.cz
Abstract: This paper deals with multi-label document The methods are evaluated on documents in the Czech
classification using an ensemble of neural networks. The language, being a representative of highly inflectional
assumption is that different network types can keep com- Slavic language with a free word order. These properties
plementary information and that the combination of more decrease the performance of usual methods. We further
neural classifiers will bring higher accuracy. We verify this compare the results of our methods with other state-of-
hypothesis by an error analysis of the individual networks. the-art approaches on English Reuters-215781 dataset in
One contribution of this work is thus evaluation of several order to show its robustness across languages. Addition-
network combinations that improve performance over one ally we analyze the final F-measure on document sets di-
single network. Another contribution is a detailed analysis vided according to the number of assigned labels in order
of the achieved results and a proposition of possible direc- to improve the accuracy of the presented approach.
tions of further improvement. We evaluate the approaches The rest of the paper is organized as follows. Section 2
on a Czech ČTK corpus and also compare the results with is a short review of document classification methods with
state-of-the-art approaches on the English Reuters-21578 a particular focus on neural networks. Section 3 describes
dataset. We show that the ensemble of neural classifiers our neural network models and the combination methods.
achieves competitive results using only very simple fea- Section 4 deals with experiments realized on the ČTK and
tures. Reuters corpora and then analyzes and discusses the ob-
Keywords: Czech, deep neural networks, document classi- tained results. In the last section, we conclude the ex-
fication, multi-label perimental results and propose some future research di-
rections.
1 Introduction
2 Related Work
This paper deals with multi-label document classification
Document classification is usually based on a supervised
by neural networks. Formally, this task can be seen as the
machine learning. A classifier is trained on an annotated
problem of finding a model M which assigns a document
corpus and it then assigns class labels to unlabelled docu-
d ∈ D a set of appropriate labels (categories) c ∈ C as fol-
ments. Most works use vector space model (VSM), which
lows M : d → c where D is the set of all documents and
generally represents each document as a vector of all word
C is the set of all possible document labels. The multi-
occurrences usually weighted by their tf-idf.
label classification using neural networks is often done by
Several classification methods have been successfully
thresholding of the output layer [1, 2]. It has been shown
used [3], as for instance Bayesian classifiers, maximum
that both standard feed-forward networks (FNNs) and con-
entropy, support vector machines, etc. However, the main
volutional neural networks (CNNs) achieve state-of-the-
issue of this task is that the feature space is highly di-
art results on the standard corpora [1, 2].
mensional which decreases the classification results. Fea-
However, we believe that there is still some room for
ture selection/reduction [4] or better document representa-
further improvement. A combination of classifiers is a nat-
tion [5] can be used to solve this problem.
ural step forward. Therefore, we combine a CNN and an
Nowadays, “deep” neural nets outperform majority of
FNN in this work to gain further improvement in the terms
the state-of-the-art natural language processing (NLP)
of precision and recall. We support the claim that combi-
methods on several tasks with only very simple features.
nation may bring better results by studying the errors of
These include for instance POS tagging, chunking, named
the individual networks. The main contribution of this pa-
entity recognition and semantic role labelling [6]. Sev-
per thus consists in the analysis of errors in the prediction
eral different topologies and learning algorithms were pro-
results of the individual networks. Then we present the re-
posed. For instance, Zhang et al. [7] propose two convolu-
sults of several combination methods and illustrate that the
tional neural nets (CNN) for ontology classification, sen-
ensemble of neural networks brings significant improve-
ment over the individual networks. 1 http://www.daviddlewis.com/resources/testcollections/reuters21578/
Ensemble of Neural Networks for Multi-label Document Classification 187
timent analysis and single-label document classification. Another recent work proposes novel features based on
They show that the proposed method significantly outper- the unsupervised machine learning [17].
forms the baseline approach (bag of words) on English and A significant amount of work about combination of
Chinese corpora. Another interesting work [8] uses in the classifiers was done previously. Our approaches are mo-
first layer pre-trained vectors from word2vec [9]. The au- tivated by the review of Tulyakov et al. [18].
thors show that the proposed models outperform the state
of the art on 4 out of 7 tasks, including sentiment anal-
ysis and question classification. Recurrent convolutional 3 Neural Networks and Combination
neural nets are used for text classification in [10]. The
authors demonstrated that their approach outperforms the 3.1 Individual Nets
standard convolutional networks on four corpora in single- We use two individual neural nets with different activation
label document classification task. functions (sigmoid and softmax) in the output layer. Their
On the other hand, traditional feed-forward neural net topologies are briefly presented in the following two sec-
architectures are used for multi-label document classifica- tions.
tion rather rarely. These models were more popular be-
fore as shown for instance in [11]. They build a simple
multi-layer perceptron with three layers (20 inputs, 6 neu- Feed-forward Deep Neural Network (FDNN) We use
rons in hidden layer and 10 neurons in the output layer, i.e. a Multi-Layer Perceptron (MLP) with two hidden lay-
number of classes) which gives F-measure about 78% on ers2 . As the input of our network we use the simple bag
the standard Reuters dataset. The feed-forward neural net- of words (BoW) which is a binary vector where value 1
works were used for multi-label document classification means that the word with a given index is present in the
in [12]. The authors have modified standard backpropaga- document. The size of this vector depends on the size of
tion algorithm for multi-label learning (BP-MLL) which the dictionary which is limited by N most frequent words
employs a novel error function. This approach is evalu- which defines the size of the input layer. The first hid-
ated on functional genomics and text categorization. den layer has 1024 while the second one has 512 nodes.
A recent study on multi-label text classification was pro- This configuration was set based on the experimental re-
posed by Nam et al. in [1]. The authors build on the sults. The output layer has the size equal to the number
assumption that neural networks can model label depen- of categories |C|. To handle the multi-label classification,
dencies in the output layer. They investigate limitations of we threshold the values of nodes in the output layer. Only
multi-label learning and propose a simple neural network the labels with values larger than a given threshold are as-
approach. The authors use cross-entropy algorithm instead signed to the document.
of ranking loss for training and they also further employ
recent advances in deep learning field, e.g. rectified linear Convolutional Neural Network (CNN) The input is a se-
units activation, AdaGrad learning with dropout [13, 14]. quence of words in the document. We use the same dic-
TF-IDF representation of documents is used as network in- tionary as in the previous approach. The words are then
put. The multi-label classification is handled by perform- represented by the indexes into the dictionary. The archi-
ing thresholding on the output layer. Each possible label tecture of our network (see Figure 1) is motivated by Kim
has its own output node and based the final value of the in [8]. However, based on our preliminary experiments,
node a final decision is made. The approach is evaluated we used only one-dimensional (1D) convolutional kernels
on several multi-label datasets and reaches results compa- instead of the combination of several sizes of 2D kernels.
rable to the state of the art. The input of our network is a vector of word indexes of
Another method [15] based on neural networks lever- the length L where L is the number of words used for doc-
ages the co-occurrence of labels in the multi-label clas- ument representation. The issue of the variable document
sification. Some neurons in the output layer capture the size is solved by setting a fixed value (longer documents
patterns of label co-occurrences, which improves the clas- are shortened and the shorter ones padded). The second
sification accuracy. The architecture is basically a convo- layer is an embedding layer which represents each input
lutional network and utilizes word embeddings for initial- word as a vector of a given length. The document is thus
ization of the embedding layer. The method is evaluated represented as a matrix with L rows and EMB columns
on the natural language query classification in a document where EMB is the length of the embedding vectors. The
retrieval system. third layer is the convolutional one. We use NC convolu-
An alternative approach to handling the multi-label clas- tion kernels of the size K × 1 which means we do 1D con-
sification is proposed by Yang and Gopal in [16]. The con- volution over one position in the embedding vector over K
ventional representations of texts and categories are trans- input words. The following layer performs max-pooling
formed into meta-level features. These features are then over the length L−K +1 resulting in NC 1×EMB vectors.
utilized in a learning-to-rank algorithm. Experiments on
six benchmark datasets show the abilities of this approach 2 We have also experimented with an MLP with one hidden layer
in comparison with other methods. with lower accuracy.
188 L. Lenc, P. Král
result. This method is called hereafter Averaged thresh-
olding.
The second combination approach first thresholds the
scores of all individual classifiers. Then, the final classifi-
cation output is given as an agreement of the majority of
the classifiers. We call this method as Majority voting with
thresholding
Supervised Combination We use another neural network
of type multi-layer perceptron to combine the results. This
network has three layers: n × |C| inputs, hidden layer with
512 nodes and the output layer composed of |C| neurons
(number of categories to classify). n value is the num-
ber of the nets to combine. This configuration was set
experimentally. We also evaluate and compare, as in the
case of the individual classifiers, two different activation
functions: sigmoid and softmax. These combination ap-
proaches are hereafter called FNN with sigmoid and FNN
with softmax. According to the previous experiments with
neural nets on multi-label classification, we assume better
results of this net with sigmoid activation (see first part of
Table 1).
4 Experiments
In this section we first describe the corpora that we used
for evaluation of our methods. Then, we describe the per-
formed experiments and the final results.
Figure 1: CNN architecture
4.1 Tools and Corpora
The output of this layer is then flattened and connected For implementation of all neural nets we used Keras tool-
with the output layer containing |C| nodes. The final result kit [19] which is based on the Theano deep learning li-
is, as in the previous case, obtained by the thresholding of brary [20]. It has been chosen mainly because of good
the network outputs. performance and our previous experience with this tool.
All experiments were computed on GPU to achieve rea-
sonable computation times.
3.2 Combination
We consider that the different nets keep some complemen- 4.2 Czech ČTK Corpus
tary information which can compensate recognition errors.
We also assume that similar network topology with differ- For the following experiments we used first the Czech
ent activation functions can bring some different informa- ČTK corpus. This corpus contains 2,974,040 words be-
tion and thus that all nets should have its particular impact longing to 11,955 documents. The documents are anno-
for the final classification. Therefore, we consider all the tated from a set of 60 categories as for instance agricul-
nets as the different classifiers which will be further com- ture, weather, politics or sport out of which we used 37
bined. most frequent ones. The category reduction was done
Two types of combination will be evaluated and com- to allow comparison with previously reported results on
pared. The first group does not need any training phase, this corpus where the same set of 37 categories was used.
while the second one learns a classifier. We have further created a development set which is com-
posed of 500 randomly chosen samples removed from the
entire corpus. Figure 2 illustrates the distribution of the
Unsupervised Combination The first combination documents depending on the number of labels. Figure 3
method compensates the errors of individual classifiers by shows the distribution of the document lengths (in word
computing the average value from the inputs. This value is tokens). This corpus is freely available for research pur-
thresholded subsequently to obtain the final classification poses at http://home.zcu.cz/~pkral/sw/.
Ensemble of Neural Networks for Multi-label Document Classification 189
4500
4000 3821 Table 1: Results of the individual nets with sigmoid and
3500
softmax activation functions against the baseline approach
3000 2693 2723
Documents
2500
1837
No. Network/activation Prec. Recall F1 [%]
2000
1500 1. FDNN softmax 84.4 82.1 83.3
2. sigmoid 83.0 81.2 82.1
1000
656
500 183
41
3. CNN softmax 80.6 80.8 80.7
1
0
1 2 3 4 5 6 7 8
Number of labels
4. sigmoid 86.3 81.9 84.1
Figure 2: Distribution of documents depending on the Baseline [17] 89.0 75.6 81.7
number of labels assigned to the documents
2500 recall, while the best performing method no. 4 has signifi-
cantly better precision than recall (∆ ∼ 4%).
2000
This table further shows that three individual neural net-
works outperform the baseline approach.
Documents
1500
1000
500 Error Analysis To confirm the potential benefits of the
combination we analyze the errors of the individual nets.
0
200 400 600 800 1000 As already stated, we assume that different classifiers re-
Document length (words)
tain different information and thus they should bring dif-
ferent types of errors which could be compensated by a
Figure 3: Distribution of the document lengths
combination. Following analysis shows the numbers of
incorrectly identified documents for two categories. We
We use the five-folds cross validation procedure for all present the numbers of errors for all individual classifiers
experiments on this corpus. The optimal value of the and compare it with the combination of all classifiers.
threshold is determined on the development set. For eval- The upper part of Figure 4 is focused on the most fre-
uation of the multi-label document classification results, quent class - politics. The graph shows that the numbers
we use the standard recall, precision and F-measure (F1) of errors produced by the individual nets are compara-
metrics [21]. The values are micro-averaged. ble. However, the networks make errors on different docu-
ments and only few ones (384 from 2221 are common for
all the nets.
Reuters-21578 English Corpus The Reuters-215783 cor- The lower part of Figure 4 is concentrated on the less
pus is a collection of 21,578 documents. This corpus is frequent class - chemical industry. This analysis demon-
used to compare our approaches with the state of the art. strates that the performances of the different nets signif-
As suggested by many authors, the training part is com- icantly differ, the sigmoid activation function is substan-
posed of 7769 documents, while 3019 documents are re- tially better than the softmax and the different nets provide
served for testing. The number of possible categories is 90 also different types of errors. The number of the common
and average label/document number is 1.23. errors is 49 (from 232 in total).
To conclude, both analysis clearly confirm our assump-
4.3 Results of the Individual Nets tion that the combination should be beneficial for improve-
ment of the results of the individual nets.
The first experiment (see Table 1) shows the results of the
individual neural nets with sigmoid and softmax activa-
tion functions against the baseline approach proposed by 4.4 Results of Unsupervised Combinations
Brychcín et al. [17]. These nets will be further referenced
The second experiment shows (see Table 2) the results of
by the method number.
Averaged thresholding method. These results confirm our
This table demonstrates very good classification perfor-
assumption that the different nets keep complementary in-
mance of both individual nets and that the classification
formation and that it is useful to combine them. This ex-
results are very close to each other and comparable. This
periment further shows that the combination of the nets
table also shows that softmax activation function is slightly
with lower scores (particularly with net no. 2) can degrade
better for FDNN, while sigmoid activation function gives
the final classification score (e.g. combination 1 & 2 vs.
significantly better results for CNN.
individual net no. 1).
Another interesting fact regarding to these results is that
Another interesting, somewhat surprising, observation
the approaches no. 1 - 3 have comparable precision and
is that the CNN with the lowest classification accuracy
3 http://www.daviddlewis.com/resources/testcollections/reuters21578/ can have some positive impact to the final classification
190 L. Lenc, P. Král
Table 3: Combinations of the nets by Majority voting with
thresholding
Net combi. Precision Recall F1 [%]
1&2&3 86.1 82.9 84.6
1&2&4 87.5 82.6 85.0
1&3&4 86.5 82.9 84.6
2&3&4 86.9 82.7 84.8
1&2&3&4 84.1 85.7 84.9
4.5 Results of Supervised Combinations
The following experiments show the results of the super-
vised combination method with an FNN (see Sec 3.2). We
have evaluated and compared the nets with both sigmoid
(see Table 4) and softmax (see Table 5) activation func-
tions.
These tables show that these combinations have also
Figure 4: Error analysis of the individual nets for the most positive impact on the classification and that sigmoid ac-
frequent (top, politics) and for the less frequent (bottom, tivation function brings better results than softmax. This
chemical industry) classes, numbers of incorrectly identi-
fied documents in brackets
Table 4: Combinations of the nets by FNN with sigmoid
Table 2: Combinations of nets by Averaged thresholding Net combi. Precision Recall F1 [%]
Net combi. Precision Recall F1 [%] 1&2 86.1 82.1 84.1
1&2 83.0 82.4 82.7 1&3 87.1 81.5 84.2
1&3 83.2 84.6 83.9 1&4 88.4 81.9 85.0
1&4 85.7 84.3 85.0 2&3 86.6 81.4 83.9
2&3 86.2 79.6 82.8 2&4 87.7 82.0 84.7
2&4 84.9 83.5 84.2 3&4 89.3 80.0 84.4
3&4 87.3 81.7 84.4 1&2&3 86.9 82.4 84.6
1&2&3 84.8 81.9 83.3 1&2&4 87.9 82.8 85.3
1&2&4 90.1 79.6 84.5 1&3&4 88.2 82.5 85.2
1&3&4 86.7 83.5 85.1 2&3&4 87.9 82.2 85.0
2&3&4 89.3 80.5 84.6 1&2&3&4 88.0 82.8 85.3
1&2&3&4 89.7 80.5 84.9
Table 5: Combinations of the nets by FNN with softmax
(e.g. combination 1 & 3). However, the FDNN no. 2 (with
Net combi. Precision Recall F1 [%]
significantly better results) brings only very small positive
impact to any combination. 1&2 85.3 81.6 83.4
The next experiment which is depicted in Table 3 deals 1&3 85.4 81.8 83.6
with the results of the second unsupervised combination 1&4 86.3 82.6 84.4
method, Majority voting with thresholding. Note, that we 2&3 85.4 80.9 83.1
consider an agreement of at least one half of the classifiers 2&4 86.1 82.0 84.0
to obtain unambiguous results. Therefore, we evaluated
the combinations of at least three networks. 3&4 86.7 81.3 83.9
This table shows that this combination approach brings 1&2&3 85.0 82.7 83.9
also positive impact to document classification and the re- 1&2&4 85.7 83.2 84.4
sults of both methods are comparable. However, from the 1&3&4 85.8 83.3 84.5
point of view of the contribution of the individual nets, the 2&3&4 85.6 82.9 84.3
net no. 2 contributes better for the final results as in the
previous case. 1&2&3&4 85.7 83.6 84.6
Ensemble of Neural Networks for Multi-label Document Classification 191
is a similar behaviour as in the case of the individual F-measure based on number of labels
nets. Moreover, as supposed, this supervised combina-
92
Adaptive threshold
Fixed threshold
tion slightly outperforms both previously described unsu- 90
pervised methods. 88
86
F-measure
4.6 Final Results Analysis
84
82
Finally, we analyze the results for the different document 80
types. The main criterion was the number of the document 78
labels. We assume that this number will play an important 76
role for classification and intuitively, the documents with 1 2 3 4 >=5
less labels will be easier to classify. We thus divided the #Labels
F-measure based on number of labels
documents into five distinct classes according to the num- 92
Adaptive threshold
ber of labels (i.e. the documents with one, two, three and 90 Fixed threshold
four labels and the remaining documents). Then, we tried 88
to determine an optimal threshold for every class and re- 86
port the F-measure. This value is compared to the results
F-measure
obtained with global threshold identified previously (one
84
threshold for all documents). 82
The results of this analysis are shown in Figure 5. We 80
have chosen two representative cases to analyze, the indi- 78
vidual FDNN with softmax (left side) and the combination 76
by Averaged thresholding method (right side). The adap- 1 2 3
#Labels
4 >=5
tive threshold means that the threshold is optimized for
each group of documents separately. The fixed threshold Figure 5: F-measure according to the number of labels for
is the one that was optimized on the development set. This adaptive and fixed thresholds, the upper graph shows the
figure confirms our assumption. The best classification re- results for MLP with softmax while the lower one is for
sults are for the documents with one label and then they the combination of all nets
decrease. Moreover, this analysis shows that this num-
ber plays a crucial role for document classification for all
cases. Hypothetically, if we could determine the number Table 6: Results on the Reuters-21578 dataset
of labels for a particular document before the thresholding, Method Precision Recall F1 [%]
we could improve the final F-measure by 1.5%. MLP/softmax 89.08 80.6 85.0
MLP/sigmoid 89.6 82.7 86.0
4.7 Results on English Corpus CNN/softmax 87.8 84.1 85.9
This experiment shows results of our methods on the fre- CNN/sigmoid 89.4 81.3 85.2
quently used Reuters-21578 corpus. We present the results Supervised combi 91.4 84.1 87.6
on English dataset mainly for comparison with other state- NNAD [1] 90.4 83.4 86.8
of-the-art methods while we cannot provide such compari-
BP − MLLTAD 1 84.2 84.2 84.2
son on Czech data. Table 6 shows the performance of pro-
posed models on the benchmark Reuters-21578 dataset. BRR [22] 89.8 86.0 87.9
The bottom part of the table provides comparison with
other state-of-the-art methods.
The experimental results have confirmed our assump-
tion that the different nets keep different information.
5 Conclusions and Future Work Therefore, it is useful to combine them to improve the clas-
sification score of the individual nets. We have also proved
In this paper, we have used several combination methods that the thresholding is a good method to assign the docu-
to improve the results of individual neural nets for multi- ment labels of multi-label classification. We have further
label document classification of Czech text documents. shown that the results of all the approaches are compa-
We have also presented the results of our methods on a rable. However, the best combination method is the su-
standard English corpus. We have compared several popu- pervised one which uses an FNN with sigmoid activation
lar (unsupervised and also supervised) combination meth- function. The F-measure on Czech is 85.3% while the best
ods. result for English is 87.6%. Results on both languages are
1 Approach proposed by Zhang et al. [12] and used with ReLU acti- thus at least comparable with the state of the art.
vation, AdaGrad and dropout. One perspective for further work is to improve the com-
192 L. Lenc, P. Král
bination methods while the error analysis has shown that [13] Nair, V., Hinton, G.E.: Rectified linear units improve re-
there is still some room for improvement. We have also stricted boltzmann machines. In: Proceedings of the 27th
shown that knowing the number of classes could improve international conference on machine learning (ICML-10).
the result. Another perspective is thus to build a classifier (2010) 807–814
with thresholds dependent on the number of labels. [14] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I.,
Salakhutdinov, R.: Dropout: a simple way to prevent neural
networks from overfitting. Journal of Machine Learning
Acknowledgements Research 15(1) (2014) 1929–1958
[15] Kurata, G., Xiang, B., Zhou, B.: Improved neural
This work has been supported by the project LO1506 of network-based multi-label classification with better initial-
the Czech Ministry of Education, Youth and Sports. We ization leveraging label co-occurrence. In: Proceedings of
also would like to thank the Czech New Agency (ČTK) NAACL-HLT. (2016) 521–526
for support and for providing the data. [16] Yang, Y., Gopal, S.: Multilabel classification with meta-
level features in a learning-to-rank framework. Machine
Learning 88(1-2) (2012) 47–68
References [17] Brychcín, T., Král, P.: Novel unsupervised features for
Czech multi-label document classification. In: 13th Mexi-
[1] Nam, J., Kim, J., Mencía, E.L., Gurevych, I., Fürnkranz, J.:
can International Conference on Artificial Intelligence (MI-
Large-scale multi-label text classification—revisiting neu-
CAI 2014), Tuxtla Gutierrez, Chiapas, Mexic, Springer
ral networks. In: Joint European Conference on Machine
(16-22 November 2014) 70–79
Learning and Knowledge Discovery in Databases, Springer
(2014) 437–452 [18] Tulyakov, S., Jaeger, S., Govindaraju, V., Doermann, D.:
Review of classifier combination methods. In: Machine
[2] Lenc, L., Král, P.: Deep neural networks for czech multi-
Learning in Document Analysis and Recognition. Springer
label document classification. CoRR abs/1701.03849
(2008) 361–386
(2017)
[19] Chollet, F.: keras. https://github.com/fchollet/
[3] Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing
keras (2015)
features of random fields. IEEE Transactions on Pattern
Analysis and Machine Intelligence 19(4) (1997) 380–393 [20] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pas-
canu, R., Desjardins, G., Turian, J., Warde-Farley, D., Ben-
[4] Yang, Y., Pedersen, J.O.: A comparative study on fea-
gio, Y.: Theano: a cpu and gpu math expression compiler.
ture selection in text categorization. In: Proceedings of the
In: Proceedings of the Python for scientific computing con-
Fourteenth International Conference on Machine Learning.
ference (SciPy). Volume 4., Austin, TX (2010) 3
ICML ’97, San Francisco, CA, USA, Morgan Kaufmann
Publishers Inc. (1997) 412–420 [21] Powers, D.: Evaluation: From precision, recall and f-
measure to roc., informedness, markedness & correlation.
[5] Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: La-
Journal of Machine Learning Technologies 2(1) (2011) 37–
beled lda: A supervised topic model for credit attribution in
63
multi-labeled corpora. In: Proceedings of the 2009 Confer-
ence on Empirical Methods in Natural Language Process- [22] Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statis-
ing: Volume 1 - Volume 1. EMNLP ’09, Stroudsburg, PA, tical topic models for multi-label document classification.
USA, Association for Computational Linguistics (2009) Machine learning 88(1-2) (2012) 157–208
248–256
[6] Collobert, R., Weston, J., Bottou, L., Karlen, M.,
Kavukcuoglu, K., Kuksa, P.: Natural language processing
(almost) from scratch. The Journal of Machine Learning
Research 12 (2011) 2493–2537
[7] Zhang, X., LeCun, Y.: Text understanding from scratch.
arXiv preprint arXiv:1502.01710 (2015)
[8] Kim, Y.: Convolutional neural networks for sentence clas-
sification. arXiv preprint arXiv:1408.5882 (2014)
[9] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient es-
timation of word representations in vector space. In: Pro-
ceedings of Workshop at ICLR. (2013)
[10] Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional
neural networks for text classification. (2015)
[11] Manevitz, L., Yousef, M.: One-class document classifica-
tion via neural networks. Neurocomputing 70(7-9) (2007)
1466–1481
[12] Zhang, M.L., Zhou, Z.H.: Multilabel neural networks with
applications to functional genomics and text categorization.
Knowledge and Data Engineering, IEEE Transactions on
18(10) (2006) 1338–1351