=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-QA-PapanikolaouEt2014
|storemode=property
|title=Ensemble Approaches for Large-Scale Multi-Label Classification and Question Answering in Biomedicine
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-QA-PapanikolaouEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/PapanikolaouDTLMV14
}}
==Ensemble Approaches for Large-Scale Multi-Label Classification and Question Answering in Biomedicine==
<pdf width="1500px">https://ceur-ws.org/Vol-1180/CLEF2014wn-QA-PapanikolaouEt2014.pdf</pdf>
<pre>
          Ensemble Approaches for Large-Scale
         Multi-Label Classification and Question
               Answering in Biomedicine

 Yannis Papanikolaou1 , Dimitrios Dimitriadis1 , Grigorios Tsoumakas1 , Manos
          Laliotis2 , Nikos Markantonatos3 , and Ioannis Vlahavas1
          1
            Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
        yannis.papanik@gmail.com,{dndimitr,greg,vlahavas}@csd.auth.gr
    2
      Atypon, 5201 Great America Parkway Suite 510, Santa Clara, CA 95054, USA
                                 elalio@atypon.com
      3
        Atypon Hellas, Dimitrakopoulou 7, Agia Paraskevi 15341, Athens, Greece
                                  nikos@atypon.com


        Abstract. This paper documents the systems that we developed for
        our participation in the BioASQ 2014 large-scale bio-medical semantic
        indexing and question answering challenge. For the large-scale semantic
        indexing task, we employed a novel multi-label ensemble method con-
        sisting of support vector machines, labeled Latent Dirichlet Allocation
        models and meta-models predicting the number of relevant labels. This
        method proved successful in our experiments as well as during the compe-
        tition. For the question answering task we combined different techniques
        for scoring of candidate answers based on recent literature.

        Keywords: ensemble methods · multi-label learning · support vector
        machines · latent Dirichlet allocation · BioASQ


1     Introduction

At the moment this paper is being written a simple query on the PubMed website
for the number of articles included in the database since 1900 gives a total
number of 23,921,183 abstracts. 15,027,919 of them were published since 1990
and 8,412,522 only the last decade. These numbers concern only a portion of
the total publications from the various scientific societies as this open digital
repository contains only articles from biomedicine and life sciences. There is a
very large number of scientific publications and this number seems to grow at
non-trivial rates each year.
    A key issue for exploiting this fast growing literature is the existence of
semantic meta-data describing the topics of each publication. Searching the lit-
erature for a particular topic, discovering topic trends and many more tasks
all rely on such meta-data. As manual annotation costs time and money, it is
of great importance to automate this process. Furthermore, even in cases where
this is affordable (e.g. PubMed) there is usually a crucial delay from the moment


                                       1348
a new article is published until it gets annotated. However, automatic annota-
tion of new articles is not an easy task in spite of the numerous algorithms and
tools available for text classification. We need to deal with millions of docu-
ments, millions of features and tens of thousands of concepts, the latter being
also highly imbalanced. In addition, each instance can belong to many classes,
making our problem one of a multi-label nature. At the same time, such large
bodies of knowledge are the perfect sources for developing question-answering
systems capable of interacting with the scientific community in natural language.
    In support of researchers working on these problems, the BioASQ4 European
project has developed a competition framework targeted at large-scale online
semantic indexing (Task A) and question answering (Task B) in the domain
of biomedicine. This paper presents our approaches to deal with both of these
tasks for the 2014 challenge of BioASQ. We primarily worked on the semantic
indexing task, developing a novel multi-label classifier ensemble, which proved
successful both in our experiments as well as during the competition. For the
question-answering task we sucesfully replicated a recent approach [1].
    The rest of the paper is organized as follows. Section 2 offers background
knowledge on the models and algorithms we employed. Section 3 presents our
classifier selection approaches for multi-label data. Section 4 describes the actual
systems we used for the challenge and the experiments we performed. Section 5
presents our results. Section 6 presents our work on the question answering task.
Finally, Section 7 concludes this paper and points to future work directions.


2     Background

This section provides a brief description of the models/algorithms used in our
participation in Task 2A of the BioASQ challenge along with the necessary
theory.


2.1    Support Vector Machines

Support Vector Machines [2] have been extensively used in the literature for
classification and regression tasks. Being a non-probabilistic binary classification
algorithm in its essence, it has managed to achieve state-of-the art performance
in numerous tasks and has been applied in multiple domains for solving learning
problems. In our experiments we used the Liblinear package [3], along with some
minor modifications, which fitted perfectly our needs for a very fast and scalable
implementation.


2.2    MetaLabeler

The MetaLabeler [4] is essentially a meta-model employed in multi-label tasks
that serves to automatically determine the cardinality of the label set for a given
4
    http://www.bioasq.org


                                     1349
instance. The idea is to train a linear regression model (e.g. with an SVM) with
input from some feature space (an easy option could be simply the word tokens
of each instance) and output the number of labels associated to the particular
instance.
    The need for the above meta-model arises in multi-label problems where,
given an instance, the model’s output for each label is a score or a probability.
In this case, every instance is associated with a ranking of labels and we need
to properly set a threshold so that we get a hard-assignment of labels. It should
be noted here, that apart from the MetaLabeler a great deal of work exists in
literature to address that particular problem [5] [6] but alternative solutions usu-
ally require a cross-validation procedure which proves to be too time-consuming
for large-scale data sets. We also experimented with an approach similar to the
MetaLabeler [7]. In this case, the output of the regression training problem is
not the actual number of labels but the one that maximizes some evaluation
measure (the F-measure in our case). Thus, given a trained model, we employ it
on a validation set to determine the number of labels that would maximize the
F-measure for every instance. Even if intuitively this approach would do better
as it captures also the misclassification errors of the classifiers, in practice results
were inferior compared to the MetaLabeler.

2.3    Topic Models
Latent Dirichlet Allocation (LDA) is a powerful probabilistic model first intro-
duced by [8] [9] in an unsupervised learning context. The key idea is that a corpus
of documents hides a number of topics; this model, given the corpus, attempts
to learn the distribution of topics to documents (namely the Θ distribution)
and the distribution of topics to word tokens (Φ distribution respectively). After
learning these distributions, the trained model can be used either in a genera-
tive task (e.g. given some topics, produce a new document(s)) or in an inference
task (given some new documents, determine the topics they belong to). It is
rather obvious to note that this model seems naturally fitted to deal with multi-
label problems, apart from the fact that, being totally unsupervised, its resulting
topics may be hard to interpret.
    In the works of [10] and [11] the LDA theory is incorporated into a supervised
learning context where each topic corresponds to a label of the corpus in a one-to-
one correspondence. We implemented the LLDA and the prior LLDA variant of
[11]. The only difference between the two is that the prior LLDA model takes into
account the relative frequencies of labels in the corpus, a crucial fact in case of a
problem with power-law statistics5 like the one we address. In experiments, the
prior LLDA model was performing significantly better than the simple LLDA so
we used that one for our systems. Even though this model’s performance didn’t
match that of the SVMs, we opted to use it with the motivation that it could do
better for some labels and therefore used it in two ensembles (see section 4.2).
5
    by referring to a data set with power-law statistics we mean that the vast majority
    of labels have a very low frequency and only very few have a high frequency, for a
    more elaborate explanation refer to [12]


                                       1350
3     A Classifier Selection Multi-Label Ensemble
The main idea behind ensembles is to exploit the fact that different classifiers
may do well in different aspects of the learning task so combining them could
improve overall performance. Ensembles have been extensively used in literature
[13] with stacking [14], bagging [15] and boosting [16] being the main methods
employed. In the context of multi-label problems, [17] proposes a fusion method
where the probabilistic outputs of heterogeneous classifiers are averaged and the
labels above a threshold are chosen. In the same direction, a classifier selection
scheme based on the F-measure is proposed in [18]. For each label and for each
of the classifiers the F-measure is computed and the best performing is chosen
to predict that particular label. We tried the last approach and even for large
validation data sets we found a systematic decline on the micro-F measure.
    In this work, we propose a different method oriented towards a classifier
selection (rather than fusion) scheme. Essentially, we treat the problem as having
L different classification tasks and requiring to be able to tell which of the models
used is more suitable for each of them. In the description below, we suppose that
there is a baseline model (i.e. a model that has a better overall performance than
the others) but our idea can be applied with minor modifications without this
assumption. The main issue addressed by our work is how to select the binary
component classifiers for each label, so as to optimize the global micro-averaged
f-measure that concerns all labels.
    Formally, suppose we have a baseline model A and q different models Bi and
we want to combine them in a multi-label task with input feature vectors x and
output y, y ∈ L, L being the set of labels. Instead of choosing a voting system
for all labels, we could see for which labels each Bi performs better than A on
some validation set and according to some evaluation metric eval. Let’s denote

    LBi = {l : eval(Bi ) > eval(A), eval(Bi ) > eval(Bj )}, with l ∈ L and j 6= i

and                                            X
                               |LA | = |L| −       |LBi |
respectively. Then, when predicting on unseen data, we could predict labels that
belong to LA from model A and labels belonging to each LBi from the respective
model Bi .
    There are two remaining issues to be solved; a) choose a valid evaluation
metric eval and b) assure that results pointed by eval on a validation set can
be generalized to new, unseen data. As the contest’s main evaluation metric was
the micro-F measure we opted for it. As mentioned, we also tried to use the
F-measure (per label) but it was not improving overall performance, even on the
validation data set.
    Concerning the second issue, initially we tried to address it by just relying on
using a large validation data set. However, after obtaining unfavorable results
on the competition, we relied on a significance test, namely a McNemar test
with a confidence level of 95%. To sum up, we first predict with A (our baseline
model) on a validation data set and then for each label and for each model Bi


                                     1351
we check if choosing Bi to predict for that label improves the overall micro-F
measure. If yes, that label is candidate to belong to LBi . Then, for all labels that
belong to the candidate sets, we run a McNemar test, or multiple McNemar tests
accordingly, to check if the difference in performance is statistically significant.
and if there is a Bi significantly better than A on that label then we add that
label to LBi . Below we show the pseudo code for this technique. This approach
proved to be successful in the competition context, even when using relatively
small datasets for validation (around 30k documents).

 1. For all documents ∈ V alidationDataset assign the relevant labels ∈ L pre-
    dicting with model A
 2. For each model Bi
      – For all documents ∈ V alidationDataset assign the relevant labels ∈ L
        predicting with Bi
 3. For each label l ∈ L calculate the true positives tpAl , false positives f pAl
    and false negatives f nAl for A
 4. For each model Bi
      – For each label ∈ L calculate tpBil , f pBil and f nBil
                P
 5. Set tpA = tpAl and f pA , f nA respectively
                                                 2tpA
 6. Set the micro-F measure as mfA = 2tpA +f       pA +f nA
 7. For each label l ∈ L
      – For each model Bi
          • subtract tpA l, f pA l and f nA l from tpA and f pA , f nA respectively
          • add tpB il, f pB il and f nB il to tpA and f pA , f nA respectively
          • If the new mfA is better than the previous add l in candidateListi
 8. For each label l
    (a) If l belongs to just one candidateListi
          – perform a McNemar test between models A and Bi with significance
             level 0.95
          – if Bi is significantly better than A add l to LBi
    (b) If l belongs to more than one candidateListi
          – perform a McNemar test between models A and each Bi with signif-
             icance level 0.95 applying a FWER correction with the Bonferoni-
             Holmes step method
          – If just one Bi is significantly better than A add l to LBi
          – Else if many Bi ’s are significantly better than A choose the model
             Bi that has the highest score in the McNemar test with A 6
                                      P
 9. Compute |LA | as |LA | = |L| − |LBi |
10. For all documents ∈ T estDataset assign the relevant labels ∈ LA predicting
    with model A
11. For each model Bi
6
    It is needless at this point to apply again McNemar tests among the Bi models
    because we are not interested on determining if their differences in performance are
    significant; we just need to choose one among them as we know they are all doing
    better than A


                                        1352
      – For all documents ∈ T estDataset assign the relevant labels ∈ LBi pre-
        dicting with model Bi

    A final note is that when performing multiple statistical comparisons (that
is for more than two models) we need to keep control of the family-wise error
rate (FWER) in order for the statistical comparisons to be valid. In our case, as
the tests we performed were parametrical, we used the Bonferroni-Holmes step
method, as proposed in [19].


4     Description of Systems and experiments

This section provides the description of our systems, the training procedure and
the experiments. We present all results for the systems in the following section,
so whenever speaking about e.g. a model being better than another or about
performances, we refer the reader to section 5.


4.1    Description of the experiments

In our experiments we used a subset of the corpus, keeping only the documents
belonging to the journals from which the new, unseen data would be taken. Thus
we ended up with about 4.3 million documents. For all systems, we extracted
a dictionary from the corpus keeping words and bi-grams (pairs of words) with
more than 6 occurrences and less than half of the size of the corpus, removing
stop-words (e.g. ”and”, ”the”, etc) and non-arithmetic symbols. In case of the
SVMs’ training, each feature was represented by its tf-idf value 7 , where tf stands
for term frequency and idf, inverse document frequency. In that case we also
applied zoning for features belonging in the title and features that were a label
(e.g. features such as ”humans”, ”female”, etc). In the context of the BioASQ
competition we used the last 50 thousand documents for validation and the
preceding 1.5 million documents for training.


4.2    Systems used in the competition

We used five systems in the competition, opting to name them as Asclepios,
Hippocrates, Sisyphus, Galen and Panacea.
    The first two systems are identical but trained in different size data sets.
We trained |L| binary SVMs in a one-vs-all approach (one for each label) and a
second-level model, the Metalabeler (for predicting an instance’s label cardinal-
ity). During prediction we slightly changed the Liblinear code to output a score
instead of a binary decision for the SVMs. This way, for each instance we obtain
a descending ranking of labels, from the ones with the highest scores to the ones
with the lowest. Then, by using the Metalabeler we predict a label cardinality
c for that instance and thus choose the top c labels from the ranking. Asclepios
7
    apart from the BNS SVMs in which case we used the BNS value


                                     1353
was trained on the last 950 thousand documents while Hippocrates was trained
on the last 1.5 million documents.
    The rest of the systems are ensembles implemented just as described in sec-
tion 3. They all have Hippocrates as a component, which was the best performing
system, so from now and forth we will refer to it as the baseline model.
    The third system, Sisyphus, consists of an ensemble of two models, the base-
line and a model of simple binary SVMs. We initially used vanilla (not tuned)
SVMs for the second model but then proceeded in trying also to tune them. Fea-
ture scaling with BNS [20] was our first effort, but the trained models performed
worse and training required very long times. The reason for the last observation
is that if performing scaling or feature selection in a multi-label problem, the
features’ scaling factors for training will be different for each label. This means
that we need to vectorize the training corpus |L| times, a non-trivial task in
our case where |L| is of the order of 104 . If using common scaling factors for all
labels instead (e.g. by tf-idf as we did) vectorizing needs to be done only once for
all labels. Another effort for tuning the SVMs was to experiment with different
values for the C parameter (other than the default 1) which did not really yield
significant improvements. We then used the idea of [21] to change the weight
parameter for positive instances (w1). When training a classifier with very few
positive instances we can choose to penalize a false negative (a positive instance
being misclassified) more than a false positive (a negative instance being mis-
classified). We followed this approach unfortunately just before the end of the
third batch.


     Table 1. Component models for the systems employed in the competition

                                    MetaLabeler         MetaLabeler
                      Binary        with SVMs           with SVMs
     Systems           SVM          (1.5m docs)         (4.2m docs)        LLDA
    Asclepios                               x
   Hippocrates                              x
    Sisyphus             x                  x
      Galen                                 x                                 x
    Panacea              x                  x                x                x


    The fourth model, Galen, is an ensemble of the baseline model and a prior
LLDA model and the fifth, Panacea, combines in an ensemble the baseline model
(SVMs with score ranking and Metalabeler), the tuned binary SVMs, the prior
LLDA model (all trained on the last 1.5 × 106 documents) and a baseline model
trained on the whole corpus (about 4.3m documents, except the last 50k docu-
ments). Even if from at first glance it seems redundant to combine two identical
models, the reason we did this is the following: the corpus contains articles from
1974 to 2014. During this period a lot of things have changed concerning the se-
mantics of some entities, the semantics of some labels and most importantly the


                                     1354
distribution of labels to words. This leads to the effect of the first model, trained
in 1.5 million documents (papers from 2007-2012) having a better performance
than the second one, trained on the whole corpus (papers between 1974-2012),
in terms of the micro-f measure. Nonetheless, the second model learns more la-
bels and is expected to do better in some very rare labels, having more training
instances. Driven by this observation we added this model in the ensemble, com-
bining four models in total. Table 1 depicts the component models for the five
systems.


5     Results
5.1   Parameter Setup
All SVM-based models were trained with default parameters (C=1, e=0.01). For
the LLDA model, we used 10 Markov chains and averaged them, taking a total
of 600 samples (one sample every 5 iterations), after a burn-in period of 300
iterations. Alpha and beta parameters were equal for all labels during training
with α = 50/|L| and β = 0.001. As noted in [11], the prior LLDA model reduces
during prediction to an LDA model with the alpha parameter proportional to
the frequency of each label. We set

                                  50 × f requency(l)   30
                         α(l) =                      +
                                  totalLabelT okens    |L|

and took 200 samples (one every 5 iterations) after a burn-in of 300 iterations,
from a single Markov chain. We note here that there was a lot of room for
improving the LLDA variant (e.g. average from many Markov Chains or take
more samples) but unfortunately we didn’t have the time to do so.
    Experiments were conducted on a machine with 40 processors and 1Tb of
RAM. For the SVM models (apart from those with BNS scaling) the whole
training procedure (dictionary extraction, vectorizing and training) for 1.5 × 106
documents, a vocabulary of 1.5 × 106 features and 26281 labels takes around
32 hours. The SVMs trained with BNS scaling, require a lot longer, about 106
hours while the LLDA model needs around 72 hours. Predicting for the 3.5 × 104
documents of Table 2 needs around 20 minutes for the SVMs and around 3
hours for the BNS SVMs. The prior LLDA model needs a very long time for
predicting, around 33 hours. The reason for this is that the time needed for the
Gibbs sampling algorithm is roughly proportional to the number of documents
and the number of labels, which in our case, are both of the order of tens of
thousands. In case of the size of the BioASQ data sets (∼ 5000 documents)
predicting for the LLDA needed around 4 hours.

5.2   Results
In this section we present the results of our experiments. Tables 2 and 3 show
the performance of our component models in terms of the micro-F and macro-F


                                      1355
measures. We can see that the Metalabeler on 1.5m documents is performing
better in total, with the tuned SVMs following. Also, we can easily observe
that the Metalabeler on 4.2 million documents is worse compared to the one on
1.5m documents, learning though 228 more labels. The prior LLDA model is not
performing nearly as well as the SVM variants.


Table 2. Results for the models with which we experimented trained on the last
1.5 million documents of the corpus and tested on 35k documents already annotated
documents from the competition batches

             Classifier               no. of labels      Micro-F       Macro-F
           Vanilla SVMs                    26281         0.56192        0.33190
    Metalabeler(1.5m documents)            26281         0.59461        0.43622
      SVMs with BNS scaling                26281         0.51024        0.27980
    tuned SVMs( -w1 parameter)             26281         0.58330        0.37729
    Metalabeler(4.2m documents)            26509         0.58508        0.42929
         Prior labeled LDA                 26281         0.38321        0.29563


Table 3. Results for the component models of our systems trained on the last 1.5
million documents of the corpus and tested on 12.3k documents already annotated
documents from the competition batches

             Classifier               no. of labels      Micro-F       Macro-F
    Metalabeler(1.5m documents)            26281         0.60921        0.44745
    tuned SVMs( -w1 parameter)             26281         0.60296        0.40705
    Metalabeler(4.2m documents)            26509         0.55350        0.39926
         Prior labeled LDA                 26281         0.37662        0.40125


    Table 4 shows the performance of the models and the four systems described
in section 4.2. Asclepios is omitted as it is identical to Hippocrates. Results are
shown for 12.3k documents, having used 35k documents for validation. We can
see that the ensemble systems perform better than the baseline (Hippocrates),
with Panacea and Sisyphus reaching the best performance even though the val-
idation data set is relatively small.


6    Question Answering

Being newcomers in the area of question answering, our modest goal was to
replicate work already existing in the literature. We decided to focus on [1],
an approach presented in the 2013 BioASQ Workshop for extracting answers


                                    1356
    Table 4. Results for the systems that participated in the BioASQ challenge

                      Systems              Micro-F          Macro-F
                    Hippocrates            0.60921           0.44745
                     Sisyphus              0.61323           0.44816
                       Galen               0.60949           0.44880
                     Panacea               0.61368           0.44893


to factoid questions. Furthermore, we only focused on phase B of the question
answering task, taking the gold (correct) relevant concepts, articles, snippets,
and RDF triples from the benchmark data sets as input.
    For each factoid question, our system firsts extracts the lexical answer type
(LAT). This is achieved by splitting the question into words, extracting the
part-of-speech for each word and finally extracting the first consecutive nouns
or adjectives in the word list of the question. Then, each of the relevant snippets
is split into sentences and each of these sentences are processed with the 2013
Release of MetaMap [22] in order to extract candidate answers.
    For each candidate answer c, we calculated five scores similarly to [1]. Let I
denote an indicator function, returning 1 if each input is true and 0 otherwise.
The first score is prominence, which considers the frequency of each candidate
answer c within the set of sentences S of the relevant snippets:
                                                P
                                                  s∈S I(c ∈ s)
                        Prominence(c) =                                          (1)
                                                      |S|
   The second score is a version of prominence that further takes into account
the cosine similarity of the question q with each sentence:
                                            P
                                               s∈S similarity(q, s)I(c ∈ s)
            WeightedProminence(c) =              P                               (2)
                                                     s∈S similarity(q, s)

   The third score, specificity, considers the (in)frequency of each candidate
answer in the corpus of PubMed abstracts A released by BioASQ:
                                                            
                                                 |A|
                Specificity(c) = log       P                     / log(|A|)      (3)
                                              a∈A I(c ∈ a)

   The fourth and fifth scores consider the semantic type(s) of the candidate
answers as detected by MetaMap. In particular they examine whether these
types intersect with the semantic types(s) of the questions LAT (fourth score)
and the whole question (fifth score):

                         (
                          1 if SemType(c) ∩ SemType(LAT) 6= ∅
    TypeCoercionLAT(c) =                                                         (4)
                          0 otherwise


                                       1357
                              (
                               0.5 if SemType(c) ∩ SemType(q) 6= ∅
    TypeCoercionQuestion(c) =                                                     (5)
                               0   otherwise

    Table 5 presents the results of the above scores as well as their ensemble
on the 42 factoid questions out of the 100 questions provided by BioASQ as
training set. Results are presented in terms of the three metrics of the BioASQ
competition: Strict accuracy (SAcc), which compares the correct answer with
the top candidate, lenient accuracy (LAcc), which compares the correct answer
with the top 5 candidates and mean reciprocal rank (MRR), which takes into
account the position of the correct answer within the ranking of candidates.


          Table 5. Results of the different scores and their combinations

                      Scoring                         SAcc      LAcc        MRR
                Prominence (P)                          9%       31%        16%
           WeightedProminence (WP)                     23%       31%        25%
                 Specificity (S)                        4%       23%        11%
                 P + WP + S                            31%       43%        35%
    P + WP + S + TypeCoercionLAT (TCLAT)               26%       40%        31%
          P + WP + S + TCLAT × 0.5                     29%       45%        35%
    P + WP + S + TypeCoercionQuestion (TCQ)            24%       45%        33%
           P + WP + S + TCQ × 0.5                      29%       48%        36%
       P + WP + S + TCQ × 0.5 + TCLAT                  24%       43%        32%
       P + WP + S + TCQ + TCLAT × 0.5                  24%       48%        35%


    Interestingly, we notice that in terms of SAcc, the best results are obtained
by combining the first three non-semantic scorings. In terms of LAcc, the best re-
sults are obtained when combining the first three scorings with TCLAT weighted
by 0.5 or with TCQ weighted by 1 and TCLAT weighted by 0.5. The best results
in terms of MRR are obtained when combining the first three scorings with TCQ
weighted by 0.5.


7    Conclusions and Future Work
In this paper we presented our participation to both of the tasks of the BioASQ
challenge, introducing a novel multi-label classifier ensemble method. This ap-
proach was successful both in our experiments and during the competition, with
the ensemble systems outperforming the baseline models.
    While experimenting with different data sets, we noticed a significant change
in the performance of models with time. It would be really interesting to study
in a systematic way this concept drift along time, as it could yield interesting
observations about trends in the literature, changes of meaning of terms and,


                                    1358
from a machine learning view, changes in the hidden distribution. Concerning
the LLDA model, we think that there is a lot of room for improvements. For
instance, a possible parallelization or some variant of a faster Gibbs sampling
implementation scheme during the prediction phase could improve performance
by allowing to draw more samples. Either way, a hybrid approach to exploit
both the SVM and the LDA theory could bring significant improvements over
the multi-label classification problem.


References

 1. Weissenborn, D., Tsatsaronis, G., Schroeder, M.: Answering factoid questions in
    the biomedical domain. In Ngomo, A.C.N., Paliouras, G., eds.: BioASQ@CLEF.
    Volume 1094 of CEUR Workshop Proceedings., CEUR-WS.org (2013)
 2. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3) (1995)
    273–297
 3. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library
    for large linear classification. J. Mach. Learn. Res. 9 (June 2008) 1871–1874
 4. Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via
    metalabeler. In: WWW ’09: Proceedings of the 18th international conference on
    World wide web, New York, NY, USA, ACM (2009) 211–220
 5. Yang, Y.: A study of thresholding strategies for text categorization. In: SIGIR ’01:
    Proceedings of the 24th annual international ACM SIGIR conference on Research
    and development in information retrieval, New York, NY, USA, ACM (2001) 137–
    145
 6. Fan, R.E., Lin, C.J.: A study on threshold selection for multi-label classification.
    Technical report, National Taiwan University (2007)
 7. Nam, J., Kim, J., Gurevych, I., Fürnkranz, J.: Large-scale multi-label text classi-
    fication - revisiting neural networks. CoRR abs/1312.5419 (2013)
 8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
    Res. 3 (March 2003) 993–1022
 9. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National
    Academy of Sciences 101(Suppl. 1) (April 2004) 5228–5235
10. Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled lda: A supervised
    topic model for credit attribution in multi-labeled corpora. In: Proceedings of the
    2009 Conference on Empirical Methods in Natural Language Processing: Volume
    1 - Volume 1. EMNLP ’09, Stroudsburg, PA, USA, Association for Computational
    Linguistics (2009) 248–256
11. Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statistical topic models for
    multi-label document classification. Mach. Learn. 88(1-2) (July 2012) 157–208
12. Yang, Y., Zhang, J., Kisiel, B.: A scalability analysis of classifiers in text catego-
    rization. In: Proceedings of the 26th Annual International ACM SIGIR Conference
    on Research and Development in Informaion Retrieval. SIGIR ’03, New York, NY,
    USA, ACM (2003) 96–103
13. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Proceedings of the
    1st International Workshop in Multiple Classifier Systems. (2000) 1–15
14. Wolpert, D.H.: Original contribution: Stacked generalization. Neural Netw. 5(2)
    (February 1992) 241–259
15. Breiman, L.: Bagging predictors. Mach. Learn. 24(2) (August 1996) 123–140


                                        1359
16. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2) (July 1990)
    197–227
17. Tahir, M.A., Kittler, J., Bouridane, A.: Multilabel classification using hetero-
    geneous ensemble of multi-label classifiers. Pattern Recogn. Lett. 33(5) (2012)
    513–523
18. Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., Aronson, A.R.: A one-
    size-fits-all indexing method does not exist: Automatic selection based on meta-
    learning. JCSE 6(2) (2012) 151–160
19. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal
    of Machine Learning Research 7 (2006) 1–30
20. Forman, G.: BNS feature scaling: an improved representation over tf-idf for svm
    text classification. In: Proceedings of the 17th ACM conference on Information and
    knowledge management. CIKM ’08, New York, NY, USA, ACM (2008) 263–270
21. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for
    text categorization research. J. Mach. Learn. Res. 5 (2004) 361–397
22. Aronson, A.R., Lang, F.M.: An overview of metamap: historical perspective and
    recent advances. JAMIA 17(3) (2010) 229–236


                                      1360

</pre>