<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Results of the BioASQ Track of the Question Answering Lab at CLEF 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>George Balikas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Partalas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Krithara</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Gaussier</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Paliouras</string-name>
        </contrib>
      </contrib-group>
      <fpage>1181</fpage>
      <lpage>1193</lpage>
      <abstract>
        <p>The goal of this task is to push the research frontier towards hybrid information systems. We aim to promote systems and approaches that are able to deal with the whole diversity of the Web, especially for, but not restricted to, the context of bio-medicine. This goal is pursued by the organization of challenges. The second challenge consisted of two tasks: semantic indexing and question answering. 61 systems participated by 18 di erent participating teams for the semantic indexing task, of which between 25 and 45 participated in each batch. The semantic indexing task was tackled by 22 systems, which were developed by 8 di erent organizations. Between 15 and 19 of these systems addressed each batch. The question answering task was tackled by 18 di erent systems, developed by 7 di erent organizations. Between 9 and 15 of these systems submitted results in each batch. Overall, the best systems were able to outperform the strong baselines provided by the organizers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The aim of this paper is twofold. First, we aim to give an overview of the data
issued during the BioASQ track of the Question Answering Lab at CLEF 2014.
In addition, we aim to present the systems that participated in the challenge
and for which we received system descriptions. In particular, we aim to evaluate
their performance w.r.t. to dedicated baseline systems. To achieve these goals,
we begin by giving a brief overview of the tasks included in the track, including
the timing of the di erent tasks and the challenge data. Thereafter, we give an
overview of the systems which participated in the challenge and provided us
with an overview of the technologies they relied upon. Detailed descriptions of
some of the systems are given in lab proceedings. The evaluation of the systems,
which was carried out by using state-of-the-art measures or manual assessment,
is the last focal point of this paper. The conclusion sums up the results of the
track.</p>
    </sec>
    <sec id="sec-2">
      <title>Overview of the Tasks</title>
      <p>
        The challenge comprised two tasks: (1) a large-scale semantic indexing task (Task
2a) and (2) a question answering task (Task 2b).
Large-scale semantic indexing. In Task 2a the goal is to classify documents from
the PubMed1 digital library unto concepts of the MeSH2 hierarchy. Here, new
PubMed articles that are not yet annotated are collected on a weekly basis.
These articles are used as test sets for the evaluation of the participating
systems. As soon as the annotations are available from the PubMed curators, the
performance of each system is calculated by using standard information retrieval
measures as well as hierarchical ones. The winners of each batch were decided
based on their performance in the Micro F-measure (MiF) from the family of at
measures [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and the Lowest Common Ancestor F-measure (LCA-F) from the
family of hierarchical measures [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. For completeness several other at and
hierarchical measures were reported [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In order to provide an on-line and large-scale
scenario, the task was divided into three independent batches. In each batch 5
test sets of biomedical articles were released consecutively. Each of these test
sets were released in a weekly basis and the participants had 21 hours to provide
their answers. Figure 1 gives an overview of the time plan of Task 2a.
February4
arch11
M
pril 15
A
ay20
M
      </p>
      <p>
        Biomedical semantic QA. The goal of task 2b was to provide a large-scale
question answering challenge where the systems should be able to cope with all
the stages of a question answering task, including the retrieval of relevant
concepts and articles, as well as the provision of natural-language answers. Task
2b comprised two phases: In phase A, BioASQ released questions in English
from benchmark datasets created by a group of biomedical experts. There were
four types of questions: \yes/no" questions, \factoid" questions,\list" questions
and \summary" questions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Participants had to respond with relevant concepts
(from speci c terminologies and ontologies), relevant articles (PubMed and
PubMedCentral3 articles), relevant snippets extracted from the relevant articles and
relevant RDF triples (from speci c ontologies). In phase B, the released questions
contained the correct answers for the required elements (concepts, articles,
snippets and RDF triples) of the rst phase. The participants had to answer with
exact answers as well as with paragraph-sized summaries in natural language
(dubbed ideal answers).
1 http://www.ncbi.nlm.nih.gov/pubmed/
2 http://www.ncbi.nlm.nih.gov/mesh/
3 http://www.ncbi.nlm.nih.gov/pmc/
arch3
M
arch4
M
arch19
M
arch20
      </p>
      <p>M
Phase A
Phase B
pril 2
A
pril 3
A</p>
      <p>
        The task was split into ve independent batches. The two phases for each
batch were run with a time gap of 24 hours. For each phase, the participants
had 24 hours to submit their answers. We used well-known measures such as
mean precision, mean recall, mean F-measure, mean average precision (MAP)
and geometric MAP (GMAP) to evaluate the performance of the participants in
Phase A. The winners were selected based on MAP. The evaluation in phase B
was carried out manually by biomedical experts on the ideal answers provided
by the systems. For the sake of completeness, ROUGE [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is also reported.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Overview of Participants</title>
      <p>The participating systems in the semantic indexing task of the BioASQ
challenge adopted a variety of approaches including hierarchical and at algorithms
as well as search-based approaches that relied on information retrieval
techniques. In the rest of section we describe the proposed systems and stress their
key characteristics.</p>
      <p>
        The new NCBI system [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] for Task 2a is an extension of the work
presented in 2013 and relies on the generic learning-to-rank approach presented in
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This novel approach, dubbed LAMBDA-MART, di ers from the previous
approach in the following aspects: First, the set of features has been extended
to include binary classi er results. In addition, the set of documents used as
neighbor documents was reduced to documents indexed after 2009. Moreover,
the score function for the selection of the number of features was changed from
a linear to a logarithmic approach. Overall, the novel approach achieves an
Fmeasure between 0 (RDF triples) and 0.38 (concepts).
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] at classi cation processes were employed for the semantic indexing
task. In particular, the authors trained binary SVM classi ers for each label that
was present in the data. In order to reduce the complexity they trained the SVMs
in fractions of the data. They trained two systems on di erent corpus: Asclepios
on 950 thousand documents and Hippocrates on 1.5 million. Those systems
output a ranked lists with labels and a meta-model, namely MetaLabeler [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], is
used to decide the number of labels that will be submitted for each document.
The remaining three systems of the team employ ensemble learning methods. The
approach that worked best was a combination of Hippocrates with a model of
simple binary SVMs, which were trained by changing the weights parameter for
positive instances [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. During the training of a classi er with very few positive
instances they can chose to penalize a false negative (a positive instance being
misclassi ed) more than a false positive (a negative instance being mis-classi ed).
The proposed approaches, although they are relatively simple, require a lot of
processing power and memory. For that reason they used a machine with 40
processors and 1TB RAM.
      </p>
      <p>
        Ribadas et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] employ hierarchical models based on a top-down
hierarchical classi cation scheme [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and a Bayesian network which models the
hierarchical relations among the labels as well as the training data. The team
participated in the rst edition of the BioASQ challenge using the same
technologies [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. In the current competition they focused on the pre-processing of
the textual data while keeping the same classi cation models. More speci cally,
the authors employ techniques for identifying abbreviations in the text and
expanding it afterwards in order to enrich the document. Also, a part of speech
tagger is used in order to tokenize the text and identify noun, verbs, adjectives
and unknown elements (not identi ed). Finally, a lemmatization step extracts
the canonical forms of those words. Additionally, the authors extract word
bigrams and keep only those that are identi ed as multiword terms. The rational
is that multiword terms in a domain with complex terminology, like biomedicine,
provide higher discriminant power.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] the authors use a standard at classi cation scheme, where a SVM is
trained for each class label in MeSH. Di erent training set methodologies are
used resulting in di erent trained classi ers. Due to computational issues only
50,000 documents were used for training. The selection of the best classi cation
scheme is optimized on the precision at top k labels on a validation set.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] the authors used the learning to rank (LTR) method for predicting
MeSH headings. However, in addition to the information from similar citations,
they also used the prediction scores from individual MeSH classi ers to improve
the prediction accuracy. In particular, they trained a binary classi er
(logistic regression) for each label (MeSH heading). For a target citation, using the
trained classi ers, they calculated the annotation probability (score) of every
MeSH heading. Then, using NCBI efetch4,they retrieved similar citations for
the neighbor scores. Finally, these two scores, together with the default results
of NLM o cial solution MTI, were considered as features in the LTR framework.
The LambdaMART [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was used as the ranking method in the learning to rank
framework.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], they proposed a system which uses Latent Semantic Analysis to
identify semantically similar documents in MEDLINE and then constructs a list of
MeSH headers from candidates selected from the documents most similar to a
new abstract.
4 http://www.ncbi.nlm.nih.gov/books/NBK25499/
      </p>
      <p>
        Table 1 resumes the principal technologies that were employed by the
participating systems and whether a hierarchical or a at approach has been followed.
Baselines. During the rst challenge two systems were served as baseline
systems. The rst one, dubbed BioASQ Baseline, follows an unsupervised
approach to tackle the problem and so it is expected that the systems developed
by the participants will outperform it. The second baseline is a
state-of-theart method called Medical Text Indexer [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which is developed by the National
Library of Medicine5 and serves as a classi cation system for articles of
MEDLINE. MTI is used by curators in order to assist them in the annotation process.
The new annotator is an extension of the system presented in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] with the
approaches of the last year's winner [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Consequently, we expected the baseline
to di cult to beat.
3.2
      </p>
      <sec id="sec-3-1">
        <title>Task 2b</title>
        <p>As mentioned above, the second task of the challenge is split into two phases. In
the rst phase, where the goal is to annotate questions with relevant concepts,
documents, snippets and RDF triples 8 teams with 22 systems participated. In
the second phase, where team are requested to submit exact and paragraph-sized
answers for the questions, 7 teams with 18 di erent systems participated.</p>
        <p>
          The system presented in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] relies on the Hana Database for text processing.
It uses the Stanford CoreNLP package for tokenizing the questions. Each of
the token is then sent to the BioPortal and to the Hana database for concept
retrieval. The concepts retrieved from the two stores are nally merged to a
single list that is used to retrieve relevant text passages from the documents
at hand. To this end, four di erent types of queries are sent to the BioASQ
services. Overall, the approach achieves between 0.18 and 0.23 F-measure.
        </p>
        <p>
          The approach proposed by NCBI [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] for Task 2b can be used in combination
with the approach by the same group for Task 2a. In phase A, NCBI's
framework used the cosine similarity between question and sentence to compute their
5 http://ii.nlm.nih.gov/MTI/index.shtml
similarity. The best scoring sentence from an abstract was chosen as relevant
snippet for an answer. Concept recognition was achieved by a customized
dictionary lookup algorithm in combination with MetaMap. For phase B, tailored
approaches were used depending on the question types. For example, a manual
set of rules was crafted to determine the answers to factoid and list questions
based on the benchmark data for 2013. The system achieved an F-measure of
up to betwen 0.2% (RDf triples) and 38.48% (concepts). It performed very well
on Yes/No questions (up to 100% accuracy). Factoid and list questions led to
an MRR of up to 20.57%.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] the authors participated only in the document retrieval of phase A and
in the generation of ideal answers in phase B. The Indri search engine is used
to index the PubMed articles and di erent models are used to retrieve
documents like pseudo-relevance feedback, sequential dependence model and
semantic concept-enriched dependence model where the recognised UMLS concepts in
the query are used as additional dependence features for ranking documents. For
the generation of ideal answers the authors retrieve sentences from documents
and identify the common keywords. Then the sentences are ranked according to
the number of times these keywords appear in each of them and nally the top
ranked m are used to form the ideal answer.
        </p>
        <p>
          The authors of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] propose a method for the retrieval of relevant documents
and snippets of task 2b. They develop a gure-inspired text retrieval method as
a way of retrieving documents and text passages from biomedical publications.
The method is based on the insight that for biomedical publications, the gures
play an important role to the point that the captions can be used to provide
abstract like summaries. The proposed approach uses an Information Retrieval
perspective on the problem. In principle, the followed steps are: (i) the question
in enriched by query expansion with information from UMLS, Wikipedia, and
Figures, (ii) a ranking of full documents and snippets is retrieved from a corpus
of PubMed Central Articles which is the set of full-text available articles, (iii)
features are extracted for each document and snippet that provide proof of its
relevance for the question and (iv) the documents/snippets are re-ranked with
a learning-to-rank approach.
        </p>
        <p>
          In the context of phase B of task 2b in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], the authors attempted to replicate
the work that already exists in literature and was presented in the BioASQ 2013
workshop [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. They provided exact answers only for the factoid questions. Their
system tries to extract the lexical answer type by manipulating the words of the
question. Then, the relevant snippets of the question which are provided as
inputs for this tasks are processed with the 2013 release of MetaMap [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] in order
to extract candidate answers.
        </p>
        <p>Baselines. Two baselines were used in phase A. The systems return the list of
the top-50 and the top-100 entities respectively that may be retrieved using the
keywords of the input question as a query to the BioASQ services. As a result,
two lists for each of the main entities (concepts, documents, snippets, triples)
are produced, of a maximum length of 50 and 100 items respectively.</p>
        <p>
          For the creation of a baseline approach in Task 2B Phase B, three approaches
were created that address respectively the answering of factoid and lists
questions, summary questions, and yes/no questions [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. The three approaches were
combined into one system, and they constitute the BioASQ baseline for this
phase of Task 2B. The baseline approach for the list/factoid questions utilizes
and ensembles a set of scoring schemes that attempt to prioritize the concepts
that answer the question by assuming that the type of the answer aligns with the
lexical answer type (type coercion). The baseline approach for the summary
questions introduces a multi-document summarization method using Integer Linear
Programming and Support Vector Regression.
4
4.1
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>During the evaluation phase of the Task 2a, the participants submitted their
results on a weekly basis to the online evaluation platform of the challenge6. The
evaluation period was divided into three batches containing 5 test sets each.
18 teams were participated in the task with a total of 61 systems. 12,628,968
articles with 26,831 labels (20.31GB) were provided as training data to the
participants. Table 2 shows the number of articles in each test set of each batch of
the challenge.</p>
      <p>
        Labels per article
13.20
13.13
13.32
13.02
13.07
13.15
13.05
12.28
12.90
13.23
13.58
13.01
12.71
13.37
13.32
13.90
12.70
13.20
13.12
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]
Baselines
      </p>
      <p>Systems
Asclepius, Hippocrates, Sisyphus
cole hce1, cole hce2, cole hce ne, utai rebayct, utai rebayct 2
SNUMedInfo*
Antinomyra-*
L2R*</p>
      <p>MTIFL, MTI-Default, bioasq baseline</p>
      <p>
        Table 3 presents the correspondence of the systems for which a description
was available and the submitted systems in Task 2a. The systems MTIFL,
MTIDefault and BioASQ Baseline were the baseline systems used throughout the
challenge. MTIFL and MTI-Default refer to the NLM Medical Text Indexer
system [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Systems that participated in less than 4 test sets in each batch are
not reported in the results7.
      </p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] the appropriate way to compare multiple classi cation
systems over multiple datasets is based on their average rank across all the datasets.
On each dataset the system with the best performance gets rank 1.0, the second
best rank 2.0 and so on. In case that two or more systems tie, they all receive the
average rank. Tables 4 presents the average rank (according to MiF and LCA-F)
of each system over all the test sets for the corresponding batches. Note, that the
average ranks are calculated for the 4 best results of each system in the batch
according to the rules of the challenge8. The best ranked system is highlighted
with bold typeface.
      </p>
      <p>
        First, we can observe that several systems outperforms the strong MTI
baseline in terms of MiF and LCA measures exhibiting state-of-the-art performances.
During the rst batch the at classi cation approach (Asclepius system) used in
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In the other two batches the learning-to-rank systems proposed by NCBI
(L2R systems) and the Fudan University (Antinomyra systems) ranked as the
best performed ones occupying the rst two places in both measures.
      </p>
      <p>
        According to the available descriptions the only systems that made of use of
the MeSH hierarchy were the ones introduced by [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The top-down hierarchical
systems, cole hce1, cole hce2 and cole hce ne achieved mediocre results. while
the utai rebayct systems had poor performances. For the systems based on a
Bayesian network this behavior was expected as they cannot scale well to large
problems.
4.2
      </p>
      <sec id="sec-4-1">
        <title>Task 2b</title>
        <p>Phase A. Table 5 presents the statistics of the training and test data provided
to the participants. The evaluation included ve test batches. For the phase A of
Task 2b the systems were allowed to submit responses to any of the corresponding
7 According to the rules of BioASQ, each system had to participate in at least 4 test
sets of a batch in order to be eligible for the prizes.
8 http://bioasq.lip6.fr/general information/Task1a/</p>
        <p>
          MiF
types of annotations, that is documents, concepts, snippets and RDF triples.
For each of the categories we rank the systems according to the Mean Average
Precision (MAP) measure [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The nal ranking for each batch is calculated as
the average of the individual rankings in the di erent categories. The detailed
results for Task 2b phase A can be found in http://bioasq.lip6.fr/results/
        </p>
        <p>
          Focusing on the speci c categories, (e.g., concepts or documents) for the
Wishart system we observe that it achieves a balanced behavior with respect
to the baselines (Table 7 and Table 6). This is evident from the value of
Fmeasure which is much higher that the values of the two baselines. This can
be explained on the fact that the Wishart-S1 system responded with short lists
while the baselines return always long lists (50 and 100 items respectively).
Similar observations hold also for the other four batches, the results of which
are available online.
ideal answers. The systems were ranked according to the manual evaluation of
ideal answers by the BioASQ experts [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. For reasons of completeness we report
also the results of the systems for the exact answers.
The participation to the second BioASQ challenge signalizes an uptake of the
signi cance of biomedical question answering in the research community. We
monitored an increased participation of both Tasks 2a and 2b. The baseline that
we used this year in Task 2a incorporated techniques from last year's winning
system. Although we had more data and thus more possible sources of errors
(but also more training data), the best system in the rst challenge clearly
outperformed the baseline. This suggest an improvement of large-scale classi cation
systems over the last year. The results achieved in Task 2b also suggest that the
state of the art was pushed a step further. Consequently, we regard the outcome
of the challenge as a success towards pushing the research on bio-medical
information systems a step further. In future editions of the challenge, we aim to
provide even more benchmark data derived from a community-driven acquisition
process.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Joel</given-names>
            <surname>Robert</surname>
          </string-name>
          Adams and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bedrick</surname>
          </string-name>
          .
          <article-title>Automatic classi cation of pubmed abstracts with latent semantic indexing: Working notes</article-title>
          .
          <source>In Proceedings of Question Answering Lab at CLEF</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alan</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aronson</surname>
          </string-name>
          and
          <string-name>
            <surname>Franois-Michel Lang</surname>
          </string-name>
          .
          <article-title>An overview of MetaMap: historical perspective and recent advances</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>17</volume>
          :
          <fpage>229</fpage>
          {
          <fpage>236</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Georgios</given-names>
            <surname>Balikas</surname>
          </string-name>
          , Ioannis Partalas, Aris Kosmopoulos, Sergios Petridis, Prodromos Malakasiotis, Ioannis Pavlopoulos, Ion Androutsopoulos, Nicolas Baskiotis, Eric Gaussier, Thierry Artieres, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Gallinari</surname>
          </string-name>
          .
          <article-title>Evaluation Framework Speci cations</article-title>
          .
          <source>Project deliverable D4.1</source>
          ,
          <issue>05</issue>
          /
          <year>2013</year>
          2013.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Christopher</surname>
            <given-names>J.C.</given-names>
          </string-name>
          <string-name>
            <surname>Burges</surname>
          </string-name>
          .
          <article-title>From ranknet to lambdarank to lambdamart: An overview</article-title>
          .
          <source>Technical Report MSR-TR-2010-82</source>
          ,
          <year>June 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Sungbin</given-names>
            <surname>Choi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jinwook</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>Classi cation and retrieval of biomedical literatures: Snumedinfo at clef qa track bioasq 2014</article-title>
          .
          <source>In Proceedings of Question Answering Lab at CLEF</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Janez</given-names>
            <surname>Demsar</surname>
          </string-name>
          .
          <article-title>Statistical Comparisons of Classi ers over Multiple Data Sets</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>7</volume>
          :1{
          <fpage>30</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Minlie</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Aurlie</given-names>
            <surname>Nvol</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Recommending mesh terms for annotating biomedical articles</article-title>
          .
          <source>JAMIA</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <volume>660</volume>
          {
          <fpage>667</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Susan</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt Alan R. Aronson James G. Mork</surname>
          </string-name>
          , Dina Demner-Fushman.
          <article-title>Recent enhancements to the nlm medical text indexer</article-title>
          .
          <source>In Proceedings of Question Answering Lab at CLEF</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Aris</given-names>
            <surname>Kosmopoulos</surname>
          </string-name>
          , Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <article-title>Evaluation Measures for Hierarchical Classi cation: a uni ed view and novel approaches</article-title>
          .
          <source>CoRR, abs/1306.6802</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>David</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          et al.
          <article-title>Rcv1: A new benchmark collection for text categorization research</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>5</volume>
          :
          <fpage>361</fpage>
          {
          <fpage>397</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <article-title>ROUGE: A package for automatic evaluation of summaries</article-title>
          .
          <source>In Proceedings of the ACL workshop `Text Summarization Branches Out'</source>
          , pages
          <fpage>74</fpage>
          {
          <fpage>81</fpage>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Jessa</given-names>
            <surname>Lingeman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Laura</given-names>
            <surname>Dietz</surname>
          </string-name>
          . UMass at BioASQ 2014:
          <article-title>Figure-inspired text retrieval</article-title>
          . In 2nd BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ke</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Junqiu Wu, Shengwen Peng, Chengxiang Zhai, and
          <string-name>
            <given-names>Shanfeng</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>The fudan-uiuc participation in the bioasq challenge task 2a: The antinomyra system</article-title>
          .
          <source>In Proceedings of Question Answering Lab at CLEF</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Yifeng</given-names>
            <surname>Liu. BioASQ System</surname>
          </string-name>
          <article-title>Descriptions (Wishart team)</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Yuqing</given-names>
            <surname>Mao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>NCBI at the 2013 BioASQ challenge task: Learning to rank for automatic MeSH Indexing</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>James</surname>
            <given-names>Mork</given-names>
          </string-name>
          , Antonio Jimeno-Yepes, and
          <string-name>
            <given-names>Alan</given-names>
            <surname>Aronson</surname>
          </string-name>
          .
          <source>The NLM Medical Text Indexer System for Indexing Biomedical Literature</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>Mariana</given-names>
            <surname>Neves</surname>
          </string-name>
          .
          <article-title>Hpi in-memory-based database system in task 2b of bioasq</article-title>
          .
          <source>In Proceedings of Question Answering Lab at CLEF</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Yannis</surname>
            <given-names>Papanikolaou</given-names>
          </string-name>
          , Dimitrios Dimitriadis, Grigorios Tsoumakas, Manos Laliotis, Nikos Markantonatos, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>Ensemble Approaches for Large-Scale Multi-Label Classi cation and Question Answering in Biomedicine</article-title>
          . In 2nd BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. Francisco Ribadas, Luis de Campos, Victor Darriba, and
          <string-name>
            <given-names>Alfonso</given-names>
            <surname>Romero</surname>
          </string-name>
          .
          <article-title>Two hierarchical text categorization approaches for BioASQ semantic indexing challenge</article-title>
          . In 1st BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. Francisco J.
          <string-name>
            <surname>Ribadas-Pena</surname>
          </string-name>
          ,
          <string-name>
            <surname>Luis M. de Campos Ibanez</surname>
            , Victor Manuel DarribaBilbao, and
            <given-names>Alfonso E.</given-names>
          </string-name>
          <string-name>
            <surname>Romero</surname>
          </string-name>
          .
          <article-title>Cole and utai participation at the 2014 bioasq semantic indexing challenge</article-title>
          .
          <source>In Proceedings of Question Answering Lab at CLEF</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Jr. Carlos N. Silla</surname>
            and
            <given-names>Alex A.</given-names>
          </string-name>
          <string-name>
            <surname>Freitas</surname>
          </string-name>
          .
          <article-title>A survey of hierarchical classi cation across di erent application domains</article-title>
          .
          <source>Data Mining Knowledge Discovery</source>
          ,
          <volume>22</volume>
          :
          <fpage>31</fpage>
          {
          <fpage>72</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Lei</surname>
            <given-names>Tang</given-names>
          </string-name>
          , Suju Rajan, and
          <string-name>
            <surname>Vijay</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>Large scale multi-label classi cation via metalabeler</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web, WWW '09</source>
          , pages
          <fpage>211</fpage>
          {
          <fpage>220</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Grigorios</surname>
            <given-names>Tsoumakas</given-names>
          </string-name>
          , Ioannis Katakis, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>Mining Multi-label Data</article-title>
          .
          <source>In Oded Maimon and Lior Rokach</source>
          , editors,
          <source>Data Mining and Knowledge Discovery Handbook</source>
          , pages
          <volume>667</volume>
          {
          <fpage>685</fpage>
          .
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Grigorios</surname>
            <given-names>Tsoumakas</given-names>
          </string-name>
          , Manos Laliotis, Nikos Markontanatos, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>Large-Scale Semantic Indexing of Biomedical Publications</article-title>
          . In 1st BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Dirk</surname>
            <given-names>Weissenborn</given-names>
          </string-name>
          , George Tsatsaronis, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Schroeder</surname>
          </string-name>
          .
          <article-title>Answering Factoid Questions in the Biomedical Domain</article-title>
          . In 1st BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26. Zhiyong Lu Yuqing Mao,
          <string-name>
            <surname>Chih-Hsuan Wei</surname>
          </string-name>
          .
          <article-title>Ncbi at the 2014 bioasq challenge task: large-scale biomedical semantic indexing and question answering</article-title>
          .
          <source>In Proceedings of Question Answering Lab at CLEF</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Donhqing</surname>
            <given-names>Zhu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dingcheng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ben</given-names>
            <surname>Carterette</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hongfang</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>An Incemental Approach for MEDLINE MeSH Indexing</article-title>
          . In 1st BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>