<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Results of the BioASQ tasks of the Question Answering Lab at CLEF 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgios Balikas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aris Kosmopoulos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Krithara</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgios Paliouras</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Kakadiaris</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratoire d' Informatique de Grenoble</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NCSR \Demokritos"</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Houston</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The goal of the BioASQ challenge is to push research towards highly precise biomedical information access systems. We aim to promote systems and approaches that are able to deal with the whole diversity of the Web, especially for, but not restricted to, the context of biomedicine. The third challenge consisted of two tasks: semantic indexing and question answering. 59 systems by 18 di erent teams participated in the semantic indexing task (Task 3a). The question answering task was further subdivided into two phases. 24 systems from 9 di erent teams participates in the annotation phase (Task 3b-phase A), while 26 systems of 10 di erent teams participated in the answer generation phase (Task 3b-phase B). Overall, the best systems were able to outperform the strong baselines provided by the organizers. In this paper, we present the data used during the challenge as well as the technologies which were used by the participants.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The aim of this paper is to present an overview of the BioASQ challenge in
CLEF 2015. The overview provides information about:
1. the two BioASQ tasks of the Question Answering Lab at CLEF 2015,
2. the data provided during the BioASQ tasks,
3. the systems that participated in the challenge, according to the system
descriptions that we have received; detailed descriptions of some of the systems
are given in the lab proceedings which we cite,
4. evaluation results about the performance of the participating systems and
compare them to dedicated baseline systems.</p>
    </sec>
    <sec id="sec-2">
      <title>Overview of the Tasks</title>
      <p>
        The challenge comprised two tasks: (1) a large-scale semantic indexing task
(Task 3a) and (2) a question answering task (Task 3b). Information about the
challenge and the nature of the data it provides is available at [
        <xref ref-type="bibr" rid="ref2 ref21">21, 2</xref>
        ].
Large-scale semantic indexing. In Task 3a the goal is to classify documents from
the MEDLINE4 digital library unto concepts of the MeSH5 hierarchy. Here,
new MEDLINE articles that are not yet annotated are collected on a weekly
basis. These articles are used as test sets for the evaluation of the participating
systems. As soon as the annotations are available from the MEDLINE curators,
the performance of each system is assessed using standard information retrieval
measures as well as hierarchical ones. The winners of each batch are decided
based on their performance in the Micro F-measure (MiF) from the family of
at measures [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and the Lowest Common Ancestor F-measure (LCA-F) from
the family of hierarchical measures [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. For completeness several other at and
hierarchical measures are reported [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In order to provide an on-line and large-scale scenario, the task was divided
into three independent batches. In each batch 5 test sets of biomedical articles
were released following a pre-announced schedule. The test sets were released
on a weekly basis (on Monday 17.00 CET) and the participants were asked to
provide their system's answers within 21 hours. Figure 1 gives an overview of
the time plan of Task 3a.</p>
      <sec id="sec-2-1">
        <title>February09</title>
      </sec>
      <sec id="sec-2-2">
        <title>February16</title>
      </sec>
      <sec id="sec-2-3">
        <title>February23</title>
        <p>arch02
M
arch09
M
arch16
M
arch23
M
arch30
M
pril 06
A
pril 13
A
pril 20
A
pril 27
A
ay04
M
ay11
M</p>
        <p>
          Biomedical semantic QA. The goal of Task 3b was to assess the performance
of participating systems in di erent stages of the question answering process,
ranging from the retrieval of relevant concepts and articles, to the generation of
natural-language answers. Task 3b comprised two phases: In phase A, BioASQ
released questions in English from benchmark datasets created by a group of
biomedical experts. There were four types of question: \yes/no" questions,
\factoid" questions,\list" questions and \summary" questions [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Participants were
asked to respond with relevant concepts (from speci c terminologies and
ontologies), relevant articles (PubMed and PubMedCentral6 articles), relevant snippets
extracted from the relevant articles and relevant RDF triples (from speci c
ontologies). In phase B, the released questions were accompanied by the correct
answers for a subset of the required elements of phase A; namely documents and
        </p>
        <sec id="sec-2-3-1">
          <title>4 http://www.ncbi.nlm.nih.gov/pubmed/ 5 http://www.ncbi.nlm.nih.gov/mesh/ 6 http://www.ncbi.nlm.nih.gov/pmc/</title>
          <p>snippets.7 The participants had to answer with exact answers as well as with
paragraph-sized summaries in natural language (dubbed ideal answers).</p>
          <p>
            The task was split into ve independent batches (see Fig. 2). For each phase,
the participants had 24 hours to submit their answers. We used well-known
measures such as mean precision, mean recall, mean F-measure, mean average
precision (MAP) and geometric MAP (GMAP) to evaluate the performance of
the participants in Phase A. The winners were selected based on MAP. The
evaluation in phase B was carried out manually by biomedical experts on the
ideal answers provided by the systems. For the sake of completeness, ROUGE [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]
was also reported.
The systems that participated in the semantic indexing task of the BioASQ
challenge adopted a variety of approaches based mostly on at classi cation. In
the rest of section we describe the participating systems and stress their key
characteristics.
          </p>
          <p>
            The NCBI system [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], called MeSH Now, was contributed as a baseline
system for the semantic indexing task of 2015. This allowed other participants
to use its predictions, in order to improve their own results. The system is very
similar to that developed by NCBI for the BioASQ2 challenge, based on the
generic learning-to-rank approach presented in [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. The main improvements were
the addition of new training data from the third iteration of the challenge and
the submission of two separate runs each week, one favoring high F1 and one
favoring high recall. Improvements were also done on the scalability of the system
which now runs in parallel using a computer cluster.
7 In the rst two editions of the BioASQ challenge, the datasets released for Phase B
contained relevant articles, snippets, concepts and RDF triples for each question.
          </p>
          <p>
            The AUTH-Atypon system [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] also adopted a at classi cation approach.
The approach is based on binary linear SVM models for each class. A
MetaLabeler [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] is used to predict the number of classes that an instance should be
assigned to. An ensemble of such classi ers, trained on variable training data
sizes and di erent time periods, is then used in order to deal with the problem
of Concept Drift.
          </p>
          <p>
            A domain-independent k-nearest-neighbor approach is adopted by the IIIT
team [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. Initially the system uses k-NN in order to nd the most relevant MeSH
headings. Then a series of procedures are used, based on POS-tagging, IDF
computation and SVM-rank in order to assign some extra classes to each test
instance and improve the recall of the initial k-NN results. In the nal step,
tree-based classi ers (one versus all) are used (FastXML), which actually take
in to account the hierarchical relations between the MeSH terms.
          </p>
          <p>
            Another k-nearest-neighbor approach is that of USI [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], which does not take
into account the hierarchy. The authors claim that the method is generic since it
does not take into account the domain or use any NLP, although they believe that
an NLP module would boost their performance. Given an instance the system
nds the k nearest instances in the training corpus and then uses the labels of
these instances for annotating it by computing semantic similarities. During the
challenge they experimented with various parameters of their system, such as
the value of k and they also took into account the predictions of the baselines
in order to improve their results.
          </p>
          <p>
            The CoLe and UTAI [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] teams introduce a new approach, compared to
their approach during the previous challenges. This year they use only
conventional information retrieval tools, such as Lucene, combined with k-NN methods.
The authors also experimented with several approaches of index term extraction
ranging from simple to more complex ones requiring the use of NLP.
          </p>
          <p>The ESIS* systems used the Lucene index in order to nd useful features for
each of the MeSH classes separately. In this direction, they selected words that
co-occur often with a particular class, as well as the most common terms
excluding stop words. The decision function follows an k-nearest-neighbor approach,
where for each test instance and given the feature extraction process they nd
in the Lucene index the most common training examples that decide the class of
the test instance. Intuitively, the probability of a class increases if a term that is
strongly associated with it is present and decreases if a frequent term is absent.</p>
          <p>
            The Fudan system [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] uses a learning to rank (LTR) method for predicting
MeSH headings. The MeSHLabeler algorithm consists of two components. The
rst component, called MeSHRanker, returns an ordered list of MeSH headings
for each test instance. The ranking is determined by a combination of (a) binary
classi ers, one for each MeSH heading, (b) the most similar citations to the
test instance, (c) pattern matching between the MeSH headings and the title of
the abstract and (d) the prediction of the MTI system. The second component,
called MeSHNumber, predicts the actual number of MeSH heading that must be
assigned to each test instance.
          </p>
          <p>
            Table 1 describes the principal technologies that were employed by the
participating systems and whether a hierarchical or a at approach has been adopted.
Baselines. Five systems have served as baseline systems for BioASQ task 3a.
The rst one, dubbed BioASQ Baseline, follows a simplistic unsupervised
approach to the problem and is thus easy to beat. The rest of the systems are
implementations of state-of-the-art methods: the Medical Text Indexer (MTI)
and the MTI First Line Index [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] were developed and are maintained by the
National Library of Medicine (NLM). 8 They serve as classi cation systems for
articles of MEDLINE and are actively used by the MEDLINE curators in
order to assist them in the annotation process. Furthermore, MeSH Now BF and
MeSH Now HR were developed by NCBI and were among the best-performing
systems in the second edition of the BioASQ challenge [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. Consequently, we
expected these baselines to be hard to beat.
3.2
          </p>
          <p>Task 3b
As mentioned above, the second task of the challenge is further divided into
two phases. In the rst phase, where the goal is to annotate questions with
relevant concepts, documents, snippets and RDF triples, 9 teams with 24 systems
participated. In the second phase, where team are requested to submit exact and
paragraph-sized answers for the questions, 10 teams with 26 di erent systems
participated.</p>
          <p>
            The OAQA system described in [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] focuses on learning to answer factoids
and list questions. The participants trained three supervised models, using
factoid and list questions of the previous editions of the task. The rst is an answer
type prediction model, the second assigns a score to each predicted answer while
the third is a collective re-ranking model. Although the system also participated
in phase A of Task 3b its performance was much better in the factoid and list
questions of phase B.
          </p>
          <p>In contrast, the USTB system [25] participated only in phase A of the
challenge. This approach initially uses a sequential dependence model for document
retrieval. It then uses Word Embeddings (speci cally the Word2Vec tool) to rank
8 http://ii.nlm.nih.gov/MTI/index.shtml
the results and improve the document retrieval of the previous step. In the nal
step, biomedical concepts and corresponding RDF triples are extracted, using
concept recognition tools, such as MetaMap and Banner.</p>
          <p>Another system that focused on phase A is by the IIIT team and is described
in [24]. The authors relied on the PubMed search engine to retrieve relevant
documents. They then applied their own snippet extraction methods, which is
based on the similarity of the top 10 sentences of the retrieved documents and
the query.</p>
          <p>
            The HPI system [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] participated in both phases of Task 3b. The system
relies on in-memory based database technology, in order to map the given questions
to concepts. The Stanford CoreNLP package is used for question tokenization
and the BioASQ services are used for relevant document retrieval. The
selection of snippets from the retrieved documents is performed using string similarity
between terms of the question and words of the documents. Exact and ideal
answers are both extracted using the gold-standard snippets that were provided to
the participants.
          </p>
          <p>
            The Fudan system [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] also participated in the second task of challenge. For
phase A a language model is used in order to retrieve relevant documents. For
snippet extraction, the retrieved documents are searched for query keywords by
giving extra credit to terms that appear close to the query keywords. Regarding
exact and ideal answers, the system is split into three main components: question
analysis, candidate answer generation and candidate answer ranking.
          </p>
          <p>
            In the system of ILSP and AUEB [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] a di erent approach for question
answering is presented based on multi-document summarization from relevant
documents. The system rst uses an SVR in order to assign scores to each
sentence of the relevant documents. The most relevant sentences are then combined
to form an answer. In order to avoid redundancy, two main approaches are
examined, the use of an ILP model and the use of a more greedy strategy. Several
versions of the system were examined, which di er on the features and training
data that was used.
          </p>
          <p>
            The YodaQA system, described in [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], is a pipeline question answering system
that was altered in order to make it compatible with the BioASQ task. The
system rst extracts natural language features from the questions and then searches
its knowledge base for existing answers. It then either directly provides these
passages as answers or performs passage analysis in order to produce answers
from the extracted texts. Each answer is evaluated using a logistic regression
classi er and those with the highest scores are provided as a nal answer. The
initial system was designed to answer only factoid questions, so modi cations
were necessary in order to be able to answer list questions.
          </p>
          <p>
            The nal system is the SNUMedinfo described in [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Regarding Phase A,
the system participated only in the document retrieval task. The approach was
based on the Indri search engine [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] and the semantic concept-enriched model
presented in [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. In phase B, the system participated only in the ideal answer
generation subtask, where it ranked each passage from the provided list, based
on the unique keywords it contained. A set of m (parameter of the system)
YodaQA [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]
SNUMedinfo [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]
4
4.1
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>passages were selected, in rank order, by selecting only passages that contain a
minimum proportion of new tokens compared the already selected ones.</p>
      <p>
        Table 2 describes the principal technologies that were employed by the
participating systems and in which phase (A and/or B) have participated.
Baselines. The BioASQ baseline of Task 3b phase B is a system similar to
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It applies a multi-document summarization method using Integer Linear
Programming and Support Vector Regression.
      </p>
      <p>During the evaluation phase of the Task 3a, the participants submitted their
results on a weekly basis to the online evaluation platform of the challenge.9 The
evaluation period was divided into three batches containing 5 test sets each. 18
teams participated in the task with a total of 59 systems. Two training datasets
were provided: the rst contains 11,804,715 articles that cover 27,097 MeSH
labels; the second is a subset containing 4,607,922 articles and covers 26,866
MeSH labels. The latter dataset focuses on the journals that appear also in the
test sets. The uncompressed size of those training sets in text format is 19Gb
and 7.4Gb respectively. Table 3 shows the number of articles in each test set of
each batch of the challenge.</p>
      <p>Table 4 presents the correspondence of the system names in the BioASQ
Participants Area Leaderboard for Task 3a and the system description submitted
in the track's working notes. Systems that participated in less than 4 test sets
in each batch are not reported in the results.10</p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] the appropriate way to compare multiple classi cation
systems over multiple datasets is based on their average rank across all the datasets.
      </p>
      <sec id="sec-3-1">
        <title>9 http://participants-area.bioasq.org/</title>
        <p>10 According to the rules of BioASQ, each system had to participate in at least 4 test
sets of a batch in order to be able to win the batch.</p>
        <p>Batch</p>
        <p>1
Subtotal
2
3
Subtotal
Subtotal</p>
        <p>Total</p>
        <p>Systems
MeSH Now HR, MeSH Now BF
auth*
qaiiit system *
Abstract framework, USI 20 neighbours, USI baseline, USI 10 neighbours
iria-*</p>
        <p>
          MeSHLabeler-*
On each dataset the system with the best performance gets rank 1.0, the second
best rank 2.0 and so on. In case two or more systems tie, they all receive the
average rank. Table 5 presents the average rank (according to MiF and LCA-F)
of each system over all the test sets for the corresponding batches. Note, that the
average ranks are calculated for the 4 best results of each system in the batch
according to the rules of the challenge11. The best ranked system is highlighted
with bold typeface. As it can be noticed, on all three batches and for both at
and hierarchical measures, the Fudan system [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] clearly outperforms other
approaches. The AUTH-Atypon system [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] managed to score second in two out
of three batches, while the MeSH-UK0 scored second in one of the batches.
11 http://participants-area.bioasq.org/general_information/Task3a/
MiF and LCA-F. A dash (-) is used whenever the system participated in less than 4
times in the batch. Systems that didn't participate in the challenge regularly, i.e. they
didn't submit results for at least four test sets in at least one of the three batches, were
Batch 1
        </p>
        <p>Batch 2</p>
        <p>Batch 3
LCA-F</p>
        <p>LCA-F
excluded from the table.</p>
        <p>System
3b the systems were allowed to submit up to 10 responses per question to any of
the corresponding type of annotation; that is documents, concepts, snippets and
RDF triples. For each of the categories we rank the systems according to the
MiF</p>
        <p>
          Mean Average Precision (MAP) measure [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The nal ranking for each batch is
calculated as the average of the individual rankings in the di erent categories.
Tables 7 and 8 present the scores of the participating systems for document and
snippet retrieval in the rst batch of Phase A .12 Note that systems are allowed to
participate in any or all four parts of the task e.g., SNUMedinfo* retrieved only
12 In contrast to the
        </p>
        <p>
          rst two editions of the challenge, the biomedical experts of
BioASQ were not asked to produce golden concepts and triples prior to the
challenge. The ground truth for concepts and snippets will be constructed by the experts
on the basis of the material provided by the systems.
documents. It is worth noting, that document retrieval for the given questions
was the most popular aspect of the task; far fewer systems returned document
snippets, concepts and RDF triples. The detailed results for Task 3b phase A can
be found in http://participants-area.bioasq.org/results/3b/phaseA/.
Phase B. In phase B of Task 3b, the systems were asked to generate exact and
ideal answers. The systems will be ranked according to the manual evaluation
of ideal answers by the BioASQ experts [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. For reasons of completeness we
report also the results of the systems for the exact answers. In contrast to the
previous editions of the BioASQ challenge, the test les of Phase B included only
relevant documents and snippets for each question instead of relevant documents,
snippets, concepts and RDF triples. As a result, the participating systems had
less information available in order to construct the exact and the ideal answers.
        </p>
        <p>
          Table 9 shows the results for the exact answers in the rst batch of task 3b.
For systems that didn't provide exact answers for a particular kind of question
we use the dash symbol \-". The results of the other batches are available at
http://participants-area.bioasq.org/results/3b/phaseB/. They are not
reproduced here in the interest of space. From those results we can see that
some of the systems are achieving a very high (&gt; 80% accuracy) performance
in the yes/no questions. The performance in factoid and list questions is not as
good, indicating that there is room for improvement. On the other hand, the
performance on ideal answers has improved compared to the previous years [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
which in combination with the increase of participation leads us to believe that
a signi cant amount of e ort was invested by the participants and that the task
is gaining attention. It is to be noted that those conclusions are based only on
the automated evaluation measures; the manual assessment was still in progress
at the time of writing this document.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>The third edition of the BioASQ challenge has led to a number of interesting
results by the participating systems. Despite them being quite advanced
systems, the baselines that we provided have been beaten by the best systems.
Both tasks have attracted an increasing number of participants and the
number of submissions to the workshop has also increase. Therefore, we believe that
the third edition of the challenge has been another contribution towards better
biomedical information systems.This encourages us to continue the e ort and
establish BioASQ as a reference point for research in the area. In future editions
of the challenge, we aim to provide even more benchmark data derived from a
community-driven acquisition process.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The third edition of BioASQ is supported by a conference grant from the
NIH/NLM (number 1R13LM012214-01) and sponsored by the companies Viseo
and Atypon.
ing Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse,
France, 2015.
24. Harish Yenala, Avinash Kamineni, Manish Shrivastava, and Manoj Chinnakotla.</p>
      <p>BioASQ 3b Challange 2015: Bio-Medical Question Answering System. In Working
Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse,
France, 2015.
25. Zhi-Juan Zhang, Tian-Tian Liu, Bo-Wen Zhang, Yan Li, Chun-Hua Zhao,
ShaoHui Feng, Xu-Cheng Yin, and Fang Zhou. A generic retrieval system for biomedical
literatures: USTB at BioASQ2015 Question Answering Task. In Working Notes
for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France,
2015.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Kamineni</given-names>
            <surname>Avinash</surname>
          </string-name>
          , Fatma Nausheen,
          <string-name>
            <given-names>Das</given-names>
            <surname>Arpita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Shrivastava</given-names>
            <surname>Manish</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Chinnakotla</given-names>
            <surname>Manoj</surname>
          </string-name>
          .
          <article-title>Extreme Classi cation of PubMed Articles usingMeSH Labels</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>George</given-names>
            <surname>Balikas</surname>
          </string-name>
          , Ioannis Partalas,
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , Anastasia Krithara, Eric Gaussier, and
          <string-name>
            <given-names>George</given-names>
            <surname>Paliouras</surname>
          </string-name>
          .
          <article-title>Results of the bioasq track of the question answering lab at clef 2014</article-title>
          .
          <article-title>Results of the BioASQ Track of the Question Answering Lab at</article-title>
          CLEF,
          <year>2014</year>
          :
          <volume>1181</volume>
          {
          <fpage>93</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Georgios</given-names>
            <surname>Balikas</surname>
          </string-name>
          , Ioannis Partalas, Aris Kosmopoulos, Sergios Petridis, Prodromos Malakasiotis, Ioannis Pavlopoulos, Ion Androutsopoulos, Nicolas Baskiotis, Eric Gaussier, Thierry Artieres, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Gallinari</surname>
          </string-name>
          .
          <article-title>Evaluation Framework Speci cations</article-title>
          .
          <source>Project deliverable D4.1</source>
          ,
          <issue>05</issue>
          /
          <year>2013</year>
          2013.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Petr</given-names>
            <surname>Baudis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Sedivy</surname>
          </string-name>
          .
          <article-title>Biomedical Question Answering using the YodaQA System: Prototype Notes</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Sungbin</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>SNUMedinfo at CLEF QA track BioASQ 2015</article-title>
          . In Working Notes for the Conference and
          <article-title>Labs of the Evaluation Forum (CLEF</article-title>
          ), Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Sungbin</given-names>
            <surname>Choi</surname>
          </string-name>
          , Jinwook Choi, Sooyoung Yoo, Heechun Kim, and
          <string-name>
            <given-names>Youngho</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Semantic concept-enriched dependence model for medical information retrieval</article-title>
          .
          <source>Journal of biomedical informatics</source>
          ,
          <volume>47</volume>
          :
          <fpage>18</fpage>
          {
          <fpage>27</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Janez</given-names>
            <surname>Demsar</surname>
          </string-name>
          .
          <article-title>Statistical Comparisons of Classi ers over Multiple Data Sets</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>7</volume>
          :1{
          <fpage>30</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Nicolas</given-names>
            <surname>Fiorini</surname>
          </string-name>
          , Sylvie Ranwez,
          <string-name>
            <surname>Sebastien</surname>
            <given-names>Harispe1</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jacky</given-names>
            <surname>Montmain</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Vincent</given-names>
            <surname>Ranwez</surname>
          </string-name>
          .
          <source>USI at BioASQ</source>
          <year>2015</year>
          :
          <article-title>a semantic similarity-based approach for semantic indexing</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Minlie</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Aurlie</given-names>
            <surname>Nvol</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Recommending mesh terms for annotating biomedical articles</article-title>
          .
          <source>JAMIA</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <volume>660</volume>
          {
          <fpage>667</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Susan C. Schmidt Alan R. Aronson James G. Mork</surname>
          </string-name>
          , Dina Demner-Fushman.
          <article-title>Recent enhancements to the nlm medical text indexer</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , volume
          <volume>1180</volume>
          , She ed, UK,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Aris</surname>
            <given-names>Kosmopoulos</given-names>
          </string-name>
          , Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <article-title>Evaluation Measures for Hierarchical Classi cation: a uni ed view and novel approaches</article-title>
          .
          <source>CoRR, abs/1306.6802</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <article-title>ROUGE: A package for automatic evaluation of summaries</article-title>
          .
          <source>In Proceedings of the ACL workshop `Text Summarization Branches Out'</source>
          , pages
          <fpage>74</fpage>
          {
          <fpage>81</fpage>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Prodromos</surname>
            <given-names>Malakasiotis</given-names>
          </string-name>
          , Emmanouil Archontakis, Ion Androutsopoulos, Dimitrios Galanis, and
          <string-name>
            <given-names>Harris</given-names>
            <surname>Papageorgiou</surname>
          </string-name>
          .
          <article-title>Biomedical question-focused multi-document summarization: ILSP and AUEB at BioASQ3</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Yuqing</given-names>
            <surname>Mao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>NCBI at the 2015 BioASQ challenge task: Baseline results from MeSH Now</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Mariana</given-names>
            <surname>Neves</surname>
          </string-name>
          .
          <article-title>HPI question answering system in the BioASQ 2015 challenge</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Yannis</surname>
            <given-names>Papanikolaou</given-names>
          </string-name>
          , Grigorios Tsoumakas, Manos Laliotis, Nikos Markantonatos, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>AUTH-Atypon at BioASQ 3: Large-Scale Semantic Indexing in Biomedicine</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Shengwen</surname>
            <given-names>Peng</given-names>
          </string-name>
          , Ronghui You, Zhikai Xie, Yanchun Zhang, and
          <string-name>
            <given-names>Shanfeng</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>The Fudan participation in the 2015 BioASQ Challenge: Large-scale Biomedical Semantic Indexing and Question Answering</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Francisco J. Ribadas,
          <string-name>
            <surname>Luis M. de Campos</surname>
          </string-name>
          ,
          <article-title>V ctor M. Darriba1</article-title>
          , and
          <string-name>
            <surname>Alfonso E. Romero.</surname>
          </string-name>
          <article-title>CoLe and UTAI at BioASQ 2015: experiments with similarity based descriptor assignment</article-title>
          .
          <source>In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF)</source>
          , Toulouse, France,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Trevor</surname>
            <given-names>Strohman</given-names>
          </string-name>
          , Donald Metzler, Howard Turtle, and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Indri: A language model-based search engine for complex queries</article-title>
          .
          <source>In Proceedings of the International Conference on Intelligent Analysis</source>
          , volume
          <volume>2</volume>
          , pages
          <issue>2{6</issue>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Lei</surname>
            <given-names>Tang</given-names>
          </string-name>
          , Suju Rajan, and
          <string-name>
            <surname>Vijay</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>Large scale multi-label classi cation via metalabeler</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web, WWW '09</source>
          , pages
          <fpage>211</fpage>
          {
          <fpage>220</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>George</surname>
            <given-names>Tsatsaronis</given-names>
          </string-name>
          , Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis,
          <string-name>
            <given-names>Dimitris</given-names>
            <surname>Polychronopoulos</surname>
          </string-name>
          , et al.
          <article-title>An overview of the bioasq largescale biomedical semantic indexing and question answering competition</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ):
          <fpage>138</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Grigorios</surname>
            <given-names>Tsoumakas</given-names>
          </string-name>
          , Ioannis Katakis, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>Mining Multi-label Data</article-title>
          .
          <source>In Oded Maimon and Lior Rokach</source>
          , editors,
          <source>Data Mining and Knowledge Discovery Handbook</source>
          , pages
          <volume>667</volume>
          {
          <fpage>685</fpage>
          .
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zi</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Niloy Gupta, Xiangyu Sun, Di Xu, Chi Zhang, and
          <string-name>
            <given-names>Eric</given-names>
            <surname>Nyberg</surname>
          </string-name>
          .
          <article-title>Learning to Answer Biomedical Factoid and List Questions OAQA at BioASQ 3B</article-title>
          . In Work-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>