<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Results of the First BioASQ Workshop</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ioannis Partalas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Gaussier</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The goal of the BioASQ project is to push the research frontier towards hybrid information systems. We aim to promote systems and approaches that are able to deal with the whole diversity of the Web, especially for, but not restricted to the context of bio-medicine. This goal is pursued by the organization of challenges. The rst challenge consisted of two tasks: semantic indexing and question answering. 157 systems were registered by 12 di erent participants for the semantic indexing task, of which between 19 and 29 participated in each batch. The question answering task was tackled by 15 systems, which were developed by three di erent organizations. Between 2 and 5 of these systems addressed each batch. Overall, the best systems were able to outperform the strong baselines provided in the experiments in two out of three settings. This suggests that advances over the state of the art were achieved through the BioASQ challenge but also that the benchmark in itself is very challenging. In this paper, we present the data used during the challenge as well as the technologies which were at the core of the participants' frameworks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The aim of this paper twofold. First, we aim to give an overview of the data
issued during the rst BioASQ challenge. In addition, we aim to present the
systems that participated in the challenge and evaluate their performance w.r.t.
to dedicated baseline systems. To this end, we begin by giving a brief overview of
the tasks included in the challenge. Especially, we present the setup for the
challenge, including the timing of the di erent tasks and the challenge data.
Thereafter, we give an overview of the systems which participated in the challenge.
We only provide descriptions for systems that provided us with an overview of
the technologies they relied upon. Detailed descriptions of some of the systems
are given in workshop proceedings. The evaluation of the systems, which was
carried out by using state-of-the-art measures or manual assessment, is the last
focal point of this paper. The conclusion sums up the results of the workshop as
well as striking ndings.
The challenge comprised two tasks: (1) a large-scale semantic indexing task (Task
1a) and (2) a question answering task (Task 1b).</p>
      <p>
        Large-scale semantic indexing. In Task 1a the goal is to classify documents from
the PubMed1 digital library unto concepts of the MeSH2 hierarchy. Here, new
PubMed articles that are not yet annotated are collected on a daily basis. These
articles are used as test sets for the evaluation of the participating systems.
As soon as the annotations are available from the PubMed curators, the
performance of each system is calculated by using standard information retrieval
measures as well as hierarchical ones. The winners of each batch were decided
based on their performance in the Micro F-measure (MiF) from the family of
at measures [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and the Lowest Common Ancestor F-measure (LCA-F) from
the family of hierarchical measures [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For completeness several other at and
hierarchical measures were reported [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In order to provide an on-line and
largescale scenario, the task was divided into three independent batches, where in
each batch 6 test sets of biomedical articles were released consecutively. Each of
these test sets were released in a weekly basis and the participants had 23 hours
to provide their answers. Figure 1 gives an overview of the time plan of Task 1a.
      </p>
      <p>Biomedical semantic QA. The goal here was to provide a large-scale question
answering challenge where the systems should be able to cope with all the stages
of a question answering task, including the retrieval of relevant concepts and
articles as well as the provision of natural-language answers. Task 1b comprised two
phases: In phase A, BioASQ released questions in English from the benchmark
datasets and the participants had to respond with concepts (from speci c
terminologies and ontologies), snippets extracted from PubMed articles and RDF
triples (from speci c ontologies). In phase B, the released questions contained the
correct answers for the elements (concepts, articles, snippets and RDF triples)
of the rst phase. The participants had to answer with exact answers as well as
with paragraph-sized summaries in natural language (dubbed ideal answers).</p>
      <p>The task was split into three independent batches. The two phases for each
batch were run with a time gap of 24 hours. For each phase, the participants
had 24 hours to submit their answers. We used well-known measure such as
mean precision, mean recall, mean F-measure, mean average precision (MAP)</p>
      <sec id="sec-1-1">
        <title>1 http://www.ncbi.nlm.nih.gov/pubmed 2 http://www.ncbi.nlm.nih.gov/mesh</title>
        <sec id="sec-1-1-1">
          <title>June26</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>June27</title>
          <p>Phase A
Phase B
July17</p>
          <p>
            July18
ugust 7
A
ugust 8
A
and geometric MAP (GMAP) to evaluate the performance of the participants in
Phase A. The winners were selected based on MAP. The evaluation in phase B
was carried out manually by biomedical experts on the ideal answers provided
by the systems. For the sake of completeness, ROUGE [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] is also reported.
The participating systems in the semantic indexing task of the BioASQ
challenge adopted a variety of approaches including hierarchical and at algorithms
as well as search-based approaches that relied on information retrieval
techniques. In the rest of section we describe the proposed systems and stress their
key characteristics.
          </p>
          <p>
            In [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] the authors proposed two hierarchical approaches. The rst approach,
dubbed Hierarchical Annotation and Categorization Engine (HACE), follows
a top-down hierarchical classi cation scheme [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] where for each node of the
hierarchy a binary classi er is trained. For constructing the positive training
examples for each node the authors employ a random method that selects a
xed amount of examples from the descendants of the current node and a method
that is based on k-means which chooses the k closest examples to the centroid of
the node. In both approaches the selected examples are xed in order to create
manageable datasets especially in the upper levels of the hierarchy. The second
system (Rebayct ) that has participated in the challenge was based on a Bayesian
network which models the hierarchical relations among the labels as well as the
training data (that is the terms in the abstracts ant titles). A major drawback
of this system is that it cannot scale well to large classi cation problems with
thousands of classes and millions of documents. For this reason, the authors
reduced the training data to 10% and further split it into 5 disjoint parts in
order to train ve di erent models. During the testing phase, the models were
aggregated with simple majority voting.
          </p>
          <p>
            In [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] (AUTH ) a at classi cation approach was employed. This approach
trains a binary SVM for each label that is present in the training data [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. In
order to reduce the complexity of the problem the authors kept only the data
that belong to the journals (1806 in total) from which the test sets were sampled
during the testing phase of the challenge. The systems that were introduced in
the challenge use a meta-model (called MetaLabeler [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]) for predicting the
number of labels (N ) of a test instance. During the prediction all the SVM classi ers
are queried and the labels are sorted according to the corresponding con dence
value. Finally, the system predicts the N top labels. While the proposed
approach is relative simple it requires processing power for both the training and
the testing procedures and also it has large storage requirements (the authors
reported that the the size of the models for one of the systems was 406GB).
          </p>
          <p>
            In [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] the authors follow two di erent approaches: a) one that relies in the
results provided by the MetMap tool [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] and b) one that is based on the search
engine Indri3. In the MetaMap-based approach, the title and abstract of the
article of each test instance is used to query the MetaMap system. The returned
results contain concepts and their corresponding con dence scores. The system
calculates a nal score by assigning weights the concepts that are obtained for
the title and the abstract and exceed a prede ned threshold for the con dence
score. Finally, the system proposes the m top-ranked concepts, where m is a free
parameter. In the search-based approach the authors index the training data
using the engine Indri. For each test article a query q is generated and a score is
calculated for each document d in the index. The concepts of the m top-ranked
documents are assigned to the test article.
          </p>
          <p>
            In the Wishart system [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] a typical at classi cation approach and k-NN are
used. In the at approach, a binary SVM is trained for each label present in
the training data. In the k-NN-based approach, the classi er is invoked for each
test article to retrieve documents from a local index. Additionally, the NCBI
Entrez system is queried in order to retrieve extra documents along with their
labels. All the abstracts are ordered ( rst N - empirically set to 100) according to
their distance and the top M (empirically set to 10) labels are retained. For the
nal prediction the two systems are combined by keeping the common predicted
labels and the rest labels are ordered according to their con dence scores. The
system predicts 10-15 labels for each test article.
          </p>
          <p>
            A learning-to-rank method was used in the NCBI team [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. The systems
follow a three-stage approach: (1) rst the k-nearest neighbors of the test article
are retrieved from the Medline database, (2) next the labels are ordered using a
learning to rank algorithm and (3) nally a cut-o method prunes the ordered
list. It is interesting to note that in the de nition of the features for the learning
to rank problem the authors use the results of the MTIFL baseline system. More
speci cally, a binary feature indicates whether a speci c label observed in the
results of MTIFL.
          </p>
          <p>
            Table 1 resumes the principal technologies that were employed by the
participating systems and whether a hierarchical or a at approach has been followed.
3 http://www.lemurproject.org/indri.php
Reference
[
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]
[
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]
[
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]
[
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]
[
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]
          </p>
          <p>
            Approach Technologies
at SVMs, MetaLabeler [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]
hierarchical SVMs, Bayes networks
at MetaMap [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], information retrieval, search engines
at k-NN, SVMs
at k-NN, learning-to-rank
          </p>
          <p>Table 1. Technologies used by participants in Task1a.</p>
          <p>
            Baselines. During the rst challenge two systems were served as baseline
systems. The rst one, dubbed BioASQ Baseline, follows an unsupervised
approach to tackle the problem and so it is expected that the systems developed
by the participants will outperform it. The second baseline is a state-of-the-art
method called Medical Text Indexer [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] which is developed by the National
Library of Medicine4 and serves as a classi cation system for articles of MEDLINE.
MTI is used by curators in order to assist them in the annotation process. It
is worth to note also that MTI is used in a few journals to fully automate the
process of annotation. So, it is expected to be a hard baseline.
3.2
          </p>
          <p>
            Task 1b
In the second task of the BioASQ challenge a total of three teams participated
in both phases with 11 systems. Only two system descriptions were available
when this paper was written[
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
          </p>
          <p>
            For the phase A of Task 1b the Wishart system [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] invokes query processing
and document ranking techniques. More speci cally, each test question in
natural language form is converted by extracting the noun phrases and reference
them using a thesaurus of biomedical entities. Then the question is expanded by
adding synonyms and relevant biomedical entities using the PolySearch tool5.
The entities found by PolySearch are used to rank the retrieved set of concepts,
articles, triples and snippets. In phase B of the task a similar approach to phase
A is used in order to augment the set of given concepts. Extracted sentences
from the retrieved documents are ranked according to the cosine similarity with
respect to the augmented concepts. The top-ranked sentences are concatenated
in order to provide an ideal answer.
          </p>
          <p>
            The MCTeam system participated only in phase A [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. In order to form
an appropriate query the system rst uses the test question to query MetaMap
which responds with concept-related words. These words were used to form a
query. In case where no concepts were returned by MetaMap, the nal query
formed by removing the stopwords from the test question. This query was used
to retrieve the appropriate information from the BioASQ web services ans also
from a local index of PubMed full-text articles6. The two lists of the retrieved
results were then merged and formed the nal results.
4 http://ii.nlm.nih.gov/MTI/index.shtml
5 http://wishart.biology.ualberta.ca/polysearch/
6 The Indri search engine has been used for indexing the documents.
Baselines. Two baselines were used in phase A. The systems return the list of
the top-50 and the top-100 entities respectively that may be retrieved using the
keywords of the input question as a query to the BioASQ services. As a result,
two lists for each of the main entities (concepts, documents, snippets, triples)
are produced, of a maximum length of 50 and 100 items respectively.
          </p>
          <p>
            For the creation of a baseline approach in Task 1B Phase B, three approaches
were created that address respectively the answering of factoid and lists
questions, summary questions, and yes/no questions [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. The three approaches were
combined into one system, and they constitute the BioASQ baseline for this
phase of Task 1B. The baseline approach for the list/factoid questions utilizes
and ensembles a set of scoring schemes that attempt to prioritize the concepts
that answer the question by assuming that the type of the answer aligns with the
lexical answer type (type coercion). The baseline approach for the summary
questions introduces a multi-document summarization method using Integer Linear
Programming and Support Vector Regression.
4
4.1
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>During the evaluation phase of the Task1a, the participants submitted their
results on a weekly basis to the online evaluation platform of the challenge7. The
evaluation period was divided into three batches containing 6 test sets each.
11 teams were participated in the task with a total of 40 systems. 10,876,004
articles with 26,563 labels (22GB) were provided as training data to the
participants. Table 2 shows the number of articles in each test set of each batch of the
challenge.</p>
      <p>
        Table 3 presents the correspondence of the systems for which a description
was available and the submitted systems in Task 1a. The systems MTIFL, MTI
and BioASQ Baseline were the baseline systems used throughout the challenge.
MTIFL and MTI refer to the NLM Medical Text Indexer system [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Systems
that participated in less than 4 test sets in each batch are not reported in the
results8.
      </p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the appropriate way to compare multiple classi cation
systems over multiple datasets is based on their average rank across all the datasets.
On each dataset the system with the best performance gets rank 1.0, the
second best rank 2.0 and so on. In case that two or more systems tie, they all
receive the average rank. Tables 4 presents the average rank (according to MiF
and LCA-F) of each system over all the test sets for the corresponding batches.
Note, that the average ranks are calculated for the 4 best results of each system
in the batch according to the rules of the challenge9. The best ranked system
      </p>
      <sec id="sec-2-1">
        <title>7 http://bioasq.lip6.fr</title>
        <p>8 According to the rules of BioASQ, each system had to participate in at least 4 test
sets of a batch in order to be eligible for the prizes.
9 http://bioasq.lip6.fr/general information/Task1a/</p>
        <p>Articles
1,942
845
793
2,408
6,742
4,556
17,286</p>
        <p>1
Subtotal
2
3
Subtotal
Subtotal</p>
        <p>Total</p>
        <p>88,628 31,869</p>
        <p>Table 2. Statistics on the test datasets of Task1a.</p>
        <p>
          Systems
is highlighted with bold typeface. We can observe that during the rst batch
the MTIFL baseline achieved the best performance in terms of MiF measure
exhibiting a state-of-the-art performance which is also evident in the other two
batches. During the rst batch RMAIP and system3 have the best performances
in both measures. Interestingly, the ranking of the RMAIP according to the
LCA-F measure is better than that based on MiF which shows that RMAIP
is able to give answers in the neighborhood (as designated by the hierarchical
relations among the classes) of the correct ones. In the other two batches the
systems proposed in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] ranked as the best performed ones occupying the rst
two places (system3 and system2 for the second batch and system1 and system 2
for the third batch). Recall that these systems follow a simple machine-learning
approach which uses SVMs and the problem is treated as at.
        </p>
        <p>
          We note here the good performance of the learning-to-rank systems (RMAI,
RMAIP, RMAIR, RMAIN, RMAIA), which are commonly used in information
retrieval tasks.According to the available descriptions the only systems that
made of use of the MeSH hierarchy were the ones introduced by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The
topdown hierarchical systems, cole hce1 and cole hce2, achieved mediocre results.
while the utai rebayct systems had poor performances. For the systems based
on a Bayesian network this behavior was expected as they cannot scale well to
large problems. On the other hand the question that arises is whether the use of
the MeSH hierarchy can be helpful for classi cation systems as the labels that
are assigned by the curators to the PubMed articles do not follow the rule of
the most specialized label. That is, an article may have been assigned a speci c
label in a deeper level of the hierarchy and in the same time a label in the upper
hierarchy that is ancestor of the most speci c one.
participated in less than 4 times in the batch.
4.2
        </p>
        <p>
          Task 1b
Phase A. Table 5 presents the statistics of the training and test data provided
to the participants. As in Task 1a the evaluation included three test batches.
For the phase A of Task 1b the systems were allowed to submit responses to
any of the corresponding categories, that is documents, concepts, snippets and
RDF triples. For each of the categories we rank the systems according to the
Mean Average Precision (MAP) measure [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The nal ranking for each batch is
calculated as the average of the individual rankings in the di erent categories.
The detailed results for Task 1b phase A can be found in http://bioasq.lip6.
fr/results/1b/phaseA/.
        </p>
        <p>Table 6 presents the average ranking of each system in each batch of Task
1b phase A. It is evident from the results that the participated systems did not
succeed in outperforming the two baselines that were used in phase A. Whether
this ine ectiveness can be attributed to the inferior behavior of the participating
systems is not clear as they seem to follow intuitive ways to construct the queries.
We note also that the systems did not respond to all the categories. For example,
the MCTeam systems did not submit snippets throughout the task.
System Batch 1 Batch 2 Batch 3
Top 100 Baseline 1.0 1.875 1.25
Top 50 Baseline 2.5 2.375 1.75
MCTeamMM 3.625 4.5 3.5
MCTeamMM10 3.625 4.5 3.5
Wishart-S1 4.25 3.875
Wishart-S2 - 4.125
Table 6. Average ranks for each system for each batch of phase A of Task 1b. The
MAP measure were used in order to rank the systems. A hyphenation symbol (-) is
used whenever the system did not participate in the corresponding batch.</p>
        <p>Focusing on the speci c categories, (e.g., concepts) for the Wishart system
we observe that it achieves a balanced behavior with respect to the baselines
(Table 7). This is evident from the value of F-measure which is much higher
that the values of the two baselines. This can be explained on the fact that the
Wishart-S1 system responded with short lists while the baselines return always
long lists (50 and 100 items respectively). Similar observations hold also for the
other two batches.</p>
        <p>
          Mean
F-measure
Phase B. In the phase B of Task 1b the systems were asked to report exact and
ideal answers. The systems were ranked according to the manual evaluation of
ideal answers by the BioASQ experts [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. For reasons of completeness we report
also the results of the systems for the exact answers. To do so, we average the
individual rankings of the systems for the di erent types of questions, that is
Yes/No, factoids and list.
        </p>
        <p>Table 8 presents the average ranks for each system for the exact answers. In
this phase we note that the Wishart system was able to outperform the BioASQ
baselines.</p>
        <p>
          Table 9 presents the average scores10 of the biomedical experts for each
system across the batches. Note that the scores are between 1 and 5 and the
higher it is the better the performance. According to the results the systems
were able to provide comprehensible answers and in some cases, like in the
second batch, high readable ones. Of course this depends on the di culty of
the question. This seems to be the case in the last batch were the averages
scores are lower with respect to the other batches. Also, the calculated
measures using ROUGE (the detailed results for Task 1b phase B can be found in
http://bioasq.lip6.fr/results/1b/phaseB/) seem to be consistent with the
10 Please consult the description of the evaluation measures used in the challenge for
more information .
manual scores in the rst two batches while the situation is inverted in the third
batch.
A large number of systems participated in Task 1A, the majority of which were
able to cope with both the large scale of the problem as well as the on-line
evaluation procedure with success. From the results we can draw three major
conclusions: First, the majority of the systems were able to achieve good
performance, as they were able to outperform the weak baseline throughout the
batches. Second, the best systems were able to outperform even the strong
baseline (MTIFL), which is the current state of the art for biomedical indexing. This
is a very important achievement towards the goal of challenge and the
development of accurate classi cation systems for large-scale problems. Finally, the
wide variety of technologies used by the participants allowed us to asses them on
a very large-scale scenario. Simple machine-learning approaches (see, e.g., [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ])
were shown to achieve state-of-the-art results. Additionally, learning-to-rank
approaches followed (see [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]) were shown to be e ective for large-scale classi cation
tasks. Interestingly, the hierarchical approach employed in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] achieved moderate
results revealing the fact that the MeSH hierarchy may not be appropriate for
classi cation tasks.
        </p>
        <p>The smaller number of participants in Task 1B and the poor results achieved
by these systems suggest that this task is particularly challenging. As the systems
seem to follow well principled ways to construct the queries we cannot conclude
whether their low performance can be attributed to the use of low-performance
methods. Other factors might have played a role, including the retrieval engines
underlying the systems not being able to retrieve appropriate responses from the
designated resources. Interestingly, a participant was still able to outperform the
baselines in phase B (Wishart). The automatic measures that were used to asses
the ideal answers seem to be in accordance with the manual scores assigned by
the BioASQ experts in the rst two batches of the task while in the third one
the measure have di erent behaviour. This discrepancy will be investigated in
future work.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alan</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aronson</surname>
          </string-name>
          and
          <string-name>
            <surname>Franois-Michel Lang</surname>
          </string-name>
          .
          <article-title>An overview of metamap: historical perspective and recent advances</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>17</volume>
          :
          <fpage>229</fpage>
          {
          <fpage>236</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Georgios</given-names>
            <surname>Balikas</surname>
          </string-name>
          , Ioannis Partalas, Aris Kosmopoulos, Sergios Petridis, Prodromos Malakasiotis, Ioannis Pavlopoulos, Ion Androutsopoulos, Nicolas Baskiotis, Eric Gaussier, Thierry Artieres, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Gallinari</surname>
          </string-name>
          .
          <article-title>Evaluation framework speci cations</article-title>
          .
          <source>Project deliverable D4.1</source>
          ,
          <issue>05</issue>
          /
          <year>2013</year>
          2013.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Janez</given-names>
            <surname>Demsar</surname>
          </string-name>
          .
          <article-title>Statistical comparisons of classi ers over multiple data sets</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>7</volume>
          :1{
          <fpage>30</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Aris</given-names>
            <surname>Kosmopoulos</surname>
          </string-name>
          , Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <article-title>Evaluation measures for hierarchical classi cation: a uni ed view and novel approaches</article-title>
          .
          <source>CoRR, abs/1306.6802</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <article-title>ROUGE: A package for automatic evaluation of summaries</article-title>
          .
          <source>In Proceedings of the ACL workshop `Text Summarization Branches Out'</source>
          , pages
          <fpage>74</fpage>
          {
          <fpage>81</fpage>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Yifeng</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Bioasq system descriptions (wishart team)</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Yuqing</given-names>
            <surname>Mao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Ncbi at the 2013 bioasq challenge task: Learning to rank for automatic mesh indexing</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>James</given-names>
            <surname>Mork</surname>
          </string-name>
          , Antonio Jimeno-Yepes, and
          <string-name>
            <given-names>Alan</given-names>
            <surname>Aronson</surname>
          </string-name>
          .
          <article-title>The nlm medical text indexer system for indexing biomedical literature</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Francisco Ribadas, Luis de Campos, Victor Darriba, and
          <string-name>
            <given-names>Alfonso</given-names>
            <surname>Romero</surname>
          </string-name>
          .
          <article-title>Two hierarchical text categorization approaches for bioasq semantic indexing challenge</article-title>
          . In 1st BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Carlos</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Silla</surname>
          </string-name>
          , Jr. and
          <string-name>
            <surname>Alex</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Freitas</surname>
          </string-name>
          .
          <article-title>A survey of hierarchical classi cation across di erent application domains</article-title>
          .
          <source>Data Mining Knowledge Discovery</source>
          ,
          <volume>22</volume>
          :
          <fpage>31</fpage>
          {
          <fpage>72</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lei</surname>
            <given-names>Tang</given-names>
          </string-name>
          , Suju Rajan, and
          <string-name>
            <surname>Vijay</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>Large scale multi-label classi cation via metalabeler</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web, WWW '09</source>
          , pages
          <fpage>211</fpage>
          {
          <fpage>220</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Grigorios</surname>
            <given-names>Tsoumakas</given-names>
          </string-name>
          , Ioannis Katakis, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>Mining Multi-label Data</article-title>
          .
          <source>In Oded Maimon and Lior Rokach</source>
          , editors,
          <source>Data Mining and Knowledge Discovery Handbook</source>
          , pages
          <volume>667</volume>
          {
          <fpage>685</fpage>
          .
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Grigorios</surname>
            <given-names>Tsoumakas</given-names>
          </string-name>
          , Manos Laliotis, Nikos Markontanatos, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>Large-scale semantic indexing of biomedical publications</article-title>
          . In 1st BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Dirk</surname>
            <given-names>Weissenborn</given-names>
          </string-name>
          , George Tsatsaronis, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Schroeder</surname>
          </string-name>
          .
          <article-title>Answering factoid questions in the biomedical domain</article-title>
          . In 1st BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Donhqing</surname>
            <given-names>Zhu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dingcheng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ben</given-names>
            <surname>Carterette</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hongfang</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>An incemental approach for medline mesh indexing</article-title>
          . In 1st BioASQ Workshop:
          <article-title>A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>