Overview of BioASQ 8a and 8b: Results of the
     eighth edition of the BioASQ tasks a and b

Anastasios Nentidis1,2 , Anastasia Krithara1 , Konstantinos Bougiatiotis1,3 , and
                              Georgios Paliouras1
        1
             National Center for Scientific Research “Demokritos”, Athens, Greece
            {tasosnent, akrithara, bogas.ko, paliourg}@iit.demokritos.gr
                 2
                   Aristotle University of Thessaloniki, Thessaloniki, Greece
             3
               National and Kapodistrian University of Athens, Athens, Greece


        Abstract. In this paper, we present an overview of the eighth edition
        of the tasks a and b of the BioASQ challenge, which ran as a lab in the
        Conference and Labs of the Evaluation Forum (CLEF) 2020. BioASQ
        aims at promoting methodologies and systems for large-scale biomedical
        semantic indexing and question answering through the organization of
        yearly challenges since 2012. These shared tasks offer to teams around
        the world the opportunity to develop and compare their methods on
        the same benchmark datasets that represent the demanding information
        needs of biomedical experts. This year, apart from introduction of a new
        task on medical semantic indexing in Spanish (MESINESP8), the eighth
        versions of the two established BioASQ tasks on semantic indexing (8a)
        and question answering (8b) in English were also offered. In total, 34
        teams with more than 100 systems participated in the three tasks of the
        challenge, with seven of them focusing on task 8a and 23 on task 8b. As in
        previous versions of the tasks, the evaluation of system responses reveals
        some participating systems managed to outperform the strong baselines,
        indicating that continuous advancements in state-of-the-art systems keep
        pushing the frontier of research leading to performance improvements.


        Keywords: Biomedical knowledge · Semantic Indexing · Question An-
        swering


1     Introduction

This paper presents the shared tasks 8a and 8b of the eighth edition of the
BioASQ challenge in 2020, the corresponding datasets and the approaches and
achieved results of the participating systems. A detailed description of the new
task on medical indexing in Spanish is offered in the MESINESP task overview.
A condensed BioASQ 2020 Lab overview [2] is also available, describing the
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
eighth edition the BioASQ challenge as a whole, in the context of the Confer-
ence and Labs of the Evaluation Forum (CLEF) 2020. Towards this direction, in
section 2 we provide an overview of the shared tasks 8a and 8b, that took place
from February to May 2020, as well as the corresponding datasets developed for
training and testing the participating systems. In section 3, we briefly overview
the participating systems and the approaches proposed by the corresponding
teams for these two tasks. Detailed descriptions for some of the systems are
also available in the proceedings of the BioASQ lab. In section 4, we present
the results of the evaluation of participating systems, based on manual assess-
ment or state-of-the-art evaluation measures, depending on the nature of the
required system response. Finally, we conclude and discuss the eighth version of
the BioASQ tasks a and b in section 5.


2     Overview of the Tasks

In the eighth version of the BioASQ challenge were offered three tasks: (1) a
large-scale biomedical semantic indexing task (task 8a), (2) a biomedical question
answering task (task 8b), both considering documents in English, and (3) a new
task on medical semantic indexing in Spanish (task MESINESP). In this section
we provide a brief description of the two established tasks (8a and 8b) with focus
on differences from previous versions of the challenge [32]. A detailed overview
of the initial versions of the tasks and the general structure of BioASQ is also
already available [46].


2.1    Large-scale semantic indexing - Task 8a

In Task 8a the aim is to classify articles from the PubMed/MedLine4 digital
library into concepts of the MeSH hierarchy. In particular, new PubMed articles
that are not yet annotated by the indexers in NLM are gathered to form the test
sets for the evaluation of the participating systems. Some basic details about
each test set and batch are provided in Table 1. As done in previous versions of
the task, the task is divided into three independent batches of 5 weekly test sets
each, providing an on-line and large-scale scenario, and the test sets consist of
new articles without any restriction on the journal published. The performance of
the participating systems is calculated using standard flat information retrieval
measures, as well as, hierarchical ones, when the annotations from the NLM
indexers become available. As usual, participants have 21 hours to provide their
answers for each test set. However, as it has been observed that new MeSH
annotations are released in PubMed earlier that in previous years, we shifted
the submission period accordingly to avoid having some annotations available
from NLM while the task is still running. For training, a dataset of 14,913,939
articles with 12.68 labels per article, on average, was provided to the participants.
4
    https://pubmed.ncbi.nlm.nih.gov/
      Batch     Articles   Annotated Articles           Labels per Article
                  6510               6487                       12.49
                  7126               7074                       12.27
         1       10891              10789                       12.55
                  6225               6182                       12.28
                  6953               6887                       12.75
      Total      37705              37419                        0.99
                  6815               6787                       12.49
                  6485               6414                       12.52
         2        7014               6975                       11.92
                  6726               6647                       12.90
                  6379               6246                       12.45
      Total      33419              33069                        0.99
                  6842               6601                       12.70
                  7212               6456                       12.37
         3        5430               4764                       12.59
                  6022               4858                       12.33
                  5936               3999                       12.21
      Total      31442              26678                        0.85
                 Table 1. Statistics on test datasets for Task 8a.


2.2   Biomedical semantic QA - Task 8b
Task 8b aims at providing a realistic large-scale question answering challenge
offering to the participating teams the opportunity to develop systems for all the
stages of question answering in the biomedical domain. Four types of questions
are considered in the task: “yes/no”, “factoid”, “list” and “summary” questions
[4]. A training dataset of 3,243 questions annotated with golden relevant elements
and answers is provided for the participants to develop their systems. Table 2
presents some statistics about the training dataset as well as the five test sets.


 Batch Size Yes/No List Factoid Summary Documents Snippets
  Train 3,243 881          644      941         777           10.15       12.92
 Test 1   100       25      20       32          23           3.45         4.51
 Test 2   100       36      14       25          25           3.86         5.05
 Test 3   100       31      12       28          29           3.35         4.71
 Test 4   100       26      17       34          23           3.23         4.38
 Test 5   100       34      12       32          22           2.57         3.20
  Total 3,743 1033         719      1092        899           9.23        11.78
Table 2. Statistics on the training and test datasets of Task 8b. The numbers for the
documents and snippets refer to averages per question.


   As in previous versions of the challenge, the task is structured into two phases
that focus on the retrieval of the required information (phase A) and answer-
ing the question (phase B). In addition, the task is split into five independent
bi-weekly batches and the two phases for each batch run during two consecu-
tive days. In each phase, the participants receive the corresponding test set and
have 24 hours to submit the answers of their systems. In particular, in phase
A, a test set of 100 questions written in English is released and the partici-
pants are expected to identify and submit relevant elements from designated
resources, including PubMed/MedLine articles, snippets extracted from these
articles, concepts and RDF triples. In phase B, the manually selected relevant
articles and snippets for these 100 questions are also released and the partici-
pating systems are asked to respond with exact answers, that is entity names
or short phrases, and ideal answers, that is natural language summaries of the
requested information.


3   Overview of participation


Fig. 1. The world-wide distribution of teams participating in the tasks 8a and 8b,
based on institution affiliations.


    This year, 34 teams from institutes around the world participated in the
three tasks of the challenge with more than 100 distinct systems. Seven of these
teams focused on task 8a and 23 on task 8b. As presented in fig. 1, the insti-
tutions hosting the teams that participated in tasks 8a and 8b are distributed
around the world highlighting the international interest in the tasks. Compared
to previous versions of the challenge, we observe a shift towards the most com-
plex question answering task b, where the number of participating teams and
systems is increasing during the last years as shown in Fig. 2.
Fig. 2. The evolution of participation in the BioASQ tasks a and b until their current
eighth version.


3.1   Task 8a
This year, 7 teams participated in the eighth edition of task a, submitting pre-
dictions from 16 different systems in total. Here, we provide a brief overview of
those systems for which a description was available, stressing their key character-
istics. A summing-up of the participating systems and corresponding approaches
is presented in Table 3.


           System                                   Approach
       X-BERT BioASQ                  X-BERT, Transformers, ELMo, MER
          NLM CNN                  SentencePiece, CNN, embeddings, ensembles
                                    d2v, tf-idf, SVM, KNN, LTR, DeepMeSH,
             dmiip fdu
                                    AttentionXML, BERT, PLT, BERTMeSH
                                  Lucene Index, k-NN, stem bigrams, ensembles,
                Iria
                                              UIMA ConceptMapper
Table 3. Systems and approaches for Task 8a. Systems for which no description was
available at the time of writing are omitted.


   This year, the LASIGE team from the University of Lisboa, in its “X-BERT
BioASQ” system propose a novel approach for biomedical semantic indexing
combining a solution based on Extreme Multi-Label Classification (XMLC) with
a Named-Entity-Recognition (NER) tool. In particular, their system is based on
X-BERT [8], an approach to scale BERT [14] to XMLC, combined with the use of
the MER [12] tool to recognize MeSH terms in the abstracts of the articles. The
system is structured into three steps. The first step is the semantic indexing of
the labels into clusters using ELMo [39]; then a second step matches the indices
using a Transformer architecture; and finally, the third step focuses on ranking
the labels retrieved from the previous indices.
    Other teams, improved upon existing systems already participating in pre-
vious versions of the task. Namely, the National Library of Medicine (NLM)
team, in its “NLM CNN ” system enhance the previous version of their “ceb”
systems [40], based on an end-to-end Deep Learning (DL) architecture with Con-
volutional Neural Networks (CNN), with SentencePiece tokenization [24]. The
Fudan University team also builds upon their previous “AttentionXML” [55]
and “DeepMeSH ” [38] systems as well their new “BERTMeSH ” [54] system,
which are based on document to vector (d2v) and tf-idf feature embeddings,
learning to rank (LTR) and DL-based extreme multi-label text classification,
Attention Mechanisms and Probabilistic Label Trees (PLT) [18]. Finally, this
years versions of the “Iria” systems [43] are also based on the same techniques
used by the systems in previous versions of the challenge which are summarized
in Table 3.
    Similarly to the previous versions of the challenge, two systems developed by
NLM to facilitate the annotation of articles by indexers in MedLine/PubMed,
where available as baselines for the semantic indexing task. MTI [31] as enhanced
in [56] and an extension based on features suggested by the winners of the first
version of the task [47].

3.2   Task 8b
This version of Task b was tackled by 94 different systems in total, developed by
23 teams. In particular, 8 teams participated in the first phase, on the retrieval of
relevant material required for answering the questions, submitting results from
30 systems. In the second phase, on providing the exact and ideal answers for the
questions, participated 18 teams with 72 distinct systems. Three of the teams
participated in both phases. An overview of the approaches, technologies and
datasets used by the teams is provided in Table 4 and a graphical representation
of them as a word cloud, weighted by their frequency in logarithmic scale, is also
provided in Fig. 3. Only systems for which a description was available is included
in this section. Detailed descriptions for some of the systems are available at the
proceedings of the workshop.
    The “ITMO” team participated in both phases of the task experimenting
in its “pa” systems with differing solutions across the batches. In general, for
document retrieval the systems follow an approach with two stages. First, they
identify initial candidate articles based on BM25, and then they re-rank them
using variations of BERT [14], fine-tuned for the binary classification task with
the BioASQ dataset and pseudo-negative documents. They extract snippets from
the top documents and rerank them using biomedical Word2Vec based on cosine
similarity with the question. To extract exact answers they use BERT fine-tuned
on SQUAD [41] and BioASQ datasets and employ a post-processing to split
    Systems               Phase                               Approach
                  A (documents, snippets)        BM25, BERT, Word2Vec, SQuAD,
        pa
                      B (exact, ideal)            PubMedQA, BioMed-RoBERTa
                                                     Bio-AnswerFinder, LSTM,
                    A (documents, snippets)
 bio-answerfinder                                  ElasticSearch, BERT, Electra,
                        B (exact, ideal)
                                                   BioBERT, SQuAD, wRWMD
                    A (documents, snippets)       BM25, Word2Vec, Graph-Node
      AUEB
                           B (exact)          Embeddings, SciBERT, DL (JPDRMM)
                                               BM25, ElasticSearch, distant learning,
      bioinfo       A (documents, snippets)
                                                              DeepRank
                                                 BM25, BioBERT, Synthetic Query
      Google             A (documents)
                                                   Generation, BERT, reranking
                                                BioBERT, NLI, MultiNLI, SQuAD,
    KU-DMIS             B (exact, ideal)            BART, beam search, BERN,
                                                            language check
    NCU-IISR            B (exact, ideal)         BioBERT, logistic regression, LTR
       UoT                 B (exact)          BioBERT, multi-task learning, BC2GM
                                               BioBERT, multi-task learning, NLTK,
    BioNLPer               B (exact)
                                                               ScispaCy
                                              BERT, BioBERT, XLNet, SpanBERT,
      LabZhu               B (exact)
                                               transfer learning, SQuAD, ensembling
                                                 BioBERT, SciBERT, BioSentVec,
     umass czi             B (exact)              Pubtator, SQuAD, PubMedQA,
                                                           transfer learning
                                                     Word2Vec, BERT, LSTM,
        MQ                  B (ideal)
                                                  Reinforcement Learning (PPO)
     DAIICT                 B (ideal)                 textrank, lexrank, UMLS
                                                 Sentence-BERT, BioBERT, SNLI,
       sbert                B (ideal)
                                                MutiNLI, multi-task learning, MQU
Table 4. Systems and approaches for Task8b. Systems for which no information was
available at the time of writing are omitted.


the answer for list questions and additional fine-tuning on PubMedQA [19] for
yes/no questions. Finally, for ideal answers they generate some candidates from
the snippets and their sentences and rerank them using the model used for
phase A. In the last batch, they also experiment with generative summarization,
developing a model based on BioMed-RoBERTa [17] to improve the readability
and consistency of the produced ideal answers.
    Another team participating in both phases of the task is the “UCSD” team
with its “bio-answerfinder” system. In particular, for phase A they rely on pre-
viously developed Bio-AnswerFinder system [34], which is also used as a first
step in phase B, for re-ranking the sentences of the snippets provided in the
test set. For identifying the exact answers for factoid and list questions they
experimented on fine-tuning Electra [10] and BioBERT [25] on SQuAD and
BioASQ datasets combined. The answer candidates are then scored considering
classification probability, the top ranking of corresponding snippets and number
Fig. 3. A word cloud of the approaches, techniques and datasets used by task 8b
participating teams, weighted by their frequency in logarithmic scale.


of occurrences. Finally a normalization and filtering step is performed and, for
list questions, and enrichment step based on coordinated phrase detection. For
yes/no questions they fine-tune BioBERT on the BioASQ dataset and use ma-
jority voting. For summary questions, they employ hierarchical clustering, based
on weighted relaxed word mover’s distance (wRWMD) similarity [34] to group
the top sentences, and select the sentence ranked highest by Bio-AnswerFinder
to be concatenated to form the summary.
    The “AUEB ” team also participated in both tasks focusing on phase A and
briefly experimenting with Phase B. Working on extending their previous top-
performing model [36], they experimented with graph-node embeddings gener-
ated from a biomedical entity co-occurrence graph from publications [23]. More-
over, they experimented with new ways to encode and retrieve relevant snippets,
but concluded that conventional BM25 pre-fetching was more efficient. For Phase
B, they worked with exact answer extraction. To this end, they experimented
with a SciBERT-based model [5] modelled for cloze-style biomedical machine
reading comprehension [37] (MRC). However, their initial results indicated that
the MRC task differs greatly from the exact answer extraction task and they did
not pursue this research direction further.
    In phase A, the team from the University of Aveiro participated with its
“bioinfo” systems, which consists of a fine-tuned BM25 retrieval model based on
ElasticSearch [16], followed by a neural reranking step. For the latter, they use
an interaction-based model inspired on the DeepRank [35] architecture building
upon previous versions of their system [1]. The focus of the improvements was on
the sentence splitting strategy, on extracting of multiple relevance signals, and
the independent contribution of each sentence for the final score. The “Google”
team also participated in phase A, with four distinct systems for document
retrieval based on different approaches. In particular, they used a BM25 retrieval
model, a neural retrieval model, initialized with BioBERT and trained on a
large set of questions developed through Synthetic Query Generation (QGen),
and a hybrid retrieval model 5 based on a linear blend of BM25 and the neural
model [28]. In addition, they also used a reranking model, rescoring the results
of the hybrid model with a cross-attention BERT rescorer [36].
    In phase B, this year the “KU-DMIS ” team participated on both exact and
ideal answers. For exact answers, they build upon their previous BioBERT-
based systems [53] and try to adapt the sequential transfer learning of Natural
Language Inference (NLI) to biomedical question answering. In particular, they
investigate whether learning knowledge of entailment between two sentence pairs
can improve exact answer generation, enhancing their BioBERT-based models
with alternative fine-tuning configurations based on the MultiNLI dataset [50].
For ideal answer generation, they develop a deep neural abstractive summa-
rization model based on BART [26] and beam search, with particular focus on
pre-processing and post-processing steps. In particular, alternative systems were
developed either considering the answers predicted by the exact answer predic-
tion system in their input or not. In the post-processing step, the generated
candidate ideal answers for each question where scored using the predicted ex-
act answers and some grammar scores provided by the language check tool6 . For
factoid and list questions in particular, the BERN [21] tool was also employed
to recognize named entities in the candidate ideal answers for the scoring step.
    The “NCU-IISR” team also participated in both parts of phase B, construct-
ing two BioBERT-based models for extracting the exact answer and ranking the
ideal answers respectively. The first model is fine-tuned on the BioASQ dataset
formulated as a SQuAD-type QA task that extracts the answer span. For the
second model, they regard the sentences of the provided snippets as candidate
ideal answers and build a ranking model with two parts. First, a BioBERT-based
model takes as input the question and one of the snippet sentences and provides
their representation. Then, a logistic regressor, trained on predicting the simi-
larity between a question and each snippet sentence, takes this representation
and outputs a score, which is used for selecting the final ideal answer.
    The “UoT ” team participated with three different DL approaches for gener-
ating exact answers. In their first approach, they fine-tune separately two distinct
BioBERT-based models extended with an additional neural layer depending on
the question type, one for yes/no and one for factoid and list questions to-
gether. In their second system, they use a joint-learning setting, where the same
BioBERT layer is connected with both the additional layers and jointly trained
for all types of questions. Finally, in their third system they propose a multi-task
model to learn recognizing biomedical entities and answers to questions simulta-
5
    https://ai.googleblog.com/2020/05/an-nlu-powered-tool-to-explore-covid-19.html
6
    https://pypi.org/project/language-check/
neously, aiming at transferring knowledge from the biomedical entity recognition
task to question answering. In particular, they extend their joint BioBERT-based
model with simultaneous training on the BC2GM dataset [45] for recognizing
gene and protein entities.
    The “BioNLPer ” team also participated in the exact answers part of phase
B, focusing on factoids. They proposed 5 BioBERT-based systems, using ex-
ternal feature enhancement and auxiliary task methodologies. In particular, in
their “factoid qa model” and “Parameters retrained” systems they consider the
prediction of answer boundaries (start and end positions) as the main task and
the whole answer content prediction as an auxiliary task. In their “Features Fu-
sion” system they leveraged external features including NER and part-of-speach
(POS) extracted by NLTK [27] and ScispaCy [33] tools as additional textual in-
formation and fused them with the pre-trained language model representations,
to improve answer boundary prediction. Then, in their “BioFusion” system they
combine the two methodologies together. Finally, their “BioLabel” system em-
ployed the general and biomedical domain corpus classification as the auxiliary
task to help answer boundary prediction.
    The “LabZhu” systems participated in phase B as well, with focus on exact
answers for the factoid and list questions. They treat answer generation as an
extractive machine comprehension task and explore several different pretrained
language models, including BERT, BioBERT, XLNet [51] and SpanBERT [20].
They also follow a transfer learning approach, training the models on the SQuAD
dataset, and then fine-tuning them on the BioASQ datasets. Finally, they also
rely on voting to integrate the results of multiple models. The “umass czi” team
also focused on the exact answer part of phase B, experimenting with unsu-
pervised representation learning approaches in the context of Biomedical QA. In
particular, they considered pretrained representations based on BioBERT, SciB-
ERT, and BioSentVec [9] and experimented with transferring knowledge from the
SQuAD and PubMedQA datasets in to the BioASQ 8b QA task. Finally, they
also develop a new pre-training method based on a self-supervised de-noising
approach. In this method, they first generate a QA dataset randomly replacing
entities automatically recognized by PubTator [48] in PubMed abstracts. Then,
train their model on extracting the span of the new entities given the original
ones as a queries.
    The “MQ” team, as in past years, focused on ideal answers, approaching
the task as query-based summarisation. In some of their systems the retrain
their previous classification and regression approaches [30] in the new training
dataset. In addition, they also employ reinforcement learning with Proximal
Policy Optimization (PPO) [44] and two variants to represent the input fea-
tures, namely Word2Vec-based and BERT-based embeddings. The “DAIICT ”
team also participated in ideal answer generation, using the standard extractive
summarization techniques textrank [29] and lexrank [15] as well as sentence se-
lection techniques based on their similarity with the query. They also modified
these techniques investigating the effect of query expansion based on UMLS [6]
for sentence selection and summarization.
    Finally, the “sbert” team, also focused on ideal answers. They experimented
with different embedding models and multi-task learning in their systems, using
parts from previous “MQU ” systems for the pre-processing of data and the
prediction step based on classification and regression [30]. In particular, they
used a Universal Sentence Embedding Model [11] (BioBERT-NLI 7 ) based on a
version of BioBERT fine-tuned on the the SNLI [7] and the MultiNLI datasets as
in Sentence-BERT [42]. The features were fed to either a single logistic regression
or classification model to derive the ideal answers. Additionally, in a multi-task
setting, they trained the model on both the classification and regression tasks,
selecting for the final prediction one of them.
    In this challenge too, the open source OAQA system proposed by [52] served
as baseline for phase B exact answers. The system which achieved among the
highest performances in previous versions of the challenge remains a strong base-
line for the exact answer generation task. The system is developed based on
the UIMA framework. ClearNLP is employed for question and snippet parsing.
MetaMap, TmTool [49], C-Value and LingPipe [3] are used for concept identi-
fication and UMLS Terminology Services (UTS) for concept retrieval. The final
steps include identification of concept, document and snippet relevance based on
classifier components and scoring and finally ranking techniques.


4     Results
4.1     Task 8a


            System                Batch 1            Batch 2             Batch 3
                               MiF     LCA-F      MiF      LCA-F      MiF     LCA-F
     deepmesh dmiip fdu        1.25     2.25     1.875       3.25     2.25      2.25
    deepmesh dmiip fdu        2.375     3.625     1.25      1.25      1.75        2
     attention dmiip fdu         3      2.25       3.5      3.125       3       3.25
         Default MTI           4.75      3.75       6        5.25       6        5.5
    MTI First Line Index        5.5       4.5     6.75      5.875     5.75      5.25
          dmiip fdu              -         -     2.375      1.625      1.5     1.25
         NLM CNN                 -         -        5        6.75      5.5        7
           iria-mix              -         -        -          -      8.25      8.25
             iria-1              -         -        -          -       9.25     9.25
      X-BERT BioASQ              -         -        -          -      10.75    10.75
Table 5. Average system ranks across the batches of the task 8a. A hyphenation
symbol (-) is used whenever the system participated in fewer than 4 test sets in the
batch. Systems participating in fewer than 4 test sets in all three batches are omitted.


   In Task 8a, each of the three batches were independently evaluated as pre-
sented in Table 5. Standard flat and hierarchical evaluation measures [4] were
7
    https://huggingface.co/gsarti/biobert-nli
used for measuring the classification performance of the systems. In particu-
lar, the micro F-measure (MiF) and the Lowest Common Ancestor F-measure
(LCA-F) were used to identify the winners for each batch [22]. As suggested
by Demšar [13], the appropriate way to compare multiple classification systems
over multiple datasets is based on their average rank across all the datasets. In
this task, the system with the best performance in a test set gets rank 1.0 for
this test set, the second best rank 2.0 and so on. In case two or more systems tie,
they all receive the average rank. Then, according to the rules of the challenge,
the average rank of each system for a batch is calculated based on the four best
ranks of the system in the five test sets of the batch. The average rank of each
system, based on both the flat MiF and the hierarchical LCA-F scores, for the
three batches of the task are presented in Table 5.
    The results in Task 8a show that in all test batches and for both flat and
hierarchical measures, the best systems outperform the strong baselines. In par-
ticular, the “dmiip fdu” systems from the Fudan University team achieve the
best performance in all three batches of the task. More detailed results can
be found in the online results page8 . Comparing these results with the corre-
sponding results from previous versions of the task, suggests that both the MTI
baseline and the top performing systems keep improving through the years of
the challenge, as shown in Figure 4.


Fig. 4. The micro f-measure (MiF) achieved by systems across different years of the
BioASQ challenge. For each test set the MiF score is presented for the best performing
system (Top) and the MTI, as well as the average micro f-measure of all the partici-
pating systems (Avg).


8
    http://participants-area.bioasq.org/results/8a/
4.2     Task 8b
Phase A: In the first phase of Task 8b, the systems are ranked according to
the Mean Average Precision (MAP) measure for each of the four types of anno-
tations, namely documents, snippets, concepts and RDF triples. This year, the
calculation of Average Precision (AP) in MAP for phase A was reconsidered as
described in the official description of the evaluation measures for Task 8b9 . In
brief, since BioASQ3, the participant systems are allowed to return up to 10 rel-
evant items (e.g. documents), and the calculation of AP was modified to reflect
this change. However, the number of golden relevant items in the last years have
been observed to be lower than 10 in some cases, resulting to relatively small AP
values even for submissions with all the golden elements. For this reason, this
year, we modified the MAP calculation to consider both the limit of 10 elements
and the actual number of golden elements. In Tables 6 and 7 some indicative
preliminary results from batch 2 are presented. The full results are available in
the online results page of Task 8b, phase A10 . The results presented here are
preliminary, as the final results for the task 8b will be available after the manual
assessment of the system responses by the BioASQ team of biomedical experts.


                              Mean      Mean      Mean F-
           System                                              MAP      GMAP
                           Precision Recall       measure
             pa             0.1934      0.4501     0.2300     0.3304      0.0185
      AUEB-System1            0.1688   0.4967      0.2205     0.3181      0.0165
         bioinfo-3            0.1500    0.4880     0.2027     0.3168     0.0223
         bioinfo-1            0.1480    0.4755     0.1994     0.3149      0.0186
         bioinfo-4            0.1500    0.4787     0.2002     0.3120      0.0161
      AUEB-System2            0.1618    0.4864     0.2126     0.3103      0.0149
         bioinfo-2            0.1420    0.4648     0.1914     0.3084      0.0152
         bioinfo-0            0.1380    0.4341     0.1830     0.2910      0.0117
      AUEB-System5            0.1588    0.4549     0.2057     0.2843      0.0116
          Ir sys4             0.1190    0.4179     0.1639     0.2807      0.0056
 Google-AdHoc-MAGLEV 0.1310             0.4364     0.1770     0.2806      0.0109
          Ir sys2             0.1190    0.4179     0.1639     0.2760      0.0055
   Google-AdHoc-BM25          0.1324    0.4222     0.1758     0.2718      0.0088
      AUEB-System3            0.1688   0.4967      0.2205     0.2702      0.0146
          Ir sys3             0.1325    0.3887     0.1730     0.2678      0.0045
Table 6. Results for document retrieval in batch 2 of phase A of Task 8b. Only the
top-15 systems are presented.


   Phase B: In the second phase of task 8b, the participating systems were
expected to provide both exact and ideal answers. Regarding the ideal answers,
the systems will be ranked according to manual scores assigned to them by
the BioASQ experts during the assessment of systems responses [4]. For the
9
     http://participants-area.bioasq.org/Tasks/b/eval meas 2020/
10
     http://participants-area.bioasq.org/results/8b/phaseA/
                             Mean        Mean         Mean
           System                                                MAP       GMAP
                           Precision Recall F-measure
     AUEB-System1            0.1545      0.2531      0.1773      0.6821     0.0015
     AUEB-System2            0.1386      0.2260       0.1609     0.6549     0.0011
            pa               0.1348      0.2578       0.1627     0.3374     0.0047
         bioinfo-4           0.1308      0.2009       0.1413     0.2767     0.0016
         bioinfo-1           0.1373      0.2103       0.1461     0.2721     0.0016
         bioinfo-2           0.1299      0.2018       0.1408     0.2637     0.0011
         bioinfo-3           0.1321      0.2004       0.1404     0.2607     0.0014
   MindLab QA System         0.0811      0.1454       0.0916     0.2449     0.0005
  MindLab Red Lions++        0.0830      0.1469       0.0932     0.2394     0.0005
     AUEB-System5            0.0943      0.1191       0.0892     0.2217     0.0011
  MindLab QA Reloaded        0.0605      0.1103       0.0691     0.2106     0.0002
  Deep ML methods for        0.0815      0.0931       0.0811     0.2051     0.0001
         bioinfo-0           0.1138      0.1617       0.1175     0.1884     0.0009
 MindLab QA System ++        0.0639      0.0990       0.0690     0.1874     0.0001
     AUEB-System3            0.0966      0.1285       0.0935     0.1556     0.0011
     bio-answerfinder        0.0910      0.1617       0.1004     0.1418     0.0008
     AUEB-System4            0.0080      0.0082       0.0077     0.0328     0.0000
      Table 7. Results for snippet retrieval in batch 2 of phase A of Task 8b.


exact answers, which are required for all questions except the summary ones,
the measure considered for ranking the participating systems depends on the
question type. For the yes/no questions, the systems were ranked according to
the macro-averaged F1-measure on prediction of no and yes answer. For factoid
questions, the ranking was based on mean reciprocal rank (MRR) and for list
questions on mean F1-measure. Some indicative results for exact answers for the
third batch of Task 8b are presented in Table 8. The full results of phase B of
Task 8b are available online11 . These results are preliminary, as the final results
for Task 8b will be available after the manual assessment of the system responses
by the BioASQ team of biomedical experts.
     Figure 5 presents the performance of the top systems for each question type
in exact answers during the eight years of the BioASQ challenge. The diagram
reveals that this year the performance of systems in the yes/no questions keeps
improving. For instance, in batch 3 presented in Table 8, various systems manage
to outperform by far the strong baseline, which is based on a version of the OAQA
system that achieved top performance in previous years. Improvements are also
observed in the preliminary results for list questions, whereas the top system
performance in factoid questions is fluctuating in the same range as done last
year. In general, Figure 5 suggests that for the latter types of question there is
still more room for improvement.


11
     http://participants-area.bioasq.org/results/8b/phaseB/
     System          Yes/No               Factoid               List
                    Acc.    F1 Str. Acc. Len. Acc. MRR Prec. Rec.         F1
   Umass czi 5    0.9032 0.8995 0.2500      0.4286 0.3030 0.7361 0.4833 0.5229
   Umass czi 1    0.8065 0.8046 0.2500      0.3571 0.2869 0.6806 0.4444 0.4683
   Umass czi 2    0.8387 0.8324 0.2500      0.3571 0.2869 0.6806 0.4444 0.4683
      pa-base     0.9032 0.8995 0.2500      0.4643 0.3137 0.5278 0.4778 0.4585
        pa        0.9032 0.8995 0.2500      0.4643 0.3137 0.5278 0.4778 0.4585
   Umass czi 4    0.9032 0.9016 0.3214 0.4643 0.3810 0.6111 0.4361 0.4522
   KU-DMIS-1      0.9032 0.9028 0.3214 0.4286 0.3601 0.6583 0.4444 0.4520
   KU-DMIS-4      0.8387 0.8360 0.2857      0.4286 0.3357 0.6167 0.4444 0.4490
   KU-DMIS-5      0.9032 0.9028 0.3214 0.4643 0.3565 0.6167 0.4444 0.4490
   KU-DMIS-2      0.8710 0.8697 0.3214 0.4286 0.3446 0.6028 0.4444 0.4467
   KU-DMIS-3      0.8387 0.8360 0.2500      0.4643 0.3357 0.6111 0.4444 0.4431
UoT allquestions 0.5806 0.3673 0.3214 0.3929 0.3423 0.5972 0.4111 0.4290
   UoT baseline   0.5806 0.3673 0.3214 0.3929 0.3512 0.4861 0.4056 0.4214
   Best factoid   0.5806 0.4732 0.2857      0.3929 0.3333 0.5208 0.4056 0.4107
 bio-answerfinder 0.8710 0.8640 0.3214 0.4286 0.3494 0.3884 0.5083 0.4078
  FudanLabZhu2 0.7419 0.6869 0.3214 0.5357 0.3970 0.5694 0.3583 0.3988
  FudanLabZhu3 0.7419 0.6869 0.3214 0.4643 0.3655 0.5583 0.3472 0.3777
  FudanLabZhu4 0.7419 0.6869 0.2857 0.5714 0.3821 0.5583 0.3472 0.3777
  FudanLabZhu5 0.7419 0.6869 0.3214 0.4286 0.3690 0.5583 0.3472 0.3777
 UoT multitask l. 0.5161 0.3404 0.3214 0.4286 0.3643 0.5139 0.3556 0.3721
BioASQ Baseline 0.5161 0.5079 0.0714        0.2143 0.1220 0.2052 0.4833 0.2562
Table 8. Results for batch 3 for exact answers in phase B of Task 8b. Only the
performance of the top-20 systems and the BioASQ Baseline are presented.


Fig. 5. The official evaluation scores of the best performing systems in Task B, Phase
B, exact answer generation, across the eight years of the BioASQ challenge. Since
BioASQ6 the official measure for Yes/No questions is the macro-averaged F1 score
(macro F1, but accuracy (Acc) is also presented as the former official measure. The
results for BioASQ8 are preliminary, as the final results for Task 8b will be available
after the manual assessment of the system responses.
5   Conclusions

This paper provides an overview of the eighth version of the BioASQ tasks
a and b, on biomedical semantic indexing and question answering in English
respectively. These tasks, already established through the previous seven years
of the challenge, together with the new MESINESP task on semantic indexing
of medical content in Spanish, which ran for the first time, consisted the eighth
edition of the BioASQ challenge.
    The overall shift of participant systems towards deep neural approaches,
already noticed in the previous years, is even more apparent this year. State-
of-the-art methodologies have been successfully adapted to biomedical question
answering and novel ideas have been investigated. In particular, most of the
systems adopted neural embedding approaches, notably based on BERT and
BioBERT models, for both tasks. In the QA task in particular, different teams
attempted transferring knowledge from general domain QA datasets, notably
SQuAD, or from other NLP tasks such as NER and NLI, also experimenting with
multi-task learning settings. In addition, recent advancements in NLP, such as
XLNet [51], BART [26] and SpanBERT [20] have also been tested for the tasks
of the challenge.
    Overall, as in previous versions of the tasks, the top preforming systems
were able to advance over the state of the art, outperforming the strong base-
lines on the challenging shared tasks offered by the organizers. Therefore, we
consider that the challenge keeps meeting its goal to push the research frontier
in biomedical semantic indexing and question answering. The future plans for
the challenge include the extension of the benchmark data though a community-
driven acquisition process.


6   Acknowledgments

Google was a proud sponsor of the BioASQ Challenge in 2019. The eighth edition
of BioASQ is also sponsored by the Atypon Systems inc. BioASQ is grateful to
NLM for providing the baselines for task 8a and to the CMU team for providing
the baselines for task 8b. The MESINESP task is sponsored by the Spanish
Plan for advancement of Language Technologies (Plan TL) and the Secretarı́a
de Estado para el Avance Digital (SEAD). BioASQ is also grateful to LILACS,
SCIELO and Biblioteca virtual en salud and Instituto de salud Carlos III for
providing data for the BioASQ MESINESP task.


References

 1. Almeida, T., Matos, S.: Calling attention to passages for biomedical question an-
    swering. In: European Conference on Information Retrieval. pp. 69–77. Springer
    (2020)
 2. Anastasios, N., Anastasia, K., Konstantinos, B., Martin, K., Carlos, R.P., Marta,
    V., Georgios, P.: Overview of bioasq 2020: The eighth bioasq challenge on large-
    scale biomedical semantic indexing and question answering. In: Experimental IR
    Meets Multilinguality, Multimodality, and Interaction Proceedings of the Eleventh
    International Conference of the CLEF Association (CLEF 2020), Thessaloniki,
    Greece, September 22–25, 2020, Proceedings. vol. 12260. Springer (2020)
 3. Baldwin, B., Carpenter, B.: Lingpipe. Available from World Wide Web:
    http://alias-i. com/lingpipe (2003)
 4. Balikas, G., Partalas, I., Kosmopoulos, A., Petridis, S., Malakasiotis, P., Pavlopou-
    los, I., Androutsopoulos, I., Baskiotis, N., Gaussier, E., Artieres, T., Gallinari, P.:
    Evaluation framework specifications. Project deliverable D4.1, UPMC (05/2013
    2013)
 5. Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific
    text. arXiv preprint arXiv:1903.10676 (2019)
 6. Bodenreider, O.: The unified medical language system (umls): integrating biomed-
    ical terminology. Nucleic acids research 32(suppl 1), D267–D270 (2004)
 7. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for
    learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)
 8. Chang, W.C., Yu, H.F., Zhong, K., Yang, Y., Dhillon, I.: X-bert: extreme multi-
    label text classification with using bidirectional encoder representations from trans-
    formers. arXiv preprint arXiv:1905.02331 (2019)
 9. Chen, Q., Peng, Y., Lu, Z.: Biosentvec: creating sentence embeddings for biomed-
    ical texts. In: 2019 IEEE International Conference on Healthcare Informatics
    (ICHI). pp. 1–5. IEEE (2019)
10. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text en-
    coders as discriminators rather than generators. arXiv preprint arXiv:2003.10555
    (2020)
11. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning
    of universal sentence representations from natural language inference data. arXiv
    preprint arXiv:1705.02364 (2017)
12. Couto, F.M., Lamurias, A.: MER: a shell script and annotation server for minimal
    named entity recognition and linking. Journal of Cheminformatics 10(1), 58 (dec
    2018). https://doi.org/10.1186/s13321-018-0312-9
13. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal
    of Machine Learning Research 7, 1–30 (2006)
14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep
    Bidirectional Transformers for Language Understanding. NAACL HLT 2019 -
    2019 Conference of the North American Chapter of the Association for Computa-
    tional Linguistics: Human Language Technologies - Proceedings of the Conference
    1(Mlm), 4171–4186 (oct 2018), http://arxiv.org/abs/1810.04805
15. Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text
    summarization. Journal of artificial intelligence research 22, 457–479 (2004)
16. Gormley, C., Tong, Z.: Elasticsearch: The definitive guide: A distributed real-time
    search and analytics engine. “O’Reilly Media, Inc.” (2015)
17. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D.,
    Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks.
    arXiv preprint arXiv:2004.10964 (2020)
18. Jain, H., Prabhu, Y., Varma, M.: Extreme Multi-label Loss Functions for Recom-
    mendation, Tagging, Ranking & Other Missing Label Applications. In: Proceedings
    of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
    Data Mining - KDD ’16. pp. 935–944. ACM Press, New York, New York, USA
    (2016). https://doi.org/10.1145/2939672.2939756
19. Jin, Q., Dhingra, B., Liu, Z., Cohen, W.W., Lu, X.: Pubmedqa: a dataset for
    biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019)
20. Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: Spanbert:
    Improving pre-training by representing and predicting spans. Transactions of the
    Association for Computational Linguistics 8, 64–77 (2020)
21. Kim, D., Lee, J., So, C.H., Jeon, H., Jeong, M., Choi, Y., Yoon, W., Sung, M.,
    Kang, J.: A neural named entity recognition and multi-type normalization tool for
    biomedical text mining. IEEE Access 7, 73729–73740 (2019)
22. Kosmopoulos, A., Partalas, I., Gaussier, E., Paliouras, G., Androutsopoulos, I.:
    Evaluation measures for hierarchical classification: a unified view and novel ap-
    proaches. Data Mining and Knowledge Discovery 29(3), 820–865 (2015)
23. Kotitsas, S., Pappas, D., Androutsopoulos, I., McDonald, R., Apidianaki, M.: Em-
    bedding biomedical ontologies by jointly encoding network structure and textual
    node descriptors. arXiv preprint arXiv:1906.05939 (2019)
24. Kudo, T., Richardson, J.: SentencePiece: A simple and language independent sub-
    word tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the
    2018 Conference on Empirical Methods in Natural Language Processing: System
    Demonstrations. pp. 66–71. Association for Computational Linguistics, Strouds-
    burg, PA, USA (2018). https://doi.org/10.18653/v1/D18-2012
25. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: pre-
    trained biomedical language representation model for biomedical text mining.
    arXiv preprint arXiv:1901.08746 (2019)
26. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O.,
    Stoyanov, V., Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training
    for natural language generation, translation, and comprehension. arXiv preprint
    arXiv:1910.13461 (2019)
27. Loper, E., Bird, S.: Nltk: the natural language toolkit. arXiv preprint cs/0205028
    (2002)
28. Ma, J., Korotkov, I., Yang, Y., Hall, K., McDonald, R.: Zero-shot neural retrieval
    via domain-targeted synthetic query generation. arXiv preprint arXiv:2004.14503
    (2020)
29. Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the
    2004 conference on empirical methods in natural language processing. pp. 404–411
    (2004)
30. Mollá, D., Jones, C.: Classification betters regression in query-based multi-
    document summarisation techniques for question answering. In: Joint European
    Conference on Machine Learning and Knowledge Discovery in Databases. pp. 624–
    635. Springer (2019)
31. Mork, J.G., Demner-Fushman, D., Schmidt, S.C., Aronson, A.R.: Recent enhance-
    ments to the nlm medical text indexer. In: Proceedings of Question Answering Lab
    at CLEF (2014)
32. Nentidis, A., Bougiatiotis, K., Krithara, A., Paliouras, G.: Results of the sev-
    enth edition of the bioasq challenge. In: Joint European Conference on Machine
    Learning and Knowledge Discovery in Databases. pp. 553–568. Springer (2019).
    https://doi.org/10.1007/978-3-030-43887-6 51
33. Neumann, M., King, D., Beltagy, I., Ammar, W.: Scispacy: Fast and robust models
    for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019)
34. Ozyurt, I.B., Bandrowski, A., Grethe, J.S.: Bio-answerfinder: a system to find
    answers to questions from biomedical texts. Database 2020 (2020)
35. Pang, L., Lan, Y., Guo, J., Xu, J., Xu, J., Cheng, X.: Deeprank: A new deep
    architecture for relevance ranking in information retrieval. In: Proceedings of the
    2017 ACM on Conference on Information and Knowledge Management. pp. 257–
    266 (2017)
36. Pappas, D., McDonald, R., Brokos, G.I., Androutsopoulos, I.: AUEB at BioASQ
    7: Document and Snippet Retrieval. In: Seventh BioASQ Workshop: A challenge
    on large-scale biomedical semantic indexing and question answering (2019)
37. Pappas, D., Stavropoulos, P., Androutsopoulos, I., McDonald, R.: Biomrc: A
    dataset for biomedical machine reading comprehension. In: Proceedings of the 19th
    SIGBioMed Workshop on Biomedical Language Processing. pp. 140–149 (2020)
38. Peng, S., You, R., Wang, H., Zhai, C., Mamitsuka, H., Zhu, S.: Deepmesh: deep
    semantic representation for improving large-scale mesh indexing. Bioinformatics
    32(12), i70–i79 (2016)
39. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-
    moyer, L.: Deep contextualized word representations. Proceedings of the Confer-
    ence on Empirical Methods in Natural Language Processing pp. 31–40 (feb 2018),
    http://arxiv.org/abs/1802.05365
40. Rae, A., Mork, J., Demner-Fushman, D.: Convolutional Neural Network for Auto-
    matic MeSH Indexing. In: Seventh BioASQ Workshop: A challenge on large-scale
    biomedical semantic indexing and question answering (2019)
41. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for
    machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
42. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-
    networks. arXiv preprint arXiv:1908.10084 (2019)
43. Ribadas, F.J., De Campos, L.M., Darriba, V.M., Romero, A.E.: CoLe and UTAI
    at BioASQ 2015: Experiments with similarity based descriptor assignment. CEUR
    Workshop Proceedings 1391 (2015)
44. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy
    optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
45. Smith, L., Tanabe, L.K., nee Ando, R.J., Kuo, C.J., Chung, I.F., Hsu, C.N., Lin,
    Y.S., Klinger, R., Friedrich, C.M., Ganchev, K., et al.: Overview of biocreative ii
    gene mention recognition. Genome biology 9(S2), S2 (2008)
46. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers,
    M.R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almi-
    rantis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artieres, T., Ngonga, A.,
    Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I.,
    Paliouras, G.: An overview of the bioasq large-scale biomedical semantic index-
    ing and question answering competition. BMC Bioinformatics 16, 138 (2015).
    https://doi.org/10.1186/s12859-015-0564-6
47. Tsoumakas, G., Laliotis, M., Markontanatos, N., Vlahavas, I.: Large-Scale Seman-
    tic Indexing of Biomedical Publications. In: 1st BioASQ Workshop: A challenge
    on large-scale biomedical semantic indexing and question answering (2013)
48. Wei, C.H., Kao, H.Y., Lu, Z.: Pubtator: a web-based text mining tool for assisting
    biocuration. Nucleic acids research 41(W1), W518–W522 (2013)
49. Wei, C.H., Leaman, R., Lu, Z.: Beyond accuracy: creating interoperable and scal-
    able text-mining web services. Bioinformatics (Oxford, England) 32(12), 1907–10
    (2016). https://doi.org/10.1093/bioinformatics/btv760
50. Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage challenge corpus for
    sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017)
51. Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: Xl-
    net: Generalized autoregressive pretraining for language understanding. CoRR
    abs/1906.08237 (2019), http://arxiv.org/abs/1906.08237
52. Yang, Z., Zhou, Y., Eric, N.: Learning to answer biomedical questions: Oaqa at
    bioasq 4b. ACL 2016 p. 23 (2016)
53. Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained Language Model for
    Biomedical Question Answering. In: Seventh BioASQ Workshop: A challenge on
    large-scale biomedical semantic indexing and question answering (2019)
54. You, R., Liu, Y., Mamitsuka, H., Zhu, S.: Bertmesh: Deep contex-
    tual representation learning for large-scale high-performance mesh index-
    ing with full text. bioRxiv (2020). https://doi.org/10.1101/2020.07.04.187674,
    https://www.biorxiv.org/content/early/2020/07/06/2020.07.04.187674
55. You, R., Zhang, Z., Wang, Z., Dai, S., Mamitsuka, H., Zhu, S.: Attentionxml: Label
    tree-based attention-aware deep model for high-performance extreme multi-label
    text classification. arXiv preprint arXiv:1811.01727 (2018)
56. Zavorin, I., Mork, J.G., Demner-Fushman, D.: Using learning-to-rank to enhance
    nlm medical text indexer results. ACL 2016 p. 8 (2016)