Overview of BioASQ 2021-MESINESP track.
Evaluation of advance hierarchical classification
techniques for scientific literature, patents and
clinical trials.
Luis Gasco1 , Anastasios Nentidis2,3 , Anastasia Krithara2 , Darryl Estrada-Zavala1 ,
Renato Toshiyuki Murasaki4 , Elena Primo-Peña5 , Cristina Bojo Canales5 ,
Georgios Paliouras2 and Martin Krallinger1
1
  Barcelona Supercomputing Center. Barcelona, Spain
2
  National Center for Scientific Research “Demokritos”, Athens, Greece.
3
  Aristotle University of Thessaloniki. Thessaloniki, Greece
4
  Centro Latinoamericano y del Caribe de Información en Ciencias de la Salud, Organización Panamericana de la Salud.
São Paulo, Brasil.
5
  Instituto de Salud Carlos III. Biblioteca Nacional de Ciencias de la Salud. Madrid, Spain


                                         Abstract
                                         There is a pressing need to exploit recent advances in natural language processing technologies, in
                                         particular language models and deep learning approaches, to enable improved retrieval, classification
                                         and ultimately access to information contained in multiple, heterogeneous types of documents. This is
                                         particularly true for the field of biomedicine and clinical research, where medical experts and scientists
                                         need to carry out complex search queries against a variety of document collections, including literature,
                                         patents, clinical trials or other kind of content like EHRs. Indexing documents with structured controlled
                                         vocabularies used for semantic search engines and query expansion purposes is a critical task for enabling
                                         sophisticated user queries and even cross-language retrieval. Due to the complexity of the medical domain
                                         and the use of very large hierarchical indexing terminologies, implementing efficient automatic systems
                                         to aid manual indexing is extremely difficult. This paper provides a summary of the MESINESP task
                                         results on medical semantic indexing in Spanish (BioASQ/ CLEF 2021 Challenge). MESINESP was carried
                                         out in direct collaboration with literature content databases and medical indexing experts using the DeCS
                                         vocabulary, a similar resource as MeSH terms. Seven participating teams used advanced technologies
                                         including extreme multilabel classification and deep language models to solve this challenge which can
                                         be viewed as a multi-label classification problem. MESINESP resources, we have released a Gold Standard
                                         collection of 243,000 documents with a total of 2179 manual annotations divided in train, development
                                         and test subsets covering literature, patents as well as clinical trial summaries, under a cross-genre
                                         training and data labeling scenario. Manual indexing of the evaluation subsets was carried out by three
                                         independent experts using a specially developed indexing interface called ASIT. Additionally, we have
                                         published a collection of large-scale automatic semantic annotations based on NER systems of these
                                         documents with mentions of drugs/medications (170,000), symptoms (137,000), diseases (840,000) and
                                         clinical procedures (415,000). In addition to a summary of the used technologies by the teams, this paper


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" lgasco@bsc.es (L. Gasco); tasosnent@iit.demokritos.gr (A. Nentidis); akrithara@iit.demokritos.gr (A. Krithara);
darrylestrada97@gmail.com (D. Estrada-Zavala); murasaki@paho.org (R. T. Murasaki); eprimo@isciii.es
(E. Primo-Peña); cbojo@isciii.es (C. B. Canales); paliourg@iit.demokritos.gr (G. Paliouras); martin.krallinger@bsc.es
(M. Krallinger)
 0000-0002-4976-9879 (L. Gasco); 0000-0002-2646-8782 (M. Krallinger)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
            shows that there was a clear improvement in terms of the best scoring systems when compared to previous
            efforts and there was also a clear time saving of up to 67% when using pre-indexing with these systems
            compared to manual indexing of documents. MESINESP corpus: https://doi.org/10.5281/zenodo.4612274

            Keywords
            Hierarchical Text Classification, Semantic Indexing, Information Retrieval, Question Answering, Biomedical
            knowledge, Automatic Indexing, DeCS, Spanish


1. Introduction
Since the beginning of the 21st century, we have been immersed in a digitalization process that,
together with the advent of the information age, has facilitated the generation, dissemination,
and access to digital content using computer-based tools. The production and accumulation
of information have been particularly relevant in the field of health and biomedicine, in
which difficulties have arisen in accessing relevant information on specific topics not only
for practitioners but also for researchers, public healthcare decision-makers, and other health
professionals.
   The difficulties in finding the right information among so many documents become particularly
evident in more demanding scenarios such as pandemics, when a high amount of new research
is published every day. For example, more than 160,000 COVID-19 documents were published
between January and October 2020, and estimates indicate that more than 700,000 articles
may be published before the end of 2021 [1]. There is also a pressing need to enable and
improve cross-lingual and multilingual search technologies, as a considerable number of
publications, in particular those that present more clinically oriented results such as clinical
case reports, do correspond to non-English content. Recent advances in machine translation
approaches specifically adapted to the characteristics of medical language [2] as well as the
use of multilingual controlled vocabularies exploited by tools like BabelMeSH 1 show that
multilingual medical information retrieval will contribute to improve information access of
healthcare professionals.
   In an attempt to help health professionals access up-to-date and relevant information,
initiatives have emerged to improve retrieval of COVID-19-related documents [3], new datasets
have been generated to train better AI systems specialised in scientific literature [4] or even
new platforms have appeared to highlight the most relevant research in the field [5, 6].
   Despite all these innovative projects, the searching process is still centered on bibliographic
databases, which incorporate functionalities to build complex semantic queries that allow to
gather more specific results. Conversely, the results obtained rely on prior human indexing of
the database records. Manual indexing consists of assigning a set of descriptors, which are part
of a controlled vocabulary, to a manuscript to describe its content. The effectiveness of this
process depends on the judgement, thoroughness and speed of the annotators, which makes it
a slow, unsystematic and costly process.
   To overcome these difficulties, previous initiatives such as the TREC genomics track (2003-
2007) were launched with the aim of developing the state-of-the-art of automatic biomedical

   1
       https://babelmesh.nlm.nih.gov
semantic indexing [7]. More recently, from 2013, the BioASQ challenge focused on biomedical
question answering and semantic indexing of scientific literature written in English [8].
   However, it is a fact that there is a considerable amount of medically relevant content
published in languages other than English. This is particularly significant for non-scientific
literature records, such as clinical trials, EHRs, and patents, which are written entirely in the
native language of each country, with some exceptions. Additionally, there is a great lack
of interoperability in semantic search queries when looking for information in different data
sources, a procedure mandatory if a health professional wants to get a complete vision about a
specific topic. For example, if a practitioner wanted to know about adverse reactions to vaccines,
he easily would be able to obtain research papers from scientific databases. When gathering
more information on this topic, he should also search for information about the clinical trials
conducted or about proprietary vaccine compounds in patents; nonetheless, search engines of
these databases do not use the same controlled terminologies than scientific literature records,
causing more efforts in creating queries with terminologies not known to the physician.
   The MESINESP2 track, promoted by the Spanish Plan for the Advancement of Language
Technology (Plan TL)2 and organized by the Barcelona Supercomputing Center (BSC) in
collaboration with BioASQ, aims to improve the state of the art of semantic indexing for
documents written in Spanish, the second language with the most native speakers in the world3 .
We strongly believe that interoperability in semantic search queries is essential to improve the
information retrieval procedure. For that reason, in this edition we propose to index not only
scientific literature, but also clinical trials and patents to evaluate if well-known structured
vocabularies can be used for other type of biomedical documents. The generation of automatic
systems to index other documents would foster the interoperability of search queries and would
be the first step towards unifying databases to include all available biomedical information
by exploring cross-corpus training scenarios. Moreover, we briefly introduce new possible
metrics based on semantic similarity to assess the quality of the generated systems as well as
the usability of the results to improve the efficiency of manual indexing tasks using predictive
models and manual indexing tools.
   This paper presents the data and results of the MESINESP2 track, which was part of the
CLEF-BioASQ 2021 challenge. In this document we firstly provide an overview of the task, the
corpus and additional data resources we prepared for the participant teams. We also present
and analyse the systems developed by the participants. We evaluate the performance of the
systems using state-of-the-art evaluation measures, but we will also introduce new feasible
metrics that may be used to evaluate the performance from a semantic perspective. Finally, we
conclude with a discussion about the current quality of results and the current advantages of
semantic indexing systems, as well as the future steps in the MESINESP track.


2. Track description
MESINESP2 proposed to participating teams the challenge of training AI models capable of
assigning DeCS codes (Descriptores en Ciencias de la Salud, a terminology derived and extended
   2
       https://plantl.mineco.gob.es
   3
       https://www.ethnologue.com/guides/ethnologue200
from MeSH terms) to biomedical documents. The predictions were evaluated using a Gold
Standard manually annotated by expert human indexers. We structured MESINESP2 into three
independent subtracks launched consecutively, as shown in Figure 1, each focused on designing
semantic indexing systems for different types of texts.

     • MESINESP-L - Scientific Literature (Subtrack 1): This track required automatic indexing
       with DeCS terms of abstracts from scientific articles using two highly used databases in
       Spanish: IBECS4 and LILACS5 .
     • MESINESP-T - Clinical Trials (Subtrack 2): This track asked to predict DeCS codes
       automaticaly for clinical trials from the REEC database6
     • MESINESP-P - Patents (Subtrack 3): This track required automatic indexing with DeCS
       terms the content of patents in Spanish extracted from Google Patents7 .

  The rest of the section briefly describes the controlled terminology used in the shared task,
the annotation process employed and a description of the corpora generated for the participants.


Figure 1: MESINESP2 workflow diagram showing the resources generated, as well as the areas it will
improve.


    4
       IBECS includes bibliographic references from scientific articles in health sciences published in Spanish journals.
http://ibecs.isciii.es
     5
       LILACS is the most important and comprehensive index of scientific and technical literature of Latin America
and the Caribbean. It includes 26 countries, 882 journals and 878,285 records, 464,451 of which are full texts
https://lilacs.bvsalud.org
     6
       Registro Español de Estudios Clínicos, a database containing summaries of clinical trials
https://reec.aemps.es/reec/public/web.html
     7
       Google Patents is a public database that aggregates patents from many IP agencies including the OEPM (Oficina
Española de Patentes y Marcas) https://support.google.com/faqs/answer/7049585
2.1. DeCS terminology used for semantic indexing
DeCS (Descriptores Descriptores en Ciencias de la Salud, Health Science Descriptors) is a structured
controlled vocabulary created by BIREME to index scientific publications on BvSalud (Biblioteca
Virtual en Salud, Virtual Health Library), the largest database of scientific documents in Spanish,
which hosts records from the databases LILACS, MEDLINE, IBECS, among others.
   The aim of this vocabulary is to facilitate the retrieval of records written in Spanish, Portuguese
and English contained in BvSalud. This trilingual vocabulary is based on the MeSH (Medical
subject Headings) terminology of the US National Library of Medicine (NLM), but also covers
additional vocabularies to describe in more detail areas such as Public Health, Homeopathy,
Health and Science and Health Surveillance. Similarly to MeSH, the DeCS vocabulary is
organized basically in the form of a hierarchical tree structure. This enables improving search
strategies by directly exploiting both more general or more specific term-relations through the
hierarchical vocabulary structure.
   For the MESINESP2 track we have used the 2020 version of DeCS, as at the time of releasing
the datasets the new 2021 edition was not yet published. The 2020 version comprises a collection
of 34,041 unique descriptors, 60,670 alternative terms in Spanish and a total of 77 qualifiers.
   In order to improve the interoperability of the generated models, and because it was planned to
use the generated systems for pre-annotations of BvHealth documents, in which there are a large
number of COVID19-related documents, we have included an extension of COVID19-related
codes that will be incorporated in the 2021 version.

2.2. Annotation process
Documents from BvSalud are being manually indexed by experts. Complementing the evaluation
scenario of the BioASQ’s English-language semantic indexing task, which relies on progressive
indexing of PubMED articles, in case of the MESINESP track, evaluation is done using a subset
of records indexed by professionals specifically for this evaluation initiative allowing us to
control the quality of the annotations in the Gold Standard. The MESINESP Gold Standard
evaluation data consists of a carefully selected subset of records (titles and abstracts) which are
manually annotated by experts with DeCS codes following the BvSalud indexing guidelines,
but focusing only on the title and abstract content8 . The We evaluated the quality of the models
generated by registered teams by comparing automatic predictions of DeCS codes against a
manually indexed Gold Standard.
   After a thorough Inter-Annotator Agreement analysis of the seven indexers from last year’s
edition, this year we selected the three experts with the best agreement (around 0.9) to annotate
the documents. These annotators were asked to index 1000 documents, so that each document
was manually indexed at least twice. To perform the annotation, we used a new indexing tool
called ASIT (Advanced Semantic Indexing Tool) that features performance improvements and
allows recording annotation performance metrics such as the time a human indexer takes to
annotate a document. In addition, ASIT provides a user-friendly and interactive interface to


    8
      For a detailed description of the document selection criteria and the creation of the corpus, see section 2.3 of
this document.
perform semantic indexing of documents, such as interactive descriptor search; and includes
predictive systems to suggest which codes to use based on the content of the text.


Figure 2: ASIT User Interface is divided into a document selection panel and an annotation panel. The
annotation panel displays the text content, as well as an interactive descriptor search box that can
incorporate DeCS codes suggestions.

2.3. MESINESP2 corpora
The aim of MESINESP2 was to explore semantic indexing technologies applied to a variety
of heterogeneous health related content in Spanish. On the one hand, for scientific literature,
we considered the IBECS and LILACS records. For clinical trials, we considered the studies
present in the Spanish Registry of Clinical Trials (REEC); and finally, we considered the patents
in Spanish present in Google Patents. Content selection criteria included practical importance
of the selected databases, size and number of records, practical impact of the resulting semantic
indexing systems as well as access and redistribution of the data collections. We have prepared
a corpus for every subtrack of the MESINESP task. For each of them, a process of data
collection/harvesting, cleaning, data harmonization, subset selection and annotation of the
records by experts was carried out.
2.3.1. MESINESP-L corpus
First, we crawled the BvSalud platorm with a specific framework9 to obtain records from IBECS
and LILACS on 01/29/2021. This means that the data is a snapshot of that moment and that
may change over time since LILACS and IBECS usually add or modify records/indexes after
the first inclusion in the database. We obtained the title, abstract, language, journal and date
of publication of more than 1.14 million records. The initial corpus contained documents in
Spanish, English, Portuguese and French, and many of them were stored in the database without
having been manually indexed with DeCS.
   To generate a very large training dataset, we selected documents that were already indexed at
this time point by literature databases. We then filtered records with empty abstracts and those
that were not written in Spanish, as well as other articles that were not journal article publication
types. We published a corpus of 237,574 scientific journal articles manually annotated by LILACS
and IBECS experts with DeCS codes. This year we removed the DeCS qualifier information
from those annotations, and assigned string descriptors to their corresponding DeCS identifiers.
This prevented inconsistencies in the process of mapping descriptors to their identifiers by
teams and ensured that they all had exactly the same set of labels.
   For the development set, in order to resemble the characteristics of the test set used for
evaluation purposes, we provided a set of records manually indexed by expert annotators.
This dataset included 1065 articles (i.e. titles and abstracts) annotated with DeCS by the three
indexers who obtained the best IAA (over 0.9) in the last MESINESP. Among all items in the
development set, 285 were annotated by more than one annotator (we selected the union
between annotations), and 852 articles were annotated by only one of those selected indexers.
   To generate the evaluation dataset, we pre-selected all non-indexed Spanish scientific articles
in the database that were published since 2020. As a novelty, and to align better medical
indexing of literature content with indexing of medical records and EHRs, more than 8,000
pre-selected records were semantically compared with a corpus of clinical records provided
by major hospitals in Spain (discharge summaries, clinical course and radiology reports). The
500 publications most similar to the medical records were selected for annotation by the three
experts and were included in the final test set provided to the participants. A background set of
9676 Spanish-language clinical practice guidelines documents was also included in the test set
distributed to participants, in order to evaluate the performance of automatic indexing systems
on this type of biomedical documents in the future by manually validating team predictions by
human indexers
   Table 1 contains an overview of the MESINESP corpus statistics. The manually indexed
corpora contained an average of 10 DeCS codes and about 190 tokens in length. Overall, the
number of codes assigned to each document did not have a significant association with the
length of its abstract as can be seen in Figure 3.

2.3.2. MESINESP-T corpus
Clinical trials from the REEC database are currenlty not indexed on a regular basis with any
controlled terminology. This makes efficient searches within clinical trials in this database quite
    9
        https://github.com/ankush13r/BvSalud
Table 1
MESINESP-L (Scientific Literature) corpus statistics
 MESINESP-L       Docs      DeCS       Unique DeCS     Tokens     Avg.DeCS/doc      Avg.token/doc
  Training        237,574   ∼1.98M          22434      ∼43.1M        8.37(3.5)         181.45(72.3)
Development        1065      11283           3750      211,420       10.59(4.1)        198.52(64.2)
    Test            491      5398            2124       93645        10.99(3.9)        190.72(63.6)


Figure 3: Correlation plots between the number of DeCS codes and abstract length for each of the
three corpus subsets.


challenging. To account for the lack of a larger training set for this sub-track, we have used
the silver standard (automatic predictions) generated by the MESINESP 2020 participants to
create the subtrack training set surrogate [9]. Given that the quality of the task results were
diverse, we only released the predictions made by the winning team of the last year’s task,
which achieved a micro-averaged F-measure of 0.4254. For the development set, we released
147 records annotated manually by expert indexers in the last MESINESP edition.
   For the test set, we downloaded the whole REEC database using the reecapi python library [10],
a non-official library to access the REEC databases developed for this task. Since the number of
indexed documents was substantially smaller than in MESINESP-L, we calculated the semantic
similarity between the subtrack 1 training corpus and the 416 clinical trials published since 2020.
Then, we selected the top 250 most similar articles, which included many COVID-19 clinical
trials, being these records annotated by our indexers. In addition to the manually annotated
data, which were used to evaluate the participating systems, we also included a background
set of 8,669 documents from drug product data sheets to be automatically annotated by the
participating systems.
   In terms of corpus statistics (Table 2), clinical trials were longer documents dealing with
more topics than scientific articles, which translates into a higher number of DeCS codes and a
longer length in word tokens. However, the diversity of codes were much more limited than in
subtrack 1, due to the health-focused nature of the documents.
Table 2
MESINESP-T (Clinical Trials) corpus statistics
 MESINESP-T         Docs      DeCS      Unique DeCS        Tokens      Avg.DeCS/doc         Avg.token/doc
  Training           3560     52257          3940          ∼4.13M         14.68(1.19)         1161.0(553.5)
Development          147       2038           771          146,791        13.86(5.53)         998.58(637.5)
    Test             248       3271           905          267,031        13.19(4.49)        1076.74(553.68)


2.3.3. MESINESP-P corpus
Similar to clinical trials, patents are not indexed using DeCS, but with International Patent
Classification (IPC) codes. There are some studies that propose mapping between MeSH and
IPC descriptors, but unfortunately at the time of the competition there were no public resources
available to map these terminologies [11]. Because of this lack of resources, we decided to
propose this track as a cross-corpus training challenge, in which participants should transfer
previous models to the patent domain with a very low amount of annotated data. Improving
patent search is of key practical relevance for competitive intelligence and intellectual property
purposes.
   We downloaded from Google Big Query 10 all the patents written in Spanish with the IPC
codes “A61P” and “A61K31”11 , some well-known codes for some of the task organisers [12].
After data acquisition, we obtained 65,513 patents, from which we chose the 250 most lexically
similar to the MESINESP-L training set. This subset of patents was annotated by the expert
indexers, 115 were used as the development set and the rest as the test set. In this subtrack, we
also include a background set with a larger number of patents to be annotated, trying to increase
the size of the dataset by improving the annotation process in future editions of MESINESP.
   As can be seen in Table 3, the diversity of DeCS codes is the lowest of all subtracks because
we decided to work with a subset of patents associated with only two IPC codes. The number
of associated descriptors was similar to those found in the MESINESP-L annotated data, and
the length of the documents was very diverse, with a mix of short and long documents.

Table 3
MESINESP-P (Patents) corpus statistics
 MESINESP-P         Docs      DeCS      Unique DeCS        Tokens      Avg.DeCS/doc         Avg.token/doc
Development           109      1092           520           38564         10.02(3.11)         353.79(321.5)
    Test              119      1176           629            9065          9.88(2.76)          76.17(27.36)


   10
      https://cloud.google.com/blog/topics/public-datasets/google-patents-public-datasets-connecting-public-paid-
and-private-patent-data
   11
      IPC Code reference: A61P (Specific therapeutic activity of chemical compounds or medicinal preparations,
A61K31 ( Medicinal preparations containing organic active ingredients)
2.3.4. Named entity annotations of MESINESP corpus: disease, procedures,
       symptoms, drugs
Semantic content indexing tasks are complicated. Teams face the problem of assigning several
labels to a document, and in many cases the training sets are not large enough. Since the
MESINESP organising committee has extensive experience in the development of NER systems
on Spanish-language biomedical documents [13, 14, 15], a set of biomedical entities was extracted
from each corpus to be incorporated into the participants’ models and potentially improve their
performance.


Figure 4: Visualization of entities detected in a random abstract of our dataset


  We ran different NER systems on each of the corpora to obtain mentions of diseases, drugs,
medical procedures and symptoms in each of the documents. Across all corpora we obtained
more than 840,000 mentions of diseases, 170,000 mentions of drugs, 415,000 mentions of medical
procedures and 137,000 mentions of symptoms. For each document we provided a list of each
mention with its associated span in the document.

Table 4
Summary statistics on the number of entities of each type extracted for each corpus.
                 Corpus         Diseases    Medications      Procedures      Symptoms
             MESINESP - L        711751         87150           362927         127810
             MESINESP - T        129362         86303            52566          10140
             MESINESP - P          171           180               25             12


3. Results
3.1. Participation
The task had 35 registered teams at CLEF2021 Labs website which resulted in a final participation
of 7 teams from Spain, Chile, China, India, Portugal and Switzerland. Among all participating
teams, 25 systems were generated for MESINESP-L, 20 for MESINESP-T and 20 in MESINESP-
P. The approaches followed by this year’s participants were strongly marked by the use of
transformer based encondings using Deep Language Models for the representation of text and
labels using mostly Multilingual-BERT.
   The Fudan University team built their BertDeCS system following their previously published
AttentionXML architecture [16]. They made some modifications to make it perform better in
Spanish domain. First, the encoding layer of their system uses Multilingual BERT which performs
better on non-English documents. Once the word representation is obtained, the model uses
label-level attention to get different representations for different labels. Finally, they used a
fully-connected layer to get confidence scores for each label. They pretrained the model with
English papers from the MEDLINE dataset, and then fine-tuned the model with Spanish articles
from MESINESP2 corpus.
   The Vicomtech team from the Basque Research and Technology Alliance implemented
two systems based on transformers [17]. Specifically they used BERT pre-trained models to
encode the text data and build a multi-label classification on top. The first system, CSS, relied
on combining BERT-encoded tokens as input of a classification model. The second approach
LabelGLOSSES was similar to Fudan’s, and it includes an encoded representation of the DeCS
descriptors built using a pre-trained BERT encoder, together with the layers of the first model,
namely the text data encoder and the multilabel classifier.
   The Roche system uses a hybrid semantic indexing method that integrates transformer-
based multi-label text classification (MLTC), and named entity recognition (NER) provided by
Barcelona Supercomputing Center [18]. The transformer-based solution is implemented with
package "transformers" [19] and PyTorch with SentencePiece, the BETO pretrained model[20]
and auxiliary multiple binary classifiers. The system complements rare classes through additional
synonym matching the entities in the articles to DeCS terms. The results are pooled together
for each article to assign the final labels.
   The Iria systems, a joint work of researchers from Universidade de Vigo and Universidade
da Coruña, followed the approach described in [21]. They applied a k-NN approach using
linguistically motivated index terms such as lemmas, syntactic dependencies, NP chunks, name
entities and keywords. They also sent runs using a k-NN approach over indices storing dense
representations of training documents obtained by means of sentence level embeddings using
SentenceTransformers library [22].
   The Lasige-TEAM, from Universidade de Lisboa, developed a prediction pipeline composed
by two modules [23]. The first one was a graph-based entity linking model that used the
Personalized PageRank algorithm in candidate disambiguation and a semantic similarirty-
based filter to select the most relevant entities in each documents. The second one was an
adaptation of the X-Transformer algorithm for extreme multi-label classification [24] that had
three components, being one of them the deep neural matcher, based on Multilingual BERT.
   This year there has been teams that have decided to use traditional text representation methods
to examine how they work in semantic indexing in Spanish. The team from Universidad de
Chile used TF-IDF, word embeddings and cosine similarity measures to get the terms associated
to each document.
   The MESINESP2 baseline consisted of a simple textual lookup system. This approach used
the descriptors and synonyms of each of the DeCS codes to search on the text and assign the
code to the document if a match was detected. This approach obtained an MiF of 0.2876 for
MESINESP-L, 0.1288 for MESINESP-L and 0.2992 for MESINESP-P.
3.2. System evaluation
The main evaluation metric used for the task was the Micro-averaged F-score. Based on this
metric, the best results were obtained by the three MESINESP2 subtracks was the team from
Fudan University. All their models were ranked first in each of the subtracks.
   As happened in the last edition of MESINESP, the results obtained are lower than those
of the English task when the same type of technologies are used [25]. In this edition we
chose to generate a dataset exclusively with scientific articles published in journals in order
to limit possible inconsistencies in documents of other types such as PhD dissertations. In
addition, we opted to provide a list of mapped DeCS codes instead of textual descriptors to
avoid inconsistencies among participants when assigning codes to records. The measures taken
have resulted in a smaller but higher quality dataset which, together with the technological
improvements implemented by the participants, has led to higher MiF values overall, with the
winner obtaining a score of 0.4837.
   The possible reasons for the drop in performance with respect to the English task may still
be associated with a lower number of training documents and the fact that the documents
obtained from BvSalud come mainly from two bibliographic databases that are indexed in
slightly different ways [26]. On the other hand, since the update of deprecated DeCS codes is
not performed simultaneously with the update of the controlled vocabulary, it may lead to some
issues in the list of terms associated with the documents that may result in a loss of performance
in the automatic indexing models developed by the participants.
   Regarding the MESINESP-T task, there is no corresponding task in English dedicated to the
indexing of clinical trials to compare the results. The TREC 2021 Clinical Trials Track had the
aim to evaluate clinical trial retrieval systems, but using a different stetting 12 . The achieved
outcomes were significantly lower than those found in the scientific literature. Although this
drop in performance could be linked to the fact of lacking a large set of Gold Standard training
data (only a silver standard dataset was available), participants reported that they preferred
to reuse the models trained with scientific literature, incorporating the development set, to
make their final predictions. The average token length of the clinical trials was much longer
than found in scientific literature (around 1000 vs. 200). Many of the participants opted to use
BERT models, which have an input size limit of 512 tokens which may have caused some of
the content not to be processed by the model, inherently losing indexing performance. Despite
these drawbacks, the top scoring team was Fudan University with a MiF of 0.3640. However, in
terms of accuracy this team was outperformed by a run of the Roche’s models, achieving 0.4004.
   The patent subtrack is one of the major novelties of MESINESP. Despite the existence of
automatic patent indexing tasks using the IPC ontology, there were no public system that could
be used for indexing patents with MeSH/DeCS terms [27]. MESINESP-P presented a great
challenge for the participants due to the lack of training corpus, and a very small volume of
development data. Since some of the statistics between scientific literature and patents corpora
were similar, the participants opted to use the same models they had used in MESINESP-L to
make the predictions of the test set. Despite being systems trained specifically for the scientific
literature, the results obtained for the patent track are promising. The performance of some
of the systems, especially those of the Fudan, Roche and Iria teams, remains at the same same
   12
        http://www.trec-cds.org/2021.html
level as for scientific literature.
   Since the corpus was generated through the selection of two IPC code labels, the content
of the articles might be more specific in some way. If the codes used to index patents were
used extensively in the scientific literature corpus, the system may have learned to assign this
type of descriptors more accurately. Moreover, patents were selected using a lexical similarity
criteria with scientific articles, which would explain to some extent the similar results obtained.
In MESINESP-P the best performing run was Fudan’s architecture, with a very balanced system
between precision and recall. The system with the highest precision was the Roche’s one, which
achieved a value of 0.525, the highest of all models and in all subtasks.

Table 5
Performance of each of the models generated by participating teams in each of the subtracks.
                                                                                          MESINESP-L                  MESINESP-T                  MESINESP-P
               Team                     Country      Ref           System             MiF     MiP      MiR       MiF         MiP   MiR       MiF         MiP   MiR
                                                            BERTDeCS-CooMatInfer     0.4505   0.4791   0.4252   0.1095   0.1509    0.0859   0.4489   0.4462    0.4515
                                                             BERTDeCS version 2      0.4798   0.5037   0.4581   0.3640   0.3666    0.3614   0.4514   0.4487    0.4541
          Fudan University               China        -      BERTDeCS version 3      0.4808   0.5047   0.4591   0.3630   0.3657    0.3604   0.4480   0.4454    0.4507
                                                             BERTDeCS version 4      0.4837   0.5077   0.4618   0.3563   0.3589    0.3537   0.4514   0.4487    0.4541
                                                                 bertmesh-1          0.4808   0.5077   0.4591   0.3600   0.3626    0.3574   0.4489   0.4462    0.4515
                                                                  bert_dna           0.3989   0.4662   0.3486   0.2710   0.3448    0.2232   0.2479   0.4143    0.1769
                                                                   pi_dna            0.4225   0.4667   0.3859   0.2781   0.3504    0.2305   0.3628   0.5250    0.2772
               Roche                   Switzerland   [18]         pi_dna_2           0.3978   0.4520   0.3551   0.2680   0.4004    0.2015      -     -         -
                                                                  pi_dna_3           0.4027   0.4348   0.3750      -     -         -           -     -         -
                                                                 bert_dna_2          0.3962   0.4820   0.3364   0.2383   0.3754    0.1746   0.2479   0.4143    0.1769
                                                              LASIGE_BioTM_1         0.2007   0.1584   0.2738      -     -         -           -     -         -
                                                              LASIGE_BioTM_2         0.1886   0.1489   0.2573      -     -         -           -     -         -
Lasige-TEAM (Universidade de Lisboa)    Portugal     [23]      clinical_trials_1.0      -        -        -     0.0679   0.0575    0.0828      -     -         -
                                                              clinical_trials_0.25      -        -        -     0.0686   0.0581    0.0838      -     -         -
                                                                  patents_1.0           -        -        -        -     -         -        0.0314   0.0239    0.0459
                                                                  Classifier         0.3825   0.4622   0.3262   0.2485   0.2721    0.2287   0.1968   0.2700    0.1548
                                                               CSSClassifier025      0.3823   0.4509   0.3318   0.2819   0.2933    0.2715   0.2834   0.3188    0.2551
             Vicomtech                   Spain       [17]      CSSClassifier035      0.3801   0.4710   0.3186   0.2810   0.2888    0.2736   0.2651   0.2547    0.2764
                                                               LabelGlosses01        0.3704   0.4526   0.3134   0.2807   0.2949    0.2678   0.2908   0.3596    0.2440
                                                               LabelGlosses02        0.3746   0.4560   0.3179      -     -         -        0.2921   0.3890    0.2338
                                                                     iria-1          0.3406   0.3641   0.3199   0.2454   0.2289    0.2644   0.1871   0.1926    0.1820
                                                                     iria-2          0.3389   0.3622   0.3185      -     -         -        0.3203   0.3657    0.2849
     Iria (Uni Vigo, Uni. Coruña)        Spain                       iria-3          0.2537   0.2729   0.2369   0.1562   0.1419    0.1736   0.0793   0.0822    0.0765
                                                                     iria-4          0.3656   0.3909   0.3435   0.2003   0.1868    0.2158   0.2169   0.2232    0.2109
                                                                   iria-mix          0.3725   0.4193   0.3351   0.2003   0.1868    0.2158   0.2542   0.2750    0.2364
        Universidad de Chile              Chile       -          tf-idf-model        0.1335   0.1405   0.1271     -      -         -          -      -         -
                                                                AnujTagging          0.0631   0.0666   0.0600      -     -         -           -     -         -
                                                                  Anuj_ml               -        -        -     0.0019   0.0020    0.0018      -     -         -
          YMCA University                 India       -
                                                                 Anuj_NLP            0.0035   0.0053   0.0026      -     -         -           -     -         -
                                                               Anuj_Ensemble            -        -        -        -     -         -        0.0389   0.0387    0.0391
              Baseline                                                               0.2876   0.2335   0.3746   0.1288   0.0781    0.3678   0.2992   0.4293    0.2296


3.3. Analysis
Does the performance of the models change depending on the number of document
descriptors?
Gold standard documents have variability in the number of assigned codes. Some of the
participating teams, such as Fudan University and Universidade de Lisboa, made the design
decision of predicting a fixed number of codes for every document, which could affect performance
when the actual number of descriptors is far from that design constant. Figure 5 shows the
performance metrics of the best of the models for each team in MESINESP-L, splitting the test
set into groups according to the number of codes manually assigned.
Figure 5: Performance of models for Gold Standard subsets generated using the number of associated
codes.


  The highest model performance is obtained for records containing between 12 and 24 codes.
In general terms the most difficult records to classify are those with the lowest number of
associated descriptors, although the sample of these documents is smaller than the rest of
the groups. The accuracy of the models is substantially higher in the 18-24 code group, with
precision values up to 0.7 in the case of the winning team. However, Fudan’s model, whose
output was limited to 10 labels, drops substantially in performance for documents with less
than 6 codes, although it maintains its clear lead in the other groups.

How do systems behave with COVID-related descriptors?
Indexing tasks are critical in pandemic scenarios where it is necessary to assign descriptors to
documents in order to retrieve them from databases. One of the analyses we have done of the
participants’ predictions is focused on how the systems developed work with COVID-19 related
records.
   Although the training data were indexed with the COVID terms available in DeCS 2020, the
annotation process of the test data was performed with the descriptors that will be incorporated
in the 2021 version. This update of terms made it possible to test whether the systems developed
by the participants were robust to new labels not present in the training resources, a topic of
growing interest in the field [28, 29, 30, 31].
   To carry out this analysis, we selected records that addressed COVID topics. The training
corpus has the items indexed with the COVID terms recommended in 2020, namely "D018352"
and "D000073640". In contrast, the test set (Gold Standard) was annotated with the new, and
much more specific, descriptors. Figure 613 shows the occurrences of each COVID descriptor
in the best-score participants’ model with respect to the gold standard for MESINESP-L. Iria
team was the only team able to detect these more specific terms, with 3 occurrences of the term
"D000086402", but in general, none of the systems were able to detect these not-previously-seen
terms, which opens the door to propose the implementation of automatic indexing systems that
know how to predict previously unseen labels.


Figure 6: Number of occurrences of COVID descriptors in participants’ predictions for the test set.


   This ineffective assignment of COVID-related DeCS codes that were not present in the
training corpus may significantly decrease model performance, given that a high proportion of
the test set (119/491) had these types of descriptors. To assess the effect of these descriptors on
overall performance, we recalculated model performance by considering only documents that
did not have the new COVID codes.
   The results are shown in Figure 7. When models are evaluated only with documents containing
labels seen in the training set, the performance decreases minimally in general. This means
that although there is a significant portion of documents in the test set whose COVID labels are
not being correctly assigned, they are not being classified worse than the rest of the documents
in which all the descriptors were present in the training set.

Which descriptors do the systems predict best and worst?
MESINESP2 is an extreme multi-label labelling challenge. We provide a set of thousands of
labels to be assigned to documents. There may be codes that are very easy for systems to
identify and others that are difficult to assign correctly. Figure 8 presents the 4 descriptors that
    13
       Descriptors reference: D018352 (Coronavirus Infections), D000073640 (Betacoronavirus), D000086382 (COVID-
19), D000086663 (COVID-19 vaccines), D000086742 (COVID-19 test), D000087123 (nucleic acid test for COVID-19),
D000086402 (SARS-CoV-2)
Figure 7: Difference in model performance when removing COVID-related documents.


have been best identified by the top systems of the 4 most competitive teams in MESINESP-L,
MESINESP-T and MESINEP-P. Conversely, Figure 9 shows the most poorly predicted labels by
models, also showing the number of times they appeared in the test set.
   For subtrack 1, the codes with the best success rate are repeated in all systems: "D006801",
Humanos(Humans); "D018352", Infecciones por Coronavirus (Coronavirus infections); D008297,
Masculino (Male); and D0052650, Femenino (Female). The human descriptor is one of the most
frequent codes in the BvSalud database, so it was expected that systems would be able to detect
this descriptor with ease and accuracy in the test set. Male and Female descriptors are special
nodes within the DeCS ontology, since they are at the top of the hierarchical structure, without
having any children. Finally, systems were able to detect with high precision the descriptor
"Coronavirus infections", probably because the systems tended to over-assign this term, as seen
in Figure 6. Regarding the worst predicted labels, as seen above, none of the systems was able
to correctly predict the new COVID descriptors with which the test set was manually indexed.
Additionally the code "D013812" (Therapeutic) and "D002363" (Case histories) were not assigned
with adequate frequency, with the exception of the system of the Fudan team.
   Clinical trials models show similar behaviour to those of scientific literature. Almost all of
the descriptors "D006801" and "D018352" were correctly recognised. In addition, the descriptors
"D011024" and "D016896", corresponding to the terms Viral Pneumonia and Treatment Outcome,
showed very good accuracy. These codes have the peculiarity of being in the upper part of
the structure, thus continuing the trend observed at MESINESP-L results. Concerning terms
that were not correctly classified, once again we found that COVID descriptors have not been
indexed correctly. This is an expected result after previous analyses and due to the lack of data
labelled with these codes. We also have general terms, such as Therapeutics, but also the code
"D016449" (Randomized Controlled Trial) which are very specific and models were not able to
understand the context of the abstract to understand this concept.
   For patents, the general trend observed so far in relation to well-detected terms changes. On
the one hand, the Fudan and Roche models perform very well in detecting the Patent ("D020490")
and Human ("D006801") descriptors, while the Vicomtech and Iria models seemed to have some
issues in correctly assigning these codes. Conversely, the Vicomtech and Iria models detect
the descriptor Therapeutica ("D013812") very well, while the Fudan and Roche models did
slightly worse. Although in general terms the systems are able to assign specific labels such
as Oligonucleotides antisense ("D016376") or antineoplastic agents ("D010300"), they perform
worse for more ontology-specific labels, although they are conceptually more generic (such as
organic chemistry and Drug Compounding)


Figure 8: Descriptors best predicted by the best model for each team in each subtrack. Each block
contains the DeCS code, the number of times the model has predicted that code correctly and the total
number of times the code appeared in the corpus.


Figure 9: Descriptors worst predicted by the best model for each team in each subtrack. Each block
contains the DeCS code, the number of times the model has predicted that code correctly and the total
number of times the code appeared in the corpus.


3.4. Silver standard generation
Within the test sets for each of the tasks, a set of records that were not used for evaluating the
models was included in order to generate a silver standard.
   This silver standard has been published in the task data repository, and contains two separate
sections. On the one hand, the union of the labels of the best model of each participating team
has been calculated, as long as this model had obtained at least an F-score of 0.2. On the other
hand, the predictions of the best models of each participant have been included individually
and anonymised.
   The silver standard contains a set of 8642 scientific articles, 1537 text sections from Clinical
Practice Guidelines, a set of 8458 text segments from Medication Data Sheets, 461 clinical trials
from REEC and 5170 patents. The summary statistics of the silver standard generated are shown
in Table 6.
Table 6
Summary statistics of the silver standard corpora generated in MESINESP.
 Silver Standard Corpus        Docs   DeCS     Unique DeCS   Tokens   Avg. DeCS/doc    Avg.token/doc
      Scientific papers        8642   257123      11557      ∼1.7M      29.75 (3.44)    198.31 (63.21)
Clinical Practice Guidelines   1537    46263      4018       132264     30.10 (3.51)     86.05 (87.66)
  Medication data sheets       8458   190701      1609       ∼9.1M      22.55 (2.78)   1076.93 (423.71)
       Clinical Trials         461     12277      4158       508793     26.63 (3.69)   1103.67 (558.31)
           Patents             5170   101775      7319       641624     19.69 (6.05)    124.11 (145.77)


4. Discussion
Real application of Spanish semantic indexing models
Despite the improvement of the F-score results by 0.06 with respect to last year, and considering
the positive evolution of the results of BioASQ Task 9a, we believe that there is still wide scope
for improvement in the semantic indexing of biomedical documentation in Spanish.
   The applicability of semantic indexing models in Spanish is feasible and their use to assist
manual indexing initiatives seems to be in reach, but with some limitations. The precision
and recall values are not high enough to incorporate the predicted descriptors into databases
without a validation process, as there are a significant number of codes that would not be
added and would be poorly predicted. This validation process, which could be seen as an
AI-assisted document annotation process, would speed up the process of including DeCS terms
in documents and facilitate document retrieval. These automatic indexing systems could be
incorporated into indexing assistance tools to pre-index documents for validation by expert
annotators. Providing a ranked list of predicted codes would allow more flexibility in case of
using totally automated predictions.
   Using the ASIT tool, which allows annotation time logging, we measured the average time
taken by three expert indexers to annotate 30 documents in a traditional manner and with the
descriptors predicted by BERTDeCS version4 system on a subset of the documents from the
test set. Document pre-indexing has reduced annotation times by more than half. Figure 10a
shows box plots of the indexing times of each of the annotators. When experts are confronted
with a pre-indexed document their annotation times decrease substantially, with a maximum
time reduction of one third of the total in the case of the A7 annotator.

Cross-corpus training
The task has resulted in models with disparate quality for each type of document. The indexing
models for scientific literature are the best in terms of performance, followed by those for
patents and clinical trials, but how can be explained the differences in the quality of the systems
if they have been trained with similar corpora?
    When preparing MESINESP corpora, and in order to facilitate the cross-corpus training
process, semantic similarity models were applied for the selection of documents to be annotated
by experts. Despite using a document selection criterion, we believe that knowing the features
that make a document well indexed would make it easier to know in advance in which systems
    (a) Improvement of annotation time with an (b) Histogram showing the distribution of
        AI-assisted system compared to standard    Resnik’s similarity measure between the
        manual annotation.                         documents predicted by the baseline and
                                                   Fudan with respect to the Gold Standard.
Figure 10:


semantic indexing would work better. This would make it possible to focus research on this type
of document and even, if a sufficiently high indexing quality could be achieved, to train a model
with literature data and index documents with similar properties without a new annotation,
training and evaluation process.
   For example, one of the documents that would be interesting to index would be the Electronic
Health Records. We have launched automatic indexing systems on a subset of EHRs and obtained
the metrics shown in Table 7. Since we do not know the common properties between scientific
literature documents and EHRs, it would be very risky to use these models on this type of
documents. But the annotation process needed to train indexing systems might discourage
attempts to implement semantic indexing in this type of document.

Table 7
Statistics of DeCS descriptors found in different types of Electronic Health Records
     EHR corpus          Docs    DeCS    Unique DeCS      Tokens    Avg. DeCS/doc      Avg.token/doc
   Clinical course       1000    85562        4252        ∼3.7M       85.56 (90.37)    3780.47 (7023.16)
Discharge summaries      1000     7872        1242        83937        7.87 (11.43)      83.94 (142.46)
    Death reports         42       312         182         2735        7.43 (14.18)      65.12 (158.73)
 Radiology reports       461     3894          554        50420         1.30 (3.08)       16.81 (36.46)
Quality in predictions due to the hierarchical structure
Semantic Indexing with DeCS is a complex task. As we have seen, DeCS is a constantly changing
terminology, with very specific and some infrequently used terms, which makes model training
challenging. The use of flat metrics, such as the F-score, to evaluate hierarchical models is
too restrictive. The F-score penalises too much the failure of a prediction, as it does not take
into account the distance between the predicted and the true code within the ontology. For
example, in the predictions made by the participants we found many documents in which the
systems, instead of matching the exact codes, had predicted the direct parent of the descriptors.
These documents, despite not having a correct prediction, may have captured the content
of the documents in some way, but they are not evaluated in the fairest way. Traditional
ranking-based metrics, such as precision@k and nDCG [32], are also unable to capture this
hierarchical structure, although they may be able to assess the presence of parent and child
terms in the predictions. An alternative to this issue would be the use of other type of metrics
such as Knowledge-based semantic similarity metrics, which measure the degree of common
information between two documents by quantifying the distance among the concepts covered in
the texts when mapped into an ontology [33]. Semantic similarity has been used for validating
results from biomedical studies with specific ontologies such as Gene Ontology [34], and metrics
such as Resnik’s, Lin’s, and Jiang [35] may be useful to evaluate the degree of similarity between
gold standard and predictions obtained with a semantic indexing model.
   To calculate Resnik’s similarity at document level, we calculate the distances between the
predicted and true labels for the same document. The higher the Resnik similarity value, the
closer the predicted codes are to the true labels. This allows to consider systems that have
predicted nearby codes within the hierarchical tree of the controlled vocabulary. Figure 10b
shows the similarity distribution between predictions of the baseline and the winning team with
respect to the MESINESP-L Gold Standard. As expected, Fudan’s model obtains a considerably
higher average similarity value than the baseline of the task. Despite some overlap between the
two distributions, the Fudan model was able to assign codes to the documents.


5. Conclusions
This document has shown the overview of the MESINESP2 task within the 9th BioASQ Challenge
(CLEF 2021). This time, the task has been focused on indexing documents concerning scientific
literature, clinical cases and patents.
   A brief analysis of the participant models and performance for each of the tasks has been
introduce. Once again, the trend of using deep neural approaches is more evident, with a large
percentage of the participants using Multilingual-BERT for text representation.
   The team from Fudan University, winner in each of the subtracks, was able to improve the
state of the art for indexing scientific literature in Spanish defined last year, and has been able
to define it in the indexing of clinical trials and patents.
   Despite the positive evolution in the performance of the MESINESP task, there is a drop in
the performance of the model with respect to the English task. This reduction may be related
to the volume of data available for model training, so we should evaluate whether following
a multilingual approach, incorporating languages such as French, Russian or Chinese, could
favour the impact, interest and performance improvement of the systems by having a larger
volume of labelled data.
   We have experienced a rapid evolution in the quality of existing approaches to multi-label
learning in recent years, for example from the first edition of BioASQ to the current edition
we have experienced an increase in F-score of 0.12. However, when using such systems in real
environments, the fact that the controlled vocabularies used as labels are updated periodically has
been overlooked. In addition, the updating of database descriptors is not done instantaneously,
which makes it difficult for the systems to be capable of new labels not previously considered.
   Future evaluation setting for this or similar tasks could benefit from considering more
interactive evaluation settings with the human (indexer) in the loop, similar to the BioCreative
Interactive tracks [36] or the technical evaluation and integration and access of predictive
systems through some meta-server settings [37].

6. Acknowledgements
The MESINESP task is sponsored by the Spanish Plan for advancement of Language Technologies
(Plan TL). We thank Eulalia Farré, Antonio Miranda and Salvador Lima from Text Mining Unit
at the Barcelona Supercomputing Center for giving advice and valuable opinions to organise
MESINESP shared-task and input during the data preparation. Thanks to Alejandro Asensio
(BSC) for his help in setting up the annotation tool, to BIREME indexers (Regina Chiquetto,
Lucilena Bragion and Sueli Mitiko) for providing feedback about quality of manual annotations,
and to Felipe Soares and Ankush Rana for helping in the data collection process.

References
 [1] D. Torres-Salinas, N. Robinson-Garcia, F. van Schalkwyk, G. F. Nane, P. Castillo-Valdivieso,
     The growth of covid-19 scientific literature: A forecast analysis of different daily time
     series in specific settings, arXiv preprint arXiv:2101.12455 (2021).
 [2] R. Bawden, K. B. Cohen, C. Grozea, A. J. Yepes, M. Kittner, M. Krallinger, N. Mah, A. Neveol,
     M. Neves, F. Soares, et al., Findings of the wmt 2019 biomedical translation shared task:
     Evaluation for medline abstracts and biomedical terminologies, in: Proceedings of the
     Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), 2019,
     pp. 29–53.
 [3] S. MacAvaney, A. Cohan, N. Goharian, Sledge: A simple yet effective baseline for covid-19
     scientific knowledge search, 2020. arXiv:2005.02365.
 [4] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. Kinney, Z. Liu,
     W. Merrill, et al., Cord-19: The covid-19 open research dataset, ArXiv (2020).
 [5] E. Zhang, N. Gupta, R. Nogueira, K. Cho, J. Lin, Rapidly deploying a neural search engine
     for the covid-19 open research dataset: Preliminary thoughts and lessons learned, 2020.
     arXiv:2004.05125.
 [6] Covid research: a year of scientific milestones, 2021. URL: https://www.nature.com/articles/
     d41586-020-00502-w.
 [7] W. Hersh, E. Voorhees, Trec genomics special issue overview, 2009.
 [8] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers,
     D. Weissenborn, A. Krithara, S. Petridis, D. Polychronopoulos, et al., An overview of
     the bioasq large-scale biomedical semantic indexing and question answering competition,
     BMC bioinformatics 16 (2015) 1–28.
 [9] M. Krallinger, A. Gonzalez-Agirre, A. Asensio, MESINESP: Medical Semantic Indexing
     in Spanish - Development dataset, 2020. URL: https://doi.org/10.5281/zenodo.3746596.
     doi:10.5281/zenodo.3746596, Funded by the Plan de Impulso de las Tecnologías del
     Lenguaje (Plan TL).
[10] L. Gasco, REECapi: The unofficial library for accessing the REEC API, 2021. URL: https:
     //doi.org/10.5281/zenodo.4882820. doi:10.5281/zenodo.4882820.
[11] D. Eisinger, G. Tsatsaronis, M. Bundschus, U. Wieneke, M. Schroeder, Automated patent
     categorization and guided patent search using ipc as inspired by mesh and pubmed, in:
     Journal of biomedical semantics, volume 4, Springer, 2013, pp. 1–23.
[12] M. Krallinger, O. Rabal, A. Lourenço, M. P. Perez, G. P. Rodriguez, M. Vazquez, F. Leitner,
     J. Oyarzabal, A. Valencia, Overview of the chemdner patents task, in: Proceedings of the
     fifth BioCreative challenge evaluation workshop, 2015, pp. 63–75.
[13] A. Miranda-Escalada, E. Farré-Maduell, S. Lima-López, L. Gascó, V. Briva-Iglesias,
     M. Agüero-Torales, M. Krallinger, The profner shared task on automatic recognition
     of occupation mentions in social media: systems, evaluation, guidelines, embeddings
     and corpora, in: Proceedings of the Sixth Social Media Mining for Health (# SMM4H)
     Workshop and Shared Task, 2021, pp. 13–20.
[14] A. Miranda-Escalada, A. Gonzalez-Agirre, J. Armengol-Estapé, M. Krallinger, Overview of
     automatic clinical coding: annotations, guidelines, and solutions for non-english clinical
     cases at codiesp track of clef ehealth 2020, in: Working Notes of Conference and Labs of
     the Evaluation (CLEF) Forum. CEUR Workshop Proceedings, 2020.
[15] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept
     normalization and clinical coding: Overview of the cantemist track for cancer text mining in
     spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
[16] R. You, Z. Zhang, Z. Wang, S. Dai, H. Mamitsuka, S. Zhu, Attentionxml: Label tree-based
     attention-aware deep model for high-performance extreme multi-label text classification,
     arXiv preprint arXiv:1811.01727 (2018).
[17] A. García-Pablos, N. Perez, M. Cuadros, Vicomtech at MESINESP2: BERT-based Multi-label
     Classification Models for Biomedical Text Indexing (2021).
[18] Y. Huang, G. Buse, K. Abdullatif, A. Ozgur, E. Ozkirimli, Pidna at bioasq mesinesp: Hybrid
     semanticindexing for biomedical articles in spanish (2021).
[19] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault,
     R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu,
     C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-
     of-the-art natural language processing, in: Proceedings of the 2020 Conference on
     Empirical Methods in Natural Language Processing: System Demonstrations, Association
     for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/
     anthology/2020.emnlp-demos.6.
[20] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert
     model and evaluation data, in: PML4DC at ICLR 2020, 2020.
[21] F. J. Ribadas, L. M. De Campos, V. M. Darriba, A. E. Romero, Cole and utai at bioasq
     2015: experiments with similarity based descriptor assignment, in: CEUR Workshop
     Proceedings, volume 1391, 2015.
[22] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using
     knowledge distillation, arXiv preprint arXiv:2004.09813 (2020).
[23] P. Ruas, V. D. T. Andrade, F. M. Couto, LASIGE-BioTM at MESINESP2: entity linking
     with semantic similarity and extreme multi-label classification on Spanish biomedical
     documents (2021).
[24] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. S. Dhillon, Taming pretrained transformers
     for extreme multi-label text classification, in: Proceedings of the 26th ACM SIGKDD
     International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3163–3171.
[25] A. Nentidis, E. Katsimpras, Georgios and Vandorou, A. Krithara, L. Gasco, G. . Krallinger,
     Martin and Paliouras, Overview of BioASQ 2021: The ninth BioASQ challenge on Large-
     Scale Biomedical Semantic Indexing and Question Answering. (2021).
[26] C. Rodriguez-Penagos, A. Nentidis, A. Gonzalez-Agirre, A. Asensio, J. Armengol-Estapé,
     A. Krithara, M. Villegas, G. Paliouras, M. Krallinger, Overview of MESINESP8, a Spanish
     Medical Semantic Indexing Task within BioASQ 2020 (2020).
[27] D. Molla, D. Seneviratne, Overview of the 2018 alta shared task: Classifying patent
     applications, in: Proceedings of the Australasian Language Technology Association
     Workshop 2018, 2018, pp. 84–88.
[28] A. Pham, R. Raich, X. Fern, J. P. Arriaga, Multi-instance multi-label learning in the presence
     of novel class instances, in: International Conference on Machine Learning, PMLR, 2015.
[29] Y. Zhu, K. M. Ting, Z.-H. Zhou, Multi-label learning with emerging new labels, IEEE
     Transactions on Knowledge and Data Engineering 30 (2018) 1901–1914.
[30] Y. Zhu, K. M. Ting, Z.-H. Zhou, Discover multiple novel labels in multi-instance multi-label
     learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 31,
     2017.
[31] J. Huang, L. Xu, K. Qian, J. Wang, K. Yamanishi, Multi-label learning with missing and
     completely unobserved labels, Data Mining and Knowledge Discovery 35 (2021) 1061–1086.
[32] Y. Wang, L. Wang, Y. Li, D. He, W. Chen, T.-Y. Liu, A theoretical analysis of ndcg ranking
     measures, in: Proceedings of the 26th annual conference on learning theory (COLT 2013),
     volume 8, Citeseer, 2013, p. 6.
[33] F. Couto, A. Lamurias, Semantic similarity definition, Encyclopedia of bioinformatics and
     computational biology 1 (2019).
[34] G. O. Consortium, The gene ontology (go) database and informatics resource, Nucleic
     acids research 32 (2004) D258–D261.
[35] X. Guo, R. Liu, C. D. Shriver, H. Hu, M. N. Liebman, Assessing semantic similarity measures
     for the characterization of human regulatory pathways, Bioinformatics 22 (2006) 967–973.
[36] C. N. Arighi, P. M. Roberts, S. Agarwal, S. Bhattacharya, G. Cesareni, A. Chatr-Aryamontri,
     S. Clematide, P. Gaudet, M. G. Giglio, I. Harrow, et al., Biocreative iii interactive task: an
     overview, BMC bioinformatics 12 (2011) 1–21.
[37] F. Leitner, M. Krallinger, C. Rodriguez-Penagos, J. Hakenberg, C. Plake, C.-J. Kuo, C.-N.
     Hsu, R. T.-H. Tsai, H.-C. Hung, W. W. Lau, et al., Introducing meta-services for biomedical
     information extraction, Genome biology 9 (2008) 1–11.