LSI_UNED at CLEF eHealth2021: Exploring the
effects of transfer learning in negation detection and
entity recognition in clinical texts
Hermenegildo Fabregat1 , Andres Duque1,2 , Lourdes Araujo1,2 and
Juan Martinez-Romo1,2
1
    NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos. Universidad Nacional de Educación a Distancia (UNED)
2
    Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS)


                                         Abstract
                                         This paper describes the approach presented by the LSI_UNED team in the Multilingual Information
                                         Extraction task (SpRadIE) of CLEF eHealth 2021. The proposed system is a deep learning stack designed
                                         for separately detecting negation hedge cues and other biomedical entities in the task. Transfer learning
                                         techniques are applied for studying whether pre-trained weights from a different negation detection task
                                         can be effectively incorporated into the model for improving a baseline system trained only with the
                                         provided data. The system obtains promising results in the task, obtaining the second best F1 score, and
                                         the best precision score among all participant systems.

                                         Keywords
                                         Biomedical information extraction, Transfer learning, Negation detection


1. Introduction
Named Entity Recognition (NER) is the task that aims to detect a particular set of entities within
a text. It represents one of the key steps in the process of information extraction in any specific
domain. In the field of biomedicine, entity detection is of paramount importance for successfully
performing subsequent tasks in the information extraction pipeline, such as relation extraction
or document classification. Considering the huge amount of information currently available
in the biomedical domain, including research papers, clinical notes or medical reports, the
development of automatic systems able to perform accurate NER in those types of documents
will definitely lead to better health support systems.
   In this context, the eHealth Evaluation Lab conducted at the Conference and Labs of the
Evaluation Forum (CLEF) 2021 [1] is a great opportunity for testing systems designed for solving
these kind of tasks related to the biomedical domain. In particular, Task 1 of the eHealth 2021
challenge, named Multilingual Information Extraction (SpRadIE) [2] focuses on the detection of
biomedical entities in clinical texts (radiology reports) written in the Spanish language.

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" gildo.fabregat@lsi.uned.es (H. Fabregat); aduque@lsi.uned.es (A. Duque); lurdes@lsi.uned.es (L. Araujo);
juaner@lsi.uned.es (J. Martinez-Romo)
 0000-0001-9820-2150 (H. Fabregat); 0000-0002-0619-8615 (A. Duque); 0000-0002-7657-4794 (L. Araujo);
0000-0002-6905-7051 (J. Martinez-Romo)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
   In this paper, we present a deep learning architecture designed for taking advantage of the
use of transfer learning techniques in the detection of a particular type of entity, in this case
negation hedge cues. For this purpose, the proposed architecture is a pipeline with two different
branches receiving the same input, each branch being a particular neural network. One of
the networks will perform the detection of negation hedge cues, while the other network will
be used for recognizing the rest of the entities proposed in the task. In order to analyze the
effect and possible improvements offered by the use of transfer learning techniques in detecting
negation hedge cues, the network performing this subtask will be initialized either randomly, as
well as the other network, in a usual setup, or with information from a different task oriented
to negation detection, in a transfer learning setup.
   The rest of the paper is organized as follows: Section 2 briefly presents some systems that
faced similar tasks in past competitions. Details of the task addressed in this paper are given in
Section 3, and the proposed system is presented in Section 4. Results obtained in the competition
are shown and discussed in Section 5, while Section 6 is devoted to analyze some systematic
errors detected during the development of the system. Finally, some conclusions and future
lines of work are depicted in Section 7.


2. Background
The identification and classification of named entities is a deeply studied field in Natural
Language Processing (NLP), and more particularly in the biomedical domain. The use of classical
NLP approaches has led to the development of well-established systems in the literature such
as Metamap [3]. These classical approaches include look-up dictionaries [4] and rule-based
systems like PROPER [5] or TextDetective [6]. Machine learning systems [7] and especially
deep learning techniques [8, 9], however, represent the current state of the art in biomedical
NER. The development of specific biomedical word embeddings [10] and language models [11]
have been key to the huge success of these systems.
  Many different tasks related to biomedical NER have been proposed in evaluation campaigns
such as CLEF eHealth 2015 [12], TASS eHealth-KD 2018 [13] or IberLEF eHealth-KD 2019
[14] and 2020 [15]. As mentioned before, the use of deep learning approaches for addressing
these tasks has grown exponentially in the past few years, to the point of representing the vast
majority of participating systems. Many of those systems propose deep learning stacks mainly
based on Bidirectional Long Short Term Memory (Bi-LSTM) layers followed by Conditional
Random Field (CRF) layers for performing entity detection and classification [16, 17, 18]. The
use of techniques based on the Transformer architecture [19] such as BERT [20] has also gained
high popularity in these tasks since their publication [21, 22].
  In addition to the aforementioned challenges and evaluation campaigns, other works address-
ing biomedical NER tasks in the Spanish language have been recently developed. Deep learning
methods are applied in [23] for the identification and subsequent anonymization of named
entities within radiology reports. Transfer learning techniques based on contextualized word
embeddings are employed in [24] for detecting pharmacological entities (substances, compounds
and proteins) in Spanish clinical cases, improving previous results obtained with standard and
general domain word embeddings.
3. Task: Multilingual Information Extraction
Task 1 of the eHealth Evaluation Lab at CLEF 2021 (SpRadIE) aims at the detection and classifica-
tion of biomedical entities and hedge cues in radiology reports written in the Spanish language.
The participating systems are asked to recognize ten different classes, separated into seven
entities (anatomical entity, finding, location, measure, type of measure, degree and abbreviation)
and three hedge cues (negation, uncertainty and conditional temporal). In order to achieve a
good performance, systems must adequately deal with some casuistries inherent to NER tasks
in the biomedical domain: long entities, discontinuous entities, overlaps or polysemy.
   The dataset provided by the organizers consists of anonymized ultrasonography reports from
the radiology department of a pediatric hospital in Argentina. Further information regarding
the original annotation criteria, which was slightly modified for this task, can be found in [25].
The dataset contains 169 documents for training purposes and 92 documents for development
purposes, all of them annotated using the BRAT format [26]. System testing is performed with
an additional test set of 207 unseen documents. The development dataset is divided into two
types of documents: same-sample documents, whose vocabulary is similar to the one in the
training corpus, and held-out documents containing words that do not usually occur in the
training corpus.
   Finally, evaluation of the participating systems is carried out using Precision, Recall and F1
metrics over the Jaccard index between the predicted and the reference entities. Two different F1
measures are computed: exact F1 only considers exact matches of the predicted entities, while
lenient F1 is a more relaxed metric that computes a score regarding the overlapping between
the predicted entity and the reference.


4. System Description
The proposed system is a deep learning architecture that is mainly focused on two particular
types of layers: Bidirectional Long Short-Term Memory layers (Bi-LSTM) and Conditional
Random Field layers (CRF). Input documents are processed forwards and backwards thanks to
the Bi-LSTMs, and each token from the documents is finally classified through the CRF layer.

4.1. Pre-processing
Since the reports from the dataset are initially annotated using the BRAT format, it is important
to transform this annotation into a format that can be used for representing the final classes to
which each token can belong. For this purpose, we use the BILOU annotation scheme, widely
used in different NER tasks. This scheme discriminates between the beginning (B), inside (I)
and last (L) tokens of a particular entity, as well as entities composed of a unique (U) token, and
tokens in the document that are out (O) of any entity.
   The final output of our system is designed for taking into account discontinuities and over-
lapped entities, both of them being two of the most repeated linguistic challenges within the
provided corpus. After inspecting the training and development dataset, the entities that present
a greater number of discontinuities and overlappings are Location, Findings and Abbreviations.
Moreover, we are particularly interested in treating Negation hedge cues separately, in order
to analyze whether additional information coming from a different negation detection tasks is
able to provide useful knowledge to our network. Due to these considerations, we model those
four classes (Location, Finding, Abbreviation and Negation) in a separate way, and gather the
remaining six entities in the same output structure. Hence, the four separate entities can be
modelled by only using BILOU labels (since they will be modelled in separate output vectors). On
the other hand, since the output for the six remaining entities is being represented in the same
vector, the entity type has to be combined with the BILOU labels when considering these entities:
for instance, using label B-Measure for distinguishing it from B-Degree or B-Anatomical_Entity.

                                                                           Ab
                                                             Ne               bre
                                                                    L
                                                               ga oca Find        via
                                                                                      tio
                                                                                            Ot
                                                                                               he
                                                                 tio  t
                                                                    n ion ing             n       r

   Type of                                          Espesor           O      O      O       O       U_TM
                Location        Abbr
   measure
                                                      del             O      O      O       O            O
  Espesor del musculo 0.15 cm
                                                    musculo           O      U      O       O            O
                            Measure                   0.15            O      O      O       O         B_M
                                                      cm              O      O      O       U         L_M


                                                                                Ab
                                                              Ne                   bre
                                                                      L
                                                                 ga oca ind F         via      Ot
                                                                   tio  t ion in         tio      he
                                                                      n          g           n       r


     Negation              Finding                     No              B      O      O        O       O

 No se detectaron adenomegalias                        se              I      O      O        O       O

                                                    detectaron         L      O      O        O       O
                                                  adenomegalias        O      O       U       O       O


Figure 1: Example of transformation from BRAT format to BILOU annotation scheme.


   Figure 1 shows an example of transforming sentences annotated with the BRAT format
to the described BILOU annotation scheme. The top part of the figure shows a sentence in
which we find a unique token labelled as Type_of_measure, another one as Location, and a
unique Abbreviation embedded within a Measure. These elements are represented as unique
tokens (U) in the Location and Abbreviation vectors. As neither Type_of_Measure nor Measure
have independent output vectors, particular labels must be used within the Other vector for
specifying that a unique Type_of_Measure (U_TM) and a begin and last Measure token (B_M
and L_M) are used for representing “Espesor" and “0.15 cm" are used, respectively. The bottom
part of the figure represents the use of begin, inside and last tokens (B, I and L) within the
Negation vector for representing “No se detectaron", and a unique (U) token in the Finding vector
for representing “adenomegalias".
4.2. Input Features
The features used for representing the input of the proposed deep learning stack are the
following:

    • Word embeddings: Two different pre-trained word embeddings from different sources
      are used for text representation. On the one hand, we use general domain Spanish 100-
      dimensional word embeddings in FastText, trained on Common Crawl and Wikipedia
      [27]. On the other hand, also 100-dimensional embeddings in FastText, generated from
      Spanish clinical texts [28], are also tested in order to analyze the differences and potential
      improvements.
    • Character embeddings: The use of character embeddings may help in decreasing
      loss information caused by the reduction of dimensionality in word embeddings. We
      train character embeddings from scratch using a convolutional layer for generating a
      16-dimensional character vector for each token in the document.
    • Casing, punctuation and formatting information: An additional 8-position one-
      hot vector is used for modeling different casing scenarios, as well as information about
      punctuation marks and other formatting issues: uppercased first letter, term ending in
      comma, term ending in dot, term being a number, term being mostly numeric (over
      50% of the charaters being digits), term containing any digit, term containing any other
      punctuation marks, and other cases. Through this feature, we encode information that is
      usually omitted by word embeddings.

4.3. Main Architecture
The main design of the proposed deep learning stack is shown in Figure 2. Vectors representing
word embeddings, character embeddings and casing information are concatenated and fed into
two different pipelines, both of them consisting of a Bi-LSTM layer followed by a dense layer
and a Conditional Random Field that performs the final classification. As mentioned in Section
2, this combination of Bi-LSTM and CRF layers has shown high performance in different NER
tasks in the past few years. Although modern BERT-based architectures might offer better
results, they have been avoided in this case due to the small size of the training dataset provided
by the organizers. In the proposed system, the first pipeline is used for detecting all the possible
entities in the dataset except for the negation hedge cues, while the second pipeline performs
independent detection of negation hedge cues. Four parallel CRF layers are used in the first
pipeline for classifying the aforementioned most frequently overlapping entities (Location,
Finding and Abbreviation), and the set of remaining entities. A single CRF layer is used in the
second pipeline for classifying negation hedge cues.
   A final post-processing step based on rules is applied to the output of the deep learning
architecture for solving systematic errors. The proposed rules are as follows:

    • Use of the regular expression “([0-9]+)([a-zA-Z]+)” for finding terms such as “128cm”. In
      those cases, the expression “cm” is added to the list of entities as an Abbreviation.
    • Use of a more complex regular expression based on the previous one for trying to ensure
      the annotation of three-dimensional measures such as “2.5 x 2.5 x 128cm”.
  Word embedding                              Char embedding                                           Casing


   .1 .15 .27 .09 .56   0   .8 .01 ...   .2    0 .01 .03 .09 .6 .22 .23        ... .06                 1   0   0   0   0      No


  .43 .65 .14 .21 .76 .87 .34 .22 ... .98      .1   .8 .01 .09 .55 .43 .04     ... .56                 0   1   0   0   0
                                                                                                                               se


   .3 .75 .18 .82 .11 .01 .44 .66 ... .09      .54 .88 .38 .51 .56   0   .45   ... .04                 0   0   1   0   0
                                                                                                                              observo


  .22 .77 .15 .1   0    .34 .1 .08 ... .04     .56 .76 .9 .41 .11 .9 .67       ...   .1                0   0   0   1   0
                                                                                                                              liquido


  .65 .6 .03 .1 .15 .12 .5 .01 ... .06         .77 .53 .08 .3 .13 .09 .06      ...   .6                1   0   0   0   0
                                                                                                                              libre
                    ...                                        ...                              ...                    ...


                                                                                                                                                          Ab
                                                                                                                                                  Fin Loca brev                      Ne
                                                                                                                                                                  O                     ga
                                                                                                                                                     din tio iati the                        tio
                                                                                                                                                        g    n   on   r                            n

                                                          +                                                                                              O    B    B      O     No                     B
                                                                                                                                                         O     I   I      O     se                     I

                                                          Bi-LSTM(50, tanh)                                                                              O    O    O      O   observo                  L

                                                                                                                              CRF (Location)             B    O    O      O   liquido                  O
                              h1              h2                 h3                       ...         hz
                                                                                                                               CRF (Finding)             L    O    O      O    libre                   O
                                                                                                                             CRF (Abbreviation)          O    O    O      O     en                 O
                              h1              h2                 h3                       ...         hz
                                                                                                                                CRF (Other)              O    U    O      O   cavidad                  O
                                                                                                                                                         O    O    O      O      ,                     O
                                                        Bi-LSTM(100, tanh)
                             h1               h2                 h3                   ...             hz

                                                                                                                              CRF (Negation)
                             h1               h2                 h3                   ...             hz


Figure 2: Proposed deep learning stack.


      • Some scenarios have been identified in which no annotation is generated for some
        abbreviations. This last rule tries to cover the full annotation of the following measure
        abbreviations: “cc”, “cm”, “mm”, “ml”, “l”, “kg”, “g”, “mg”.


4.4. Transfer Learning
As it has been depicted in previous sections, the main objective of this system is to allow
the analysis of potential improvements that can be obtained by applying transfer learning
techniques to the independent identification of negation hedge cues. For this purpose, we
compare the performance of the system when the initial weights of both Bi-LSTM networks
are randomly initialized with that achieved when incorporating pre-trained weights to the
Bi-LSTM network that performs negation detection. These pre-trained weights are extracted
from a different negation detection task. In particular, the weights are generated in the process
of training a different deep learning stack for detecting negation scopes and triggers, using
for this purpose the SFU Review SP-NEG corpus [29]. The deep learning stack used for that
separate task is described in detail in [30], and is based on the combination of a Bi-LSTM layer
followed by a dense neural network performing the final classification. The transfer learning
process is achieved by using the weights from this Bi-LSTM trained on the negation task for
initialising the weights of the Bi-LSTM devoted to the detection of the negation hedge cues in
the eHealth task.
   According to the widely recognised and adopted categorisation of transfer learning techniques
presented in [31], the proposed approach would be a case of inductive transfer learning, in which
the two tasks involved in the process are different, although their domains are strongly related,
and also labelled data are available in both the source and target tasks (negation detection and
eHealth NER, respectively). This is a similar setting to multi-task learning, however, in this case
only one task is optimized for achieving high performance, instead of trying to learn both tasks
simultaneously. Regarding deep transfer learning categorizations such as the proposed in [32],
this would be a case of network-based deep transfer learning.


5. Results
Four different runs were allowed in the test phase of the CLEF eHealth challenge. In consequence,
we have prepared four different settings of our system for submitting them to the organizers
in this phase. In those settings, we combine the two different word embeddings mentioned in
Section 4.2, this is, general domain and clinical embeddings, and also we explore the two main
settings of the deep learning architecture: classic weight initialization (random, no transfer
learning), and transfer learning initialization, both of them applied to the Bi-LSTM layer related
to negation detection. These four settings will be denoted as follows in the experiments:
Classic+General for classic initialization and general domain embeddings, Transfer+General
for transfer learning approach and general domain embeddings, Classic+Clinical for classic
initialization and clinical embeddings, and Transfer+Clinical for transfer learning approach
and clinical embeddings.
   As detailed in Section 3, two different development sets (same-sample and held-out) were
provided by the organizers, and a simple test set was used for the final scores of the participating
systems. Tables 1, 2 and 3 show the scores achieved using the different settings of our system,
for the same-sample and held-out development dataset and for the test dataset respectively,
using the official metrics of the task.


Table 1
Results achieved by the LSI_UNED team in the CLEF eHealth same-sample development dataset, for
the lenient matching and exact matching metrics. F1, precision (P) and recall (R) values for each metric
are expressed as percentages. Bold indicates the best setting for each metric.
                                              Same-sample development
                                            Lenient              Exact
                    Setting            F1      P     R      F1     P                R
                Classic+General       83.49 87.63 80.35 80.45 84.38               77.47
               Transfer+General       84.02 88.38 80.67 80.82 84.98               77.60
                Classic+Clinical      84.43 88.85 80.95 82.04 86.34               78.66
               Transfer+Clinical      84.18 89.06 80.50 81.37 86.11               77.77
Table 2
Results achieved by the LSI_UNED team in the CLEF eHealth held-out development dataset, for the
lenient matching and exact matching metrics. F1, precision (P) and recall (R) values for each metric are
expressed as percentages. Bold indicates the best setting for each metric.
                                                    Held-out dev.
                                            Lenient                      Exact
                    Setting            F1      P     R       F1            P        R
                Classic+General       77.80 89.36 69.56 74.56            85.56    66.72
               Transfer+General       78.89 89.27 71.24 75.36            85.17    68.11
                Classic+Clinical      75.89 85.73 68.58 72.82            82.23    65.82
               Transfer+Clinical      76.70 86.67 69.41 73.72            83.26    66.74


Table 3
Results achieved by the LSI_UNED team in the CLEF eHealth test dataset, for the lenient matching
and exact matching metrics. F1, precision (P) and recall (R) values for each metric are expressed as
percentages. Bold indicates the best setting for each metric.
                                                         Test dataset
                                                 Lenient                      Exact
                System setting              F1      P     R       F1            P        R
            Classic+General (Run 3)        83.66 90.88 77.51 80.14            87.06    74.25
           Transfer+General (Run 1)        83.88 90.28 78.33 80.07            86.17    74.76
            Classic+Clinical (Run 4)       83.71 89.75 78.43 79.57            85.30    74.55
           Transfer+Clinical (Run 2)       83.77 89.73 78.55 79.82            85.50    74.84


   Some insights can be drawn from the results obtained by the proposed settings of our system
regarding the different development datasets and the test dataset. It can be seen that, in general,
transfer learning techniques applied to negation detection provide some improvements to the
classic approach, particularly in the case of the held-out development dataset, while the best
results for the same-sample dataset are achieved using the classic approach. However, in this
same-sample dataset the differences are quite small. Regarding the use of general domain or
clinical embeddings, again the most noticeable differences occur in results for the held-out
dataset. Considering the test dataset, we can observe that the setting that obtains the best F1
measure in the lenient matching metric uses general domain embeddings and transfer learning
techniques. However, the small differences between the four different runs submitted for the task
indicate that the good performance shown by our system is more attributable to the proposed
deep learning architecture (Bi-LSTM + CRF) than to the use of transfer learning techniques or
different embeddings. In addition, we can observe that the results achieved in the test dataset
are quite close to those obtained in the same-sample development dataset, and higher than
those obtained in the held-out development dataset. This might indicate that the test dataset
developed by the organizers is possibly more similar to the same-sample development dataset,
and hence to the training dataset.
   Table 4 illustrates the behaviour of the four different configurations of the system for each
of the entities and hedge cues in the task: Abbreviation (Abb), Anatomical_Entity (AE), Condi-
tional_Temporal (CT), Degree (Deg.), Finding (Find.), Location (Loc.), Measure (Meas.), Nega-
tion (Neg.), Type_of_Measure (TM) and Uncertainty (Unc.). System configurations are the same
as shown in Tables 1, 2 and 3: Only F1 score for the lenient evaluation is shown in order to
simplify the table.


Table 4
Results achieved by the different runs of the LSI_UNED team in the CLEF eHealth test dataset, for each
of the proposed entities. Metric is lenient F1, expressed as a percentage. Bold indicates the best setting
for each entity.
                                                     Entities
   Setting     Abb.     AE       CT      Deg.     Find. Loc.        Meas.     Neg.      TM      Unc
    C+G        90.70   82.08    57.14    44.44    75.34 65.89       88.89     92.09    86.28    70.06
    T+G        91.22   82.83    36.36    46.93    73.01 66.87       88.40     92.26    88.25    72.33
    C+C        91.04   81.67    50.00    65.71    71.93 66.87       88.32     94.50    89.26    66.62
    T+C        92.20   82.50    50.00    63.77    73.03 65.54       87.52     90.05    89.28    66.71


   As mentioned before, no major differences are found when comparing neither the “Classic”
against the “Transfer” initialization schemes, nor the “General” against the “Clinical” word
embedding models employed. The main differences can be seen regarding entity Degree, for
which the use of clinical embeddings clearly improves the results compared to those obtained
when using general domain embeddings. On the other hand, general domain embeddings
offer quite better results than clinical embeddings for hedge cue Uncertainty. The use of
the proposed transfer learning setting brings slight improvements for entities Abbreviation,
Anatomical_Entity, Type_of_Measure and Uncertainty, while negation hedge cues only benefit
from this transfer learning technique when using general domain embeddings. All these results
reinforce the idea that the good results offered by the system are a consequence of the proposed
deep learning architecture, over and above the use of transfer learning technique or specific
embeddings.
   Finally, Table 5 shows the comparison of results obtained by the best run of each system
participating in the CLEF eHealth 2021 task (SpRadIE), according and ordered by the F1 measure
for the lenient matching metric, as provided by the organizers.
   Comparing our system with other participating systems, our best run is ranked second in the
task, out of seven participants. Moreover, we are able to obtain the highest precision scores,
both in the lenient and in the exact matching metrics. Regarding F1, our team is just 1.63%
behind the best system in the lenient metric, and 0.19% in the exact metric, while the differences
between our results and those obtained by the third best system are much higher (5.41% and
6.94%, respectively). In addition, thanks to the information provided by the organizers upon
completion of the evaluation, we know that our runs are able to obtain the best F1 lenient values
for detecting entities Finding and Measure, and the second best for entities such as Abbreviation,
Degree or Negation. Since Finding, Abbreviation and Negation are considered separately for
Table 5
Results achieved by the participating systems in the CLEF eHealth test dataset, for the lenient matching
and exact matching metrics. F1, precision (P) and recall (R) values for each metric are expressed as
percentages. Results are ordered by the F1 lenient metric, and bold indicates the results of our best
setting.
                                                  Test dataset
                                          Lenient                      Exact
                      System         F1      P     R       F1            P        R
                  EdIE-KnowLab      85.51 87.24 83.85 80.26            81.88    78.70
                   LSI_UNED         83.88 90.28 78.33 80.07            86.17    74.76
                    ctb madrid      78.47 78.62 78.32 73.13            73.27    72.99
                   HULAT_MA         75.64 78.38 73.08 64.92            67.28    62.73
                       SINAI        73.70 86.07 64.43 67.96            79.37    59.42
                       SWAP         59.17 70.18 51.14 47.84            56.75    41.35
                    ims_unipd       16.00   9.29  57.62   9.38          5.45    33.77


classification (see Section 4.1), this could indicate that using separate classifiers for each entity
might bring important improvements.


6. Error Analysis
In this section we present some systematic errors affecting the performance of the proposed
system that were detected during the development phase:

    • The system is not able to process some of the discontinuous entities included in the
      dataset. These entities represent 4.64% of the held-out dataset and 4.12% of the same-
      sample dataset. The annotation of some of those entities has been avoided in order to
      prevent the system from mislearning particular entities. For instance, the original text
      “VIA BILIAR intra y extrahepatica” should result in the detection of entities “VIA BILIAR
      extrahepatica” and “VIA BILIAR intra hepatica”. However, it is particularly difficult to
      design an annotation scheme for representing both entities in the training step, hence
      our system is not taking this particular case into account.
    • Although the system includes different CRF layers for addressing overlapping entities,
      the total number of CRF layers is less than the total number of entities in the task. As
      mentioned in Section 4.1, only those entities most frequently involved in overlapping
      issues were selected for being classified in a separate CRF layer, from the preliminary study
      of the provided corpus. The main reason for this decision was to reduce the complexity
      of the final deep learning stack. However, although the remaining cases of overlapping
      entities may represent a small proportion of the total number of cases, some errors might
      come from this design choice.
    • Documents within the training corpus were not tokenized and contained misspellings
      due to the specific nature of medical texts. Our system does not consider special solutions
      for misspelling errors, and the tokenization step only considers whitespace as a token
      delimiter. This fact usually leads to recall issues that, in our case, are addressed by
      considering subword information through the use of FastText embeddings.


7. Conclusions and Future Work
In this paper we described the deep learning architecture proposed for the Multilingual Infor-
mation Extraction task (SpRadIE) of CLEF eHealth 2021. We explored the use of transfer learning
techniques taking advantage of information from negation detection tasks, and we also analyzed
the differences in results when using general domain embeddings and clinical embeddings. The
obtained results are quite promising, especially with regard to the proposed deep learning stack,
composed of Bi-LSTM layers and CRF classifiers, and dividing the classification of those entities
more likely to appear embedded or in a discontinuous form within the dataset. Improvements
provided by the use of transfer learning were only found in specific settings. We obtained the
second best F1 score among the participants in the task, not far behind the first place, and the
best precision score in the task.
   One of the first future lines of work should be exploring further decomposition of the
annotation scheme used in the documents, for analyzing the effects of classifying, for instance,
each entity separately. We consider that transfer learning techniques are a promising line of
research within the task, however, a different secondary task more related to the main NER task
should probably be found for the effects of this transfer learning to be noticed. We consider that,
in this task, the influence of negation hedge cues within the addressed task is not strong enough
for transfer learning from the considered secondary task to make a real difference. Finally,
further exploration of the different word embedding models considered for this work might be
an interesting research line. For instance, the combination of both general and clinical word
embeddings, either by averaging or concatenating them could offer some additional insights on
the behaviour of the different models.


Acknowledgments
This work has been partially supported by the Spanish Ministry of Science and Innovation
within the DOTT-HEALTH Project (MCI/AEI/FEDER, UE) under Grant PID2019-106942RB-C32,
as well as project EXTRAE II (IMIENS 2019) and the research network AEI RED2018-102312-T
(IA-Biomed).


References
 [1] H. Suominen, L. Goeuriot, L. Kelly, L. A. Alemany, E. Bassani, N. Brew-Sam, V. Cotik,
     D. Filippo, G. González-Sáez, F. Luque, P. Mulhem, G. Pasi, R. Roller, S. Seneviratne,
     R. Upadhyay, J. Vivaldi, M. Viviani, C. Xu, Overview of the CLEF eHealth Evaluation Lab
     2021, in: CLEF 2021 - 12th Conference and Labs of the Evaluation Forum, Lecture Notes in
     Computer Science (LNCS), Springer, 2021.
 [2] V. Cotik, L. A. Alemany, D. Filippo, F. Luque, R. Roller, J. Vivaldi, A. Ayach, F. Carranza,
     L. D. Francesca, A. Dellanzo, M. F. Urquiza, Overview of CLEF eHealth Task 1 - SpRadIE:
     A challenge on information extraction from Spanish Radiology Reports, in: CLEF 2021
     Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2021.
 [3] A. R. Aronson, Effective mapping of biomedical text to the UMLS Metathesaurus: the
     MetaMap program., in: Proceedings of the AMIA Symposium, American Medical Infor-
     matics Association, 2001, p. 17.
 [4] Z. Yang, H. Lin, Y. Li, Exploiting the performance of dictionary-based bio-entity name
     recognition in biomedical literature, Computational Biology and Chemistry 32 (2008)
     287–291.
 [5] K.-i. Fukuda, T. Tsunoda, A. Tamura, T. Takagi, et al., Toward information extraction:
     identifying protein names from biological papers, Pacific Symposium on Biocomputing.
     Pacific Symposium on Biocomputing 707 (1998) 707—718.
 [6] J. Tamames, Text detective: a rule-based system for gene annotation in biomedical texts,
     BMC bioinformatics 6 (2005) 1–8.
 [7] P.-T. Lai, M.-S. Huang, T.-H. Yang, W.-L. Hsu, R. T.-H. Tsai, Statistical principle-based
     approach for gene and protein related object recognition, Journal of cheminformatics 10
     (2018) 1–9.
 [8] Q. Wei, T. Chen, R. Xu, Y. He, L. Gui, Disease named entity recognition by combining
     conditional random fields and bidirectional recurrent neural networks, Database 2016
     (2016).
 [9] Y. Wu, M. Jiang, J. Xu, D. Zhi, H. Xu, Clinical named entity recognition using deep learning
     models, in: AMIA Annual Symposium Proceedings, volume 2017, American Medical
     Informatics Association, 2017, p. 1812.
[10] Y. Zhang, Q. Chen, Z. Yang, H. Lin, Z. Lu, BioWordVec, improving biomedical word
     embeddings with subword information and MeSH, Scientific data 6 (2019) 1–9.
[11] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–
     1240.
[12] L. Goeuriot, L. Kelly, H. Suominen, L. Hanlen, A. Névéol, C. Grouin, J. Palotti, G. Zuccon,
     Overview of the CLEF eHealth Evaluation Lab 2015, in: J. Mothe, J. Savoy, J. Kamps,
     K. Pinel-Sauvagnat, G. Jones, E. San Juan, L. Capellato, N. Ferro (Eds.), Experimental IR
     Meets Multilinguality, Multimodality, and Interaction, Springer International Publishing,
     Cham, 2015, pp. 429–443.
[13] E. M. Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. G. Cumbreras,
     M. G. Vega, Y. Gutiérrez, A. Montejo-Ráez, A. Montoyo, R. Muñoz, A. Piad-Morffis, J. Villena-
     Román, Overview of TASS 2018: Opinions, Health and Emotions, in: Proceedings of TASS
     2018: Workshop on Semantic Analysis at SEPLN, TASS@SEPLN 2018, co-located with
     34nd SEPLN Conference (SEPLN 2018), Sevilla, Spain, September 18th, 2018, volume 2172
     of CEUR Workshop Proceedings, CEUR-WS.org, 2018, pp. 13–27. URL: http://ceur-ws.org/
     Vol-2172/p0_overview_tass2018.pdf.
[14] A. Piad-Morffis, Y. Gutiérrez, J. P. Consuegra-Ayala, S. Estevez-Velarde, Y. Almeida-
     Cruz, R. Muñoz, A. Montoyo, Overview of the eHealth Knowledge Discovery Chal-
     lenge at IberLEF 2019, in: Proceedings of the Iberian Languages Evaluation Forum
     co-located with 35th Conference of the Spanish Society for Natural Language Process-
     ing, IberLEF@SEPLN 2019, Bilbao, Spain, September 24th, 2019., 2019, pp. 1–16. URL:
     http://ceur-ws.org/Vol-2421/eHealth-KD_overview.pdf.
[15] A. Piad-Morffis, Y. Gutiérrez, H. Cañizares-Diaz, S. Estevez-Velarde, R. Muñoz, A. Montoyo,
     Y. Almeida-Cruz, Overview eHealth-KD 2020, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2020) co-located with 36th Conference of the Spanish Society
     for Natural Language Processing (SEPLN 2020), Málaga, Spain, September 23th, 2020,
     volume 2664 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 71–84. URL: http:
     //ceur-ws.org/Vol-2664/eHealth-KD_overview.pdf.
[16] A. Bravo, P. Accuosto, H. Saggion, LaSTUS-TALN at IberLEF 2019 eHealth-KD Challenge:
     Deep Learning Approaches to Information Extraction in Biomedical Texts, in: IberLEF@
     SEPLN, 2019, pp. 51–59.
[17] H. Fabregat, A. D. Fernandez, J. Martinez-Romo, L. Araujo, NLP_UNED at eHealth-KD
     Challenge 2019: Deep Learning for Named Entity Recognition and Attentive Relation
     Extraction, in: IberLEF@ SEPLN, 2019, pp. 67–77.
[18] A. R. Pérez, E. Q. Caballero, J. M. Alvarado, R. C. Linares, J. P. Consuegra-Ayala, UH-
     MAJA-KD at eHealth-KD Challenge 2020, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2020) co-located with 36th Conference of the Spanish Society
     for Natural Language Processing (SEPLN 2020), Málaga, Spain, September 23th, 2020,
     volume 2664 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 125–135. URL: http:
     //ceur-ws.org/Vol-2664/eHealth-KD_paper5.pdf.
[19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, arXiv preprint arXiv:1706.03762 (2017).
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[21] S. Medina Herrera, J. Turmo Borras, Talp-upc at eHealth-KD challenge 2019: A joint model
     with contextual embeddings for clinical information extraction, in: Proceedings of the
     Iberian Languages Evaluation Forum (IberLEF 2019): co-located with 35th Conference
     of the Spanish Society for Natural Language Processing (SEPLN 2019): Bilbao, Spain,
     September 24th, 2019, CEUR-WS. org, 2019, pp. 78–84.
[22] A. G. Pablos, N. Pérez, M. Cuadros, E. Zotova, Vicomtech at eHealth-KD Challenge 2020,
     in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) co-located with
     36th Conference of the Spanish Society for Natural Language Processing (SEPLN 2020),
     Málaga, Spain, September 23th, 2020, volume 2664 of CEUR Workshop Proceedings, CEUR-
     WS.org, 2020, pp. 102–111. URL: http://ceur-ws.org/Vol-2664/eHealth-KD_paper3.pdf.
[23] I. Perez-Diez, R. Perez-Moraga, A. Lopez-Cerdan, J.-M. Salinas-Serrano, M. de la Iglesia-
     Vaya, De-identifying Spanish medical texts-Named Entity Recognition applied to radiology
     reports, Journal of Biomedical Semantics 12 (2021) 1–13.
[24] L. Akhtyamova, P. Martínez, K. Verspoor, J. Cardiff, Testing Contextualized Word Em-
     beddings to Improve NER in Spanish Clinical Case Narratives, IEEE Access 8 (2020)
     164717–164726.
[25] V. Cotik, D. Filippo, R. Roller, H. Uszkoreit, F. Xu, Annotation of Entities and Relations in
     Spanish Radiology Reports, in: RANLP, 2017, pp. 177–184.
[26] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, BRAT: a web-based
     tool for NLP-assisted text annotation, in: Proceedings of the Demonstrations at the 13th
     Conference of the European Chapter of the Association for Computational Linguistics,
     2012, pp. 102–107.
[27] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors for 157
     languages, arXiv preprint arXiv:1802.06893 (2018).
[28] A. Gutiérrez-Fandiño, J. Armengol-Estapé, C. P. Carrino, O. De Gibert, A. Gonzalez-Agirre,
     M. Villegas, Spanish Biomedical and Clinical Language Embeddings, arXiv preprint
     arXiv:2102.12843 (2021).
[29] S. M. J. Zafra, M. Taulé, M. T. Martín-Valdivia, L. A. U. López, M. A. Martí, SFU ReviewSP-
     NEG: a Spanish corpus annotated with negation for sentiment analysis. A typology of
     negation patterns, Lang. Resour. Evaluation 52 (2018) 533–569. URL: https://doi.org/10.
     1007/s10579-017-9391-x. doi:10.1007/s10579-017-9391-x.
[30] H. Fabregat, L. Araujo, J. Martínez-Romo, Deep learning approach for negation trigger
     and scope recognition, Proces. del Leng. Natural 62 (2019) 37–44. URL: http://journal.sepln.
     org/sepln/ojs/ojs/index.php/pln/article/view/5950.
[31] S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on knowledge and
     data engineering 22 (2009) 1345–1359.
[32] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, C. Liu, A survey on deep transfer learning, in:
     International conference on artificial neural networks, Springer, 2018, pp. 270–279.