=Paper=
{{Paper
|id=Vol-2936/paper-70
|storemode=property
|title=A multi-BERT hybrid system for Named Entity Recognition in Spanish radiology reports
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-70.pdf
|volume=Vol-2936
|authors=Víctor Suárez-Paniagua,Hang Dong,Arlene Casey
|dblpUrl=https://dblp.org/rec/conf/clef/Suarez-Paniagua21
}}
==A multi-BERT hybrid system for Named Entity Recognition in Spanish radiology reports==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-70.pdf</pdf>
<pre>
A multi-BERT hybrid system for Named Entity
Recognition in Spanish radiology reports
Víctor Suárez-Paniagua1,3 , Hang Dong1,3 and Arlene Casey2
1
  Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom.
2
  Advanced Care Research Centre, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom.
3
  Health Data Research UK, London, United Kingdom.


                                         Abstract
                                         The present work describes the proposed methods by the EdIE-KnowLab team in Information Extraction
                                         Task of CLEF eHealth 2021, SpRadIE Task 1. This task focuses on detecting and classifying relevant
                                         mentions in ultrasonography reports. The architecture developed is an ensemble of multiple BERT
                                         (multi-BERT) systems, one per each entity type, together with a generated dictionary and available
                                         off-the-shelf tools, Google Healthcare Natural Language API and GATECloud’s Measurement Expression
                                         Annotator system, applied to the documents translated into English with word alignment from the neural
                                         machine translation tool, Microsoft Translator API. Our best system configuration (multi-BERT with a
                                         dictionary) achieves 85.51% and 80.04% F1 for Lenient and Exact metrics, respectively. Thus, the system
                                         ranked first out of 17 submissions from 7 teams that participated in this shared task. Our system also
                                         achieved the best Recall merging the previous predictions to the results given by English-translated texts
                                         and cross-lingual word alignment (83.87% Lenient match and 78.71% Exact match). The overall results
                                         demonstrate the potential of pre-trained language models and cross-lingual word alignment for limited
                                         corpus and low-resource NER in the clinical domain.

                                         Keywords
                                         Named Entity Recognition, Radiology Reports, Deep Learning, BERT, Machine Translation


1. Introduction
Medical imaging reports are interpretations of diagnostic images written by radiologists. Whilst
radiology reports contain a relatively restricted vocabulary compared to other electronic health
records they are still unstructured, and this makes it difficult to extract meaningful data. However,
being able to effectively extract information from these narratives has the potential to quickly
and accurately identify abnormalities supporting clinical decision in a timely manner. The
application of Natural Language Processing (NLP) to radiology reports is a growing area such
as shown in a recent systematic review [1].
   The SpRadIE Task 1 [2] was the first challenge to deal with Named Entity Recognition (NER)
in the domain of Spanish radiology reports. Concretely, the target is to detect and classify
relevant mentions in the ultrasonography reports produced by physicians during their clinical
practice. These documents cover different domains such as heart and liver related reports.


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" vsuarez@ed.ac.uk (V. Suárez-Paniagua); hang.dong@ed.ac.uk (H. Dong); arlene.casey@ed.ac.uk (A. Casey)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
   Currently, large pre-trained language models with layers of multi-head self-attention ar-
chitectures [3], specifically Bidirectional Encoder Representations from Transformers (BERT)
[4], outperform other machine learning systems for the task of NER [5, 6] particularly in the
biomedical domain [7, 8]. BERT based models have been successful applied to NLP tasks in radi-
ology, such as Smit et al. [9] who label findings in chest radiology reports, Wood et al.[10] who
explore document level labels at a coarse and finer grained level from free-text, and Schrmepf
et al. [11] who use BERT with a per-label attention mechanism. However, BERT models do
not always outperform more traditional methods [12]. In this paper our main approach is to
use BETO [13], BERT pre-trained models for Spanish NLP tasks. There are existing works that
perform Spanish NER for radiology reports [14, 15]. However, the use of a BERT based model is
still largely unexplored in Spanish radiology report named entity recognition.
   This work describes the participation of the team EdIE-KnowLab in the CLEF 2021 eHealth
Task 1 [16] that involves the recognition of named entities in Spanish radiology reports. The
proposed method, which ranked first in the task, is a hybrid system that combines multiple
Spanish BERT classifiers (BETO) that were fine-tuned independently for each entity type, and
the use of a dictionary extracted with the annotations from the training set. The cloud-based
machine translation service, Microsoft Translator API, was used to translate the documents to
English. Once the documents were translated, available off-the-shelf tools Google Healthcare
Natural Language API (GHNL) and GATECloud’s Measurement Expression Annotator (MEA)
system predicted the entities in the documents which were then traced back into Spanish using
the translation alignment.


2. Dataset
The dataset for the CLEF 2021 eHealth Task 1 contains 513 anonymized radiology reports from
a major pediatric hospital in Buenos Aires. Clinical experts and linguists annotated 17,000
mentions using Brat Standoff format [17] following an annotation guideline. The organizers
provided this dataset split in three different sets annotated: 174 reports for training the models,
92 documents to validate the systems and 207 reports without annotations to test the predictions
submitted by the participants. In addition, the development set was divided into 47 documents
with the same vocabulary as the training set (same-sample) and 45 documents containing words
that are not in the training set (held-out). The entities are divided into seven entity types:
Anatomical Entity, Abbreviation, Finding, Location, Measure, Type of Measure, and Degree, and
three hedge cues: Negation, Uncertainty, and Conditional Temporal.
   Table 1 presents the number of annotations for each entity type in the three different sets.
It can be observed that this task presents a highly unbalanced problem with the greatest
represented class (1,292) differing in two orders of magnitude with respect to the lowest (11). A
more detailed description of the dataset and its annotations can be found in [18].

2.1. Data preprocessing
The annotated reports show multiple linguistic challenges, such as orthographic and grammatical
errors, very long entities, discontinuous and embedded entities, subordination and coordination,
and systematic polysemy. Thus, our team carried out a cleaning step of the data using simple
Table 1
Number of instances for each class in the training and development sets from the highest represented
to the lowest.
                                        training set   same-sample set    held-out set
               Entity type
                                       (174 reports)       (47 reports)   (45 reports)
               Anatomical Entity               1292                365            476
               Abbreviation                     877                251            269
               Finding                          795                230            450
               Measure                          579                148            211
               Location                         517                121            156
               Negation                         484                121            156
               Type of measure                  346                 80             79
               Uncertainty                       52                 15              6
               Degree                            41                 13             22
               Conditional Temporal              11                  3              1


rules and regular expressions in order to solve some of these challenges and prepare the
documents for the classifier.
  There are some tokens whose spans were not completely annotated, like the mention "venas
femorale" in the sentence "Ambas arterias y venas femorales permeables". In these cases, we
utilized regular expressions to redefine the token offsets until the span covers the mention.
Concretely, the regular expressions "[a-zA-Z]+$" and "^[a-zA-Z]+" were used from the given
span to complete the mention backward and forward, respectively. For the discontinuous
mentions, we followed a naïve approach that takes the minimum and the maximum offsets as
the span of these split mentions.. The embedded entities were de-overlaped obtaining all the
possible paths without overlapping walking recursively through a graph representation of the
sentence, where the nodes are the entities and the edges are the links to the other mentions
which overlap. We automatically solve the problem of the annotations with multiple types
using one classifier for each class and then merging all their outputs into a prediction file.
  Documents were transformed into lower case and some special characters, like the escape
sequence "\n" for newline, were replaced by a white space and the sentences were tokenized
using the Spanish transformer pipeline of spaCy [19]. Finally, the Brat annotations were tagged
with the BIOES schema, an extension to the BIO-encoding [20], where the B tag indicates the
beginning token of an entity, the I tag indicates the inside token of an entity, the E tag indicates
the ending token of an entity, the S tag indicates a single entity token, and the O tag represents
other tokens that do not belong to any entity.


3. Methods
This section presents the different approaches used for the SpRadIE 2021 Shared Task. Figure 1
shows the whole system with the four proposed methods integrated together merging their
predictions.
Figure 1: The proposed NER system that recognizes and classifies the mentions in a given Spanish
radiology report. The whole hybrid system contains four different methods (from A to D): A. multi-BERT
classifier; B. dictionary-based approach; C-D: English translation and cross-lingual word alignment
(all obtained using Microsoft Translator API) with Google Healthcare Natural Language API (C) and
GateCloud’s Measurement Expression Annotator (D). All predictions from the different methods are
integrated into a final prediction. Submissions are a cumulative combination of the four methods A-D.
The system A+B (multi-BERT with dictionary) obtained the best F1 and A+B+C (further using word
alignment with English NER) obtained the best Recall in the final evaluation in SpRadIE.


3.1. Multi-BERT classifier
Once the Spanish radiology reports were preprocessed, all the annotations in the datasets were
divided by classes. Thus, a single BETO classifier [13] was fine-tuned for each class using the
corresponding named entities in the training set independently. Then, each model was validated
with the development set to get the best performance for each entity type. Figure 1A shows the
merging of individual predictions in order to generate the multi-BERT final annotation.

3.2. Dictionary based
We observed that at a word-level, annotations for entities are highly repetitive and thus, entity
vocabulary is similar across the reports. For this reason, we created a dictionary using the
vocabulary from the named entities mapped to their corresponding classes in the training set.
We use the generated vocabulary to find exact string matches in the reports and classify them
with the labels given by the dictionary (see Figure 1B).

3.3. Cross-lingual word alignment with English NER tools
Since the most advanced NER tools are usually tailored for texts in English, we used machine
translation with cross-lingual word alignment to leverage results from these tools. We used
   Table 2
   Matching between the medical knowledge categories in Google Healthcare Natural Language
   (GHNL) API and the entity types in SpRadIE
                        GHNL API category                   SpRadIE entity type
                        ANATOMICAL_STRUCTURE                Anatomical_Entity
                        BODY_MEASUREMENT                    Type_of_measure
                        BM_UNIT                             Abbreviation
                        BM_VALUE                            Measure
                        LAB_VALUE                           Measure
                        LABORATORY_DATA                     Measure
                        MED_STRENGTH                        Measure
                        MED_UNIT                            Abbreviation
                        PROBLEM                             Finding
                        SEVERITY                            Degree


Microsoft Translator API1 , a neural machine translation (NMT) tool, to translate reports from
source language (Spanish) to target language (English). The key reason to use Microsoft
Translator API is that it allows for the output of word alignment between the original and the
translated texts2 . Each word alignment is represented as a pair of mention spans, where the
span in the source language is aligned to the one in the target language and each span includes
a start and an end index. However, one major limitation is that NMT methods can produce
erroneous and unreliable word alignment as we see in the experiments, potentially because the
widely used attention-based alignment [21] is not accurate enough, e.g. in the sentence in Figure
1, “muscule” in English was actually aligned to “espesor” in Spanish (rather than “musculo”)
despite the correct translation at the sentence level. Nevertheless, the word alignment enables
leveraging of NER tools for other languages (e.g. English), which are presented below.

Google Healthcare Natural Language (GHNL) API The GHNL API3 , released in Nov 2020
    [22], allows for the extraction and matching of mentions in clinical texts into medical
    terminologies and classify the mentions into a set of “medical knowledge categories” (See
    Figure 1C). We matched some of these categories into the entity types for this task (e.g.
    “SEVERITY” to Degree). The matching dictionary from the categories in GHNL API to the
    entity types in the SpRadIE task is presented in Table 2.

GATECloud’s Measurement Expression Annotator (MEA) GATE is open source free soft-
    ware that performs text analysis with a multitude of applications [23]. We used MEA
    through the GATE cloud service4 (See Figure 1D). This tool annotates numbers and mea-
    surement expressions in text. We map the indices of numeric measurements and their
    units returned by MEA to Measure entities.


   1
     https://docs.microsoft.com/en-us/azure/cognitive-services/translator/translator-info-overview
   2
     https://docs.microsoft.com/en-us/azure/cognitive-services/translator/word-alignment
   3
     https://cloud.google.com/healthcare/docs/concepts/nlp
   4
     https://cloud.gate.ac.uk/shopfront/displayItem/measurement-expression-annotator
   After obtaining the English entities identified from the two off-the-shelf tools, we converted
these back into the entities and offsets in the original Spanish texts based on the word alignment.
A tolerance value (from 0 to 3) of number of characters was allowed in matching the mention
spans from either the source or the target language to the indexes in an alignment word pair.
We selected the best tolerance value (2 for GHNL API and 1 for MEA) according to the results
in the development set.


4. Results
Each BETO classifier was trained for 8 epochs independently with the training set of each class.
Early stopping criteria is applied to each model taking the best performance over both the
development sets together (same-sample and the held-out). We applied the uncased model of
BETO because this achieved better results than the cased model during the validation. In addition,
the maximum length of a sentence was fixed to 300 and the remaining hyper-parameters to
their default values for fine-tuning each BETO classifier.
  The submissions were evaluated using two metrics:
    • Lenient F1: is computed using the Precision and Recall of the Jaccard Index that measures
      the coefficient between the intersection and the union of the reference entities and the
      predicted entities.
    • Exact F1: is calculated with the Jaccard Index scores when the reference and predicted
      entities have a perfect match.
   In order to choose the methods to be used for each submission, we evaluated each system
independently with the two development sets (same-sample and held-out). The multi-BERT
approach achieved very good results in the majority of the classes, but the F1 score was 0%
for Degree and Conditional Temporal, likely due to the low number of entities in the training
set (submission 1). Thus, we decided to create a hybrid approach and include the predictions
of the two methods (A+B), A the multi-BERT approach and B the dictionary based approach
(submission 2). While the micro-F1 of GHNL API was low (26.29%) due to the inaccurate
cross-lingual alignment, we observed that GHNL API performed better in the class Degree than
BETO (30.43% vs. 0% F1) and we aggregate its predictions to the hybrid system (submission
3). Finally, the GATECloud’s MEA obtained higher Precision for the class Measure than BETO
(84.29% vs 81.78%) and we merge the prediction for this entity type to the previous methods
having the complete system (submission 4). We also experimented with adding results from the
GATECloud’s BioYODIE5 [24]. This tool applies a gazetter-based approach for named entity
recognition and disambiguation to identify various biomedical named entities and tries to link
entities to concept labels in UMLS. We mapped entities from the BioYODIE results to Anatomical
Entity, Location and Findings but as it did not improve performance we left this out of submission
4. Doing this ablation study, we can evaluate the contribution of each method to the whole
system.
   Table 3 presents the final results over the test set for each submission. The best performance
is obtained by submission 2 which is the hybrid approach of the multi-BERT and the dictionary
   5
       https://cloud.gate.ac.uk/shopfront/displayItem/bio-yodie
based. This system ranked first in the SpRadIE Shared Task with a 85.51% and 80.26% in F1 for
Lenient and Exact metrics, respectively. However, the best Precision is obtained only using
the multi-BERT approach because the dictionary based approach introduces False Positives.
Moreover, submission 3 achieved the best Recall results in the task, meaning that it helps to
recognize some missing entities of the Degree type, but subsequently it marginally drops the
Precision measure by introducing spurious predictions. Submission 4 slightly dropped the
results in all the metrics, suggesting that aggregating the MEA predictions for the Measure class
did not improve the BETO classification for this class.

Table 3
SpRadIE 2021 official results for the EdIE-KnowLab team submissions. Best performance for each column
is marked in bold.
                                         Lenient                                Exact
          Submission
                             Precision     Recall          F1     Precision      Recall        F1
          EdIE-KnowLab 1       87.81%     82.99%       85.33%      82.36%       77.84%     80.04%
          EdIE-KnowLab 2       87.24%     83.85%       85.51%      81.88%       78.70%     80.26%
          EdIE-KnowLab 3       87.15%     83.87%       85.48%      81.79%       78.71%     80.22%
          EdIE-KnowLab 4       85.67%     83.75%       84.70%      80.17%       78.37%     79.26%


Table 4
The multi-BERT hybrid system results for each entity type in the SpRadIE 2021 Shared Task.
                                            Lenient                              Exact
        Entity type
                                Precision     Recall        F1      Precision     Recall        F1
        Anatomical Entity          87.51%    84.35%      85.90%       80.91%      77.99%    79.43%
        Abbreviation               95.93%    94.83%      95.38%       95.35%      94.25%    94.79%
        Finding                    72.65%    75.63%      74.11%       59.94%      62.40%    61.15%
        Measure                    90.15%    85.85%      87.94%       83.06%      79.09%    81.03%
        Location                   75.56%    62.70%      68.53%       71.83%      59.60%    65.14%
        Negation                   93.52%    94.97%      94.24%       92.39%      93.82%    93.10%
        Type of measure            97.77%    82.80%      89.66%       94.50%      80.03%    86.67%
        Uncertainty                68.44%    43.13%      52.91%       52.17%      32.88%    40.34%
        Degree                     53.72%    79.27%      64.04%       53.72%      79.27%    64.04%
        Conditional Temporal        0.00%     0.00%       0.00%        0.00%       0.00%     0.00%

   Table 4 shows the submission 2 results for each entity type, which is the model that performs
the best overall. Surprisingly, this system achieves very good results in some classes with
a low number of instances such as Negation, Type of measure or Uncertainty. This is due to
the fact that each BERT classifier is forced to be trained for a specific category, so the model
optimization is done independently. Submission 1, which is only the multi-BERT, obtained 0%
in Lenient F1 and Exact F1 for the classes Conditional Temporal and Degree and we increased
the results of the Degree class to 64.04% in both metrics using the dictionary. However, it could
not be increased for the Conditional Temporal which we believe is due to it being one of the
most linguistically complex entity types. The F1 scores of the submission 3 and 4 were slightly
lower than the multi-BERT hybrid system in the classes that were applied, Degree and Measure,
respectively. Thus, we conclude that more exploration about the translation, their alignment
and the mapping of the classes for these tools is required to enhance the overall predictions.


5. Conclusions
This work describes the multi-BERT hybrid system presented by the EdIE-KnowLab for the
CLEF 2021 eHealth Task 1, SpRadIE. This model ranked first using multiple BETO classifiers,
one for each named entity, in Spanish radiology reports together with a dictionary extracted
from the training set. The proposed NER method shows very promising results achieving a
85.51% in Lenient F1 and 80.04% in Exact F1. The main advantage of this approach is that it does
not require any expert domain knowledge or external resources for classifying the mentions.
The multi-classifier approach deals with the problem of class imbalance in some entities within
this task and it can recognize the overlapped entities with different classes, but it is not able to
predict embedded mentions with the same class.
   In addition, the method combining the multi-BERT hybrid system and the GHNL over the
translated radiology reports with cross-lingual word alignment obtained the best Recall in the
task. While the improvement is marginal due to the low quality of the word alignment, the
approach provides a framework to leverage English NER tools for texts in relatively low-resource
languages and domains with limited corpus (e.g. Spanish radiology reports in this task). This
approach is practical as it does not require training data and has the potential to be improved
with more accurate word alignment.
   We will explore fine-tuning new pre-trained language models with larger Spanish corpus
such as PadChest [25] in the same way as [9] and extend it to the multi-language classification
of radiology reports from [26]. We also suggest to further study using translation and cross-
lingual word alignment to leverage English NER tools for Spanish clinical texts. The current
performance using the neural machine translation (NMT) tool, Microsoft Translator, with
GHNL API is poor due to the inaccurate alignment. Jointly generating accurate alignment with
translations in NMT is an open question being addressed (e.g. in [27, 28]) and to be applied for
low-resource NER in future studies.


Acknowledgments
The authors would like to thank to members in the Clinical Natural Language Processing
Research Group and KnowLab in the University of Edinburgh and University College London
for their valuable discussion and comments. This work was supported by the HDR UK National
Text Analytics Implementation Project, the HDR UK National Phenomics Resource Project,
Wellcome Institutional Translation Partnership Awards (PIII032, PIII029, PIII009), the Alan
Turing Institute via Turing Fellowships and Turing project funding (ESPRC grant EP/N510129/1),
a Legal and General PLC (research grant to establish the independent Advanced Care Research
Centre at University of Edinburgh). Legal and General PLC had no role in conduct of the study,
interpretation or the decision to submit for publication. The views expressed are those of the
authors and not necessarily those of Legal and General PLC.
References
 [1] A. Casey, E. Davidson, M. Poon, H. Dong, D. Duma, A. Grivas, C. Grover, V. Suárez-
     Paniagua, R. Tobin, W. Whiteley, H. Wu, B. Alex, A systematic review of natural language
     processing applied to radiology reports, BMC Medical Informatics and Decision Making
     21 (2021) 179. URL: https://doi.org/10.1186/s12911-021-01533-7. doi:10.1186/s12911-
     021-01533-7.
 [2] V. Cotik, L. A. Alemany, D. Filippo, F. Luque, R. Roller, J. Vivaldi, A. Ayach, F. Carranza,
     L. D. Francesca, A. Dellanzo, M. F. Urquiza, Overview of clef ehealth task 1 - spradie:
     A challenge on information extraction from spanish radiology reports, in: CLEF 2021
     Evaluation Labs and Workshop: Online Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, 2021.
 [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
     I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing
     Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/
     paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
 [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
 [5] X. Li, X. Sun, Y. Meng, J. Liang, F. Wu, J. Li, Dice loss for data-imbalanced NLP tasks, in:
     Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
     Association for Computational Linguistics, Online, 2020, pp. 465–476. doi:10.18653/v1/
     2020.acl-main.45.
 [6] M. Eberts, A. Ulges, Span-based joint entity and relation extraction with transformer
     pre-training, in: ECAI, 2020.
 [7] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234–
     1240. doi:10.1093/bioinformatics/btz682.
 [8] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon,
     Domain-specific language model pretraining for biomedical natural language processing,
     2021. arXiv:2007.15779.
 [9] A. Smit, S. Jain, P. Rajpurkar, A. Pareek, A. Ng, M. Lungren, Combining automatic labelers
     and expert annotations for accurate radiology report labeling using BERT, in: Proceedings
     of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
     Association for Computational Linguistics, Online, 2020, pp. 1500–1519. doi:10.18653/
     v1/2020.emnlp-main.117.
[10] D. A. Wood, J. Lynch, S. Kafiabadi, E. Guilhem, A. Al Busaidi, A. Montvila, T. Varsavsky,
     J. Siddiqui, N. Gadapa, M. Townend, M. Kiik, K. Patel, G. Barker, S. Ourselin, J. H. Cole,
     T. C. Booth, Automated labelling using an attention model for radiology reports of mri
     scans (alarm), in: T. Arbel, I. Ben Ayed, M. de Bruijne, M. Descoteaux, H. Lombaert, C. Pal
     (Eds.), Proceedings of the Third Conference on Medical Imaging with Deep Learning,
     volume 121 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 811–826. URL:
     http://proceedings.mlr.press/v121/wood20a.html.
[11] P. Schrempf, H. Watson, E. Park, M. Pajak, H. MacKinnon, K. Muir, D. Harris-Birtill,
     A. O’Neil, Templated text synthesis for expert-guided multi-label extraction from radiology
     reports, Machine Learning and Knowledge Extraction 3 (2021) 299–317. doi:10.3390/
     make3020015.
[12] A. Grivas, B. Alex, C. Grover, Tobin, R., Whiteley, W., Not a cute stroke: Analysis of
     Rule- and Neural Network-Based Information Extraction Systems for Brain Radiology
     Reports, in: Proceedings of the 11th International Workshop on Health Text Mining and
     Information Analysis, 2020.
[13] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert
     model and evaluation data, in: PML4DC at ICLR 2020, 2020.
[14] V. Cotik, H. Rodríguez, J. Vivaldi, Spanish Named Entity Recognition in the Biomedical
     Domain, in: J. A. Lossio-Ventura, D. Muñante, H. Alatrista-Salas (Eds.), Information
     Management and Big Data, volume 898 of Communications in Computer and Information
     Science, Springer International Publishing, Lima, Peru, 2018, pp. 233–248. doi:10.1007/
     978-3-030-11680-4-23.
[15] V. Cotik, D. Filippo, J. Castaño, An Approach for Automatic Classification of Radiology
     Reports in Spanish., Studies in Health Technology and Informatics 216 (2015) 634–638.
     URL: https://europepmc.org/article/med/26262128.
[16] L. Goeuriot, H. Suominen, L. Kelly, L. A. Alemany, N. Brew-Sam, V. Cotik, D. Filippo,
     G. Gonzalez Saez, F. Luque, P. Mulhem, G. Pasi, R. Roller, S. Seneviratne, J. Vivaldi, M. Vi-
     viani, C. Xu, Clef ehealth evaluation lab 2021, in: D. Hiemstra, M.-F. Moens, J. Mothe,
     R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval, Springer
     International Publishing, Cham, 2021, pp. 593–600.
[17] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, brat: a web-based tool
     for NLP-assisted text annotation, in: Proceedings of the Demonstrations Session at EACL
     2012, Association for Computational Linguistics, Avignon, France, 2012.
[18] V. Cotik, D. Filippo, R. Roller, H. Uszkoreit, F. Xu, Annotation of entities and relations
     in Spanish radiology reports, in: Proceedings of the International Conference Recent
     Advances in Natural Language Processing, RANLP 2017, INCOMA Ltd., Varna, Bulgaria,
     2017, pp. 177–184. URL: https://doi.org/10.26615/978-954-452-049-6_025. doi:10.26615/
     978-954-452-049-6_025.
[19] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spaCy: Industrial-strength Natural
     Language Processing in Python, 2020. doi:10.5281/zenodo.1212303.
[20] J. Turian, L. Ratinov, Y. Bengio, Word representations: A simple and general method for
     semi-supervised learning, in: Proceedings of the 48th Annual Meeting of the Association
     for Computational Linguistics, ACL ’10, Association for Computational Linguistics, Strouds-
     burg, PA, USA, 2010, pp. 384–394. URL: http://dl.acm.org/citation.cfm?id=1858681.1858721.
[21] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align
     and translate, arXiv preprint arXiv:1409.0473 (2014).
[22] A. Bodnari, Healthcare gets more productive with new industry-specific AI tools,
     2020. https://cloud.google.com/blog/topics/healthcare-life-sciences/now-in-preview-
     healthcare-natural-language-api-and-automl-entity-extraction-for-healthcare, accessed
     15 Mar, 2021.
[23] V. Tablan, I. Roberts, H. Cunningham, K. Bontcheva, Gatecloud.net: a platform for
     large-scale, open-source text processing on the cloud, Philosophical Transactions of the
     Royal Society A: Mathematical, Physical and Engineering Sciences 371 (2013) 20120071.
     doi:10.1098/rsta.2012.0071.
[24] G. Gorrell, X. Song, A. Roberts, Bio-yodie: A named entity linking system for biomedical
     text, arXiv preprint arXiv:1811.04860 (2018).
[25] A. Bustos, A. Pertusa, J.-M. Salinas, M. de la Iglesia-Vayá, Padchest: A large chest x-ray
     image dataset with multi-label annotated reports, Medical Image Analysis 66 (2020) 101797.
     doi:https://doi.org/10.1016/j.media.2020.101797.
[26] A. E. W. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y.
     Deng, R. G. Mark, S. Horng, Mimic-cxr, a de-identified publicly available database of chest
     radiographs with free-text reports, Scientific Data 6 (2019) 317. doi:10.1038/s41597-
     019-0322-0.
[27] K. Song, X. Zhou, H. Yu, Z. Huang, Y. Zhang, W. Luo, X. Duan, M. Zhang, Towards better
     word alignment in transformer, IEEE/ACM Transactions on Audio, Speech, and Language
     Processing 28 (2020) 1801–1812. doi:10.1109/TASLP.2020.2998278.
[28] J. Zhang, H. Luan, M. Sun, F. Zhai, J. Xu, Y. Liu, Neural machine translation with explicit
     phrase alignment, IEEE/ACM Transactions on Audio, Speech, and Language Processing
     29 (2021) 1001–1010. doi:10.1109/TASLP.2021.3057831.

</pre>