<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Vicomtech at eHealth-KD Challenge 2020: Deep End-to-End Model for Entity and Relation Extraction in Medical Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aitor García-Pablos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naiara Perez</string-name>
          <email>nperez@vicomtech.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Montse Cuadros</string-name>
          <email>mcuadros@vicomtch.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Zotova</string-name>
          <email>ezotova@vicomtch.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SNLT group at Vicomtech Foundation, Basque Research and Technology Alliance (BRTA)</institution>
          ,
          <addr-line>Mikeletegi Pasealekua 57, Donostia/San-Sebastián, 20009</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>102</fpage>
      <lpage>111</lpage>
      <abstract>
        <p>This paper describes the participation of the Vicomtech NLP team in the eHealth-KD 2020 shared task about detecting and classifying entities and relations in health-related texts written in Spanish. The proposed system consists of a single end-to-end deep neural network with pre-trained BERT models as the core for the semantic representation of the input texts. We have experimented with two models: BERT-Base Multilingual Cased and BETO, a BERT model pre-trained on Spanish text. Our system models all the output variables-entities and relations-at the same time, modelling the whole problem jointly. Some of the outputs are fed back to latter layers of the model, connecting the outcomes of the diferent subtasks in a pipeline fashion. Our system shows robust results in all the scenarios of the task. It has achieved the first position in the main scenario of the competition and top-3 in the rest of the scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Entity detection</kwd>
        <kwd>Relation extraction</kwd>
        <kwd>Health documents</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the face of the widespread success of Transformer-based architectures [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in virtually all Natural
Language Processing (NLP) tasks, Vicomtech has implemented a system with BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that learns
to recognise and classify entities and establish relations between them in an end-to-end multi-task
fashion. Our system has achieved the best results in the Main scenario and ranks among the top-3
results in the rest of the scenarios.
      </p>
      <p>The paper is organised as follows: Section 2 describes the proposed model and the approach
followed to represent the data to solve the task; Section 3 presents the results obtained, including a
comparison to other competing systems; finally, Sections 4 and 5 comment on several design choices
and provide some concluding remarks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System description</title>
      <p>This section provides a comprehensive description of the system with which the reported results
have been obtained. First, we present the architecture of the deep neural network. Next, we explain
how the inputs and outputs have been represented and handled in order to solve the task. After, we
describe the post-processing rules that help fix potential incongruous outputs of the neural network
model. Finally, we present the training settings.</p>
      <sec id="sec-2-1">
        <title>2.1. Architecture</title>
        <p>The model is a deep neural network that receives the input tokens and jointly emits predictions for
several diferent output variables. These predictions can be grouped into two tasks: a) classifying
individual tokens, and b) classifying relations—the presence, absence or type of a relation—between
pairs of tokens. The output variables to be predicted by the model are the following:
• Entities: the classification of each individual token into one of the task’s entity types or ‘O’
(from ‘Out’, meaning that the token is not part of any entity at all, such as “puede” in Figure 1).
• multiword relations: whether token pairs belong to the same entity (such as “uno” and “días”
or “dos” and “días” in Figure 1).
• same-as relations: whether token pairs are related by the same-as relation.
• Directed relations: whether token pairs are related by any of the other relation types described
in the task.</p>
        <p>Unlike the rest of the relations considered, multiword and same-as relations are bidirectional.
In view of several preliminary experiments, which indicated that modelling all the relations together
caused noise when predicting directed relations, we decided to model bidirectional relations
separately.
[CLS]
El
dolor
puede
come
##nza
##r
...
[SEP]
[CLS]
El
dolor
puede
come
##nza
##r
...
[SEP]
entity
SxE
...
...
...
...
...
...
...
...
...</p>
        <p>[CLS]
...
dolor
dolor
dolor
dolor
dolor
...
[SEP]
O
O
Action
O
Action
O
O
...</p>
        <p>O
[CLS]
...
dolor
puede
come
##nza
##r
...
[SEP]
3
...
...
...
...
...
...
...</p>
        <p>...
...
...
...
...</p>
        <p>...</p>
        <sec id="sec-2-1-1">
          <title>5multiword</title>
          <p>SxSx2
...
...
[CLS], [CLS]
... , ...
dolor, dolor
dolor, puede
dolor, come
dolor, ##nza
dolor, ##r
...
[SEP], [SEP]
SxSx2(H+E)
...</p>
          <p>...
...
...
...
...
...
...
...
...
...
...</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>6same-as</title>
          <p>SxSx2
...
...</p>
          <p>SxSx(D+2)
...
...
...
...
...
...
...
...</p>
          <p>...
0
...
0
0
1
0
0
...
0
9
relation type</p>
          <p>SxSxR
...
...
...
...
...
...
...
...
...</p>
          <p>An overview of the inner workings of the network is given in Figure 2. The computation of the
model starts with the input tokens. The tokens are fed into a BERT model to obtain their contextual
embeddings 1 . These embeddings are passed to a classification layer that emits logits with the
prediction about each token being or not an entity of a certain type 2 .</p>
          <p>
            Next, the entity logits are concatenated back to the contextual embeddings, and a tensor operation
is performed to obtain an all-vs-all combination of token vectors 3 . This generates  ×  combined
embeddings that represent all the possible token pairs,  being the length of the input sequence.
Further, these embeddings are passed to a small randomly initialised DistilBERT model [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] with only two
layers of two attention heads each 4 . The objective of this model is to further capture interactions
between the token pairs via self-attention.
          </p>
          <p>The resulting token-pair representations are then passed to several classification layers to make
predictions about the relation between the tokens in each pair. The pairs are categorised by four
binary classifiers that decide respectively whether the tokens that form the pair are connected by a
multiword relation 5 , a same-as relation 6 , or one of the directed relations 7 .</p>
          <p>The directed relations are modelled as outgoing arcs and as incoming arcs. That is, if the pair
  ⊕   is linked by a relation of type  , the pair   ⊕   will have the same relation
 . This is represented in the neural network as a fork of two equal branches of layers, one to model
the outgoing relations, and another to model the incoming relations. Notice that Figure 2 has been
simplified to show one of the branches only.</p>
          <p>At this step, the output of the classification layers tells whether there is a relation or not between
a pair of tokens. The multiword and same-as relations do not require further processing. In the
token
entity
multiword
same-as
related
relation
type
El
O
dolor
Action
puede</p>
          <p>O
comenzar
uno
7
Action</p>
          <p>Concept</p>
          <p>Concept</p>
          <p>Concept</p>
          <p>Predicate
dos
7
días
4,6
antes
4,6
in-context,
in-context
case of directed relations, however, a relation type or label must be assigned. To that end, the logits of
the directed relation classifiers are concatenated back with DistilBERT’s layer 8 , and the resulting
representation passed to a final classification layer to obtain the type of relation for each token pair
among the types defined in the task 9 . Again, this is done twice: once for the outgoing and another
for the incoming arcs.</p>
          <p>
            Overall, the network has seven classifiers, which are built using the same stack of layers: a fully
connected linear transformation layer, followed by a dropout layer and a non-linear activation function,
and a final linear transformation that outputs the logits for the given output variable. We arbitrarily
decided to use Mish [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], not having experimented with other activation functions.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Input and output handling</title>
        <p>
          The training and development corpora have been provided in Brat [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] standof format. This format
is character span-based, while our network works at token level. Furthermore, the output of the
neural network needs to be converted back to Brat’s format and the annotation schema proposed in
the task (i.e., multi-word entities must be reconstructed from the multiword relations, and so on).
Consequently, our system relies on a set of pre-processing and post-processing transformation steps,
explained below.
2.2.1. Data representation
Starting from the provided Brat representation, the diferent pieces of information must be adapted
in a manageable way according to our objectives. Figure 3 shows an example of the information
representation designed with all the network’s output variables.
        </p>
        <p>As mentioned in previous sections, the entities to be detected are not necessarily continuous and
they may even overlap. For example, the text span “uno o dos días” contains two independent entities:
“uno días” and “dos días”. In order to represent this information, we assign to each individual token its
corresponding entity label according to the spans from the Brat annotations. The tokens that belong
to the same entity span are marked as linked by a multiword relation. This approach allows to
represent tokens as part of one or more entities regardless of their original position in the text. All
the tokens which are part of the same entity are inter-linked via multiword relations among them.</p>
        <p>A similar approach is followed for same-as relations and the directed relations. When entities that
span several tokens are connected by one of these relations, only the first token of each entity (in the
order they appear in the text) is marked as being part of the relation. In addition, directed relations
are described by an additional output variable to indicate the type of the relation among the diferent
relation types defined in the task.</p>
        <p>The described model performs all the tasks end-to-end, using its own entity predictions as input
to detect relations. However, in Task B, gold entity annotations are provided by the task organisers;
systems need to focus on the relations only. In this case, our model accepts gold entity labels along
the input tokens, and replaces the predicted entities with a one-hot encoding of the gold ones as the
input for detecting relations.
2.2.2. Interpreting and reconstructing the model output
The output of the model has to be interpreted to obtain a correct and meaningful label or relation
arc for each of the tokens in the original input text. The output of the entity classifier is
straightforwardly interpreted as a regular sequence-labelling task selecting the most probable prediction for
each individual token.
two strategies to select the most probable label:
and  ; ,</p>
        <p>∈ [0,  ] contains the prediction for the relation between</p>
        <p>The reconstruction of relations is more elaborate. The network’s outcome for each modelled
relation variable forms a
 × 
matrix,  being the length of the token sequence, where each position 
 and 

. We implemented
• With inferencer 1, only the predicted outgoing relations are used, ignoring the incoming arcs’
predictions. In the ideal case, they should be symmetric and, thus, redundant.</p>
        <p>arc 
• With inferencer 2, the prediction values for any outgoing arc 
→  and its counterpart incoming
modelled relations. In the case of the relation types, this is only possible because the relation
←  are summed up before selecting the most probable outcome. This is done for all the
types vocabulary is shared between the outgoing and incoming relations, so the same index
refers to the same decoded relation type.</p>
        <p>
          Finally, token positions must be corrected to account for deviations and extra ofsets introduced by
BERT’s tokenization (BERT uses WordPiece tokenization [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which breaks original tokens into
subtokens; in addition, it requires that extra special tokens be added which distort the token positions
w.r.t. the original input).
2.2.3. Post-processing rules
The predictions obtained from interpreting the model’s output are still token-based and must be
processed further to obtain a representation in Brat standof format that is compliant with eHealth-KD’s
annotation scheme. In doing so, we also apply several rules that correct potential inconsistencies
produced by the neural network’s several classifiers. The post-processing consists in the following
steps:
1. Align entity annotations of individual tokens to the original text with a tool provided by Brat
developers (https://github.com/nlplab/brat/blob/master/tools/annalign.py).
2. Merge the annotations connected by a multiword arc, efectively generating multi-word
entities. We decided to keep disjoint sets of tokens: if token  and  are both connected to token
 but not with each other, we generate the multi-word entities , 
and ,  instead of , ,  . We
also decided to disregard multiword arcs to/from tokens that are not classified as entities.
3. Re-assign the same-as and directed relations of the tokens in a multi-word entity to the latter.
4. Split into two multi-word entities that contain conjunctions or certain punctuation marks, such
as commas, semi-colons, parenthesis, and so on.
5. Discard multi-word entities that start or end with a stopword.
6. If the tokens in a multi-word entity have been assigned diferent entity types, assign to the
multi-word entity the most frequent type among the tokens composing the multi-word; in case
of a tie, choose the most frequent label in the corpus (i.e., Concept).
7. Discard entities that are wholly contained within another entity.
8. Discard same-as and directed relations from/to a token that is not an entity or part of an entity.
9. Discard reflexive relations, which might have arisen during the generation of multi-word
annotations.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Training setup</title>
        <p>
          The system has been implemented in Python 3.7 with HuggingFace’s transformers library [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] (ht
tps://github.com/huggingface/transformers). We have experimented with two diferent pre-trained
BERT models as the core for the semantic representation of the input tokens: BERT-Base
Multilingual Cased (henceforth, mBERT; https://github.com/google-research/bert/blob/master/multilingual.
md) and BETO [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], a BERT model pre-trained on Spanish text. We did not perform any in-domain
language model fine-tuning for the base models. In this sense, the approach is general and
domainagnostic. The only resource used for fine-tuning the whole system is the data provided for the task,
consisting of 800 training sentences and 200 development sentences.
        </p>
        <p>
          The training of the diferent variants was carried out on 2 Nvidia GeForce RTX 2080 GPUs with
∼11GB of memory. The model requires a considerable amount of memory for training, so the batch
size was adjusted to 2, while the sequence length was adjusted to ∼100 tokens (the maximum length
encountered in the training set after BERT WordPiece tokenization, which varies from mBERT to
BETO). We applied the AdamW optimiser [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] with a base learning rate of 2−5, combined with a
linear LR scheduling to warm-up the learning rate during the first 5,000 training steps. The dropout
probability was arbitrarily set to 0.2 across the whole network.
        </p>
        <p>The training monitored the F1-score of several of the classifiers in the development set and it was
run for a maximum of 500 epochs with an early-stopping patience of 150 epochs. Finally, we chose
the model checkpoints that had the best balance of development metrics, which for BETO was at the
epoch 148 and for mBERT was at the epoch 262.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>Vicomtech participated in all the scenarios, submitting the same three runs:
• Run 1: mBERT + inferencer 1
• Run 2: BETO + inferencer 1
• Run 3: BETO + inferencer 2</p>
      <p>
        The results for each scenario and run are shown in Table 1. We provide the results on the
development data and the oficially published results on the training data. In addition, the best results
obtained among all the participants in the challenge are also included per scenario for benchmarking
purposes.
Scenario 1 - Main
Run 1: mBERT + inf 1
Run 2: BETO + inf 1
Run 3: BETO + inf 2 (best)
Run 1: mBERT + inf 1
Run 2: BETO + inf 1
Run 3: BETO + inf 2
SINAI [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (best)
Run 1: mBERT + inf 1
Run 2: BETO + inf 1
Run 3: BETO + inf 2
IXA-NER-RE [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (best)
Scenario 4 - Transfer
Run 1: mBERT + inf 1
Run 2: BETO + inf 1
Run 3: BETO + inf 2
Talp-UPC [13] (best)
Scenario 4 - Transfer, unoficial
BETO + inf 2 + domain FT
Scenario 3 - Task B (relation extraction)
      </p>
      <p>Furthermore, after submitting our predictions we learned that the the task organisers had provided
100 out-of-domain sentences to fine-tune systems for the Transfer scenario. Our oficial results for
this scenario did not involve any kind of fine-tuning for the new domain, and thus rely exclusively
on zero-shot transfer-learning. For the sake of completeness, we have fine-tuned our best performing
model with the provided extra sentences and report the results at the bottom of the results table.</p>
      <p>As Table 1 shows, our approach has achieved the best scores of the challenge in the Main scenario,
obtaining the best balance between Task A—entity recognition and classification—and Task B—relation
extraction—, despite being surpassed by other participants in the individual tasks. The proposed
approach yields both better precision and recall metrics, improving the second best system by more
than 2 F1-score points. In Task A, our system is in second position, having improved the recall of
the winner system (82.01 vs 80.67) but not its precision (82.16 vs 84.46). As for Task B, our approach
yields remarkably lower recall scores than the best system (51.73 vs 61.92), but manages to win third
place in the scenario with the best precision (67.17). Finally, our system has won third place in the
Transfer task, despite not having been fine-tuned with the data available for the new domain. Our
system would have achieved the best F1-score had the extra 100 sentences been used, as the last row
in Table 1 shows.</p>
      <p>Regarding the diferences between the submitted runs, little diference is observed. BETO seems to
be a slightly better choice than mBERT in all the scenarios. However, the variation of these diferences
in the development and the test set suggests that the observed diferences may not be statistically
significant, specially due to the limited size of the datasets. As for the diferences between inferencer
1 and 2, the latter seems to help improve the scores for the relation detection. Specifically, taking into
account incoming and outgoing arcs appears to help produce more precise predictions by dropping
mostly false positive predictions in comparison to inferencer 1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Due to time constraints, the presented model is the result of many arbitrary design choices and
contains arguable components that may require further research and experimentation. To enumerate
some of the potential flaws:
• The intermediate custom DistilBERT model is an addition based on intuition. We have not
performed enough experiments to prove it useful. Further, it implies a non-negligible amount
of extra computational and memory requirements.
• It is not clear whether modelling relations as outgoing and incoming arcs helps improve the
results. We have not experimented with other representation variations to gather enough
evidence to reach a conclusion in this regard.
• The directed relations are detected in two steps: 1) whether a relation exists or not, and 2) the
type of relation. This can been done in a single step. We have not made experiments to know
which approach yields better results.</p>
      <p>All in all, our system appears to be a good entity recogniser with the capability to produce quite
precise relations between the entities—while missing almost half them—, and to be suitable for
transfer learning scenarios. The joint modelling of both entities and relations has allowed the system to
achieve a good balance between Task A and B, but the system does not excel in any of them
individually. The presence of a pre-trained BERT model helps in the domain transfer scenario. Since the
results obtained suggest that specific pre-training on Spanish text (i.e., BETO) achieves better scores,
additional pre-training on more relevant data would probably help improve the results.</p>
      <p>On the whole, the task is far from being solved, in particular for relation extraction, despite the
reasonably good results obtained. We leave the issues and open questions discussed to future work.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this working notes we have described our participation in the eHealth-KD 2020 shared task. We
have presented the end-to-end deep-learning-based architecture of our system, which relies on
pretrained BERT models as the base for semantic representation of the texts, and jointly models the
entities and relations proposed in the competition. We have described our data representation, which
allows to model discontinuous and overlapping entities in an integrated manner. We also explained
how we interpret and post-process the output of the neural network. The proposed system has won
the competition, achieving the first place in the Main scenario and ranking within the top-3 in the other
three scenarios. Still, further experimentation is required to understand the impact of the network’s
components and how to improve them, which we will explore in future work.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been supported by Vicomtech and partially funded by the project DeepReading
(RTI2018096846-B-C21, MCIU/AEI/FEDER,UE).
Languages Evaluation Forum co-located with 36th Conference of the Spanish Society for Natural
Language Processing, IberLEF@SEPLN 2020, 2020.
[13] S. Medina, J. Turmo, TALP at eHealth-KD Challenge 2020: Multi-Level Recurrent and
Convolutional Neural Networks for Joint Classification of Key-Phrases and Relations , in: Proceedings of
the Iberian Languages Evaluation Forum co-located with 36th Conference of the Spanish Society
for Natural Language Processing, IberLEF@SEPLN 2020, 2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piad-Morfis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cañizares-Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Estevez-Velarde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Almeida-Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Muñoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Montoyo</surname>
          </string-name>
          ,
          <article-title>Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2020, in: Proceedings of the Iberian Languages Evaluation Forum co-located with 36th Conference of the Spanish Society for Natural Language Processing</article-title>
          ,
          <source>IberLEF@SEPLN</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention Is All You Need,
          <source>in: Proceedings of the Thirty-first Conference on Advances in Neural Information Processing Systems (NeurIPS</source>
          <year>2017</year>
          ),
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>
          ,
          <source>in: Proceedings of the 5th Workshop on Energy Eficient Machine Learning and Cognitive Computing</source>
          (
          <article-title>EMC2) co-located with the Thirty-third</article-title>
          <source>Conference on Neural Information Processing Systems (NeurIPS</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Misra</surname>
          </string-name>
          , Mish:
          <string-name>
            <given-names>A Self</given-names>
            <surname>Regularized Non-Monotonic Neural Activation Function</surname>
          </string-name>
          , arXiv:
          <year>1908</year>
          .
          <volume>08681</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stenetorp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          , G. Topić,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Tsujii,</surname>
          </string-name>
          <article-title>BRAT: A Web-based Tool for NLP-assisted Text Annotation</article-title>
          ,
          <source>in: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL '12)</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Macherey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krikun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Macherey</surname>
          </string-name>
          , et al.,
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation</article-title>
          ,
          <source>arXiv:1609.08144</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Brew,</surname>
          </string-name>
          <article-title>HuggingFace's Transformers: State-of-the-art Natural Language Processing</article-title>
          , arXiv:
          <year>1910</year>
          .
          <volume>03771</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish Pre-Trained BERT Model and Evaluation Data</article-title>
          ,
          <source>in: Proceedings of the Practical ML for Developing Countries Workshop</source>
          at the Eighth International Conference on Learning
          <source>Representations (ICLR</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Decoupled Weight Decay Regularization,
          <source>in: Proceedings of the Seventh International Conference on Learning Representations (ICLR</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>López-Ubeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Perea-Ortega</surname>
          </string-name>
          , D.-G. Manuel C.,
          <string-name>
            <surname>M. T.</surname>
            Martín-Valdivia,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Ureña-López</surname>
          </string-name>
          ,
          <source>SINAI at eHealth-KD Challenge</source>
          <year>2020</year>
          :
          <article-title>Combining Word Embeddings for Named Entity Recognition in Spanish Medical Records, in: Proceedings of the Iberian Languages Evaluation Forum co-located with 36th Conference of the Spanish Society for Natural Language Processing</article-title>
          ,
          <source>IberLEF@SEPLN</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Andrés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sainz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Atutxa</surname>
          </string-name>
          , O. Lopez de Lacalle,
          <article-title>IXA-NER-RE at eHealth-KD Challenge 2020: Cross-Lingual Transfer Learning for Medical Relation Extraction</article-title>
          ,
          <source>in: Proceedings of the Iberian</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>