<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Aitor García-Pablos, Naiara Perez and Montse Cuadros</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>SNLT group at Vicomtech Foundation</string-name>
          <email>agarciap@vicomtech.org</email>
          <email>nperez@vicomtech.org</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Basque Research</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Technology Alliance (BRTA)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikeletegi Pasealekua</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donostia/San-Sebastián</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>489</fpage>
      <lpage>498</lpage>
      <abstract>
        <p>This paper describes the participation of the Vicomtech NLP team in the CANTEMIST shared task, consisting in the automatic assignment of ICD-O-3 tumour morphology codes to health-related documents in Spanish language. The submitted systems are based on pre-trained BERT models. The contextual embeddings obtained for each token are used in a multitask sequence-labelling approach that takes advantage of ICD-O-3 code's structure. We have experimented with diferent pre-trained BERT models and combinations, as well as several ensemble structures. The three task tracks-tumour morphology mention recognition, normalisation and document coding-have been approached at the same time, based on the outputs of the proposed models and some post-processing steps. The reported results are robust and perform well across diferent subsets of data. The oficial results also indicate that the ensemble models outperform individual models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Clinical Text Coding</kwd>
        <kwd>ICD-O-3</kwd>
        <kwd>Oncology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• NER: finding automatically tumour morphology mentions.
• NORM: NER + assigning to each recognized mention their corresponding ICD-O-3 code.
• CODING: suggesting a ranked list of ICD-O-3 codes per document.</p>
      <p>
        The CANTEMIST gold standard corpus consists of manually annotated clinical cases in BRAT
standof format [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], sourced from the SPACCC corpus 1. 501 and 500 clinical cases have been made
available for training and development purposes, respectively. The development data is split in 2
sets of 250 documents each. The test dataset consists of 300 unlabelled clinical cases that come mixed
within a background set of 5,323 documents to dificult manual revision on the predicted labels for the
competition. Detailed information about CANTEMIST, including a detailed description of the corpus,
the annotation guidelines and evaluation metrics, is provided in the shared task overview article [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and website2.
      </p>
      <p>
        The Vicomtech team has submitted multiple systems to all CANTEMIST tracks. The systems have
been developed with state-of-the-art deep learning architectures, featuring diferent BERT-flavoured
embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The final submitted systems consist of voting ensemble models.
      </p>
      <p>The paper is organised as follows: Section 2 provides a detailed explanation of the submitted
systems’ architectures and training setups; Section 3 presents the results obtained in the diferent task
tracks and provides a preliminary error analysis; Section 4 poses several open questions; finally,
Section 5 outlines our main conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System Description</title>
      <p>The submitted systems are designed primarily to solve the NORM track, i.e., to detect text spans
mentioning tumour morphologies and to assign valid ICD-O-3 codes to them. Since the NER track
is contained within the NORM track, solving NORM implies solving NER. In addition, we use the
ICD-O-3 codes obtained from the NORM track as candidates for the CODING track, after some
postprocessing. In summary, we address the three CANTEMIST tracks with the same models, which we
describe in the following sections.</p>
      <sec id="sec-2-1">
        <title>2.1. Data representation</title>
        <p>The CANTEMIST datasets come in BRAT format. This format consists of plain text files paired with
annotation files that indicate the character spans of each tumour morphology mention and their
corresponding gold ICD-O-3 codes. In what follows, we explain how we transform these datasets to solve
the proposed problem.</p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Document segmentation</title>
          <p>The CANTEMIST corpus contains documents longer than 512 tokens, the maximum allowed by BERT.
A common fix to perform sequence-labelling tasks on long documents is to define a more granular
processing unit, such as sentences. A sentence is likely to fit within 512 tokens, so the task can be
performed without cropping any potentially relevant part of an input document. Yet this approach
poses several risks: a) sentence splitters may introduce errors, b) isolated sentences may lack relevant
information for the target task, and c) unbalanced sentence lengths may lead to an ineficient use of
the computational resources.</p>
          <p>In an attempt to overcome these problems, we have opted for a sliding-windows approach,
depicted broadly in Figure 1: After each document is tokenised with a pre-trained BERT tokeniser, the
sequences of subwords are split into windows of a fixed length  . Then, surrounding contexts of
size  are appended and prepended, padding as necessary in order to obtain subsequences of size
 +  +  . Finally, BERT’s [CLS] and [SEP] tokens are added to each subsequence. A mask
indicates which sequence positions are part of the window and which ones form the context. Both
contexts and window positions are attended to build the BERT contextual embeddings, but the loss
function is only calculated for the positions inside the window. We have chosen  = 300 and  = 100,
resulting in sequences of 502 tokens.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Classification objectives</title>
          <p>
            ICD-O-3 morphology codes have a very specific structure [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] (see Figure 2): they consist of at least
5 digits, where the first four digits indicate the tumour or cell type and the fifth digit indicates the
Original sequence:
[PAD] t1 t2 t3 t4 t5 t6
... tn-5 tn-4 tn-3 tn-2 tn-1 tn [PAD]
Subsequence 1: [CLS] + C
Subsequence 2:
          </p>
          <p>W
[CLS] + C</p>
          <p>C + [SEP]</p>
          <p>W</p>
          <p>C + [SEP]
...</p>
          <p>Subsequence X-1:</p>
          <p>Subsequence X:
[CLS] + C</p>
          <p>W
[CLS] + C</p>
          <p>C + [SEP]</p>
          <p>W</p>
          <p>C + [SEP]
General tumour type</p>
          <p>Behaviour</p>
          <p>CANTEMIST
/H modification
(optional)
189 888666888 to 888777111 PPPaaarrraaagggaaannngggllliiiooommmaaasss aaannnddd ggglllooommmuuusss tttuuummmooouuurrrsss
10 871111 GGGlllooommmuuusss tttuuummmooouuurrr
6 8711/333 MMMaaallliiigggnnnaaannnttt glomus tumour
9 8711/3111 Malignant glomus tumour, dddiiifefeferrreeennntttiiiaaattteeeddd
1 8711/3///HHH Malignant glomus tumour, wwwiiittthhh uuunnncccooodddeeeddd mmmooodddiiifffiiieeerrr
behaviour of the tumour; an optional digit codes histologic grading o diferentiation. In addition,
CANTEMIST annotators introduced a task-specific code extension: /H. It is used when ICD-O-3 does
not ofer a code specific enough for the tumour morphology mention being coded.</p>
          <p>The current ICD-O-3 version describes 4,205 codes. However, because of its multi-axial nature,
new well-formed codes can be composed if necessary following the aforementioned convention. In
CANTEMIST, a total number of 58,062 codes are considered valid. Table 1 shows how many diferent
values each code position can take and provides some examples.</p>
          <p>Based on these facts, we have approached the task as a multitask sequence-labelling problem. The
ICD-O-3 codes have been split into several pieces, each piece comprising a classification objective.
After preliminary experimentation, the selected classification objectives are:
a) the first 3 digits of the code,
b) the fourth digit, and
c) the Behaviour and Grade digits, and the H indicator, as a single variable.</p>
          <p>If a token is not part of a tumour morphology mention, the label O (from “Out”) should be predicted
by the three classifiers. We henceforth refer to them as 3Ds, 4D and BGH.</p>
          <p>An additional classification objective—since this is, in essence, a sequence-labelling task—is:
d) the BIO tag.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Architecture</title>
        <p>
          The BIO tag [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] indicates whether a token is the first element of a tumour morphology mention ( B-,
“Begin”), whether it is inside a mention (I-, “In”), or it is not part of a mention at all (O, “Out”).
Although it does not convey ICD-O-3-related information, the BIO tag is an additional signal of whether
a token is part of a mention or not, and it helps discern between contiguous mentions.
The submitted systems are built on the Transformers [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] architecture, specifically BERT [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In few
words, they consist of pre-trained BERT models with several classification layers on top.
        </p>
        <p>We have tested two approaches, one of which is the continuation of the other. We henceforth refer
to them as the baseline approach and the two-experts approach. The latter is an experiment to assess
whether two sources of knowledge can be fused to collaborate and improve the results they would
obtain on their own. There are diferent ways of combining two models into a bigger model; in this
work, we have chained one after another.</p>
        <p>A high-level diagram of the baseline and two-experts approach is shown in Figure 3: Both
approaches start by passing the prepared tokens to a BERT model. In the two-experts approach, the
output of the last layer is fed to a second BERT model as pre-computed embeddings. The output
of the second model’s last layer is then processed by a dropout layer. In the baseline approach, the
output of the first model is directly passed to the dropout layer. After dropout, the token
representations are passed to 4 independent linear transformation layers, which output the logits for the 4
output variables described earlier (see Section 2.1.2). That is, all the objectives are trained jointly in a
single model that has several classification heads. All of them rely on the same per-token contextual
embedding obtained from a pre-trained BERT model.</p>
        <p>In training, the back-propagated error is the sum of the cross-entropy losses of the 4 outputs.
BERT’s special tokens and context tokens do not participate in the computation of the loss. That
is, while the BERT models do attend to all positions, they only learn from the gold labels in each
sequence’s window, not from its context (see Section 2.1.1). This helps avoid an “edge bias” near the
arbitrary start/end of the input.</p>
        <p>For inference, the label with the maximum probability is chosen for each token and variable after
applying the softmax function to the logits. Then, the outputs of the sliding windows are concatenated,
and BERT’s special tokens and context tokens ruled out, in order to obtain the original sequence of
tokens and their corresponding predictions. In the case of tokens split into subwords by the tokeniser,
the predictions corresponding to the first subword are used as predictions for the whole token.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Output interpretation</title>
        <p>The implemented systems output 4 predictions per token, which correspond to the BIO tag, the first
3 digits of an ICD-O-3 code, the fourth digit, and the Behaviour, Grade and /H positions. This output
must be interpreted and transformed to BRAT’s span-based format, where each tumour morphology
mention detected, whether a single token or multiple, is associated with a valid ICD-O-3 code. The
post-process consists of two main steps:</p>
        <p>First, if any of the classifiers 3Ds, 4D or BGH predicted the tag O, O is assigned to the token; it
is not part of a tumour morphology mention. Otherwise, an ICD-O-3 code is composed from the
predictions, prefixed with the corresponding BIO tag (see examples on the right-hand side of Figure
3). A probability is assigned to the newly created code, defined as the product of the probabilities
emitted by the classifiers 3Ds, 4D and BGH.
[CLS]</p>
        <p>Sar
##coma</p>
        <p>de
Ewing
con
...
[SEP]</p>
        <p>Sx768</p>
        <p>Sx768</p>
        <p>Sx768</p>
        <p>Sx768</p>
        <p>Sx768
s
g
n
i
d
ebd EXPERT  A
m
teu BERT encoder
p
n
i 
A
s
g
n
i
d
d
e
b
m
e
t
u
p
t
u
o
 
A
s
g
n
i
d
d
eb EXPERT  B
m
e
tu BERT encoder
p
t
u
o
 
A
s
g
n
i
d
d
e
b
m
e
t
u
p
t
u
o
B 
t
u
o
p
o
r
D
926
926
926
O
...</p>
        <p>Softmax
0
0
0
O
...</p>
        <p>Sx77
BGH</p>
        <p>Then, the token-based annotations are translated to span-based annotations with the help of the
BIO tags. In the case of single-word mentions, the ICD-O-3 code assigned to the mention is simply
the ICD-O-3 code of the corresponding token. In the case of multi-word mentions, the ICD-O-3 code
chosen is the code with maximum average probability among the codes of the tokens that participate
in the mention.</p>
        <p>
          As a result, we obtain outputs for the NER and NORM tracks. To produce outputs for the CODING
track, where a ranked list of ICD-O-3 codes is expected per document, we simply take the set of
predicted codes and order them by their assigned probability.
2.4. Training setup and submitted systems
In the earlier phases of this work, we experimented with several publicly available pre-trained BERT
models, namely, BERT-base Multilingual Cased3, BETO [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], SciBERT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Clinical BERT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and
BioBERT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The latter three have been pre-trained with English text of the health and
biomedical domains. The best results on the oficial development sets were achieved by BETO and SciBERT.
Thus, the submitted systems use BETO, SciBERT, or both.
        </p>
        <p>Early experimentation also showed considerable diferences in performance between the two
development datasets. In order to leverage all the available data, we have trained several systems with
diferent data splits and combined their predictions in voting ensembles.</p>
        <sec id="sec-2-3-1">
          <title>2.4.1. Voting ensembles</title>
          <p>Let  1 and  2 be the two oficial development sets provided by the task organisers. Let  3 be a third
development set randomly sampled from the oficial training set  , and   the remaining data of the
training set, so  =   ∪  3. We have trained 3 versions of each model, setting aside one development
set each time, so for each rotation the training data split is   = {  ∪  ∪  } and the development
set is   = {  }, with , ,  ∈ {1, 2, 3} and  ≠  ≠  .</p>
          <p>The model ensembles are obtained via token-wise soft voting, prior to transforming the standalone
predictions to BRAT’s character-span-based format: the full ICD-O-3 code for each token and voting
system is built from its predicted components as explained in Section 2.3; after, the vote of each system
3https://github.com/google-research/bert/blob/master/multilingual.md
is weighted by the probability of the codes, the probability being the product of the probabilities
given by the classifiers 3Ds, 4D, and BGH. Finally, the BRAT files and the CODING track outputs are
generated as if the predictions came from a single system.</p>
          <p>
            The final submitted systems are the following:
• S1: an ensemble of 3 BETO-based baseline models
• S2: an ensemble of 3 SciBERT-based baseline models
• S3: an ensemble of 3 Two Experts models, with BETO as the first expert and SciBERT the second
• S4: an ensemble of the prior 9 models, henceforth the Flat ensemble
• S5: an ensemble of S1, S2 and S3, henceforth the 2-step ensemble
Both ensembles S4 and S5 take advantage of all the 9 trained standalone models, but the former
performs the voting with the 9 outcomes at the same time, while the latter calculates the votes in 2
consecutive rounds: it first calculates S1, S2 and S3, then uses their results to vote a second time.
2.4.2. Hyperparameters and other implementation details
The implementation of the models and all the auxiliary modules, helpers and functions are mainly
written in Python 3.7 and the HuggingFace’s Transformers library [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>During training, the base learning rate was 2E-5 with a linear warm-up scheduling that reaches
its maximum during the first 5,000 iterations. The training of all models was limited to a maximum
of 200 epochs with an early-stopping patience of 50 epochs (i.e., the training was stopped after 50
consecutive epochs without improvement). In most of the cases, the early-stopping was triggered
before reaching the maximum allowed epochs. The dropout rate was the same used in the pre-trained
BERT-base models: 0.1. The batch size for the baseline models was set to 6, while for the two-experts
variants it was set to 4, in both cases because it was the largest possible batch that fit in memory on a
single GPU. The training has been run on a single Nvidia RTX 2080 GPU with 11GB of RAM. Training
times vary depending on when the early-stopping condition is met, but all of them have fallen within
a range of a few hours.</p>
          <p>For inference, we have used a much larger batch size of 128, because the memory requirements
are lower due to the lack of gradients calculation. The context and window sizes for the
slidingwindows have been kept the same:  = 300 and  = 100. The inference speed in GPU exceeds
8,000 tokens/second4, which for this task is equivalent to processing about 10 documents per second.
With these settings, the 5,323 background documents of the competition have been processed in 7-8
minutes.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>4Although training is impractical without a GPU, the inference can be performed on CPU achieving a throughput of
about 800 tokens/second.</p>
      <p>NER</p>
      <p>R</p>
      <p>F1</p>
      <p>P</p>
      <p>NORM</p>
      <p>R</p>
      <p>F1</p>
      <p>The results obtained by our models vary among the development sets, but are quite consistent. The
results for the test set are higher, probably due to the efect of the ensembles. Precision and recall are
evenly balanced for all the tested systems.</p>
      <p>The best performing system, the 2-step ensemble, obtains 86.97 F1-score in NER, 82.14 F1-score in
NORM and 84.68 Mean Average Precision (MAP) in CODING. Overall, the systems surpass the scores
83.00, 75.00 and 76.00, respectively, by a large margin. The ensembles of the 9 model variants—3 per
model type—work noticeably better than the ensembles of a single model type. The 2-step ensemble
works even better than the flat ensemble for all the tracks, in particular for CODING, where the
diference between the flat and 2-step ensemble is almost 0.5 MAP points.</p>
      <p>With respect to BETO and SciBERT, the former performs marginally better in NER and NORM;
however, it obtains consistently better MAP in CODING. The two-experts approach has not resulted
in a performance improvement.</p>
      <p>A quantitative analysis of the errors committed by the submitted systems is provided in Table 3.
Again, we observe similar trends among the systems. SciBERT seems to yield more annotations—
spurious and correct—than BETO; Two Experts produces less annotations than BETO or SciBERT
alone. Meanwhile, the flat and 2-step ensembles miss less annotations, make less spurious predictions,
and produce more exact matches.</p>
      <p>In general, when a mention span is matched exactly—which happens ∼80% of the times on average—
, the code given is likely to be correct with &gt;93% probability. The chances drop to around 35% with
overlapping spans. In whichever case, the error is more likely to be found in the first four digits of
the code than in the behaviour (B), grading (G) or /H position when an incorrect code is proposed.</p>
      <p>While 58,062 codes are considered valid in CANTEMIST, only 746 of them actually occur in the
training and development data provided. Our systems are capable, to an extent, of producing
ICDO-3 codes that they have not seen in the training data. This is possible on account of the multi-task
approach. Still, our systems fail to generate correct unseen codes much more often than they succeed,
even more so when the mention span has not been matched exactly.</p>
      <p>The bulk of missed mentions are not mentions pertaining to unseen codes, but mentions that do
occur and are even very frequent in the training and development datasets. This phenomenon requires
further analysis to be better explained and addressed.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>The systems presented rely mainly on the semantic representation capabilities derived from the BERT
architecture and the knowledge captured by their own pre-training. The results are seemingly good—
other participants’ results are unknown to us at the time of writing—, but there is still room for
improvement. We pose the following open questions as discussion:
1. Our approach does not leverage information associated to ICD-O-3 codes (code descriptions,
definitions, and so on) nor any other hand-crafted knowledge source, which could improve the
results obtained by helping produce representations for ICD-O-3 codes not seen in the training
data.
2. Regarding BioBERT, Clinical BERT and SciBERT: we hypothesise that SciBERT has
outperformed BioBERT and Clinical BERT in our experiments because it has been trained from scratch
with its own vocabulary, better suited to the health domain.
3. In the same line, it may come as a surprise that BETO and SciBERT obtain similar results,
when SciBERT has only been trained with texts in English. We hypothesise that because the
terminology of the health domain is mainly constructed from Greek and Latin roots and afixes
both in English and in Spanish, the WordPiece strategy and the domain-specific vocabulary of
SciBERT play to its advantage in this case.
4. The two previous points indicate that a Spanish Clinical BERT may lead to better results still.
5. The combination of BETO and SciBERT, in the manner explained in this paper, does not seem
to be beneficial in this task, having obtained slightly worse results than the standalone models.
Many other ways exist in which the two models could be combined, so further experimentation
in this line might be of interest.
6. While the flat and 2-step ensemble models show performance gains in comparison to the simpler
models, it is questionable whether such a system would be viable in a real-world scenario.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this working notes we have described our participation in the CANTEMIST shared task. Our
end-to-end deep-learning-based system relies on pre-trained BERT models as the base for semantic
representation of the texts. With these semantic representation, the ICD-O-3 codes are calculated for
each token in a sequence-labelling fashion, and this information is used to address the three
competition tracks (namely, NERC, NORM and CODING) at the same time. We have described how we have
preprocessed and represented the information, and how we have performed rotating training runs to
leverage all the available data (i.e., the oficial training set and the two oficial development sets). We
have submitted results of ensemble models trained on diferent views of the data.</p>
      <p>Both our experiments and the oficial evaluation show robust results in diferent subsets of data.
According to these results, the ensembles do provide a performance advantage, with a two-step
ensemble outperforming a flat ensemble. We have also found that BETO and SciBERT obtain comparable
results in this particular task, but the proposed combination of both has not resulted in better scores.</p>
      <p>As future work, the models may benefit from a mechanism to inject ICD-O-3 codes semantics to
enhance their capability to match codes that have not been seen during the training phase. Further
experimentation on the combination of several pre-trained models would also be helpful for scenarios
where each model brings some useful knowledge to the task, and there is not a single pre-trained
model that suits the task better.
This work has been partially funded by the project DeepReading (RTI2018-096846-B-C21, MCIU / AEI
/ FEDER,UE).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stenetorp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          , G. Topić,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Tsujii,</surname>
          </string-name>
          <article-title>BRAT: A Web-based Tool for NLP-assisted Text Annotation</article-title>
          ,
          <source>in: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL '12)</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miranda-Escalada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <article-title>Named entity recognition, concept normalization and clinical coding: Overview of the CANTEMIST track for cancer text mining in Spanish, Corpus, Guidelines, Methods and Results</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Word</given-names>
            <surname>Health</surname>
          </string-name>
          <string-name>
            <surname>Organization</surname>
          </string-name>
          ,
          <article-title>International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3</article-title>
          ),
          <year>2015</year>
          . URL: https://www.who.int/classifications/icd/adaptations/oncology/en/, accessed:
          <fpage>24</fpage>
          -
          <lpage>07</lpage>
          -
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ramshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <article-title>Text Chunking Using Transformation-based Learning</article-title>
          , in: S. Armstrong,
          <string-name>
            <given-names>K.</given-names>
            <surname>Church</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Isabelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tzoukermann</surname>
          </string-name>
          , D. Yarowsky (Eds.),
          <source>Natural Language Processing Using Very Large Corpora</source>
          , Springer Netherlands,
          <year>1999</year>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention Is All You Need,
          <source>in: Proceedings of the Thirty-first Conference on Advances in Neural Information Processing Systems (NeurIPS</source>
          <year>2017</year>
          ),
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish Pre-Trained BERT Model and Evaluation Data</article-title>
          ,
          <source>in: Proceedings of the Practical ML for Developing Countries Workshop</source>
          at the Eighth International Conference on Learning
          <source>Representations (ICLR</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Cohan,
          <article-title>SciBERT: A Pretrained Language Model for Scientific Text</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3615</fpage>
          -
          <lpage>3620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Alsentzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Boag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-H.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>McDermott, Publicly Available Clinical BERT Embeddings</article-title>
          ,
          <source>in: Proceedings of the 2nd Clinical Natural Language Processing Workshop (ClinicalNLP</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>72</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Kang,</surname>
          </string-name>
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Brew,</surname>
          </string-name>
          <article-title>HuggingFace's Transformers: State-of-the-art Natural Language Processing</article-title>
          , arXiv:
          <year>1910</year>
          .
          <volume>03771</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>