Vicomtech at CANTEMIST 2020
Aitor García-Pablos, Naiara Perez and Montse Cuadros
SNLT group at Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi Pasealekua 57,
Donostia/San-Sebastián, 20009, Spain


                                       Abstract
                                       This paper describes the participation of the Vicomtech NLP team in the CANTEMIST shared task, consisting
                                       in the automatic assignment of ICD-O-3 tumour morphology codes to health-related documents in Spanish
                                       language. The submitted systems are based on pre-trained BERT models. The contextual embeddings obtained
                                       for each token are used in a multitask sequence-labelling approach that takes advantage of ICD-O-3 code’s
                                       structure. We have experimented with different pre-trained BERT models and combinations, as well as several
                                       ensemble structures. The three task tracks—tumour morphology mention recognition, normalisation and doc-
                                       ument coding—have been approached at the same time, based on the outputs of the proposed models and some
                                       post-processing steps. The reported results are robust and perform well across different subsets of data. The
                                       official results also indicate that the ensemble models outperform individual models.

                                       Keywords
                                       Clinical Text Coding, ICD-O-3, Oncology


1. Introduction
These working notes describe Vicomtech’s participation in CANTEMIST: CANcer TExt MIning Shared
Task - tumor named entity recognition. CANTEMIST is the first shared task focused on tumor morphol-
ogy mining and coding in Spanish text with the International Classification of Diseases for Oncology,
3rd Edition (ICD-O-3). The task consists of three independent tracks:

               • NER: finding automatically tumour morphology mentions.

               • NORM: NER + assigning to each recognized mention their corresponding ICD-O-3 code.

               • CODING: suggesting a ranked list of ICD-O-3 codes per document.

   The CANTEMIST gold standard corpus consists of manually annotated clinical cases in BRAT
standoff format [1], sourced from the SPACCC corpus1 . 501 and 500 clinical cases have been made
available for training and development purposes, respectively. The development data is split in 2
sets of 250 documents each. The test dataset consists of 300 unlabelled clinical cases that come mixed
within a background set of 5,323 documents to difficult manual revision on the predicted labels for the
competition. Detailed information about CANTEMIST, including a detailed description of the corpus,
the annotation guidelines and evaluation metrics, is provided in the shared task overview article [2]
and website2 .
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: agarciap@vicomtech.org (A. García-Pablos); nperez@vicomtech.org (N. Perez); mcuadros@vicomtch.org (M.
Cuadros)
orcid: 0000-0001-9882-7521 (A. García-Pablos); 0000-0001-8648-0428 (N. Perez); 0000-0002-3620-1053 (M. Cuadros)
                                    © 2020 Copyright for this paper by its authors.
                                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073


               1
                 https://github.com/PlanTL-SANIDAD/SPACCC
               2
                 https://temu.bsc.es/cantemist
   The Vicomtech team has submitted multiple systems to all CANTEMIST tracks. The systems have
been developed with state-of-the-art deep learning architectures, featuring different BERT-flavoured
embeddings [3]. The final submitted systems consist of voting ensemble models.
   The paper is organised as follows: Section 2 provides a detailed explanation of the submitted sys-
tems’ architectures and training setups; Section 3 presents the results obtained in the different task
tracks and provides a preliminary error analysis; Section 4 poses several open questions; finally, Sec-
tion 5 outlines our main conclusions.


2. System Description
The submitted systems are designed primarily to solve the NORM track, i.e., to detect text spans
mentioning tumour morphologies and to assign valid ICD-O-3 codes to them. Since the NER track
is contained within the NORM track, solving NORM implies solving NER. In addition, we use the
ICD-O-3 codes obtained from the NORM track as candidates for the CODING track, after some post-
processing. In summary, we address the three CANTEMIST tracks with the same models, which we
describe in the following sections.

2.1. Data representation
The CANTEMIST datasets come in BRAT format. This format consists of plain text files paired with
annotation files that indicate the character spans of each tumour morphology mention and their cor-
responding gold ICD-O-3 codes. In what follows, we explain how we transform these datasets to solve
the proposed problem.

2.1.1. Document segmentation
The CANTEMIST corpus contains documents longer than 512 tokens, the maximum allowed by BERT.
A common fix to perform sequence-labelling tasks on long documents is to define a more granular
processing unit, such as sentences. A sentence is likely to fit within 512 tokens, so the task can be
performed without cropping any potentially relevant part of an input document. Yet this approach
poses several risks: a) sentence splitters may introduce errors, b) isolated sentences may lack relevant
information for the target task, and c) unbalanced sentence lengths may lead to an inefficient use of
the computational resources.
   In an attempt to overcome these problems, we have opted for a sliding-windows approach, de-
picted broadly in Figure 1: After each document is tokenised with a pre-trained BERT tokeniser, the
sequences of subwords are split into windows of a fixed length 𝑊 . Then, surrounding contexts of
size 𝐶 are appended and prepended, padding as necessary in order to obtain subsequences of size
𝐶 + 𝑊 + 𝐶. Finally, BERT’s [CLS] and [SEP] tokens are added to each subsequence. A mask in-
dicates which sequence positions are part of the window and which ones form the context. Both
contexts and window positions are attended to build the BERT contextual embeddings, but the loss
function is only calculated for the positions inside the window. We have chosen 𝑊 = 300 and 𝐶 = 100,
resulting in sequences of 502 tokens.

2.1.2. Classification objectives
ICD-O-3 morphology codes have a very specific structure [4] (see Figure 2): they consist of at least
5 digits, where the first four digits indicate the tumour or cell type and the fifth digit indicates the


                                                  490
 Original sequence:              [PAD]   t1   t2    t3   t4     t5    t6       ...       tn-5 tn-4 tn-3 tn-2 tn-1   tn   [PAD]

    Subsequence 1:     [CLS] +    C           W          C     + [SEP]

    Subsequence 2:                        [CLS] +   C           W              C       + [SEP]

                 ...
 Subsequence X-1:                                                    [CLS] +       C             W        C   + [SEP]

   Subsequence X:                                                                           [CLS] +   C        W           C     + [SEP]


Figure 1: Segmentation of documents into subsequences for BERT with the sliding windows technique

                                 Speciﬁc tumour type           Grade (optional)

                                                                                           CANTEMIST
                                  _ _ _ _ / _ _ /H modiﬁcation
                                                    (optional)

                             General tumour type              Behaviour

Figure 2: ICD-O-3 code structure


Table 1
Number of different values each code position can take and examples
                            Values Example
              3 digits            189 868
                                      868 to 871 Paragangliomas and
                                             871 Paragangliomas and glomus
                                                                    glomus tumours
                                                                            tumours
              4th digit            10 871 1
                                          1      Glomus
                                                 Glomus tumour
                                                        tumour
              Behaviour             6 8711/33    Malignant
                                                 Malignant glomus tumour
              Grade                 9 8711/3  1
                                              1  Malignant glomus tumour, differentiated
                                                                          differentiated
              /H                    1 8711/3  /H
                                              /H Malignant glomus tumour, with uncoded
                                                                          with  uncoded modifier
                                                                                         modifier


behaviour of the tumour; an optional digit codes histologic grading o differentiation. In addition,
CANTEMIST annotators introduced a task-specific code extension: /H. It is used when ICD-O-3 does
not offer a code specific enough for the tumour morphology mention being coded.
  The current ICD-O-3 version describes 4,205 codes. However, because of its multi-axial nature,
new well-formed codes can be composed if necessary following the aforementioned convention. In
CANTEMIST, a total number of 58,062 codes are considered valid. Table 1 shows how many different
values each code position can take and provides some examples.
  Based on these facts, we have approached the task as a multitask sequence-labelling problem. The
ICD-O-3 codes have been split into several pieces, each piece comprising a classification objective.
After preliminary experimentation, the selected classification objectives are:
   a) the first 3 digits of the code,

   b) the fourth digit, and

   c) the Behaviour and Grade digits, and the H indicator, as a single variable.
If a token is not part of a tumour morphology mention, the label O (from “Out”) should be predicted
by the three classifiers. We henceforth refer to them as 3Ds, 4D and BGH.
   An additional classification objective—since this is, in essence, a sequence-labelling task—is:


                                                                491
   d) the BIO tag.

The BIO tag [5] indicates whether a token is the first element of a tumour morphology mention (B-,
“Begin”), whether it is inside a mention (I-, “In”), or it is not part of a mention at all (O, “Out”). Al-
though it does not convey ICD-O-3-related information, the BIO tag is an additional signal of whether
a token is part of a mention or not, and it helps discern between contiguous mentions.

2.2. Architecture
The submitted systems are built on the Transformers [6] architecture, specifically BERT [3]. In few
words, they consist of pre-trained BERT models with several classification layers on top.
   We have tested two approaches, one of which is the continuation of the other. We henceforth refer
to them as the baseline approach and the two-experts approach. The latter is an experiment to assess
whether two sources of knowledge can be fused to collaborate and improve the results they would
obtain on their own. There are different ways of combining two models into a bigger model; in this
work, we have chained one after another.
   A high-level diagram of the baseline and two-experts approach is shown in Figure 3: Both ap-
proaches start by passing the prepared tokens to a BERT model. In the two-experts approach, the
output of the last layer is fed to a second BERT model as pre-computed embeddings. The output
of the second model’s last layer is then processed by a dropout layer. In the baseline approach, the
output of the first model is directly passed to the dropout layer. After dropout, the token represen-
tations are passed to 4 independent linear transformation layers, which output the logits for the 4
output variables described earlier (see Section 2.1.2). That is, all the objectives are trained jointly in a
single model that has several classification heads. All of them rely on the same per-token contextual
embedding obtained from a pre-trained BERT model.
   In training, the back-propagated error is the sum of the cross-entropy losses of the 4 outputs.
BERT’s special tokens and context tokens do not participate in the computation of the loss. That
is, while the BERT models do attend to all positions, they only learn from the gold labels in each
sequence’s window, not from its context (see Section 2.1.1). This helps avoid an “edge bias” near the
arbitrary start/end of the input.
   For inference, the label with the maximum probability is chosen for each token and variable after ap-
plying the softmax function to the logits. Then, the outputs of the sliding windows are concatenated,
and BERT’s special tokens and context tokens ruled out, in order to obtain the original sequence of
tokens and their corresponding predictions. In the case of tokens split into subwords by the tokeniser,
the predictions corresponding to the first subword are used as predictions for the whole token.

2.3. Output interpretation
The implemented systems output 4 predictions per token, which correspond to the BIO tag, the first
3 digits of an ICD-O-3 code, the fourth digit, and the Behaviour, Grade and /H positions. This output
must be interpreted and transformed to BRAT’s span-based format, where each tumour morphology
mention detected, whether a single token or multiple, is associated with a valid ICD-O-3 code. The
post-process consists of two main steps:
   First, if any of the classifiers 3Ds, 4D or BGH predicted the tag O, O is assigned to the token; it
is not part of a tumour morphology mention. Otherwise, an ICD-O-3 code is composed from the
predictions, prefixed with the corresponding BIO tag (see examples on the right-hand side of Figure
3). A probability is assigned to the newly created code, defined as the product of the probabilities
emitted by the classifiers 3Ds, 4D and BGH.


                                                    492
                                                                                                                                            Sx3   Sx190       Sx11   Sx77

                                                                                                                                            BIO   3Ds          4D    BGH          Linear


                                                                                                                                                          Softmax

             Sx768                               Sx768                 Sx768                                Sx768                 Sx768

 [CLS]                                                                                                                                       -     -           -      -
   Sar                                                                                                                                       B    926          0      3     = B-9260/3


                                                 A output embeddings


                                                                       A output embeddings


                                                                                                            B output embeddings
             A input embeddings


##coma                                                                                                                                       -     -           -      -


                                                                                                                                  Dropout
       de                         EXPERT A                                                   EXPERT B                                        I    926          0      3     = I-9260/3
 Ewing                            BERT encoder                                               BERT encoder                                    I    926          0      3     = I-9260/3
   con                                                                                                                                       O     O           O      O     = O
   ...                                                                                                                                      ...   ...         ...    ...    ...
 [SEP]                                                                                                                                       -     -           -      -


Figure 3: High-level architecture diagram


   Then, the token-based annotations are translated to span-based annotations with the help of the
BIO tags. In the case of single-word mentions, the ICD-O-3 code assigned to the mention is simply
the ICD-O-3 code of the corresponding token. In the case of multi-word mentions, the ICD-O-3 code
chosen is the code with maximum average probability among the codes of the tokens that participate
in the mention.
   As a result, we obtain outputs for the NER and NORM tracks. To produce outputs for the CODING
track, where a ranked list of ICD-O-3 codes is expected per document, we simply take the set of
predicted codes and order them by their assigned probability.

2.4. Training setup and submitted systems
In the earlier phases of this work, we experimented with several publicly available pre-trained BERT
models, namely, BERT-base Multilingual Cased3 , BETO [7], SciBERT [8], Clinical BERT [9], and
BioBERT [10]. The latter three have been pre-trained with English text of the health and biomed-
ical domains. The best results on the official development sets were achieved by BETO and SciBERT.
Thus, the submitted systems use BETO, SciBERT, or both.
   Early experimentation also showed considerable differences in performance between the two de-
velopment datasets. In order to leverage all the available data, we have trained several systems with
different data splits and combined their predictions in voting ensembles.

2.4.1. Voting ensembles
Let 𝐷1 and 𝐷2 be the two official development sets provided by the task organisers. Let 𝐷3 be a third
development set randomly sampled from the official training set 𝑇 , and 𝑇𝑑 the remaining data of the
training set, so 𝑇 = 𝑇𝑑 ∪ 𝐷3 . We have trained 3 versions of each model, setting aside one development
set each time, so for each rotation the training data split is 𝑇𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛 = {𝑇𝑑 ∪𝐷𝑖 ∪𝐷𝑗 } and the development
set is 𝐷𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛 = {𝐷𝑘 }, with 𝑖, 𝑗, 𝑘 ∈ {1, 2, 3} and 𝑖 ≠ 𝑗 ≠ 𝑘.
   The model ensembles are obtained via token-wise soft voting, prior to transforming the standalone
predictions to BRAT’s character-span-based format: the full ICD-O-3 code for each token and voting
system is built from its predicted components as explained in Section 2.3; after, the vote of each system
   3
       https://github.com/google-research/bert/blob/master/multilingual.md


                                                                                                       493
is weighted by the probability of the codes, the probability being the product of the probabilities
given by the classifiers 3Ds, 4D, and BGH. Finally, the BRAT files and the CODING track outputs are
generated as if the predictions came from a single system.
   The final submitted systems are the following:

    • S1: an ensemble of 3 BETO-based baseline models

    • S2: an ensemble of 3 SciBERT-based baseline models

    • S3: an ensemble of 3 Two Experts models, with BETO as the first expert and SciBERT the second

    • S4: an ensemble of the prior 9 models, henceforth the Flat ensemble

    • S5: an ensemble of S1, S2 and S3, henceforth the 2-step ensemble

Both ensembles S4 and S5 take advantage of all the 9 trained standalone models, but the former
performs the voting with the 9 outcomes at the same time, while the latter calculates the votes in 2
consecutive rounds: it first calculates S1, S2 and S3, then uses their results to vote a second time.

2.4.2. Hyperparameters and other implementation details
The implementation of the models and all the auxiliary modules, helpers and functions are mainly
written in Python 3.7 and the HuggingFace’s Transformers library [11].
   During training, the base learning rate was 2E-5 with a linear warm-up scheduling that reaches
its maximum during the first 5,000 iterations. The training of all models was limited to a maximum
of 200 epochs with an early-stopping patience of 50 epochs (i.e., the training was stopped after 50
consecutive epochs without improvement). In most of the cases, the early-stopping was triggered
before reaching the maximum allowed epochs. The dropout rate was the same used in the pre-trained
BERT-base models: 0.1. The batch size for the baseline models was set to 6, while for the two-experts
variants it was set to 4, in both cases because it was the largest possible batch that fit in memory on a
single GPU. The training has been run on a single Nvidia RTX 2080 GPU with 11GB of RAM. Training
times vary depending on when the early-stopping condition is met, but all of them have fallen within
a range of a few hours.
   For inference, we have used a much larger batch size of 128, because the memory requirements
are lower due to the lack of gradients calculation. The context and window sizes for the sliding-
windows have been kept the same: 𝑊 = 300 and 𝐶 = 100. The inference speed in GPU exceeds
8,000 tokens/second4 , which for this task is equivalent to processing about 10 documents per second.
With these settings, the 5,323 background documents of the competition have been processed in 7-8
minutes.


3. Results
Table 2 shows the results obtained by the submitted systems for all the tracks. It includes the results
we have calculated on the different development sets described in Section 2.4.1. The results for the test
set are the official results reported by the task organisers. A comprehensive comparison and ranking
of the results from all the shared task participants can be found in [2].

    4
     Although training is impractical without a GPU, the inference can be performed on CPU achieving a throughput of
about 800 tokens/second.


                                                        494
Table 2
Results per system, track and dataset
                                                               NER                NORM            COD
                                                           P    R     F1      P      R     F1     MAP
  Development set 1
  D1.1       BETO                                        84.32 83.54 83.93   78.10 77.37 77.73    81.23
  D1.2       SciBERT                                     85.38 82.13 83.72   79.25 76.23 77.71    82.24
  D1.3       Two Experts                                 83.79 83.42 83.61   75.23 74.89 75.06    78.17
  Development set 2
  D2.1       BETO                                        85.81 85.45 85.63   77.77 77.44 77.60    78.48
  D2.2       SciBERT                                     84.24 85.00 84.62   76.49 77.18 76.83    79.47
  D2.3       Two Experts                                 86.48 83.91 85.17   77.14 74.85 75.98    76.77
  Development set 3
  D3.1       BETO                                        86.49 85.39 85.94   79.24 78.24 78.74    80.59
  D3.2       SciBERT                                     84.35 85.30 84.82   77.94 78.81 78.37    81.34
  D3.3       Two Experts                                 86.58 84.87 85.72   78.57 77.02 77.79    78.66
  Test set
  S1         BETO ensemble (D1.1 + D2.1 + D3.1)          86.29 86.62 86.46   80.74 80.76 80.75    82.91
  S2         SciBERT ensemble (D1.2 + D2.2 + D3.2)       85.45 86.65 86.05   80.13 81.15 80.63    83.84
  S3         Two Experts ensemble (D1.3 + D2.3 + D3.3)   86.29 86.13 86.21   79.81 79.55 79.68    81.49
  S4         Flat ensemble (all 9 DX.X )                 86.92 86.54 86.73   82.16 81.92 82.04    84.21
  S5         2-step ensemble (S1 + S2 + S3)              86.83 87.12 86.97   82.19 82.08 82.14    84.68


   The results obtained by our models vary among the development sets, but are quite consistent. The
results for the test set are higher, probably due to the effect of the ensembles. Precision and recall are
evenly balanced for all the tested systems.
   The best performing system, the 2-step ensemble, obtains 86.97 F1-score in NER, 82.14 F1-score in
NORM and 84.68 Mean Average Precision (MAP) in CODING. Overall, the systems surpass the scores
83.00, 75.00 and 76.00, respectively, by a large margin. The ensembles of the 9 model variants—3 per
model type—work noticeably better than the ensembles of a single model type. The 2-step ensemble
works even better than the flat ensemble for all the tracks, in particular for CODING, where the
difference between the flat and 2-step ensemble is almost 0.5 MAP points.
   With respect to BETO and SciBERT, the former performs marginally better in NER and NORM;
however, it obtains consistently better MAP in CODING. The two-experts approach has not resulted
in a performance improvement.
   A quantitative analysis of the errors committed by the submitted systems is provided in Table 3.
Again, we observe similar trends among the systems. SciBERT seems to yield more annotations—
spurious and correct—than BETO; Two Experts produces less annotations than BETO or SciBERT
alone. Meanwhile, the flat and 2-step ensembles miss less annotations, make less spurious predictions,
and produce more exact matches.
   In general, when a mention span is matched exactly—which happens ∼80% of the times on average—
, the code given is likely to be correct with >93% probability. The chances drop to around 35% with
overlapping spans. In whichever case, the error is more likely to be found in the first four digits of
the code than in the behaviour (B), grading (G) or /H position when an incorrect code is proposed.


                                                     495
Table 3
NORM error analysis on the test set
                                                                S1      S2      S3      S4      S5
              Total predictions                               3,634   3,679   3,621   3,622   3,628
              Exact span matches                              3,141   3,142   3,131   3,149   3,150
              of which,

                                  exact code matches          2,933   2,948   2,889   2,975   2,981
                                           3Ds errors            86      77     116      65      62
                                            4D errors            77      67      93      59      59
                                             B errors            47      48      51      37      33
                                             G errors            23      24      23      16      21
                                             H errors            49      48      48      52      49

                code not in train, guessed correctly             9      15       5      17      15
              code not in train, guessed incorrectly            66      55      67      58      49
              Span overlaps                                    335     348     326     318     327
              of which,

                                  exact code matches           120     126     116     111     115
                                           3Ds errors           97     116     104      98     110
                                            4D errors           65      77      76      64      72
                                             B errors           73      74      72      66      70
                                             G errors           47      52      43      51      48
                                             H errors           28      31      28      24      28

                code not in train, guessed correctly             1       2       1       2       4
              code not in train, guessed incorrectly            68      83      72      81      82
              Spurious predictions                             158     189     164     155     151
              Missed mentions                                  218     204     233     219     216
              of which,

                                    code not in train           18      14      17      19      18


   While 58,062 codes are considered valid in CANTEMIST, only 746 of them actually occur in the
training and development data provided. Our systems are capable, to an extent, of producing ICD-
O-3 codes that they have not seen in the training data. This is possible on account of the multi-task
approach. Still, our systems fail to generate correct unseen codes much more often than they succeed,
even more so when the mention span has not been matched exactly.
   The bulk of missed mentions are not mentions pertaining to unseen codes, but mentions that do
occur and are even very frequent in the training and development datasets. This phenomenon requires
further analysis to be better explained and addressed.


4. Discussion
The systems presented rely mainly on the semantic representation capabilities derived from the BERT
architecture and the knowledge captured by their own pre-training. The results are seemingly good—


                                                        496
other participants’ results are unknown to us at the time of writing—, but there is still room for
improvement. We pose the following open questions as discussion:

   1. Our approach does not leverage information associated to ICD-O-3 codes (code descriptions,
      definitions, and so on) nor any other hand-crafted knowledge source, which could improve the
      results obtained by helping produce representations for ICD-O-3 codes not seen in the training
      data.

   2. Regarding BioBERT, Clinical BERT and SciBERT: we hypothesise that SciBERT has outper-
      formed BioBERT and Clinical BERT in our experiments because it has been trained from scratch
      with its own vocabulary, better suited to the health domain.

   3. In the same line, it may come as a surprise that BETO and SciBERT obtain similar results,
      when SciBERT has only been trained with texts in English. We hypothesise that because the
      terminology of the health domain is mainly constructed from Greek and Latin roots and affixes
      both in English and in Spanish, the WordPiece strategy and the domain-specific vocabulary of
      SciBERT play to its advantage in this case.

   4. The two previous points indicate that a Spanish Clinical BERT may lead to better results still.

   5. The combination of BETO and SciBERT, in the manner explained in this paper, does not seem
      to be beneficial in this task, having obtained slightly worse results than the standalone models.
      Many other ways exist in which the two models could be combined, so further experimentation
      in this line might be of interest.

   6. While the flat and 2-step ensemble models show performance gains in comparison to the simpler
      models, it is questionable whether such a system would be viable in a real-world scenario.


5. Conclusions
In this working notes we have described our participation in the CANTEMIST shared task. Our
end-to-end deep-learning-based system relies on pre-trained BERT models as the base for semantic
representation of the texts. With these semantic representation, the ICD-O-3 codes are calculated for
each token in a sequence-labelling fashion, and this information is used to address the three competi-
tion tracks (namely, NERC, NORM and CODING) at the same time. We have described how we have
preprocessed and represented the information, and how we have performed rotating training runs to
leverage all the available data (i.e., the official training set and the two official development sets). We
have submitted results of ensemble models trained on different views of the data.
   Both our experiments and the official evaluation show robust results in different subsets of data.
According to these results, the ensembles do provide a performance advantage, with a two-step en-
semble outperforming a flat ensemble. We have also found that BETO and SciBERT obtain comparable
results in this particular task, but the proposed combination of both has not resulted in better scores.
   As future work, the models may benefit from a mechanism to inject ICD-O-3 codes semantics to
enhance their capability to match codes that have not been seen during the training phase. Further
experimentation on the combination of several pre-trained models would also be helpful for scenarios
where each model brings some useful knowledge to the task, and there is not a single pre-trained
model that suits the task better.


                                                   497
Acknowledgments
This work has been partially funded by the project DeepReading (RTI2018-096846-B-C21, MCIU / AEI
/ FEDER,UE).


References
 [1] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, BRAT: A Web-based Tool for
     NLP-assisted Text Annotation, in: Proceedings of the Demonstrations at the 13th Conference
     of the European Chapter of the Association for Computational Linguistics (EACL ’12), 2012, pp.
     102–107.
 [2] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normaliza-
     tion and clinical coding: Overview of the CANTEMIST track for cancer text mining in Spanish,
     Corpus, Guidelines, Methods and Results, in: Proceedings of the Iberian Languages Evaluation
     Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
 [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Trans-
     formers for Language Understanding, in: Proceedings of the 2019 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language Tech-
     nologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
 [4] Word Health Organization, International Classification of Diseases for Oncology, 3rd Edi-
     tion (ICD-O-3), 2015. URL: https://www.who.int/classifications/icd/adaptations/oncology/en/,
     accessed: 24-07-2020.
 [5] L. Ramshaw, M. P. Marcus, Text Chunking Using Transformation-based Learning, in: S. Arm-
     strong, K. Church, P. Isabelle, S. Manzi, E. Tzoukermann, D. Yarowsky (Eds.), Natural Language
     Processing Using Very Large Corpora, Springer Netherlands, 1999, pp. 157–176.
 [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
     Attention Is All You Need, in: Proceedings of the Thirty-first Conference on Advances in Neural
     Information Processing Systems (NeurIPS 2017), 2017, pp. 5998–6008.
 [7] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish Pre-Trained BERT Model and Evaluation
     Data, in: Proceedings of the Practical ML for Developing Countries Workshop at the Eighth
     International Conference on Learning Representations (ICLR 2020), 2020, pp. 1–9.
 [8] I. Beltagy, K. Lo, A. Cohan, SciBERT: A Pretrained Language Model for Scientific Text, in:
     Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
     the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019,
     pp. 3615–3620.
 [9] E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, M. McDermott, Publicly
     Available Clinical BERT Embeddings, in: Proceedings of the 2nd Clinical Natural Language
     Processing Workshop (ClinicalNLP 2019), 2019, pp. 72–78.
[10] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234–1240.
[11] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
     towicz, J. Brew, HuggingFace’s Transformers: State-of-the-art Natural Language Processing,
     arXiv:1910.03771 (2019) 1–11.


                                                  498