Tumor Entity Recognition and Coding for Spanish
Electronic Health Records
Fadi Hassana,b , David Sancheza and Josep Domingo-Ferrera
a
  CYBERCAT-Center for Cybersecurity Research of Catalonia. UNESCO Chair in Data Privacy.), Department of
Computer Science and Mathematics Universitat Rovira i Virgili, Av. Països Catalans 26, E-43007 Tarragona, Catalonia
b
  Department of Computer Science, Hodeidah University, Hodeidah 1821, Yemen


                                         Abstract
                                         This paper describes a two-stage system to solve tumor entity detection and coding in Spanish health
                                         records. This system is submitted to the CANcer TExt Mining Shared Task (CANTEMIST), a challenge
                                         in the IberLEF 2020 Workshop. We include a comparison between two kinds of systems to tackle this
                                         problem. The first kind employ feature-based Conditional Random Fields (CRF), and the second kind
                                         is based on deep learning models. The reported experiments show that our proposals and their com-
                                         bination achieve a micro-F1 of 83.1% and 78.6% on the test data set for the first and second sub-tasks,
                                         respectively, and a MAP of 79.7% on the third sub-task.

                                         Keywords
                                         Electronic Health Records, Deep Learning, Convolution Neural Networks, Conditional Random Fields,
                                         Named entity recognition


1. Introduction
Electronic health records (EHRs) are systematized collections of patient electronically-stored
health information. Usually, EHRs come in different formats, the most popular being free-text
form. Text is a rich source of information for healthcare research and, in particular, to machine
learning tasks. So far, most of the healthcare data available for research is written in English, so
there is a need of tagged data for other languages like Spanish.
   On the basis of this need, the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL)
organizes the CANcer TExt Mining Shared Task (CANTEMIST) [1]. This task aims to provide
the medical and machine learning fields with Spanish labeled data. Since accessing EHRs is
tricky due to privacy issues, the provided notes in this task were clinical case reports (CCRs),
which they are as close as possible to EHRs.
   CANTEMIST is a shared task in the IberLEF 2020 workshop, which focuses on extracting
named entities of critical concepts related to cancer in EHRs. This contest includes three
independent sub-tasks: 1) CANTEMIST-NER track, which requires finding tumor morphology
mentions automatically; 2) CANTEMIST-NORM track, which is a named entity normalization


Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: fadi.hassan@urv.cat (F. Hassan); david.sanchez@urv.cat (D. Sanchez); josep.domingo@urv.cat (J.
Domingo-Ferrer)
orcid:
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                      Dataset    # Tokens   # Sentences    # Named Entities
                        Train     661,853      27,662           9,609
                       Dev 1      219,377      9,097            3,287
                       Dev 2      177,663      8,346            2,624
                         Test     240,666      10,745           3,569
Table 1
Comparison between the datasets.


task requiring all tumor morphology mentions being returned with their corresponding eCIE-O-
3.1 codes; and 3) CANTEMIST-CODING track, which requires returning, for each of document,
a ranked list of its corresponding ICD-O-3 codes [2] (Spanish version: eCIE-O-3.1[3]).
    As participants in the contest, we tackled the problem by designing several systems based on
Conditional Random Fields (CRFs) and deep learning models. We also proposed a combination
of these systems to improve the results.
    For the first task (NER task), we design two systems (1-BiLSTM and 2-BiLSTM) based on Ar-
tificial Neural Networks (ANNs) with Bi-directional Long Short-Term Memory neural networks
(Bi-LSTM), and a third one relying on feature-based Conditional Random Fields (CRF). Then,
we combined the results of the three systems using voting.
    For the second and third tasks (normalization and coding tasks), we have designed a system
based on Convolution Neural Networks (CNNs) and Long Short-Term Memory neural networks
(LSTMS).
    The remainder of the paper is organized as follows. In Section 2, we briefly describe the data.
Section 3 describes the systems we propose. Results and discussions are presented in Section 4.
Section 5 presents the conclusions and depicts some lines of future work.


2. Data Description
The proposed systems are trained and validated on train and development datasets provided
by the organizers of the contest. The corpora released for the tasks consists of 1,300 Spanish
medical records, divided into 500 as training data, 250 for each dev1 and dev2 data, and 300 as
test data. Table 1 shows general statistics about the four datasets.


3. Systems Description
We developed several systems to detect and code tumor named entities in Spanish health records.
The next subsections describe the steps we followed to train and use these systems. Figure 1
describes the general data flow of the system.

3.1. Text Tokenization
In this step we performed a sentence splitting and text tokenization of the health records. For
this, we used the spaCy pre-trained model for the Spanish language [4].


                                               377
      Note 1
                                            #        Tokens              #   Tokens       Entities   Codes

                                            1                            1
                   Sentence splitter/       2                            2
                      Tokenizer
                                                     ...                                  ...
      Note 2
                                            m                            m

        ...

                                                NER Model


      Note n
                                        #   Tokens            Entities

                                        1
                                        2                                             Norm Models
                                                      ...
                                        m


Figure 1: Data flow diagram for the three sub-tasks.


3.2. Word Embedding
After we got all the tokens from the previous step, we used word embedding to map these
tokens to meaningful n-dimensional vectors. The word embedding model has been generated
from Spanish medical corpora [5]. The model was built with fasttext and the dataset used to
generate the model consisted of two data sources: (i) the SciELO database, which contains
full-text articles primarily in English, Spanish and Portuguese, and (ii) a subset of the Wikipedia,
which we call Wikipedia Health, consisting on the articles under the following categories:
Pharmacology, Pharmacy, Medicine and Biology.

3.3. Data Augmentation
In addition to the available sentences in the training dataset, we increased the number of
sentences with named entities by using the list of tumor entities in the file "valid-codes.txt" from
the extra resources provided by the organizers. This file contains 4,203 records of tumor named
entities with their corresponding eCIE-O-3.1 codes. This data augmentation was performed
by choosing a random sentence containing NEs from the training dataset and replacing the
existing entity by a random entity from the list of entities in the "valid-code.txt" file.


                                                              378
                              Forward                       Forward     CRF         Output
            Input
                               Layer                         Layer      Layer


                                                LSTM
        Word Embedding

                ...


                                 LSTM


                                                               LSTM
                ...


                                                ...
                                 LSTM


                                                               LSTM
             ...


                                                LSTM
                                 ...


                                                               ...


                                                                         ...
                ...
                                                LSTM


                ...
                                 LSTM


                                                               LSTM


                                            Backward
                                             Layer


Figure 2: The architecture for the Bi-LSTM with a CRF Layer model.


3.4. Tumor Named Entity Recognition
This section describes the proposed systems for handling the tumor named entity recognition.
The first subsection explains the feature-based NER using CRF, while the second subsection
explains the deep learning model using Bi-LSTM neural networks.

3.4.1. Feature-Based CRF
In principle, we included this system only for comparison purposes. However, its results were
comparable to those of the deep learning model, so we took advantage of that to improve the


                                              379
final results by combining the outcomes of the two systems.
   This system was implemented using Python 3.7 with the sklearn-crfsuite package [6]. The
input for this system was the tokenized sentences from the text tokenizer. The system extracts
some features for every word token like the Part-of-Speach tag (POS), word prefix, word suffix,
word length, etc. This is similar to what we did in our previous work in the MEDDOCAN
competition [7, 8, 9]. Similarly, we also used the BIO tagging scheme to set the labels of the
tokens [10]. As a result, each word token in the medical record is labeled using one of three
possible tags: B, I, or O, which indicate if the word is at the beginning, middle, or outside of a
Tumor entity.

3.4.2. Bi-LSTM with CRF Layer
As shown in Figure 1, the first stage in our system is the NER, which takes as input the sequence
of tokens of a sentence and predicts labels using BIO tags. We choose to use Bi-LSTM as our
encoder because of its ability to take all the information sequentially and pass it to the classifier.
After that, we have a CRF classifier. CRF does sequence labeling for every token considering
the label of neighboring tokens.
   These kind of systems take as input a list of tokens with a fixed size. In our case, we choose the
maximum size of 80 tokens per sentence; extra tokens were truncated as a separated sentence.
If the sentence’s length is less than 80, we added pre-defined padding to reach the maximum
size.
   As introduced before, by means of word embedding, each token is mapped to its corresponding
word embedding vector. After that, all vectors are passed to the LSTM layers and, finally, the
output of the LSTM is passed to the CRF layer to do the classification. The CRF layer classifies
every token to one of the BIO tags.

3.5. Clinical Concept Normalization
The second and the third tasks in the competition focus on tumor named entity normalization,
where every tumor NE should be mapped to their corresponding CIE-O-3.1 codes. In the
following section, we describe our proposed system for these tasks.

3.5.1. CNN with LSTM Layer
The second and third sub-tasks in the competition (Norm and coding tasks) are clinical concept
normalization. The input is a list of entities that are detected by the NER system, which should
be mapped to the corresponding eCIE-O-3.1 codes. eCIE-O-3.1 is the Spanish version of the
International Classification of Diseases for Oncology (ICD-O). This ontology aims to standardize
the tumor named entities in health records and make them understandable internationally.
   All the codes in the eCIE-O-3.1 come in the form of three codes separated by /. The regular
expression for these codes is dddd/dd?(/H)?. The first idea that comes to the mind is to build
a system that takes the NER output as input and predicts the corresponding code. However,
implementing this system gave us bad results. After we analyzed why this straight forward
system did not perform well, our findings were: 1) the three parts of the code vary between
different tumor name entities; and 2) some of these combinations appear few times in the


                                                380
            Input                              Models                                Output


       Word Embedding
                                                                  dddd
                                        Norm model for part 1
               ...
               ...
                                                                  dd?
                                        Norm model for part 2                    dddd/dd?(/H)?
             ...


               ...
                                        Norm model for part 3
                                                                   H?


Figure 3: Overall overview of the clinical concept normalization system.


training dataset, which produces too many combinations. Because of that, we decided to treat
these three parts separately, i.e. build a separate classifier for every part.
   Our system contains three identical sub-systems as we mentioned before (see Figure 3 and
Figure 4). The input for these three models is a list of 20 tokens (because the number of tokens in
the tumor named entities is between 1 and 20). We used CNNs because these kinds of networks
reduce the computation by exploiting local correlation of the input data. In our case, the tumor
named entities contain in average three words, so we used CNNs with 512 filters of the size
three. We do 2-max pooling to get the max two numbers for every filter response, which gives
us two vectors of size 512. These two vectors were passed to the LSTM to combine them. Finally,
the output was passed to the classifier, which is a fully connected layer.


4. Results and Discussion
We have proposed four systems. Table 2 provides details about the training and evaluation of
each of them. NER models were trained by feeding all the sentences without exception, while
Norm models were trained only on those sentences that contain NEs.

4.1. Evaluation Metrics
The standard metrics to evaluate the systems participating in this competition are the F1-score,
for both NER and Norm sub-tasks, and the Mean Average Precision MAP, for the coding sub-
task. Precision and Recall also are included to give more insights about the performance of the
sub-models.


                                                   381
                                                              2-Max Pooling       LSTM       Dense
          Input                              Conv1D                                                      Output
                                                                                  Layer      Layer


     Word Embedding              Filter 0
                                               ...


                                                            ...
                                                                                               Softmax


                                                                                   LSTM
                                                                          ...
                                 Filter 1      ...


                                                            ...
            ...                                                                                          Code


                                               ...


                                                                                              ...
                                 Filter 2


                                                            ...


                                                                                   LSTM
                                             ...


                                                                          ...
                                Filter 511     ...
                                                            ...


Figure 4: The architecture for sub-code models for clinical concept normalization.

                  System name        Tasks                        Training sets           Validation
                                     NER
                   1-BILSTM                                   train and dev2               dev1 set
                                 Norm & coding
                                     NER
                   2-BILSTM                                   train and dev1               dev2 set
                                 Norm & coding
                                     NER                 train, dev1 and dev2                20%
                      CRF
                                 Norm & coding              train and dev1                 dev2 set
                                     NER              Voting (BILSTMs and CRF)                -
                  CRF+BILSTM
                                 Norm & coding              train and dev1                 dev2 set
Table 2
Training and validation datasets used to train our systems.


4.2. Result
Table 3 provides a detailed comparison of the performance of the baseline system and our
systems. The baseline system was provided by the organizers, which is a dictionary lookup
based system. It looks for the NEs that found in train and development sets in the test set.
  In the three sub-tasks, combining the three systems using voting gave better results on
F1-score. However, it gave worse results on the third sub-task for the MAP metric. This happens


                                                      382
                                 NER task           Norm task           Coding task
             System name
                               P    R     F1     P     R      F1           MAP
                Baseline      18.1 73.3 29.1 18.0 73.0 28.8                58.4
                                        Our systems
              1-BILSTM       80,7 83,0 81,8 76,5 78,6 77,6                 78,3
              2-BILSTM       82,4 82,4 82,4 77,9 78,0 77,9                 79,7
                 CRF         80,6 77,6 79,1 77,5 74,6 76,0                 77,9
             CRF+BILSTM      84,4 81,8 83,1 79,8 77,4 78,6                 78.7
Table 3
Performance comparison between the baseline system and our systems on the test set.


because of the code "8000/6", which appears more than the other codes. Combining the three
systems helped to improve the wrongly classified samples for that code but, at the same time,
caused several wrongly classified samples for the other codes.


5. Conclusion and Future Work
In this paper we describe the implementation of several systems to solve the problem of tumor
named entity recognition and normalization. We compared the performance of our systems
with a feature-based system. Results show that a combination of several of our systems provided
the best results in most cases.
   As future work, we plan to use transformer-based models like BERT [11] or XLNet [12],
which they are the state of the art in the NLP field nowadays. We expect these models will give
us better results on this task.


Acknowledgments
Partial support to this work has been received from the European Commission (project H2020-
871042 “SoBigData++”), the Government of Catalonia (ICREA Acadèmia Prize to J.Domingo-
Ferrer and grant 2017 SGR 705), the Spanish Government (projects RTI2018-095094-B-C21
“Consent” and TIN2016-80250-R “Sec-MCloud”) and the Norwegian Research Council (project
no. 308904 “CLEANUP”). The authors are with the UNESCO Chair in Data Privacy, but the
views in this paper are their own and are not necessarily shared by UNESCO.


References
 [1] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal-
     ization and clinical coding: Overview of the cantemist track for cancer text mining in
     spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
 [2] Steward/Custodian, International classification of diseases for oncology, 3rd edition


                                              383
     (icd-o-3), https://www.who.int/classifications/icd/adaptations/oncology/en/, (accessed:
     17.08.2020).
 [3] Portal estadístico del ministerio de sanidad servicios sociales e igualdad,
     tabla de códigos de morfología de las neoplasias (cie-o-3.1) válidos, https:
     //datosabiertos.castillalamancha.es/dataset/registro-de-actividad-de-atenci%C3%
     B3n-sanitaria-especializada-de-castilla-la-mancha-rae-clm-11, (accessed: 17.08.2020).
 [4] M. Honnibal, I. Montani, spacy 2: Natural language understanding with bloom embeddings,
     convolutional neural networks and incremental parsing, To appear 7 (2017).
 [5] F. Soares, M. Villegas, A. Gonzalez-Agirre, M. Krallinger, J. Armengol-Estapé, Medical
     word embeddings for spanish: Development and evaluation, in: Proceedings of the 2nd
     Clinical Natural Language Processing Workshop, 2019, pp. 124–133.
 [6] M. Korobov, sklearn-crfsuite, https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html,
     (accessed: 17.08.2020).
 [7] M. Marimon, A. Gonzalez-Agirre, A. Intxaurrondo, H. Rodriguez, J. L. Martin, M. Villegas,
     M. Krallinger, Automatic de-identification of medical texts in spanish: the meddocan track,
     corpus, guidelines, methods and evaluation of results., in: IberLEF@ SEPLN, 2019, pp.
     618–638.
 [8] F. Hassan, M. Jabreel, N. Maaroof, D. Sánchez, J. Domingo-Ferrer, A. Moreno, Recrf:
     Spanish medical document anonymization using automatically-crafted rules and crf., 2019.
 [9] M. Jabreel, F. Hassan, D. Sánchez, J. Domingo-Ferrer, A. Moreno, E2ej: Anonymization of
     spanish medical records using end-to-end joint neural networks., 2019.
[10] E. F. Sang, J. Veenstra, Representing text chunks, arXiv preprint cs/9907006 (1999).
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[12] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized
     autoregressive pretraining for language understanding, in: Advances in neural information
     processing systems, 2019, pp. 5753–5763.


                                               384