<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Embeddings and Bi-LSTM+CRF Model to Detect Tumor Morphology Entities in Spanish Clinical Cases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergio Santamaria Carrasco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paloma Martínez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Named Entity Recognition, Deep Learning, Long Short-Term Memory, Word Embeddings</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Carlos III de Madrid, Computer Science Department, Av de la Universidad</institution>
          ,
          <addr-line>30, 28911, Leganés, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>368</fpage>
      <lpage>375</lpage>
      <abstract>
        <p>Extracting key concepts from electronic health records makes it easier to access and represent the information contained in them. The growing number of these documents and the complexity of clinical narrative makes it dificult to label them manually. This is why the development of systems capable of automatically extracting key information is of great interest. In this paper, we describe a deep learning architecture for the identification of named tumor morphology entities. The architecture consists of convolutional and bidirectional Long Short-Term Memory layers and a final layer based on Conditional Random Field. The proposed system (HULAT-UC3M) participated in CANTEMIST-NER and obtained a micro-F1 of 83.4%</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the exponential growth of clinical documents, natural language processing (NLP)
techniques have become a crucial tool for unlocking critical information to make better clinical
decisions. Understanding health-related problems requires the extraction of certain key entities
such as diseases, treatments or symptoms and their attributes from textual data.</p>
      <p>
        The identification of specific entities of interest inside medical documents can be addressed as
a Named Entity Recognition (NER) problem. The diferent approaches to solving this problem
range from dictionaries and rule-based systems, machine learning, deep learning, and hybrid
systems. In similar shared tasks, such as eHealth 2019 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the most popular approaches were
based on deep learning models, since they allow automatic learning of relevant patterns, allowing
a certain degree of independence of language and domain.
      </p>
      <p>
        This paper describes a participation of the team HULAT-UC3M in CANTEMIST 2020
challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], CANTEMIST-NER subtrack. This challenge is oriented to named entity recognition
of a critical type of concept related to cancer, namely tumor morphology that has been rarely
adressed.
      </p>
      <p>
        The core of the proposed system is an adaptation of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] bidirectional Long Short-Term Memory
(bi-LSTM) - CRF model, successfully applied previously for temporal expression recognition.
This system combines several neural network architectures for the extraction of characteristics
at a contextual level and a CRF for the decoding of labels. The results obtained by the system
proposed by the HULAT-UC3M team reach a F1 score of 0.834. These results show the good
performance of the approaches based on deep learning.
      </p>
      <p>The rest of the paper is organized as follows. In Section 2 we briefly describe the datasets
provided for the CANTEMIST task. In Section 3, we describe the architecture of our system.
Section 4 presents the results obtained for our system. In Section 5, we provide the conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>
        The dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The provided corpus was divided in three parts: training,
development and validation. The training set contained 501 clinical cases manually annotated
by clinical experts using BRAT standof format 1 and following the rules for annotating
morphology neoplasms in Spanish oncology clinical cases. The development set was formed by 500
additional clinical cases for evaluating machine learning systems and tune their
hyperparameters. Participants could also freely use additional resources from other corpora to improve the
systems.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and system description</title>
      <sec id="sec-3-1">
        <title>3.1. Pre-processing</title>
        <p>
          We pre-process the text of the clinical cases taking into account diferent steps. First, the texts
are split into tokens and sentences using the Spacy2, an open-source library that provides
1http://brat.nlplab.org/standof.htm
2https://spacy.io/
support for texts in several languages, including Spanish. We have decided to use Spacy, instead
of other more specialized tools for processing clinical texts such as SciSpacy3, due to the fact
that they do not support Spanish texts. Finally, the text and its annotations are transformed
into the CoNLL-2003 format using the BIOES schema [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In this schema, tokens are annotated
using the following tags:
• B: represents a token that conform the begining of an entity.
• I: indicate that the token belongs to an entity.
• O: represents that the token does not belong to an entity.
• E: marks a token as the end of a given entity.
        </p>
        <p>• S: indicates that an entity is comprised of a single token.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Features</title>
        <p>In this section we present the diferent attributes considered to be the input of the deep learning
stack.</p>
        <p>
          • Words: A representation based on pre-trained word embeddings has been used with
FastText [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The word vectors presented in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] have been selected due to their contribution
of specific knowledge of the domain as they have been generated from Spanish medical
corpora. These vectors have a total of 300 dimensions.
• Part-of-speech: This feature has been considered due to the significant amount of
information it ofers about the word and its neighbors. It can also help in word sense
disambiguation. The PoS-Tagging model used was the one provided by the Spacy. An
embedding representaton of this feature is learned during training, resulting in a
40dimensional vector.
• Characters: We also add character-level embeddings of the words, learned during
training and resulting in a 30-dimensional vector. These have proven to be useful for
specificdomain tasks and morphologically-rich languages.
• Syllables: Syllable-level embeddings of the words, learned during training and resulting
in a 75-dimensional vector is also added. Like character-level embeddings, they help to
deal with words outside the vocabulary and contribute to capturing common prefixes
and sufixes in the domain and correctly classifying words.
• Meaning Cloud Named Entities: MeaningCloud4 is a Software as a Service product
that allows users to embed text analysis and semantic processing into any application or
system. Its API, Topic’s Extraction, allows extracting the entities present in a text, such as
drugs or places. In addition, using a customized dictionary built from the diferent terms
present in the National Cancer Institute’s dictionary5, we also identify cancer treatments,
types of cancer, medical tests, etc. This information is coded as a 15-dimensional one-hot
vector.
3https://allenai.github.io/scispacy/
4https://www.meaningcloud.com/es
5https://www.cancer.gov/espanol/publicaciones/diccionario
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Deep Learning Model</title>
        <p>The model implemented, as shown in Figure 1, works on two levels of maximum sequence
length. First, a bidirectional LSTM layer receives the input features, the sequence of characters
and syllables being previously processed by a convolutional and global max pooling block, for a
maximum sequence length of 10. The reshaped output of this layer is concatenated with the
input features, but this time for a maximum sequence length size of 50, and serves as the input
for a new BiLSTM. This layer connects directly to a fully connected dense layer with ℎ
activation function.</p>
        <p>The last layer (CRF optimization layer) consists of a conditional random fields layer. This
layer receives as input the sequence of probabilities of the previous layer in order to improve
predictions. This is due to the ability of the layer to take into account the dependencies between
the diferent labels. The output of this layer provides the most probable sequence of labels.</p>
        <p>
          The hyperparameters of our model used are listed below:
• Maximum character sequence length: 30
• Maximum syllable sequence length: 10
• Character convolutional filters : 50 filters of size 3
• Syllables convolutional filters : 50 filters of size 3
• First BiLSTM hidden state dimension: 300 for the forward and backward layers
• Second BiLSTM hidden state dimension: 300 for the forward and backward layers
• Dense layer units: 200
• Dropout: 0.4
• Optimizer: ADAM optimizer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], learning rate: 0.001
• Number of epochs: 5
        </p>
        <p>
          The system has been developed in python 3 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] with Keras 2.2.4 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Tensorflow 1.14.0 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In our experiments, we apply the standard measures precision, recall, and micro-averaged
F1-score to evaluate the performance of our model. These metrics are also adopted as the
evaluation metrics during CANTEMIST task.</p>
      <p>We utilized the training set for training the model and exploited the development set to
hyperparameter fine tuning. In the prediction stage, we combined the training and development
sets to training the model. The detailed hyper-parameter settings are ilustrated in Table 2 ‘Opt.’
denotes optimal.</p>
      <p>Considering the optimal hyper-parameters, we use these in our model to process the test set
provided by CANTEMIST in subtask CANTEMIST-NER. Our proposal, as the Table 3 shows,
reaches a F1-score of 0.834, with an precision of 0.826 and a recall of 0.843. The results show
that a proposal based on BiLSTM-CRF can be interesting.</p>
      <p>Furethemore, we manually analyzed the erorrs generated by our system on the corpus set
after the CANTEMIST-NER task. The main errors can be classified into three categories:
• Incorrect boundaries: This type of error is usually caused by including more
information, especially related to the location of the tumor, than is necessary. An example
of this found in the test set is where our system predicts "nódulo pulmonar
contralateral" (contralateral pulmonary nodule) as an entity while the gold standard annotation
shows "nódulo pulmonar" (pulmonary nodule). This error is also caused by not including
important information such as the stage of the tumor.
• Missing the tumor morphology mention (Missed): This error occurs when the
system does not recognize a tumor morphology mention in the clinical text. This type of
error is often caused when the model confuses terminology that in other documents refers
another medical condition, but in the current document refers to tumor morphology. An
example of this is "lesión pulmonar" (lung lesion), which depending on the context may
or may not refer to tumor morphology.
• Incorrectly distinguishing the tumor morphology mention (Incorrectly
distinguished): This type of error is often caused when the model confuses terminology that
in other documents refers to tumor morphology, but in the current document refers
to another medical condition. An example of this is "lesión ósea" (bone lesion), which
depending on the context may or may not refer to tumor morphology.</p>
      <p>Table 4 details the proportion of errors made by our system in the predictions on the test set
in the CANTEMIST-NER task.</p>
      <p>By analyzing these error examples, we infer the document-level information may be helpful
for our system.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Due to the exponential increase in electronic health records, the need to develop systems capable
of automatically identifying and classifying entities in order to facilitate the use of electronic
information and resources has become stronger.</p>
      <p>The shared task CANTEMIST is focused on the recognition of tumour morphology entities
with the aim of contributing to oncological and clinical research and collecting potentially
important clinical variables for cancer treatments hidden in medical texts, which are essential
to promote health quality improvements and personalised medicine in a more systematised
way for the advancement of cancer care.</p>
      <p>In this document, we propose a system based on deep learning with bidirectional LSTM and
CRF layers for the NER task. Our system achieves a micro-F1 of 83.4%.</p>
      <p>In future work, our goal is to explore diferent ways to include document-level information,
as well as examine to other deep learning architectures and other types of embeddings such as
contextual embeddings.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the Research Program of the Ministry of Economy and
Competitiveness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R).</p>
    </sec>
    <sec id="sec-7">
      <title>A. Online Resources</title>
      <p>The sources for the HULAT-UC3M participation are available via
• GitHub https://github.com/ssantamaria94/CANTEMIST-Participation,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piad-Morfis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Consuegra-Ayala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Estevez-Velarde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Almeida-Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Muñoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Montoyo</surname>
          </string-name>
          ,
          <article-title>Overview of the ehealth knowledge discovery challenge at iberlef 2019</article-title>
          ., in: IberLEF@ SEPLN,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miranda-Escalada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <article-title>Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Genthial</surname>
          </string-name>
          , Tensorflow - named
          <source>entity recognition</source>
          ,
          <year>2018</year>
          . URL: https://github.com/ guillaumegenthial/tf_ner.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miranda-Escalada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <article-title>Cantemist corpus: gold standard of oncology clinical cases annotated with CIE-O 3 terminology</article-title>
          ,
          <year>2020</year>
          . URL: https://doi.org/10.5281/ zenodo.3978041. doi:
          <volume>10</volume>
          .5281/zenodo.3978041,
          <string-name>
            <surname>Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan</surname>
            <given-names>TL</given-names>
          </string-name>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ratinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information</article-title>
          ,
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Soares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Armengol-Estapé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barzegar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          , Fasttext spanish medical embeddings,
          <year>2020</year>
          . URL: https://doi.org/10.5281/zenodo.3744326. doi:
          <volume>10</volume>
          .5281/zenodo.3744326,
          <string-name>
            <surname>Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan</surname>
            <given-names>TL</given-names>
          </string-name>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic optimization</article-title>
          ,
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Van Rossum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Drake</surname>
          </string-name>
          , Python 3 Reference Manual, CreateSpace, Scotts Valley, CA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Keras</surname>
          </string-name>
          ,
          <year>2015</year>
          . URL: https://github.com/fchollet/keras.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Devin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          , G. Irving,
          <string-name>
            <given-names>M.</given-names>
            <surname>Isard</surname>
          </string-name>
          , et al.,
          <article-title>Tensorflow: A system for large-scale machine learning</article-title>
          ,
          <source>in: 12th {USENIX} Symposium on Operating Systems Design and Implementation</source>
          ({
          <source>OSDI} 16)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>