<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Clinical NER using Spanish BERT Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ramya Vunikili</string-name>
          <email>ramya.vunikili@siemens-healthineers.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Supriya H N</string-name>
          <email>supriya.hn@siemens-healthineers.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasile George Marica</string-name>
          <email>george.marica@siemens.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oladimeji Farri</string-name>
          <email>oladimeji.farri@siemens-healthineers.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Digital Technology &amp; Innovation</institution>
          ,
          <addr-line>Siemens Healthineers, Bangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Digital Technology &amp; Innovation</institution>
          ,
          <addr-line>Siemens Healthineers, NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Siemens</institution>
          ,
          <addr-line>Brasov</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents an overview of transfer learning-based approach to the Named Entity Recognition (NER) sub-task from Cancer Text Mining Shared Task (CANTEMIST) conducted as a part of Iberian Languages Evaluation Forum (IberLEF) 2020. We explore the use of Bidirectional Encoder Representations from Transformers (BERT) based contextual embeddings trained on general domain Spanish text to extract tumor morphology from clinical reports written in Spanish. We achieve an F1 score of 73.4% on NER without using any feature engineered or rule-based approaches, and present our work as inspiration for further research on this task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Bidirectional Encoder Representations</kwd>
        <kwd>BERT</kwd>
        <kwd>NER</kwd>
        <kwd>IberLEF 2020</kwd>
        <kwd>Spanish embeddings</kwd>
        <kwd>BETO</kwd>
        <kwd>CANTEMIST</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        BETO has faithfully replicated the architecture behind the seminal contextualized embeddings
inspired from Transformers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and is enhanced through training techniques like dynamic-masking [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and whole-word-masking. As an example, Figure 1 shows the embedding of a Spanish sentence from
the CANTEMIST corpus.
      </p>
      <p>
        Also, since BETO has outperformed multilingual BERT (M-BERT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on seven of the eight NLP
tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we chose to use BETO as the base for the CANTEMIST NER task.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Contextualized language models have provided improved performance for a myriad of NLP tasks
by relying on a common deep network architecture. These models are often trained on a single large
corpus of multilingual, general domain texts with subsequent fine-tuning on specific data sets through
transfer learning.</p>
      <p>
        One important reference in this field is the BERT language representation model which serves as
basis for many zero-shot cross-lingual transfer. Trained on the top 104 Wikipedia versions,
multilingual BERT has proven competitive in many NLP tasks. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] Despite not benefiting from cross-lingual
alignment, M-BERT outperforms models based on cross-lingual embeddings [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Such adaptability of M-BERT to various NLP tasks has been investigated end explained through the
over-lapping efect of word-pieces across diferent languages. As such, common nouns, word roots,
numbers, and URLs are mapped to a shared embedding space, determining co-occurring pieces [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Another study on the cross-lingual ability of BERT concludes that performance is relatively invariant
with respect to word-pieces overlap or multi-head attention complexity[9] and suggests that the true
versatility comes from a better network depth or a higher structural and semantic similarity between
diferent languages.
      </p>
      <p>Departing from the hypothesis that diferent languages have a common structural core to which
MBERT adapts during training, [10] follow the intuition of splitting a M-BERT sentence representation
into a neutral (language agnostic) component and a specific language component. Through a series
of tasks oriented towards language identification, language similarity, parallel sentence retrieval and
word alignment, this study concludes that core cross-lingual representations are not neutral/general
enough to mirror similar semantic structure. Consequently, multilingual embeddings are not good
enough to solve dificult NLP tasks after zero-shot transfer learning.</p>
      <p>In the same vein, an extensive study [11] regarding the internal structure of M-BERT used
canonical correlation analysis [12] between similar representations in multiple languages. By looking at
the similarity of deep layer representations, a divergence pattern was identified. M-BERT was not
just mapping diferent languages into the same space but instead it was reflecting “linguistic and
evolutionary relationships”. Embeddings similarity was mostly identified in word-pieces rather than
in word or character tokenization, with Romantic and Germanic languages clustered into diferent
branches of the network.</p>
      <p>A more targeted approach for transfer learning would be the identification of language families,
where word-piece overlap, and similar grammar structure preserve the compact nature of a semantic
representation. English to Spanish transfer learning for POS tagging has been shown improve
performance when labeled data is scarce [13], or improve NER tasks when referring to proper nouns or
niche concepts [14]. In the case where data is available in large quantities for individual languages, it
is recommendable to combine specific language word representations with language-family models
[15].</p>
      <p>Considering these findings, we believe that multilingual contextualized embeddings are not optimal
for those NLP tasks where either word-piece overlap, or semantic structure similarity are not high
enough between pre-training corpus and task corpus. As such we have searched for a pre-trained
BERT model that closely mimics the CanTeMiST data set. In ideal circumstances, such a model should
have been pre-trained on Spanish EHR documents (labelled and/or unlabelled). However, we decided
to explore the performance of the model trained on general domain Spanish text with fine-tuning, as
the results can provide additional evidence to support the hypothesis that linguistic and evolutionary
relationships can be learned from one domain and transferred to another.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset and Experiments</title>
      <p>We chose as task, the automatic named entity recognition of tumor morphology mentions in plain
text medical documents.</p>
      <p>The CanTeMiST dataset contains 6,933 de-identified clinical documents which are annotated for
mentions related to tumor morphology, denoted by entity MORPHOLOGIA_NEOPLASIA, using the
BRAT tool [16]. The annotations are done using well-established guidelines published by the Spanish
Ministry of Health. Annotations have been made by clinical coding experts, according to
eCIE-O3.1 codes2 following multiple iterations of quality control and annotation consistency. The choice of
reports faithfully reflects the narative of electronic clinical reports. Table 2 summarises the data splits
used as train, development and test sets along with the average number of tokens per report in each
of these sets.</p>
      <p>As a pre-processing step, all the reports are lower-cased and tokenized according to either sentences
or sections of the reports so as to maintain a sequence length of less than or equal to 512. The sentence
tokenizations are further broken-down to word-level tokens such that the start and end ofsets of these
tokens with respect to the original report are preserved. These word-level tokens are then encoded
in BILOU format and given as input to fine-tune the BERT model on CANTEMIST dataset. During
prediction time, all the tokens are O encoded as the ground truth is not provided. The output from
2https://eciemaps.mscbs.gob.es/ecieMaps/browser/index_o_3.html
the BERT model is then gathered and post-processed to produce BRAT format. Figure 2 shows an
overview of the pipeline used for prediction.</p>
      <p>The BERT model is fine-tuned using AllenNLP platform [17] on NVIDIA Tesla V100 (32GB) GPU
for 40 epochs, on the shufled set composed of train, dev1 and dev2 data. Prediction is carried on both
test and background sets. The hyper-parameters for the best model are summarised in Table 3.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>3https://github.com/TeMU-BSC/cantemist-evaluation-library
from Figure 4. Majority of these overlapped vocabulary contain sufixes such as ’##s’, ’##l’, ’##al’, ’##a’,
’##op’ that carry little-to-no information related to medical domain. And hence, the model struggled
to diferentiate between words such as mycoplasma (a bacteria) and neoplasm (abnormal growth of
cells) which resulted in labelling the former as tumor related entity. In order to avoid such issues, it
would be nice to add frequently occurring cancer related vocabulary to the unused tokens of BETO
vocabulary so that the model can initialise diferent embedding irrespective of the sufix.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Future Work</title>
      <p>As Spanish and English languages are syntactically similar, it might be safe to assume that some
of the architectures that worked well for English might also translate to Spanish. One such model
based on BERT and dynamic span graphs is DyGIEPP [18]. We plan on applying this architecture to
CANTEMIST using the BETO embeddings as a next step.
[9] K. Karthikeyan, Z. Wang, S. Mayhew, D. Roth, Cross-lingual ability of multilingual bert: An
empirical study, in: International Conference on Learning Representations, 2019.
[10] J. Libovicky`, R. Rosa, A. Fraser, How language-neutral is Multilingual BERT?, arXiv preprint
arXiv:1911.03310 (2019).
[11] J. Singh, B. McCann, R. Socher, C. Xiong, Bert is not an interlingua and the bias of tokenization,
in: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP
(DeepLo 2019), 2019, pp. 47–55.
[12] H. Hotelling, Relations between two sets of variates, in: Breakthroughs in statistics, Springer,
1992, pp. 162–190.
[13] Z. Yang, R. Salakhutdinov, W. W. Cohen, Transfer learning for sequence tagging with
hierarchical recurrent networks, arXiv preprint arXiv:1703.06345 (2017).
[14] J. L. C. Zea, J. E. O. Luna, C. Thorne, G. Glavaš, Spanish NER with word representations and
conditional random fields, in: Proceedings of the sixth named entity workshop, 2016, pp. 34–40.
[15] J.-K. Kim, Y.-B. Kim, R. Sarikaya, E. Fosler-Lussier, Cross-lingual transfer learning for pos tagging
without cross-lingual resources, in: Proceedings of the 2017 conference on empirical methods
in natural language processing, 2017, pp. 2832–2838.
[16] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, brat: a web-based tool for
NLPAssisted Text Annotation, in: Proceedings of the Demonstrations at the 13th Conference of the
European Chapter of the Association for Computational Linguistics, Association for
Computational Linguistics, Avignon, France, 2012, pp. 102–107. URL: https://www.aclweb.org/anthology/
E12-2021.
[17] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz,
L. S. Zettlemoyer, Allennlp: A deep semantic natural language processing platform, 2017.
arXiv:arXiv:1803.07640.
[18] D. Wadden, U. Wennberg, Y. Luan, H. Hajishirzi, Entity, Relation, and Event Extraction with
Contextualized Span Representations, in: EMNLP/IJCNLP, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miranda-Escalada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <article-title>Named entity recognition, concept normalization and clinical coding. Overview of the CANTEMIST track for cancer text mining in Spanish, Corpus, Guidelines, Methods</article-title>
          and Results,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish pre-trained bert model and evaluation data</article-title>
          ,
          <source>in: Practical ML for Developing Countries Workshop@ ICLR</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <article-title>Beto, bentz, becas: The surprising cross-lingual efectiveness of bert</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>09077</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Turban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hamblin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Y.</given-names>
            <surname>Hammerla</surname>
          </string-name>
          ,
          <article-title>Ofline bilingual word vectors, orthogonal transformations and the inverted softmax</article-title>
          ,
          <source>arXiv preprint arXiv:1702.03859</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garrette</surname>
          </string-name>
          ,
          <article-title>How multilingual is Multilingual BERT?</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>01502</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>