<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Salud</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.ejca</article-id>
      <title-group>
        <article-title>Artificial Intelligence for Natural Language Processing of Clinical Text in Spanish for Real-World-Data Analysis (Text2RWD Project)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fancisco J. Veredas</string-name>
          <email>franveredas@uma.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Gallego</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillermo López-García</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nuria Ribelles</string-name>
          <email>nuriaribelles@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emilio Alba</string-name>
          <email>ealbac@uma.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José M. Jerez</string-name>
          <email>jmjerez@uma.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Grupo de Inteligencia Computacional en Biomedicina (ICB), Dept. Lenguajes y Ciencias de la Computación, Universidad de Málaga</institution>
          ,
          <addr-line>29071, Málaga</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Institute of Multilingual Language Technologies. Universidad de Málaga</institution>
          ,
          <addr-line>29071, Málaga</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Unidad de Gestión Clínica Intercentros de Oncología (UGCIO), Hospitales Universitarios Regional y Virgen de la Victoria</institution>
          ,
          <addr-line>Málaga</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>6</volume>
      <issue>2010</issue>
      <fpage>8</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>The Text2RWD project (funded by the Spanish Ministerio de Ciencia e Innovación, PID2020-116898RB-I00), aims to advance in the creation and de-identification of a specific clinical corpus, which is expected to be of reference as an oncological text corpus in Spanish. By using this corpus, new artificial intelligence (AI) algorithms for natural (Spanish-)language processing (NLP) are being designed and adapted to be applied to information-processing downstream tasks carried out on unstructured textual data stored in oncological electronic health records (EHR) contained in Galén, a healthcare information management system. The resulting models are to be analyzed and validated by applying them to the resolution of diferent clinically-significant tasks through the analysis of real world data in oncology units. The AI-for-NLP models are also expected to be transferred and applied to text corpora of other medical disciplines or healthcare settings, and validated in tackling information extraction and prediction tasks in those specific areas.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;natural language processing</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>electronic health records</kwd>
        <kwd>real-world data</kwd>
        <kwd>oncology</kwd>
        <kwd>biomedical applications</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>“real-world data” or “data from the real world”, a con- the Galén system, with information on more than 60,000
cept that can be defined as all data related to the health cancer patients, with a total of ∼ 600,000 documents
corstatus of patients or diagnostic and therapeutic proce- responding to clinical episodes and a significant number
dures performed during daily clinical practice. Specifi- of structured fields (which are completed both in real
cally, this concept refers to a huge volume of informa- time, during clinical care activity, and later by specific
tion of heterogeneous nature, in a wide variety of for- personnel in charge of this supervised task), supplies the
mats, which includes structured data of diverse clinical- research team with quality supervised information to
adpathological type and molecular expression data, as well dress diferent types of classification and coding problems
as free-text documents (clinical notes on first encounters, in the NLP field. Previously, and since 2003, our research
subsequent encounters, emergency reports, pathologi- team has been working actively in collaboration with the
cal anatomy reports, etc.) that contain highly relevant staf of the UGCIO in the design of predictive models of
information related to diagnoses, treatments and clini- survival in cancer, with a large number of works
pubcal procedures [1, 2, 3]. The extraction and exploitation lished in journals both in the field of oncology and AI,
of this unstructured information constitutes currently most of them using structured clinical-pathological and
one of the main challenges for health systems in general molecular expression data extracted from Galén.
Specifand oncology settings in particular, insofar as it is con- ically, in the last years, our team has made significant
sidered of vital importance for the achievement of the progress in the design of predictive models for the
evolumain objective of advanced medicine: a more personal- tion of patients with metastatic breast cancer, based
exized, proactive, preventive and predictive healthcare. It is, clusively on unstructured clinical text contained in EHRs
indeed, an ambitious objective that raises a series of ques- of Galén, using text mining techniques in combination
tions to the scientific community that remain partially with AI algorithms [2]. These results, together with those
unsolved, so that in recent years there has been a grow- also obtained in the design and adaptation of deep
learning interest in the exploitation of clinical texts through ing (DL) algorithms for their application to small datasets
the application of techniques and algorithms of natural —using techniques such as transfer learning (TL) and data
language processing (NLP) and artificial intelligence (AI). augmentation (DA) [7, 8]—, have provided this research</p>
      <p>However, most of the previous studies of application team with the necessary experience to incorporating
unof NLP and AI to RWD analysis problems, such as clini- structured information into the design of classification
cal coding or automatic document classification, are ad- and prediction models in the field of oncology, through
dressed on texts in English, due to the limited availability the application of techniques and algorithms of AI to NLP.
of annotated corpora with clinical coding information In fact, the incorporation of AI models to NLP tasks has
and additional linguistic resources in languages other significantly and surprisingly improved the eficiency of
than English. With about 470 million native speakers, the classification and prediction algorithms applied in
there is growing interest in clinical text processing in a wide variety of problems, such as conversational
sysSpanish. Proof of this are the “shared tasks” promoted by tems, automatic translation of texts, sentiment analysis
organizations such as CLEF eHealth Lab, for tasks such as in social networks and, of course, automatic document
automatic clinical coding on multilingual corpora or cor- classification.
pora in languages other than English [4, 5]. Specifically, Thus, the Text2RWD is aiming to provide our research
the research team of this project has recently participated team with the possibility of advancing in i) the creation
in the CANTEMIST task (CANcer TExt Mining Shared of a specific clinical de-identified corpus, expected to be
Task), promoted by the “Plan for the Promotion of Lan- of reference for the international of oncology in Spanish;
guage Technologies” (Plan-TL) of the Secretaría de Estado ii) the design and adaptation of AI and NLP algorithms
de Digitalización e Inteligencia Artificial in Spain, for au- for their application to the processing of unstructured
tomatic labeling of oncological terms in clinical texts in information contained in oncological EHRs in Spanish;
Spanish related to the morphology of neoplasms (eCIE- iii) the application of the AI models to the resolution of
O-3.1). The proposal of our research team [6], based on diferent NLP downstream tasks in which the research
the use of the BERT attention model together with un- team is currently working, in collaboration with UGCIO
structured information stored in the Galén system, has staf: detection/prediction of events (disease relapse,
disbeen awarded the first prize of among 160 proposals from ease progression, patient death, prediction of emergency
international research groups and companies. encounters, etc.), TNM classification of neoplasms and</p>
      <p>In general, the design of algorithms for annotation and standardized ICD-10 coding; and iv) the transfer of the
automatic classification of clinical texts requires manual resulting models to the processing of EHR and text
corretrospective analysis of a large amount of unstructured pora in Spanish belonging to other medical disciplines,
information, for the annotation and labeling of the cate- as well as to other healthcare settings.
gories corresponding to the documents used in the design
of classification models. In this sense, the availability of</p>
    </sec>
    <sec id="sec-2">
      <title>2. Progress of the Text2RWD Project</title>
      <p>former demonstrated the best results, achieving a
strictmatch micro-averaged precision of 0.946, recall of 0.954,
and an F1-score of 0.95 when trained on the augmented
corpus [8]. The success of transformers in this study
underscores their applicability in real-world clinical
scenarios, showcasing the eficacy of these state-of-the-art
models.</p>
      <sec id="sec-2-1">
        <title>2.2. Oncology Text Classification</title>
        <p>In a phase prior to the development of the Text2RWD
project, progress has been made in a line of research
consisting of the application of TL approaches to
problem solving in the biomedical field [ 9, 10, 7]. With this
previous experience, the group has continued the same
line of work by applying a TL-based strategy to address
the problem of standardization of medical entities and
automatic clinical coding of EHR texts in Spanish, using
Transformer-based models. There is preliminary work by
the ICB-UGCIO group itself in which BERT-based models
have been applied with moderate results to specific
clinical coding tasks in Spanish [11, 6]. However, the need for
studies that systematically analyzed the performance of
transformers in the Spanish medical setting, in particular
for distinct clinical named-entity recognition problems,
was noted. Moreover, there was also a lack of publicly
available models based on transformers pre-trained on
Spanish clinical corpora that could facilitate the adoption
of these models in subsequent medical NLP tasks. All
these reasons have justified the need to advance in this
line of work, with the objectives and results presented
below.</p>
        <p>Addressing a recurring challenge within oncology
clinical analysis units, the timely completion of first visit
reports is hindered by the limited availability of clinical
staf. Specifically, the structured fields, such as neoplasm
type, location, and histology, often lack comprehensive
information due to time constraints. Consequently,
accessing and utilizing the results becomes exceedingly
challenging. Although the necessary information exists
within EHR, it is in an unstructured text format, dispersed
throughout the EHR data rather than consolidated in
dedicated electronic fields. The crucial task at hand involves
automating the extraction of neoplasm type from the
patient’s EHR text, enabling the oncology analysis unit to
promptly direct the patient to the appropriate specialist.</p>
        <p>Within our Text2RWD project [12], we aim to advance
the application of NLP models for the automated
extraction of neoplasm types from EHRs written in Spanish.
2.1. EHR De-identification Our classification algorithms determine the likelihood
The primary objective in managing information within of each document belonging to one of the three most
EHRs involves the automatic extraction and masking prevalent neoplasms in the Galén information system:
of concepts associated with personally identifiable data. breast, colorectal, and lung; or being categorized as
anThis process, known as de-identification, is not only other type of neoplasm. The machine learning (ML) and
an ethical imperative but also a legal mandate stipu- DL models explored in this study included RNNs in
conlated by data privacy laws. Both the General Data junction with convolutional neural networks (CNN) and
Protection Regulation (GDPR) of the European Union embedding models, which gave high performance rates
(EU) and the Ley Orgánica Española de Protección de in the classification task: 0.981 precision, 0.984 recall
Datos Personales y Garantía de Derechos Digitales (LOPD- and 0.982 F1-score [12]. Notably, this study is the first of
GDD) of Spain specifically prohibit the processing of its kind, examining the application of NLP models to the
personal data unless identifiable information is appro- task of extracting information about a patient’s neoplasm
priately masked. As part of the Text2RWD project, we from real-world medical texts in Spanish.
advanced in the development of AI and NLP algorithms
for the de-identification of Spanish EHRs to ensure com- 2.3. Clinical coding in Spanish
pliance with the LOPD-GDD.</p>
        <p>For that purpose, in [8] we annotated a private corpus One of the most fundamental and challenging tasks in
consisting of 599 real-world clinical cases with 8 distinct the medical NLP domain is automatic clinical coding.
categories of protected health information. Addressing This task involves the conversion of unstructured
clinthe predictive challenge as a named-entity recognition ical texts, written in specialized natural language, into
task, we developed two distinct methodologies rooted in structured formats that align with standardized coding
DL. The first strategy employs recurrent neural networks terminologies, employing computational methods. In the
(RNN), while the second adopts an end-to-end approach Text2RWD, we primarily focus on the problem of
autobased on transformers. To augment the training data, we matic clinical coding for documents in Spanish. However,
introduced a DA procedure, expanding the text corpus. most of the existing literature has centered on
EnglishOur findings indicate that transformers exhibit superior written texts. This can be attributed to the scarce
availperformance over RNN in the de-identification of Spanish ability of corpora annotated with standardized clinical
clinical data. Notably, the XLM-RoBERTa-large Trans- coding labels and supplementary linguistic resources in
languages other than English. Consequently, in addition models have the ability to assign standardized disease
to the intrinsic dificulties of coding medical texts men- and procedure codes to clinical documents, while
providtioned above, clinical coding in Spanish requires dealing ing detailed information about the specific text segments
with the lack of textual resources crucial for training underlying the choice of each assigned code. To achieve
accurate automated systems. Hence, the lack of exten- this, the performance of two diferent multilingual
Transsive training data in Spanish restricts the application of former models, namely XLM-RoBERTa and mBERT, as
data-hungry DL methods, which have shown promising well as a Transformer-based model designed for the
Spanresults in English clinical coding tasks. In the specific ish language, called BETO, were examined. These
precase of automatic coding of clinical texts in Spanish, there trained models were adapted to a specific clinical domain
has been limited research. by continuous training with Galén corpus, and then
re</p>
        <p>The main objective of our work in [13] for the fined and evaluated in subsequent clinical explanatory
Text2RWD project was to develop clinical coding mod- coding tasks: CodiEsp-X [14] and Catemist-Norm [15].
els for medical documents in Spanish, adapting several In addition, a comparison was made between two
difTransformer models to the particularities of the Span- ferent training strategies for dealing with explainable
ish healthcare environment. To this end, the models clinical coding: a hierarchical approach versus a
multiwere pre-trained with the Galén corpus [2]. The result- task approach. In the first approach, a Transformer is
ing models were refined on three clinical coding tasks initially trained on a named medical entity recognition
from evaluation campaigns—CodiEsp-D, CodiEsp-P, and (MER) task to identify clinical entities, i.e. text fragments
Cantemist-Coding [14, 15]—using two public Spanish an- containing information relevant to medical diagnoses or
notated clinical corpora [15, 6, 16, 14]. In this work, three procedures. The results of this MER Transformer are
Transformer-based models that support the Spanish lan- then used to train a second Transformer that deals with
guage were explored: multilingual BERT (mBERT), BETO the medical entity normalization (MEN) task, assigning
and XLM-RoBERTa. In order to adapt the transformers ICD-10 labels to the clinical entities previously
identito the particularities presented by clinical coding tasks ifed by the first Transformer model. In the multi-tasking
with small datasets coming from the real world (clini- approach, the MER and MEN transformers are trained
cal practice data stored in EHRs), a multi-label sentence simultaneously.
classification approach was developed in this study and The hierarchical approach (MER-&gt;MEN) for
explainserved as a DA procedure. Following the proposed strat- able clinical coding demonstrated better performance
egy, the trained transformers achieved a new state of compared to the multi-task approach. Furthermore, the
the art (SOTA) in each of the three clinical coding tasks performance of domain-adapted transformers was found
explored in this work. to outperform their non-adapted counterparts in all
sce</p>
        <p>Table 1 shows the predictive results obtained in [13] for narios evaluated in this study (see Table 2). The
varithe three Transformer models analyzed: on the one hand, ous multilingual Transformer models and training
apmodels without adaptation to the clinical-oncological proaches proposed in [17] were evaluated on public
domain (mBERT, BETO and XLM-R) and, on the other datasets obtained from shared tasks related to
explanahand, models with adaptation to the clinical-oncological tory clinical coding, specifically CodiEsp-X [ 14] and
domain through pre-training with the Galén corpus Cantemist-Norm [15]. The results obtained in both tasks
(mBERT-Galén, BETO-Galén and XLM-R-Galén). The exceeded the SOTA for these tasks at the time of
publicatable also shows the results obtained by means of “ex- tion.
pert committee” or “ensemble” strategies for the diferent
models. As can be seen in the table, the best clinical
coding results in the three tasks evaluated are obtained 3. Conclusions and Future Work
with the models adapted to the domain by pre-training
with the Galén corpus. As expected, the model ensembles
manage to improve the results obtained by the models
independently.</p>
        <sec id="sec-2-1-1">
          <title>This article presents the latest advances made by the ICB</title>
          <p>UGCIO research group in the Text2RWD project. On
the one hand, the progress made in the development of
AI and DL algorithms—and, more specifically, in the
design and training of models based on the RNN and the
Transformer architecture—has demonstrated their
usefulness and efectiveness in the performance of classification
and named-entity recognition tasks for text classification,
de-identification and explainable clinical coding in
Spanish. Domain adaptation strategies—in which language
models are trained using multilingual general purpose
corpora and retrained using the Galén corpus, specific</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.4. Explainable clinical coding</title>
        <sec id="sec-2-2-1">
          <title>Few studies have delved into the explainability of ML</title>
          <p>and DL models for clinical coding. In our work for the
Text2RWD project [17], several approaches based on
Transformer models, adapted to the clinical-oncology
domain and multilingual, were developed and evaluated
in order to address explainable clinical coding. These
to the clinical-oncological domain—and the use of TL
approaches, have allowed us to obtain SOTA results in
several competitive tasks in the field of clinical natural
language processing in Spanish. Our latest unpublished
results, which show recent advances in normalization of
clinical entities on standardized ontologies like
SnomedCT or UMLS, are promising and augur the continuity of
the progress made within the Text2RWD project in a line
of research that is still new and which will remain active
in the coming years. Finally, in the coming years the
Text2RWD project will also advance in new predictive
tasks on RWD based on AI and NLP of EHRs—such as
TNM staging in cancer—and their transfer to domains
other than oncology in which the algorithms have been
initially designed.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>The authors acknowledge the support from the Minis</title>
        <p>terio de Ciencia e Innovación (MICINN) under project
PID2020-116898RB-I00, from the Universidad de Málaga
and Junta de Andalucía through grant
UMA20-FEDERJA045, from the Malaga-Pfizer consortium for AI research
in Cancer - MAPIC, and from the Instituto de
Investigación Biomédica de Málaga – IBIMA (all including
FEDER funds)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>