-

Salud

10.1016/j.ejca

Artificial Intelligence for Natural Language Processing of Clinical Text in Spanish for Real-World-Data Analysis (Text2RWD Project)

Fancisco J. Veredas

franveredas@uma.es 0 1

Fernando Gallego

0 1

Guillermo López-García

0 1

Nuria Ribelles

nuriaribelles@gmail.com 2

Emilio Alba

ealbac@uma.es 2

José M. Jerez

jmjerez@uma.es 0 1 0 Grupo de Inteligencia Computacional en Biomedicina (ICB), Dept. Lenguajes y Ciencias de la Computación, Universidad de Málaga , 29071, Málaga , Spain 1 Research Institute of Multilingual Language Technologies. Universidad de Málaga , 29071, Málaga , Spain 2 Unidad de Gestión Clínica Intercentros de Oncología (UGCIO), Hospitales Universitarios Regional y Virgen de la Victoria , Málaga , Spain

2020

6 2010 8 10

The Text2RWD project (funded by the Spanish Ministerio de Ciencia e Innovación, PID2020-116898RB-I00), aims to advance in the creation and de-identification of a specific clinical corpus, which is expected to be of reference as an oncological text corpus in Spanish. By using this corpus, new artificial intelligence (AI) algorithms for natural (Spanish-)language processing (NLP) are being designed and adapted to be applied to information-processing downstream tasks carried out on unstructured textual data stored in oncological electronic health records (EHR) contained in Galén, a healthcare information management system. The resulting models are to be analyzed and validated by applying them to the resolution of diferent clinically-significant tasks through the analysis of real world data in oncology units. The AI-for-NLP models are also expected to be transferred and applied to text corpora of other medical disciplines or healthcare settings, and validated in tackling information extraction and prediction tasks in those specific areas.

eol>natural language processing artificial intelligence electronic health records real-world data oncology biomedical applications

“real-world data” or “data from the real world”, a con- the Galén system, with information on more than 60,000 cept that can be defined as all data related to the health cancer patients, with a total of ∼ 600,000 documents corstatus of patients or diagnostic and therapeutic proce- responding to clinical episodes and a significant number dures performed during daily clinical practice. Specifi- of structured fields (which are completed both in real cally, this concept refers to a huge volume of informa- time, during clinical care activity, and later by specific tion of heterogeneous nature, in a wide variety of for- personnel in charge of this supervised task), supplies the mats, which includes structured data of diverse clinical- research team with quality supervised information to adpathological type and molecular expression data, as well dress diferent types of classification and coding problems as free-text documents (clinical notes on first encounters, in the NLP field. Previously, and since 2003, our research subsequent encounters, emergency reports, pathologi- team has been working actively in collaboration with the cal anatomy reports, etc.) that contain highly relevant staf of the UGCIO in the design of predictive models of information related to diagnoses, treatments and clini- survival in cancer, with a large number of works pubcal procedures [1, 2, 3]. The extraction and exploitation lished in journals both in the field of oncology and AI, of this unstructured information constitutes currently most of them using structured clinical-pathological and one of the main challenges for health systems in general molecular expression data extracted from Galén. Specifand oncology settings in particular, insofar as it is con- ically, in the last years, our team has made significant sidered of vital importance for the achievement of the progress in the design of predictive models for the evolumain objective of advanced medicine: a more personal- tion of patients with metastatic breast cancer, based exized, proactive, preventive and predictive healthcare. It is, clusively on unstructured clinical text contained in EHRs indeed, an ambitious objective that raises a series of ques- of Galén, using text mining techniques in combination tions to the scientific community that remain partially with AI algorithms [2]. These results, together with those unsolved, so that in recent years there has been a grow- also obtained in the design and adaptation of deep learning interest in the exploitation of clinical texts through ing (DL) algorithms for their application to small datasets the application of techniques and algorithms of natural —using techniques such as transfer learning (TL) and data language processing (NLP) and artificial intelligence (AI). augmentation (DA) [7, 8]—, have provided this research

However, most of the previous studies of application team with the necessary experience to incorporating unof NLP and AI to RWD analysis problems, such as clini- structured information into the design of classification cal coding or automatic document classification, are ad- and prediction models in the field of oncology, through dressed on texts in English, due to the limited availability the application of techniques and algorithms of AI to NLP. of annotated corpora with clinical coding information In fact, the incorporation of AI models to NLP tasks has and additional linguistic resources in languages other significantly and surprisingly improved the eficiency of than English. With about 470 million native speakers, the classification and prediction algorithms applied in there is growing interest in clinical text processing in a wide variety of problems, such as conversational sysSpanish. Proof of this are the “shared tasks” promoted by tems, automatic translation of texts, sentiment analysis organizations such as CLEF eHealth Lab, for tasks such as in social networks and, of course, automatic document automatic clinical coding on multilingual corpora or cor- classification. pora in languages other than English [4, 5]. Specifically, Thus, the Text2RWD is aiming to provide our research the research team of this project has recently participated team with the possibility of advancing in i) the creation in the CANTEMIST task (CANcer TExt Mining Shared of a specific clinical de-identified corpus, expected to be Task), promoted by the “Plan for the Promotion of Lan- of reference for the international of oncology in Spanish; guage Technologies” (Plan-TL) of the Secretaría de Estado ii) the design and adaptation of AI and NLP algorithms de Digitalización e Inteligencia Artificial in Spain, for au- for their application to the processing of unstructured tomatic labeling of oncological terms in clinical texts in information contained in oncological EHRs in Spanish; Spanish related to the morphology of neoplasms (eCIE- iii) the application of the AI models to the resolution of O-3.1). The proposal of our research team [6], based on diferent NLP downstream tasks in which the research the use of the BERT attention model together with un- team is currently working, in collaboration with UGCIO structured information stored in the Galén system, has staf: detection/prediction of events (disease relapse, disbeen awarded the first prize of among 160 proposals from ease progression, patient death, prediction of emergency international research groups and companies. encounters, etc.), TNM classification of neoplasms and

In general, the design of algorithms for annotation and standardized ICD-10 coding; and iv) the transfer of the automatic classification of clinical texts requires manual resulting models to the processing of EHR and text corretrospective analysis of a large amount of unstructured pora in Spanish belonging to other medical disciplines, information, for the annotation and labeling of the cate- as well as to other healthcare settings. gories corresponding to the documents used in the design of classification models. In this sense, the availability of

2. Progress of the Text2RWD Project

former demonstrated the best results, achieving a strictmatch micro-averaged precision of 0.946, recall of 0.954, and an F1-score of 0.95 when trained on the augmented corpus [8]. The success of transformers in this study underscores their applicability in real-world clinical scenarios, showcasing the eficacy of these state-of-the-art models.

2.2. Oncology Text Classification

In a phase prior to the development of the Text2RWD project, progress has been made in a line of research consisting of the application of TL approaches to problem solving in the biomedical field [ 9, 10, 7]. With this previous experience, the group has continued the same line of work by applying a TL-based strategy to address the problem of standardization of medical entities and automatic clinical coding of EHR texts in Spanish, using Transformer-based models. There is preliminary work by the ICB-UGCIO group itself in which BERT-based models have been applied with moderate results to specific clinical coding tasks in Spanish [11, 6]. However, the need for studies that systematically analyzed the performance of transformers in the Spanish medical setting, in particular for distinct clinical named-entity recognition problems, was noted. Moreover, there was also a lack of publicly available models based on transformers pre-trained on Spanish clinical corpora that could facilitate the adoption of these models in subsequent medical NLP tasks. All these reasons have justified the need to advance in this line of work, with the objectives and results presented below.

Addressing a recurring challenge within oncology clinical analysis units, the timely completion of first visit reports is hindered by the limited availability of clinical staf. Specifically, the structured fields, such as neoplasm type, location, and histology, often lack comprehensive information due to time constraints. Consequently, accessing and utilizing the results becomes exceedingly challenging. Although the necessary information exists within EHR, it is in an unstructured text format, dispersed throughout the EHR data rather than consolidated in dedicated electronic fields. The crucial task at hand involves automating the extraction of neoplasm type from the patient’s EHR text, enabling the oncology analysis unit to promptly direct the patient to the appropriate specialist.

Within our Text2RWD project [12], we aim to advance the application of NLP models for the automated extraction of neoplasm types from EHRs written in Spanish. 2.1. EHR De-identification Our classification algorithms determine the likelihood The primary objective in managing information within of each document belonging to one of the three most EHRs involves the automatic extraction and masking prevalent neoplasms in the Galén information system: of concepts associated with personally identifiable data. breast, colorectal, and lung; or being categorized as anThis process, known as de-identification, is not only other type of neoplasm. The machine learning (ML) and an ethical imperative but also a legal mandate stipu- DL models explored in this study included RNNs in conlated by data privacy laws. Both the General Data junction with convolutional neural networks (CNN) and Protection Regulation (GDPR) of the European Union embedding models, which gave high performance rates (EU) and the Ley Orgánica Española de Protección de in the classification task: 0.981 precision, 0.984 recall Datos Personales y Garantía de Derechos Digitales (LOPD- and 0.982 F1-score [12]. Notably, this study is the first of GDD) of Spain specifically prohibit the processing of its kind, examining the application of NLP models to the personal data unless identifiable information is appro- task of extracting information about a patient’s neoplasm priately masked. As part of the Text2RWD project, we from real-world medical texts in Spanish. advanced in the development of AI and NLP algorithms for the de-identification of Spanish EHRs to ensure com- 2.3. Clinical coding in Spanish pliance with the LOPD-GDD.

For that purpose, in [8] we annotated a private corpus One of the most fundamental and challenging tasks in consisting of 599 real-world clinical cases with 8 distinct the medical NLP domain is automatic clinical coding. categories of protected health information. Addressing This task involves the conversion of unstructured clinthe predictive challenge as a named-entity recognition ical texts, written in specialized natural language, into task, we developed two distinct methodologies rooted in structured formats that align with standardized coding DL. The first strategy employs recurrent neural networks terminologies, employing computational methods. In the (RNN), while the second adopts an end-to-end approach Text2RWD, we primarily focus on the problem of autobased on transformers. To augment the training data, we matic clinical coding for documents in Spanish. However, introduced a DA procedure, expanding the text corpus. most of the existing literature has centered on EnglishOur findings indicate that transformers exhibit superior written texts. This can be attributed to the scarce availperformance over RNN in the de-identification of Spanish ability of corpora annotated with standardized clinical clinical data. Notably, the XLM-RoBERTa-large Trans- coding labels and supplementary linguistic resources in languages other than English. Consequently, in addition models have the ability to assign standardized disease to the intrinsic dificulties of coding medical texts men- and procedure codes to clinical documents, while providtioned above, clinical coding in Spanish requires dealing ing detailed information about the specific text segments with the lack of textual resources crucial for training underlying the choice of each assigned code. To achieve accurate automated systems. Hence, the lack of exten- this, the performance of two diferent multilingual Transsive training data in Spanish restricts the application of former models, namely XLM-RoBERTa and mBERT, as data-hungry DL methods, which have shown promising well as a Transformer-based model designed for the Spanresults in English clinical coding tasks. In the specific ish language, called BETO, were examined. These precase of automatic coding of clinical texts in Spanish, there trained models were adapted to a specific clinical domain has been limited research. by continuous training with Galén corpus, and then re

The main objective of our work in [13] for the fined and evaluated in subsequent clinical explanatory Text2RWD project was to develop clinical coding mod- coding tasks: CodiEsp-X [14] and Catemist-Norm [15]. els for medical documents in Spanish, adapting several In addition, a comparison was made between two difTransformer models to the particularities of the Span- ferent training strategies for dealing with explainable ish healthcare environment. To this end, the models clinical coding: a hierarchical approach versus a multiwere pre-trained with the Galén corpus [2]. The result- task approach. In the first approach, a Transformer is ing models were refined on three clinical coding tasks initially trained on a named medical entity recognition from evaluation campaigns—CodiEsp-D, CodiEsp-P, and (MER) task to identify clinical entities, i.e. text fragments Cantemist-Coding [14, 15]—using two public Spanish an- containing information relevant to medical diagnoses or notated clinical corpora [15, 6, 16, 14]. In this work, three procedures. The results of this MER Transformer are Transformer-based models that support the Spanish lan- then used to train a second Transformer that deals with guage were explored: multilingual BERT (mBERT), BETO the medical entity normalization (MEN) task, assigning and XLM-RoBERTa. In order to adapt the transformers ICD-10 labels to the clinical entities previously identito the particularities presented by clinical coding tasks ifed by the first Transformer model. In the multi-tasking with small datasets coming from the real world (clini- approach, the MER and MEN transformers are trained cal practice data stored in EHRs), a multi-label sentence simultaneously. classification approach was developed in this study and The hierarchical approach (MER->MEN) for explainserved as a DA procedure. Following the proposed strat- able clinical coding demonstrated better performance egy, the trained transformers achieved a new state of compared to the multi-task approach. Furthermore, the the art (SOTA) in each of the three clinical coding tasks performance of domain-adapted transformers was found explored in this work. to outperform their non-adapted counterparts in all sce

Table 1 shows the predictive results obtained in [13] for narios evaluated in this study (see Table 2). The varithe three Transformer models analyzed: on the one hand, ous multilingual Transformer models and training apmodels without adaptation to the clinical-oncological proaches proposed in [17] were evaluated on public domain (mBERT, BETO and XLM-R) and, on the other datasets obtained from shared tasks related to explanahand, models with adaptation to the clinical-oncological tory clinical coding, specifically CodiEsp-X [ 14] and domain through pre-training with the Galén corpus Cantemist-Norm [15]. The results obtained in both tasks (mBERT-Galén, BETO-Galén and XLM-R-Galén). The exceeded the SOTA for these tasks at the time of publicatable also shows the results obtained by means of “ex- tion. pert committee” or “ensemble” strategies for the diferent models. As can be seen in the table, the best clinical coding results in the three tasks evaluated are obtained 3. Conclusions and Future Work with the models adapted to the domain by pre-training with the Galén corpus. As expected, the model ensembles manage to improve the results obtained by the models independently.

This article presents the latest advances made by the ICB

UGCIO research group in the Text2RWD project. On the one hand, the progress made in the development of AI and DL algorithms—and, more specifically, in the design and training of models based on the RNN and the Transformer architecture—has demonstrated their usefulness and efectiveness in the performance of classification and named-entity recognition tasks for text classification, de-identification and explainable clinical coding in Spanish. Domain adaptation strategies—in which language models are trained using multilingual general purpose corpora and retrained using the Galén corpus, specific

2.4. Explainable clinical coding Few studies have delved into the explainability of ML

and DL models for clinical coding. In our work for the Text2RWD project [17], several approaches based on Transformer models, adapted to the clinical-oncology domain and multilingual, were developed and evaluated in order to address explainable clinical coding. These to the clinical-oncological domain—and the use of TL approaches, have allowed us to obtain SOTA results in several competitive tasks in the field of clinical natural language processing in Spanish. Our latest unpublished results, which show recent advances in normalization of clinical entities on standardized ontologies like SnomedCT or UMLS, are promising and augur the continuity of the progress made within the Text2RWD project in a line of research that is still new and which will remain active in the coming years. Finally, in the coming years the Text2RWD project will also advance in new predictive tasks on RWD based on AI and NLP of EHRs—such as TNM staging in cancer—and their transfer to domains other than oncology in which the algorithms have been initially designed.

Acknowledgments The authors acknowledge the support from the Minis

terio de Ciencia e Innovación (MICINN) under project PID2020-116898RB-I00, from the Universidad de Málaga and Junta de Andalucía through grant UMA20-FEDERJA045, from the Malaga-Pfizer consortium for AI research in Cancer - MAPIC, and from the Instituto de Investigación Biomédica de Málaga – IBIMA (all including FEDER funds)