1. Introduction

L. Lilli);

Lupus Alberto: A Transformer-Based Approach for SLE Information Extraction from Italian Clinical Reports

Livia Lilli

0 1

Laura Antenucci

0 1

Augusta Ortolan

Silvia Laura Bosello

Maria Antonietta D'Agostino

Stefano Patarnello

Carlotta Masciocchi

Jacopo Lenkowicz

1 0 Catholic University of the Sacred Heart , Rome, 00168 , Italy 1 Real World Data Facility, Gemelli Generator, Fondazione Policlinico Universitario Agostino Gemelli IRCCS , Rome, 00168 , Italy 2 UOC di Reumatologia, Fondazione Policlinico Universitario A Gemelli IRCCS , 00168 Roma , Italy

2024

000 0 0002

Natural Language Processing (NLP) is widely used across several fields, such as in medicine, where information often originates from unstructured data sources. This creates the need for automated systems, in order to classify text and extract information from Electronic Health Records (EHRs). However, a significant challenge lies in the limited availability of pre-trained models for less common languages, such as Italian, and for specific medical domains. Our study aims to develop an NLP approach to extract Systemic Lupus Erythematosus (SLE) information from Italian EHRs at Gemelli Hospital in Rome. We then introduce Lupus Alberto, a fine-tuned version of AlBERTo, trained for classifying categories derived from three distinct domains: Diagnosis, Therapy and Symptom. We evaluated Lupus Alberto's performance by comparing it with other baseline approaches, selecting from available BERT-based models for the Italian language and fine-tuning them for the same tasks. Evaluation results show that Lupus Alberto achieves overall F-Scores equal to 79%, 87%, and 76% for the Diagnosis, Therapy, and Symptom domains, respectively. Furthermore, our approach outperformed other baseline models in the Diagnosis and Symptom domains, demonstrating superior performance in identifying and categorizing relevant SLE information, thereby improving clinical decision-making and patient management.

eol>Natural Language Processing Systemic Lupus Erythematosus Text Classification Italian Language

1. Introduction

tact visit. However, these Lupic features are not always available in a structured format, then there is the need for Natural Language Processing (NLP) is used in many ap- NLP approaches in order to interpret clinical reports and plications, such as in the medical domain, where the extract the desired data. Based on the literature, large huge amount of unstructured data sources coming from language models (LLMs) and transformer-based architecElectronic Health Records (EHRs) generates the need to tures represent the state-of-the-art for EHR classification develop automated systems for text classification and in- tasks [1, 2, 3, 4]. formation extraction. However, employing such methods This work aims to develop a transformer-based apis challenging due to the scarcity of pre-trained models proach to identify SLE information from unstructured in less common languages like Italian, and for specific EHRs at the Italian Gemelli Hospital of Rome. We then medical domains. propose Lupus Alberto, a fine-tuned version of Alberto

In this study, we explored the Systemic Lupus Erythe- [5], the available BERT-based model for the Italian lanmatosus (SLE), a complex pathology which involves dif- guage trained on Italian tweets. In order to assess the ferent organ domains and can occur in patients at several Lupus Alberto performance, we compare it with other levels of severity. For this reason, information about diag- baseline approaches, choosing among the BERT-based noses, symptoms and therapies are used by physicians to models available for the Italian language, always finecharacterize Lupus patients and to make better informed tuned on the same tasks. decisions about therapy changes or time for the next con

2. Background

Hospitals may not have structured data sources and often there is a need for advanced and automated approaches for the extraction of specific features from clinical reports. For this reason, there are several studies related to information extraction and text classification in the medical domain, in the context of diferent diseases and languages.

Specifically for SLE, we found the work of Deng et al. [6], who applied rule-based and logistic regression to identify SLE patient population from unstructured EHRs in the English language. Also Turner et al. [7] investigated NLP techniques for SLE characterization from clinical notes, by using Bag-of-Words and cTakes to transform input EHR texts into features eligible for Machine Learning algorithms. They then used several models like Neural Networks, Random Forest, Support Vector Machines, Naïve Bayes and Word2Vec Bayesian inversion, for the final text classification. Furthermore, in the studies of Lilli et al. [8], Ortolan et al. [9], a rule-based approach combined with a Bert-based topic modelling, is proposed for the identification of longitudinal features Figure 1: Diversity of the fine-tuned categories. The inner in Italian EHRs of SLE patients. circle shows the three classification domains, while the outer

We then found more recent techniques applied in other circle represents the related categories. pathological contexts in the Italian language, and based on transformers and large language models. For example, the work of Paolo et al. [10] presented a NER transformer- at the paragraph level, complying with the token limit based approach in the lung cancer domain, on Italian of the BERT models. The final classification was then EHRs. Additionally, Crema et al. [11] delivered an Ital- aggregated on the entire report, through a logical-OR. ian dataset for the neuropsychiatric domain, training a transformer-based model for NER tasks. About text clas- 3.2. Data Annotation sification, Torri et al. [ 12 ] exploited text classification models to extract relevant clinical variables, comparing The training set for the fine-tuning consisted of a silrule-based, recurrent neural network and BERT-based ver standard made up of annotations from a rule-based models, in the ST-Elevation Myocardial Infarction do- algorithm, developed ad hoc for the study [8]. In particumain, from an Italian hospital. Finally, Lilli et al. [13] lar, we formulated rules and expressions for tagging each proposed an ensemble of Llama with a Bert-based model, EHR paragraph with the presence of the categories shown for metastasis classification of Italian EHRs in the Breast in Figure 1, excluding the possible negations. Rules conCancer domain. sist of personalized regex and checks on distances among

Based on the previous findings, our study aims to pro- words. pose a transformer-based approach for the Italian lan- The gold standard for the evaluation was built by physiguage, specifically for SLE. To this scope, we searched for cians, who annotated a set of EHRs in two steps. Manual suitable methods to extract multiple Lupic features from annotation was performed by a first team of two physithe clinical reports of our Italian hospital. We based on cians with medical knowledge in SLE, who annotated the models delivered by Polignano et al. [5], who trained the reports of each patient with respect to the target Albert [14] on Italian tweets, and by Buonocore et al. [15], information. A second team of two specialist rheumatolwho proposed transformer-based models, pre-trained on ogists reviewed the manual annotations, for the quality neural-machine translations of English resources and on assessment. For labelling data, an interactive dashboard natively Italian-written medical texts. was developed ad hoc for the project, where the user assigned to each EHR the corresponding tags. The dashboard URL is accessible only from the hospital’s internal 3. Methods network, then it’s not sharable. However, Figure 2 provides a screen of the home and annotation pages. 3.1. Data Corpus The Inter Annotator Agreement (IAA) among the annotations of the two groups was also computed for a quality assurance measure of data and annotations [16].

For this purpose, we chose the Cohen’s Kappa metric, which is a measure of the agreements of two annotators while considering the agreement that could occur by chance [17]:

In this paper, we used data from the SLE Data Mart of the

Gemelli Hospital of Rome, which comprises an extensive collection of structured and unstructured data related to Lupus patients. We selected the outpatient clinical reports, considered by physicians as more informative for extracting information like diagnoses, therapies and symptoms. For their length, we also chose to treat EHRs

4. Experiments 4.1. Dataset In the Equation 1, 0 is the observed agreement, while

is the expected agreement when both the annotators randomly assign labels, and it is estimated using a perannotator empirical prior over the class labels [18].

3.3. Fine-Tuning and Classification This study aimed to extract information about diagnoses,

therapies and symptoms from the EHRs of the Gemelli Hospital of Rome. Our purpose was to identify, for each of the three domains, a set of categories provided by our team of rheumatologists, related to SLE. As explained in Figure 1, we then trained our model on 8 diferent types of diagnoses, 4 therapies, and 7 symptoms.

For this purpose, we fine-tuned AlBERTo 1, a BERTbased model for the Italian language proposed by Polignano et al. [5]. The fine-tuning was performed following the approach of Polignano et al. [5], by treating every category as a singular binary task, with its own training set of labelled texts, randomly sampled from the original data corpus. We then obtained multiple binary classifiers, one for each category to extract.

Fine-tuning and inference were implemented at the paragraph level and not at the entire reports, in order to comply with the token limit imposed by BERT models. The final evaluation was then applied at the overall EHR level, comparing the gold standard reports to the paragraphs’ classification, combined at EHR level through a logical-OR. Then if at least a paragraph is positive to a specific category, the corresponding report is classified with that category.

For this study, we started from the SLE data mart of the

Gemelli Hospital of Rome, by selecting among the 13299 available EHRs of outpatient visits.

For our training set, we sampled 1000 training texts for each binary category shown in Figure 1, balancing them among positive and negative samples, such that each category had 50% training samples labelled as positives. The training set was composed of EHR paragraphs, in order to comply with the token limit of 512 tokens imposed by BERT-models.

The gold standard set was composed of 750 EHRs randomly sampled from the data mart, verifying that their paragraphs were not already in the training set. Gold standard set was annotated by two groups of physicians through the annotation dashboard in Figure 2. The same set of gold standard reports were used for the evaluation of all the classification domains.

Details about the dataset are shown in Table 1, where some statistics are reported for each domain, distinguished by training set and gold standard. In particular, for each case are shown the number of categories to classify, the total of paragraphs processed during training and inference, the overall number of EHRs, and the mean of tokens and characters over the paragraphs. Tokens were computed through the BERT tokenizer2 [19] available on Hugging Face [20].

For privacy reasons, the dataset used in this study is not publicly available. We then provided the descriptive summary metrics in Table 1.

4.2. Inter Annotator Agreement In order to measure the Inter Annotator Agreement on

the gold standards, we used the cohen_kappa_score func

1https://github.com/marcopoli/AlBERTo-it 2google-bert/bert-base-uncased

tion provided by the Python Scikit-Learn package [21]. BERT-based models for text classification. Particularly, As inputs to the function, we considered the arrays con- we considered the three models proposed by Buonocore taining the binary annotations performed by the two et al. [15], BioBit3, MedBit4 and MedBIT-r3-plus5, which groups of annotators respectively. Additionally, we per- are pre-trainings on the Italian language, in the medical formed the analysis grouping the annotations by the context. Additionally, we also tried the two base versions three domains: Diagnosis, Therapy and Symptom. Re- of Albert6 [14], that is the base model used by Polignano sults are shown in Table 2. Staying on the grid proposed et al. [5] to release AlBERTo. by Landis and Koch [22] for the interpretation of the The inference for all the models was performed at the coeficient, we have an almost perfect quality of annota- paragraph level instead of the whole report level, and tion for the Diagnosis and Therapy domains ( > 0.80), the final classification was aggregated at the EHR level and a substantial level for the Symptom case ( = 0.69). through a logical-OR. Then, if at least a paragraph is Although acceptable according to literature standards positive to the Articular Diagnosis, the overall EHR is [16], the latter k score has a lower value than the others, classified as positive to that category. because of the greater dificulty of identifying symptoms from text. Symptoms at current contact are in fact more 4.4. Results and Discussion complex concepts to identify, compared to therapies and diagnoses, which are usually mentioned in the EHR more For the evaluation, we compared Lupus Alberto to the explicitly. So, even if analyzed by clinical experts, the other baseline models (fine-tuned on the same tasks), in same report can present inconsistency of annotations, terms of F-Score at the singular category level. Addidue to the poor quality of text semantics. tionally, to quantify the overall performances, we also computed the mean F-Score for the Diagnosis, Therapy Table 2 and Symptom domains.

The Inter Annotator Agreement (IAA) computed between the As shown in Table 3, Lupus Alberto presents the hightwo groups of physicians, through the Cohen’s Kappa metric, est F-score for the therapy domain, with a value of 87%. distinguished by the three classification domains. Then follow the Diagnosis and Symptom domains with overall metrics of 79% and 76% respectively. These perDomain Cohen’s Kappa (k) formances reflect the IAA results in Table 2, which shows Diagnosis 0.88 that Therapy presents a higher quality of annotations Therapy 0.93 compared to Diagnosis and Symptom.

Symptom 0.69 Concerning the baselines, Lupus Alberto outperforms the other experiments for Diagnosis and Symptom, while the Therapy domain presents the higher metric value 4.3. Modeling with the fine-tuned MedBIT-r3-plus [ 15], whose score equals 88%.

The AlBERTo fine-tuning was performed through the Py- At the singular category level, the Hematologic and Torch Trainer of the Hugging Face Transformers library Renal diagnoses present the highest performance metrics [20], using 10 epochs (for further implementation details, in their domain, with values of 98% and 94%, respectively. see Appendix A). Fine-tuning was performed for each of The Glucocorticoid is the therapy with the best F-Score, the 19 categories, in order to obtain a classifier for each equal to 97%. Finally, Papula and Raynaud’s Phenomenon binary task.

In order to assess the Lupus Alberto performance, we 3IVN-RIN/bioBIT then compared the model to other baselines, always fine- 45IIVVNN--RRIINN//mmeeddBBIITT-r3-plus tuned on the same binary tasks, choosing among several 6albert/albert-base-v1, albert/albert-base-v2 are the best-performing symptoms, with a score equal to baseline methods, outperforming especially in the clas89% and 87% respectively. sification of information in the Diagnosis and Symptom

In all the three domains, the second version of Al- domains, achieving F-Scores of 79% and 76%, respectively. bert model present the lowest performance values, with F-Scores equal to 69%, 78% and 44% respectively, if compared to our Lupus Alberto and to the fine-tuned models 6. Limitations of Buonocore et al. [15]. Then, as demonstrated from the above results, fine-tuning models specifically trained in the Italian language, improved the final classification performance.

5. Conclusion This study aims to deliver a transformer-based approach

to extract SLE information from real-world data of the Gemelli Hospital of Rome. The scarcity of available models for the Italian language, specialized in Lupus, prompted us to develop a solution to automate the extraction process of SLE information from Italian EHRs. We especially focused on identifying features in the domains of Diagnosis, Therapy and Symptom, reported as of interest for SLE. Our work shows that Lupus Alberto presents competitive performance if compared to other While our proposed approach presents higher performances if compared to the baselines, many aspects could be investigated in future studies, in order to enhance the ifnal performance. This includes the usage of a larger set of training data for the model fine-tuning. Additionally, new research could be conducted by extracting Lupus features through LLMs, and comparing the results with the traditional transformer-based classifiers. Finally, a first release of the Lupus Alberto could be implemented using diferential privacy techniques to ensure the protection of data from inference risks [23].

Acknowledgments For this study, the use of electronic health records was

essential for training and testing our new technology.

However, these data contain sensitive patient informa- and validation of a rule-based framework for autotion and it was fundamental adhering to strict privacy and mated identification of longitudinal clinical features confidentiality guidelines. To this purpose, the dataset about systemic lupus erythematosus patients from used in this paper was fully de-identified and we received electronic health records, Annals of the Rheumatic approval from our institution to conduct the presented Diseases 2024;83:1014 (2024). research. Approval protocol number from the relevant [10] D. Paolo, A. Bria, C. Greco, M. Russano, S. Ramella, Ethics Committee can be provided on request. P. Soda, R. Sicilia, Named entity recognition in italian lung cancer clinical reports using transformers, in: 2023 IEEE International Conference on BioinReferences formatics and Biomedicine (BIBM), IEEE, 2023, pp. 4101–4107. [1] Y. Li, S. Rao, J. R. A. Solares, A. Hassaine, R. Ra- [11] C. Crema, T. M. Buonocore, S. Fostinelli, E. Parimmakrishnan, D. Canoy, Y. Zhu, K. Rahimi, G. Salimi- belli, F. Verde, C. Fundarò, M. Manera, M. C. RaKhorshidi, Behrt: transformer for electronic health musino, M. Capelli, A. Costa, et al., Advancrecords, Scientific reports 10 (2020) 7155. ing italian biomedical information extraction with [2] V. Yogarajan, J. Montiel, T. Smith, B. Pfahringer, transformers-based models: Methodological inTransformers for multi-label classification of medi- sights and multicenter practical application, Journal cal text: an empirical comparison, in: International of Biomedical Informatics 148 (2023) 104557. Conference on Artificial Intelligence in Medicine, [ 12 ] V. Torri, S. Mazzucato, S. Dalmiani, U. Paradossi, Springer, 2021, pp. 114–123. C. Passino, S. Moccia, S. Micera, F. Ieva, Structur[3] M. Rupp, O. Peter, T. Pattipaka, Exbehrt: Extended ing clinical notes of italian st-elevation myocardial transformer for electronic health records, in: Inter- infarction patients, in: Proceedings of the First national Workshop on Trustworthy Machine Learn- Workshop on Patient-Oriented Language Processing for Healthcare, Springer, 2023, pp. 73–84. ing (CL4Health)@ LREC-COLING 2024, 2024, pp. [4] Z. Yang, A. Mitra, W. Liu, D. Berlowitz, H. Yu, Trans- 37–43.

formehr: transformer-based encoder-decoder gen- [13] L. Lilli, S. Patarnello, C. Masciocchi, V. Masiello, erative model to enhance prediction of disease out- F. Marazzi, T. Luca, N. Capocchiano, Llamamts: Opcomes using electronic health records, Nature com- timizing metastasis detection with llama instruction munications 14 (2023) 7857. tuning and bert-based ensemble in italian clinical [5] M. Polignano, P. Basile, M. De Gemmis, G. Semeraro, reports, in: Proceedings of the 6th Clinical Natural V. Basile, et al., Alberto: Italian bert language under- Language Processing Workshop, 2024, pp. 162–171. standing model for nlp challenging tasks based on [14] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, tweets, in: CEUR workshop proceedings, volume R. Soricut, Albert: A lite bert for self-supervised 2481, CEUR, 2019, pp. 1–6. learning of language representations, arXiv [6] Y. Deng, J. A. Pacheco, A. Ghosh, A. Chung, C. Mao, preprint arXiv:1909.11942 (2019).

J. C. Smith, J. Zhao, W.-Q. Wei, A. Barnado, C. Dorn, [15] T. M. Buonocore, C. Crema, A. Redolfi, R. Bellazzi, et al., Natural language processing to identify lupus E. Parimbelli, Localizing in-domain adaptation nephritis phenotype in electronic health records, of transformer-based biomedical language modBMC Medical Informatics and Decision Making 22 els, Journal of Biomedical Informatics 144 (2023) (2022) 348. 104431. [7] C. A. Turner, A. D. Jacobs, C. K. Marques, J. C. Oates, [16] K. L. Soeken, P. A. Prescott, Issues in the use of D. L. Kamen, P. E. Anderson, J. S. Obeid, Word2vec kappa to estimate reliability, Medical care (1986) inversion and traditional text classifiers for pheno- 733–741. typing lupus, BMC medical informatics and deci- [17] J. Cohen, A coeficient of agreement for nominal sion making 17 (2017) 1–11. scales, Educational and psychological measurement [8] L. Lilli, S. L. Bosello, L. Antenucci, S. Patarnello, 20 (1960) 37–46.

A. Ortolan, J. Lenkowicz, M. Gorini, G. Castellino, [18] R. Artstein, M. Poesio, Inter-coder agreement for A. Cesario, M. A. D’Agostino, et al., A comprehen- computational linguistics, Computational linguissive natural language processing pipeline for the tics 34 (2008) 555–596. chronic lupus disease, in: Digital Health and In- [19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, formatics Innovations for Sustainable Health Care Bert: Pre-training of deep bidirectional transformSystems, IOS Press, 2024, pp. 909–913. ers for language understanding, arXiv preprint [9] A. Ortolan, L. Lilli, S. Bosello, L. Antenucci, C. Mas- arXiv:1810.04805 (2018).

ciocchi, J. Lenkowicz, P. Cerasuolo, L. Lanzo, S. Pi- [20] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Deunno, G. Castellino, et al., Pos1142 development langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun

A. Implementation Details The fine-tuning was performed through the PyTorch

Trainer7 of the Hugging Face Transformers library [20], with a desktop GPU Nvidia RTX 5000 Graphics Processing with 16GB of RAM, on a machine with Ubuntu 20.04.3 LTS. The 20% of training set was used as eval_dataset, while the remaining was employed as train_dataset. The learning rate was set to 2e-5, the batch size to 16, and the weight decay to 0.01.

arXiv: 1910 . 03771 ( 2019 ). [21]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

research 12 ( 2011 ) 2825 - 2830 . [22]

J. R.

Landis , G. G. Koch,

The measurement of ob-

( 1977 ) 159 - 174 . [23]

Miranda ,

E. S.

Ruzzetti ,

Santilli ,

F. M.

Zanzotto ,

solutions , arXiv preprint arXiv:2408.05212 ( 2024 ).