=Paper=
{{Paper
|id=Vol-2936/paper-66
|storemode=property
|title=Pre-trained language models to extract information from radiological reports
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-66.pdf
|volume=Vol-2936
|authors=Pilar López-Úbeda,Manuel Carlos Díaz-Galiano,L. Alfonso Ureña-López,M. Teresa Martín-Valdivia
|dblpUrl=https://dblp.org/rec/conf/clef/Lopez-UbedaDLM21
}}
==Pre-trained language models to extract information from radiological reports==
Pre-trained language models to extract information from radiological reports Pilar López-Úbeda1 , Manuel Carlos Díaz-Galiano1 , L. Alfonso Ureña-López1 and M. Teresa Martín-Valdivia1 1 Universidad de Jaén, Campus Las Lagunillas s/n, E-23071, Jaén, Spain Abstract This paper describes the participation of the SINAI team in the SpRadIE challenge: Information Ex- traction from Spanish radiology reports which consists of identifying biomedical entities related to the radiological domain. There have been many tasks focused on extracting relevant information from clini- cal texts, however, no previous task has been centered on radiology using Spanish as the main language. Detecting relevant information automatically in biomedical texts is a crucial task because current health information systems are not prepared to analyze and extract this knowledge due to the time and cost involved in processing it manually. To accomplish this task, we propose two approaches based on pre- trained models using the BERT architecture. Specifically, we use a multi-class classification model, a binary classification model and a pipeline model for entity identification. The results are encouraging since we improved the average of the participants by obtaining a 73.7% F1-score using the binary system. Keywords Biomedical information extraction, Radiological domain, Spanish clinical reports, Pre-trained language models, BERT, 1. Introduction Medical texts such as radiology reports or Electronic Health Records (EHR) are a powerful source of data for researchers. These data sources contain relevant information that can help in clinical decision-making and report structuring, among other benefits. However, current health information systems are not prepared to analyze and extract knowledge due to the time and cost involved in processing it manually. The field of artificial intelligence known as Natural Language Processing (NLP) is being applied to medical documents to build applications that can understand and analyze this huge amount of textual information automatically [1]. This paper describes the system presented by the SINAI team for the SpRadIE (Information ex- traction from Spanish Clinical Reports) challenge [2] at CLEF eHealth 2021 [3] (Task 1). SpRadIE challenge focuses on information extraction from Spanish biomedical texts, more specifically, on the NER (Named Entity Recognition) task. Spanish has more than 480 million native speakers1 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " plubeda@ujaen.es (P. López-Úbeda); mcdiaz@ujaen.es (M. C. Díaz-Galiano); laurena@ujaen.es (L. A. Ureña-López); maite@ujaen.es (M. T. Martín-Valdivia) 0000-0003-0478-743X (P. López-Úbeda); 0000-0001-9298-1376 (M. C. Díaz-Galiano); 0000-0001-7540-4059 (L. A. Ureña-López); 0000-0002-2874-0401 (M. T. Martín-Valdivia) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 Spanish language: https://en.wikipedia.org/wiki/Spanish_language and nowadays there is a worldwide interest in processing medical texts in this language. For this challenge, our proposal is focused on pre-trained models based on the transformer architecture using BERT (Bidirectional Encoder Representations from Transformers). More specifically, we employ the BETO model trained on a large corpus in Spanish to address the task. For this purpose, we submitted three different systems: a BERT model with multi-class classification to perform entity extraction, BERT using binary classification, and a combined model. The rest of the paper is structured as follows: In Section 2 we present some previous studies related to information extraction in radiology reports. The dataset, the pre-processing carried out and the descriptions of the implemented systems are presented in Section 3. Section 4 provides the results achieved for the SpRadIE challenge. Finally, conclusions and future work are presented in Section 5. 2. Related Work In the NLP literature, studies related to the biomedical domain are focused on specific sub- domains such as radiology. NLP techniques can be an interesting tool for successfully analyzing radiology reports to extract clinically important findings [4]. In order to be effective in the treatment of significant information, it is necessary to have specialized and domain-focused corpora. On the one hand, PadChest [5] dataset contains 27,593 reports in Spanish that were manually annotated by trained physicians. The reports included in PadChest were labeled with 174 different radiographic findings, 19 differential diagnoses, and 104 anatomic locations. On the other hand, there are other datasets available to the scientific community such as Chestx- ray8 [6], PLCO [7], CheXpert [8], and MIMIC-CXR [9], which use English as the principal language. The NLP clinical community organized a series of open challenges to identify and extract relevant information included in clinical reports. These challenges highlight the importance of tackling this type of task, offering participants the opportunity to submit novel systems and compare their results using the same dataset. Some examples of these challenges are PharmaCoNER [10], eHealth-KD [11], CLEF eHealth [12], MEDDOCAN [13], CHEMDNER [14], and I2B2 [15]. Researchers interested in information extraction tasks have explored a variety of Machine Learning (ML) approaches [16]. Previous studies applied the CRF method [17] in order to perform the identification and subsequent classification of entities. CRF is the most popular solution among conventional ML algorithms. Recently, deep learning has become prevalent in the ML research community to improve biomedical named entity recognition using different neural networks such as BiLSTM-CRF [18, 19]. Although Recurrent Neural Networks (RNNs) have obtained high results and a wide range of related literature on the NER task in recent years, the pre-training of Transformer-based language models such as BERT [20] has also led to impressive gains in NER systems. Based on the idea of using the latest methods employed in the area of information extraction, our study uses BERT as a model to identify named entities in radiological reports written in Spanish. 3. System Overview 3.1. Dataset SpRadIE dataset consists of 513 ultrasonography reports provided by a pediatric hospital in Argentina. Because the reports are written by specialists, they are unstructured and have abundant spelling and grammatical errors. For confidentiality, each report has been anonymized by removing patient information, names, and physician registration numbers. The dataset was annotated by clinical experts [21] and then revised by linguists and it is composed of 175 reports in the training set, 92 reports in the development set and 207 for the test set. Seven entity types and three radiological concept hedging cues are distinguished.These entities may be very long, sometimes even spanning over sentence boundaries, embedded within other entities of different types and may be discontinuous. Moreover, different text strings may be used to refer to the same entity, including abbreviations and typos. Some of the most relevant statistics of the types of entities in each dataset along with some examples are shown in Table 1. Table 1 Statistics on the number of entities in each dataset. Training Development Example # entities # uniques # entities # uniques Anatomical Entity 1335 179 868 167 ambos riñones (both kidneys) Finding 825 289 698 295 líquido libre (free fluid) Location 529 104 286 93 cavidad (cavity) Measure 596 389 369 268 10 cm diametro longitudinal (longitudinal Type of Measure 357 36 163 30 diameter) Degree 41 19 35 22 leve (slight) Abbreviation 903 57 538 59 cm Negation 496 29 282 36 sin (without) Uncertainty 55 16 22 12 compatible (compatible) Conditional Temporal 12 8 7 4 antecedente (antecedent) 3.2. Pre-processing The initial step in data science is data preparation or text pre-processing. In our particular case, we work with texts written in Spanish and related to the radiology domain. The pre-processing carried out in all the texts is the following: • Sentence tokenization. This process consists of splitting the text into individual sen- tences. For this purpose, we use the FreeLing library [22] that incorporates analysis functionalities for a variety of languages, including Spanish. • Word tokenization. This step converts text strings to streams of token objects, where each token object is a separate word, punctuation sign, number/amount, date, etc. In this step, we also use the FreeLing library. This library offers an optimal result for clinical texts since in some cases, e.g. the sentence "Compatibles con hipertrofia pilorica" is separated into the following tokens: "Compatibles", "con", "hipertrofia_pilorica". Where the token hipertrofia_pilorica is annotated with the category Findings. • Lowercase. The texts have been converted to lowercase. • BIO tagging scheme. The final step in pre-processing the reports involves converting to CoNLL format using the BIO tagging scheme [23]. Thus each token in a sentence was labeled with B (beginning token of an entity), I (inside token of an entity), and O (non-entity). 3.3. Methodology The methodology employed for the achievement of the task is based on BERT architecture. BERT [20] uses a Transformer [24] architecture and is designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right context in all layers. Moreover, BERT proposes a Masked Language Modelling (MLM) objective, where some of the tokens of an input sequence are randomly masked, and the goal is to predict these masked positions taking the corrupted sequence as input. Specifically, for our experimental design, we use the pre-trained BERT model named BETO [25] because it is trained on a big Spanish corpus. The following sections describe the models and parameters used for each system submitted to the SpRadIE challenge. 3.3.1. BERT Multi-class Entity The first approach is to use BERT to detect all possible entities found in a text. For this purpose, we have taken into account all types of entities in each sentence of the dataset. In this case, BERT is trained to assign an entity type to each token. Figure 1 shows an example of the sentence "Ecografía Inguinal: Aumento de partes blandas en región inguinal" (Inguinal ultrasound: soft tissue enlargement in the inguinal region) where each token can have up to ten different entity types or the label non-entity (O), i.e., this approach can be considered as a multi-class classification for each token. Ecografía Inguinal : Aumento de partes blandas en región inguinal O B-Loc O B-Find I-Find I-Find I-Find O B-Loc I-Loc Figure 1: BERT architecture for the multi-class classification of entities approach. 3.3.2. BERT Binary Class Entity The second approach used in the SpRadIE challenge is to perform a binary classification for each entity, instead of multi-class classification. To carry out this experimentation, we have developed ten BERT models (one for each type of entity). Figures 2 and 3 show examples using a BERT system for each entity independently. On the one hand, Figure 2 shows the same sentence as the previous approach training the dataset with the Finding entity. On the other hand, Figure 3 shows an example of the methodology followed using only the Location entity. Ecografía Inguinal : Aumento de partes blandas en región inguinal O O O B-Find I-Find I-Find I-Find O O O Figure 2: BERT architecture used to train Finding entities. Ecografía Inguinal : Aumento de partes blandas en región inguinal O B-Loc O O O O O O B-Loc I-Loc Figure 3: BERT architecture used to train Location entities. 3.3.3. BERT Pipeline Following the two experiments proposed above using the development corpus, we observed that some entities performed better with the multi-class method and others with the binary method (using the development dataset). For this reason, on the one hand, we use the output of the multi-class approach for the following entities: conditional temporal, finding, location, type of measure, and uncertainty. On the other hand, the outputs of the abbreviation, anatomical entity, degree, measure, and negation, we use the output obtained from the binary approach. Briefly, as shown in Figure 4, the BERT pipeline consists of combining the outputs of the previously proposed systems according to the type of entity in order to obtain the results. For all our experiments, we fine-tune our models using the following hyperparameters: the BETO model used is “bert-base-spanish-wwm-cased" according to the Huggingface library [26] used, the maximum sequence is set to 150 and we use 5 epoch with a batch size of 32. Finally, all experiments (training and evaluation) were performed on a node equipped with two Intel Xeon Silver 4208 CPU at 2.10 GHz, 192 GB RAM, as main processors, and six GPUs NVIDIA GeForce RTX 2080Ti (with 11 GB each). BERT Multi-class Entity BERT Binary Class Entity Entities: - Conditional temporal - Finding - Location - Type of measure Entity: Entity: Entity: Entity: Entity: - Uncertainty Abbreviation Anatomical Degree Measure Negation BERT Pipeline Figure 4: BERT pipeline architecture. 4. Results The metrics defined by the SpRadIE challenge to evaluate the submitted experiments are those commonly used for some NLP tasks such as NER or text classification, namely precision (P), recall (R), and F1-score (F1) considering exact and lenient match. Table 2 shows the results obtained by the SINAI team for each run submitted. Table 2 Systems test results achieved by the SINAI group in SpaRadIE Task 5. Lenient Exact P(%) R(%) F1(%) P(%) R(%) F1(%) BERT Multi-class Entity 83.58 45.13 58.61 76.53 41.33 53.67 BERT Binary Class Entity 86.07 64.43 73.70 79.37 59.42 67.96 BERT Pipeline 83.94 61.19 70.78 77.95 56.82 65.73 Mean participant 75.90 66.31 68.97 68.66 59.32 62.41 On the one hand, in the first approximation and taking into account both lenient and exact matching, we obtain high precision although we drop in the recall measurement. Specifically, we achieved 76.53% precision, 41.33% recall, and 53.67% F1-score by performing exact matching. On the other hand, using BERT Binary Class Entity approach, we improved in all measurements compared to the results of BERT Multi-class Entity, which means that by performing binary classification, the system can detect entities more accurately. Following this methodology, the system does not have to choose between ten different entity types, but between 0 and 1 for each entity annotated in the dataset. With this methodology, the system obtains a 14% F1 improvement reaching 67.96% (above the average of the participants). It is important to highlight that although we improved in all measures using the BERT Multi-class Entity method, we especially increased the recall. Finally, the results of the last approach (BERT Pipeline) do not obtain improvements in evaluation compared to the binary method. 100% 75% 50% 25% L E L E L E L E L E L E L E L E L E L E Abbreviation Anatomical Conditional Degree Finding Location Measure Negation Type of Uncertainty Entity Temporal measure Figure 5: SINAI group results for each type of entity by using the F1 metric. L: Lenient. E: Exact. To perform a specific evaluation for each entity, Figure 5 shows the obtained F1-score value taking into account lenient (L) and exact (E) matching for each submitted system. As we can see, there is a significant difference between the systems BERT Multi-class Entity and BERT Binary Class Entity in entities such as degree, measure, negation and type of measure. These entities have been detected more accurately using the binary entity classification system. We should highlight the recognition of negated entities because we achieved an 85% F1 improvement and the entity degree that we also improved by 59% using the BERT Multi-class Entity system. 5. Conclusion and Future Work This paper presents the participation of the SINAI research group in the SpRadIE challenge at CLEF 2021. This challenge aims to extract relevant information related to the radiological domain in Spanish. Specifically, the collection is composed of reports manually annotated by specialists with ten different types of entities such as findings, anatomical entities, and abbreviations, among others. Our proposal follows a pre-trained model-based approach using the Transformer architecture for the NER task on Spanish health documents. The proposed methods uses the BERT model trained on a large Spanish corpus called BETO. First, we performed a pre-processing step using the annotated datasets provided by the organization, which were previously tokenized and labeled using the BIO scheme. Subsequently, we propose three evaluation methodologies: using BERT to label entities as a multi-class classification system, using BERT to extract entities in a binary approach and a method that combines the outputs of the two previous ones. Using BERT’s binary system, the results obtained were better than the average of the challenge participants, achieving 67.96% F1-score, 79.37% precision, and 59.42% recall using the exact matching evaluation and 73.70% F1 with the lenient evaluation. Moreover, we found that this binary architecture for entity extraction that we propose provides more information of each entity individually for the learning phase of the model achieving better results than using the multi-class model of entity detection. For future work, we plan to perform an in-depth error analysis once the gold annotation of the test is released. Moreover, we will study the performance of using linguistic features such as Part-Of-Speech tags as an input in the BERT model, as well as the use of ontologies related to the radiological domain such as RadLex. Acknowledgments This work has been partially supported by the LIVING-LANG project [RTI2018-094653-B-C21] of the Spanish Government and the Fondo Europeo de Desarrollo Regional (FEDER). References [1] C. Friedman, S. B. Johnson, Natural language and text processing in biomedicine, in: Biomedical Informatics, Springer, 2006, pp. 312–343. [2] V. Cotik, L. A. Alemany, F. Luque, R. Roller, J. Vivaldi, A. Ayach, F. Carranza, L. D. Francesca, A. Dellanzo, M. F. Urquiza, Overview of CLEF eHealth Task 1 - SpRadIE: A challenge on information extraction from Spanish Radiology Reports, in: CLEF 2021 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September 2021. [3] H. Suominen, L. Goeuriot, L. Kelly, L. Alonso Alemany, E. Bassani, N. Brew-Sam, V. Cotik, D. Filippo, G. González-Sáez, F. Luque, P. Mulhem, G. Pasi, R. Roller, S. Seneviratne, R. Upadhyay, J. Vivaldi, M. Viviani, C. Xu, Overview of the CLEF eHealth Evaluation Lab 2021, in: CLEF 2021 - 11th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS), Springer, September 2021. [4] E. Pons, L. M. Braun, M. M. Hunink, J. A. Kors, Natural language processing in radiology: a systematic review, Radiology 279 (2016) 329–343. [5] A. Bustos, A. Pertusa, J.-M. Salinas, M. de la Iglesia-Vayá, Padchest: A large chest x-ray image dataset with multi-label annotated reports, Medical image analysis 66 (2020) 101797. [6] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers, Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2097–2106. [7] G. L. Andriole, E. D. Crawford, R. L. Grubb III, S. S. Buys, D. Chia, T. R. Church, M. N. Fouad, C. Isaacs, P. A. Kvale, D. J. Reding, et al., Prostate cancer screening in the randomized Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial: mortality results after 13 years of follow-up, Journal of the National Cancer Institute 104 (2012) 125–132. [8] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al., Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 590–597. [9] A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, S. Horng, MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs, arXiv preprint arXiv:1901.07042 (2019). [10] A. G. Agirre, M. Marimon, A. Intxaurrondo, O. Rabal, M. Villegas, M. Krallinger, Pharma- coner: Pharmacological substances, compounds and proteins named entity recognition track, in: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 2019, pp. 1–10. [11] A. Piad-Morffis, Y. Gutiérrez, S. Estevez-Velarde, Y. Almeida-Cruz, R. Muñoz, A. Montoyo, Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2020, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020. [12] L. Kelly, H. Suominen, L. Goeuriot, M. Neves, E. Kanoulas, D. Li, L. Azzopardi, R. Spijker, G. Zuccon, H. Scells, et al., Overview of the CLEF eHealth evaluation lab 2019, in: Inter- national Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2019, pp. 322–339. [13] M. Marimon, A. Gonzalez-Agirre, A. Intxaurrondo, H. Rodríguez, J. A. Lopez Martin, M. Villegas, M. Krallinger, Automatic de-identification of medical texts in Spanish: the MEDDOCAN track, corpus, guidelines, methods and evaluation of results, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), volume TBA, CEUR Workshop Proceedings (CEUR-WS.org), Bilbao, Spain, 2019, p. TBA. URL: TBA. [14] M. Krallinger, F. Leitner, O. Rabal, M. Vazquez, J. Oyarzabal, A. Valencia, CHEMDNER: The drugs and chemical names extraction challenge, Journal of cheminformatics 7 (2015) S1. [15] W. Sun, A. Rumshisky, O. Uzuner, Evaluating temporal relations in clinical text: 2012 i2b2 Challenge, Journal of the American Medical Informatics Association 20 (2013) 806–813. URL: https://doi.org/10.1136/amiajnl-2013-001628. doi:10.1136/ amiajnl-2013-001628. [16] P. L. Úbeda, M. C. D. Galiano, M. T. Martín-Valdivia, L. A. U. Lopez, Using machine learning and deep learning methods to find mentions of adverse drug reactions in social media, in: Proceedings of the fourth social media mining for health applications (# SMM4H) workshop & shared task, 2019, pp. 102–106. [17] J. Lafferty, A. McCallum, F. C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001). [18] P. López-Úbedaa, M. Dıaz-Galianoa, M. Martın-Valdiviaa, L. A. Ureña-Lópeza, Extracting neoplasms morphology mentions in spanish clinical cases through word embeddings, Proceedings of IberLEF (2020). [19] P. López-Úbedaa, J. M. Perea-Ortegab, M. C. Díaz-Galianoa, M. T. Martín-Valdiviaa, L. A. Ureña-Lópeza, SINAI at eHealth-KD Challenge 2020: Combining Word Embeddings for Named Entity Recognition in Spanish Medical Records (2020). [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019. arXiv:1810.04805. [21] V. Cotik, D. Filippo, R. Roller, H. Uszkoreit, F. Xu, Annotation of Entities and Relations in Spanish Radiology Reports, in: RANLP, 2017, pp. 177–184. [22] L. Padró, E. Stanilovsky, Freeling 3.0: Towards wider multilinguality, in: LREC2012, 2012. [23] L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), 2009, pp. 147–155. [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008. [25] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish Pre-Trained BERT Model and Evalua- tion Data, in: to appear in PML4DC at ICLR 2020, 2020. [26] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. Delangue, A. Moi, P. Cistac, M. Funtowicz, J. Davison, S. Shleifer, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45.