=Paper=
{{Paper
|id=Vol-2696/paper_117
|storemode=property
|title=SINAI at CLEF eHealth 2020: Testing Different pre-trained Word Embeddings for Clinical Coding in Spanish
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_117.pdf
|volume=Vol-2696
|authors=José M. Perea-Ortega,Pilar López-Úbeda,Manuel Carlos Díaz-Galiano,María-Teresa Martín-Valdivia,L. Alfonso Ureña-López
|dblpUrl=https://dblp.org/rec/conf/clef/Perea-OrtegaLDV20
}}
==SINAI at CLEF eHealth 2020: Testing Different pre-trained Word Embeddings for Clinical Coding in Spanish==
SINAI at CLEF eHealth 2020: testing different pre-trained word embeddings for clinical coding in Spanish José M. Perea-Ortega1[0000−0002−7929−3963] , Pilar López-Úbeda2[0000−0003−0478−743X] , Manuel C. Dı́az-Galiano2[0000−0001−9298−1376] , M. Teresa Martı́n-Valdivia2[0000−0002−2874−0401] , and L. Alfonso Ureña-López2[0000−0001−7540−4059] 1 University of Extremadura, Badajoz, Spain jmperea@unex.es 2 University of Jaén, Jaén, Spain {plubeda,mcdiaz,maite,laurena}@ujaen.es Abstract. This paper describes the system presented by the SINAI team for the Multilingual Information Extraction task of the CLEF eHealth Lab 2020. This task focuses on the automatic assignment of the International Classification of Diseases (ICD) codes to health-related texts in Spanish. Our proposal follows a deep learning-based approach where we have used the bidirectional variant of a Long Short Term Mem- ory (LSTM) network along with a stacked Conditional Random Fields (CRF) decoding layer (BiLSTM+CRF). The aim of the experiments car- ried out was to test the performance of different pre-trained word embed- dings for recognizing diagnoses and procedures in clinical text. The main finding was that combining word embeddings could be a useful strategy to apply for deep learning-based approaches, even though the combined embeddings do not belong to the medical domain. The best MAP scores achieved were 0.314 and 0.293 for the CodiEsp-D and CodiEsp-P sub- tasks, respectively. 1 Introduction Within health organizations, clinical coding can be seen as a task in which infor- mation from Electronic Health Records (EHR) is translated into alphanumeric codes by using internationally recognized terminologies or classifications. For example, acute appendicitis is represented by code ‘K35.8’ using the Interna- tional Classification of Diseases (ICD). In hospitals, these data are critical for clinical professionals, research, and other purposes, such as statistical analysis Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. and decision-making. However, this task is often performed manually by clin- ical coders, where the effort required for information abstraction is extremely laborious, time-consuming, and prone to human errors. To alleviate this problem, the research community has to lead to the orga- nization of challenges and shared tasks to promote automatic clinical coding systems. Over the past years, CLEF eHealth offered challenges addressing sev- eral aspects of related information access, providing researchers with datasets to work with and validate the outcomes [18, 17, 7]. In 2020, they continue to offer two shared tasks: i) Multilingual Information Extraction (IE), which focuses on ICD coding for clinical textual data in Spanish, and ii) Consumer Health Search, which follows a standard information retrieval shared challenge paradigm. This paper describes the system presented by the SINAI team for the Multi- lingual IE subtask of the CLEF eHealth Lab 2020. Automatic assignment of ICD codes for health-related texts can be considered a special case of multilabel text classification, which may be approached either from a Natural Language Pro- cessing (NLP) perspective by using syntactic and/or semantic decision rules, or a machine learning perspective. For this purpose, machine learning algorithms have been successfully applied, particularly those that have focused on deep learning-based methods. In this paper, we mainly focus on Recurrent Neural Network (RNN), specifically on the bidirectional variant of Long Short Term Memory along with a stacked Conditional Random Fields decoding layer (BiL- STM+CRF) [8, 15]. For training the network, our approach proposes the use of different types of vectors representing word meanings (word embeddings) by using only the training data provided by the organizers. In the next section, we briefly present the background. Section 3 describes the architecture of our system presented to the Multilingual IE task of the CLEF eHealth lab. Section 4 reports the results obtained for the different experiments carried out and, finally, conclusions and future work are presented in Section 5. 2 Background Clinical coding can be approached as a Named Entity Recognition (NER) task where medical concepts should be firstly detected within the text. Then, they should be mapped to a specific code related to that concept. In recent years, deep learning approaches have been used for NER, leading to state-of-the-art results [9, 5, 14]. Our group has experience in clinical NER by using different method- ologies such as traditional machine learning [11], Recurrent Neural Networks (RNNs) [12] and unsupervised machine learning [10, 13]. Clinical NER is being commonly approached as a sequence labelling problem, where the text is treated as a sequence of words to be labeled with linguistic tags. Current state-of-the-art approaches for sequence labeling propose the use of RNNs to learn useful representations automatically, since they facilitate model- ing long-distance dependencies between the words in a sentence. These networks usually rely on word embeddings, that are commonly pre-trained over very large corpora to capture latent syntactic and semantic similarities between words. A novel type of word embeddings called contextual string embeddings is proposed by Akbik et al. [2], which essentially model words as sequences of characters, thus contextualizing a word by their surrounding text and allowing the same word to have different embeddings depending on its contextual use. 3 System Overview 3.1 Dataset The corpus provided for the Multilingual IE task of the CLEF eHealth lab con- sisted of 1,000 clinical case comprising 16,504 sentences and 396,988 words, with an average of 396.2 words per clinical case [16]. The corpus had 18,483 annotated codes, of which, 3,427 were unique. These were divided into two groups: – ICD10-CM codes (CIE10 Diagnóstico in Spanish). They are codes belonging to the International Classification of Diseases, 10th revision, Clinical Modi- fication, and they are tagged as DIAGNOSTICO. – ICD10-PCS codes (CIE10 Procedimiento in Spanish). They are codes belong- ing to the International Classification of Diseases, 10th revision, Procedure codes (related to procedures performed in hospitals), and they are tagged as PROCEDIMIENTO. The entire corpus was randomly sampled into three subsets: training, develop- ment and test. The training set comprised 500 clinical cases, and the development and test sets 250 clinical cases each. Together with the test set, the organizers released an additional collection of more than 2,000 documents (background set) to make sure that participating teams were not be able to do manual corrections. We performed a preliminar preprocessing phase to the train and dev data sets provided for the task, considering DIAGNOSTICO and PROCEDIMIENTO annotations in a separate way. First, we used Freeling [19] to tokenize the text and get the Part-Of-Speech (POS) tag of each word. Then, we generated the training corpus with the following features: original form of the word, POS tag and NER tag. For performing the NER tagging, the provided annotations were encoded by using the BIO tagging scheme, which represents that a token is at the beginning of an entity (B-ENT), inside of an entity (I-ENT), or outside (O) of an entity. Finally, only the sentences with BIO tags were considered to generate the training corpus. Figure 1 shows an example of the generated training corpus for the assignment of DIAGNOSTICO codes (left) and PROCEDIMIENTO codes (right). 3.2 BiLSTM+CRF architecture Our proposal follows a deep learning-based approach where a Recurrent Neural Network (RNN) is used to generate different learning models. Specifically, we have used the bidirectional variant of Long Short Term Memory along with a stacked Conditional Random Fields decoding layer (BiLSTM+CRF) [8, 15]. This specialized architecture is chosen to approach NER because it facilitates Mujer NC O No RN O de SP O antecedentes NC O 42 Z O de SP O a~ nos NC O nefrolitiasis NC B-ENT en SP O ni CC O el DA O hematuria NC B-ENT momento NC O ni CC O de SP O infecciones NC B-ENT someterse VM O del SP I-ENT a SP O tracto NC I-ENT trasplante NC B-ENT urinario AQ I-ENT hepático AQ I-ENT . Fp O . Fp O Fig. 1. Example of the generated training corpus for the assignement of DIAGNOSTICO codes (left) and PROCEDIMIENTO codes (right). the processing of arbitrary length input sequences and enables the learning of long-distance dependencies, which is particularly advantageous in the case of clinical coding to detect medical concepts. Moreover, our approach proposes the combination of different types of pre-trained word embeddings by concatenating each embedding vector to form the final word vectors. In this way, the probability of recognizing a specific medical concept in a text should be increased since different types of word representation are combined. For the case of contextual string embeddings, since they are robust in face of misspelled words, we suppose they could be highly suitable for clinical NER. As shown in Figure 2, the proposed architecture gets a context of each word on the clinical case using BiLSTM (encoding layer), and then makes word pre- dictions simultaneously on the CRF layer (decoding layer). It should be noted that diagnoses and procedures were managed independently, i.e., we generated learning models to predict diagnoses exclusively, and other different models to predict procedures. We have used Flair Library3 [1] to apply this architecture. Flair is an open source NLP library developed by Zalando Research. It is built on Pytorch4 and has fairly good GPU support. 3.3 Pre-trained Word Embeddings RNNs generally use an embedding layer as an input, which makes it possible to represent words using a dense vector representation. In order to fit the text input into the BiLSTM+CRF architecture, we have combined different types of pre-trained word embeddings: – Classic Word Embeddings. Classic word embeddings are static and word- level, meaning that each distinct word gets exactly one pre-computed embed- 3 http://github.com/flairNLP/flair 4 http://pytorch.org Fig. 2. BiLSTM+CRF architecture that uses different word embeddings as an input layer. ding. For our experiments we have used the WordEmbeddings class provided by the Flair Library [1] that was initialized with fastText5 embeddings pre- trained over Spanish Wikipedia. – Contextual Word Embeddings. Contextual word embeddings are con- sidered powerful embeddings because they capture latent syntactic-semantic information that goes beyond standard word embeddings [2]. These em- beddings are based on character-level language modeling and their use is particularly advantageous when the NER task is approached as a sequen- tial labeling problem. For our experiments we have used the FlairEmbeddings class provided by the Flair Library. These contextual string embeddings were pre-trained over Spanish Wikipedia. – Word Embeddings based on Transformers. Bidirectional Encoder Rep- resentations from Transformers (BERT) [6] is based on a multilayer bidi- rectional transformer-encoder, where the transformer neural network uses parallel attention layers rather than sequential recurrence [21]. This kind of embeddings are commonly pre-trained over very large corpora to capture la- 5 https://fasttext.cc tent syntactic and semantic similarities between words. For our experiments we have used BETO cased embeddings [3], which follows a BERT model trained on a big corpus composed of text portions extracted from different web sources in Spanish. – In-domain Word Embeddings. Most of the available word embeddings are focused on general-domain texts, and their uses not necessarily apply well to clinical text analysis [4]. In order to test biomedical word embed- dings for our experiments, we have used the first version of Spanish Medical Embeddings6 [20], which are based on the fastText model and were devel- oped from two data sources: (i) the SciELO database, and (ii) Wikipedia Health, comprised by the categories of Pharmacology, Pharmacy, Medicine and Biology. 4 Experiments and Results Our team submitted a total of 10 runs for the Multilingual IE task, 5 for each proposed main subtasks: diagnosis coding (CodiEsp-D) and procedure coding (CodiEsp-P). Besides, other 5 runs were submitted for the exploratory subtask called CodiEsp-X, where systems were required to submit the reference in text to the predicted codes for both diagnosis and procedure. The aim of the experiments carried out was to test the performance of dif- ferent pre-trained word embeddings for recognizing diagnoses and procedures in clinical text. Thus, several learning models were generated using the default hy- perparameter setting in Flair: 0.1 of learning rate, 32 of batch size, 0.5 of dropout probability, and 150 of maximum epoch. All experiments were performed on a single Tesla-V100 32 GB GPU with 192 GB of RAM. The configuration used for each submitted run is shown below: – Run 1: Spanish Medical Embeddings (SME). In-domain word embeddings generated from two data sources: (i) the SciELO database, and (ii) Wikipedia Health. – Run 2: WordEmbeddings + FlairEmbeddings (Word+Flair). This was per- formed by using the StackedEmbeddings class of Flair, whereby words are embedded in a single vector using a concatenation of the different embed- dings combined. – Run 3: WordEmbeddings (WordEmbed). – Run 4: BETO cased embeddings (BETO). – Run 5: FlairEmbeddings (Flair). The evaluation metrics defined by the organizers were those commonly used for some NLP tasks such as NER or information retrieval, namely Mean Average Precision (MAP), Precision (P), Recall (R), and F1-score. Table 1 and Table 2 shows the results obtained by the SINAI team for the main and exploratory subtasks, respectively. 6 http://doi.org/10.5281/zenodo.2542722 Subtask Model MAP P R F1-score SME 0.301 0.412 0.538 0.467 Word+Flair 0.314 0.443 0.544 0.488 CodiEsp-D WordEmbed 0.302 0.418 0.540 0.471 BETO 0.251 0.450 0.433 0.441 Flair 0.291 0.402 0.528 0.456 SME 0.280 0.367 0.452 0.405 Word+Flair 0.293 0.370 0.476 0.416 CodiEsp-P WordEmbed 0.271 0.342 0.455 0.391 BETO 0.250 0.343 0.422 0.378 Flair 0.254 0.318 0.458 0.376 Table 1. Official results obtained by the SINAI team in the main subtasks CodiEsp-D and CodiEsp-P. Subtask Model P R F1-score SME 0.330 0.425 0.371 Word+Flair 0.360 0.447 0.399 CodiEsp-X WordEmbed 0.323 0.420 0.365 BETO 0.337 0.346 0.342 Flair 0.313 0.421 0.359 Table 2. Official results obtained by the SINAI team in the exploratory subtask CodiEsp-X. As shown in Table 1, the results obtained for both main subtasks are rela- tively low. This behavior may be due to the limited amount of training data used since we have only used the sentences with BIO tags found in the train and dev data sets provided by the organization. Another reason of the poor performance could be the use of embeddings that have not been generated from medical texts. Nevertheless, our best MAP result in both subtasks was achieved when different pre-trained word embeddings were combined (classic and contextual) and used as an input layer to the BiLSTM+CRF architecture. This may lead to the finding that combining word embeddings could be an interesting strategy to consider for the future, even though the combined embeddings do not belong to the medical domain. 5 Conclusions and future work This paper describes the participation of the SINAI research group in the Mul- tilingual Information Extraction task of the CLEF eHealth Lab 2020. This task focuses on the automatic assignment of codes to clinical textual data in Spanish. The classification proposed to perform the coding is the Spanish version of the International Classification of Diseases, 10th revision, ICD10 (CIE10 in Span- ish). Two main NLP subtasks were defined: diagnosis coding (CodiEsp-D) and procedure coding (CodiEsp-P). Our proposal follows a deep learning-based approach for clinical NER. It is focused on the use of a BiLSTM+CRF architecture where different pre-trained word embeddings are used as an input to the neural network. Then, training is performed by using the annotated datasets provided by the organization, which were previously tokenized and NER-tagged by using the BIO scheme. Our main goal was to test the performance of different types of pre-trained word embed- dings for detecting and recognizing diagnoses and procedures in medical texts in Spanish. We believe that the poor performance obtained is due to the limited amount of training data, and the use of word embeddings that were not gen- erated from medical texts. Nevertheless, the main finding was that combining word embeddings could be a useful strategy to apply for deep learning-based ap- proaches, even though the combined embeddings do not belong to the medical domain. For future work, we first should analyze in-depth why the results were low. Then, further research should focus on injecting domain knowledge into the deep learning model. Another future direction would be to explore how the machine translation of Spanish into English performs to use greater availability of existing knowledge resources in English. Acknowledgements This work has been partially supported by LIVING-LANG project (RTI2018- 094653-B-C21) from the Spanish Government, Junta de Extremadura (GR18135) and Fondo Europeo de Desarrollo Regional (FEDER). References 1. Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: An easy-to-use framework for state-of-the-art NLP. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). pp. 54–59 (2019) 2. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 1638–1649 (2018) 3. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Practical ML for Developing Countries Workshop (ICLR 2020) (2020) 4. Chiu, B., Crichton, G.K.O., Korhonen, A., Pyysalo, S.: How to Train good Word Embeddings for Biomedical NLP. In: Cohen, K.B., Demner-Fushman, D., Anani- adou, S., Tsujii, J. (eds.) BioNLP@ACL. pp. 166–174. Association for Computa- tional Linguistics (2016) 5. Chokwijitkul, T., Nguyen, A., Hassanzadeh, H., Perez, S.: Identifying risk factors for heart disease in electronic medical records: A deep learning approach. In: Pro- ceedings of the BioNLP 2018 workshop. pp. 18–27. Association for Computational Linguistics, Melbourne, Australia (Jul 2018). https://doi.org/10.18653/v1/W18- 2303, https://www.aclweb.org/anthology/W18-2303 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 7. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu, Z., Pasi, G., Saez Gonzales, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth evaluation lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., andNicola Ferro, L.C. (eds.) Exper- imental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020) . LNCS Volume number: 12260 (2020) 8. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 9. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 260–270. Association for Computational Lin- guistics, San Diego, California (Jun 2016). https://doi.org/10.18653/v1/N16-1030, https://www.aclweb.org/anthology/N16-1030 10. López-Ubeda, P., Dı́az-Galiano, M.C., Martı́n-Valdivia, M.T., Ureña-López, L.A.: Sinai en TASS 2018 task 3. clasificando acciones y conceptos con UMLS en Medline. Proceedings of TASS (2018) 11. López-Ubeda, P., Dı́az-Galiano, M.C., Martı́n-Valdivia, M.T., Ureña-López, L.A.: Using machine learning and deep learning methods to find mentions of adverse drug reactions in social media. In: Proceedings of the Fourth Social Media Mining for Health Applications (# SMM4H) Workshop & Shared Task. pp. 102–106 (2019) 12. López-Ubeda, P., Dı́az-Galiano, M.C., Ureña-López, L.A., Martı́n-Valdivia, M.T.: Using Snomed to recognize and index chemical and drug mentions. In: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks. pp. 115–120 (2019) 13. López-Úbeda, P., Dı́az-Galiano, M.C., Montejo-Ráez, A., Martı́n-Valdivia, M.T., Ureña-López, L.A.: An Integrated Approach to Biomedical Term Identification Systems. Applied Sciences 10(5), 17–26 (2020) 14. Luu, T.M., Phan, R., Davey, R., Chetty, G.: Clinical name entity recognition based on recurrent neural networks. In: 2018 18th International Conference on Computational Science and Applications (ICCSA). pp. 1–9 (2018) 15. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs- CRF. In: 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers (2016). https://doi.org/10.18653/v1/p16-1101 16. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of eHealth CLEF 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., andNicola Ferro, L.C. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Con- ference of the CLEF Association (CLEF 2020) . LNCS Volume number: 12260 (2020) 17. Neves, M.L., Butzke, D., Dörendahl, A., Leich, N., Hummel, B., Schönfelder, G., Grune, B.: Overview of the clef ehealth 2019 multilingual information extraction. In: Cappellato, L., Ferro, N., Losada, D.E., Müller, H. (eds.) CLEF (Working Notes). CEUR Workshop Proceedings, vol. 2380. CEUR-WS.org (2019), http: //dblp.uni-trier.de/db/conf/clef/clef2019w.html#NevesBDLHSG19 18. Névéol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier, L., Rey, G., Zweigenbaum, P.: Clef ehealth 2018 multilingual information extraction task overview: Icd10 coding of death certificates in french, hungarian and italian. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) CLEF (Working Notes). CEUR Workshop Proceedings, vol. 2125. CEUR-WS.org (2018), http://dblp. uni-trier.de/db/conf/clef/clef2018w.html#NeveolRGMOPRRZ18 19. Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceed- ings of the Language Resources and Evaluation Conference (LREC 2012). ELRA, Istanbul, Turkey (May 2012) 20. Soares, F., Villegas, M., Gonzalez-Agirre, A., Krallinger, M., Armengol-Estapé, J.: Medical word embeddings for Spanish: Development and evaluation. In: Pro- ceedings of the 2nd Clinical Natural Language Processing Workshop. pp. 124– 133. Association for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019). https://doi.org/10.18653/v1/W19-1916, https://www.aclweb.org/ anthology/W19-1916 21. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)