=Paper=
{{Paper
|id=Vol-2696/paper_198
|storemode=property
|title=IAM at CLEF eHealth 2020: Concept Annotation in Spanish Electronic Health Records
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_198.pdf
|volume=Vol-2696
|authors=Sebastien Cossin,Vianney Jouhet
|dblpUrl=https://dblp.org/rec/conf/clef/CossinJ20
}}
==IAM at CLEF eHealth 2020: Concept Annotation in Spanish Electronic Health Records==
IAM at CLEF EHEALTH 2020: CONCEPT ANNOTATION in SPANISH ELECTRONIC HEALTH RECORDS Sébastien Cossin1,2[0000−0002−3845−8127] and Vianney Jouhet1,2[0000−0001−5272−2265] 1 Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center, team ERIAS, UMR 1219, F-33000 Bordeaux, France 2 CHU de Bordeaux, Pôle de santé publique, Service d’information médicale, Informatique et Archivistique Médicales (IAM), F-33000 Bordeaux, France sebastien.cossin@u-bordeaux.fr Abstract. In this paper, we describe the approach and the results of our participation in task 1 (multilingual information extraction) of the CLEF eHealth 2020 challenge. We tackled the task of automatically assigning ICD-10 diagnosis and procedure codes to Spanish electronic health records. We used a dictionary-based approach using only mate- rials provided by the task organizers. The training set consisted of 750 clinical cases annotated by a medical expert. Our system achieved an F1-score of 0.69 for the detection of diagnoses and 0.52 for the detection of procedures on a test set of 250 clinical cases. Keywords: Semantic annotation · Entity recognition · Natural Lan- guage Processing · Electronic Health Records · Spanish 1 Introduction An electronic health record is a patient-centered record that contains medical information about a patient’s medical history, past and current medications, lab results, diagnoses etc. Most of this medical information is provided by health care professionals in free text format. Free text has many advantages like familiarity, ease of use and freedom to express complex things [3]. However unstructured data are difficult to reuse and query to retrieve information. Natural Language Processing (NLP) develops methods to manage free-text data and extract in- formation required by applications such as clinical decision support systems. A frequent step in a NLP pipeline is the detection of medical entities (treatment, diagnosis) with named entity recognition (NER) algorithms. Linking each de- tected entity to a terminology or ontology is essential to leverage the power of knowledge graphs that bring external knowledge and meaning [4,5]. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. The objective of shared tasks is to foster the development of NLP tools. For many years CLEF eHealth has proposed challenges to solve several real-world problems of information extraction in free-text data. In this paper, we describe our approach and present the results for our partic- ipation in the task 1 (multilingual information extraction) of the CLEF eHealth 2020 challenge [2,6]. This task focused on diagnosis and procedure coding of Spanish electronic health records. We addressed 3 subtasks: 1. ICD10-CM codes assignment. In this sub-track the systems must detect symptoms and diseases mentioned in clinical notes by predicting ICD10- CM codes (International Classification of Diseases, Tenth Revision, Clinical Modification) for each document. 2. ICD10-PCS codes assignment. In this sub-track the systems must detect pro- cedures mentioned in clinical notes by predicting ICD10-PCS codes (Interna- tional Classification of Diseases, Tenth Revision, Procedure Coding System) for each document. 3. Explainable AI. In this sub-track the systems must provide text annotations for each ICD10-CM and ICD10-PCS code prediction of sub-track 1 and 2. We developed a biomedical semantic annotation tool for our own needs at Bordeaux hospital. The main motivation to participate in the challenge was to compare our system and to learn from others on a shared task. 2 Methods In the following subsections, we describe the corpora, the terminologies used in this challenge, an exploratory analysis of the data and our system. 2.1 Corpora The dataset provided by the organizers was called the CodiEsp corpus. The corpus comprised 1,000 Spanish clinical case studies selected by a practicing physician. Each clinical case was a plain text file and the filename was the iden- tifier. The train set and the development set contained 500 and 250 clinical cases, respectively. For these sets, annotations were published. 3,001 clinical cases had to be annotated by the participants, of which 250 formed the test set and were only known by the organizers to avoid manual corrections. The annotation format was a tab-separated file with 2 fields for subtasks 1 and 2 corresponding to the clinical case identifier and an ICD10-CM or ICD10- PCS code. Three more fields were expected for the third substask: the start and end offset of each term detected and whether the code came from the ICD10-CM or the ICD10-PCS terminology. 2.2 Coding terminologies Spanish versions of the ICD10-CM and the ICD10-PCS terminologies were pro- vided by the organizers. The terminologies contained 98,288 and 75,789 different codes, respectively. In this task only 2,921 distinct codes (1.7%) were present in the train and development sets. Therefore, a vast majority of codes were not used while others were frequent. 2.3 Corpora exploration The Brat annotation tool [7] was used to visualize the annotations made by the medical expert. To do so, a script was developed to transform the task file format to the Brat file format. Figure 1 presents a screenshot of the Brat interface with the first 4 lines of a clinical case from the development set. Fig. 1. The Brat interface was used to visualize the annotations made by the medical expert. Each annotation was linked to an ICD10-CM or ICD10-PCS code (not shown here) Four key insights emerged from this visualization: – All clinical diagnoses and procedures detected in a clinical note had to be coded. It’s different from medical coding used for reimbursement in that only what was done during a medical encounter should be coded. – The system didn’t have to detect negation. The mention of the absence of a diagnosis or a procedure was also coded. – The different words that make up an annotation can be very far apart in a clinical note, often without syntactic dependencies. Detecting terms com- posed of nonadjacent words seemed to be very challenging. – Spelling mistakes were very rare. The development and train sets were combined into a single set later referred to as the ’training set’. The training set contained 750 clinical cases with 10,678 and 3,018 annotations of diagnoses and procedures, respectively. The number of annotations made up of nonadjacent words was 14.6% and 39.7% for diagnoses and procedures, respectively. 2.4 Algorithm We reused the algorithm we developed for the multilingual information extrac- tion task at CLEF eHealth 2018 which consisted of automatically assigning ICD- 10 codes to French death certificates [1]. The algorithm was described in details at this occasion and the code is available1 . The algorithm uses a dictionary- based approach. It takes in input a normalized dictionary and stores it in a tree data structure where each token corresponds to a node. A text to annotate is tokenized after undergoing the same normalization process as the terminol- ogy: words are normalized through accents (diacritical marks) and punctuation removal, lowercasing and stopwords removal (if a stopword list is given). The algorithm tries to match a token in a text with a token in the tree using three different techniques: exact match, abbreviation match or a string-distance match based on the Levenshtein distance. The abbreviation match technique uses a dic- tionary of abbreviations that may be provided in input. When the last token of a term at the leaf of the tree is matched, the algorithm outputs an annotation, meaning a term was found. This algorithm can’t detect a term if its words are not in the right order or nonadjacent which occurred frequently in this task. Dictionaries Interestingly, only 4% of the terms annotated by the medical expert were found in the terminologies after normalization. By comparing the terms from the annotations and the terms from the terminology, a list of the most frequent stopwords was created (no especificado, no especificada...). This list was used to normalize the terminology. Two dictionaries were constructed: – The first dictionary (run1) contained only the terms from the annotations of the medical expert from the training set which corresponded to 6,316 terms. – The second dictionary (run2) was the combination of the first dictionary and the normalized labels of the ICD10-CM terminology. It contained a total of 94,386 terms. These two dictionaries were tested on the training set to detect and remove terms that could hinder the evaluation metrics. For each term we calculated the number of times it was annotated by the algorithm and by the human annotator. If the ratio between these two numbers was greater than 2, the term was removed from the dictionary. For example, the term "renal" was annotated 67 times by the algorithm but only 2 times by the human expert in the development set. Keeping this term would decrease the precision and the F1 score although the recall would be slightly increased. 1 https://github.com/scossin/IAMsystem 3 Results We submitted two runs (one for each dictionary) for the three subtasks. It took less than 5 seconds to annotate the 3,001 documents for the three subtasks on a laptop with Intel Core i7-5700HQ @2.70GH x 8CPUs. We obtained our best F1 score with dictionary 1 for detecting the diagnoses and with dictionary 2 for identifying the procedures. Table 1 shows the performance of our system. Subtask Run MAP Precision Recall F1 score 1 1 0.52 0.82 0.59 0.69 2 0.51 0.79 0.59 0.68 2 1 0.43 0.66 0.37 0.48 2 0.49 0.69 0.42 0.52 3 1 - 0.005 0.003 0.005 2 - 0.006 0.004 0.005 3 (non official) 1 - 0.75 0.52 0.61 2 - 0.73 0.52 0.61 Table 1. System performance on the CodiEsp test set. MAP: Mean Average Precision. In subtask 3, an error was detected after publication of the official results. A miscalculation error of the end offset position of each term was fixed and the performance (non official) was reassessed by the organizers. 4 Discussion The performance was better for the diagnosis subtask than for the procedure one. It was not surprising since the number of terms composed of nonadjacent words were higher in this last task and our algorithm cannot detect such terms. This missing functionality was the main limitation of our system and it probably had a strong impact on our results. In 2018, the same algorithm obtained a F1-score of 0.786 (precision: 0.794, recall: 0.779) on the task of coding French Death Certificates with the ICD-10 terminology. These better results in 2018 can be explained by a greater number of terms annotated by a medical expert, a shorter text to annotate and no long dependency between words in death certificates. Adding additional terms (run 2) did not improve the recall in subtask 1 and even reduced the precision. The opposite was observed for procedures (subtask 2) where addition of terms (run 2) improved both recall and precision. The labels in ICD-10 terminologies were of little interest compared to labels from the annotations. The main advantage of our algorithm is its simplicity and speed. All it needs is a dictionary, a list of abbreviations (optional) and stopwords (optional). The algorithm provides an explanation by outputting start and end position of each detected term. The proposed algorithm can be used as a baseline method for any named entity task and could be integrated into another system to create a more complex approach. Recently deep learning models based on CNNs and RNNs have shown to achieve better performance on NER tasks in clinical domain [8]. However these models are more data hungry and their training is very costly in terms of com- putational power. Their advantages are diminished when the number of anno- tations is low because it is very difficult to predict unseen codes compared to a dictionary-based approach. In this task only 1.7% of the codes were present in the training set. Further improvement may be possible by using a better curated terminology, a longer list of abbreviations and a phonetic matching strategy to our dictionary based approach. References 1. Cossin, S., Jouhet, V., Mougin, F., Diallo, G., Thiessard, F.: IAM at CLEF eHealth 2018: Concept Annotation and Coding in French Death Certifi- cates. arXiv:1807.03674 [cs] (Jul 2018), http://arxiv.org/abs/1807.03674, arXiv: 1807.03674 2. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu, Z., Pasi, G., Gonzales Saez, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth Evaluation Lab 2020. In: Arampatzis, A. (ed.) Experimental IR Meets Multilin- guality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS (2020) 3. Johnson, S.B., Bakken, S., Dine, D., Hyun, S., Mendonça, E., Morrison, F., Bright, T., Van Vleck, T., Wrenn, J., Stetson, P.: An Electronic Health Record Based on Structured Narrative. Journal of the American Medical Informatics Association : JAMIA 15(1), 54–64 (2008). https://doi.org/10.1197/jamia.M2131, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2274868/ 4. Jovanović, J., Bagheri, E.: Semantic annotation in biomedicine: the current landscape. Journal of Biomedical Seman- tics 8 (Sep 2017). https://doi.org/10.1186/s13326-017-0153-x, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5610427/ 5. Karadeniz, I., Ozgur, A.: Linking entities through an ontology us- ing word embeddings and syntactic re-ranking. BMC Bioinfor- matics 20 (Mar 2019). https://doi.org/10.1186/s12859-019-2678-8, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6437991/ 6. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of eHealth CLEF 2020. CEUR-WS (2020) 7. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT: A Web-based Tool for NLP-assisted Text Annotation. In: Proceed- ings of the Demonstrations at the 13th Conference of the European Chap- ter of the Association for Computational Linguistics. pp. 102–107. EACL ’12, Association for Computational Linguistics, Stroudsburg, PA, USA (2012), http://dl.acm.org/citation.cfm?id=2380921.2380942 8. Wu, Y., Jiang, M., Xu, J., Zhi, D., Xu, H.: Clinical Named Entity Recognition Using Deep Learning Models. AMIA Annual Symposium Proceedings 2017, 1812– 1819 (Apr 2018), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5977567/