The E3C Project: European Clinical Case Corpus El proyecto E3C: European Clinical Case Corpus Bernardo Magnini1 , Begoña Altuna1,2 , Alberto Lavelli1 , Manuela Speranza1 , Roberto Zanoli1 1 Fondazione Bruno Kessler 2 Universidad del Paı́s Vasco/Euskal Herriko Unibertsitatea {magnini;altuna;lavelli;manspera;zanoli}@fbk.eu Abstract: The European Clinical Case Corpus (E3C) project aims at collecting and annotating a large corpus of clinical documents in five European languages (Spanish, Basque, English, French and Italian), which will be freely distributed. Annotations include temporal information, to allow temporal reasoning on chronologies, and information about clinical entities based on medical taxonomies, to be used for se- mantic reasoning. Keywords: Clinical data, corpus, multilingual, temporal information, clinical enti- ties. Resumen: El proyecto European Clinical Case Corpus (E3C) pretende reunir y anotar un gran crorpus de documentos clı́nicos en cinco lenguas europeas (español, euskera, inglés, francés e italiano) que será distribuido libremente. Las anota- ciones incluyen información temporal para permitir el razonamiento temporal en cronologı́as e información sobre entidades clı́nicas basada en taxonomı́as médicas para su uso en razonamiento semántico. Palabras clave: información clı́nica, multilingüe, información temporal, entidades clı́nicas. 1 Introduction A 25-year-old man with a history of Klippel-Trenaunay syndrome pre- E3C, European Clinical Case Corpus, is a sented to the hospital with mucopuru- one-year project (started in July 2020) aim- lent bloody stool and epigastric persis- ing at creating a corpus of clinical documents tent colic pain for 2 wk. Colonoscopy in five European languages: Spanish, Basque, showed continuous superficial ulcers and English, French and Italian. The project is bleeding. Subsequent gastroscopy re- partially supported by the European Lan- vealed mucosa with diffuse edema, ul- guage Grid project through its open call for cers, errhysis, and granular and friable pilot projects1 . On its hand, the European changes in the stomach and duodenal Language Grid project has received fund- bulb. A diagnosis of GDUC was consid- ing from the European Union’s Horizon 2020 ered. The patient hesitated about iv cor- Research and Innovation programme under ticosteroids, so he was treated with pen- Grant Agreement no. 825627 (ELG). The tasa 3.2 g/d. After 0.5 mo of treatment, project is led by FBK (Fondazione Bruno the symptoms achieved complete remis- Kessler) and part of the activities has been sion. Follow-up examinations showed no subcontracted to the Université d’Orléans. evidence of recurrence for 26 mo. The core of the corpus is a manually an- notated dataset of clinical cases. A clinical Box 1: Sample clinical case. case reports statements of a clinical practice, presenting the reason for a clinical visit, the 2 Motivation and Related Work description of physical exams, and the assess- The main motivation of the project is creat- ment of the patient’s situation (Box 1). ing a clinical document corpus that can be freely redistributable and that contains tem- 1 https://www.european-language-grid.eu/ poral information and clinical entity annota- open-calls/ tions. The annotation of temporal informa- Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 17 tion is the core effort of the project, so we ex- Language L1 L2 L3 pect the E3C corpus to be useful in tasks such Spanish 81 162 1,772 as event ordering and chronology generation. Basque 88 113 1,232 These decisions are justified by three com- English 84 171 9,986 mon issues on Natural Language Processing French 81 168 9,111 (NLP) for the clinical domain, listed below. Italian 84 174 10,217 First, the E3C corpus is formed of already published documents, mostly clinical cases in Table 1: Documents in the different layers journals so as to overcome the patient privacy for each language. issues that Electronic Health Records (EHR) often convey. We have opted for selecting cal Journal4 (clinical cases). documents that already allow redistribution to ensure the E3C corpus will be easily usable 3.1 Corpus Organisation for the research community. For each language, the E3C corpus is orga- Secondly, as E3C is a multilingual cor- nized into three layers, with different pur- pus, it helps reducing the gap between En- poses: glish and other languages in terms of avail- Layer 1: about 25K tokens per language able data for research. A large dataset of (around 80 documents) of clinical narratives clinical cases in Spanish is already available, with full manual annotation of clinical enti- SPACCC (Intxaurrondo et al., 2018), and we ties, temporal information and factuality, for have expanded and enriched it. For a low- benchmarking and linguistic analysis. resourced language as Basque, instead, E3C Layer 1 is the core of the E3C corpus and is the first clinical narrative corpus. In ad- special attention has been awarded to cre- dition, the project will also provide the NLP ating a balanced document set in terms of community with the first Creative Commons size. Short (<200 tokens), medium (200– clinical case corpora for French and Italian. 400 tokens) and long (400–600 tokens) texts Thirdly, E3C is centered on temporal in- have been selected, as we presumed that text formation in clinical narratives, which has length would directly affect the temporal in- not been often targeted by scholars. Most formation in text; the longer the text, the of the attention has been focused on clinical more complex the temporal graph. entity extraction and classification (Schulz et al., 2020; Grabar et al., 2019; Dreisbach et Layer 2: 50–100K tokens per language of al., 2019; Luo et al., 2017) and only a few key clinical narratives with automatic annotation initiatives have addressed temporal informa- of clinical entities and manual check of the tion processing, e.g. the THYME annotation annotation of a small sample (about 10%). scheme (Styler et al., 2014b). THYME and Layer 3: about 1M tokens per language of other off-spins have been used to annotate non-annotated medical documents (not nec- clinical case corpora such as the i2b2 tem- essarily clinical narratives) to be exploited by poral relation corpus (Sun, Rumshisky, and semi-supervised approaches. Uzuner, 2013), and have been used in clinical All the layers are covered for Spanish, En- narratives processing challenges, e.g., CLEF glish, Italian and French. For Basque, Layer eHealth (Kelly et al., 2019). This information 2 (14K token) and 3 (600K token) are cov- could be then merged with the information ered only partially. In Table 1 we summarize on structured data collections, e.g. MIMIC the distribution of documents per layer and III (Johnson et al., 2016), enriching it. language. In the case of L1, the amount of texts provides the information on how many 3 Data Collection and distinct temporal graphs or chronologies we Distribution will be able to build from the dataset. The E3C corpus is a collection of both ex- isting corpora (e.g., the SPACCC corpus) 3.2 Corpus Distribution and published texts extracted from differ- The final E3C corpus will be available for ent sources, such as PubMed2 (journal ab- download from the ELG platform reposi- stracts), SciELO3 and the PanAfrican Medi- tory5 . All documents will be released under 2 4 https://pubmed.ncbi.nlm.nih.gov/ https://www.panafrican-med-journal.com/ 3 5 https://scielo.org/ https://live.european-language-grid.eu 18 Figure 1: A sentence in a clinical case annotated with both temporal information and clinical entities (i.e., disorders) with their UMLS codes (marked in red). Creative Commons licenses. This is possible duration and the information about the actor as a large part of the texts in the corpus were (the patient) is represented in light blue. already released under Creative Commons li- censes and as permission for free distribution 4.2 Clinical Entity Annotation has been requested and obtained from the Clinical entity annotation focuses on disor- original owners of some of the documents. ders. Following UMLS6 a disorder is defined as “a definite pathologic process with a char- 4 Corpus Annotation acteristic set of signs and symptoms”. In E3C, we mark disorders mentioned in The E3C corpus will contain two types of an- the text and assign them an UMLS concept notations: (i) temporal information and fac- unique identifier (CUI). Disorder identifica- tuality, and (ii) annotation of clinical entities, tion and coding is performed following an specifically disorders. adaptation of the ShARe annotation guide- Manual annotation has been performed lines (Elhadad et al., 2012). In concept se- by a team of NLP students and researchers. lection, we restrict to the UMLS semantic More precisely, temporal information and group Disorder, which includes the Finding clinical entity annotation guidelines have semantic type in addition to those proposed been defined by two teams of three and four by ShARe. Figure 1 shows disorders, marked experts respectively. For the manual anno- in red, and their UMLS codes. tation effort, eight people have been trained and are completing the annotation of tempo- 5 Conclusions and Future Work ral information, while clinical entity annota- The E3C project aims at reducing the lack of tion is being conducted by five people. available resources for clinical NLP, gather- 4.1 Temporal Information ing a large number of clinical narratives and focusing on languages other than English. Annotation After completing the project, the E3C cor- Temporal information annotation is per- pus and the associated resources (baselines, formed following the THYME annotation scorers, etc.) will be available for research guidelines (Styler et al., 2014a) with some mi- under a Creative Commons license, which nor adaptations (Magnini et al., 2020). This will facilitate its acquisition and reusability. scheme provides tags for events, time ex- More specifically, since the corpus contains pressions, temporal relations between events information for temporal reasoning as well as and/or time expressions, and aspectual rela- clinical entity mentions, it will be useful for tions between events. For each tag, a set of works on semantic interpretation of clinical attribute-value pairs allow to make the rele- texts. The fact that the corpus contains texts vant features explicit. In order to mark infor- in five languages will allow linguistic compar- mation that further contributes to the clinical isons as well as experimentation on transfer history of a patient, we have added three cat- learning. We also consider that the E3C cor- egories: measurements and test results, ac- pus is a resource that could be employed in a tors (for the patient itself, health profession- series of evaluation challenges due to its atyp- als and other participants), body parts. ical contents and types of annotations. In Figure 1 a simplified temporal informa- tion annotation is displayed. Events are in Acknowledgements dark blue and their contextual modality (AC- This work was partially supported by the Eu- TUAL), document time relation (BEFORE) ropean Language Grid project through its and polarity (POS) are highlighted. The 2 6 wk time expression (in gray) is classified as a https://uts.nlm.nih.gov/uts/umls/home 19 open call for pilot projects (EU grant no. Losada, G. Heinatz Bürki, L. Cappel- 825627), and by the Basque Government lato, and N. Ferro, editors, Experimental post-doctoral grant POS 2020 2 0026. IR Meets Multilinguality, Multimodality, and Interaction, pages 322–339, Cham. References Springer International Publishing. [Dreisbach et al.2019] Dreisbach, C., T. A. [Luo et al.2017] Luo, Y., W. K. Thompson, Koleck, P. E. Bourne, and S. Bakken. T. M. Herr, Z. Zeng, M. A. Berendsen, 2019. A systematic review of natu- S. R. Jonnalagadda, M. B. Carson, and ral language processing and text min- J. Starren. 2017. Natural Language ing of symptoms from electronic patient- Processing for EHR-Based Pharmacovigi- authored text data. International Journal lance: A Structured Review. Drug Safety, of Medical Informatics, 125:37–46. 40:1075–1089. [Elhadad et al.2012] Elhadad, N., G. Savova, [Magnini et al.2020] Magnini, B., B. Altuna, W. Chapman, G. Zaramba, D. Harris, A. Lavelli, M. Speranza, and R. Zanoli. and A. Vogel. 2012. ShARe Guidelines 2020. The E3C Project: Collection and for the Annotation of Modifiers for Disor- Annotation of a Multilingual Corpus of ders in Clinical Notes. Technical report, Clinical Cases. In Proceedings of the Sev- Columbia University. enth Italian Conference on Computational [Grabar et al.2019] Grabar, N., C. Grouin, Linguistics, Bologna, Italy, December. As- T. Hamon, and V. Claveau. 2019. sociazione Italiana di Linguistica Com- Recherche et extraction d’information putazionale. dans des cas cliniques. Présentation de la [Schulz et al.2020] Schulz, S., J. Ševa, campagne d’évaluation DEFT 2019. In S. Rodrı́guez, M. Ostendorff, and Actes du Défi Fouille de Textes 2019, G. Rehm. 2020. Named Entities in pages 7–16, Toulouse, France. Actes Medical Case Reports: Corpus and DEFT 2019. Experiments. In Proceedings of the [Intxaurrondo et al.2018] Intxaurrondo, A., 12th Language Resources and Evaluation M. Marimón, A. González-Agirre, J. A. Conference, pages 4495–4500, Marseille, López-Martı́n, H. Rodrı́guez, J. Santa- France. European Language Resources marı́a, M. Villegas, and M. Krallinger. Association. 2018. Finding Mentions of Abbrevia- [Styler et al.2014a] Styler, W., G. Savova, tions and Their Definitions in Spanish M. Palmer, J. Pustejovsky, T. O’Gorman, Clinical Cases: The BARR2 Shared and P. C. deGroen. 2014a. THYME Task Evaluation Results. In Proceedings Annotation Guidelines. Techni- of the Third Workshop on Evaluation cal report, University of Colorado. of Human Language Technologies for http://clear.colorado.edu/compsem/ Iberian Languages (IberEval 2018), pages documents/THYME_guidelines.pdf. 280–289, Seville, Spain. Spanish Society for Natural Language Processing. [Styler et al.2014b] Styler, W. F., S. Bethard, S. Finan, M. Palmer, S. Pradhan, P. C. [Johnson et al.2016] Johnson, A. E., T. J. de Groen, B. Erickson, T. Miller, C. Lin, Pollard, L. Shen, L.-w. H. Lehman, G. Savova, et al. 2014b. Temporal Anno- M. Feng, M. Ghassemi, B. Moody, tation in the Clinical Domain. Transac- P. Szolovits, L. Anthony Celi, and R. G. tions of the Association for Computational Mark. 2016. MIMIC-III, a freely accessi- Linguistics, 2:143–154. ble critical care database. Scientific Data, 3. [Sun, Rumshisky, and Uzuner2013] Sun, W., A. Rumshisky, and O. Uzuner. 2013. [Kelly et al.2019] Kelly, L., H. Suominen, Annotating temporal information in clin- L. Goeuriot, M. Neves, E. Kanoulas, ical narratives. Journal of Biomedical In- D. Li, L. Azzopardi, R. Spijker, G. Zuc- formatics, 46(Supplement):S5–S12. 2012 con, H. Scells, and J. Palotti. 2019. i2b2 NLP Challenge on Temporal Rela- Overview of the CLEF eHealth Evaluation tions in Clinical Data. Lab 2019. In F. Crestani, M. Braschler, J. Savoy, A. Rauber, H. Müller, D. E. 20