=Paper= {{Paper |id=Vol-2968/paper5 |storemode=property |title=The E3C Project: European Clinical Case Corpus |pdfUrl=https://ceur-ws.org/Vol-2968/paper5.pdf |volume=Vol-2968 |authors=Bernardo Magnini,Begoña Altuna,Alberto Lavelli,Manuela Speranza,Roberto Zanoli |dblpUrl=https://dblp.org/rec/conf/sepln/MagniniALSZ21 }} ==The E3C Project: European Clinical Case Corpus== https://ceur-ws.org/Vol-2968/paper5.pdf
    The E3C Project: European Clinical Case Corpus
           El proyecto E3C: European Clinical Case Corpus
                  Bernardo Magnini1 , Begoña Altuna1,2 , Alberto Lavelli1 ,
                            Manuela Speranza1 , Roberto Zanoli1
                                   1
                                     Fondazione Bruno Kessler
                   2
                     Universidad del Paı́s Vasco/Euskal Herriko Unibertsitatea
                         {magnini;altuna;lavelli;manspera;zanoli}@fbk.eu

      Abstract: The European Clinical Case Corpus (E3C) project aims at collecting and
      annotating a large corpus of clinical documents in five European languages (Spanish,
      Basque, English, French and Italian), which will be freely distributed. Annotations
      include temporal information, to allow temporal reasoning on chronologies, and
      information about clinical entities based on medical taxonomies, to be used for se-
      mantic reasoning.
      Keywords: Clinical data, corpus, multilingual, temporal information, clinical enti-
      ties.
      Resumen: El proyecto European Clinical Case Corpus (E3C) pretende reunir y
      anotar un gran crorpus de documentos clı́nicos en cinco lenguas europeas (español,
      euskera, inglés, francés e italiano) que será distribuido libremente. Las anota-
      ciones incluyen información temporal para permitir el razonamiento temporal en
      cronologı́as e información sobre entidades clı́nicas basada en taxonomı́as médicas
      para su uso en razonamiento semántico.
      Palabras clave: información clı́nica, multilingüe, información temporal, entidades
      clı́nicas.

1   Introduction                                                                             A 25-year-old man with a history
                                                                                             of Klippel-Trenaunay syndrome pre-
E3C, European Clinical Case Corpus, is a                                                     sented to the hospital with mucopuru-
one-year project (started in July 2020) aim-                                                 lent bloody stool and epigastric persis-
ing at creating a corpus of clinical documents                                               tent colic pain for 2 wk. Colonoscopy
in five European languages: Spanish, Basque,                                                 showed continuous superficial ulcers and
English, French and Italian. The project is                                                  bleeding. Subsequent gastroscopy re-
partially supported by the European Lan-                                                     vealed mucosa with diffuse edema, ul-
guage Grid project through its open call for                                                 cers, errhysis, and granular and friable
pilot projects1 . On its hand, the European                                                  changes in the stomach and duodenal
Language Grid project has received fund-                                                     bulb. A diagnosis of GDUC was consid-
ing from the European Union’s Horizon 2020                                                   ered. The patient hesitated about iv cor-
Research and Innovation programme under                                                      ticosteroids, so he was treated with pen-
Grant Agreement no. 825627 (ELG). The                                                        tasa 3.2 g/d. After 0.5 mo of treatment,
project is led by FBK (Fondazione Bruno                                                      the symptoms achieved complete remis-
Kessler) and part of the activities has been                                                 sion. Follow-up examinations showed no
subcontracted to the Université d’Orléans.                                                 evidence of recurrence for 26 mo.
    The core of the corpus is a manually an-
notated dataset of clinical cases. A clinical                                                             Box 1: Sample clinical case.
case reports statements of a clinical practice,
presenting the reason for a clinical visit, the                                          2        Motivation and Related Work
description of physical exams, and the assess-
                                                                                         The main motivation of the project is creat-
ment of the patient’s situation (Box 1).
                                                                                         ing a clinical document corpus that can be
                                                                                         freely redistributable and that contains tem-
   1
     https://www.european-language-grid.eu/                                              poral information and clinical entity annota-
open-calls/                                                                              tions. The annotation of temporal informa-

         Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                 17
tion is the core effort of the project, so we ex-                  Language    L1   L2      L3
pect the E3C corpus to be useful in tasks such                     Spanish     81   162    1,772
as event ordering and chronology generation.                       Basque      88   113    1,232
These decisions are justified by three com-                        English     84   171    9,986
mon issues on Natural Language Processing                          French      81   168    9,111
(NLP) for the clinical domain, listed below.                       Italian     84   174   10,217
    First, the E3C corpus is formed of already
published documents, mostly clinical cases in            Table 1: Documents in the different layers
journals so as to overcome the patient privacy           for each language.
issues that Electronic Health Records (EHR)
often convey. We have opted for selecting                cal Journal4 (clinical cases).
documents that already allow redistribution
to ensure the E3C corpus will be easily usable           3.1     Corpus Organisation
for the research community.                              For each language, the E3C corpus is orga-
    Secondly, as E3C is a multilingual cor-              nized into three layers, with different pur-
pus, it helps reducing the gap between En-               poses:
glish and other languages in terms of avail-             Layer 1: about 25K tokens per language
able data for research. A large dataset of               (around 80 documents) of clinical narratives
clinical cases in Spanish is already available,          with full manual annotation of clinical enti-
SPACCC (Intxaurrondo et al., 2018), and we               ties, temporal information and factuality, for
have expanded and enriched it. For a low-                benchmarking and linguistic analysis.
resourced language as Basque, instead, E3C                  Layer 1 is the core of the E3C corpus and
is the first clinical narrative corpus. In ad-           special attention has been awarded to cre-
dition, the project will also provide the NLP            ating a balanced document set in terms of
community with the first Creative Commons                size. Short (<200 tokens), medium (200–
clinical case corpora for French and Italian.            400 tokens) and long (400–600 tokens) texts
    Thirdly, E3C is centered on temporal in-             have been selected, as we presumed that text
formation in clinical narratives, which has              length would directly affect the temporal in-
not been often targeted by scholars. Most                formation in text; the longer the text, the
of the attention has been focused on clinical            more complex the temporal graph.
entity extraction and classification (Schulz et
al., 2020; Grabar et al., 2019; Dreisbach et             Layer 2: 50–100K tokens per language of
al., 2019; Luo et al., 2017) and only a few key          clinical narratives with automatic annotation
initiatives have addressed temporal informa-             of clinical entities and manual check of the
tion processing, e.g. the THYME annotation               annotation of a small sample (about 10%).
scheme (Styler et al., 2014b). THYME and                 Layer 3: about 1M tokens per language of
other off-spins have been used to annotate               non-annotated medical documents (not nec-
clinical case corpora such as the i2b2 tem-              essarily clinical narratives) to be exploited by
poral relation corpus (Sun, Rumshisky, and               semi-supervised approaches.
Uzuner, 2013), and have been used in clinical                All the layers are covered for Spanish, En-
narratives processing challenges, e.g., CLEF             glish, Italian and French. For Basque, Layer
eHealth (Kelly et al., 2019). This information           2 (14K token) and 3 (600K token) are cov-
could be then merged with the information                ered only partially. In Table 1 we summarize
on structured data collections, e.g. MIMIC               the distribution of documents per layer and
III (Johnson et al., 2016), enriching it.                language. In the case of L1, the amount of
                                                         texts provides the information on how many
3       Data Collection and                              distinct temporal graphs or chronologies we
        Distribution                                     will be able to build from the dataset.
The E3C corpus is a collection of both ex-
isting corpora (e.g., the SPACCC corpus)                 3.2     Corpus Distribution
and published texts extracted from differ-               The final E3C corpus will be available for
ent sources, such as PubMed2 (journal ab-                download from the ELG platform reposi-
stracts), SciELO3 and the PanAfrican Medi-               tory5 . All documents will be released under
    2                                                      4
        https://pubmed.ncbi.nlm.nih.gov/                       https://www.panafrican-med-journal.com/
    3                                                      5
        https://scielo.org/                                    https://live.european-language-grid.eu




                                                    18
Figure 1: A sentence in a clinical case annotated with both temporal information and clinical
entities (i.e., disorders) with their UMLS codes (marked in red).

Creative Commons licenses. This is possible               duration and the information about the actor
as a large part of the texts in the corpus were           (the patient) is represented in light blue.
already released under Creative Commons li-
censes and as permission for free distribution            4.2       Clinical Entity Annotation
has been requested and obtained from the                  Clinical entity annotation focuses on disor-
original owners of some of the documents.                 ders. Following UMLS6 a disorder is defined
                                                          as “a definite pathologic process with a char-
4     Corpus Annotation                                   acteristic set of signs and symptoms”.
                                                             In E3C, we mark disorders mentioned in
The E3C corpus will contain two types of an-              the text and assign them an UMLS concept
notations: (i) temporal information and fac-              unique identifier (CUI). Disorder identifica-
tuality, and (ii) annotation of clinical entities,        tion and coding is performed following an
specifically disorders.                                   adaptation of the ShARe annotation guide-
    Manual annotation has been performed                  lines (Elhadad et al., 2012). In concept se-
by a team of NLP students and researchers.                lection, we restrict to the UMLS semantic
More precisely, temporal information and                  group Disorder, which includes the Finding
clinical entity annotation guidelines have                semantic type in addition to those proposed
been defined by two teams of three and four               by ShARe. Figure 1 shows disorders, marked
experts respectively. For the manual anno-                in red, and their UMLS codes.
tation effort, eight people have been trained
and are completing the annotation of tempo-               5       Conclusions and Future Work
ral information, while clinical entity annota-
                                                          The E3C project aims at reducing the lack of
tion is being conducted by five people.
                                                          available resources for clinical NLP, gather-
4.1    Temporal Information                               ing a large number of clinical narratives and
                                                          focusing on languages other than English.
       Annotation
                                                          After completing the project, the E3C cor-
Temporal information annotation is per-                   pus and the associated resources (baselines,
formed following the THYME annotation                     scorers, etc.) will be available for research
guidelines (Styler et al., 2014a) with some mi-           under a Creative Commons license, which
nor adaptations (Magnini et al., 2020). This              will facilitate its acquisition and reusability.
scheme provides tags for events, time ex-                 More specifically, since the corpus contains
pressions, temporal relations between events              information for temporal reasoning as well as
and/or time expressions, and aspectual rela-              clinical entity mentions, it will be useful for
tions between events. For each tag, a set of              works on semantic interpretation of clinical
attribute-value pairs allow to make the rele-             texts. The fact that the corpus contains texts
vant features explicit. In order to mark infor-           in five languages will allow linguistic compar-
mation that further contributes to the clinical           isons as well as experimentation on transfer
history of a patient, we have added three cat-            learning. We also consider that the E3C cor-
egories: measurements and test results, ac-               pus is a resource that could be employed in a
tors (for the patient itself, health profession-          series of evaluation challenges due to its atyp-
als and other participants), body parts.                  ical contents and types of annotations.
   In Figure 1 a simplified temporal informa-
tion annotation is displayed. Events are in               Acknowledgements
dark blue and their contextual modality (AC-              This work was partially supported by the Eu-
TUAL), document time relation (BEFORE)                    ropean Language Grid project through its
and polarity (POS) are highlighted. The 2
                                                              6
wk time expression (in gray) is classified as a                   https://uts.nlm.nih.gov/uts/umls/home




                                                     19
 open call for pilot projects (EU grant no.                 Losada, G. Heinatz Bürki, L. Cappel-
 825627), and by the Basque Government                      lato, and N. Ferro, editors, Experimental
 post-doctoral grant POS 2020 2 0026.                       IR Meets Multilinguality, Multimodality,
                                                            and Interaction, pages 322–339, Cham.
 References                                                 Springer International Publishing.
[Dreisbach et al.2019] Dreisbach, C., T. A.             [Luo et al.2017] Luo, Y., W. K. Thompson,
    Koleck, P. E. Bourne, and S. Bakken.                    T. M. Herr, Z. Zeng, M. A. Berendsen,
    2019.    A systematic review of natu-                   S. R. Jonnalagadda, M. B. Carson, and
    ral language processing and text min-                   J. Starren. 2017. Natural Language
    ing of symptoms from electronic patient-                Processing for EHR-Based Pharmacovigi-
    authored text data. International Journal               lance: A Structured Review. Drug Safety,
    of Medical Informatics, 125:37–46.                      40:1075–1089.
[Elhadad et al.2012] Elhadad, N., G. Savova,            [Magnini et al.2020] Magnini, B., B. Altuna,
    W. Chapman, G. Zaramba, D. Harris,                     A. Lavelli, M. Speranza, and R. Zanoli.
    and A. Vogel. 2012. ShARe Guidelines                   2020. The E3C Project: Collection and
    for the Annotation of Modifiers for Disor-             Annotation of a Multilingual Corpus of
    ders in Clinical Notes. Technical report,              Clinical Cases. In Proceedings of the Sev-
    Columbia University.                                   enth Italian Conference on Computational
[Grabar et al.2019] Grabar, N., C. Grouin,                 Linguistics, Bologna, Italy, December. As-
    T. Hamon, and V. Claveau.             2019.            sociazione Italiana di Linguistica Com-
    Recherche et extraction d’information                  putazionale.
    dans des cas cliniques. Présentation de la         [Schulz et al.2020] Schulz,  S.,   J. Ševa,
    campagne d’évaluation DEFT 2019. In                    S. Rodrı́guez, M. Ostendorff, and
    Actes du Défi Fouille de Textes 2019,                  G. Rehm. 2020. Named Entities in
    pages 7–16, Toulouse, France. Actes                     Medical Case Reports:        Corpus and
    DEFT 2019.                                              Experiments.      In Proceedings of the
[Intxaurrondo et al.2018] Intxaurrondo, A.,                 12th Language Resources and Evaluation
    M. Marimón, A. González-Agirre, J. A.                 Conference, pages 4495–4500, Marseille,
    López-Martı́n, H. Rodrı́guez, J. Santa-                France. European Language Resources
    marı́a, M. Villegas, and M. Krallinger.                 Association.
    2018. Finding Mentions of Abbrevia-                 [Styler et al.2014a] Styler, W., G. Savova,
    tions and Their Definitions in Spanish                  M. Palmer, J. Pustejovsky, T. O’Gorman,
    Clinical Cases: The BARR2 Shared                        and P. C. deGroen. 2014a. THYME
    Task Evaluation Results. In Proceedings                 Annotation Guidelines.           Techni-
    of the Third Workshop on Evaluation                     cal report, University of Colorado.
    of Human Language Technologies for                      http://clear.colorado.edu/compsem/
    Iberian Languages (IberEval 2018), pages                documents/THYME_guidelines.pdf.
    280–289, Seville, Spain. Spanish Society
    for Natural Language Processing.                    [Styler et al.2014b] Styler, W. F., S. Bethard,
                                                            S. Finan, M. Palmer, S. Pradhan, P. C.
[Johnson et al.2016] Johnson, A. E., T. J.                  de Groen, B. Erickson, T. Miller, C. Lin,
    Pollard, L. Shen, L.-w. H. Lehman,                      G. Savova, et al. 2014b. Temporal Anno-
    M. Feng, M. Ghassemi, B. Moody,                         tation in the Clinical Domain. Transac-
    P. Szolovits, L. Anthony Celi, and R. G.                tions of the Association for Computational
    Mark. 2016. MIMIC-III, a freely accessi-                Linguistics, 2:143–154.
    ble critical care database. Scientific Data,
    3.                                                  [Sun, Rumshisky, and Uzuner2013] Sun, W.,
                                                            A. Rumshisky, and O. Uzuner. 2013.
[Kelly et al.2019] Kelly, L., H. Suominen,                  Annotating temporal information in clin-
    L. Goeuriot, M. Neves, E. Kanoulas,                     ical narratives. Journal of Biomedical In-
    D. Li, L. Azzopardi, R. Spijker, G. Zuc-                formatics, 46(Supplement):S5–S12. 2012
    con, H. Scells, and J. Palotti. 2019.                   i2b2 NLP Challenge on Temporal Rela-
    Overview of the CLEF eHealth Evaluation                 tions in Clinical Data.
    Lab 2019. In F. Crestani, M. Braschler,
    J. Savoy, A. Rauber, H. Müller, D. E.



                                                   20