=Paper=
{{Paper
|id=Vol-2619/short3
|storemode=property
|title=Towards a Multilingual Corpus for Named Entity Linking Evaluation in the Clinical Domain
|pdfUrl=https://ceur-ws.org/Vol-2619/short3.pdf
|volume=Vol-2619
|authors=Pedro Ruas,André Lamúrias,Francisco Couto
|dblpUrl=https://dblp.org/rec/conf/ecir/RuasLC20
}}
==Towards a Multilingual Corpus for Named Entity Linking Evaluation in the Clinical Domain==
<pdf width="1500px">https://ceur-ws.org/Vol-2619/short3.pdf</pdf>
<pre>
           Towards a multilingual corpus for Named Entity
             Linking evaluation in the clinical domain ?

                Pedro Ruas1[0000−0002−1293−4199] , André Lamúrias1[0000−0001−7965−6536] , and
                                 Francisco M Couto3[0000−0003−0627−1496]

            LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon 1749-016, Portugal
                                           ps ruas@fc.ul.pt


                     Abstract. We propose a new multilingual, parallel corpus for Named
                     Entity Linking benchmarking which comprises English, Portuguese and
                     Spanish clinical case reports1 . The medical diagnostic entities in the re-
                     ports were annotated with the respective code of the International Clas-
                     sification of Diseases 10 - Clinical Modification (ICD10-CM) terminology
                     and its Portuguese and Spanish versions. The result is a preliminary an-
                     notation set, which will be further validated and expanded by humans.
                     Additionally, the ICD10-CM codes in the annotations will be mapped to
                     the respective Medical Subject Headings (MeSH) identifiers when possi-
                     ble.

                     Keywords: Text Mining · Multilingual clinical case reports · Named
                     Entity Linking · Information retrieval · Named Entity Recognition


           1       Introduction

           In Text Mining pipelines, Named Entity Linking (NEL) systems are applied after
           the Named Entity Recognition (NER) step and before the Relation Extraction
           step. The goal of NEL systems is to map the entity mentions in text with the
           respective concept identifier in a Knowledge Base (KB). Currently, most NEL
           approaches are still being developed with English text in mind, but there is a
           growing interest in the development of tools able to process non-English text.
           However, the main challenge to the development of new tools is the scarcity of
           multilingual NEL datasets containing clinical text. In addition, building a gold
           standard from scratch is time-consuming and demands high expertise, and for
           non-English languages, there is a lack of controlled vocabularies.
               In this work, we applied a pipeline of Information Retrieval, NER and NEL
           tools to build a multilingual NEL corpus. The goal of the pipeline is to obtain
           preliminary annotations of medical diagnostic entities in clinical case reports,
           which will facilitate and speed up the further task of human validation.
            ?
              Supported by FCT through the DeST: Deep SemanticTagger project, ref.
              PTDC/CCI-BIO/28685/2017, LASIGE ResearchUnit, UIDB/00408/2020
            1
              https://github.com/lasigeBioTM/MultiNEL-corpus


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           2        P. Ruas et al.

           2     Building the corpus

           2.1    Abstract Retrieval

           SciELO2 is a digital library for scientific articles, its majority written in Spanish,
           Portuguese and English. One of the main advantages is that many articles have
           versions in different languages. We extracted the abstract of clinical case reports
           (search filters: *AND subject area:(”Health Sciences”) AND type:(”case-report”)
           AND la:(”es” OR ”pt” OR ”en”)), and only considered those with the three ver-
           sions simultaneously available (English, Portuguese and Spanish). We obtained
           1917 abstracts in the three languages, corresponding to 639 clinical case reports,
           which we considered enough to test the annotation approach described below.


           2.2    Annotation of medical diagnostic entities using NER and NEL

           We used the python interface of MER [1], which recognises entity mentions in the
           text according to a given lexicon, i.e., a list of terms that represent the concepts
           of a vocabulary or a KB. In this work, we used as target KB the International
           Classification of Diseases 10th Revision - Clinical Modification (ICD10-CM),
           since it is available in several languages. This vocabulary contains codes rel-
           ative to medical diagnostics and an hierarchy defining subsumption relations
           between them. For each language, we used the most recent available edition:
           the 2020 edition for the English ICD10-CM provided by the Center for Disease
           Control and Prevention (CDC)3 ; the 2020 edition for the Spanish Classificación
           Internacional de Enfermedades - 10a Revisión - Modificación Clı́nica (CIE10-
           CM), provided by the Spanish Ministry of Health4 ; the 2017 edition for the
           Portuguese Classificação Internacional de Doenças - 10a Revisão - Modificação
           Clı́nica (CID10-CM), provided by the Portuguese Ministry of Health5 . MER
           recognised the entity mentions related with medical diagnostics in the clini-
           cal case reports, and then linked each mention to the respective code in the
           ICD10-CM (or the respective language version). The resulting annotations were
           converted to the brat Standoff format.


           3     Discussion

           The overall statistics pertaining the annotation process are available in Table 1.
           MER was able to recognise entity mentions in the text expressed in the three
           languages, but its NER performance was slightly higher in English text than in
           other languages as expected. Surprisingly, the NEL performance was higher in
           Portuguese text. As example, a sentence from a retrieved abstract is expressed in
           the three languages: (1 - English) “Among the identified nursing diagnosis was
            2
              https://scielo.org/
            3
              www.cdc.gov/nchs/icd/icd10cm.htm
            4
              www.mscbs.gob.es/estadEstudios/estadisticas/normalizacion/home.htm
            5
              www.ctc.min-saude.pt/category/catalogos/


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         Multilingual corpus for clinical Named Entity Linking evaluation          3

           included: acute confusion, constipation and knowledge deficit.”; (2 - Portuguese)
           “Entre os diagnósticos de enfermagem identificados incluı́ram-se confusão aguda,
           constipação e conhecimento deficiente.”; (3 - Spanish) “Los resultados del estu-
           dio permitieron identificar los seguientes diagnósticos de enfermerı́a: confusión
           aguda, constipación e conocimiento deficiente.”. MER was able to identify the
           italicised entity “constipation” in the English sentence (1) because there is a
           ICD10-CM term with the same designation: “Constipation” (code K59.0). How-
           ever, the Portuguese and Spanish equivalents “constipação” (sentence 2) and
           “constipación” (sentence 3) were not recognised nor linked because the respec-
           tive terms in the ICD10-CM have a different designation: “Obstipação” and
           “Estreñimiento” (code K59.0).

           Table 1. Statistics for the annotation of medical diagnostic entities in the clinical case
           reports

                                                                     English Portuguese Spanish
                               Abstracts retrieved                     639      639       639
                          Abstracts with annotations                   217      197       199
                         Ratio of annotated abstracts                 0.340    0.308     0.314
                                Entity mentions                        533      432       465
                    Entity mentions per annotated abstract            2.456    2.193     2.340
                             Linked entity mentions                    463      432       389
                 Linked entity mentions per annotated abstract        2.134    2.193     1.955
                        Ratio of linked entity mentions               0.867    1.000     0.837


               The resulting corpus is available at https://github.com/lasigeBioTM/
           MultiNEL-corpus. The future work consists in the human validation of the an-
           notation set, as well as its expansion with new annotations. This validation will
           be performed either by expert analysis or by crowd-sourcing, a less expensive
           approach that has shown comparable results to the expert analysis [2]. Addi-
           tionally, the ICD10-CM codes present in the annotations will be further mapped
           to the respective Medical Subject Headings (MeSH) concepts using the MeS-
           DiCon subset for CodiEsp [3], which will improve the cross-linking evaluation
           capability.

           References
           1. Couto, Francisco M. and Lamurias, Andre: MER: a shell script and annotation
              server for minimal named entity recognition and linking. Journal of Cheminformatics
              10(1), 58 (2018)
           2. Campos, Luis F, Lamurias, Andre and Couto, Francisco M: Can the Wisdom of
              the Crowd Be Used to Improve the Creation of Gold-standard for Text Mining
              applications?. In: 9th INForum - Simpósio de Informática (INForum 2017), Aveiro,
              Portugal (2017)
           3. Miranda, Antonio and Krallinger, Martin. (2020). MeSDiCon subset for CodiEsp:
              MESH terms in MeSDiCon mapped to ICD10 CM and ICD10 PCS (Version 1.0)
              [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3657429


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

</pre>