=Paper= {{Paper |id=Vol-3180/paper-16 |storemode=property |title=A Simple Terminology-Based Approach to Clinical Entity Recognition |pdfUrl=https://ceur-ws.org/Vol-3180/paper-16.pdf |volume=Vol-3180 |authors=José Castaño,Laura Gambarte,Carlos Otero,Daniel Luna |dblpUrl=https://dblp.org/rec/conf/clef/CastanoGOL22 }} ==A Simple Terminology-Based Approach to Clinical Entity Recognition== https://ceur-ws.org/Vol-3180/paper-16.pdf
A Simple Terminology-Based Approach to Clinical
Entity Recognition
José Castaño1 , Laura Gambarte1 , Carlos Otero1 and Daniel Luna1
1
    Departamento de Informática en Salud, Hospital Italiano, Buenos Aires, Argentina


                                         Abstract
                                         We describe how we use terminology resources as a basic approach to entity recognition and normaliza-
                                         tion in Spanish. In particular we use a proprietary large vocabulary and thesaurus that extends SNOMED
                                         CT, SNOMED CT itself and UMLS. The proprietary terminology uses historical data of clinical terms
                                         used in the EHR problem list. Clinical terms are noisy descriptions typed by healthcare professionals,
                                         in Spanish, in the electronic health record system (EHR) and contain clinical findings and suspected
                                         diseases, among other categories of concepts. Descriptions are very short texts presenting high lexi-
                                         cal variability containing synonymy, acronyms, abbreviations, and typographical errors. Each term is
                                         mapped to SNOMED CT concepts. This approach was evaluated using the DisTEMIST corpus in the
                                         entity recognition and entity linking tasks.

                                         Keywords
                                         Terminology resources, Named entity recognition, Entity linking, DisTEMIST




1. Introduction
Text mining and Natural Language Processing (NLP) techniques have been used to extract
and access information in clinical documents to obtain valuable clinical information. Recently
many approaches have been tested in languages other than English. The use of manually
labeled clinical texts annotated by professional experts is a standard tradition to promote and
evaluate the use of different techniques for a set of tasks. Usually, those tasks are Named
Entity Recognition, Entity Linking or Entity Normalization. Techniques used are dictionary or
gazetteer based, rule-based or machine learning, and any combination of them. Deep learning
and transformation-based learning technologies have been used a lot in recent years, yielding
very good results. The DisTEMIST challenge (Disease Text Mining Shared Task) proposes
two tasks, DISTEMIST-entities (Named entity recognition) and DISTEMIST-linking (Entity
linking), over a broad category of entities, covering, diseases, disorders, and anomalies, as is
understood from the DisTEMIST homepage. The cover name used is Diseases, but it does
not correspond to a category in a given ontology. The DisTEMIST task follows such previous
efforts asPharmaCONER [1], SpRadIE [2] and in particular the CanTEMIST (CANcer TExt
Mining Shared Task) track at IberLEF 2020 [3] which had very good results using deep learning
algorithms and language models.

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ jose.castano@hospitalitaliano.org.ar (J. Castaño); laura.gambarte@hospitalitaliano.org.ar (L. Gambarte);
carlos.otero@hospitalitaliano.org.ar (C. Otero); daniel.luna@hospitalitaliano.org.ar (D. Luna)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
DisTEMIST Training corpus
       Task                  clinical cases (files)   Entity Mentions   Unique Entity Mentions
       DISTEMIST-entities             750                  8065                  5349
       DISTEMIST-linking              584                  5136                  3453


Table 2
DisTEMIST Training corpus word length
                                       Instances      Word length
                                       2387               1
                                       2333               2
                                       1714               3
                                       490                4
                                       317                5
                                       236                6
                                       588               ≥6


2. The DisTEMIST Dataset
The DisTEMIST corpus data [4] was distributed including training sets, multilingual resources,
a so-called DisTEMIST dictionary, and additional concept information (crossmappings to other
ontological resourses). The DisTEMIST corpus itself is a collection of 1000 files corresponding to
clinical cases. It has been randomly divided into a training set, 750 clinical cases, and a test set of
250 cases. The test set was released inside 3000 clinical cases, the so-called test background, so
as the participants would not know which were the cases used for the performance evaluation.
There were two interdependent subtasks, the first sub-task, named entity recognition, requires
identifyng the named entities corresponding to the cover category Disease (ENFERMEDAD).
The second sub-task, entity linking required to provide for each named entity recognized the
corresponding SNOMED CT [5] concept. The training files also provided information about
whether the corresponding named entity was an EXACT description for the SNOMED concept
(3803 instances) or a NARROW description (1121 instances), when the entity mention is not a
direct mapping to the SNOMED code. COMPOSITE was used when there were two concepts
associated to a given entity mention (211 instances). However, these parameters were not used
nor required to be submitted. Table 1 shows the correspondence in the training corpus for
named entities and unique entity mentions.
   The DisTEMIST dictionary [6] contains 134697 terms, with 103154 concepts. Most of the
terms were labeled as disorder (132332), 1626 were labeled as finding, and 738 as morphologic
abnormality. These labels correspond to the SNOMED CT hierarchy. Also there is an indication
on the FAQ section that the SNOMED CT codes to be returned should correspond to the subset
in the dictionary, other SNOMED CT codes were not considered.
   The named entity terms had a word length distribution very similar to the one available in
our terminology resources (see Table 2).
Figure 1: NLP Pipeline

                  Tokenizer    Parser    QuickUMLS        NER Rules    Context


3. Terminology Resources and the Distemist Corpus
Some electronic health records (EHR) implementations allow free text descriptions in structured
data entries. Free text descriptions enable more expressiveness, ease of use, and flexibility for
physicians. Descriptions are short texts, mostly 3 to 5 words long. Those descriptions must
be encoded according to their meaning to allow information interoperability. They have to be
mapped to concepts in a controlled vocabulary according to the meaning, and usually, SNOMED
CT is used. SNOMED CT is a controlled reference terminology and coding medical ontology
that allows storage and retrieval of healthcare information. It is a standard for electronic health
records. It can be used in clinical decision support systems. The Hospital Italiano of Buenos Aires
(henceforth HIBA), has a Spanish interface terminology [7, 8] where each term is mapped via a
direct relation or using compositional expressions to SNOMED CT as its reference vocabulary.
The HIBA interface vocabulary was implemented many years ago and has more than 2 million
description terms in its terminology system. It was implemented using those description terms
typed by the healthcare professionals in structured textual data. A major benefit of the local
interface vocabulary is its size and coverage, but it is also the biggest obstacle to its use and
maintenance.
   We implemented a simple named entity recognition and entity linking system in Python. We
used open-source libraries, such as spaCy[9], MedspaCy [10], Quickumls [11], to use our HIBA
terminology. It uses spaCy components in a standard pipeline of tokenization, parsing, NER
and context rules for NEGEX[12] (See Figure 1). The system allows to recognize those terms
that exist in the controlled vocabulary. It also allows to select terms using UMLS types, or HIBA
terminology codes. HIBA terminology codes are mapped to SNOMED CT codes (and it also has
crossmaping to other terminologies, such as ICD-10).
   Our first approach was to run our system directly on the DisTEMIST entity mention terms
that are distributed to evaluate the results of the training corpus. The results using the HIBA
terminology were very poor only 3490 from the 8065 entity mentions (43%) had at least a partial
correspondence with terms in the HIBA controlled vocabulary. Therefore we also added terms
from SNOMED CT and UMLS, and obtained a better correspondence 6677 (82%) had at least a
partial correspondence with terms in the vocabulary. Table 3 shows the distribution of UMLS
types corresponding to the DisTEMIST entity mention terms. The UMLS types provide more
detailed information than the SNOMED CT hierarchy types supplied by the DisTEMIST gazetteer.
This set of UMLS types was used to select those matched terms in our HIBA terminology that
should be identified by the cover term Disease. Type T061 Therapeutic or Preventive Procedure
presented a problem because it does not fit in a cover term of Disease, either there was a specific
interpretation in a particular context or a misinterpretation of ambiguous terms. Types T033
Finding, and T184 Sign or Symptom, also looked problematic. There is also an important number
of entity mentions in the training set (18%) that do not match even partially to a term in any of
the terminological resources.
Table 3
UMLS labels for the entity mentions at the DisTEMIST Training corpus
     UMLS TYPE      Label                                 HIBA     HIBA+SNOMED CT+UMLS
     T047           Disease or Syndrome                    1780               2968
     T191           Neoplastic Process                     616                 949
     T033           Finding                                322                 997
     T046           Pathologic Function                    318                 706
     T184           Sign or Symptom                        166                 268
     T037           Injury or Poisoning                    157                 519
     T048           Mental or Behavioral Dysfunction       101                 202
     T061           Therapeutic or Preventive Procedure                        41
     T041           Mental Process                                             19
     T049           Cell or Molecular Dysfunction            4                  6
     T042           Organ or Tissue Function                 1                  2
     Total                                                 3490               6677


4. Experimentation and Results
We performed several experiments to see what was the performance of the system. Precision
and recall measures were obtained using the DisTEMIST evaluation tool. They are presented
in Table 4. We were surprised by the low precision we obtained using the HIBA terminology
resources. We did not expect high recall, but better precision. A number of variations were used,
in particular adding SNOMED CT and UMLS description terms. This increased Recall a little bit
but lowered Precision. Then we filtered terms that were in the T033 category and considered
only a subset of them, using HIBA and SNOMED CT concepts that were in the training set.
Only UMLS categories T047, T033, T037, T048, T041, T191 and T046 were considered. This
increased precision to reach 0.633. Adding contextual rules, the use of NEGEX, did not change
significantly the results. In some cases, the use of the most simple NEGEX rules lowered the
precision.
   We also modified the evaluation script to find out how many initial offsets, were correct.
The results are depicted in the last line (Training set start only). In this case, the precision was
significantly high 0.936. In other words, it means that most predicted initial spans of named
entities were correct, and that the problem was to identify correctly the right boundary of the
named entity mentions.
   We submitted only one set of predictions for the test set, the results were a little lower than
in the training set.
   We used the HIBA concept codes and their mapping to SNOMED CT codes, as well as the
SNOMED CT codes themselves, when the recognized entity mention was using a SNOMED CT
term. The results were lower (see Table 5), given the task was dependent on the entity mention
recognition.
Table 4
Disease Recognition results in Training and Test sets
     Terminology                                                       Precision   Recall   F-score
     HIBA Training set                                                  0.3586     0.3883   0.3729
     HIBA+SNOMED+UMLS Training set                                      0.271       0.463   0.342
     HIBA+SNOMED+UMLS+Filter Training set                               0.633       0.409   0.497
     HIBA+SNOMED+UMLS+Fiter+Rules Training set                          0.6381     0.4079   0.4977
     HIBA+SNOMED+UMLS+Filter+Rules Test set                             0.5622     0.3772   0.4515
     HIBA+SNOMED+UMLS+Filter+Rules Training set start only              0.936      0.5984   0.7301


Table 5
Disease Linking results on Training and Test sets
                    Terminology                         Precision   Recall   F-score
                    HIBA+SNOMED Training 1               0.366      0.2593   0.3035
                    HIBA+SNOMED Training 1+2             0.3519     0.2446   0.2886
                    HIBA+SNOMED Test set                 0.4795     0.2292   0.3102


5. Conclusions and Future Work
It is well known that dictionary lookup and regular expressions, are very limited approaches
for NER tasks. They are useful on limited scope tasks and as a preliminary baseline approach.
They can also be combined to be used with machine learning approaches. We had very limited
time and human resources to test machinery that was not mature nor tested previously. This
experience allowed us to find unexpected outcomes on some terms from the HIBA terminology
which were not in the system. The DisTEMIST corpus presented a particular challenge on some
general terms like lesión, tumor, herida, which produced many false positives and negatives. It
might be questionable if such general terms are valuable for some of the tasks NER would serve,
such as document indexing. This is a problem of granularity, which has also the opposite side:
terms with too detailed information which might not be relevant. Also, the DisTEMIST corpus
seems quite different from user-generated texts in a healthcare institution, which is a major
source of our HIBA terminology. The UMLS types were good for restricting some of the target
terms but there were two problematic categories (T033 and T184). In future work, we will try to
solve some of the flaws we found in our approach and work with machine learning techniques.
We will also look into error and corpus analysis.


References
 [1] A. Gonzalez-Agirre, M. Marimon, A. Intxaurrondo, O. Rabal, M. Villegas, M. Krallinger,
     Pharmaconer: Pharmacological substances, compounds and proteins named entity recog-
     nition track, in: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 2019,
     pp. 1–10.
 [2] V. Cotik, L. A. Alemany, D. Filippo, F. Luque, R. Roller, J. Vivaldi, A. Ayach, F. Carranza,
     L. Francesca, A. Dellanzo, et al., Overview of clef ehealth task 1-spradie: A challenge on
     information extraction from spanish radiology reports, in: CLEF 2021 Evaluation Labs
     and Workshop: Online Working Notes. CEUR-WS, 2021.
 [3] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal-
     ization and clinical coding: Overview of the cantemist track for cancer text mining in
     spanish, corpus, guidelines, methods and results., IberLEF@ SEPLN (2020) 303–323.
 [4] A. Miranda-Escalada, L. Gascó, S. Lima-López, , D. Estrada, A. Nentidis, A. Krithara,
     G. Katsimpras, E. Farré, G. Paliouras, M. Krallinger, Overview of distemist at bioasq:
     Automatic detection and normalization of diseases from clinical texts: results, methods,
     evaluation and multilingual resources, in: Working Notes of Conference and Labs of the
     Evaluation (CLEF) Forum. CEUR Workshop Proceedings, 2022.
 [5] D. Lee, R. Cornet, F. Lau, N. de Keizer, A survey of snomed ct implementations, Jour-
     nal of Biomedical Informatics 46 (2013) 87 – 96. URL: http://www.sciencedirect.com/
     science/article/pii/S1532046412001530. doi:https://doi.org/10.1016/j.jbi.2012.
     09.006.
 [6] L. Gascó, M. Krallinger, Distemist gazetteer, 2022. URL: https://doi.org/10.5281/zenodo.
     6505583. doi:10.5281/zenodo.6505583, Funded by the Plan de Impulso de las Tec-
     nologías del Lenguaje (Plan TL).
 [7] H. Navas, A. Lopez Osornio, A. Baum, A. Gomez, D. Luna, F. Gonzalez Bernaldo de Quiros,
     et al., Creation and evaluation of a terminology server for the interactive coding of
     discharge summaries, in: Medinfo 2007: Proceedings of the 12th World Congress on
     Health (Medical) Informatics; Building Sustainable Health Systems, IOS Press, 2007, p. 650.
 [8] D. Luna, G. Lopez, C. Otero, A. Mauro, C. T. Casanelli, F. G. B. de Quirós, Implementation
     of interinstitutional and transnational remote terminology services, in: AMIA Annual
     Symposium Proceedings, volume 2010, American Medical Informatics Association, 2010,
     p. 482.
 [9] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spacy: Industrial-strength natural
     language processing in python, 2020. doi:10.5281/zenodo.1212303.
[10] H. Eyre, A. Chapman, K. Peterson, J. Shi, P. Alba, M. Jones, T. Box, S. DuVall, O. Patterson,
     Launching into clinical space with medspacy: a new clinical text processing toolkit in
     python, AMIA ... Annual Symposium proceedings. AMIA Symposium 2021 (2022) 438–447.
[11] L. Soldaini, N. Goharian, Quickumls: a fast, unsupervised approach for medical concept
     extraction (2016).
[12] W. W. Chapman, W. Bridewell, P. Hanbury, G. F. Cooper, B. G. Buchanan, A Simple
     Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries, Journal
     of Biomedical Informatics 34 (2001) 301–310. URL: http://dx.doi.org/10.1006/jbin.2001.1029.
     doi:10.1006/jbin.2001.1029.