=Paper=
{{Paper
|id=Vol-2696/paper_211
|storemode=property
|title=Classifying Clinical Case Studies with ICD-10 atCodiesp CLEF eHealth 2020 Task 1-Diagnostics
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_211.pdf
|volume=Vol-2696
|authors=Paula Queipo-Álvarez,Israel Gonzalez-Carrasco
|dblpUrl=https://dblp.org/rec/conf/clef/Queipo-AlvarezG20
}}
==Classifying Clinical Case Studies with ICD-10 atCodiesp CLEF eHealth 2020 Task 1-Diagnostics==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_211.pdf</pdf>
<pre>
Classifying clinical case studies with ICD-10 at
Codiesp CLEF eHealth 2020 Task 1-Diagnostics

              Paula Queipo-Álvarez1 and Israel González-Carrasco1

        Computer science Department, Universidad Carlos III de Madrid, Spain
                         {pqueipo,igcarras}@inf.uc3m.es


        Abstract. In this paper, the authors describe the approach and results
        for the participation in Task 1 (multilingual information extraction) of
        CLEF eHealth 2020. This work addresses the task of automatically as-
        sign ICD-10 codes for Diagnostics clinical case studies in Spanish and
        English. A dictionary-based approach has been used relying on the ter-
        minological resource provided by the organization. The system achieved
        a mean average precision of 0.115 (precision: 0.866, recall: 0.066).

        Keywords: ICD-10 Classification · Clinical case studies · Named-Entity
        Recognition · Dictionary based.


1     Introduction

In this paper, the authors describe the participation of Human Language and
Accessibility Technologies Group (HULAT) at CodiEsp CLEF eHealth 2020, in
particular to Task 1: Multilingual Information Extraction. The focus is the ICD-
10 coding of Diagnostics, belonging to the sub-task D. CodiEsp is the Clinical
Cases Coding in Spanish language Track and is devoted to the automatic coding
of clinical cases in Spanish.


1.1    State-of-work

CLEF eHealth has been running these annual evaluation campaigns since 2013 in
the Information retrieval, Information Management and Information Extraction.
For example, in task of multilingual Information Extraction, named entity recog-
nition (NER), text classification and acronym normalization. In 2018 [1], CLEF
organization shared a task to promote automatic clinical coding systems over
death reports in the French language (in 2018) and over german non-technical
summaries (NPTs) of animal experiments (in 2019).

    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
1.2   Task description
This task utilizes the International Classification of Diseases, 10th revision (ICD-
10), which is a terminology resource. The sub-task CodiEsp Diagnosis aims to as-
sign ICD10-CM codes (CIE10 Diagnóstico, in Spanish) to clinical case documents
[7]. This terminology is tree-shaped. Annotated codes are minimum 3-character
long, and the codes with a greater number of characters are more granular. The
organization provided a list of valid codes for this sub-task with their English
and Spanish description, and an annotated corpus as well. The evaluation of
the automatic coding is against manually generated ICD10 codifications. The
motivation of the task is to determine the most competitive approaches for cod-
ing this type of documents and generating new clinical coding tools for other
languages and data collections.
     Task 1: Multilingual Information Extraction, was built upon information
extraction tasks from previous years[10] [9]. Information Extraction could be
treated as a cascaded named entity recognition with normalization, or a text
classification. This task was proposed to explore multilingual approaches, even
though the two languages (English and Spanish) were addressed individually.


2     Method
In the following subsections, we describe the corpora, the terminology, the dic-
tionary based approach, and SpaCy Language Processing Pipelines.
    The system must predict the codes in both languages, English and Spanish.
The dictionary matches the codes with the terms in English and Spanish, and
the translation of the Spanish clinical case studies into English was offered. In
this case, we recognised entities in each language using the dictionary in both
languages.
    The automatic classification with ICD-10 codes it is a multi-class and multi-
label problem. This can also be considered as a Named-Entitiy Recognition
(NER) task.
    The code is stored in a Github repositoriy[2].

2.1   Corpora and Terminologies
The CodiEsp corpus is available in several versions. In this work, the third
version is considered [6]. The corpus is composed of 1000 clinical case studies
annotated by clinical coding professionals meeting strict quality criteria. The
clinical case studies have been randomly sampled into three subsets: the train,
the development, and the test set. The train set is composed of 500 clinical
cases, the development set has of 250 clinical cases each, the test set includes
250 clinical cases, and the background set is composed of 2,751 clinical cases.
    Apart from the text files with all the clinical case studies, the annotations for
train, development and test sets are provided. The annotation has the following
fields: articleID, that refers to the clinical case study, and every ICD10-code
found in the clinical case study.
                   Fig. 1. SpaCy Language Processing Pipeline.


2.2   Dictionary based approach

The organization have provided terminological resources, such as the valid codes
for the task[8]. One of the files contains a list of 71486 CIE10-Diagnósticos terms
(2018 version) with their description in Spanish and English. Diagnostic codes
have the following fields: code, es-description and en-description. To create the
dictionary in Python, it was necessary to map the codes to the references and
find the frequency of the tags for each code. It is observed that the frequencies
differ and the classes are not balanced.


Data preprocessing First of all, it is necessary to join all the clinical case
studies in different files for train, development and test to work with them easily.
In addition, alphanumeric ICD-10 terms were mapped into different numeric
values to avoid errors in the Recognizer. Finally, the clinical case studies were
tokenized with SpaCy.


SpaCy Language Processing Pipelines [4] They have been used in several
steps. First, to segment text into tokens and produce doc objects that were
processed through the pipeline. Secondly, to assign part-of-speech tags using the
tagger. It also uses a parser to assign dependency labels. Furthermore, it includes
an Entity Recognizer to detect and label named entities.
    In the Figure 1 it is shown the structure of the SpaCy Language Processing
Pipeline.
    In order to use the Entity Recognizer, it is necessary to add Named Entities
metadata to doc objects in SpaCy. First, we loaded the English (or Spanish)
model. To avoid overlapping of entities, we replaced the default NER module
with the dictionary in English (or Spanish). We did this to prevent overlapping
of entities.
    After that, we detected Named Entities over the test and background set,
processing the text and showing its entities.
    This procedure has been used with two different models:[5] en core web sm
(2.2.5) and es-core-news-sm (2.2.5) in English and Spanish, respectively. These
models include convolutional layers, residual connections, layer normalization
and maxout non-linearity.
Fuzzy matching With the library fuzzy-wuzzy, it is possible to adjust the max-
imum (-1) and minimum (50) scores allowed to match between terms. Finally,
the predicted terms are saved to evaluate them with the organization script.[3]

2.3   Other approaches
There are other approximations, such as Semantic rules, Transfer learning, RNN-
based or BERT-based. Multi-lingual information retrieval (MLIR) is the retrieval
of documents in several languages from a query. So it would be interesting to
combine English and Spanish in a model.


3     Results & Discussion
In this section, we present the results of one official run. This allows to compare
the effectiveness of the classifiers and study the difference in failure analysis.

3.1   Experimental setup
This team submitted predictions for 3001 documents, which includes test files
and background files. However, the submission was processed to include only
predictions for test files (250 documents) which were considered for computing
the metrics.
    The evaluation results came from the organization, that used the scripts
distributed as part of the Clinical Cases Coding in Spanish language Track
(CodiEsp)[3]. All the experiments were performed without cross-validation.

3.2   Test results
The official metric for the subtasks CodiEsp Diagnostic is the mean average
precision (MAP). Other metrics computed over a maximum of 1 are: precision,
recall and mean average precision for a query of 30 elements (MAP@30).


Table 1. System performance for ICD-10 coding on the test corpus in terms of mean
average precision, precision, recall and F-measure. Official evaluation setup.

                                Parameter Value
                                  MAP     0.115
                                 MAP30 0.115
                                Precision 0.866
                                  Recall  0.066
                                F-measure 0.123


   These metrics are computed taking into account only the predictions for the
codes presented in the train and development sets. This is because there were
codes in the test set that had not been used in the train and validation sets.
    In the code search on a set of clinical case studies, precision is the number
of correct codes divided by the number of all returned codes, while recall is the
number of correct codes divided by the number of codes that should have been
returned.
    Precision value reaches an 86.6%, whereas in recall only a 6.6%. This means
that 86.6% of the codes retrieved were correct, which is impresive due to the
difficulty of the taks. Recall measures the fraction of true positives among the
predicted results. A low result means a high number of false negatives or codes
not detected because of the lexical variability. Our dictionary-based approach
was not able to detect the majority of the codes because of a lack of flexibility
in the recognition.


Table 2. System performance for ICD-10 coding on the test corpus in terms of mean
average precision, precision, recall and F-measure. The test documents without gold
labels are ignored for evaluation.

                               Parameter    Value
                              MAP codes     0.138
                            Precision codes 0.935
                              Recall codes  0.071
                            F-measure codes 0.132


   Also, there are three metrics computed for correct categories. These cat-
egories are first three digits of a CIE10-Diagnostic code. For example, codes
P96.5 and P96.89 pertains to the category P96.


Table 3. System performance for ICD-10 coding on the test corpus in terms of preci-
sion, recall and F-measure. Computed for categories (first 3 digits).

                              Parameter        Value
                          Precision categories 0.889
                            Recall categories  0.074
                          F-measure categories 0.137


    These results are slightly better than the official ones, due to the relaxation
of the codes into categories.
    We can observe a good result in precision, showing that most of the entities
detected where correct, whereas recall results are weak. Also, F-Measure is a
good measure when the class distribution is uneven, which is our case. This is
an essential problem for this dictionary matching approach. In order to increase
the recall, a state-of-the-art approach must be used. Our approach needs to be
improved due to the difficulty of the task.
4   Conclusions
This working note presents our contribution to Task 1 of CLEF eHealth com-
petition 2020[7]. The task challenges the automatic assignment of ICD-10 codes
for Diagnostics to clinical case studies.
    At the previous stages, multi-label classification was tried, but the model
did not succeed due to the high number of codes, classes unbalance and a few
examples of each class. Then, NER was intended among the clinical case studies
to detect Diagnostics, but the model could not predict the code. Due to the lack
of results of other approaches, a more traditional dictionary-approach was built.
It was able to automatically assign codes to the test set with SpaCy’s help.
    Evaluation results highlight that our approach was not good enough to detect
all the entities presents in the clinical case studies due to the variability of the
terms. Lexical variability has impacted recall, that fails to detect terms.
    Some improvements would be semantic rules, reduce the number of irrele-
vant codes, a post-processing filtering phase or including more features. Other
upgrades are the treatment of abbreviations and the detection of typos. Also, the
future enhancement could be obtained with new terminological and Linguistic
resources.
    Also, to include deep learning models to infer the categories that have not
been seen. This can use and additional corpus to train the model. Finally, it
is interesting to implement a system that combines English and Spanish into a
multilingual model.


Acknowledgments
This work was supported by the Research Program of the Ministry of Econ-
omy and Competitiveness - Government of Spain, (DeepEMR project TIN2017-
87548-C2-1-R).


References
 1. CodiEsp, https://temu.bsc.es/codiesp/
 2. GitHub - pqueipo/Codiesp-CLEF-2020-eHealth-Task1: From Overview of auto-
    matic clinical coding: annotations, guidelines, and solutions for non-english clini-
    cal cases at codiesp track of CLEF eHealth 2020.In: Working Notes of Conference
    and Labs of the Evaluation (CLEF) Forum.CEUR Workshop Proceedings (2020),
    https://github.com/pqueipo/Codiesp-CLEF-2020-eHealth-Task1
 3. GitHub - TeMU-BSC/CodiEsp-Evaluation-Script: Evaluation library for CodiEsp
    Task, https://github.com/TeMU-BSC/CodiEsp-Evaluation-Script
 4. Language Processing Pipelines · spaCy Usage Documentation, https://spacy.io/
    usage/processing-pipelines
 5. Models · spaCy Models Documentation, https://spacy.io/models
 6. Miranda, A., Gonzalez-Agirre, A., Krallinger, M.: CodiEsp corpus: Span-
    ish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020 (Apr 2020).
    https://doi.org/10.5281/zenodo.3758054,       https://doi.org/10.5281/zenodo.
    3758054, Funded by the Plan de Impulso de las Tecnologı́as del Lenguaje (Plan
    TL).
 7. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.:
    Overview of automatic clinical coding: annotations, guidelines, and solutions for
    non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working
    Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop
    Proceedings (2020)
 8. Miranda-Escalada, A., Krallinger, M.: CodiEsp codes: list of valid CIE10 codes
    for the CodiEsp task (Jan 2020). https://doi.org/10.5281/zenodo.3706838, https:
    //doi.org/10.5281/zenodo.3706838, Funded by the Plan de Impulso de las Tec-
    nologı́as del Lenguaje (Plan TL).
 9. Névéol, A., Anderson, R.N., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G.,
    Robert, A., Rondet, C., Zweigenbaum, P.: CLEF eHealth 2017 Multilingual In-
    formation Extraction task overview: ICD10 coding of death certificates in English
    and French. Tech. rep., https://www.cdc.gov/
10. Névéol, A., Cohen, K.B., Grouin, C., Hamon, T., Lavergne, T., Kelly, L., Goeu-
    riot, L., Rey, G., Robert, A., Tannier, X., Zweigenbaum, P.: Clinical Informa-
    tion Extraction at the CLEF eHealth Evaluation lab 2016. Tech. rep., http:
    //quaerofrenchmed.limsi.fr/

</pre>