MEDIA team at the CLEF-2020 Multilingual
          Information Extraction Task

        Iker de la Iglesia1 , Mikel Martı́nez-Puente2 , Alexander Platas1 , Iria San
               Miguel1 , Aitziber Atutxa1[0000−0003−4512−8633] , and Koldo
                               Gojenola1[0000−0002−2116−6611]
    1
        School of Engineering, Bilbao, University of the Basque Country (EHU/UPV),
           {idelaiglesia004,platas51218,iria.san.miguel2000}@gmail.com,
                      {aitziber.atucha,koldo.gojenola}@ehu.eus
           2
             Faculty of Science, University of the Basque Country (EHU/UPV)
                                 mikelmpuente@gmail.com


          Abstract. The aim of this paper is to present our approach (MEDIA)
          on the CLEF-2020 eHealth Task 1. The task consists in automatically
          assigning ICD10 codes (CIE-10, in Spanish) to clinical case documents,
          evaluating the prediction against manually generated ICD10 codifica-
          tions. Our system took part in two different subtasks: one corresponding
          to Diagnosis Coding (CodiEsp-D) and the other to Procedure Coding
          (CodiEsp-P).
          We approached the coding task as a two step system; a first step consist-
          ing of carrying out the named entity recognition (diagnoses and proce-
          dures) and a second step for assigning the right ICD10 code to the given
          entity (diagnosis or procedure). For the first step, namely the medical
          entity recognition, we employed a transfer learning strategy over pre-
          trained Language Models by tuning them to the Named Entity Recogni-
          tion task. The second step was dealt with edit distance techniques. We
          achieved our best results combining static and contextual word embed-
          dings of Wikipedia and Electronic Health Records (∼100M words), with
          a Mean Average Precision (MAP) of 0.488 and 0.442 for diagnoses and
          procedures, respectively.

                                         ·                       ·
          Keywords: Neural Networks Levenshtein Distance ICD Coding.


1       Introduction

Automatic clinical coding has received great attention and several systems and
shared tasks have been organized in the last years, using knowledge-based and
machine learning techniques. CodiEsp: Clinical Case Coding in Spanish Shared
Task (eHealth CLEF 2020 – Multilingual Information Extraction) [1] is devoted
to the automatic coding of clinical cases in Spanish, as part of the CLEF eHealth
    Copyright    ©
                 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
    September 2020, Thessaloniki, Greece.
series [2]. Participant systems have to automatically assign ICD10 codes to clin-
ical case documents evaluating the result against manually generated ICD10
codifications. This task will serve to generate new clinical coding tools for other
languages and data collections.
    Our system took part in two different subtasks: one corresponding to the de-
tection and coding of diagnoses (CodiEsp-D) and the other to procedure coding
(CodiEsp-P). We made use of different techniques, ranging from edit distance
measures to neural approaches using the most recent architectures, including
different types of embeddings taken from medical texts.


2   Related Work


The SemEval 2014 Task 7 [5] can be cited as an antecedent to the present compe-
tition, except for the number and types of entities to be identified (diseases and
others), and the type of concept indexing (SNOMED-CT, compared to ICD10
in this work). Task 7 in SemEval 2014 comprised two subtasks, medical entity
recognition and concept indexation. To tackle the first subtask, different teams
used approaches as MaxEnt, SVM or CRF in combination with the extraction
of syntactic and semantic attributes. The authors in [6] obtained the best results
in strict F-Score with 78.5 on the development set and 81.3 on the test set. For
the second subtask, namely Concept Indexation, the solutions proposed were
very similar among the different teams. As in the NER task, the winner was
[6] with an accuracy of 74.1 on the test set. Their solution was based on the
cosine similarity using a Vector Space Model (VSM). Other teams also proposed
a method based on edit distance, more precisely Levenshtein distance [4].
    In SemEval 2015 (task 14) [7], the methods used were analogous to those
used in SemEval-2014. In the best system, a CRF was used to detect entities
and a SVM classifier to determine if these were joined or not (and thus catch
discontinuous entities). Regarding Concept Indexing, they used basically cus-
tomized look-ups, like Dictionary look-up (exact match of entity word permu-
tations, LVG), Customized Dictionary look-up (split UMLS entities by function
words), and Customized Dictionary look-up (list of possible UMLS spans and
application of Levenshtein distance).
    Recently, the PharmacoNER competition [9] proposed a similar task, but
with the aim of identifying chemical, drug, and gene/protein mentions for clin-
ical case studies written in Spanish. The evaluation of the task was divided in
two scenarios: one corresponding to the detection of named entities and one
corresponding to the indexation of named entities. Besides these competitions,
improvements have been made mostly in the entity recognition subtask using
neural networks such as Bi-LSTM + CRFs [8].
3     Resources and Methods
3.1   Resources
The CodiEsp corpus contains 3,751 (including background set) clinical cases in
Spanish language, with 1,000 of them manually coded. The annotated corpus
has been randomly sampled into three subsets: the train, the development, and
the test set. The train set contains 500 clinical cases, and the development and
test set 250 clinical cases each. The train and development cases were provided
along with their corresponding annotations. The final collection of 1,000 clinical
cases that make up the corpus had a total of 16,504 sentences, with an average
of 16.5 sentences per clinical case. It contains a total of 396,988 words, with an
average of 396.2 words per clinical case.

    In addition to this corpus, we made use of medical domain related synonym
dictionaries and word, character and contextual embeddings calculated from a
medical corpus other than the one provided by the organizers.

3.2   Methods
The first experiment, applied to both diagnoses and procedures, consisted in
finding word sequences in the text keeping a given edit distance proportional to
the length of the sequence with respect to some ICD10 dictionary term (EditDis-
tance, see below). Additionally, for the procedures, we made a similar experiment
(UnorderedMatching) but, in this case, instead of matching the sentence’s words
in the exact order, we focused just on the words individually. If an entity’s words
matched all words of an entry from the dictionary, even with a different word
order, it was interpreted as the same term.
    In these two experiments, we made use of sliding windows. Initially the win-
dow covers the maximum size available up to a word limit (in this case 7 words).
Then it tries to match the sentence with a known entity with one of the two
methods above. If no match was found, and as long as the window’s size was
longer than 1 word, the window size decreases by one and tries to match the
new shorter entity. When the window size is 1 and there is no matching entity
found, the window will move one word forward. Finally, if a match is found, the
window resets its size and moves forward to the word after the last word of the
previous sequence.

   With this approach, the entity recognition and the ICD code assignment was
done in one unique step. For the rest of the experiments we pursue the task in two
steps, first detecting the entities using transfer learning by tuning a pretrained
LM and then applying edit distance.

Named Entity Recognition. Recently, the use of pretrained Language Models
(LM) on huge amounts of data (ElMo [13], BERT [14], FLAIR [3]) has shown
to obtain very good results in different tasks [10, 13]. Language Models are able
to capture the distribution of a language by learning a probability distribution
over word sequences. This has proven to be a good approach in sequence-to-
sequence tasks like named entity recognition (NER) [12], especially when using
bidirectional Language Models (BiLM) learned on both left-to-right and right-
to-left directions [13]. From the different available Transfer Learning language
Models we decided to use Flair [3]. This system, compared to the other mentioned
alternatives, is computationally less expensive without harming the performance.
    For this work, we trained two different LMs. We trained the first LM on
331,468 Electronic Health Records (EHR) containing ∼100M words from differ-
ent Spanish hospitals (SpaEHR). The second one was trained on the Spanish
Clinical Case Corpus (SPACCC3 ), a collection of 1,000 clinical cases from Sci-
ELO (Scientific Electronic Library Online). Clinical cases have the peculiarity
of being a biomedical and medical literature version of EHRs, and therefore the
language employed is mostly standard as opposed to conventional EHRs.
    For the diagnosis NER subtask, we run four different experiments. The first
one used edit distance as explained at the beginning of this section, where the
entity recognition and the code assignment where done in one unique step. For
the three other experiments we employed Flair [3], performing the NER task
by means of a BiLSTM + CRF. In this case we used the same static word
representation in each experiment and character based embeddings as well but
fine-tuning the LMs we mentioned before. As static word representations, we
employed FastText and word2vec embeddings learned from wikipedia and skip-
Ngram embeddings learned from the EHRs corpus (SpaEHR). Part of this cor-
pus was originally tagged with diagnoses so we fine-tuned the LM learned on
the SpaEHR corpus adapting it to the diagnoses NER subtask. Therefore in this
experiment (from now on SpaEHR-LM) we did not use the train, development
and test sets provided by the organizers to fine-tune the LM. Noteworthy, the
SpaEHR corpus is not tagged with discontinuous entities (in this case diagnoses).
As a consequence such kind of entities cannot be recognized in the SpaEHR ex-
periment. For the second run, although we used the same static embeddings
(fasttext and word2vec from wikipedia and SkipNGram from SpaEHR), the LM
was the one learned on the SPACCC corpus (from now on SPACCC-LM), that
is to say, on the corpus provided by the organizers which contains more stan-
dardized clinical cases with tagged diagnoses and procedures. In this case, the
training corpus contains discontinuous entities so we pre-processed the corpus
to convert it into a tabular format with IBOES tags. And finally, the third ex-
periment consisted in joining the entities found by the SpaEHR-LM and those
found by the SPACCC-LM (Joint-LM).
    It is important to mention that we employed the same SPACCC-LM model
for diagnoses and procedures, since the SPACCC corpus provided by the orga-
nizer contained both diagnoses and procedures.

Edit distance. The normalization of given named entities consists in link-
ing named entities to concepts in standardized medical terminologies, allowing
3
    https://doi.org/10.5281/zenodo.2560316
generalization across contexts. The task consists in assigning, to each term, its
corresponding Concept Unique Index. For example, “fiebre”, “hipertermia” and
“sindrome febril” are all normalized to the same ICD-10 code (r50.9). In our
work, we made use of a Text Similarity based mapping from the given terms to
different sets:
 – The terms present in the training set. This set is limited but gives an account
   of standard and non-standard terms present in spontaneously written health
   records. These terms are a source of spontaneously written data, similar to
   those present in the test set. However, this set only covers a small fraction
   of the whole set of ICD-10 codes.
 – ICD10 standard terms. This can be viewed as a dictionary covering all terms.
   However, the description is far from the terms found in spontaneous clinical
   cases.
     A lookup table was built by traversing the training data, recording every
entity and its corresponding ICD code, and directly applied on the test set. We
tried to approximate the search to guarantee a matching, by using the Leven-
shtein distance [4], a method that quantifies the minimum number of operations
required to transform one string into another using insertions, deletions or sub-
stitutions as the basic edit operations. We compute the distance between the
input string and the set of terms taken as reference. We must take into account
that the methods just try to match the chosen strings with terms of a dictio-
nary of expressions or the list of entities present in the training set, ignoring the
context around the entities. In order to improve the number of different entities
that can be found, we included a synonym dictionary. Therefore, if no match
was found the first time assigning a code to an entity, synonyms were applied to
it, composing new candidate terms.


4    Results and Discussion
In this section we will present the results we have achieved for both subtasks
(CodiEsp-D and CodiEsp-P) of Task 1. Multilingual Information Extraction. For
this purpose we have compiled all the results in Table 1 and Table 2 respectively,
where we can observe the results and the approach of each and every run we
have made.

     In the first subtask, i.e., with regard to diagnoses, we have submitted four dif-
ferent runs: “SPACCC-LM”,“SpaEHR-LM”, “Joint-LM”, and “EditDistance”.
In Table 1 we can observe the different results according to the official metrics,
i.e., MAP (Mean average precision for a set of queries is the mean of the average
precision scores for each query) and other computed metrics such as MAP30,
Precision (guessed codes/all of our predictions), Recall (guessed codes/all codes)
and F-score (harmonic mean of Precision and Recall).
     Overall, the best run is the one corresponding to “Joint-LM”, taking into
account all the metrics. However, if we analyze each run and each metric one
               Table 1. Results of the different runs of CodiEsp-D.

               Run          MAP MAP30 Precision Recall F-score
               SPACCC-LM 0.457 0.457     0.735 0.543 0.625
               SpaEHR-LM 0.405 0.405      0.633 0.518 0.570
               Joint-LM     0.488 0.487   0.637 0.620 0.629
               EditDistance 0.462 0.461   0.526 0.630 0.574


by one we can find some interesting information. For example, the Precision of
“SPACCC-LM” is 0.735, which is much higher than the rest, whereas the Recall
is 0.543, lower than others. We can also state that the worst run is “SpaEHR-
LM”, because its figures are the lowest ones in almost all the metrics.

   In the second subtask, namely procedures, we have submitted three different
runs, “EditDistance”, “SPACCC-LM”, and “UnorderedMatching”. In Table 2
we can observe the different results according to official metrics.


                Table 2. Results of the different runs of CodiEsp-P.

            Run               MAP MAP10 Precision Recall F-score
            EditDistance      0.386 0.383   0.455 0.520 0.485
            SPACCC-LM        0.442 0.442   0.601 0.412 0.489
            UnorderedMatching 0.404 0.402   0.501 0.503 0.502


    If we analyze each run and each metric we can observe that there are differ-
ences among them depending on which metric we use. For example, the Precision
of “SPACCC-LM” is the highest one (0.601) while the Recall is lower than the
other runs (0.412). In some other cases, such as in “UnorderedMatching”, both
Precision and Recall remain almost constant: 0.501 and 0.503, respectively.


5   Conclusion and Future Work

The purpose of this work was to evaluate the feasibility of different approaches to
medical entity detection and concept indexing using the International Classifica-
tion of Diseases, ICD10. Entity detection was dealt with a sequential tagger that
used word embeddings and contextual string embeddings acquired from Elec-
tronic Health Records (EHR), Clinical Cases and Wikipedia. Concept normal-
ization was approached by Text Similarity techniques. The Levenshtein-based
system obtained relatively good results, compared to neural network approaches,
and this aspect deserves a further study of the strengths and weaknesses of each
approach.
Acknowledgements

This work has been partially funded by the Spanish ministry (projects PROSA-
MED: TIN2016-77820-C3-1-R, DOTT-HEALTH: PID2019-106942RB-C31). We
gratefully acknowledge the support of NVIDIA Corporation with the donation
of the Titan X Pascal GPU used for this research.


References

1. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.:
   Overview of automatic clinical coding: annotations, guidelines, and solutions for
   non-English clinical cases at CodiEsp track of CLEF eHealth 2020. Working Notes
   of Conference and Labs of the Evaluation (CLEF) Forum, CEUR Workshop Pro-
   ceedings, (2020)
2. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu,
   Z., Pasi, G., Saez Gonzales, G., Viviani, M., Xu, C. Overview of the CLEF eHealth
   Evaluation Lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S.
   Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cappellato, L. and Ferro, N.(eds.),
   Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings
   of the Eleventh International Conference of the CLEF Association (CLEF 2020).
   LNCS Volume number: 12260 (2020)
3. Akbik, A., Blythe, D., Vollgraf, R. Contextual string embeddings for sequence label-
   ing. Proceedings of the 27th International Conference on Computational Linguistics,
   pages 1638–1649, (2018)
4. Levenshtein, V. Binary codes capable of correcting deletions, insertions, and rever-
   sals. Soviet physics doklady, volume 10, number8, pages 707-710, (1966)
5. Pradhan, S., Elhadad, N., Chapman, W., Manandhar, S., Savova, G. SemEval-
   2014 Task 7: Analysis of Clinical Text. SemEval Workshop (COLING), pages 54-62,
   (2014)
6. Tang, Y., Zhang, J., Wang, B., Jiang, Y., Xu, Y. UTH CCB: a report for se-
   meval 2014–task 7 analysis of clinical text. SemEval Workshop (COLING), page
   802, (2014)
7. Pathak, P., Patel, P., Panchal, V., Soni, S., Dani, D., Patel, A., Choudhary, N. ezDI:
   A Supervised NLP System for Clinical Narrative Analysis. Proceedings of the 9th
   International Workshop on Semantic Evaluation (SemEval 2015), Association for
   Computational Linguistics, pages 412-416, (2015)
8. Lample, G., Ballesteros, M., Subramanian, S., Subramanian, K., Dyer, C. Neural
   architectures for named entity recognition. Proceedings of the 2016 Conference of the
   North American Chapter of the Association for Computational Linguistics: Human
   Language Technologies, pages 260-270, (2016)
9. Gonzalez, A., Marimon, M., Intxaurrondo, A., Rabal, O., Villegas, M., Krallinger,
   M. PharmaCoNER: Pharmacological Substances, Compounds and proteins Named
   Entity Recognition track. Proceedings of the BioNLP Open Shared Tasks (BioNLP-
   OST), Association for Computational Linguistics (2019)
10. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. Improving Language Un-
   derstanding by Generative Pre-Training.(2018)
11. Peters, M.,Neumann, M., Iyyer, M., Gardner, M., Clark, C. Lee, K., Zettlemoyer,
   L. Deep Contextualized Word Representations, Proceedings of the 2018 Conference
   of the North American Chapter of the Association for Computational Linguistics:
   Human Language Technologies, Vol. 1, Association for Computational Linguistics
   (2018)
12. Shreyas, S., Daniel Jr, R. BioFLAIR: Pretrained Pooled Contextualized Embed-
   dings for Biomedical Sequence Labeling Tasks(2019),1908.05760 arXiv
13. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer,
   L. Deep Contextualized Word Representations, Proceedings of the 2018 Conference
   of the North American Chapter of the Association for Computational Linguistics:
   Human Language Technologies, Volume 1,(2018)
14. Devlin, J., Chang, M., Lee, K., Toutanova, K. BERT: Pre-training of Deep Bidi-
   rectional Transformers for Language Understanding, (2018), 1810.04805 arXiv