<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IXA-AAA at CLEF eHealth 2020 CodiEsp</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberto Blanco?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alicia Perez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arantza Casillas</string-name>
          <email>arantza.casillasg@ehu.eus</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CodiEsp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HiTZ Center - Ixa, University of the Basque Country UPV/EHU</institution>
          ,
          <addr-line>Manuel Lardizabal 1, 20080 Donostia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>These working notes present the participation of the IXAAAA team on the CodiEsp Track, as part of the CLEF 2020. The track is about automatic coding of clinical records according to the International Classi cation of Diseases 10th revision (ICD-10). There are three sub-tasks: CodiEsp-D, CodiEsp-P and CodiEsp-X. The two main tasks, CodiEsp-D and CodiEsp-P, aim to develop systems able to automatically classify clinical texts according to the ICD-10, respectively for diagnostics and procedures. CodiEsp-X, by contrast, is an exploratory sub-task within the framework of Explainable AI in which the goal is to detect the text fragment that motivates the presence of the ICD code. For the IXA-AAA team participation, we have developed several systems to cope with the three sub-tasks, including tree-based multi-label classiers, similarity match strategies, and ensemble models. For the similarity match, we have explored several approaches and algorithms from string edit distances as Levenshtein to dense representation with Transformers grounded BERT models. Our best results overall are achieved by the combination of models, with a MAP of 69.8% for CodiEsp-D and 48.1% for CodiEsp-P. Regarding the exploratory task, CodiEsp-X, our best coder achieve a micro F1-Score of 30.6%.</p>
      </abstract>
      <kwd-group>
        <kwd>Clinical records</kwd>
        <kwd>Similarity Match</kwd>
        <kwd>CLEF</kwd>
        <kwd>Multi-label classi er</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Here we gather the contribution of IXA-AAA team in the CodiEsp Track from
the eHealth CLEF 2020 { Multilingual Information Extraction [
        <xref ref-type="bibr" rid="ref10">10,15</xref>
        ]. The task
consists in the automatic classi cation of clinical notes according to the ICD-10
codes, considering both procedures and diagnosis. The track contains three
independent sub-tasks, two of them considered as the main tasks and the other
regarded as exploratory. The main tasks require systems able to perform ICD
assignments (diagnosis and procedures) to a given clinical note. In the exploratory
task, the systems must also submit the text that motivated each code assigned.
Therefore, the three sub-tasks are a) Diagnosis Coding main (CodiEsp-D):
automatic ICD-10-CM (i.e. diagnosis) code assignment. b) Procedure Coding main
(CodiEsp-P): automatic ICD-10-PCS (i.e. procedure) code assignment. c)
Explainable AI exploratory (CodiEsp-X): automatic ICD-10-CM and ICD-10-PCS
code assignment and text position for reference designation.
      </p>
      <p>These tasks present several challenges regarding the text, multi-label and
ICD classi cation domain. The documents, written in Spanish, come from a set
of clinical case studies showing properties of both, the biomedical and medical
literature, as well as clinical records. Moreover, they cover a variety of medical
topics, including oncology, urology, cardiology, pneumology or infectious diseases,
which increases both the quantity and the diversity of the ICD codes present
in the dataset. Each clinical note can have several diagnoses or procedures and,
therefore, we face a multi-label classi cation task. The text multi-label classi
cation alone is an open challenge in the machine learning eld but, conjugating
this with the large label-set yielded by the ICD-10 codes, with the low frequency
and with imbalance of labels, then, the task involves overcoming multiple and
varied barriers. Moreover, we are confronted with a zero-shot learning paradigm,
where the clinical cases from the di erent data partitions (train, dev, test) have
non-overlapping label-sets. Regarding the exploratory task, the identi cation of
the text position reference for a given code is not trivial since the non-standard
medical language in the text can di er heavily from the standard terms in the
ICD. Besides, apart from the continuous references, there are also discontinuous
references (i.e. references with several parts distributed along the clinical note).
In practical terms, the evaluation of the discontinuous references is carried out
taking the beginning of the rst fragment and the end of the last.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The automatic classi cation of medical records according to the ICD is an active
eld of research with a presence on shared tasks competitions [19] and Natural
Language Processing literature [21]. Through the years, numerous techniques
and systems have been developed to solve these tasks, such as Dictionary lookups
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], statistical models like Topic Modeling [18], machine learning models and,
lately, Deep Learning models [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ].
      </p>
      <p>
        [21] indicates that it is troublesome to evaluate the advances in the eld since
neither the models nor the evaluation results are generally comparables across
related works. Hence, it is a signi cant milestone to establish standard datasets
along with evaluation systems like in this and in past CLEF eHealth editions
since 2012 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In 2018, the sixth annual edition of the CLEF eHealth evaluation
lab [22], the organizers bestowed a multilingual information extraction lab, with
ICD-10 coding of death certi cates as the main task. The dataset contained
freetext descriptions in 5 languages of causes of death as reported by practitioners in
the standardized causes of death forms, and the teams must extract ICD-10 codes
from the raw lines of death certi cate text. The best system was provided by the
IxaMed team [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which cast the problem following a sequence-to-sequence
prediction paradigm. The authors leveraged only the organizers-provided datasets,
namely, ICD-10 dictionaries and the di erent sets of death reports texts with
their corresponding ICD codes, which fed to an encoder-decoder model were
able to deliver high-quality, while language-independent, results. On the last
year edition of the CLEF eHealth evaluation lab [12] the main task consisted
of the classi cation of non-technical summaries of German animal experiments
according to the ICD-10 codes. Although the dataset consisted of veterinary
texts, it still comprised a biomedical lexicon, which combined with the use of
ICD codes, brought a narrowly related task. The WBI team [20] approached the
task as a multi-label classi cation problem and leveraged the BERT Multilingual
model, extended by an output layer which produced the individual probabilities
for each possible ICD-10 code. With this setup, the authors succeeded to get the
best results on both Precision and F-Measure metrics. However, the MLT-DFKI
team [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] managed to improve their recall. While the authors also employed a
BERT-based model, in this case, they applied its biomedical variant BioBERT
[13], in conjunction with an automatic translation system from German to
English, (as the BioBERT model is trained based on the English BERT model,
instead of the multilingual). It is worth noting that the WBI team also made
use of extra training data from the German Clinical Trials Register and tried
ensemble techniques to improve the overall performance. On this year edition, the
clinical notes yield longer texts while preserving the challenges related to the
clinical language, the non-standard terms and the ICD-10 large label-set.
Besides, the Explainable-AI-related assignment brings a new challenge regarding
to the interpretability of models.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Materials</title>
      <p>For the resolution of the three sub-tasks, the organization has provided both
main and additional data, and we have employed extra additional resources. The
main data consists of 1,000 clinical studies which are coded manually according
to the ICD-10 by practising physicians and clinical documentalists. Table 1 shows
a brief quantitative description of the main datasets regarding the texts.</p>
      <p>Partition
docs</p>
      <p>sent/doc words words/doc vocab OOV
Train
Dev
Test
500
250
250 + 2,751
17.62 172,533
18.45 86,913
19.25 1,110,601</p>
      <p>Note that the row with the test data is in fact `test + background' and that
there is a gap between the number of clinical studies which are coded
manually (1,000) and the full number of available documents (3,751). The reason is
that the test set is intentionally in ated with a so-called `background' test with
2.700 documents, added to the real test documents (250) to prevent manual
predictions. The systems will only be evaluated on the 250 test set documents,
but since we cannot discern, the statistics shown are for the test and background
sets in conjunction. Regarding the 1,000 coded documents, the partitions split
proportion is 50/25/25. Including the background set, there is a total of 71,190
sentences, 1,370,047 words from 105,038 unique words leading to 354 210 words
mean standard deviation length documents. It is relevant to mention that the
texts comprise biomedical, medical literature and clinical records, involving a
variety of medical topics such as oncology, urology, cardiology, pneumology or
infectious diseases. Hence the variety of technical lexicon is increased, which
increases the challenge. It is relevant to note that relative to the OOVs, the
percentage of OOV words in the dev set is 47.81% while it is 80.47% in the test
+ background set, meaning that both sets do not follow a similar pattern
concerning the lexical distribution (note that since it also includes the background
set we cannot claim that this divergence prevails considering only the test set).</p>
      <p>Regarding the labels, namely the ICD-10 codes, there are 10,711 annotated
codes with 2,925 unique ones from both the ICD-10-CM (diagnostics) and
ICD10-PCS (procedures). Table 2 presents an overview of the statistics of the train
and dev partitions (which were the available annotated partitions of the corpus
before the submission, and consequently the data used for training the models).
partition label-set label count unique labels cardinality max imb. ratio
Train PCCMS 51,,656510 1,756673 113..36 00..000195
Dev PCCMS 2,688137 1,135785 103..77 0.00.2052
All PCCMS 73,,251010 2,179269 113..06 0.00.1042
Table 2. Statistical description of the labels of the main dataset by partition
One can see that all the codes from train + dev set only represent a small
percentage of the full ICD-10 codes (98,287 for ICD-10-CM and 87,169 for
ICD10-PCS), but still portray a large label-set, especially taking into account the
low representativeness of some codes (i.e. only 200 CM labels appears on 1% or
more of the clinical cases from the train set) and the extreme imbalance. But
more important is the question of the disjoint codes among sets, and especially,
unseen codes in the test set. In fact, there may be unseen codes in the test set,
and in general, there are codes which only appear on one partition, since the
partitions were obtained via a random split of training, dev and test (i.e, there
are 1,036 CM and 352 PCS labels on dev set not seen on train set). This question
leads to a zero-shot learning environment where a standard classi er will make
predictions solely among the seen codes on the training phase, and therefore, fail
to predict the unseen codes.</p>
      <p>The CodiEsp-X sub-task requires to detect the text reference position, so
the available corpus also brings the start and end position noted. Also, keep
in mind that there are continuous and discontinuous codes, the formers implies
that all the words related to the code appear sequentially in the text, while for
the latter there are several fragments of texts related to the code. Nevertheless,
in both cases, the way to evaluate the detection as correct is to give the start
position of the rst (or unique) fragment and end position of the last (or unique)
fragment, regardless of the number of fragments. The organization also provides
additional resources, and from those, we have used the Spanish abstracts from
Lilacs and Ibecs with ICD-10 codes, to expand the dictionary of ICD and
nonstandard descriptions. The in-house resources employed by our team consist of
additional non-standard term descriptions for some ICDs. Moreover, we have
applied a Medical Named Entity Recognition (NER) system to extract medical
terms such as diagnostic and procedure terms, and to reduce noisy words. This
alternative representation of texts has helped us with the augmentation of the
train and dev sets.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Methods</title>
      <p>
        The systems developed to deal with each subtask are of two di erent kinds. First,
we have applied a tree-based multi-label classi er based on gradient boosting
machines [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], to cope with CodiEsp-D and CodiEsp-P, presented in section 4.1.
Furthermore, we have developed a coder based on string similarities, which can
cope with CodiEsp-D and CodiEsp-P sub-tasks, but also CodiEsp-X, introduced
in section 4.2. Besides, we combined the outputs from the classi ers and the
coders to improve the overall results.
      </p>
      <p>Regarding the text representation, we have applied a Medical Named Entity
Recognition (NER) tool to extract medical entities from the raw texts.
Particularly, it classi es each word as `Disease', `Procedure', `Drug', `Part of the body'
or `Others'. Taking that classi cation, we have extracted three alternate
representations of the raw clinical notes following two strategies; i) Medical terms
(NER Med): Aims for noise removal, preserving only those words not classi ed
as `Others' and ii) Diagnostics (NER D) or Procedures (NER P): Preserve only
the words marked as `Disease' or `Procedure', accordingly. These alternate
representations can also be concatenated to the raw texts, as a data augmentation
technique.
4.1</p>
      <p>Tree-based multi-label classi er: Gradient Boosting Machines
The Gradient Boosting Machine or GBM is an ensemble classi er. Ensemble
classi ers rely on the combination of several base-classi ers to make a nal
prediction. Speci cally, the boosting technique consists in training several classi ers
sequentially, in a manner that each classi er learns from the errors made by the
previous ones. The objective of each individual classi er is to reduce the loss
function, in this case, binary cross-entropy (CE), given by expression (1), where
log is the natural logarithm, y is the binary label and p is the prediction or
membership probability to the given class.</p>
      <p>CE =
[y log p + (1
y) log (1
p)]
(1)</p>
      <p>
        The optimization of the function uses a gradient descent algorithm to
minimize the loss when adding new classi ers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. To cope with the multi-label
paradigm, we applied the one-versus-rest approach, in which as many binary
classi ers as present labels are trained. Thus, for the i-th binary classi er,
label i is treated as the positive class and all the remaining labels as negative.
The training procedure is then followed by a post-processing stage where the
optimal threshold must be found. However, note that the evaluation metric to
be applied in this task is the Mean Average Precision (MAP) well suited for
candidate-ranking. For consistency with this metric, the output from our
system is a ranking of all the possible labels ordered by probability. That is, the
system should provide all the labels (1; 767 for CM and 563 for PCS) even though
from the data analysis (in Table 2) one could expect the system to provide just
around 11:3 labels in CM and 3:6 in PCS. More about this question is discussed
in section 6. We have applied the XGBoost implementation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with the
Scikitlearn [17] wrapper.
4.2
Our Similarity Match algorithms set their foundation on the similarity between
two strings. For this work, we have implemented two variations that, although
they follow this same approach, di er severely on the core of the algorithm,
i.e. the computation of the similarity itself. We have named them, respectively
PartialMatch and BERTMatch coders.
      </p>
      <p>First, let us describe the shared logic behind the two coders using our speci c
use-case as an example. On the one hand, we get a clinical record with several
sentences. Naturally, in each sentence one or more terms associated with a given
ICD code can appear. As an example, here we are a sentence from the corpus:</p>
      <sec id="sec-4-1">
        <title>Sentence:</title>
        <p>`Realizamos frotis sangu neo para justificar la causa de la
anemia y trombocitopenia'</p>
        <p>On the other hand, there is an ICD code dictionary, which relates each code
to one or more arbitrary-length description strings (either standard diagnostic
terms from the ICD or gold mentions from the corpus). An entry from the ICD
dictionary, as shown below, conveys the ICD code (D69.6) and one or more
standard ways to refer to that code (e.g. plaquetopenia, tombocitopenia,
trombocitopenia, trombopenia).</p>
      </sec>
      <sec id="sec-4-2">
        <title>Dictionary entry:</title>
        <p>D69.6:
plaquetopenia
tombocitopenia
trombocitopenia
trombopenia</p>
        <p>The dictionary can include a variety of terms including standard and
nonstandard, single-word descriptions and even phrases frequently associated with
the code like `enfermedad de graves basedow' for the `E05.00' code, which di ers
harshly from the standard ICD description (`tirotoxicosis con bocio difuso sin
crisis tirotoxica ni tormenta tiroidea').</p>
        <p>
          Next, a Similarity Match algorithm will cycle through all the associated
strings of each ICD code, and through all the texts, computing the
similarity between pairs of standard and non-standard terms and text fragments. The
texts fragments are extracted with a sliding window in which the length is set
to the number of words of the current ICD description. Following the example,
the process to nd the likelihood of the D96.6 code on the sample text is as
follows: compute the similarity between the `plaquetopenia' term and each word
of the target text, and store the maximum value. Then, repeat for the rest of
the associated terms (tombocitopenia, trombocitopenia. . . ), and nally get the
overall maximum value. As the similarity metric is normalized in the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range,
it can be interpreted as a membership probability for each code on each clinical
record. In the case of CodiEsp-X, which requires identifying the range, it is only
necessary to search for the range of the text fragment that leads to the maximum
similarity.
        </p>
        <p>The similarity computation is then what di erentiates the two developed
coders. Let us describe a similarity function as one that for a given pair of strings
as input generates a similarity coe cient, as described in (2), where s1 and s2 is
a pair of strings, sim is the similarity coe cient normalized in a range [0; 1] and
is the vocabulary. Regarding the interpretation, sim(s1; s2) = 1 means that
s1 and s2 are the same string while sim(s1; s2) = 0 means that are completely
di erent.</p>
        <p>sim :</p>
        <p>! [0; 1]
On this basis, the Partial Match coder applies a regular string similarity
algorithm, such as the Jaro Winkler [24] or Levenshtein Distance [14] (and we also
enable an `Auto' con guration that dynamically choose one or the other based
on the length of the given term).</p>
        <p>
          On the other hand, the BERTMatch coder leverages the BERT Multilingual
model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] to come out with a similarity between strings, the process is as follows.
First, for each string (si) a dense representation (v(si) 2 Rn with n = 768) is
extracted from the representation of the texts generated internally by the BERT
model. Then, the similarity between the vectors v(s1) and v(s2) is computed
via the Cosine Similarity [11]: simBERT Match(s1; s2) = cos(v(s1); v(s2)). Note
that the BERTMatch algorithm is far more computationally demanding than
PartialMatch, hence, it was not applied to the test set predictions (indeed, the
test set is, curiously enough, the largest set, as shown in Table 2).
        </p>
        <p>Following the example, the matching score between the word `trombocitopenia'
from the model sentence and the word `tombocitopenia' from the dictionary
entries gives a similarity value (in the range [0:0; 1:0], being 0:0 completely di erent
words and 1:0 exactly the same word) of 0:98 with the JaroWinkler (as it is
almost the same word but with a slight spelling mistake) but only 0:46 with the
BERT embeddings. However, the score between `trombocitopenia' and
`plaquetopenia', is as low as 0:57 with JaroWinkler (although they are synonyms) and
0:78 with the BERT embeddings, a much more appropriate score since both
words mean the same thing.</p>
        <p>Finally, it should be noted that we have developed all the classi ers and
coders in a way so that their outputs can be combined. Combining is done using
simple aggregation functions such as the mean, minimum, or maximum over the
similarity scores or probabilities, which is a straightforward but practical way to
improve results by ensembling strategies.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>The results from the submissions, on the test set, as reported by the CLEF
organizers for the CodiEsp-D/P and X sub-tasks, are presented on this section.
Table 3, 4 and 5 show our team submission results, including the predictions
from the test set (250 docs) (and excluding the 2,751 background set docs). For
the o cial results, only predictions for test les and labels from train and dev
sets were considered.</p>
      <p>Note that during the development phase of the challenge, it was reported that
there were codes present in the test set that were not present in the train and
validation sets, but only after the submission phase was reported that these codes
would not be taken into account for the evaluation of the results. Therefore, the
systems were developed considering that all metrics would be computed taking
into account also the predictions for the codes present only in the test set, which
could have had signi cant harmful e ects on the results.</p>
      <p>The o cial metrics for the subtasks are MAP for the CodiEsp-D/P and
FScore for the CodiEsp-X, but other metrics were also computed and reported.
Speci cally, MAP@30, Precision and Recall for CodiEsp-D/P and Precision and
Recall for CodiEsp-X. Finally, in CodiEsp-D, Precision, Recall and F-score were
also computed for categories, considering a category the rst three digits of an
ICD-10-CM code. (I.e. codes P96.5 and P96.89 are mapped to P96). Therefore,
systems that predict the code P96.89 for a document whose correct code is P96
would be correct. In CodiEsp-P, Precision, Recall and F-score are also computed
for categories. In this case, considering categories the rst four digits of the code.</p>
      <p>In relation with the name of the columns, M stands for MAP; M30 stands
for MAP@30; P stands for Precision; R stands for Recall; and F1 stands for
F1-Score. The T su x stands for Test and the C su x stands for Category.
Those columns with the Test su x show the results evaluated only on the labels
from the train and dev sets (without the only test set labels), while those with
the Category su x show the results considering the category labels.</p>
      <p>For all sub-tasks, each submitted run is the result of applying di erent
techniques. Table 3 presents the results from the CodiEsp-D sub-task, and each run
corresponds to the following setup: 1) XGBoost classi er, trained with
documents and diagnostics labels from train and dev sets, augmenting the clinical
texts with the outputs of the NER Med and the NER D. 2) Partial Match coder
with the Jaro Winkler similarity algorithm and predicting only the diagnostics
labels present on the train and dev sets. 3) The combination of the outputs from
1) and 2).</p>
      <p>Regarding the o cial metric for this task, namely the MAP, or more precisely,
the MAP evaluated only on the test set (M-T column), the XGBoost classi er
prevails over the Partial Match strategy with 63.8 and 57.1 points respectively.
However, the best result comes from the run3, with the combination of both
methods, leading to 69.8 MAP points.</p>
      <p>Table 4 presents the results from the CodiEsp-P sub-task, and each run
corresponds to the following setup: 1) XGBoost classi er, trained with documents
and procedure labels from train and dev sets, augmenting the clinical texts with
Run M M-T M30 M30-T P R F1 P-T R-T F1-T P-C R-C F1-C
run1 0.543 0.638 0.529 0.622 0.004 0.858 0.009 0.004 1.0 0.009 0.01 0.968 0.021
run2 0.485 0.571 0.469 0.553 0.004 0.858 0.009 0.004 1.0 0.009 0.01 0.968 0.021
run3 0.593 0.698 0.578 0.681 0.004 0.858 0.009 0.004 1.0 0.009 0.01 0.968 0.021
Table 3. Submission results for the CodiEsp-D sub-task as reported by the CLEF
organization.
the outputs of the NER Med and the NER P. 2) Partial Match coder with
the Jaro Winkler similarity algorithm and predicting only the procedure labels
present on the train and dev sets. 3) The combination of the outputs from 1)
and 2).</p>
      <p>Run M M-T M30 M30-T P R F1 P-T R-T F1-T P-C R-C F1-C
run1 0.412 0.46 0.395 0.441 0.004 0.825 0.008 0.004 1.0 0.008 0.005 0.857 0.01
run2 0.362 0.414 0.339 0.389 0.004 0.825 0.008 0.004 1.0 0.008 0.005 0.857 0.01
run3 0.425 0.481 0.401 0.455 0.004 0.825 0.008 0.004 1.0 0.008 0.005 0.857 0.01
Table 4. Submission results for the CodiEsp-P sub-task as reported by the CLEF
organization.</p>
      <p>Similarly to the D sub-task, the best M-T result from single models is
achieved by the XGBoost classi er, with 46.0 points, while the Partial Match
strategy stays about 5 points below, with 41.4 points. Once again, the
combination of both methods manages to improve individual performance, with a solid
48.1 MAP points.</p>
      <p>Table 5 presents the results from the CodiEsp-X sub-task, and each run
corresponds to the following setup: 1) Partial Match coder with the Jaro Winkler
similarity algorithm. 2) Partial Match coder with the Auto con guration for the
similarity algorithm. 3) Partial Match coder with the Levenshtein similarity
algorithm. For each setup, predicting only the diagnostics and procedure labels
present on the train and dev sets.</p>
      <p>For the CodiEsp-X task, the o cial metric is the F1-Score, particularly, the
F1-Score evaluated only on the test set (F1-T column). We can see that the Jaro
Winkler algorithm, which dominated on the D/P tasks, here is, curiously, the
worst-performing one with 7.6 points. The `Auto' con guration, that mixes the
Jaro Winkler and Levenshtein algorithms, gets an increased 20.5 points. Finally,
the Levenshtein algorithm improves that mark by approximately 10 points, with
a solid F1 score of 30.6, which is our best result overall for the CodiEsp-X task.
Although we have not been able to apply the Similarity Match algorithm based
on BERT embeddings on the test + background set for computational reasons,
our experiments in the dev set suggest that the BERTMatch algorithm is able
to overcome the Jaro Winkler.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>For sub-tasks CodiEsp-D and CodiEsp-P, the code predictions must be ranked,
this is, generating a list of possible codes ordered by con dence. The main metric
for evaluating these outputs is the Mean Average Precision or MAP. It is
computed iteratively; First, precision is computed considering only the rst ranked
code, then, it is computed considering the rst two codes, etc. Finally, precision
values are averaged over the number of gold codes. The organizers claim that the
MAP is the most standard ranking metric among the TREC community, and it
has shown good discrimination and stability [16]. However, the way to exploit
the MAP metric is to output all the considered codes, without discrimination,
and ranked by con dence. In other words, ranking all the considered ICD codes
and not establishing a threshold for a discrete \Yes/No" decision. Our scripts
yield this output because it is the way to maximize the MAP metric and face the
competition but we believe that this way of evaluating might not be the most
desirable, since the notion of an \automatic classi er" that \decides" whether or
not a code belongs in a given document is shaded. We feel that instead of ranking
all the labels available within the ICD, the system should just limit the output to
a subset of labels that correspond to the document. Nevertheless, MAP metric
favours a rank over all the labels above a rank of a sub-set of labels. In brief,
the ability to state whether a code is present or not in the in the given medical
record is not regarded by the MAP metric. Accordingly, a weakness of this task
is the need of a threshold for accepting and discarding codes given the ranked
list. By contrast, the CodiEsp-X sub-task does not present this drawback, since
the main evaluation metric is the micro F-Score, and therefore each predicted
code that does not belong to the ground truth carry a penalty.</p>
      <p>In the CodiEsp-X sub-task, there are some codi cation errors on which the
assigned ICD code and the text which has motivated the assignation of code
mismatches involving those errors. We have found slight di erences with the main
ICD block (the rst three digits of the ICD) remain while the modi ers (other
digits) vary. However, the evaluation entails the F-Score of the full-code, without
considering the relationship between codes according the hierarchy. This type of
errors (confounding two closely related diseases) penalize as any other error (i.e.
confounding un-related diseases). For example, for the record with ID
`S021169952011000500011-3', the label `K85.10 - Billiary acute pancreatitis
without necrosis or infection' is assigned, motivated by the following text
fragment: `acute non-lithiasic pancreatitis '. The mistake is that the record
elucidates that it is `non-lithiasic pancreatitis', but the code corresponds to that of
`lithiasic' or `biliary' pancreatitis. The label assigned by our system is `K85.90
Acute pancreatitis without necrosis or infection, unspecified', and
although we cannot claim that the K85.90 is the correct label, it seems that is,
at least, more accurate than K85.10, but it is counted as an error.</p>
      <p>On document `S2254-28842013000300009-1' we have the following text
fragment: `Mujer de 73 an~os de edad con antecedentes personales de [. . .],
histerectom a por prolapso uterino y [. . .] '. Our system gives a con dence of
98.5% to the `Z90.710 - Acquired absence of both cervix and uterus'
code, which describes a hysterectomy (the surgical removal of the uterus, which
may also include the cervix and other surrounding structures [23]). The Z90.710
code is considered as incorrect, and there is no other code that matches the
`histerectom a' word (though it is coded with `N81.2 - Incomplete
uterovaginal prolapse' due to `prolapso uterino', which is the cause of the hysterectomy,
and seems correctly coded). There are abundant examples of this type of
missing codes in the ground truth that, unfairly lead to False Positives. Accordingly,
we believe that the evaluation results of these tasks should be regarded with
prudence.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Concluding remarks and future work</title>
      <p>The CodiEsp Track proposes two di erent sub-tasks based on the classi cation
of medical texts according to the ICD-10 CM and PCS codes. The CodiEsp-D/P
sub-tasks aim to the automatic classi cation of diagnostic and procedures codes,
while the CodiEsp-X sub-task strives to bring explainability to the challenge.</p>
      <p>We have developed several systems to cope with these tasks, two strategies
with ve di erent algorithms for the D and P sub-tasks, and one strategy with
four algorithms capable of producing explainable results, in conjunction with
the ability of ensembling the distinct models, enhanced by techniques that yield
alternate representations of the medical texts with tools as Medical NER, while
experimenting also with di erent label-sets.</p>
      <p>Regarding the D and P sub-tasks, the similarity match based algorithms
perform better, on average, than the multi-label classi ers. However, we conclude
that the NER techniques for enriching the medical text inputs for the classi ers
accomplish to improve the performance of the classi ers, resulting in the best
overall results being achieved with the combination of both methods.</p>
      <p>It seems that the best similarity algorithm for the diagnostics and procedures
individually is the Jaro Winkler, while it is the Levenshtein for the CodiEsp-X
sub-task as a whole. We have not delved in this topic, but might be related to
divergences among the average length of diagnostic and procedure terms.</p>
      <p>The similarity match algorithm based on the BERT dense representations
appears to be weaker than the traditional approaches but shows promising results
when applying it to the extraction of diagnostic and procedure terms boundaries.
The consideration of the full ICD-10 codes instead of those from the train set
degrades the performance. It can be observed in every sub-task, and we believe
that this is due to the large number of extra codes considered with respect to
the actual number of codes that only appear in the dev set. Improving NER and
looking for combined match approaches might lead to further improvements.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the Spanish Ministry of Science and
Technology (PAT-MED PID2019-106942RB-C31) and by the Basque
Government (Elkartek KK-2019/00045, IXA IT-1343-19, Predoctoral Grant
PRE-20191-0158).
11. Han, J., Kamber, M., Pei, J.: 2 - getting to know your data. In: Han, J., Kamber,
M., Pei, J. (eds.) Data Mining (Third Edition), pp. 39 { 82. The Morgan Kaufmann
Series in Data Management Systems, Morgan Kaufmann, Boston, third edition
edn. (2012). https://doi.org/https://doi.org/10.1016/B978-0-12-381479-1.00002-2,
http://www.sciencedirect.com/science/article/pii/B9780123814791000022
12. Kelly, L., Suominen, H., Goeuriot, L., Neves, M., Kanoulas, E., Li, D., Azzopardi,
L., Spijker, R., Zuccon, G., Scells, H., et al.: Overview of the clef ehealth evaluation
lab 2019. In: International Conference of the Cross-Language Evaluation Forum for
European Languages. pp. 322{339. Springer (2019)
13. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a
pre-trained biomedical language representation model for biomedical text mining.</p>
      <p>Bioinformatics 36(4), 1234{1240 (2020)
14. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and
reversals. In: Soviet physics doklady. vol. 10, pp. 707{710 (1966)
15. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estape, J., Krallinger, M.:
Overview of automatic clinical coding: annotations, guidelines, and solutions for
non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working
Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop
Proceedings (2020)
16. Mogotsi, I.: Christopher d. manning, prabhakar raghavan, and hinrich schutze:</p>
      <p>Introduction to information retrieval (2010)
17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
learning in Python. Journal of Machine Learning Research 12, 2825{2830 (2011)
18. Perez, J., Perez, A., Casillas, A., Gojenola, K.: Cardiology record multi-label
classi cation using latent dirichlet allocation. Computer methods and programs in
biomedicine 164, 111{119 (2018)
19. Pestian, J., Brew, C., Matykiewicz, P., Hovermale, D.J., Johnson, N., Cohen, K.B.,
Duch, W.: A shared task involving multi-label classi cation of clinical free text.</p>
      <p>In: Biological, translational, and clinical language processing. pp. 97{104 (2007)
20. Sanger, M., Weber, L., Kittner, M., Leser, U.: Classifying german animal
experiment summaries with multi-lingual bert at clef ehealth 2019 task. CLEF (Working
Notes) (2019)
21. Stan ll, M.H., Williams, M., Fenton, S.H., Jenders, R.A., Hersh, W.R.: A
systematic literature review of automated clinical coding and classi cation systems.</p>
      <p>Journal of the American Medical Informatics Association 17(6), 646{651 (2010)
22. Suominen, H., Kelly, L., Goeuriot, L., Neveol, A., Ramadier, L., Robert, A.,
Kanoulas, E., Spijker, R., Azzopardi, L., Li, D., et al.: Overview of the clef ehealth
evaluation lab 2018. In: International Conference of the Cross-Language Evaluation
Forum for European Languages. pp. 286{301. Springer (2018)
23. Thomson, A.P.: Handbook of Consult and Inpatient Gynecology 1st ed. Springer
(2016)
24. Winkler, W.E.: The state of record linkage and current research problems. In:
Statistical Research Division, US Census Bureau. Citeseer (1999)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Almagro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Unanue</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fresno</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montalvo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Icd-10 coding of spanish electronic discharge summaries: An extreme classi cation problem</article-title>
          .
          <source>IEEE Access 8</source>
          ,
          <issue>100073</issue>
          {
          <fpage>100083</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Dun eld</article-title>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Vechkaeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.A.</given-names>
            ,
            <surname>Wixted</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.K.</surname>
          </string-name>
          :
          <article-title>Mlt-dfki at clef ehealth 2019: Multi-label classi cation of icd-10 codes with bert</article-title>
          .
          <source>CLEF (Working Notes)</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Atutxa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casillas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ezeiza</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fresno</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goenaga</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gojenola</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            <given-names>nez</given-names>
          </string-name>
          , R.,
          <string-name>
            <surname>Anchordoqui</surname>
            ,
            <given-names>M.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez-de Vin</surname>
          </string-name>
          ~aspre, O.:
          <article-title>Ixamed at clef ehealth 2018 task 1: Icd10 coding with a sequence-to-sequence approach</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          . p.
          <volume>1</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bounaama</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abderrahim</surname>
            ,
            <given-names>M.E.A.</given-names>
          </string-name>
          : Tlemcen university at celf ehealth
          <year>2018</year>
          <article-title>team techno: Multilingual information extraction-icd10 coding</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cauchy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Methode generale pour la resolution des systemes d'equations simultanees</article-title>
          .
          <source>Comp. Rend. Sci. Paris</source>
          <volume>25</volume>
          (
          <year>1847</year>
          ),
          <volume>536</volume>
          {
          <fpage>538</fpage>
          (
          <year>1847</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          .
          <source>In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining</source>
          . pp.
          <volume>785</volume>
          {
          <issue>794</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , N.:
          <article-title>What happened in clef... for a while</article-title>
          .
          <source>Crestani</source>
          et al.[
          <volume>94</volume>
          ] (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Greedy function approximation: a gradient boosting machine</article-title>
          . Annals of statistics pp.
          <volume>1189</volume>
          {
          <issue>1232</issue>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Escalada</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Saez Gonzales,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Viviani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Overview of the CLEF eHealth evaluation lab 2020</article-title>
          . In: Arampatzis,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Kanoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Tsikrika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Vrochidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Joho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Eickho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Neveol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , andNicola Ferro, L.C. (eds.)
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <source>Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ). LNCS Volume number:
          <volume>12260</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>