<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Convolutional Attention Models with Hierarchical Post-Processing Heuristics at CLEF eHealth 2020</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elias Moons</string-name>
          <email>elias.moons@cs.kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marie-Francine Moens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KU Leuven</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we compare state-of-the-art neural network approaches to the 2020 CLEF eHealth task 1. The presented models use the neural principles of convolution and attention to obtain their results. Furthermore, a hierarchical component is introduced as well as hierarchical post-processing heuristics. These additions successfully leverage the information that is inherently present in the ICD taxonomy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>the size and consequently the small number of training samples for each category
present. For the diagnostic ICD-codes for example, there are in total 1,767
different categories spread out over only 500 training documents. Every document
is labeled with on average 11.3 di erent categories and each category is on
average represented by 3.2 training examples. Only seven categories have more than
50 training examples. For the case of procedural ICD-codes, these numbers are
slightly lower with 563 di erent categories, 3.1 categories per example and only
2.7 training examples for each category, leading to a very similar distribution.
Figure 1 gives a sorted view of all categories present in the diagnostic training
dataset (left) as well as the procedural training dataset (right) and the amount
of examples tagged with that speci c category.</p>
      <p>100
t
e
sg 80
n
ii
n
a
tr 60
n
i
y
c
en 40
u
q
e
rF 20
60
t
e
s
g
n
i
ian 40
r
t
n
i
y
c
n
e
qu 20
e
r</p>
      <p>F
500</p>
      <p>1;000
Category
1;500
100
200
400</p>
      <p>500</p>
      <p>In this paper we hypothesize that exploiting the knowledge of the hierarchical
label taxonomy of ICD-10 helps the performance of automated coding when
limited training examples that are manually coded are available.</p>
      <p>The remainder of this paper is organized as follows. In section 2 related
work relevant for the conducted research will be discussed. The evaluated deep
learning methods are described in section 3. These methods are evaluated on the
benchmark CodiEsp ICD-10 dataset and all ndings are reported in section 4.
The most important ndings will be recapped in section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The most prominent and more recent advancements in categorizing medical
reports with standard codes will shortly be described in this section.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] an hierarchical support vector machine (SVM) is shown to outperform
that of a at SVM. Results were reported based of F-measure scores on the
Mimic-II dataset. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] show that datasets of di erent sizes and di erent numbers
of distinct codes demand di erent training mechanisms. For small datasets,
feature and data selection methods serve better. The authors have evaluated ICD
coding performance on a dataset consisting of more than 70,000 textual EMRs
(Electronic Medical Records) from the University of Kentucky (UKY) Medical
Center tagged with ICD-9 codes.
      </p>
      <p>A deep learning model that encompasses an attention mechanism is tested
by [9] on the Mimic-III dataset. LSTMs are used for both character and word
level representations. A soft attention layer here helps in making predictions for
the top 50 most frequent ICD-9 codes in the dataset.</p>
      <p>
        More recently, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have introduced the Hierarchical Attention bidirectional
Gated Recurrent Unit model (HA-GRU). By identifying relevant sentences for
each label, documents are tagged with corresponding ICD-9 codes. Results are
reported both on the Mimic-II and Mimic-III datasets. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] presents the
Convolutional Attention for Multi-Label classi cation (CAML) model that combines
the strengths of convolutional networks and attention mechanisms. They
propose adding regularization on the long descriptions of the target ICD codes,
especially to improve classi cation results on less represented categories in the
dataset. This approach is further extended with the idea of multiple
convolutional channels by [8] with max pooling across all channels. The authors also
shift the attention from the last prediction layer, as in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], to the attention layer.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [8] achieve state-of-the art results for ICD-9 coding on the MIMIC-III
dataset. As an addition to these models, in this paper a hierarchical variant
of each of them is constructed and evaluated. Furthermore, if the target output
space of categories follows a hierarchy of labels - as is also the case in ICD coding
- the trained models can e ciently use this hierarchy for category assignment
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][10][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. During categorization the models apply a top-down or a bottom-up
approach at the classi cation stage. In a top-down approach parent categories
are assigned rst and only children of assigned parents are considered as
category candidates. In a bottom-up approach only leaf nodes in the hierarchy are
assigned which entail that parent nodes are assigned. The hierarchical structure
of a tree leads to various parent-child relations between its categories. For the
models discussed in this paper, an hierarchical variant will also be tested which
exploits the information of the tree structure and shows that it can enhance
the classi cation performance. Recent research shows the value of these
hierarchical dependencies using hierarchical attention mechanisms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and hierarchical
penalties [11] which are also integrated in this paper.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>
        In this section, we explain the used models for ICD code prediction. First, the
preprocessing step is shortly discussed. Then, two recent state-of-the-art models
in the eld of ICD coding are explained in detail. These models are implemented
by the authors following the original papers and are called DR-CAML [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and
MVC-(R)LDA [8], respectively. We discuss in detail the attention mechanisms
and loss functions of these models. Afterwards, as a way of handling the
hierarchical dependencies of the ICD-codes, we propose various ways of their
integration in all models. This is based on advancements in hierarchical classi cation
as inspired by [11]. Lastly, heuristics are described for post-processing of the
predictions given by the models. This leads in section 4 to a clear comparison
between all tested models among themselves as well as with their hierarchical
novel variants and the introduced post-processing.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Preprocessing</title>
        <p>
          The preprocessing follows as standard procedure described in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], i.e., tokens that
contain no alphabetic characters are removed and all tokens are put to lowercase.
Furthermore tokens that appear in fewer than three training documents are
replaced with the `UNK' token. All documents are then truncated to a maximum
length of 2500 tokens.
        </p>
        <p>All discussed models have for each document i as input, a sequence of word
vectors xi as their representation and as output, a set of ICD-codes yi.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Convolutional models</title>
        <p>
          This subsection describes the details of recent state-of-the-art models presented
in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and [8] in the way they are used for the experiments in section 4.
DR-CAML DR-CAML is a CNN based model adopted for ICD coding [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
When an ICD code is de ned by the WHO, it is accompanied by a label
definition expressed in natural language to guide the model towards learning the
appropriate parameter values of the model. For this purpose the model employs
a per-label attention mechanism enabling it to learn distinct document
representations for each label. It has been shown that for labels for which there are very
few training instances available, this approach is advantageous. The idea is that
the description of a target code is itself a very good training example for the
corresponding code. Similarity between the representation of a given test sample
and the representation of the description of a target code gives extra con dence
in assigning this label.
        </p>
        <p>In general, after the convolutional layer, DR-CAML employs a per-label
attention mechanism to attend to the relevant parts of text for each predicted
label. An additional advantage is that the per-label attention mechanism
provides the model with the ability of explaining why it decided to assign each code
by showing the spans of text relevant for the ICD code.</p>
      </sec>
      <sec id="sec-3-3">
        <title>MVC-(R)LDA Both MVC-LDA and MVC-RLDA, can be seen as exten</title>
        <p>sions of DR-CAML. Similar to that model, they are based on a CNN architecture
with a label attention mechanism that considers ICD coding as a multi-task
binary classi cation problem. The added functionality lies in the use of parallel
CNNs with di erent kernel sizes to capture information of di erent granularity.</p>
        <p>In general, these multi-view CNNs are constructed with four CNNs that have
the same number of lters but with di erent kernel sizes. This convolutional
layer is followed by a max-pooling function across all channels to select the most
relevant span of text for each lter.</p>
        <p>Loss function The loss functions used to train DR-CAML and the multi-view
models MVD-(R)LDA are calculated in the same way. The general loss function
is the binary cross entropy loss lossBCE . This loss is extended by regularization
on the long description vectors of the target categories</p>
        <p>Given N di erent training examples xi. The values of y^l and max-pooled
vector zl can be calculated by getting the description of code l out of all L target
codes. In this gure and the following formulas l is a vector of prediction weights
and vl the vector representation for code l. Assuming ny is the number of true
labels in the training data, the nal loss is computed by adding regularization
to the base loss function as:</p>
        <p>y^l = ( ltvl + bl)
lossBCE (X) =
yl) log (1
y^ )
l
N L
X X yl log (y^l) + (1
i=1 l=1
lossModel(X) = lossBCE +
1 XN XL kzl
ny i=1 l=1
lk2
(1)
(2)
(3)
3.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Modelling hierarchical dependencies</title>
        <p>In this section we investigate the modelling of hierarchical dependencies as
extensions of the models described above. A rst part integrates the hierarchical
dependencies directly into the structure of the model. This leads to
Hierarchical models, which are layered variants of the already discussed approaches. The
second way hierarchical dependencies are explicitly introduced into the model is
via the use of a hierarchical loss function to penalize hierarchical inconsistencies
across the model's prediction layer.</p>
        <p>Hierarchical models Hierarchical relationships can be shaped directly into
the architecture of any of the described models above. The ICD-10 taxonomy
can be modeled as a tree with a general ICD root and 4 levels of depth. On the
highest level, codes have 1 character, the next 2 levels represent categories with
respectively 3 and 4 characters. The rest of the codes are combined in the last
layer. This leads to a hierarchical variant of any of the models. In this variant,
not 1 but 4 identical models will be trained, one for each of the di erent layers
in the ICD hierarchy (corresponding to the length of the codes).</p>
        <p>An overview of the approach is given in gure 2. The input for each layer is
partially dependent on an intermediary representation from the previous layer
as well as the original input through concatenation of both. Layers are stacked
from most to least speci c or from leaf to root node in the taxonomy. Models
corresponding to di erent layers will then rely on di erent features, or
characteristics, to classify the input vectors. This way the deepest, most advanced
representations, can be used for classifying the most abstract and broad
categories. On the other hand, for the most speci c categories, word level features
can directly be used to make detailed decisions between classes that are very
similar.</p>
        <p>Hierarchical loss function To capture the hierarchical relationships in a given
model, the loss function of the above models can be extended with an additional
term. This leads to the de nition of a Hierarchical loss function (lossH ). This
loss function penalizes classi cations that contradict the inherent ICD hierarchy.
More speci cally, when a parent category is not predicted to be true, none of its
child categories should be predicted to be true. The hierarchical loss between a
child and its parent in the tree is then de ned as the di erence between their
computed probability scores, with 0 as a lower bound. More formally, for the
entire loss function lossH Model for a category of layer X, combining the
regular training loss lossModel described above and the hierarchical loss lossH , is
calculated as follows:</p>
        <p>P (X) = P robability(X == T rue)
P ar(X) = P robability(P arent(X) == T rue)</p>
        <p>L(X) = T rue label of X(0 or 1)
lossH (X) = Clip(P (X)</p>
        <p>P ar(X); 0; 1)
lossH Model(X) = (1
)lossModel(X) + lossH (X)
(4)
(5)
(6)
(7)
(8)
which leaves a parameter</p>
        <p>to optimize the loss function.1
3.4</p>
      </sec>
      <sec id="sec-3-5">
        <title>Hierarchical post-processing</title>
        <p>As a nal step in the classi cation process, a heuristical post-processing will be
applied to some of the submitted models. All considered heuristics are explained
below. They are all reliant on the distance of any pair of target categories in the
ICD-10 taxonomy and reweigh the prediction values accordingly. The heuristics
are numbered from H1 until H7 for e cient referencing in the result section.
Node distance (H1) Given all L predictions yi made for document i by any
given model, the new prediction values yipost1 can be calculated as follows:
The newly calculated prediction values are the result of a weighted sum of all
previously calculated prediction values, taking into account the relative distances
of all target categories in the ICD taxonomy. In general, dist(i; j) gives the
distance between categories i and j in the ICD tree, e.g., the distance between a
parent and its child is 1, the distance between two siblings is 2 and the distance
of an element to itself is 0.</p>
        <p>Node distance from child to ancestor (H2) This heuristic functions the
same way as the heuristic described above but di ers in behavior if the lowest
common ancestor (LCA) of categories i and j which is not j itself. yj will only
be added to the total new score of category i if j is an ancestor of i. This can
be formally described as follows:
ypost2 =
i</p>
        <p>L
X dista;c(i; j) yj ;
j=1
8 1
dista;c(i; j) = &lt; (1 + dist(i; j)
:0;
; if ancestor(i, j) == True
if ancestor(i, j) == False.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Node distance from ancestor to child (H3) This heuristic functions anal</title>
        <p>ogous to heuristic H2 but in the opposite direction. yj will only be added to the
total new score of category i if i is an ancestor of j. This gives:
ypost3 =
i</p>
        <p>L
X distc;a(i; j) yj ;
j=1
1 Parameter is optimized over the training set.
(9)
(10)
(11)
(12)</p>
      </sec>
      <sec id="sec-3-7">
        <title>Node distance between ancestors and children (H4) Heuristic H4 com</title>
        <p>bines the ideas presented in the previous two heuristics, only adding yi when
either i is an ancestor of j or j is an ancestor of i. Using equations 11 and 13,
this evaluates to:
Squared node distance (H5) This heuristic functions as heuristic H1 but
squares the value of its distance function. As a result, it gives relatively more
weight to predictions made for categories that are closer the observed category
in comparison to H1. This leads to the following relationship:
(14)
(15)
(16)
(17)
ypost5 =
i</p>
        <p>L
X(
j=1</p>
        <p>yj
(1 + dist(i; j)2) :
Squared node prediction values (H6) Heuristic H6 di ers from the rst
heuristic in that it rescales the starting prediction values yi. Instead of using
the calculated values it will use the squares of these values, making
discrepancies in prediction values relatively more prominent. The resulting values can be
calculated via:
ypost6 =
i</p>
        <p>L
X(
j=1</p>
        <p>yj2
(1 + dist(i; j))
:</p>
      </sec>
      <sec id="sec-3-8">
        <title>Squared node distances and prediction values (H7) This heuristic com</title>
        <p>bines the ideas that comprise heuristics H5 and H6, leading to the following
relationship:
ypost7 =
i</p>
        <p>L
X(
j=1</p>
        <p>yj2
(1 + dist(i; j)2) :
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>For both the subtasks of predicting diagnostic and procedural codes, 5 di erent
models were trained, this was the maximum amount allowed in the competition.
Since the size of the dataset was a problem during training, the authors chose
to only train models for the top-50 most represented categories in the training
dataset. During training of the hierarchical models, ancestors of the top-50
categories were added as well, but only the performance on the original 50 categories
was taken into account for calculating the result metrics. A selection of models
was chosen aiming for much variety to be able to assess the in uence of both
proposed models (CAML and MVC-RLDA), the hierarchical objective and
postprocessing using a heuristic. The chosen models are summarized below and are
the same for both subtasks:
1. CAML
2. CAML + hierarchical objective
3. MVC-RLDA + hierarchical objective
4. CAML + hierarchical post-processing H1
5. MVC-RLDA + hierarchical objective + hierarchical post-processing H1
First, one baseline without use of the hierarchy and heuristics was chosen. Since
CAML got slightly better results than MVC-RLDA on the development set, this
model was selected. Second, to assess the in uence the hierarchy can have on the
classi cation results, both CAML and MVC-RLDA models were trained with a
hierarchical objective. The last 2 models were chosen with the post-processing
heuristic in mind. Only heuristic H1 was chosen for this (based on higher
performance on the development set), once in a setting without hierarchical objective
(with CAMl) and once with the hierarchical objective (and MVC-RLDA). Since
the models used in this paper had a lot of di culties with the small number
of training examples, the prediction probabilities of all categories were rather
close together (often in the range of 0:3 to 0:5 instead of from 0:0 until 1:0). For
this reason, the prediction les were generated using the top-5 highest predicted
categories instead of using a xed cut-o point. This is not optimal for
obtaining a high MAP, where it is better to submit more categories leading to lower
performance values. The results obtained by these prediction les are visible in
tables 1 and 2 for diagnostic and procedural subtasks respectively.</p>
      <p>Diag.</p>
      <p>For the case of diagnostic codes, visible in table 1, the best performance is
achieved by the CAML model in combination with heuristical post-processing
H1. Adding the heuristic to CAML leads to a clear improvement in classi cation
quality. Comparing CAML with CAML+Hier. leads to the conclusion that the
hierarchy can as well lead to an improvement, but it is less prominent than
using the post-processing heuristic. Furthermore, it is clear that the MVC-RLDA
model gets outperformed by CAML. This is most likely due to the fact that the
former model contains more trainable parameters than CAML but having only
a small amount of training examples.</p>
      <p>For the case of procedural codes, visible in table 2, the best results are now
obtained by a combination of CAML with a hierarchical objective. This is closely
followed by CAML with a post-processing heuristic. Both techniques improve
the classi cation scores signi cantly but the overall scores are lower than for the
task of classifying diagnostic codes. Lastly, both MVC-RLDA models predicted
invalid codes for all documents in the test set, not being able to learn signi cant
relations present in the data.</p>
      <p>As an extra experiment to assess the performance of the described heuristics,
a CAML model got post-processed with 7 di erent heuristics. In this case, not
only the top-5 categories were retained but the top-50 categories were all sorted
by con dence. These resulting les were then evaluated by the evaluation le
provided by the competition and results are reported in table 3.</p>
      <p>For both the subtasks of classifying diagnostic and procedural codes, the use
of heuristic H1 is the clear winner. It is worth noting that in no case, the results of
the baseline got worse because of the use of a post-processing heuristic.
Furthermore, in most cases this has led to an improvement of the results strengthening
the claim that post-processing heuristics based on the ICD-10 taxonomy can be
a valuable tool. Next to H1, the best performing heuristic is H5 which squares
the distances between nodes in the classi cation tree. Since all heuristics that
try to give more weight to nodes closer to the observed node underperform with
respect to H1, it might be interesting to see whether the opposite can further
improve the classi cation process.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we trained 5 models for participation in 2 subtasks of the 2020
CLEF eHealth task 1. For both subtasks, experiments were conducted,
yielding interesting results. The hierarchical component as well as the use of
postprocessing heuristics proved their value in this setting. The use of a multi-view
neural network led to an abundance of trainable parameters which ultimately
made the model unable to e ciently generalize over the training samples. An
extra experiment was conducted to asses the in uence of the presented
postprocessing heuristics. This led to the conclusion that these heuristics can be a
powerful tool for the classi cation of ICD codes.
8. Sadoughi, N., Finley, G.P., Fone, J., Murali, V., Korenevski, M., Baryshnikov,
S., Axtmann, N., Miller, M., Suendermann-Oeft, D.: Medical code prediction
with multi-view convolution and description-regularized label-dependent attention.
arXiv preprint arXiv:1811.01468 (2018)
9. Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated icd coding
using deep learning. arXiv preprint arXiv:1711.04075 (2017)
10. Silla, C.N., Freitas, A.A.: A survey of hierarchical classi cation across di erent
application domains. Data Mining and Knowledge Discovery 22(1), 31{72 (Jan
2011)
11. Wehrmann, J., Cerri, R., Barros, R.: Hierarchical multi-label classi cation
networks. In: Dy, J., Krause, A. (eds.) ICML. Proceedings of Machine Learning
Research, vol. 80, pp. 5075{5084. PMLR, Stockholmsmassan, Stockholm Sweden (10{
15 Jul 2018)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baumel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nassour-Kassis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhadad</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhadad</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>: Multi-label classi cation of patient notes a case study on icd code assignment (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Escalada</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Saez Gonzales,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Viviani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Overview of the CLEF eHealth evaluation lab 2020</article-title>
          . In: Arampatzis,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Kanoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Tsikrika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Vrochidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Joho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Eickho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Neveol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , andNicola Ferro, L.C. (eds.)
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <source>Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ) . LNCS Volume number:
          <volume>12260</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kavuluru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rios</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records</article-title>
          .
          <source>Arti cial intelligence in medicine 65(2)</source>
          ,
          <volume>155</volume>
          {
          <fpage>166</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kowsari</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , D.E.,
          <string-name>
            <surname>Heidarysafa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meimandi</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnes</surname>
            ,
            <given-names>L.E.</given-names>
          </string-name>
          : Hdltex:
          <article-title>Hierarchical deep learning for text classi cation</article-title>
          .
          <source>In: 2017 16th IEEE International Conference on Machine Learning and Applications (Dec</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Miranda-Escalada</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Armengol-Estape</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020</article-title>
          . In: Working Notes of Conference and
          <article-title>Labs of the Evaluation (CLEF) Forum</article-title>
          . CEUR Workshop Proceedings (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mullenbach</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Wiegre e, S.,
          <string-name>
            <surname>Duke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eisenstein</surname>
          </string-name>
          , J.:
          <article-title>Explainable prediction of medical codes from clinical text</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long Papers). pp.
          <volume>1101</volume>
          {
          <fpage>1111</fpage>
          . Association for Computational Linguistics, New Orleans,
          <source>Louisiana (Jun</source>
          <year>2018</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N18</fpage>
          -1100, https://www.aclweb.org/ anthology/N18-1100
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Perotte</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pivovarov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Natarajan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiskopf</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhadad</surname>
          </string-name>
          , N.:
          <article-title>Diagnosis code assignment: models and evaluation metrics</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <volume>231</volume>
          {237 (Mar
          <year>2014</year>
          ),
          <volume>24296907</volume>
          [pmid]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>