=Paper=
{{Paper
|id=Vol-2696/paper_183
|storemode=property
|title=Convolutional Attention Models with Post-Processing Heuristics at CLEF eHealth 2020
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_183.pdf
|volume=Vol-2696
|authors=Elias Moons,Marie-Francine Moens
|dblpUrl=https://dblp.org/rec/conf/clef/MoonsM20
}}
==Convolutional Attention Models with Post-Processing Heuristics at CLEF eHealth 2020==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_183.pdf</pdf>
<pre>
     Convolutional Attention Models with
Hierarchical Post-Processing Heuristics at CLEF
                  eHealth 2020

                       Elias Moons and Marie-Francine Moens

                                 KU Leuven, Belgium
                             elias.moons@cs.kuleuven.be


        Abstract. In this paper, we compare state-of-the-art neural network
        approaches to the 2020 CLEF eHealth task 1. The presented models use
        the neural principles of convolution and attention to obtain their results.
        Furthermore, a hierarchical component is introduced as well as hierarchi-
        cal post-processing heuristics. These additions successfully leverage the
        information that is inherently present in the ICD taxonomy.


1     Introduction
In this paper, we compare different neural network approaches in the context
of the CLEF eHealth 2020 task 1[2][5]. More specifically, we have submitted
predictions for subtasks 1 and 2 which evaluate systems that predict diagnostic
and procedural ICD-codes, respectively. Diagnostic codes represent all the dif-
ferent diagnoses and their variants themselves. Procedural codes identify what
was done to or given to a patient (medication, surgeries, etc.). Our strategy
combines the principles of convolutional neural networks and attention mech-
anisms. Furthermore, these models are extended with a hierarchical objective,
corresponding to the underlying ICD taxonomy. Lastly, hierarchical heuristics
are used for post-processing the results.
    The dataset consists of 1,000 clinical cases, tagged with various ICD-10 codes
by health specialists. The original text fragments are in Spanish but an auto-
matically translated version in English is also provided by the organisers. This
version was used in this research as the described models are optimized for En-
glish texts. Assessing the influence of using this translated version instead of
the original Spanish texts would be an interesting addition in future works. The
dataset contains a split of 500 training samples, 250 development samples and
250 test samples. In total the 1,000 documents comprise of 16,504 sentences and
396,988 words, with an average of 396.2 words per clinical case. For the first
subtask, these documents are trained with corresponding diagnostic ICD-code
tags. For the second subtask, these same documents were trained with their pro-
cedural ICD-codes instead. The biggest hurdle while training with this dataset is
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
the size and consequently the small number of training samples for each category
present. For the diagnostic ICD-codes for example, there are in total 1,767 dif-
ferent categories spread out over only 500 training documents. Every document
is labeled with on average 11.3 different categories and each category is on aver-
age represented by 3.2 training examples. Only seven categories have more than
50 training examples. For the case of procedural ICD-codes, these numbers are
slightly lower with 563 different categories, 3.1 categories per example and only
2.7 training examples for each category, leading to a very similar distribution.
Figure 1 gives a sorted view of all categories present in the diagnostic training
dataset (left) as well as the procedural training dataset (right) and the amount
of examples tagged with that specific category.


                             100                                                         60
 Frequency in training set


                                                             Frequency in training set
                             80

                                                                                         40
                             60


                             40
                                                                                         20

                             20


                                   500       1,000   1,500                                    100   200      300     400   500
                                         Category                                                         Category


Fig. 1. Category frequencies of CodiEsp training dataset (diagnostic on the left, pro-
cedural on the right).


    In this paper we hypothesize that exploiting the knowledge of the hierarchical
label taxonomy of ICD-10 helps the performance of automated coding when
limited training examples that are manually coded are available.
    The remainder of this paper is organized as follows. In section 2 related
work relevant for the conducted research will be discussed. The evaluated deep
learning methods are described in section 3. These methods are evaluated on the
benchmark CodiEsp ICD-10 dataset and all findings are reported in section 4.
The most important findings will be recapped in section 5.


2                            Related Work

The most prominent and more recent advancements in categorizing medical re-
ports with standard codes will shortly be described in this section.
   In [7] an hierarchical support vector machine (SVM) is shown to outperform
that of a flat SVM. Results were reported based of F-measure scores on the
Mimic-II dataset. [3] show that datasets of different sizes and different numbers
of distinct codes demand different training mechanisms. For small datasets, fea-
ture and data selection methods serve better. The authors have evaluated ICD
coding performance on a dataset consisting of more than 70,000 textual EMRs
(Electronic Medical Records) from the University of Kentucky (UKY) Medical
Center tagged with ICD-9 codes.
    A deep learning model that encompasses an attention mechanism is tested
by [9] on the Mimic-III dataset. LSTMs are used for both character and word
level representations. A soft attention layer here helps in making predictions for
the top 50 most frequent ICD-9 codes in the dataset.
    More recently, [1] have introduced the Hierarchical Attention bidirectional
Gated Recurrent Unit model (HA-GRU). By identifying relevant sentences for
each label, documents are tagged with corresponding ICD-9 codes. Results are
reported both on the Mimic-II and Mimic-III datasets. [6] presents the Convo-
lutional Attention for Multi-Label classification (CAML) model that combines
the strengths of convolutional networks and attention mechanisms. They pro-
pose adding regularization on the long descriptions of the target ICD codes,
especially to improve classification results on less represented categories in the
dataset. This approach is further extended with the idea of multiple convolu-
tional channels by [8] with max pooling across all channels. The authors also
shift the attention from the last prediction layer, as in [6], to the attention layer.
[6] and [8] achieve state-of-the art results for ICD-9 coding on the MIMIC-III
dataset. As an addition to these models, in this paper a hierarchical variant
of each of them is constructed and evaluated. Furthermore, if the target output
space of categories follows a hierarchy of labels - as is also the case in ICD coding
- the trained models can efficiently use this hierarchy for category assignment
[7][10][4]. During categorization the models apply a top-down or a bottom-up
approach at the classification stage. In a top-down approach parent categories
are assigned first and only children of assigned parents are considered as cate-
gory candidates. In a bottom-up approach only leaf nodes in the hierarchy are
assigned which entail that parent nodes are assigned. The hierarchical structure
of a tree leads to various parent-child relations between its categories. For the
models discussed in this paper, an hierarchical variant will also be tested which
exploits the information of the tree structure and shows that it can enhance
the classification performance. Recent research shows the value of these hierar-
chical dependencies using hierarchical attention mechanisms [1] and hierarchical
penalties [11] which are also integrated in this paper.


3    Methods

In this section, we explain the used models for ICD code prediction. First, the
preprocessing step is shortly discussed. Then, two recent state-of-the-art models
in the field of ICD coding are explained in detail. These models are implemented
by the authors following the original papers and are called DR-CAML [6] and
MVC-(R)LDA [8], respectively. We discuss in detail the attention mechanisms
and loss functions of these models. Afterwards, as a way of handling the hierar-
chical dependencies of the ICD-codes, we propose various ways of their integra-
tion in all models. This is based on advancements in hierarchical classification
as inspired by [11]. Lastly, heuristics are described for post-processing of the
predictions given by the models. This leads in section 4 to a clear comparison
between all tested models among themselves as well as with their hierarchical
novel variants and the introduced post-processing.


3.1   Preprocessing

The preprocessing follows as standard procedure described in [6], i.e., tokens that
contain no alphabetic characters are removed and all tokens are put to lowercase.
Furthermore tokens that appear in fewer than three training documents are
replaced with the ‘UNK’ token. All documents are then truncated to a maximum
length of 2500 tokens.
   All discussed models have for each document i as input, a sequence of word
vectors xi as their representation and as output, a set of ICD-codes y i .


3.2   Convolutional models

This subsection describes the details of recent state-of-the-art models presented
in [6] and [8] in the way they are used for the experiments in section 4.


DR-CAML DR-CAML is a CNN based model adopted for ICD coding [6].
When an ICD code is defined by the WHO, it is accompanied by a label def-
inition expressed in natural language to guide the model towards learning the
appropriate parameter values of the model. For this purpose the model employs
a per-label attention mechanism enabling it to learn distinct document represen-
tations for each label. It has been shown that for labels for which there are very
few training instances available, this approach is advantageous. The idea is that
the description of a target code is itself a very good training example for the
corresponding code. Similarity between the representation of a given test sample
and the representation of the description of a target code gives extra confidence
in assigning this label.
    In general, after the convolutional layer, DR-CAML employs a per-label at-
tention mechanism to attend to the relevant parts of text for each predicted
label. An additional advantage is that the per-label attention mechanism pro-
vides the model with the ability of explaining why it decided to assign each code
by showing the spans of text relevant for the ICD code.


MVC-(R)LDA Both MVC-LDA and MVC-RLDA, can be seen as exten-
sions of DR-CAML. Similar to that model, they are based on a CNN architecture
with a label attention mechanism that considers ICD coding as a multi-task bi-
nary classification problem. The added functionality lies in the use of parallel
CNNs with different kernel sizes to capture information of different granularity.
    In general, these multi-view CNNs are constructed with four CNNs that have
the same number of filters but with different kernel sizes. This convolutional
layer is followed by a max-pooling function across all channels to select the most
relevant span of text for each filter.


Loss function The loss functions used to train DR-CAML and the multi-view
models MVD-(R)LDA are calculated in the same way. The general loss function
is the binary cross entropy loss lossBCE . This loss is extended by regularization
on the long description vectors of the target categories
    Given N different training examples xi . The values of ŷl and max-pooled
vector zl can be calculated by getting the description of code l out of all L target
codes. In this figure and the following formulas βl is a vector of prediction weights
and vl the vector representation for code l. Assuming ny is the number of true
labels in the training data, the final loss is computed by adding regularization
to the base loss function as:

                        ŷl = σ(βlt vl + bl )                                         (1)
                                N X
                                X L
             lossBCE (X) = −               yl log (ŷl ) + (1 − yl ) log (1 − ŷl )   (2)
                                 i=1 l=1
                                                    N    L
                                                1 XX
            lossM odel (X) = lossBCE + λ               kzl − βl k2                    (3)
                                                ny i=1
                                                        l=1


3.3   Modelling hierarchical dependencies

In this section we investigate the modelling of hierarchical dependencies as ex-
tensions of the models described above. A first part integrates the hierarchical
dependencies directly into the structure of the model. This leads to Hierarchi-
cal models, which are layered variants of the already discussed approaches. The
second way hierarchical dependencies are explicitly introduced into the model is
via the use of a hierarchical loss function to penalize hierarchical inconsistencies
across the model’s prediction layer.


Hierarchical models Hierarchical relationships can be shaped directly into
the architecture of any of the described models above. The ICD-10 taxonomy
can be modeled as a tree with a general ICD root and 4 levels of depth. On the
highest level, codes have 1 character, the next 2 levels represent categories with
respectively 3 and 4 characters. The rest of the codes are combined in the last
layer. This leads to a hierarchical variant of any of the models. In this variant,
not 1 but 4 identical models will be trained, one for each of the different layers
in the ICD hierarchy (corresponding to the length of the codes).
    An overview of the approach is given in figure 2. The input for each layer is
partially dependent on an intermediary representation from the previous layer
as well as the original input through concatenation of both. Layers are stacked
        Fig. 2. Overview of hierarchical variant of a model, inspired by [11].


from most to least specific or from leaf to root node in the taxonomy. Models
corresponding to different layers will then rely on different features, or char-
acteristics, to classify the input vectors. This way the deepest, most advanced
representations, can be used for classifying the most abstract and broad cate-
gories. On the other hand, for the most specific categories, word level features
can directly be used to make detailed decisions between classes that are very
similar.


Hierarchical loss function To capture the hierarchical relationships in a given
model, the loss function of the above models can be extended with an additional
term. This leads to the definition of a Hierarchical loss function (lossH ). This
loss function penalizes classifications that contradict the inherent ICD hierarchy.
More specifically, when a parent category is not predicted to be true, none of its
child categories should be predicted to be true. The hierarchical loss between a
child and its parent in the tree is then defined as the difference between their
computed probability scores, with 0 as a lower bound. More formally, for the
entire loss function lossH M odel for a category of layer X, combining the reg-
ular training loss lossM odel described above and the hierarchical loss lossH , is
calculated as follows:

                        P (X) = P robability(X == T rue)                         (4)
                      P ar(X) = P robability(P arent(X) == T rue)                (5)
                        L(X) = T rue label of X(0 or 1)                          (6)
                    lossH (X) = Clip(P (X) − P ar(X), 0, 1)                      (7)
             lossH M odel (X) = (1 − λ)lossM odel (X) + λlossH (X)               (8)
which leaves a parameter λ to optimize the loss function.1

3.4     Hierarchical post-processing
As a final step in the classification process, a heuristical post-processing will be
applied to some of the submitted models. All considered heuristics are explained
below. They are all reliant on the distance of any pair of target categories in the
ICD-10 taxonomy and reweigh the prediction values accordingly. The heuristics
are numbered from H1 until H7 for efficient referencing in the result section.

Node distance (H1) Given all L predictions yi made for document i by any
given model, the new prediction values yipost1 can be calculated as follows:
                                         L
                                         X         yj
                             yipost1 =         (              .                 (9)
                                         j=1
                                             (1 + dist(i, j))

The newly calculated prediction values are the result of a weighted sum of all
previously calculated prediction values, taking into account the relative distances
of all target categories in the ICD taxonomy. In general, dist(i, j) gives the
distance between categories i and j in the ICD tree, e.g., the distance between a
parent and its child is 1, the distance between two siblings is 2 and the distance
of an element to itself is 0.

Node distance from child to ancestor (H2) This heuristic functions the
same way as the heuristic described above but differs in behavior if the lowest
common ancestor (LCA) of categories i and j which is not j itself. yj will only
be added to the total new score of category i if j is an ancestor of i. This can
be formally described as follows:
                                         L
                                         X
                             yipost2 =         dista,c (i, j) ∗ yj ;           (10)
                                         j=1

                                    1
                           
                                             , if ancestor(i, j) == True
           dista,c (i, j) =   (1 + dist(i, j)                                  (11)
                            
                              0,                if ancestor(i, j) == False.

Node distance from ancestor to child (H3) This heuristic functions anal-
ogous to heuristic H2 but in the opposite direction. yj will only be added to the
total new score of category i if i is an ancestor of j. This gives:
                                         L
                                         X
                             yipost3 =         distc,a (i, j) ∗ yj ;           (12)
                                         j=1

1
    Parameter λ is optimized over the training set.
                                  1
                        
                                           , if ancestor(j, i) == True
         distc,a (i, j) =   (1 + dist(i, j)                                    (13)
                          
                            0,                if ancestor(j, i) == False.

Node distance between ancestors and children (H4) Heuristic H4 com-
bines the ideas presented in the previous two heuristics, only adding yi when
either i is an ancestor of j or j is an ancestor of i. Using equations 11 and 13,
this evaluates to:
                              L
                              X
                  yipost1 =         (dista,c (i, j) + distc,a (i, j)) ∗ yj .   (14)
                              j=1


Squared node distance (H5) This heuristic functions as heuristic H1 but
squares the value of its distance function. As a result, it gives relatively more
weight to predictions made for categories that are closer the observed category
in comparison to H1. This leads to the following relationship:
                                        L
                                        X          yj
                         yipost5 =          (                  .               (15)
                                        j=1
                                            (1 + dist(i, j)2 )


Squared node prediction values (H6) Heuristic H6 differs from the first
heuristic in that it rescales the starting prediction values yi . Instead of using
the calculated values it will use the squares of these values, making discrepan-
cies in prediction values relatively more prominent. The resulting values can be
calculated via:
                                      L
                                     X           yj2
                           yipost6 =     (                  .                 (16)
                                     j=1
                                           (1 + dist(i, j))


Squared node distances and prediction values (H7) This heuristic com-
bines the ideas that comprise heuristics H5 and H6, leading to the following
relationship:
                                   L
                                  X            yj2
                        yipost7 =     (                    .            (17)
                                  j=1
                                        (1 + dist(i, j)2 )


4   Results
For both the subtasks of predicting diagnostic and procedural codes, 5 different
models were trained, this was the maximum amount allowed in the competition.
Since the size of the dataset was a problem during training, the authors chose
to only train models for the top-50 most represented categories in the training
dataset. During training of the hierarchical models, ancestors of the top-50 cate-
gories were added as well, but only the performance on the original 50 categories
was taken into account for calculating the result metrics. A selection of models
was chosen aiming for much variety to be able to assess the influence of both
proposed models (CAML and MVC-RLDA), the hierarchical objective and post-
processing using a heuristic. The chosen models are summarized below and are
the same for both subtasks:

 1. CAML
 2. CAML + hierarchical objective
 3. MVC-RLDA + hierarchical objective
 4. CAML + hierarchical post-processing H1
 5. MVC-RLDA + hierarchical objective + hierarchical post-processing H1

First, one baseline without use of the hierarchy and heuristics was chosen. Since
CAML got slightly better results than MVC-RLDA on the development set, this
model was selected. Second, to assess the influence the hierarchy can have on the
classification results, both CAML and MVC-RLDA models were trained with a
hierarchical objective. The last 2 models were chosen with the post-processing
heuristic in mind. Only heuristic H1 was chosen for this (based on higher perfor-
mance on the development set), once in a setting without hierarchical objective
(with CAMl) and once with the hierarchical objective (and MVC-RLDA). Since
the models used in this paper had a lot of difficulties with the small number
of training examples, the prediction probabilities of all categories were rather
close together (often in the range of 0.3 to 0.5 instead of from 0.0 until 1.0). For
this reason, the prediction files were generated using the top-5 highest predicted
categories instead of using a fixed cut-off point. This is not optimal for obtain-
ing a high MAP, where it is better to submit more categories leading to lower
performance values. The results obtained by these prediction files are visible in
tables 1 and 2 for diagnostic and procedural subtasks respectively.


                 Table 1. Results on the diagnostic codes subtask.

                                      MAP Precision Recall F1
                       CAML           0.011 0.066 0.029 0.041
                    CAML + Hier.      0.015 0.073 0.032 0.044
       Diag.      MVC-RLDA + Hier.    0.006 0.040 0.018 0.024
                     CAML + H1        0.044 0.124 0.055 0.076
                MVC-RLDA + Hier. + H1 0.002 0.013 0.006 0.008


    For the case of diagnostic codes, visible in table 1, the best performance is
achieved by the CAML model in combination with heuristical post-processing
H1. Adding the heuristic to CAML leads to a clear improvement in classification
quality. Comparing CAML with CAML+Hier. leads to the conclusion that the
hierarchy can as well lead to an improvement, but it is less prominent than
using the post-processing heuristic. Furthermore, it is clear that the MVC-RLDA
model gets outperformed by CAML. This is most likely due to the fact that the
                 Table 2. Results on the procedural codes subtask.

                                          MAP Precision Recall F1
                      CAML                0.007 0.015 0.010 0.012
                   CAML + Hier.          0.020 0.046 0.020 0.028
   Diag.         MVC-RLDA + Hier.          nan   nan     0.0 nan
                 CAML + Heuristic        0.017 0.051 0.034 0.041
             MVC-RLDA + Hier. + Heuristic nan    nan     0.0 nan


former model contains more trainable parameters than CAML but having only
a small amount of training examples.
    For the case of procedural codes, visible in table 2, the best results are now
obtained by a combination of CAML with a hierarchical objective. This is closely
followed by CAML with a post-processing heuristic. Both techniques improve
the classification scores significantly but the overall scores are lower than for the
task of classifying diagnostic codes. Lastly, both MVC-RLDA models predicted
invalid codes for all documents in the test set, not being able to learn significant
relations present in the data.
    As an extra experiment to assess the performance of the described heuristics,
a CAML model got post-processed with 7 different heuristics. In this case, not
only the top-5 categories were retained but the top-50 categories were all sorted
by confidence. These resulting files were then evaluated by the evaluation file
provided by the competition and results are reported in table 3.


               Table 3. Comparison of all post-processing heuristics.

                                Diagnostic(MAP) Procedural(MAP)
                  CAML               0.042            0.052
                CAML + H1            0.075           0.060
                CAML + H2            0.042            0.052
                CAML + H3            0.050            0.052
                CAML + H4            0.050            0.052
                CAML + H5            0.053            0.058
                CAML + H6            0.054            0.052
                CAML + H7            0.047            0.052


    For both the subtasks of classifying diagnostic and procedural codes, the use
of heuristic H1 is the clear winner. It is worth noting that in no case, the results of
the baseline got worse because of the use of a post-processing heuristic. Further-
more, in most cases this has led to an improvement of the results strengthening
the claim that post-processing heuristics based on the ICD-10 taxonomy can be
a valuable tool. Next to H1, the best performing heuristic is H5 which squares
the distances between nodes in the classification tree. Since all heuristics that
try to give more weight to nodes closer to the observed node underperform with
respect to H1, it might be interesting to see whether the opposite can further
improve the classification process.


5   Conclusion

In this paper we trained 5 models for participation in 2 subtasks of the 2020
CLEF eHealth task 1. For both subtasks, experiments were conducted, yield-
ing interesting results. The hierarchical component as well as the use of post-
processing heuristics proved their value in this setting. The use of a multi-view
neural network led to an abundance of trainable parameters which ultimately
made the model unable to efficiently generalize over the training samples. An
extra experiment was conducted to asses the influence of the presented post-
processing heuristics. This led to the conclusion that these heuristics can be a
powerful tool for the classification of ICD codes.


References
 1. Baumel, T., Nassour-Kassis, J., Elhadad, M., Elhadad, N.: Multi-label classification
    of patient notes a case study on icd code assignment (2018)
 2. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu,
    Z., Pasi, G., Saez Gonzales, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth
    evaluation lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S.,
    Joho, H., Lioma, C., Eickhoff, C., Névéol, A., andNicola Ferro, L.C. (eds.) Exper-
    imental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of
    the Eleventh International Conference of the CLEF Association (CLEF 2020) .
    LNCS Volume number: 12260 (2020)
 3. Kavuluru, R., Rios, A., Lu, Y.: An empirical evaluation of supervised learning
    approaches in assigning diagnosis codes to electronic medical records. Artificial
    intelligence in medicine 65(2), 155–166 (2015)
 4. Kowsari, K., Brown, D.E., Heidarysafa, M., Meimandi, K.J., Gerber, M.S., Barnes,
    L.E.: Hdltex: Hierarchical deep learning for text classification. In: 2017 16th IEEE
    International Conference on Machine Learning and Applications (Dec 2017)
 5. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.:
    Overview of automatic clinical coding: annotations, guidelines, and solutions for
    non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working
    Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop
    Proceedings (2020)
 6. Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., Eisenstein, J.: Explainable
    prediction of medical codes from clinical text. In: Proceedings of the 2018
    Conference of the North American Chapter of the Association for Computa-
    tional Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp.
    1101–1111. Association for Computational Linguistics, New Orleans, Louisiana
    (Jun 2018). https://doi.org/10.18653/v1/N18-1100, https://www.aclweb.org/
    anthology/N18-1100
 7. Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.:
    Diagnosis code assignment: models and evaluation metrics. J Am Med Inform
    Assoc 21(2), 231–237 (Mar 2014), 24296907[pmid]
 8. Sadoughi, N., Finley, G.P., Fone, J., Murali, V., Korenevski, M., Baryshnikov,
    S., Axtmann, N., Miller, M., Suendermann-Oeft, D.: Medical code prediction
    with multi-view convolution and description-regularized label-dependent attention.
    arXiv preprint arXiv:1811.01468 (2018)
 9. Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated icd coding
    using deep learning. arXiv preprint arXiv:1711.04075 (2017)
10. Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different
    application domains. Data Mining and Knowledge Discovery 22(1), 31–72 (Jan
    2011)
11. Wehrmann, J., Cerri, R., Barros, R.: Hierarchical multi-label classification net-
    works. In: Dy, J., Krause, A. (eds.) ICML. Proceedings of Machine Learning Re-
    search, vol. 80, pp. 5075–5084. PMLR, Stockholmsmässan, Stockholm Sweden (10–
    15 Jul 2018)

</pre>