Introduction

Convolutional Attention Models with Hierarchical Post-Processing Heuristics at CLEF eHealth 2020

Elias Moons

elias.moons@cs.kuleuven.be 0

Marie-Francine Moens

0 0 KU Leuven , Belgium

In this paper, we compare state-of-the-art neural network approaches to the 2020 CLEF eHealth task 1. The presented models use the neural principles of convolution and attention to obtain their results. Furthermore, a hierarchical component is introduced as well as hierarchical post-processing heuristics. These additions successfully leverage the information that is inherently present in the ICD taxonomy.

Introduction

the size and consequently the small number of training samples for each category present. For the diagnostic ICD-codes for example, there are in total 1,767 different categories spread out over only 500 training documents. Every document is labeled with on average 11.3 di erent categories and each category is on average represented by 3.2 training examples. Only seven categories have more than 50 training examples. For the case of procedural ICD-codes, these numbers are slightly lower with 563 di erent categories, 3.1 categories per example and only 2.7 training examples for each category, leading to a very similar distribution. Figure 1 gives a sorted view of all categories present in the diagnostic training dataset (left) as well as the procedural training dataset (right) and the amount of examples tagged with that speci c category.

100 t e sg 80 n ii n a tr 60 n i y c en 40 u q e rF 20 60 t e s g n i ian 40 r t n i y c n e qu 20 e r

F 500

1;000 Category 1;500 100 200 400

500

In this paper we hypothesize that exploiting the knowledge of the hierarchical label taxonomy of ICD-10 helps the performance of automated coding when limited training examples that are manually coded are available.

The remainder of this paper is organized as follows. In section 2 related work relevant for the conducted research will be discussed. The evaluated deep learning methods are described in section 3. These methods are evaluated on the benchmark CodiEsp ICD-10 dataset and all ndings are reported in section 4. The most important ndings will be recapped in section 5. 2

Related Work

The most prominent and more recent advancements in categorizing medical reports with standard codes will shortly be described in this section.

In [ 7 ] an hierarchical support vector machine (SVM) is shown to outperform that of a at SVM. Results were reported based of F-measure scores on the Mimic-II dataset. [ 3 ] show that datasets of di erent sizes and di erent numbers of distinct codes demand di erent training mechanisms. For small datasets, feature and data selection methods serve better. The authors have evaluated ICD coding performance on a dataset consisting of more than 70,000 textual EMRs (Electronic Medical Records) from the University of Kentucky (UKY) Medical Center tagged with ICD-9 codes.

A deep learning model that encompasses an attention mechanism is tested by [9] on the Mimic-III dataset. LSTMs are used for both character and word level representations. A soft attention layer here helps in making predictions for the top 50 most frequent ICD-9 codes in the dataset.

More recently, [ 1 ] have introduced the Hierarchical Attention bidirectional Gated Recurrent Unit model (HA-GRU). By identifying relevant sentences for each label, documents are tagged with corresponding ICD-9 codes. Results are reported both on the Mimic-II and Mimic-III datasets. [ 6 ] presents the Convolutional Attention for Multi-Label classi cation (CAML) model that combines the strengths of convolutional networks and attention mechanisms. They propose adding regularization on the long descriptions of the target ICD codes, especially to improve classi cation results on less represented categories in the dataset. This approach is further extended with the idea of multiple convolutional channels by [8] with max pooling across all channels. The authors also shift the attention from the last prediction layer, as in [ 6 ], to the attention layer. [ 6 ] and [8] achieve state-of-the art results for ICD-9 coding on the MIMIC-III dataset. As an addition to these models, in this paper a hierarchical variant of each of them is constructed and evaluated. Furthermore, if the target output space of categories follows a hierarchy of labels - as is also the case in ICD coding - the trained models can e ciently use this hierarchy for category assignment [ 7 ][10][ 4 ]. During categorization the models apply a top-down or a bottom-up approach at the classi cation stage. In a top-down approach parent categories are assigned rst and only children of assigned parents are considered as category candidates. In a bottom-up approach only leaf nodes in the hierarchy are assigned which entail that parent nodes are assigned. The hierarchical structure of a tree leads to various parent-child relations between its categories. For the models discussed in this paper, an hierarchical variant will also be tested which exploits the information of the tree structure and shows that it can enhance the classi cation performance. Recent research shows the value of these hierarchical dependencies using hierarchical attention mechanisms [ 1 ] and hierarchical penalties [11] which are also integrated in this paper. 3

Methods

In this section, we explain the used models for ICD code prediction. First, the preprocessing step is shortly discussed. Then, two recent state-of-the-art models in the eld of ICD coding are explained in detail. These models are implemented by the authors following the original papers and are called DR-CAML [ 6 ] and MVC-(R)LDA [8], respectively. We discuss in detail the attention mechanisms and loss functions of these models. Afterwards, as a way of handling the hierarchical dependencies of the ICD-codes, we propose various ways of their integration in all models. This is based on advancements in hierarchical classi cation as inspired by [11]. Lastly, heuristics are described for post-processing of the predictions given by the models. This leads in section 4 to a clear comparison between all tested models among themselves as well as with their hierarchical novel variants and the introduced post-processing. 3.1

Preprocessing

The preprocessing follows as standard procedure described in [ 6 ], i.e., tokens that contain no alphabetic characters are removed and all tokens are put to lowercase. Furthermore tokens that appear in fewer than three training documents are replaced with the `UNK' token. All documents are then truncated to a maximum length of 2500 tokens.

All discussed models have for each document i as input, a sequence of word vectors xi as their representation and as output, a set of ICD-codes yi. 3.2

Convolutional models

This subsection describes the details of recent state-of-the-art models presented in [ 6 ] and [8] in the way they are used for the experiments in section 4. DR-CAML DR-CAML is a CNN based model adopted for ICD coding [ 6 ]. When an ICD code is de ned by the WHO, it is accompanied by a label definition expressed in natural language to guide the model towards learning the appropriate parameter values of the model. For this purpose the model employs a per-label attention mechanism enabling it to learn distinct document representations for each label. It has been shown that for labels for which there are very few training instances available, this approach is advantageous. The idea is that the description of a target code is itself a very good training example for the corresponding code. Similarity between the representation of a given test sample and the representation of the description of a target code gives extra con dence in assigning this label.

In general, after the convolutional layer, DR-CAML employs a per-label attention mechanism to attend to the relevant parts of text for each predicted label. An additional advantage is that the per-label attention mechanism provides the model with the ability of explaining why it decided to assign each code by showing the spans of text relevant for the ICD code.

MVC-(R)LDA Both MVC-LDA and MVC-RLDA, can be seen as exten

sions of DR-CAML. Similar to that model, they are based on a CNN architecture with a label attention mechanism that considers ICD coding as a multi-task binary classi cation problem. The added functionality lies in the use of parallel CNNs with di erent kernel sizes to capture information of di erent granularity.

In general, these multi-view CNNs are constructed with four CNNs that have the same number of lters but with di erent kernel sizes. This convolutional layer is followed by a max-pooling function across all channels to select the most relevant span of text for each lter.

Loss function The loss functions used to train DR-CAML and the multi-view models MVD-(R)LDA are calculated in the same way. The general loss function is the binary cross entropy loss lossBCE . This loss is extended by regularization on the long description vectors of the target categories

Given N di erent training examples xi. The values of y^l and max-pooled vector zl can be calculated by getting the description of code l out of all L target codes. In this gure and the following formulas l is a vector of prediction weights and vl the vector representation for code l. Assuming ny is the number of true labels in the training data, the nal loss is computed by adding regularization to the base loss function as:

y^l = ( ltvl + bl) lossBCE (X) = yl) log (1 y^ ) l N L X X yl log (y^l) + (1 i=1 l=1 lossModel(X) = lossBCE + 1 XN XL kzl ny i=1 l=1 lk2 (1) (2) (3) 3.3

Modelling hierarchical dependencies

In this section we investigate the modelling of hierarchical dependencies as extensions of the models described above. A rst part integrates the hierarchical dependencies directly into the structure of the model. This leads to Hierarchical models, which are layered variants of the already discussed approaches. The second way hierarchical dependencies are explicitly introduced into the model is via the use of a hierarchical loss function to penalize hierarchical inconsistencies across the model's prediction layer.

Hierarchical models Hierarchical relationships can be shaped directly into the architecture of any of the described models above. The ICD-10 taxonomy can be modeled as a tree with a general ICD root and 4 levels of depth. On the highest level, codes have 1 character, the next 2 levels represent categories with respectively 3 and 4 characters. The rest of the codes are combined in the last layer. This leads to a hierarchical variant of any of the models. In this variant, not 1 but 4 identical models will be trained, one for each of the di erent layers in the ICD hierarchy (corresponding to the length of the codes).

An overview of the approach is given in gure 2. The input for each layer is partially dependent on an intermediary representation from the previous layer as well as the original input through concatenation of both. Layers are stacked from most to least speci c or from leaf to root node in the taxonomy. Models corresponding to di erent layers will then rely on di erent features, or characteristics, to classify the input vectors. This way the deepest, most advanced representations, can be used for classifying the most abstract and broad categories. On the other hand, for the most speci c categories, word level features can directly be used to make detailed decisions between classes that are very similar.

Hierarchical loss function To capture the hierarchical relationships in a given model, the loss function of the above models can be extended with an additional term. This leads to the de nition of a Hierarchical loss function (lossH ). This loss function penalizes classi cations that contradict the inherent ICD hierarchy. More speci cally, when a parent category is not predicted to be true, none of its child categories should be predicted to be true. The hierarchical loss between a child and its parent in the tree is then de ned as the di erence between their computed probability scores, with 0 as a lower bound. More formally, for the entire loss function lossH Model for a category of layer X, combining the regular training loss lossModel described above and the hierarchical loss lossH , is calculated as follows:

P (X) = P robability(X == T rue) P ar(X) = P robability(P arent(X) == T rue)

L(X) = T rue label of X(0 or 1) lossH (X) = Clip(P (X)

P ar(X); 0; 1) lossH Model(X) = (1 )lossModel(X) + lossH (X) (4) (5) (6) (7) (8) which leaves a parameter

to optimize the loss function.1 3.4

Hierarchical post-processing

As a nal step in the classi cation process, a heuristical post-processing will be applied to some of the submitted models. All considered heuristics are explained below. They are all reliant on the distance of any pair of target categories in the ICD-10 taxonomy and reweigh the prediction values accordingly. The heuristics are numbered from H1 until H7 for e cient referencing in the result section. Node distance (H1) Given all L predictions yi made for document i by any given model, the new prediction values yipost1 can be calculated as follows: The newly calculated prediction values are the result of a weighted sum of all previously calculated prediction values, taking into account the relative distances of all target categories in the ICD taxonomy. In general, dist(i; j) gives the distance between categories i and j in the ICD tree, e.g., the distance between a parent and its child is 1, the distance between two siblings is 2 and the distance of an element to itself is 0.

Node distance from child to ancestor (H2) This heuristic functions the same way as the heuristic described above but di ers in behavior if the lowest common ancestor (LCA) of categories i and j which is not j itself. yj will only be added to the total new score of category i if j is an ancestor of i. This can be formally described as follows: ypost2 = i

L X dista;c(i; j) yj ; j=1 8 1 dista;c(i; j) = < (1 + dist(i; j) :0; ; if ancestor(i, j) == True if ancestor(i, j) == False.

Node distance from ancestor to child (H3) This heuristic functions anal

ogous to heuristic H2 but in the opposite direction. yj will only be added to the total new score of category i if i is an ancestor of j. This gives: ypost3 = i

L X distc;a(i; j) yj ; j=1 1 Parameter is optimized over the training set. (9) (10) (11) (12)

Node distance between ancestors and children (H4) Heuristic H4 com

bines the ideas presented in the previous two heuristics, only adding yi when either i is an ancestor of j or j is an ancestor of i. Using equations 11 and 13, this evaluates to: Squared node distance (H5) This heuristic functions as heuristic H1 but squares the value of its distance function. As a result, it gives relatively more weight to predictions made for categories that are closer the observed category in comparison to H1. This leads to the following relationship: (14) (15) (16) (17) ypost5 = i

L X( j=1

yj (1 + dist(i; j)2) : Squared node prediction values (H6) Heuristic H6 di ers from the rst heuristic in that it rescales the starting prediction values yi. Instead of using the calculated values it will use the squares of these values, making discrepancies in prediction values relatively more prominent. The resulting values can be calculated via: ypost6 = i

L X( j=1

yj2 (1 + dist(i; j)) :

Squared node distances and prediction values (H7) This heuristic com

bines the ideas that comprise heuristics H5 and H6, leading to the following relationship: ypost7 = i

L X( j=1

yj2 (1 + dist(i; j)2) : 4

Results

For both the subtasks of predicting diagnostic and procedural codes, 5 di erent models were trained, this was the maximum amount allowed in the competition. Since the size of the dataset was a problem during training, the authors chose to only train models for the top-50 most represented categories in the training dataset. During training of the hierarchical models, ancestors of the top-50 categories were added as well, but only the performance on the original 50 categories was taken into account for calculating the result metrics. A selection of models was chosen aiming for much variety to be able to assess the in uence of both proposed models (CAML and MVC-RLDA), the hierarchical objective and postprocessing using a heuristic. The chosen models are summarized below and are the same for both subtasks: 1. CAML 2. CAML + hierarchical objective 3. MVC-RLDA + hierarchical objective 4. CAML + hierarchical post-processing H1 5. MVC-RLDA + hierarchical objective + hierarchical post-processing H1 First, one baseline without use of the hierarchy and heuristics was chosen. Since CAML got slightly better results than MVC-RLDA on the development set, this model was selected. Second, to assess the in uence the hierarchy can have on the classi cation results, both CAML and MVC-RLDA models were trained with a hierarchical objective. The last 2 models were chosen with the post-processing heuristic in mind. Only heuristic H1 was chosen for this (based on higher performance on the development set), once in a setting without hierarchical objective (with CAMl) and once with the hierarchical objective (and MVC-RLDA). Since the models used in this paper had a lot of di culties with the small number of training examples, the prediction probabilities of all categories were rather close together (often in the range of 0:3 to 0:5 instead of from 0:0 until 1:0). For this reason, the prediction les were generated using the top-5 highest predicted categories instead of using a xed cut-o point. This is not optimal for obtaining a high MAP, where it is better to submit more categories leading to lower performance values. The results obtained by these prediction les are visible in tables 1 and 2 for diagnostic and procedural subtasks respectively.

Diag.

For the case of diagnostic codes, visible in table 1, the best performance is achieved by the CAML model in combination with heuristical post-processing H1. Adding the heuristic to CAML leads to a clear improvement in classi cation quality. Comparing CAML with CAML+Hier. leads to the conclusion that the hierarchy can as well lead to an improvement, but it is less prominent than using the post-processing heuristic. Furthermore, it is clear that the MVC-RLDA model gets outperformed by CAML. This is most likely due to the fact that the former model contains more trainable parameters than CAML but having only a small amount of training examples.

For the case of procedural codes, visible in table 2, the best results are now obtained by a combination of CAML with a hierarchical objective. This is closely followed by CAML with a post-processing heuristic. Both techniques improve the classi cation scores signi cantly but the overall scores are lower than for the task of classifying diagnostic codes. Lastly, both MVC-RLDA models predicted invalid codes for all documents in the test set, not being able to learn signi cant relations present in the data.

As an extra experiment to assess the performance of the described heuristics, a CAML model got post-processed with 7 di erent heuristics. In this case, not only the top-5 categories were retained but the top-50 categories were all sorted by con dence. These resulting les were then evaluated by the evaluation le provided by the competition and results are reported in table 3.

For both the subtasks of classifying diagnostic and procedural codes, the use of heuristic H1 is the clear winner. It is worth noting that in no case, the results of the baseline got worse because of the use of a post-processing heuristic. Furthermore, in most cases this has led to an improvement of the results strengthening the claim that post-processing heuristics based on the ICD-10 taxonomy can be a valuable tool. Next to H1, the best performing heuristic is H5 which squares the distances between nodes in the classi cation tree. Since all heuristics that try to give more weight to nodes closer to the observed node underperform with respect to H1, it might be interesting to see whether the opposite can further improve the classi cation process. 5

Conclusion

In this paper we trained 5 models for participation in 2 subtasks of the 2020 CLEF eHealth task 1. For both subtasks, experiments were conducted, yielding interesting results. The hierarchical component as well as the use of postprocessing heuristics proved their value in this setting. The use of a multi-view neural network led to an abundance of trainable parameters which ultimately made the model unable to e ciently generalize over the training samples. An extra experiment was conducted to asses the in uence of the presented postprocessing heuristics. This led to the conclusion that these heuristics can be a powerful tool for the classi cation of ICD codes. 8. Sadoughi, N., Finley, G.P., Fone, J., Murali, V., Korenevski, M., Baryshnikov, S., Axtmann, N., Miller, M., Suendermann-Oeft, D.: Medical code prediction with multi-view convolution and description-regularized label-dependent attention. arXiv preprint arXiv:1811.01468 (2018) 9. Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated icd coding using deep learning. arXiv preprint arXiv:1711.04075 (2017) 10. Silla, C.N., Freitas, A.A.: A survey of hierarchical classi cation across di erent application domains. Data Mining and Knowledge Discovery 22(1), 31{72 (Jan 2011) 11. Wehrmann, J., Cerri, R., Barros, R.: Hierarchical multi-label classi cation networks. In: Dy, J., Krause, A. (eds.) ICML. Proceedings of Machine Learning Research, vol. 80, pp. 5075{5084. PMLR, Stockholmsmassan, Stockholm Sweden (10{ 15 Jul 2018)

1. Baumel , T. , Nassour-Kassis , J. , Elhadad , M. , Elhadad , N.

: Multi-label classi cation of patient notes a case study on icd code assignment (

2018 )

2. Goeuriot , L. , Suominen , H. , Kelly , L. , Miranda-Escalada , A. , Krallinger , M. , Liu , Z. , Pasi , G. , Saez Gonzales, G. , Viviani , M. , Xu , C. : Overview of the CLEF eHealth evaluation lab 2020 . In: Arampatzis, A. , Kanoulas , E. , Tsikrika , T. , Vrochidis , S. , Joho , H. , Lioma , C. , Eickho , C. , Neveol , A. , andNicola Ferro, L.C. (eds.) Experimental IR Meets Multilinguality , Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020 ) . LNCS Volume number: 12260 ( 2020 )

3. Kavuluru , R. , Rios , A. , Lu , Y. : An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records . Arti cial intelligence in medicine 65(2) , 155 { 166 ( 2015 )

4. Kowsari , K. , Brown , D.E., Heidarysafa , M. , Meimandi , K.J. , Gerber , M.S. , Barnes , L.E. : Hdltex: Hierarchical deep learning for text classi cation . In: 2017 16th IEEE International Conference on Machine Learning and Applications (Dec 2017 )

5. Miranda-Escalada , A. , Gonzalez-Agirre , A. , Armengol-Estape , J. , Krallinger , M. : Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020 . In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum . CEUR Workshop Proceedings ( 2020 )

6. Mullenbach , J. , Wiegre e, S., Duke , J. , Sun , J. , Eisenstein , J.: Explainable prediction of medical codes from clinical text . In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long Papers). pp. 1101 { 1111 . Association for Computational Linguistics, New Orleans, Louisiana (Jun 2018 ). https://doi.org/10.18653/v1/ N18 -1100, https://www.aclweb.org/ anthology/N18-1100

7. Perotte , A. , Pivovarov , R. , Natarajan , K. , Weiskopf , N. , Wood , F. , Elhadad , N.: Diagnosis code assignment: models and evaluation metrics . J Am Med Inform Assoc 21 ( 2 ), 231 {237 (Mar 2014 ), 24296907 [pmid]