SehMIC: Semi-hierarchical Multi-label ICD code Classification Sedigheh Eslami1,2 , Peter Adorjan1 , and Christoph Meinel2 1 Data4Life, Germany {sedigheh.eslami, peter.adorjan}@data4life.care 2 Hasso Plattner Institute, Germany {sedigheh.eslami, christoph.meinel}@hpi.de Abstract. Automatic ICD code assignment to clinical notes is a benefi- cial, but challenging task due to the large number of possible ICD codes and a small number of available data. It becomes even more challeng- ing in multilingual settings with resource-poor languages, in which the number of available annotated textual data is generally very small. In this work, we present SehMIC, a semi-hierarchical multi-label classifica- tion approach which leverages the knowledge about the structure of ICD codes to assign them to Spanish discharge letters. This approach classifies different sections of the ICD code separately for a given letter. It achieves the final ICD code by concatenation of the predicted code sections and pruning the unlikely combinations by using an empirical a priori distribu- tion. Moreover, we utilize a transfer learning approach using pre-trained multilingual BERT to achieve contextual document representations for Spanish discharge letters. Data augmentation is also performed in order to exploit more data in the learning process. SehMIC achieves 0.1 and 0.004 MAP scores on the dev and test datasets, respectively. This work is done by our nlp4life team at CLEF eHealth 2020 Task 1 challenge on Multilingual Information Extraction. Keywords: Automated ICD code assignment · Multi-label classification · Transfer learning · multilingual BERT. 1 Introduction Motivation. Electronic health records (EHR) include a collection of patients’ health related longitudinal data [10]. They contain patients’ demographic data, medical histories, symptoms, diagnoses, etc both in structured and unstructured text format. International Classification of Diseases (ICD) codes are diagnostic codes used in EHRs in order to uniquely describe the patient’s diagnosis for Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. billing and reimbursement purposes [18]. Healthcare systems train several hu- man coders who specifically learn the medical terminologies, the ICD coding system and its rules so that they can undertake assigning the ICD codes to patients’ records manually. This process is not only time consuming and ex- pensive, but also intrinsically introduces human errors in detecting the correct codes. Therefore, it is beneficial to develop an automated computational solu- tion to detect the associated ICD codes for given clinical notes. In this work, we investigate how to assign ICD codes directly from discharge letters since it is assumed that discharge letters contain the diagnosis ground truths along with the symptoms, procedures, examinations information of the patient [11,21]. Related work. Accurate automated ICD code assignment to clinical texts is a challenging problem. To name a few reasons: 1. clinical texts contain many ty- pos, specific medical terminologies and keywords, 2. the number of possible ICD codes is huge and respectively, there is not enough samples per ICD code to learn from, 3. real-world data suffers from the imbalanced data problem. This task has been previously investigated via rule-based [9], machine learning [5, 19, 26] and deep learning based [2, 11, 12] approaches. Rule-based systems require human experts to find the patterns in text and design the rules. These manual human efforts make the rule-based approaches difficult to scale. In contrast, learning- based approaches mostly depend on the underlying data distributions to find common patterns and decision procedures. With the recent success of deep learn- ing in language modeling and contextual word embedding solutions, end-to-end deep neural networks have been studied for automated ICD code assignment and achieved competing results [2,16]. Recently, this task has also been carried out in multilingual settings [7,8,17] . In [1], authors utilize a transfer learning approach using Bidirectional Encoder Representation from Transformers (BERT) [6] for the bilingual German-English automated ICD code assignment. Our contributions. In this paper, we describe our work on the ICD10-CM code assignment subtask from CLEF eHealth 2020 challenge Task 1 [15]. We developed a semi-hierarchical multi-label classifier by leveraging the knowledge about the structure of the labels in order to assign ICD10-CM codes to Span- ish discharge letters. We fine-tuned multilingual BERT in each hierarchy of the classification. Additionally, we applied a data augmentation mechanism in order to exploit more diverse samples per label in the learning phase. 2 Problem and concepts definition The following set of notations is used throughout this paper: • Vocabulary of words V = {w1 , w2 , . . . , wv } of size v, • Set of word-embeddings E = {e1 , e2 , . . . , ev } of size v, in which ei ∈ Rd is the word embedding vector for the word wi , • Set of discharge summaries S = {s1 , s2 , . . . , sn }, in which sj is a sequence of words from the vocabulary V , • Set of features X = {X1 , X2 , . . . , Xn }, in which Xj ∈ Rm is the contextual feature vector representing discharge letter sj , • Set of all labels L = {l1 , l2 , . . . , l` } corresponding to ICD codes, • For a given discharge summary sj , we represent the set of associated labels as Lj = {0, 1}` . Given {(Xj , yj )}nj=1 where Xj ∈ Rm and yj ∈ Lj , our objective is to train a multi-label classifier C : Rm −→ {0, 1}` such that C(Xj ) = yj for any j ∈ {1, . . . , n}. 3 Approach In this section, we describe our proposed approach for multi-label ICD code assignment to discharge letters. Our approach includes two main steps: 1. data augmentation 2. semi-hierarchical multi-label classification (SehMIC). 3.1 Data augmentation In learning-based approaches, the more and diverse data we have, the better our model learns the underlying distributions and patterns in the data. Data augmentation is used in several fields, e.g., computer vision and natural language processing, in order to increase the diversity of the training data without actually collecting new sets of data. In the CLEF 2020 eHealth challenge, we perform data augmentation primarily because there exists very few discharge letters for a lot of the ICD codes in the training data. Inspired by the work in [24], we use a lexical substitution approach using word-embeddings. We create the Synonyms Dictionary (SD) based on the similarity of the words in the embedding space using the word-embeddings set E. We define synonyms of each word to be the set of all the words whose similarity in the embedding space is greater than a given similarity threshold θ, SD(wi ) : wi −→ synonyms(wi ), synonyms(wi ) = {wj } s.t. sim(ei , ej ) ≥ θ, for all j ∈ {1, . . . v} where i 6= j. Notice that depending on the threshold θ, a word can end up with an empty set of synonyms. Afterwards, given the SD and a discharge letter sj , we iterate over the words in the letter, randomly select a synonym from its set of synonyms stored in SD, and finally substitute the word with the selected synonym. We repeat this process kj times per letter sj in which: maxl∈L (number of samples for l) kj = . minl0 ∈Lj (number of samples for l0 ) The reason for repeating the text generation kj times per letter sj is two-fold: first, in order to balance the label distribution in terms of the number of available samples for each label, as a result, the data augmentation generates fewer sample for the majority labels and more samples for the minority ones; secondly, since we have multiple synonyms per word, repetition of text generation utilizes the synonyms as many as possible per word in the augmentation step. Algorithm 1 provides a summary pseudo-code of our data augmentation approach. Algorithm 1 Data augmentation Input: data {sj , yj }n j=1 , SD Output: augmented labeled data {aj , yj }kj=1 1: procedure Augment(data, SD) 2: aug data ← init empty data() 3: max cnt ← max(number of samples over all labels) 4: L ← unique labels(data) 5: min cnt dict ← dict() 6: for l in L do 7: min cnt dict(l) ← min(number of samples for l) 8: for text, label in data do max cnt 9: K← min cnt dict(label) 10: for k in range(K) do 11: aug text ← "" 12: for w in text do 13: syn ← random(SD(w), 1) 14: aug text ← aug text + syn 15: aug data.append(aug text, label) return aug data 3.2 SehMIC Often in the task of automated ICD code assignment, the number of available samples per label is not sufficient. In this case, a flat classifier, i.e., a classifier that does not consider an inherent hierarchy between the labels, cannot receive enough samples per label to learn from. Thus, minimizing the training error will lead to overfitting. On the other hand, training a full-hierarchical classification system requires training thousands of local-classifiers considering the intermedi- ate hierarchies [22] which is time consuming. In order to overcome these prob- lems, we propose SehMIC, a heuristic semi-hierarchical multi-label classification solution in which we leverage the knowledge we have about the hierarchical structure of ICD codes. In this work, we explain our method with regards to ICD10-CM codes, but the same concepts can be applied for other types of ICD codes as well. ICD10-CM codes are three to seven character codes separated by a dot. The first three characters describe the category of the medical condition. Details about the condition in the category section are represented by the characters appearing after the dot. The first character in the category code is called the chapter code which describes the main type of the medical condition, e.g., injury. The next two characters provide more information about the problem in the chapter code, e.g., location or the severity of the problem [18]. Figure 1 depicts this structure with an example ICD10-CM code. category details chapter second level site, severity, details extension S = injuries, poisoning, S86.01 = Strain of Achilles tendon certain other external causes S86.011 = Strain of right Achilles related to single body organ tendon S86 = injury of muscle, fascia D = Subsequent encounter and tandon at lower leg Fig. 1. Example of ICD10-CM code structure Considering this structure, we translate the ICD code classification as follows: 1. Solve the multi-label classification of chapter given the discharge letter. 2. Solve the multi-label classification of the second level given the discharge letter. 3. Achieve the preliminary candidate category codes by concatenating the re- sults from 1 and 2. 4. Prune the preliminary category codes with respect to unlikely code combina- tion by multiplying an empirically estimated conditional a priori distribution and reach the final category codes: P (second level|discharge letter) × P (second level|chapter) 5. Solve the multi-label classification of details given the discharge letter. 6. Concatenate the results from 4 and 5 to reach the preliminary ICD10-CM codes. 7. Prune the codes with respect to unlikely details and category combinations by multiplying an empirically estimated conditional a priori distribution and reach the final ICD10-CM codes: P (details|discharge letter) × P (details|category). Figure 2 illustrates this approach with an example. For a given discharge sum- mary, SehMIC predicts S and T, 89 and99, 02 and 9 for chapter, second level and details codes, respectively. Concatenating the predicted chapter and second level codes results in S89, S99, T89, T99 from which T89, T99 will be pruned by the conditional a priori distribution as they are invalid ICD codes and their corresponding probabilities are zero. Similarly, combining and pruning the cat- egory codes and the predicted detail codes results in the final S89.02, S89.9, chapter S classier T S89 prune S89 S99 T89 S99 89 T99 S89.02 second level S89.02 classier S89.9 prune 99 S89.9 S99.02 S99.9 S99.02 discharge summary 02 details classier 9 Fig. 2. The process of predicting ICD10-CM code via SehMIC S99.02 ICD codes to be assigned to the discharge summary. Multi-label classification. All of the classifications in steps 1, 2 and 5 are multi-label, i.e., multiple labels are predicted per sample discharge letter. Two main approaches exist for performing multi-label classification: first, problem transformation methods, i.e., methods that transform the multi-label problem into many single-label classification problems; second, algorithm adaptation meth- ods, i.e., methods that directly adapt algorithms to handle the multi-label clas- sification [23]. In this work, we adapt and fine-tune multilingual BERT for se- quence classification to directly support multi-label classification. BERT pro- vides a sequence-level contextual embedding represented for the [CLS] special token [6]. Fine-tuning BERT for single-label classification is done by adding a feed forward fully connected layer with softmax activation function in the out- put layer on top of the sequence level BERT embeddings [6]. In contrast, for the multi-label classification setting we use the sigmoid activation in the output layer. This is because the probabilities computed by sigmoid are independent and do not need to sum up to one. As a result, our network can allow more than one correct label for a given sample. Given a decision probability thresh- old, we select all the labels whose probability is more that the threshold to be the predicted labels. 4 Experiments 4.1 Dataset As participants of the CLEF eHealth 2020 challenge [15], we conduct our exper- iments using the Spanish corpus released in this challenge. The average length of the letters in all three of the training, development and test sets is 350. Table 1 represents a simple statistic over these sets. During the challenge, around 3000 letters were released for the testing phase from which only 250 were the actual test corpus used in the evaluations. The rest of the letters were considered as background texts. The fraction of labels with only one sample in Table 1 illus- trates that if we simply ignore the labels with very few samples, we will lose more than half of the labels. Moreover, about 37% of the unique ICD10-CM codes of the development set are not present in the training set. Similarly, only about 68% of the ICD10-CM codes in test set overlap with the union of the codes in training and development set and the rest are missing. Thus, our data explorations show that this challenge also includes tackling the missing labels problem. Table 1. CLEF2020 eHealth dataset statistic #samples #labels Fraction of labels with only one sample Train 500 1767 56% Dev(elopment) 250 1158 64% Train + Dev 750 2194 53% Test 250 1143 60% 4.2 Experimental setup Data augmentation. In the augmentation step, we use pre-trained fastText embeddings [3] from the Spanish Billion Word Corpus and Embeddings project [4] in order to configure the synonyms and create the synonyms dictionary. We set the similarity threshold to 0.73 and use cosine similarly to calculate the word embedding similarities. We concatenate train and development discharge letters and perform the data augmentation on the concatenated set in order to unravel the missing labels problem in the training set. Stopwords and the ICD10-CM codes mentioned in the letters are skipped in our setting. The average number of synonyms per word in the resulting synonyms dictionary is 3 and 40% of the words end up with no synonyms. The maximum number of synonyms is 20 in the dictionary. In the training phase, the augmented dataset and the origi- nal training set are used together for training the classification models. The final set used for training includes 41750 discharge letters and 2196 unique ICD codes. Classification setup. We fine-tuned the pre-trained bert-base-multilingual-cased model4 using the Hugging Face Transformers library [25] which is based on Py- torch [20]. Since our problem is a multi-label classification task, we adapted the BertForSequenceClassification class from Hugging Face to use the sigmoid ac- tivation on the output layer along with binary cross entropy loss 5 . We set the maximum sequence length to 512 and train each of the three classifiers for 3 epochs with learning rate of 0.00003 and AdamW optimizer [14]. For chapter, category and ICD codes we set the decision threshold to 0.5, 0.001 and 0.001, respectively. 3 Similarities are normalized values in the range of [0, 1]. 4 More details at huggingface.co/transformers/pretrained˙models.html. 5 Link to source code: github.com/sarahESL/CLEFeHealth2020-multilabel-bert. Conditional a priori distributions. In order to calculate the empirical a priori distributions, the following is used. We use S86.011 code as example for illus- tration. # samples with category “S86” p(second level = “86”|chapter = “S”) = # samples with chapter code “S” # samples with code “S86.011” p(details = “011”|category = “S86”) = # samples with category code “S86” 4.3 Results and insights The experimental result of our proposed method is depicted in Table 2. We use the Mean Average Precision (MAP) [13] metric in our evaluations as it was the evaluation metric in the CLEF2020 eHealth challenge. [15]. On the chapter level, our classifier achieves the MAP score of 0.97 and 0.43 on the development and test sets, respec- tively. Although the MAP score for the category code prediction in the development setting is 0.69, we see a tremendous degradation in the test result. Furthermore, the final ICD10-CM codes are predicted with the MAP score of 0.1 and 0.004 for the de- velopment and test sets. This is due to the fact that the data used for training is highly imbalanced and the default BERT does not handle imbalanced classes. Additionally, predicting the second level code directly by the last two characters in category results in misclassifications. This is due to the fact that the same site, severity, etc are rep- resented with different second level codes. For instance, both Z94.0 and S37.0 codes describe a condition about kidney. However, the kidney is represented by 94 and 37 for transplant (Z) and injury (S) conditions. We think modeling a latent semantic variable for the second level code will improve the category prediction performance. The same explanation applies to details code as well. Table 2. MAP scores on development and test sets chapter category ICD10-CM Dev 0.97 0.69 0.1 Test 0.43 0.008 0.004 We suspect that using the development letters in our data augmentation step causes letters that are very similar to the development set to appear in the augmented data. As a result, our classifiers already have seen some development-like data in their training phases. Therefore, we interpret the dev results in Table 2 as training results. 5 Conclusion In this work, we presented our (nlp4life team) submission to the CLEF eHealth 2020 Task 1 challenge. This challenge required overcoming imbalanced data distributions and missing labels problems. Additionally, the number of available samples per unique labels was small, which made it particularly challenging to train a fully flat and su- pervised classification model. In this work, we proposed a lexical substitution data augmentation and a semi-hierarchical classification approach for assigning ICD10-CM codes to discharge letters. Our approach results in misclassifying a noticeable number of category and ICD codes. In future work, we would like improve these results by modeling the latent semantic variables to improve second level and details code pre- dictions. Moreover, we plan to investigate context-aware approaches using ICD code embeddings in order to improve the classification performance and overcome the miss- ing labels problem. Acknowledgement We would like to thank Matthias Steinbrecher for the helpful discussions and com- ments. References 1. Amin, S., Neumann, G., Dunfield, K., Vechkaeva, A., Chapman, K.A., Wixted, M.K.: Mlt-dfki at clef ehealth 2019: Multi-label classification of icd-10 codes with bert. In: CLEF (Working Notes) (2019) 2. Baumel, T., Nassour-Kassis, J., Cohen, R., Elhadad, M., Elhadad, N.: Multi-label classification of patient notes: case study on icd code assignment. In: Workshops at the thirty-second AAAI conference on artificial intelligence (2018) 3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguis- tics 5, 135–146 (2017) 4. Cardellino, C.: Spanish Billion Words Corpus and Embeddings (August 2019) 5. Dermouche, M., Velcin, J., Flicoteaux, R., Chevret, S., Taright, N.: Supervised topic models for diagnosis code assignment to discharge summaries. In: Interna- tional Conference on Intelligent Text Processing and Computational Linguistics. pp. 485–497. Springer (2016) 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 7. Dörendahl, A., Leich, N., Hummel, B., Schönfelder, G., Grune, B.: Overview of the clef ehealth 2019 multilingual information extraction (2019) 8. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu, Z., Pasi, G., Saez Gonzales, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth evaluation lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., andNicola Ferro, L.C. (eds.) Exper- imental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020) . LNCS Volume number: 12260 (2020) 9. Goldstein, I., Arzumtsyan, A., Uzuner, Ö.: Three approaches to automatic as- signment of icd-9-cm codes to radiology reports. In: AMIA Annual Symposium Proceedings. vol. 2007, p. 279. American Medical Informatics Association (2007) 10. Gunter, T.D., Terry, N.P.: The emergence of national electronic health record ar- chitectures in the united states and australia: models, costs, and questions. Journal of medical Internet research 7(1), e3 (2005) 11. Huang, J., Osorio, C., Sy, L.W.: An empirical evaluation of deep learning for icd-9 code assignment using mimic-iii clinical notes. Computer methods and programs in biomedicine 177, 141–153 (2019) 12. Li, M., Fei, Z., Zeng, M., Wu, F.X., Li, Y., Pan, Y., Wang, J.: Automated icd-9 coding via a deep learning approach. IEEE/ACM transactions on computational biology and bioinformatics 16(4), 1193–1202 (2018) 13. Liu, L., Özsu, M.T.: Encyclopedia of database systems, vol. 6. Springer New York, NY, USA: (2009) 14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 15. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2020) 16. Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., Eisenstein, J.: Explainable predic- tion of medical codes from clinical text. arXiv preprint arXiv:1802.05695 (2018) 17. Névéol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier, L., Rey, G., Zweigenbaum, P.: Clef ehealth 2018 multilingual information extraction task overview: Icd10 coding of death certificates in french, hungarian and italian. In: CLEF (Working Notes) (2018) 18. Organization, W.H.: International statistical classification of diseases and related health problems, vol. 1. World Health Organization (2004) 19. Pakhomov, S.V., Buntrock, J.D., Chute, C.G.: Automating the assignment of di- agnosis codes to patient encounters using example-based and machine learning techniques. Journal of the American Medical Informatics Association 13(5), 516– 525 (2006) 20. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., dAlché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019) 21. Prakash, A., Zhao, S., Hasan, S.A., Datla, V., Lee, K., Qadir, A., Liu, J., Farri, O.: Condensed memory networks for clinical diagnostic inferencing. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) 22. Sun, A., Lim, E.P.: Hierarchical text classification and evaluation. In: Proceedings 2001 IEEE International Conference on Data Mining. pp. 521–528. IEEE (2001) 23. Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM) 3(3), 1–13 (2007) 24. Wang, W.Y., Yang, D.: That’s so annoying!!!: A lexical and frame-semantic embed- ding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 conference on em- pirical methods in natural language processing. pp. 2557–2563 (2015) 25. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State- of-the-art natural language processing. ArXiv abs/1910.03771 (2019) 26. Yan, Y., Fung, G., Dy, J.G., Rosales, R.: Medical coding classification by leveraging inter-code relationships. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 193–202 (2010)