Transformers in Semantic Indexing of Clinical Codes Rishivardhan K, Kayalvizhi S, Thenmozhi D, Sachin Krishan T, Aravindan Chandrabose SSN College Of Engineering, Chennai. rishivardhan18126@cse.ssn.edu.in, {kayalvizhis,theni d}@ssn.edu.in, sachinkrishnan18128@cse.ssn.edu.in, aravindanc@ssn.edu.in Abstract. International Classification of Diseases (ICD) codes is used to represent the diseases and health statuses which is useful for bet- ter communication and to facilitate research. Automatic assignment of these codes to the clinical records is essential nowadays whereas multilin- gual extraction task of eHealth@CLEF-2020 concentrates on predicting ICD10-CM and ICD10-PCS codes in Spanish language. Automatic pre- diction of these codes have been approached by using transformers such as BERT, RoBERTa, Electra and XLNet. Among the models, Electra predicts ICD10-CM codes better and RoBERTa predicts ICD10-PCS codes better than the other transformer models. Keywords: BERT · Electra · RoBERTa · Transformers · XLNet 1 Introduction The International Classification of Diseases (ICD) is a healthcare classification system maintained by the World Health Organization. Diseases and health sta- tuses are classified according to certain rules and uniquely identified by character codes. The efficiency of ICD coding has begun to receive more attention because of how crucial it is for making clinical and financial decisions. Hospitals with better coding quality see many benefits, including more accurate classification and retrieval of medical records, and better communication with other hospitals to jointly promote healthcare quality and facilitate research. Human classification of diagnoses is a labor intensive process that consumes significant resources. Most medical practitioners use specially trained medical coders to categorize diagnoses for billing and research purposes. The coding pro- cess requires a comprehensive consideration of each patient’s health condition. However, very few medical practitioners are capable of taking over the process since they lack training in professional coding. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. 2 Task description By the Multilingual Extraction Task [7] from eHealth Lab [3], the goal is to assign ICD10 codes to clinical records. The task comprises of three subtasks wherein we have participated in two of them, namely, ICD10-CM (Clinical Modifica- tion) and ICD10-PCS (Procedure Classification System) codes assignment. The ICD10-CM sub task deals with automation of code assignment for classifying and reporting diseases in all healthcare settings while the ICD10-PCS sub task deals with code assignment for hospital reporting of inpatient services. The task is pursued as a multi-label text classification from our viewpoint while Named Entity Recognition and Normalization paths are also plausible. 3 Dataset Description The CodiEsp corpus of 1,000 clinical case studies, selected manually by a prac- ticing physician was used for training the models. The CodiEsp corpus is dis- tributed in plain text in UTF8 encoding, where each clinical case is stored as a single file whose name is the clinical case identifier. Each clinical case or its identifier is associated with one or multiple medical codes depending on the sub task’s golden truth file. The corpus consists of two variants of the clinical files. The first one being the original set filled with Spanish text and the other one being the English machine-translated version of the former. For our system we had used only the English version of the clinical case files to train the models. The 1000 clinical case files were distributed as to constitute 500 train files, 250 development files and 250 test files. Additionally 2000 clinical files were released along with test set release to promote that these systems would potentially be able to scale to larger data collections. 4 Methodology The task was approached as a multi-label text classification task with the use of transformers. The data is preprocessed and then data is made suitable for training by creating a data frame for all the instances of the data. Model training is done by fine tuning the parameters of four transformers that include BERT, RoBERTa, XLNet and Electra. After training, a file with the probability scores of all possible codes is generated which is post-processed to filter the suitable codes for the instances. 4.1 Data preparation Initially, the data is preprocessed and then data is made suitable for training. Preprocessing steps include removal of punctuation and stop words followed by tokenization and lemmatization. Then, the text files are prepared by mapping the textual data codes to a data frame that has ones and zeros. For example : If there are 5 unique codes in a scenario whereas text data has code1, code3 as its code, Then the text file is trained with a data frame that has rows with ones only at the columns with that codes with remaining zeros. Data frame for that text - 1 0 1 0 0 Here, there are 2194 unique labels in which each text file will have labels in the range 1 to 56 for each text file.Thus, the data frames are created for each text file and then given to the model for training. Sample Input data: text labels 0 describe case 37yearold man previous active.— [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...] 1 present head neck auscultation cardiac .... [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...] 4.2 Training the models The transformer models namely BERT, RoBERTa, Electra and XLNet were used whose implementation is explained below. BERT BERT [2] makes use of transformers, an attention mechanism that learns con- textual relations between words in a text with two strategies namely masked LM and next sentence prediction. BERT was fine tuned over “Bio-Clinical-BERT” [5], a model trained on all notes from MIMIC III, a database containing elec- tronic health records from ICU patients at the Beth Israel Hospital in Boston for predicting the clinical codes. In our implementation, the Bio-Clinical-BERT model was made use with ***. RoBERTa RoBERTa [6] makes use of a robustly optimized method that improves on BERT by modifying key hyperparameters in BERT. For our implementation, Roberta was fine tuned over “BioMed-RoBERTa-base” [4], a language model based on the RoBERTa-base architecture is trained over 100 epochs with train batch size of 20 and learning rate of 4e-5. Electra In general, electra [1], a modification of BERT corrupts the input by replac- ing some tokens and trains a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not so that the contextual representations are learned in a better way. For our work, electra was fined tuned over “Electra-base” model of 12 Trans- former blocks, 768 hidden dimensions and 110 M parameters, trained over 100 epochs with train batch size of 20 and learning rate of 4e-5 to predict the prob- ability of codes. XLNet XLNet[8] is a generalized AR pretraining method that uses a permutation lan- guage modeling objective to combine the advantages of AR (autoregression) and AE (autoencoding) methods that integrate Transformer-XL and the careful de- sign of the two-stream attention mechanism. For predicting the codes, XLNet was fine tuned over “xlnet-base-cased” model of parameters, 12 Transformer blocks, 12 self-attention heads and 768 hidden dimensions which is trained over 100 epochs with train batch size of 20 and learning rate of 4e-5. 4.3 Post-Processing After training, a prediction file with a probability score of all 2194 codes is generated which has to be post processed. Post processing includes filtering the suitable codes. Here, we filtered with a threshold value of ’0.002’. which seems to be effective. The probability of the codes greater than 0.002 are filtered and considered as codes. 5 Results File Model MAP Precision Recall F1 Run-1 BERT 0.001 0.009 0.014 0.011 Run-2 Electra 0.007 0.025 0.049 0.033 Run-3 RoBERTa 0.004 0.014 0.019 0.016 Run-4 XLNet 0 0 0.001 0.001 Table 1. subtask 1 results Table 1 and Table 2 shows the results of our submission in subtask1 and subtask2 respectively. From Table 1, Electra model has attained a better Pre- cision score 0f 0.007, MAP score of 0.025, F1 score of 0.033 and recall of 0.049 than the other models for subtask 1 whereas Roberta attains the better scores in subtask2. The Roberta model of subtask 2 has the best recall 0.075 among all our submissions. File Model MAP Precision Recall F1 Run-1 Electra 0 0.001 0.002 0.001 Run-2 RoBERTa 0.028 0.016 0.075 0.027 Run-3 XLNet 0.002 0.002 0.007 0.003 Table 2. subtask 2 results 6 Conclusion Semantic indexing of clinical codes is very desirable since term mapping of clini- cal codes are crucial now. We approached the semantic indexing by utilizing the transformers to predict the diagonstic (ICD10-CM ) and procedural (ICD10- PCS) codes for the data. BERT, RoBERTa, Electra and XLNet are the trans- formers that have been fine tuned to predict the codes. All the four transformer models have been used in ICD10-CM codes which Electra performs better and RoBERTa performs better in predicting ICD10-PCS codes among RoBERTa, Electra and XLNet. Acknowledgement We would like to thank DST-SERB and HPC laboratory for providing resources needed for this work. References 1. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: Pre-training text encoders as discriminators rather than generators. In: ICLR (2020), https://openreview.net/pdf?id=r1xMH1BtvB 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding (2018) 3. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu, Z., Pasi, G., Saez Gonzales, G., Viviani, M., Xu, C.: ”overview of the CLEF eHealth evaluation lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., andNicola Ferro, L.C. (eds.) Experi- mental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020). LNCS Volume number: 12260 (2020) 4. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks. In: Proceedings of ACL (2020) 5. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text min- ing. Bioinformatics (Sep 2019). https://doi.org/10.1093/bioinformatics/btz682, http://dx.doi.org/10.1093/bioinformatics/btz682 6. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle- moyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019) 7. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Pro- ceedings (2020) 8. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding (2019)