=Paper=
{{Paper
|id=Vol-2696/paper_166
|storemode=property
|title=Text Augmentation Techniques for Clinical Case Classification
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_166.pdf
|volume=Vol-2696
|authors=Anais Ollagnier,Hywel Williams
|dblpUrl=https://dblp.org/rec/conf/clef/OllagnierW20
}}
==Text Augmentation Techniques for Clinical Case Classification==
Text Augmentation Techniques for Clinical Case Classification Anaı̈s Ollagnier1[0000−0002−4349−5678] and Hywel Williams1[0000−0002−5927−3367] Computer Science, University of Exeter, Exeter EX4 4QE, UK {a.ollagnier,h.t.p.williams}@exeter.ac.uk Abstract. Clinical coding consists in the transformation (or classifica- tion) of patient record information into a structured or coded format us- ing internationally recognized class codes. Coding accuracy is an ongoing challenge which has led to the organization of challenges and shared tasks to evaluate AI-enhanced, computer-assisted coding systems. In this pa- per we present our contribution at CodiEsp: Clinical Case Coding Task (CLEF eHealth 2020) on the automatic assignment of clinical coding (diagnosis and procedures) to clinical cases in Spanish. We approach the task as multi-label classification problem and leverage the powerful language model: Multilingual BERT (M-BERT) to represent the clinical cases and design various deep learning architectures based on a Convolu- tional Neural Network and a Long Short-Term Memory Network (CNN- LSTM) classifier. To handle the class-imbalance problem, we present other models based on data augmentation techniques (i.e. word-level transformations and text generation methods) for synthesizing labeled- data. Models based on data augmentation pipelines obtain the best re- sults, measured by the F1-score, in comparison to the other proposed models for both tasks. The pipeline based on the word-level transforma- tions obtains the best F1-score (0.143) for the CodiEsp-D task, while the data augmentation technique using the text generation method achieves the best F1-score (0.216) for the CodiEsp-P task. Keywords: Medical text classification · Data augmentation · Text gen- eration. 1 Introduction The International Classification of Diseases (ICD) is a health care classification system which provides standardized codes for reporting diseases and health con- ditions1 . ICD codes are widely used in Electronic Medical Records (EMR) to describe a patient’s diagnosis or treatment. In current practice medical coders Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. 1 https://www.who.int/classifications/icd/factsheet/en/ Date of access: 15th June 2020. review a physician’s clinical diagnosis (almost always recorded as free text) then manually assign ICD codes according to coding guidelines. While the process of standardizing EMR is important for making clinical and financial decisions manual ICD coding is expensive, time-consuming and prone to error [12,3]. Con- sidering these constraints, automated ICD coding has become an important line of research in the Artificial Intelligence community. Traditional machine learning and deep learning techniques have been applied successfully in this context and show promising results [13,8,16,10]. However, developing an accurate computa- tional system to support automated ICD coding is still a challenging task. The idiosyncrasies of medical language, the scarcity of hospitals using EMR and the class-imbalance problem in training datasets are among the persistent challenges [4]. Issues related to clinical coding have led to the organization of challenges and shared tasks aiming to evaluate automated clinical coding systems such as the CLEF eHealth Evaluation Lab. The CLEF eHealth2 [7], established in 2012 as part of the Conference and Labs of the Evaluation Forum (CLEF), is a workshop offering evaluation labs (datasets, evaluation frameworks, and events) in the medical and biomedical domain on different tracks such as information extraction, information management and information retrieval in a mono- and multilingual setting. During the CLEF eHealth 2020 the Clinical Case Coding in Spanish Shared Task3 (CodiEsp) [11] was introduced with the aim to evaluate systems devoted to the automatic assignment of ICD codes to EMR in Span- ish. This task includes three sub-tasks: (1) Codiesp Diagnosis Coding (CodiEsp- D) which consists of automatically assigning ICD10-Clinical Modification codes to clinical cases in Spanish; (2) Codiesp Procedure Coding (CodiEsp-P) which focuses on assigning ICD10-Procedure codes to clinical cases in Spanish; (3) Codiesp Explainable Artificial Intelligence (CodiEsp-X) which evaluates the ex- plainability/interpretability of the proposed systems (i.e. request to return the text spans supporting the ICD10 code assignment). This paper presents our contribution at the CLEF eHealth CodiEsp 2020 CodiEsp-D and CodiEsp-P sub-tasks. In total five models were submitted dur- ing the official evaluation, all based on a Convolutional Neural Network and Long Short-Term Memory Network (CNN-LSTM) classifier. Multilingual BERT (M- BERT) achieved the best performances in the CLEF eHealth 2019 Multilingual Information Extraction task [15], hence we proposed to leverage the M-BERT pre-trained model as a part of various deep learning architectures. Then, in order to handle the class-imbalance problem, we designed data augmentation pipelines exploring word-level transformation and a text generation method for synthesiz- ing labeled-data. To compare all the proposed systems we carried out empirical comparisons against a standard CNN architecture, used here as a baseline. 2 https://clefehealth.imag.fr/ Date of access: 18th June 2020. 3 https://temu.bsc.es/codiesp/ Date of access: 18th June 2020. 2 Data The Codiesp corpus4 consists of a set of 1000 clinical cases manually annotated by clinical coding professionals 5 . Documents were coded with clinical diagnosis and procedure codes from the Spanish official version of ICD10-Clinical Modifi- cation and ICD10-Procedure. The released corpus has around 16,504 sentences and 396,988 words, with an average of 396.2 words per clinical case. The cor- pus has been randomly sampled into three subsets: the training set (500 clinical cases), the development set and the test sets (250 clinical cases each). Each subset provides clinical cases in plain text format stored as single files (each filename cor- responds to an unique clinical case identifier) and a tab-separated file with either ICD10-Diagnostico (equivalent to ICD10-CM) or ICD10-Procedimiento (equiva- lent to ICD10-PCS) code assignments according to the target task. Table 1 sum- marises the top-5 most frequent ICD10-Diagnostico and ICD10-Procedimiento codes from the training and development datasets for both tasks. Table 1. Top-5 most frequent ICD10-Diagnostico and ICD10-Procedimiento codes. CodiEsp-D CodiEsp-P R52 118 (15.73%) bw03zzz 74 (9.87%) R69 106 (14.13%) bw40zzz 61 (8.13%) R50.9 99 (13.2%) bw20 56 (7.47%) i10 81 (10.8%) bw24 36 (4.8%) R60.9 70 (9.33%) 4a02x4z 34 (4.53%) As we can observe in Table 1, the datasets provided are high imbalanced (i.e. there is high disparity between classes), with 15.73% and 9.87% respec- tively as the highest frequency rates for the Codiesp-D and Codiesp-P tasks. In total, 10,711 codes were assigned for both tasks, of which, 1819 are unique in the Codiesp-D datasets and 608 in the Codiesp-P datasets. The proportion of rare classes (i.e. classes with only one observation) is also high, 1022 classes (i.e. 56.18%) in the codiesp-D datasets and 393 (i.e. 64.64%) in the Codiesp- P datasets. These findings led us to investigate data augmentation techniques which have shown promise in scarce labeled data situations [17,2]. Moreover, to expand the training and development corpora, the organizers have also released several additional data resources6 including medical literature abstracts (i.e. abstracts from Lilacs and Ibecs with ICD10 codes), linguistic 4 Codiesp corpus available online: https://zenodo.org/record/3837305# .XvsEN5bTVhF Date of access: 30th June 2020. 5 Information about annotation guidelines: https://zenodo.org/record/3632523# .Xvw2N5bTU5m Date of access: 1st July 2020. 6 https://temu.bsc.es/codiesp/index.php/2019/09/19/resources/ Date of ac- cess: 30th June 2020. resources, gazetteers and a machine-translated version from English of Codiesp corpus clinical cases. 3 System architectures Empirical studies conducted on the development sets for each task7 found best performance using a CNN-LSTM classifier. Figure 1 details the architecture used and the shared parameters for both tasks.The model takes as input a time- ordered sequence of tokens (words) of arbitrary length (truncated to 396 words which corresponds here the averaged number of words per document, and then padded with zero vectors) and outputs a document-level prediction. After the embedding layer, the layer corresponding to the CNN classifier (one-headed) is introduced using a configuration of 100 parallel feature maps and a kernel size of 3. Immediately afterwards, a LSTM layer is added (set to 100 internal units). Then a dense layer of 64 nodes with ReLu is inserted. Finally an output layer is used with one node containing softmax function. The models have been trained using the Adam optimizer, with a learning rate of 0.001 and a batch size fixed to 32 for both tasks. Fig. 1. System architecture details. 4 Proposed Methods All proposed methods were trained and tested using the Spanish version of the re- leased corpora. Concerning the preprocessing steps, clinical cases were converted to lowercase and stop-words were removed. After the tokenization process, all tokens based only on non-alphanumeric characters and all short tokens (with < 3 characters) were also deleted. In total five models were submitted to the official evaluation, we provide below a detailed description of each of them: - CNN-LSTM: this default approach (used here as a baseline) is based on the architecture presented in section 3. - M-BERT: this approach is based on the BERT language model [5]. Briefly, BERT, which is based on a transformer architecture, is designed to pre- train deep bidirectional representations from unlabeled text by jointly con- ditioning on both left and right context. Several pre-trained language models 7 Evaluations (not reported here) were conducted on LSTM, BiLSTM, BiGRU, CNN and CNN-LSTM using the same architecture as presented here. (PTM) have been built from this text encoding model which has previously been successfully applied to various biomedical NLP tasks [9]. In the CLEF eHealth 2019 Multilingual Information Extraction task, models relying on BERT and its variants (BioBERT and M-BERT) obtained the best results [1,6,15]. Here, we propose to explore a Sequential Transfer Learning-based technique (STL) using M-BERT8 . In a STL scenario the source and target tasks are different and training is performed in sequence. Typically, STL con- sists of two stages: a pre-training phase in which general representations are learned on a source task or domain, and an adaptation phase during which the learned knowledge is transferred to the target task or domain [14]. In the proposed models the pre-trained language representation (i.e. M-BERT) is introduced during the pre-training phase and then the CNN-LSTM archi- tecture (c.f. section 3) is applied during the adaptation phase to fine-tune models to the target task. - WordNet: this approach explores a traditional textual data augmentation technique consisting of a word-level transformation: synonym replacement. Introduced in [17], the application of this kind of local change was shown to improve performance on text classification tasks, especially for small training datasets. The process of synonym replacement is implemented as a prepro- cessing step in a data generator pipeline (this pipeline generates batches of tensor data with real-time data augmentation). For each batch, 10% of doc- uments’ words (randomly selected) are substituted by WordNet’s synonyms9 (except stopwords). Finally, edited documents are used to feed models relying on a CNN-LSTM architecture. Below is an example of synonym replacement on a clinical case sample. - original: Paciente de 50 años con antecedente de litiasis renal de repetición que consultó por hematuria recidivante y sensación de malestar. El estudio citológico seriado de orinas demostró la pres- encia de células atı́picas sospechosas de malignidad. - edited: Paciente de 50 años con antecedente de litiasis nefrı́tico de repetición que consultó por hematuria recidivante y percepción de malestar. El estudio citológico seriado de orinas demostró la apari- encia de células atı́picas sospechosas de malignidad. - WordNet M-BERT: based on the two previous approaches: WordNet and M-BERT, we explore the combination of a word-level data augmenta- tion technique and the M-BERT pre-trained language model representation. Also implemented as a preprocessing step of the data generator pipeline, synonym replacement is based on the same setup as in the original approach i.e. 10% of documents’ words are substituted. Then, as introduced in the M-BERT approach, models are trained as a part of a STL scenario. - TEXT GEN: in this approach we propose to explore a novel data augmenta- tion technique based on a text generation method. This strategy was recently 8 BioBERT trained from the original BERT pre-trained model and medical resources in English can’t be applied to clinical cases in Spanish 9 Synonym replacement is performed using the python library NLPAug10 introduced for synthesizing labeled data to improve text classification tasks. Approaches leveraging text generation have shown promise, outperforming state-of-the-art techniques for data augmentation, specifically for handling scarce data situations [2]. The proposed data augmentation pipeline consists in two stages: a pre-training phase in which a language model is learned from the given training sets and a generative phase during which the pre-trained language model is used to generate artificial data. In detail, the pre-trained language model is built using an n-gram modeling approach which estimates n-gram distribution probabilities learned from a given corpus. The language models are trained for both tasks using the CNN-LSTM architecture pre- sented in section 3. During the generative phase the appropriate pre-trained language model is introduced to generate artificial data as a preprocessing step in the data generator pipeline. 30% of each document is altered for each mini batch. Formally, each document is split into sentences then 30% of the sentences are replaced by synthesized data. To synthesize new data 30% of the beginning of a given sentence is used as a seed then extended according to the average length of sentences in the corpus (set to 20 words). Below is an example of a synthesized sentence using the pre-trained language model learned from the CodiEsp-D training set. - original: Analytical analysis showed hydroxyvitamin lion pone- sium sodium. - edited: Analytical analysis showed sequence made transopera- tive urine outpatient image microbiological markers flap immunohis- tochemical intravenous dorsolumbar remained level signs 1788 par- tially establishing transplantation. 5 Experimental Results on the Test sets Models were trained on a workstation with a 36-core CPU and an AMD Fire- Pro W2100 GPU. Systems were evaluated according to the following metrics: Mean Average Precision (MAP), MAP@30, precision, recall and F1-score. For experimental purposes two versions of the F1-score metric are computed: the F1-score measure which considers the full code for both tasks and the F1-score CAT which considers only the first three digits of ICD10-Clinical Modification codes (e.g. codes R20.1 and R20.9 are mapped to R20) and the first four digits of ICD10-Procedure codes (e.g. the code bw40zzz is mapped to bw40). Table 2 summarizes the results obtained for both the CodiEsp-D task and the CodiEsp- P task on the test sets. For readability purposes only the MAP, the F1-score and the F1-score CAT are reported. Due to lack of time, not all models for each task were proposed at the official evaluation. However we performed the missing evaluations using the evaluation library released by the organizers11 , results are presented in italic font. 11 Evaluation library: https://temu.bsc.es/codiesp/index.php/2019/09/19/ evaluation-library/. This library only allows to compute the MAP. Date of access: 13th July 2020. Table 2. Official Results of the Clinical Case Coding in Spanish Shared Task (eHealth CLEF 2020). Model CodiEsp-D CodiEsp-P MAP F1-score F1-score MAP F1-score F1-score CAT CAT Baseline 0.0.76 0.114 0.166 0.123 0.114 0.12 M-BERT 0.081 0.13 0.151 0.123 0.124 0.129 WN M-BERT 0.078 0.136 0.153 0.125 0.117 0.121 WN 0.082 0.143 0.165 0.132 - - TEXT GEN 0.071 - - 0.121 0.145 0.216 As we can observe the results obtained depend on both the tasks and the models used. For the CodiEsp-D task the Wordnet model (WN) achieves the best MAP, followed closely by the model based on the pre-trained language model M-BERT. For the F1-score, the WN model also obtains the best performance against the other proposed approaches (+0.029 from the baseline). For the F1- score CAT, the baseline is first-ranked, slightly outperforming the WN model (−0.001 from the baseline). For the CodiEsp-P task the best MAP is obtained using the combination of M-BERT and the data augmentation technique based on synonym replacement (WN M-BERT) while the TEXT GEN model achieves the best performance for both F1-scores (+0.031 for the F1-score and +0.096 for the F1-score CAT, measured relative to the baseline). Concerning the unofficial results (in italic font), the model WN is first-ranked on the CodiEsp-P task while the TEXT GEN model is the less efficient for the CodiEsp-D task. In the overall evaluation, the use of a STL-based architecture combined with a pre-trained model has shown its efficiency, outperforming the baseline on both tasks on the majority of evaluation metrics. Concerning data augmentation tech- niques, despite missing evaluations, the proposed techniques produced strong results in comparison with the other models, outperforming both the baseline and the models based on M-BERT on both tasks on the majority of evaluation metrics. 6 Conclusion In this paper we presented our contribution to the CLEF eHealth CodiEsp 2020 CodiEsp-D and CodiEsp-P sub-tasks. In total we proposed five models during the official evaluation in which we explored both the powerful language model: Multilingual BERT (M-BERT) and two data augmentation techniques, word- level transformation and text generation methods, for synthesizing labeled-data. Models based on data augmentation pipelines achieved the best performances in comparison to the other proposed models for both tasks on the majority of evaluation metrics. References 1. Amin, S., Neumann, G., Dunfield, K., Vechkaeva, A., Chapman, K.A., Wixted, M.K.: Mlt-dfki at clef ehealth 2019: Multi-label classification of icd-10 codes with bert. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2019) 2. Anaby-Tavor, A., Carmeli, B., Goldbraich, E., Kantor, A., Kour, G., Shlomov, S., Tepper, N., Zwerdling, N.: Do not have enough data? deep learning to the rescue! In: AAAI Conference on Artificial Intelligence. pp. 7383–7390 (2020) 3. Campbell, S., Giadresco, K.: Computer-assisted clinical coding: A narrative review of the literature on its benefits, limitations, implementation and impact on clinical coding professionals. Health Information Management Journal 49(1), 5–18 (2020) 4. Catling, F., Spithourakis, G.P., Riedel, S.: Towards automated clinical coding. International journal of medical informatics 120, 50–61 (2018) 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. CoRR arXiv:1810.04805 (2018) 6. Dörendahl, A., Leich, N., Hummel, B., Schönfelder, G., Grune, B.: Overview of the clef ehealth 2019 multilingual information extraction (2019) 7. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu, Z., Pasi, G., Gonzales, G.S., Viviani, M., Xu, C.: Overview of the clef ehealth evaluation lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., andNicola Ferro, L.C. (eds.) Exper- imental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020) . LNCS Volume number: 12260 (2020) 8. Kavuluru, R., Rios, A., Lu, Y.: An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Artificial intelligence in medicine 65(2), 155–166 (2015) 9. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020) 10. Li, F., Yu, H.: Icd coding from clinical text using multi-filter residual convolu- tional neural network. In: AAAI Conference on Artificial Intelligence. pp. 8180– 8187 (2020) 11. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of ehealth clef 2020. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Pro- ceedings (2020) 12. O’Dowd, A.: Coding errors in nhs cause up to£ 1bn worth of inaccurate payments. BMJ: British Medical Journal (Online) 341 (2010) 13. Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.: Diagnosis code assignment: models and evaluation metrics. Journal of the Ameri- can Medical Informatics Association 21(2), 231–237 (2014) 14. Ruder, S.: Neural Transfer Learning for Natural Language Processing. Ph.D. thesis, NATIONAL UNIVERSITY OF IRELAND, GALWAY (2019) 15. Sänger, M., Weber, L., Kittner, M., Leser, U.: Classifying german animal experi- ment summaries with multi-lingual bert at clef ehealth 2019 task 1. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2019) 16. Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated icd coding using deep learning. CoRR arXiv:1711.04075 (2017) 17. Wei, J., Zou, K.: Eda: Easy data augmentation techniques for boosting performance on text classification tasks. CoRR arXiv:1901.11196 (2019)