=Paper=
{{Paper
|id=Vol-2696/paper_112
|storemode=property
|title=A study of Machine Learning models for Clinical Coding of Medical Reports at CodiEsp 2020
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_112.pdf
|volume=Vol-2696
|authors=Marco Polignano,Vincenzo Suriano,Pasquale Lops,Marco de Gemmis,Giovanni Semeraro
|dblpUrl=https://dblp.org/rec/conf/clef/PolignanoSLGS20
}}
==A study of Machine Learning models for Clinical Coding of Medical Reports at CodiEsp 2020==
A study of Machine Learning models for Clinical Coding of Medical Reports at CodiEsp 2020 Marco Polignano, Vincenzo Suriano, Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro University of Bari Aldo Moro, Dept. Computer Science, Bari , Italy. marco.polignano@uniba.it, v.suriano10@studenti.uniba.it, pasquale.lops@uniba.it, marco.degemmis@uniba.it, giovanni.semeraro@uniba.it Abstract. The task of identifying one or more diseases associated with a patient’s clinical condition is often very complex, even for doctors and specialists. This process is usually time-consuming and has to take into account different aspects of what has occurred, including symp- toms elicited and previous healthcare situations. The medical diagnosis is often provided to patients in the form of written paper without any correlation with a national or international standard. Even if the WHO (World Health Organization) released the ICD10 international glossary of diseases, almost no doctor has enough time to manually associate the patient’s clinical history with international codes. The CodiEsp task at CLEF 2020 addressed this issue by proposing the development of an automatic system to deal with this task. Our solution investigated dif- ferent machine learning strategies in order to identify an approach to face that challenge. The main outcomes of the experiments showed that a strategy based on BERT for pre-filtering and one based on BiLSTM- CNN-SelfAttention for classification provide valuable results. We carried out several experiments on a subset of the training set for tuning the final model submitted to the challenge. In particular, we analyzed the impact of the algorithm, the input encoding strategy, and the thresholds for multi-label classification. A set of experiments has been carried out also during a post hoc analysis. The experiments confirmed that the strategy submitted to the CodiEsp task is the best performing one among those evaluated, and it allowed us to obtain a final mean average error value on the test set equal to 0.202. To support future developments of the proposed approach and the replicability of the experiments we decided to make the source code publicly accessible. Keywords: BERT, CNN, BiLSTM, Self Attention, Machine Learning, ICD-10, Medical Diagnosis, Deep Learning, Clinical Coding Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. 1 Introduction Clinical coding [30] is the task of associating unique identification codes with a clinical diagnosis, or sometimes with a portion of it. Doctors and specialists associate the diagnosis given to the patients with the corresponding interna- tional classification only in rare cases. One of the most widely adopted standards is ICD10 [22], the tenth version of the international medical glossary released by WHO (World Health Organization). Although this annotation task may not seem very useful for medical purposes, it is extremely relevant for statistical pur- poses, automatic diagnosis analysis of clinical records, and data interoperability across healthcare systems in different countries. Indeed, if each diagnosis is fully digitized with a worldwide standard, every doctor in the world who is visiting us could uniquely interpret our medical records and provide us with the appropriate treatment. In addition, diagnostic patterns used by clinicians could be identi- fied to improve automatic disease prediction strategies and provide automatic specialist support for decision making. These observations strongly support the need for automated systems to support clinicians to perform this task quickly and without human intervention. From the technical point of view, this task is very challenging because it requires the development of an artificial intelligence system able to not only assign more than one class label to the medical response choosing from a very high number of choices, but also to identify the fragment of text associated to that choice. The CodiEsp task at CLEF 2020 [30, 13, 21] tries to face this problem by releasing a corpus of 1,000 clinical case studies manually selected by practicing physicians and clinical documentalists. Using that dataset, we carried out an experimental study by testing several machine learning ap- proaches, and we submitted the best performing one to the competition. In the following, we first analyze the state of the art concerning the reference topic (Sect. 2), then we provide the details of the proposed models (Sect. 3). In Sect. 4, we thoroughly present the performed experiments, and we finally present the main outcomes and possible future work. 2 Related Work The research community has addressed clinical coding tasks for a long time, and numerous scientific contributions have been proposed about the topic. Indeed, CLEF (Conference and Labs of the Evaluation Forum) conference has been working on eHealth and Information Extraction since 2013, but the oldest corpus on the subject dates back to 1973 [30]. Chapman et al. [9], already in 1999, stated that a computer algorithm could solve the clinical coding task better than a human being. This assertion is intuitive because even for an expert in the field, it can be very complex to assign a specific code to the result of a medical diagnosis choosing it from more than 70,000 currently available ICD-10 codes. However, in 2006, Kukafka et al. [18], confirmed that when the identification of the right code is not obvious, also Natural Language Processing (NLP) tools could easily lack accuracy. Today, NLP and machine learning techniques are widespread and are receiving substantial attention from research communities. Among the best performing systems, the one proposed by Miftahutdinov [19] at CLEF eHealth 2017 uses an LSTM on a TF-IDF representation of the text to identify the most suitable ICD-10 code for the input sequence. This allows to obtain an F1 score equal to 0.85, considering a classification on 1,256 distinct classes. During the same competition, Cabot et al. [7] used an NLP pipeline to obtain the highest F1 score of the competition on the data provided in French (i.e., 0.764). Several preprocessing steps were performed, including stop words filtering, then a method based on the Double Metaphone phonetic encoding algorithm was used to operate a first approximate term search. Finally, a Weighted Distance Score algorithm has been developed to rank the list of candidate terms. The most likely term having the highest score is retained as the matching ICD-10 code for the phrase. In 2018, Atutxa et al. [3], proposed a three-level sequence-to-sequence neural network-based approach. The first neural network tries to assign one set of ICD-10 codes to the whole document, then they are refined to assign one set of codes to the line, and finally one specific code. This strategy allowed the model to obtain an F1 score between 0.7086 and 0.9610, depending on the language of the dataset on which the system has been evaluated. Almagro et al. [1] proposed a supervised learning system based on a multilayer perceptron, SVMs, and a One- vs-Rest strategy. The approach allows to train a binary model for each of the target ICD-10 codes, indicating the presence or absence of the code. The model was able to obtain an F1 score of 0.910. At CLEF eHealth 2019, the best system was proposed by Sänger et al. [28], obtaining an F1 score of 0.80. The proposed model utilized a multilingual BERT [10] text encoding model, fine-tuned on additional training data of German clinical trials also annotated with ICD-10 codes. The model is extended by a single output layer to produce probabilities for specific ICD-10 codes. Amin et al. [2] participated in the same task obtaining the second place. They evaluated various approaches, such as Convolutional Neural Networks (CNN) and Attention models, among others. They obtained the best results when relying on Bidirectional Encoder Representations from Transformers (BERT) and, more specifically, on BioBERT, which was trained on biomedical documents. Considering the successful models presented as state of the art, we decided to use a machine learning approach that combines CNNs, Bidirectional LSTMs, Attention Layers, and BERT. Details of the proposed architecture are provided in Sect. 3. 3 Resources and Model Architectures 3.1 CodiEsp 2020 corpora The CodiEsp track at CLEF 2020 [30, 13, 21] contains three sub-tracks (2 main and 1 exploratory) about analysis of clinical reports: – CodiEsp Diagnosis Coding main sub-task (CodiEsp-D): it requires automatic ICD10-CM [CIE10 Diagnóstico] code assignment. This sub-track evaluates systems that predict ICD10-CM codes (in the Spanish translation, CIE10- Diagnóstico codes). Table 1. Statistics on the training datasets. ICD10-CM ICD10-PCS Number of code sections 21 16 Total number of possible codes 71,486 72,081 Unique codes used in 1,788 546 the training set Number of total annotations 8,494 2,587 in the training set – CodiEsp Procedure Coding main sub-task (CodiEsp-P): it requires auto- matic ICD10-PCS [CIE10 Procedimiento] code assignment. This sub-track evaluates systems that predict ICD10-PCS codes (in the Spanish translation, CIE10-Procedimiento codes). – CodiEsp Explainable AI exploratory sub-task (CodiEsp-X). Systems are re- quired to submit the reference to the predicted codes (both ICD10-CM and ICD10-PCS). The correctness of the provided reference is assessed in this sub-track, in addition to the code prediction. For each task a dataset for training, development and test has been released. Generally speaking, the CodiEsp corpora contain manually annotated clinical reports with corresponding clinical codes. The clinical reports are written in Spanish, and they are annotated with the CIE10 glossary (the Spanish version of ICD10-CM and ICD10-PCS). The training set contains 500 clinical cases, while the development and the test set provide 250 clinical cases each. The CodiEsp corpus format is plain text with UTF8 encoding, where each clinical case is stored in a single file whose name is the clinical unique case identifier. The final collection of the 1,000 clinical cases of the corpus contains 16,504 sentences, with 16.5 sentences per clinical case on average. It contains 396,988 words, with 396.2 words per clinical report on average. For sub-task 1 and 2 of the CodiEsp task, (CodiEspD and CodiEsp-P), the training files contain the following fields: [articleID, label, ICD10-code, text-reference]. – ArticleID: it contains the identifier of the clinical text that corresponds to the name of the file. – Label: it contains the diagnostic or procedimiento code. – ICD10-code: it contains the ICD10 code. – Text-reference: it contains the word or phrase in the clinical text. In Fig. 1 and 2, it is possible to observe how annotations provided with the training data of the two tasks are differently distributed between the different sections of the two ICD10 vocabularies. It is immediately clear that the distri- bution is not uniform, and some classes are more represented than others. For example, for the training set of task CodiEspD, class 18 is the most represented one, with 2,214 annotations, while class 16 is the least represented with only 23 examples. In Table 1, it is possible to observe that considering all the pos- sible codes of ICD10, the training dataset covers only 1,788 unique codes for Fig. 1. Distribution of annotations in the training dataset among the ICD10-CM sec- tions. Fig. 2. Distribution of annotations in the training dataset among the ICD10-PCS sections. CodiEspD and 546 codes for CodiEspP. Based on this observation, we decided to use models that only provide codes with at least one example in the training set. The organizers of the CodiEsp task also released an additional resource that extends the previous datasets. It contains the description of the codes in the ICD10 vocabulary, both “Diagnosis” and “Procedures” (Fig. 3). This resource contains two files: – “codiesp-D codes.tsv”: it contains all the 98,288 ICD10-CM codes, along with their description in Spanish and English. – “codiesp-P codes.tsv”: it contains all the 87,170 ICD10-PCS codes, along with the their description in Spanish and English. It contains the codes up to the fourth nesting lever its of hierarchy. Fig. 3. Example of code description provided as addition task resource. 3.2 Classification model proposed to the CodiEsp task The Clinical Coding task has been approached using different machine learning strategies that are commonly used to deal with the classification task in the field of Natural Language Processing (NLP). In particular we focused on the use of deep learning techniques such as LSTM [15], CNN [17], Attention Layers [32] and Bidirectional Encoder Representations from Transformers (BERT) [10]. Long-short term memory model. The neural network model based on long-short term memory (LSTM) was proposed in 1997 [15]. Since then, it has been widely used with data that have an inherent sequential structure, such as text. LSTMs are part of the family of sequential models based on recurring neu- rons [12]. In particular, an architecture of this type is based on the idea that the state of the specific neuron depends on that of the previous t − 1 state. The natural evolution of a model based on recurring neurons introduces a memory. This structure allows the recurring neuron to depend not only on the state of the single one at step t−1, but also on the state of the different neurons at step t−n. This idea is the base of architectures such as RNN, LSTM, and GRU. Among them, LSTM has the peculiarity of having also a forget gate able to manage the amount of information to be kept in memory. At each step a portion of the memory is deleted and another one is added. These features allow the model to be state-of-the-art for many NLP applications, such as machine translation [35], automatic summarization [29], parsing, and sentiment analysis [33, 5]. In our architecture we used LSTM in its bidirectional variant. Convolutional neural network. Convolutional neural networks (CNN) are born from an accurate study of how the portion of the brain cortex works for vision [16]. This has promoted their wide use in the computer vision task for image recognition since 1980 [11]. Recently, they are also widely used in text analysis tasks, thanks to the fast increase in computational power available to everyone. A convolutional neuron is able to concentrate only on a portion of the input data [12], e.g., a set of pixels in an image. A layer full of neurons using the same filter (or convolution kernels) gives a feature map, which highlights the areas in an input that are most similar to the filter. During the training, a CNN finds the most useful filters for its task, and learns to combine them into more complex patterns. A CNN is usually followed by a Pooling Layer. Its goal is to subsample the input image in order to reduce the computational load, the mem- ory usage, and the number of parameters. A pooling layer typically works on every input channel independently, so the output depth is the same as the input depth. A neural model using CNN usually alternating CNN layers with Pooling layers with a dense final layer aimed at prediction (classification or regression). Self attention. Similarly to the attention strategy proposed in [4], self- attention, also known as intra-attention, provides the model ability to weigh the vectors of single words of the sentence differently, according to the similarity of the neighboring tokens. It is possible to say that the level of attention can provide us an idea of what features the network is looking at most during learning and subsequent classification. In particular, we consider an additive context-aware self-attention equal to the whole set of words in input (Eq. 1) [34]. ht,t0 = tanh(xTt Wt + xTt0 Wt0 + bt ) et,t0 = σ(We ht,t0 + be ) at,t0 = sof tmax(et,t0 ) (1) Xn lt = at,t0 xt0 t0 =1 where, σ is the element-wise sigmoid function, Wt and Wt0 are the weight ma- trices corresponding to the hidden states ht and ht0 ’; We is the weight matrix corresponding to their non-linear combination; bt and be are the bias vectors. The attention-focused hidden state representation lt of a token at timestamp t is given by the weighted summation of the hidden state representation ht0 of all other tokens at timesteps t. We use the last self-attention implementation for Keras 1 . BERT. The Bidirectional Encoder Representations from Transformers (BERT) [10] is a deep learning model based on the Transformer concept [32]. In particu- lar, a Transformer architecture can be considered as a stack of N input encoding modules and M decoding modules to obtain an output using multi-head atten- tion and feed-forward layers. The encoder and decoder blocks are identical in 1 https://github.com/CyberZHG/keras-self-attention numbers, and they are stacked on top of each other. The idea behind a Trans- former architecture is to formalize the dependencies between input and output without the use of recurring neural networks. BERT uses a modified version of this architecture in which only encoding layers are present. In particular, the ba- sic version of BERT uses 12 layers, the full version 24. BERT uses two different training strategies “masking” and “next sentence prediction”. The first strat- egy trains the model to recognize certain words in the input sentence that have been appropriately hidden. Usually, the amount of hidden elements is 20% of the words in the sentence. The second mode is to guess the sentence that follows the input sentence. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. This training strategy makes BERT an extremely reliable model that can be very accurate in formalizing semantics among words considering their context. As a result, the pre-trained BERT model can be refined with only one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without significant changes to the task-specific architecture. There are currently several versions of BERT, also trained on data in languages other than English, such as BETO [8] for Spanish and AlBERTo [26] for Italian. The machine learning model we proposed for the CodiEsp challenge uses all the previously described architectures. Specifically, we decided to use a BERT- based classifier to perform a pre-filtering operation in order to select a subset of sentences possibly referring to a clinical state. Later, the candidate sentences are submitted to a classifier based on BiLSTM, CNN, and self-attention to assign them one or more clinical codes. The architecture of the final model proposed for the clinical coding task is shown in Fig. 4. It is worth to note that using the proposed strategy it was not possible to participate in task 3 of the competition (CodiEsp-X), which requires the identification of the fragment of text referring to the code. Indeed, we do not focus on the portion of text that specifically refers to a disease, while we used a classification model working with the whole sentence. Focusing more on the proposed model in Fig. 4, we observe that the input text is provided to a BERT model. The goal is the data pre-filtering, in order to select generally speaking sentences, from those talking about a symptom, disease, or treatment. This is a mandatory step because BERT accepts as input only pieces of text not exceeding 128 characters. For this reason, we work on the task at the sentence level, splitting the original clinical report into many sentences that could be or not associated with one or more codes. As pre-trained BERT model, we decided to use BETO [8], a Spanish pre-trained version of BERT. The authors trained BETO using 12 self-attention layers with 16 attention- heads each and 1,024 as hidden size. They used all the data from Wikipedia and all of the sources of the OPUS Project [31], having the text in Spanish. This source includes the United Nations and Government journals, TED Talks, Fig. 4. Architecture of the model used as final submission at the CodiEsp challenge. Subtitles, News Stories, and more. The total size of the corpora gathered was comparable with the corpora used in the original BERT. We decided not to use the multilingual version of BERT because it has been shown that a version trained on the native language performs much better in many NLP tasks [27]. The sentences classified as possible references of clinical codes are consequently passed to the second part of our model. First of all, the sentences are encoded into word embeddings. In this step, we decided to use a FastText embedding strategy [6], which proved to be more effective than GLoVE [24] and Word2Vec [20] when many domain-specific words occur in the dataset [25]. For our final configuration of the model, we chose the one released by José Cañete2 made of 300 dimensions, trained on the Spanish Unannotated Corpora3 containing more than 3 billion words. We have configured the LSTM network to work with its bidirectional version. We set the value of hidden units to 64 and the internal dropout value to 0.3. This choice was motivated by the need to reduce the dimensionality of the network output, in order to make the operations to be carried out by the following layers not computationally expensive. Moreover, the dropout value was used to reduce, during the learning, the effect of the overfitting on the training data. We have also decided to vary the function of activation used by the net, setting it to the hyperbolic tangent function (tanh). This activation function has an S-Shape and produces values in the -1 and 1 range, making the layer output more centered to the 0. Moreover, it produces a gradient larger than the sigmoid function, helping to speed up the convergence. A level of self-attention is added following the LSTM. We applied the CNN layer on the result of the attention algorithm. Such hidden level has a matrix form due to the vectorial representation supplied by the word embeddings on the tokens in the input. In detail, it has the form of 128x64, which allows us to apply a 1D Convolutional network with 64 filters and 5x5 kernel. We used ReLu as activation function, that unlike the hyperbolic tangent is faster to calculate. On the top of the CNN layer, we added a Max Pooling function for subsampling the values obtained, reducing the computational load and, the number of parameters of the model. In particular, we used a small 2x2 kernel. On the output of the last max-pooling layer, we applied a dropout function. The hidden model obtained until this step has been merged with the output of the previous Bi-LSTM. We apply this operation for letting the model conceptualize both local and long-term features better. After that, we used a max-pooling layer for ’flattening’ the results and reduce the model parameters. An analog function of dimensionality reduction is performed by the consequent dense layer and the following dropping function. Finally, another dense layer with a soft-max activation function has been applied for estimating the probability distribution of each clinical code available in the dataset. The source code of the model is publicly available on GitHub4 . 4 Experimental session The final architecture of the model, proposed in Section 3.2, was obtained after conducting several experiments on 20% of the training dataset released for the Codiesp-D subtask. In order to always select the same portion of the dataset, we randomly selected sentences using the value 42 as seed for our random func- 2 https://github.com/dccuchile/spanish-word-embeddings 3 http://crscardellino.github.io/SBWCE/ 4 https://github.com/marcopoli/CODIESP-10 tion. During our experimental session, we raised five different experimental ques- tions: – RQ1: Do recent deep learning models, such as LSTMs, outperform classical machine learning approaches? – RQ2: Which is the best strategy for encoding text in the form of word embedding? – RQ3: Which is the best model among those we proposed that allows us to achieve the best performance? – RQ4: Can the use of a sentence pre-filtering classifier help to improve the performance of the model? – RQ5: Is there a class probability threshold that allows us to choose more than one code as a result of the classification? In order to validate our claims we repeated the experiments also on the annotated test set released by the task organizers after the challenge. 4.1 Metrics and Settings We trained different classification models in order to understand the best ap- proach to solve the CodiEsp challenge. In particular we developed the following models: – LSTM, BiLSTM – CNN, CNN + Self Attention – BERT – BiLSTM + CNN, BiLSTM + CNN + Self Attention – (Pre-filtering) BERT - (Classification) BiLSTM + CNN + Self Attention – (Pre-filtering) BiLSTM + CNN + Self Attention - (Classification) BERT – (Pre-filtering) BERT - (Classification) BERT – (Pre-filtering) BiLSTM + CNN + Self Attention - (Classification) BiLSTM + CNN + Self Attention For the selection of the word embedding strategy to use for encoding the textual sentences, we have evaluated the following resources: – FastText Spanish official release [14] – Fastext Spanish Unannotated Corpora (SUC) by José Cañete 5 – GloVe Spanish Billion Word Corpus (SBWC) by George Pérez 5 – Word2Vec Spanish Billion Word Corpus (SBWC) by Cristian Cardellino 5 Finally, as baselines we chose classic machine learning approaches used for dealing with a classification problem. In particular we implemented them in Python using the scikit learn library [23]: – Logistic Regression, C=0.1 5 https://github.com/dccuchile/spanish-word-embeddings Table 2. Results obtained running the baselines on the 20% of the training set. Marco-P Macro-R Macro-F1 Logistic Regression 0.03367 0.03412 0.03071 SVC - rbf 0.00106 0.00128 0.00116 Decision Tree Classifier 0.00729 0.00868 0.00771 Random Forest Classifier 0.00531 0.00272 0.00275 Ada Boost Classifier 0.00106 0.00127 0.00116 LSTM - no pre-trained 0.08473 0.09060 0.07923 word embeddings Table 3. Results obtained varying the pre-trained word embeddings on the 20% of the training set. Macro-P Macro-R Macro-F1 LSTM - 0.09299 0.09629 0.08589 FastText Spanish official LSTM - 0.09357 0.09903 0.08845 FastText SUC LSTM - 0.09252 0.09722 0.08715 GloVe SBWC LSTM - 0.09202 0.09672 0.08685 Word2Vec SBWC – SVC (SVM Classifier) with RBF kernel, C=0.1 – Decision Tree Classifier – Random Forests Classifier, n estimators= 500 – ADA Boost, n estimators= 100, learning rate= 0.01 The model performance has been evaluated using the standard metrics of precision, recall, F1 in their macro-average version and on test set also the Mean Average Precision (MAP) [12]. We trained all the models for 30 epochs with a fixed random value of 42 a batch size of 256 when needed. 4.2 Discussion of results Looking at results in Tab. 2, it is worth to note that the strategy based on deep learning, i.e. the LSTM model, outperforms those based on classical machine learning approaches. Specifically, the logistic regression is the best model among the “classic” ones, with a F1 score equal to 0.03071. The LSTM strategy here proposed, based on a layer of word embeddings trained only data provided as input, i.e. no pretrained weights are used, achieves a F1 score performance which is higher than twice that of logistic regression, i.e. 0.07923. This result allows us to provide a positive answer to RQ1. Tab. 3 reports the results obtained by evaluating different pre-trained word embedding weights. As expected, the differences in the final F1 outcome are Table 4. Results obtained holding the FastText SUC word embedding, varying the model, using an evaluation on the 20% of the training set. Macro-P Macro-R Macro-F1 LSTM 0.09357 0.09903 0.08845 BiLSTM 0.10354 0.10909 0.09789 BiLSTM + CNN 0.09995 0.11552 0.09831 BiLSTM + CNN 0.10629 0.11887 0.10410 + SelfAtt. CNN 0.09511 0.10095 0.09100 CNN + SelfAtt. 0.09279 0.09484 0.08706 BERT (BETO) 0.10381 0.10821 0.10294 Table 5. Results obtained holding the FastText SUC word embedding, the two models (BERT, BiLSTM + CNN + SelfAttention) and varying their combinations for pre- filtering and classification, using an evaluation on the 20% of the training set. Macro-P Macro-R Macro-F1 (Pre-filtering) BiLSTM + CNN + SelfAtt. — 0.13241 0.10934 0.11534 (Classification) BiLSTM + CNN + SelfAtt. (Pre-filtering) BiLSTM + CNN + SelfAtt. — 0.09180 0.08871 0.10022 (Classification) BERT (BETO multi-class) (Pre-filtering) BERT (BETO) — 0.09704 0.10092 0.11734 (Classification) BERT (BETO multi-class) (Pre-filtering) BERT (BETO) — 0.13823 0.12053 0.13632 (Classification) BiLSTM + CNN + SelfAtt. not significant, but the best score is obtained using the FastText approach pre- trained on the Spanish Unannotated Corpora (SUC). We decided not to evaluate word embeddings weights released for English, because we would like to focus on the portion of data released in Spanish rather than its translated version. In this new experiment, we increased the F1 score previously claimed, reaching a value of 0.10410. These results allow us to answer properly to RQ2. In Tab. 4, we reported results obtained by varying the architecture of our model, by holding the pre-trained word embeddings choice at the previous exper- imental step. An unexpected result is obtained by observing the strategy based Table 6. Results obtained keeping fixed the FastText SUC word embedding, the two models (BERT, BiLSTM + CNN + SelfAttention) and varying the threshold on the probability returned by the architecture for each class. We use it for selecting multiple labels for the specific sentence. Also in this case the evaluation has been performed on the 20% of the training set. BERT (BETO) — Macro-P Macro-R Macro-F1 BiLSTM + CNN + SelfAtt. — Threshold: 0.05 0.09271 0.27911 0.12479 — Threshold: 0.10 0.14951 0.24802 0.16011 — Threshold: 0.25 0.22644 0.18310 0.16188 — Threshold: 0.50 0.27531 0.09811 0.13210 — Threshold: 0.15545 0.13324 0.13584 maxProb - (maxProb*0.10) — Threshold: 0.12746 0.13943 0.12802 maxProb - (maxProb*0.25) — Threshold maxProb-0.10 0.00434 0.29032 0.01028 on BERT as a simple classifier of codes. It performs quite worst than the one that uses BiLSTM, CNN, and self-attention layers. The difference between the two approaches is tiny, and, from our point of view, it is not very relevant. For this reason, we decided to go over using both the approaches. In this step, we obtained the best F1 score of 0.10410, i.e., around 0.02 points greater than the previous result. The results allow us to provide a valid answer to RQ3. The following evaluation step is about splitting the clinical coding task into two independent steps: pre-filtering and classification. We reported the results obtained by this evaluation step in Tab. 5. It is possible to note that, among the different combinations of classifiers used for the two steps, the best configuration is that using BERT as a pre-filtering strategy and BiLSTM + CNN + Self Attention as a classification approach. We were able to increase F1 score by around 0.03 points from the previous step, reaching the value of 0.13632. The behavior we observed in results allows us to answer at RQ4 positively. The last evaluation concerns the strategy for selecting many labels for a single sentence. We decided to use a threshold on the result of the softmax function that allowed us to extract the label on which the model is more certain. We decided to vary the thresholds using both fixed and dynamic ones. The results are reported in Tab. 6. It is possible to observe that both the values 0.10 and 0.25 achieve good results. In particular, using a threshold of 0.10, we are maximizing the recall, on the contrary, we observe high values of precision. Due to the consideration that in a real scenario, a tool like this can be a decision support system for the doctors, we decided to use the configuration that uses 0.10 as the final model for the CodiEsp task. The results support our positive answer also for RQ5. The model we implemented, has been used for participating at both CodiEsp subtasks, i.e., CodiEsp-D and CodiEsp-P. Table 7. Results obtained on the annotated test set of the CodiEsp-D subtask. BERT — BiLSTM + CNN + SelfAtt. MAP Macro-P Macro-R Macro-F1 Threshold 0.05 0.194 0.12913 0.25016 0.16804 Threshold 0.10 0.202 0.19238 0.21361 0.19931 Threshold 0.25 0.181 0.28702 0.19935 0.2201 Threshold 0.50 0.15 0.38739 0.11682 0.1765 Threshold 0.75 0.118 0.425 0.08365 0.13742 Threshold maxProb-0.10 0.156 0.00599 0.31974 0.01172 Threshold maxPrb-0.25 0.142 0.00394 0.50282 0.07813 Threshold maxProb - (maxProb*0.10) 0.165 0.21341 0.15787 0.17897 Threshold maxProb - (maxProb*0.25) 0.168 0.19305 0.16649 0.17636 Threshold: MeanValue 0.096 0.00836 0.48201 0.0134 Threshold: MeanValue - MeanValue*0.25 0.093 0.00735 0.49042 0.01187 Threshold: MeanValue - MeanValue*0.35 0.0901 0.00698 0.49313 0.01118 Threshold: MeanValue - MeanValue*0.50 0.089 0.00643 0.49681 0.01022 Table 8. Official CodiEsp task results. CodiEsp-D MAP MAP30 Run MAP MAP30 P R F1 codes codes BERT (BETO) — 0.202 0.236 0.202 0.236 0.295 0.323 0.308 BiLSTM + CNN + SelfAtt. BERT (BETO) 0.117 0.136 0.117 0.136 0.272 0.218 0.242 BiLSTM+CNN+SelfAtt. 0.169 0.195 0.16 0.185 0.135 0.442 0.207 CosiEsp-P BERT — 0.221 0.25 0.219 0.247 0.186 0.38 0.25 BiLSTM + CNN + SelfAtt. BiLSTM 0.137 0.152 0.127 0.141 0.122 0.399 0.187 BERT 0.141 0.154 0.14 0.153 0.155 0.323 0.209 BiLSTM+CNN+SelfAtt. 0.17 0.191 0.15 0.168 0.097 0.513 0.164 4.3 Post Hoc Analysis We performed a further investigation of the performance of our model on the gold annotated test set. In particular, in this phase, we take into account the score obtained for the MAP metric, because it is used by the organizers for calculating the final leaderboard. As we can observe in Fig. 7, the configuration of the model using a threshold equal to 0.10 is the best performing one, followed by those obtained using a the threshold 0.05 and 0.25, respectively. The results successfully supported our choice of using a threshold able to maximize the recall more than the precision. The final results obtained by the CodiEsp task are those reported in Tab. 8. Looking at the final scores of precision and recall, it is worth to note that our model is more feasible for real use as a decision support system. Indeed, it is able to obtain a higher recall than precision and, as previously stated, this easily allows to select candidate codes for clinical reports. 5 Conclusion In this paper, we faced the problem of clinical coding by applying several machine learning methods. We compared traditional classification approaches such as lo- gistic regression, random forests, and SVM, with deep learning models, including LSTM, CNN, and BERT. The experimental analyses allowed us to propose a classification model based on two steps of execution: pre-filtering and classifi- cation. In the pre-filtering phase, we use a BERT based classification model to select a set of medical report sentences that we believe can be associated with one or more ICD10 codes. Later, we use a BiLSTM, CNN, and Self-Attention-based classifier to select the specific set of possible codes for the candidate sentence. The results obtained in the post hoc evaluation phase have shown that the ap- proach proposed for the challenge is the best possible among those considered in this study. The results obtained for the challenge showed a MAP score of 0.202 for the CodiEsp-D task and 0.221 for the CodiEsp-P task. These are encouraging results given the difficulty of the task, and there is also the possibility of further improving the figures by better balancing the training data available among the various categories of codes. As future work, we expect to be able to associate the ICD10 code with the corresponding portion of text in order to implement a first strategy for explaining results. References 1. Almagro, M., Montalvo, S., de Ilarraza, A.D., Pérez, A.: MAMTRA-MED at CLEF ehealth 2018: A combination of information retrieval techniques and neural net- works for ICD-10 coding of death certificates. In: Cappellato, L., Ferro, N., Nie, J., Soulier, L. (eds.) Working Notes of CLEF 2018 - Conference and Labs of the Evalu- ation Forum, Avignon, France, September 10-14, 2018. CEUR Workshop Proceed- ings, vol. 2125. CEUR-WS.org (2018), http://ceur-ws.org/Vol-2125/paper 110.pdf 2. Amin, S., Neumann, G., Dunfield, K., Vechkaeva, A., Chapman, K.A., Wixted, M.K.: Mlt-dfki at clef ehealth 2019: Multi-label classification of icd-10 codes with bert. In: CLEF (Working Notes) (2019) 3. Atutxa, A., Casillas, A., Ezeiza, N., Fresno, V., Goenaga, I., Gojenola, K., Martı́nez, R., Anchordoqui, M.O., Perez-de Viñaspre, O.: Ixamed at clef ehealth 2018 task 1: Icd10 coding with a sequence-to-sequence approach. In: CLEF (Work- ing Notes). p. 1 (2018) 4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 5. Basile, P., Croce, D., Basile, V., Polignano, M.: O verview of the evalita 2018 aspect-based sentiment analysis task (absita). EVALITA Evaluation of NLP and Speech Tools for Italian 12, 10 (2018) 6. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguis- tics 5, 135–146 (2017) 7. Cabot, C., Soualmia, L.F., Darmoni, S.J.: Sibm at clef ehealth evaluation lab 2017: Multilingual information extraction with cim-ind. In: CLEF (Working Notes) (2017) 8. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish pre-trained bert model and evaluation data. In: to appear in PML4DC at ICLR 2020 (2020) 9. Chapman, W.W., Haug, P.J.: Comparing expert systems for identifying chest x-ray reports that support pneumonia. In: Proceedings of the AMIA Symposium. p. 216. American Medical Informatics Association (1999) 10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: NAACL-HLT (1) (2019) 11. Fukushima, K., Miyake, S.: Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In: Competition and cooperation in neural nets, pp. 267–285. Springer (1982) 12. Géron, A.: Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media (2019) 13. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu, Z., Pasi, G., Saez Gonzales, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth evaluation lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., andNicola Ferro, L.C. (eds.) Exper- imental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020) . LNCS Volume number: 12260 (2020) 14. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vec- tors for 157 languages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018) 15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997) 16. Hubel, D.H., Wiesel, T.N.: Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology 148(3), 574 (1959) 17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012) 18. Kukafka, R., Bales, M.E., Burkhardt, A., Friedman, C.: Human and automated coding of rehabilitation discharge summaries according to the international clas- sification of functioning, disability, and health. Journal of the American Medical Informatics Association 13(5), 508–515 (2006) 19. Miftahutdinov, Z., Tutubalina, E.: Kfu at clef ehealth 2017 task 1: Icd-10 coding of english death certificates with recurrent neural networks. In: CLEF (Working Notes) (2017) 20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. arXiv preprint arXiv:1301.3781 (2013) 21. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2020) 22. Organization, W.H., et al.: The ICD-10 classification of mental and behavioural disorders: diagnostic criteria for research, vol. 2. World Health Organization (1993) 23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. the Journal of machine Learning research 12, 2825–2830 (2011) 24. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre- sentation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014) 25. Polignano, M., Basile, P., de Gemmis, M., Semeraro, G.: A comparison of word- embeddings in emotion detection from text using bilstm, cnn and self-attention. In: Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization. pp. 63–68 (2019) 26. Polignano, M., Basile, P., de Gemmis, M., Semeraro, G., Basile, V.: Alberto: Italian bert language understanding model for nlp challenging tasks based on tweets. In: CLiC-it (2019) 27. Polignano, M., Basile, V., Basile, P., de Gemmis, M., Semeraro, G.: AlBERTo: Modeling Italian Social Media Language with BERT. Italian Journal of Computa- tional Linguistics - IJCOL -2, n.2 (2019) 28. Sänger, M., Weber, L., Kittner, M., Leser, U.: Classifying german animal exper- iment summaries with multi-lingual bert at clef ehealth 2019 task 1. In: CLEF (Working Notes) (2019) 29. Song, S., Huang, H., Ruan, T.: Abstractive text summarization using lstm-cnn based deep learning. Multimedia Tools and Applications 78(1), 857–875 (2019) 30. Stanfill, M.H., Williams, M., Fenton, S.H., Jenders, R.A., Hersh, W.R.: A sys- tematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association 17(6), 646–651 (2010) 31. Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Lrec. vol. 2012, pp. 2214–2218 (2012) 32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017) 33. Wang, S., Zhu, Y., Gao, W., Cao, M., Li, M.: Emotion-semantic-enhanced bidirec- tional lstm with multi-head attention mechanism for microblog sentiment analysis. Information 11(5), 280 (2020) 34. Zheng, G., Mukherjee, S., Dong, X.L., Li, F.: Opentag: Open attribute value ex- traction from product profiles. In: Proceedings of the 24th ACM SIGKDD Interna- tional Conference on Knowledge Discovery & Data Mining. pp. 1049–1058. ACM (2018) 35. Zhou, J., Cao, Y., Wang, X., Li, P., Xu, W.: Deep recurrent models with fast- forward connections for neural machine translation. Transactions of the Association for Computational Linguistics 4, 371–383 (2016)