Explainable Classification of Medical Documents Through a Text-to-Text Transformer Mihai Horia Popescu1 , Kevin Roitero1 and Vincenzo Della Mea1 1 Dept. of Mathematics, Computer Science and Physics, University of Udine, Udine, Italy Abstract Death certificates are important medical records which are collected for the purpose of public healthcare and statistics by multiple organizations around the globe. Due to their importance, those certificates are compiled by experienced medical practitioner according to a standard defined by the World Health Organization including rules to select an underlying cause of death (UCOD). For this reason, the coding of death certificates is a slow and costly process. To overcome these issues, the scientific community proposed deep learning approaches to perform such a task. Despite those systems achieve high accuracy scores (close to 1), their complexity makes the obscure to the final user, making it unfeasible the adoption as a decision support system. In this paper, we propose a model based on text-to-text transformers which is able to provide a UCOD as well as to generate a human-readable explanation for its classification. We compare the proposed approach to state-of-the-art interpretable rule-based systems. Keywords deep learning, XAI, automated coding, medical documentation, generative model 1. Introduction Traditionally, natural language processing (NLP) applications have been built on techniques that are natively explainable. Such techniques are generally referred to as “white box” techniques, and are mainly implemented using rule-based heuristics, decision trees, hidden Markov models, etc. [1]. Recent advances in Deep Learning (DL), a “black box” machine learning technique, have dramatically improved Neural Network (NNs) accuracy and increasingly gained interest from stakeholders. As a result, DL became the dominant approach in NLP and have seen wide adoption in a large amount of applications [2, 3]. Such a popularity of DL based approaches have been pursued by focusing merely on effectiveness on such a systems and thus resulting in effective models lacking of interpretability. Hence, concerns have been raised on the adoption of such black box methodologies in specific sensitive applications such as healthcare, decision making, and finance, in which settings it is fundamental to rely on interpretable models [4, 5]. As a result, for sensitive domains and real-world decision-making systems, the mere effectiveness of the system is not enough; those systems also need to support the reliability of the produced result and HC@AIxIA 2022: 1st AIxIA Workshop on Artificial Intelligence For Healthcare, November 28 – December 2, 2022, Udine, It Envelope-Open mihaihoria.popescu@uniud.it (M. H. Popescu); kevin.roitero@uniud.it (K. Roitero); vincenzo.dellamea@uniud.it (V. Della Mea) Orcid 0000-0003-3378-0368 (M. H. Popescu); 0000-0002-9191-3280 (K. Roitero); 0000-0002-0144-3802 (V. Della Mea) © 2022 Copyright ©2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) thus provide a feedback e.g., in the form of a confidence score or a human-readable explanation to inform the final user if the produced result is likely to be correct and/or trustworthy or to explain the rationale behind the model decisions [6]. For these reasons, in recent times we observed an increase of interest from the community to develop and improve methods for the interpretability of DL models, especially towards the generation of human-readable explanations generated using explainable artificial intelligence (XAI) models [1, 3, 7]. Recently, many works have been developed to produce natural language explanations for DL systems [8, 9]. While diverse approaches that can be used to generate explanation exist, most of these methods can be categorized as producing post-hoc explanations. Such kind of techniques target models that are not interpretable by design and are used to enhance the interpretability of the underlying model choices [7]. In this paper we propose a methodology able to generate a human readable explanation for the predictions produced by a model designed in the context of select the underlying cause of death from death certificates with, which achieves very high accuracy scores (close to 1) [10, 11, 12], but it is not being adopted in practice due to its lack of interoperability. 2. Background and Related Work In general, XAI approaches can be categorized from different perspectives: local versus global [13], transparent models versus post-hoc explainability [7], based on XAI goals (such as trustwor- thiness, causality, transferability, etc.) [7]. Our work is based on local and post-hoc explainability, given that from the exlanations generated it is possible only to understand the reason for the predicted UCOD. We have identified two major goals that the users may desire and which those explanations can support;trustworthiness and informativeness. Different approaches to enhance interpretability exists in the literature. Ribeiro et al. [14] studied the explainability of a model’s predictions using feature importance-based explanations. Other approaches, such as the one proposed by Camburu et al. [15], first generate a free-form natural language explanation, then use such an explanations to infer the classification prediction. Similarly, Brand et al. [16, 17] shown that one can jointly predict and generate an explanation for classifying the veracity of statements. From a different perspective, some other works used the confidence of the model as a reliability measure for the correctness of the predictions by computing a calibrated confidence score [6]. Finally, other works such as the one proposed by Agarwal et al. [18] leveraged alternative measures like the variance of gradients to measure model reliability and instance difficulty. 3. Data 3.1. The Death Certificate The death certificate is the main source of mortality data. Such data is supposed to be collected in compliance with the standard death certificate format defined in [19] and [20]. The death certificate contains: administrative details, a part called Frame A, and a part called Frame B. Frame A is used to record the sequence of events leading directly to death, and may contain conditions that do not belong to the sequence but their presence contributed to death. Conversely, Frame B contains additional health conditions, such as previous surgery, mode of death, or place of occurrence. It should be noted that while Frame A contains the textual expression of conditions as filled by physicians, their corresponding ICD-10 codes are generally provided by experts coders. The coded version of the certificate is the format used for the selection of the UCOD. The UCOD is the most important information extracted from mortality data, and it is used for statistical comparison and public health data. It is defined as ’(a) the disease or injury which initiated the train of morbid events leading directly to death, or (b) the circumstances of the accident or violence which produced the fatal injury’ [19]. The UCOD is selected according to the coding rules defined in the reference guide. The chosen code is usually one of the conditions present in the chains reported by the certifying doctor in Frame A. 3.2. Generation of Ground Truth Explanations The system used for the generation of the gold explanations is called DORIS [21], a prototype rule-based system for mortality coding-based ICD-10 and ICD-11. Those rules can be subdivided into 2 categories; selection and modification rules. Currently, the system fully supports 18 out of 38 selection rules, and about 95% of the modification rules. The remaining rules are only partially implemented. The system was evaluated on datasets for both ICD-10 and ICD-11. DORIS is unable to code 8.2% of the total certificates and has an accuracy of 78% for ICD-10 [21]. The explanation generated by DORIS describes the coding instructions used to reach the selection of the UCOD and the conditions on which the rule is applied. In Table 1 we have pre- sented two cases of explanations used by DORIS for two coding instructions and the associated description used in the reference guide. To select the UCOD multiple coding instructions may be used, as a result the explanations are concatenated. 3.3. Data Source and Preparation The death certificates data files were collected from the U.S. National Center for Health Statistics (NCHS)1 . The dataset contains a total of 12, 919, 268 records for the years 2014–2017 including administrative data, coded conditions for frames A and B, and the UCOD that we used as ground truth. From the full dataset, we extracted 510, 000 records for which the rule-based system presented in the Section 3.2 was able to correctly select the UCOD. The data then have been pre-processed to select only the data needed for our experiment. For this task, we choose to use the sex and age features from the administrative data and conditions from the Frame A. The dataset has been split into three smaller parts using randomization and stratified sampling by target UCOD. For the train set, we have selected 400, 000 records, 100, 000 records for the test set, and the remaining 10, 000 certificates for the validation set. The dataset contains the same records, dataset split, and reverse coding format used for the NLP model used for the selection of the underlying cause of death using reverse coding as proposed by Della Mea et al. [10] and detailed in the following. 1 https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm Table 1 Example explanation generated by DORIS. Coding Explanation Rule instruction used description SP1 Malignant neoplasm of prostate is the If there is only one condition reported unique condition reported in the certifi- on the certificate, this is the new TUC. cate and is the new tentative starting point (TUC). SP2 Unspecified injury of head is the first If only one line is used but multiple condition reported on the single used conditions are preset, select the first line, which is selected as the new ten- condition as the new TUC. tative starting point. As input, the model proposed takes the version encoded as text of the certificates. Since the certificates do not have the original textual conditions present, we had to reverse the work done by coders because it brings the certificate back to text. The certificate encoded as text needs to encode both administrative data and conditions. The administrative data were put in an explicit form (e.g., Female, 39y old). Each line is encoded with the title entity, while for multiple codes per line, the titles are merged using ”or” expression and the entire line go between parentheses. The sequence of lines then is concatenated with the expression “due to”, where Part 2 if present, is concatenated using “in the context of” between the last line of Part 1 and Part 2. 4. Methods 4.1. Generating Explanations We develop and train our models by relying on both the PyTorch2 and HuggingFace3 frameworks. The experiments have been carried put on a Linux server equipped with 16x Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz, 70GB of RAM, and 2x Nvidia Geforce RTX 3090 GPUs. We make the trained model available to the community.4 T5 [22] is a transformer based model trained on a mixture of both supervised and unsupervised tasks (i.e., summarization, translation, etc.) [22, Appendix Section]. In this work, we rely on the T5-base model5 , which is a 220 million parameters model composed of an encoder-decoder stack involving 12 blocks, each of those implementing a self-attention mechanism, an encoder-decoder attention one, and a feed forward network. Many available transformer-based architectures leverage separate transformer models for either discriminative (e.g., classification) or generative (e.g., text-generation) tasks. As opposed to this approach, we take inspiration from E-BART [16, 17], a model designed in the context 2 https://pytorch.org/ 3 https://huggingface.co/ 4 To request access to the model, send an email to the paper authors. 5 https://huggingface.co/t5-base Autoregressive CrossEntropy Loss Bidirectional Encoder Encoder-Decoder (BERT) (T5) Explanation Logits Gold Explanation Predicted Gold Explanation Certificate [SEP] UCOD (shifted) Beam Search Explanation t=1 t=2 t=3 t=n-1 t=n A B C B EOS Autoregressive Encoder-Decoder Bidirectional Encoder (T5) (BERT) Predicted Certificate [SEP] A B C E UCOD Figure 1: Model training (above) and inference (below). of misinformation and veracity assessment to perform a discriminative task (i.e., classify the truthfulness of statements) and a generative one (i.e., generate a human-readable explanation for the former step) at the same time. In a similar fashion, we develop a model which is capable of classifying the UCOD of a certificate and generate a human-readable explanation for such a task. The model training and inference phases are detailed in the following, and summarized in Figure 1. Given that the focus of this work is on the generated explanations, in the following we omit the description of the discriminative model (which is anyway a standard BERT-based model equipped with a classification head) and we focus only on the generative one. To generate the explanations, the model takes in input, separated by the [SEP] token, the death certificate encoded as text described in Section 3.3, the string generated as a report by the rule based system presented in Section 3.2, and the UCOD , the code predicted by the discriminative model representing the underlying cause of death. The model is then trained in a causal fashion, thus trained to auto-regressively predict the gold explanation (shifted), which is the description generated by the rule based system detailed in previous sections. The model loss is computed by considering the conventional multi-class cross-entropy loss function, where the number of classes is equal to the size of the vocabulary, defined as 𝐵 |𝑉 | 1 ℒ = − ∑ ∑ 𝑦𝑘𝑏 log(𝑦𝑘̂ 𝑏 ) 𝐵 𝑏=1 𝑘=1 where 𝑏 is the batch and 𝐵 the batch size, |𝑉 | is the vocabulary size, 𝑦 is the true token to be predicted be the model, and 𝑦𝑘̂ is the output probability distribution over the vocabulary at each time-step. Table 2 Effectiveness of the model on the generated explanations. Rouge–1 Rouge–2 Rouge–L Dataset Prec Rec F1 Prec Rec F1 Prec Rec F1 CDC-Test 100K 0.9988 0.9985 0.9986 0.9983 0.9980 0.9981 0.9986 0.9983 0.9983 During the inference step, the model generates the text by leveraging beam search, thus generating the explanation token-by-token by feeding the input tokens via the cross-attention layers to the decoder, and then auto-regressively generating the decoder output. To optimize the generation process, we set the early stopping parameter to the value of true so that the beam generation is stopped when all beam hypotheses reach the EOS token. Experimentally, we found that such generation procedure is suitable for the task, and generates relevant explanations for each input string, thus we found no need to implement constrained search techniques or try alternatives to beam search. For the same reason, we always select the output sequence with the highest likelihood as computed by the model. 4.2. Metrics We evaluate the generated summaries using the Rouge score [23], a recall-oriented measure designed to compare a generated textual summary to an ideal one, usually generated by a human [24, 25]. More in detail, Rouge-N denoted an n-gram metric between a candidate summary and the reference summary. In this work we consider Rouge–1 (uni-gram based metric), Rouge–2 (bi-gram based), and Rouge–L, that is computed by considering the Longest Common Sub- sequence (LCS). More in detail, Rouge precision is defined as the number of overlapping n-grams between the candidate and the reference summary divided by the number of n-grams in the candidate summary, Rouge recall is defined as the number of overlapping n-grams between the candidate and the reference summary divided by the number of n-grams in the reference summary, and Rouge F1 is the harmonic mean of precision and recall. 5. Results and Discussion Table 2 shows the Rouge scores for the considered datasets. As we can see, we have reached an overall score near to 1 for all the evaluated n-grams. Each has a high score of precision and recall and F1 values. The recall shows that in almost all cases the n-grams in the gold explanation are also present in the generated explanation, while the precision shows that almost all the n-grams in the generated explanations are present in the reference explanation. Comparing the n-grams proposed, the bi-grams (Rogue-2 score) has the lowest F1 score with a value of 0.9981. Since the overall scores are very high, most of the generated explanations have a perfect match with the gold explanation. For the remaining cases we also perform a qualitative analysis of the generated explanations, by comparing them to the rule-based system, considering the structure of the rule, the conditions involved and terminology. Table 3 shows the certificate, as well as the gold and generated explanation for a sample of the instance present in the datasets. As we can see from the table, the explanation generated is not fully correct. In particular, the error occurs in both cases on the obvious cases selection, where multiple causes are obvious causes of the TUC, but the generated description was not able to identify one. In the first case the explanation lead to an error for the selection of the UCOD, while in the second did not influence the final result. In all the cases where the description was incorrect, we have noticed that the terminology used was always consistent. The rules structure was correctly applied, and the conditions used were always consistent with those of the certificate. The errors were mainly for SP6 obvious causes and M1 special instructions, where categories were not recognized as part of the rule. While those cases are not recognized, is most likely that they were not part of the training set. 6. Conclusions We have presented a system that is able to enhance the interpretability of a classification model by generating explanations using as reference a rule-based system. The model was not only able to generate appropriate explanations consistently (about 0.998 F1 score), but it was able to correctly learn and use the structure of the rules and their terminology. The proposed model has the ability to predict the UCOD, since the last sentence always specifies the category suggested; this feature is very important since the rule-based system used to learn the explanations do not reach the same accuracy of the classification model, and the suggested UCOD of the explanation can be used to crosscheck the classification model UCOD to understand when the explanation is likely to be incorrect. Some limitations of this preliminary experiment comes from the dataset used. In fact, for this experiment, we have used a dataset as big as the preliminary evaluation of the classification model, while the certificates used needed to be encoded as text as a reverse encoding from coded conditions. Those limitations come from a lack of certificates with natural textual conditions, which were still sufficient to evaluate the feasibility of this approach. This paper opens for plenty of future work. More in detail, for future experimentation and evaluation, we plan to retrieve and use a dataset with original textual conditions, that is where plain text is available natively. Furthermore, we plan to evaluate and compare the generated explanations with the death certificates for which DORIS fails, that is where DORIS is not able to correctly predict the underlying cause of death; extrapolating from the results discussed in this paper, we expect those explanation to be well structured, with an incorrect rule applied, but this needs to be proven with further experimentation. We also plan to use an extend the dataset for the training and evaluation phases by employing the full dataset used by the model that selects the UCOD [11], also by including the cases originated after 2017, where recent cases of COVID-19 may have changed the data distribution and patterns. Finally, we plan to extend the training dataset for generating the explanations with hand written explanations generated by trained experts to be able to compare and support cases where DORIS is not able to correctly identify the UCOD. Table 3 Qualitative evaluation of the generated explanations for the datasets. Certificate Explanation Generated Explanation Male, 79y old: Un- Unspecified dementia is the first Unspecified dementia is the specified dementia in condition reported on the single first condition reported on the the context of (Other used line, which is selected as single used line, which is selected specified degenerative the new tentative starting point. as the new tentative starting diseases of nervous Other specified degenerative dis- point. Alzheimer disease, un- system or Alzheimer eases of nervous system,Alzheimer specified,Cerebrovascular disease, disease, unspecified disease, unspecified,Cerebrovascu- unspecified (HIV) is an obvious or Cerebrovascular lar disease, unspecified (HIV) is an cause of the current TUC Unspeci- disease, unspecified or obvious cause of the current TUC fied dementia (Infectious Diseases). Mental and behavioural Unspecified dementia (Infectious Alzheimer disease, unspecified is disorders due to use of Diseases). Other specified degen- the underlying cause of death. tobacco: Unspecified erative diseases of nervous system mental and behavioural is the underlying cause of death. disorder). code: G318 Male, 69y old: Cardiac Essential (primary) hypertension Essential (primary) hypertension arrest, unspecified due present on last used line causes present on last used line causes to Atherosclerotic heart all the conditions reported on the all the conditions reported on the disease due to Essential lines above, which is selected as the lines above, which is selected as (primary) hypertension new tentative starting point (TUC). the new tentative starting point in the context of (Res- Special instruction apply where Es- (TUC). Special instruction apply piratory failure, unspec- sential (primary) hypertension is where Essential (primary) hyperten- ified or Subarachnoid with mention of Atherosclerotic sion is with mention of Atheroscle- haemorrhage, unspeci- heart disease,Subarachnoid haem- rotic heart disease,Nontraumatic fied or Subdural haem- orrhage, unspecified,Nontraumatic subdural haemorrhage. Atheroscle- orrhage (acute)(nontrau- subdural haemorrhage. Atheroscle- rotic heart disease is selected using matic) or Epilepsy, un- rotic heart disease is selected using priority order. Atherosclerotic heart specified or Immobility). priority order. Atherosclerotic heart disease is the underlying cause of code: I251 disease is the underlying cause of death. death. Male, 54y old: Car- Generalized and unspecified Generalized and unspecified diomegaly due to atherosclerosis present on last atherosclerosis present on last Atherosclerotic heart dis- used line causes all the conditions used line causes all the conditions ease due to Myocarditis, reported on the lines above, which reported on the lines above, which unspecified due to Gen- is selected as the new tentative is selected as the new tentative eralized and unspecified starting point (TUC). Special in- starting point (TUC). Special in- atherosclerosis in the struction apply where Generalized struction apply where Generalized context of Atheroscle- and unspecified atherosclerosis is and unspecified atherosclerosis is rotic cardiovascular with mention of Atherosclerotic with mention of Atherosclerotic disease, so described. heart disease,Myocarditis, unspeci- heart disease,Atherosclerotic car- code: I251 fied,Atherosclerotic cardiovascular diovascular disease, so described. disease, so described. Myocarditis, Atherosclerotic heart disease unspecified is selected using prior- is selected using priority order. ity order. Special instruction apply Atherosclerotic heart disease is the where Myocarditis, unspecified is underlying cause of death. with mention of Atherosclerotic heart disease,Atherosclerotic car- diovascular disease, so described. Atherosclerotic heart disease is selected using priority order. Atherosclerotic heart disease is the underlying cause of death. References [1] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial intelli- gence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai, 2019. URL: https://arxiv.org/abs/1910.10045. doi:10.48550/ARXIV.1910.10045 . [2] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, B. Kawas, P. Sen, A survey of the state of explainable ai for natural language processing, arXiv (2020). URL: https://arxiv.org/abs/ 2010.00711. doi:10.48550/ARXIV.2010.00711 . [3] J. Yu, A. I. Cristea, A. Harit, Z. Sun, O. T. Aduragba, L. Shi, N. A. Moubayed, Interaction: A generative xai framework for natural language inference explanations, 2022. URL: https://arxiv.org/abs/2209.01061. doi:10.48550/ARXIV.2209.01061 . [4] R. McAllister, Y. Gal, A. Kendall, M. van der Wilk, A. Shah, R. Cipolla, A. Weller, Concrete problems for autonomous vehicle safety: Advantages of bayesian deep learning, in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 2017, pp. 4745–4753. URL: https://doi.org/10.24963/ijcai.2017/661. doi:10.24963/ ijcai.2017/661 . [5] R. Challen, J. Denny, M. Pitt, L. Gompels, T. Edwards, K. Tsaneva-Atanasova, Artificial intelligence, bias and clinical safety, BMJ Quality & Safety 28 (2019) 231–237. URL: https://qualitysafety.bmj.com/content/28/3/231. doi:10.1136/bmjqs- 2018- 008370 . [6] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, 2017. URL: https://arxiv.org/abs/1706.04599. doi:10.48550/ARXIV.1706.04599 . [7] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial intelli- gence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai, 2019. URL: https://arxiv.org/abs/1910.10045. doi:10.48550/ARXIV.1910.10045 . [8] D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, M. Rohrbach, Multimodal explanations: Justifying decisions and pointing to the evidence, 2018. URL: https://arxiv.org/abs/1802.08129. doi:10.48550/ARXIV.1802.08129 . [9] S. Kumar, P. Talukdar, Nile : Natural language inference with faithful natural language explanations, 2020. URL: https://arxiv.org/abs/2005.12116. doi:10.48550/ARXIV.2005. 12116 . [10] V. Della Mea, M. H. Popescu, K. Roitero, Underlying cause of death identification from death certificates using reverse coding to text and a nlp based deep learning approach, Informatics in Medicine Unlocked 21 (2020) 100456. URL: https://www.sciencedirect.com/science/ article/pii/S2352914820306067. doi:https://doi.org/10.1016/j.imu.2020.100456 . [11] K. Roitero, B. Portelli, M. H. Popescu, V. D. Mea, Dilbert: Cheap embeddings for disease related medical nlp, IEEE Access 9 (2021) 159714–159723. doi:10.1109/ACCESS.2021. 3131386 . [12] M. H. Popescu, K. Roitero, S. Travasci, V. Della Mea, Automatic assignment of icd-10 codes to diagnostic texts using transformers based techniques, in: 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI), 2021, pp. 188–192. doi:10.1109/ICHI52183. 2021.00037 . [13] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM Comput. Surv. 51 (2018). URL: https: //doi.org/10.1145/3236009. doi:10.1145/3236009 . [14] M. T. Ribeiro, S. Singh, C. Guestrin, ”why should i trust you?”: Explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, Association for Computing Machinery, New York, NY, USA, 2016, p. 1135–1144. URL: https://doi.org/10.1145/2939672.2939778. doi:10.1145/2939672.2939778 . [15] O.-M. Camburu, T. Rocktäschel, T. Lukasiewicz, P. Blunsom, e-snli: Natural language inference with natural language explanations, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc., 2018. URL: https://proceedings.neurips.cc/ paper/2018/file/4c7a167bb329bd92580a99ce422d6fa6-Paper.pdf. [16] E. Brand, K. Roitero, M. Soprano, G. Demartini, E-bart: Jointly predicting and explaining truthfulness., in: TTO, 2021, pp. 18–27. [17] E. Brand, K. Roitero, M. Soprano, A. Rahimi, G. Demartini, A neural model to jointly predict and explain truthfulness of statements, J. Data and Information Quality (2022). URL: https://doi.org/10.1145/3546917. doi:10.1145/3546917 . [18] C. Agarwal, D. D’souza, S. Hooker, Estimating example difficulty using variance of gradi- ents, 2020. URL: https://arxiv.org/abs/2008.11600. doi:10.48550/ARXIV.2008.11600 . [19] World Health Organization, International statistical classification of diseases and related health problems, 10th revision, Volume 2, https://icd.who.int/browse10/Content/statichtml/ ICD10Volume2_en_2016.pdf, 2016. [Online; accessed 21-September-2022]. [20] World Health Organization, International statistical classification of diseases and re- lated health problems, 11th revision, https://icd.who.int/en, 2022. [Online; accessed 21- September-2022]. [21] M. H. Popescu, C. Celik, V. Della Mea, R. Jakob, Preliminary validation of a rule-based system for mortality coding using ICD-11, Stud. Health Technol. Inform. 294 (2022) 679–683. [22] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, et al., Exploring the limits of transfer learning with a unified text-to-text transformer., Journal of Machine Learning Research 21 (2020) 1–67. [23] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81. [24] C.-Y. Lin, F. Och, Looking for a few good metrics: Rouge and its evaluation, in: Ntcir workshop, 2004. [25] F. Liu, Y. Liu, Correlation between rouge and human evaluation of extractive meeting summaries, in: Proceedings of ACL-08: HLT, short papers, 2008, pp. 201–204.