Language Transfer for Identifying Diagnostic Paragraphs in Clinical Notes Luca Di Liello, Olga Uryupina and Alessandro Moschitti University of Trento, Italy {luca.diliello,moschitti}@unitn.it, uryupina@gmail.com Abstract i.e., images, lab reports and, most importantly, tex- tual medical documentation. English. This paper aims at uncovering The current paper focuses on Medical Dis- the structure of clinical documents, in par- course Analysis: imposing structure on digitalized ticular, identifying paragraphs describing health reports through document segmentation and “diagnosis” or “procedures”. We present labeling of relevant segments (e.g., diagnoses). transformer-based architectures for ap- Identifying and interpreting discourse fragments is proaching this task in a monolingual set- essential for accurate and robust Information Ex- ting (English), exploring a weak supervi- traction from medical documents. In terms of doc- sion scheme. We further extend our con- tor assistance, such a system could quickly and re- tribution to a cross-lingual scenario, miti- liably identify the most crucial parts of volumi- gating the need for expensive manual data nous health records, allowing to highlight them annotation and taxonomy engineering for for improved visibility and thus reducing cogni- Italian. tive load on doctors. For example, a highlighted Italian. In questo lavoro abbiamo studiato problematic diagnosis can alert a doctor perusing approfonditamente la struttura dei docu- a large medical dossier. In terms of automated data menti clinici ed, in particolare, abbiamo analytics, discourse structure is crucial for correct creato sistemi automatici per l’estrazione interpretation of extracted information. For exam- di paragrafi contenenti diagnosi e pro- ple, if we want to study a possible correlation be- cedure. Attraverso l’utilizzo di modelli tween the use of a specific medicine and some out- basati sull’architettura transformer, abbi- come, we should only consider documents where amo estratto diagnosi e procedure nel set- this medicine is mentioned as a part of therapy, ting monolingua (in inglese). Successiva- but not as a part of allergies. mente, abbiamo esteso la nostra ricerca Some medical documents are generated using allo scenario multilingue, riducendo il task-specific eHealth software imposing certain fabbisogno di larghi dataset in italiano an- discourse structure. In Italy, however, there is notati manualmente grazie all’utilizzo di no single software adopted at either national or machine translation e transfer learning. regional levels. While there is a general agree- ment on the nature of information to be included, there are no guidelines or programmatic imple- 1 Introduction mentations for structuring it. In addition, his- Big Data approaches have been shown to yield torical records, produced before the adoption of a breakthrough to a variety of healthcare-related recording software, follow the logic of individual tasks, ranging from eHealth governance and pol- doctors and thus show even more variability. We icy making to precision medicine and smart so- aim therefore at a statistical model that is able to lutions/suites for hospitals or individual doctors. infer the discourse structure without making any They rely on large-scale and reliable automatic assumptions on the recording software. processing of vast amounts of heterogeneous data, An important advantage of our approach is its adaptability to new domains (e.g., radiology re- Copyright © 2021 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- ports) or languages as well as its robustness in ternational (CC BY 4.0). the (highly probable) scenario where new report- generating systems appear at the market. by practicing clinicians. Most importantly, Sec- Several recent studies (Sec. 2) focus on segment Tag goes beyond a superficial view of the task, not labeling for medical records in English. To our only linking easily identifiable headers, (e.g., most knowledge, no approach has been proposed so far common spellings, headers containing important to analyze medical discourse structure automati- key words), but also organising hierarchically con- cally in other languages, including, most impor- cepts that are normally expressed in very distinct tantly, Italian. The required research is hampered ways (e.g., linking “cause of death” or “gaf” to di- by the lack of resources in other languages, rang- agnoses). In total, SecTag provides 94 entries just ing from no data annotated for discourse structure, for diagnosis. This shows that a considerable either for training or for benchmarking, to lack of medical expertise is required for creating a similar high-coverage resources, e.g., taxonomies. In our resource for other languages from scratch. study, we propose a language transfer approach to The SecTag release has led to the development the problem of medical discourse analysis in Ital- of a related method for automatic identification of ian. We first investigate possibilities for training sections in clinical notes (Denny et al., 2009), via robust monolingual models (Sec. 4) and then build a combination of NLP techniques, terminology- upon our monolingual results to transfer the model based rules, and naive Bayes classification. in another language (Sec. 5). While the SecTag approach exhibits remark- able performance, creation and maintenance of the 2 State of the Art header taxonomy is a very expensive task requir- In the past decade, a massive effort has been in- ing considerable medical expertise. More data- vested into analyzing automatically textual med- driven approaches have been proposed recently for ical data (clinical notes). The notes’ internal English (Rosenthal et al., 2019; Dai et al., 2015), logic is crucial for interpreting their underlying se- among others. These systems, however, require mantics, thus enabling better understanding and manually labeled data. interoperability. This has given rise to empir- ical studies on the medical document structure: 3 Data for Identifying Diagnoses and reliable and interpretable annotation guidelines Procedures Segments and systems for automatically segmenting clinical 3.1 English Data: MIMIC-III notes and annotating segments with labels such as allergy or diagnosis. Several large collections of medical data, with par- The most thorough attempt at defining clin- tial NLP annotations, have been released recently, ical records’ structure via a taxonomy of sec- for example, MIMIC (Johnson et al., 2016) or tion headers has been undertaken by Denny et al. I2B22 . Unfortunately, none of these resources pro- (2008). This study developed SecTag—a hierar- vide annotation for discourse structure. Our study chical header terminology, supporting mappings relies on the MIMIC-III dataset, extending it with to LOINC and other taxonomies. Table 1 shows an extra layer to label diagnosis and procedure some SecTag entries related to diagnosis and fragments. Our choice follows practical motiva- their parameters relevant for the present study.1 tions: it is the largest available dataset, most com- The SecTag concepts (column 1) are organized hi- monly used by the AI community. We only rely on erarchically, with specific diagnoses (e.g., admis- the textual data from MIMIC discharge notes (the sion or discharge diagnoses) being subnodes (col- NOTESEVENTS table), however, a future work umn 2) of the main diagnosis concept (SecTag can explore possibilities of joint modeling of tex- node “5.22”). Different ways of expressing the tual and numeric data (e.g., lab measurements). specific semantics via headers (column 3) are then We have built a rule-based algorithm for an- linked to the corresponding nodes. SecTag advo- notating MIMIC with diagnosis/procedure frag- cates a practical data-driven approach, thus listing ments. We segment a note into fragments and headers that are not always grammatical (e.g., “ad- label them based on the headers, looking them mit diagnosis”), provided they are commonly used up in SecTag (Section 2). For fragments with 1 no header, we propagate the label from the pre- SecTag entires contain 16 parameters, inheriting infor- mation from referenced taxonomies such as LOINC, most of vious fragment. Fragments with headers not them are of no practical relevance in our case and, moreover, 2 are typically set to NULL. https://www.i2b2.org/ concept taxonomy tree id header diagnoses 5.22 diagnosis principle diagnosis 5.22.39 primary diagnoses diagnosis at death 5.22.41 cause of death admission diagnosis 5.22.44 admit diagnosis discharge diagnosis 5.22.45 discharge diagnosis global assessment functioning 5.22.49.58.11 gaf Table 1: Examples of diagnostic headers in the SecTag taxonomy. MIMIC discharge exprivia-10 exprivia-100 total documents 59652 10 100 paragraphs per doc 30.57 7.7 26.77 diagnoses per doc 1.22 0.8 1.28 documents with no diagnosis 8674 (14.5%) 2 (20%) 27 (27%) procedures per doc 0.71 N/A N/A documents with no procedure 20797 (35.86%) N/A N/A Table 2: MIMIC-III discharge (silver annotation with SecTag) vs. Exprivia datasets (gold annotation). found in SecTag are considered −diagnosis, exhibit a considerable variability in terms of the −procedure. The headers are then removed underlying discourse structure. Each document is from the document, thus forcing the model to learn associated with a set of ICD-9 codes for discharge paragraph classification from the textual content, diagnoses. Yet, similarly to MIMIC, no inline relying on headers as a silver supervision signal. manual annotation is provided for identifying tex- While a typical MIMIC note has a single diag- tual segments referring to diagnoses/procedures. nostic paragraph, some contain multiple diagnos- To provide accurate test data for our multilin- tic fragments: (i) some notes span multiple related gual approach, a human expert has conducted a reports, where each report comes with its own di- manual annotation of the Italian set. We have agnosis; (ii) some notes contain semantically dif- labeled a pilot of 10 notes and a random sam- ferent diagnostic sections (e.g., “admitting diag- ple of 100 notes. The annotation only covered nosis” and “discharge diagnosis”); (iii) some notes diagnosis as our pilot phase revealed that la- cover complex cases and the diagnostic section is beling procedure required considerably more expressed in several (consecutive) paragraphs. elaborate guidelines and medical training. Since SecTag predates major MIMIC releases, Table 2 compares document statistics for some popular headers are missing—we have discharge notes from MIMIC-III and Exprivia therefore manually extended the taxonomy (6.7k datasets. It suggests that the pilot can only be used headers) to cover another 75 of the most popular as a very preliminary sample of the data: the notes headers. The expansion yielded a considerable in- are rather small and with few diagnoses. The Ital- crease in procedure paragraphs, augmenting dras- ian documents from exprivia-100 show a striking tically the number of positive examples for train- similarity to MIMIC: there are on average around ing the procedure classifier. At the same time, 25-30 paragraphs per document, 1.2-1.3 of which the overall precision improved, eliminating some are diagnostic. The major difference comes from consistent errors with diagnosis paragraphs. In the documents with no diagnosis (27% in Italian, what follows, we always rely on data preprocessed 14.5% in English). We believe that this similar- with expanded SecTag. ity reflects the fact that, despite differences in na- tional and local healthcare regulations as well as 3.2 Italian Data: Exprivia Datasets individual practicing/recording approaches, clini- A large collection of discharge reports in Italian cal notes reflect a common underlying semantics has been provided by Exprivia S.p.a. The docu- and thus a language transfer model can be suc- ments show some similarity to MIMIC discharge cessful for our task, mitigating the need for very reports: they are typically 0.5-1 page long, they time-consuming and costly expert effort on con- can be split into paragraphs rather reliably, they structing taxonomies similar to SecTag in Italian. 4 Transformer-Based Architectures for Transformer Language parameters Diagnosis and Procedure Extraction BERT-base-uncased English 109M BERT-base-cased English 108M Transformer-based models have recently become ELECTRA-small English 13M the standard in NLP. Models like BERT (Devlin BERT-Ita Italian 110M et al., 2019) and ELECTRA (Clark et al., 2020) BERTino Italian 68M showed impressive performance when compared to previous state of the art. These models are based Table 3: Transformer models used in empiric eval- on the Transformer block (Vaswani et al., 2017), uation which exploits the attention mechanism to find re- lations between all pairs of tokens in the input text lingual embeddings to learn informative sub-word and thus creates deep contextualized representa- clues for diagnostic paragraphs. tions. Transformer layers can be stacked to create Our second cross-lingual pipeline builds di- more powerful and refined models. For computa- rectly upon the model presented in Section 4. We tional efficiency, we focus on architectures with no use an MT component to translate test documents more than 12 layers. from Italian into English, run our diagnosis identi- fication model and then port the results to the Ital- Tokenization. Raw text cannot be provided di- ian original via a trivial paragraph-level alignment. rectly to a transformer-based model: it is first tok- Note that this model is trained on high-quality data enized using a fixed-size vocabulary, created via a in English and tested on noisy automatically trans- segmentation algorithm, e.g., WordPiece. We ex- lated data. tended BERT vocabulary to account for eventual For the third pipeline, we first translate the deidentified medical input. whole training set from English into Italian, while Pre-training and fine-tuning. Transformer- keeping paragraphs aligned. We follow the based models are usually trained in a 2-step methodology from Section 4 to train a new model, fashion. The model is first pretrained on a huge operating on Italian directly. Note that, unlike the amount of artificially labelled text taken from second pipeline, this approach implies training on sources like Wikipedia or CommonCrawl. At noisy automatically translated data while testing the fine-tuning stage, the model is adapted to a on high-quality Italian. The effect of this is two- specific task, e.g., Question Answering or Diag- fold: on one hand, the task becomes more difficult nosis Extraction. Since the model is already able to learn, on the other hand, the resulting classifier to create good contextualized representations, should be more robust. the fine-tuning requires only a small amount To obtain a satisfactory translation using open- of manually labelled examples. Following the source architectures, we rely on the transformer common transformer fine-tuning practices, we encoder-decoder models (Tiedemann and Thottin- classify paragraphs into ±diagnosis with a gal, 2020) trained on the OPUS corpus3 . While binary classification head on top of the first token the OPUS corpus is not tailored specifically to the output. medical domain, its large size and generic nature allow for training very robust MT models. We ex- 5 Language Transfer for Diagnosis ploit the two models to translate from English to Identification Italian 4 and from Italian to English 5 . Both are transformer encoder-decoder models trained with The main bottleneck for NLP on medical data in the Causal Language Modeling objective. Italian lies in the lack of annotated data and profes- sionally created resources, similar to SecTag. To 6 Experiments mitigate this issue, we advocate a language trans- fer approach, combining our transformer models 6.1 Setup (Section 4) with state-of-the-art machine transla- Data processing. We split the MIMIC III dis- tion (MT). charge dataset into training, development and test- We investigate three cross-lingual setting. In 3 https://opus.nlpl.eu the baseline set up, we do not perform any trans- 4 https://huggingface.co/Helsinki-NLP/opus-mt-en-it 5 lation, relying on BERT’s tokenizer and cross- https://huggingface.co/Helsinki-NLP/opus-mt-it-en Task Filt. Accuracy Precision@1 Paragraph-level granularity Diagnosis 92.4 95.9 Procedure 97.1 98.4 Table 4: Diagnosis and procedure discourse seg- ments identification, monolingual setting (En- glish), document-level view: training, fine-tuning and testing on subsets of MIMIC-III discharge. ing sets (60%, 20% and 20% respectively). We used the first for training all the models pre- sented in this study, while we use the other two for checkpoint selection, hyper-parameter tuning (batch size and learning rate) and evaluating the monolingual model. We used the exprivia-10 set for validation and exprivia-100 set for testing in the cross-lingual (language transfer) experiments. Transformer Models. We run most experi- ments in two modes: (i) with powerful trans- former components comprising a large number of Figure 1: Learning curves on the exprivia-10 val- parameters and providing top performance such idation set in the Italian pipeline: BERT-Ita (top) as BERT (Devlin et al., 2019) and BERT-ita6 and vs. BERTino (bottom). MAP (y-axis) for a given (ii) with small and efficient transformer models number of training steps (x-axis). such as ELECTRA small (Clark et al., 2020) and BERTino (Muffo and Bertino, 2020). The ob- jective of this setup was to measure the perfor- 6.2 Results mance/efficiency trade-off. Monolingual results. Table 4 summarizes the Table 3 presents all the used transformer models English results. The numbers refer to a BERT- with the respective number of parameters. base-cased model fine-tuned with a batch size of 64 and a learning rate of 2 ∗ 10−6 . The model is Evaluation metrics. Diagnosis/Procedure clas- able to identify very accurately documents with no sification task shows a very skewed label distribu- diagnoses/procedures (92.4% and 97.1% accuracy tion. For this reason, we approach it from an in- respectively). Moreover, the binary classification formation retrieval viewpoint, i.e., we rank para- of paragraphs into diagnoses (or not), and proce- graphs based on their probability of containing a dures (or not) is very reliable: 95.9% and 98.4% diagnosis. We use Mean Average Precision and P@1 at document level. Precision@1 to evaluate the ranking quality. The former takes into account the whole ranking and Cross-lingual experiments. Table 5 shows the is therefore the best indicator of the ranking qual- results of our language transfer experiments. A ity. The latter indicates the number of times a cor- moderate performance (58.8% Filtering Accuracy, rect diagnosis is returned in the first position. To 49.2% P@1) can be achieved via a BERT model provide a better comparison, we report MAP and trained on English MIMIC data and directly tested P@1 averaging only over the documents that con- on the Italian exprivia-100 set. Multilingual- tain at least one diagnosis. We also report model BERT does slightly better as it was trained on accuracy in recognizing documents with no diag- 104 languages, English and Italian included. This noses (Filtering Accuracy). This metric was in- approach relies on joint multilingual embeddings troduced because a relevant fraction of documents and fine tokenization. It can, for example, identify did not contain a diagnosis, see Table 2. and align stems of Latin origin for some disease 6 https://huggingface.co/dbmdz/bert-ba names. However, it cannot go much beyond: it is se-italian-xxl-cased not able to model deep semantics related to medi- Test set performance Model Development set Filt. Accuracy Precision1 MAP Cross-Lingual BERT BERT-base-uncased exprivia-10 58.8 49.2 58.5 Multilingual-BERT-cased exprivia-10 51.2 73.5 75.6 MT-based pipeline-2, train on English (MIMIC), test on English translation of exprivia-100 BERT-cased exprivia-10 31.8 (7.6) 67.4 (6.8) 69.2 (3.3) BERT-cased MIMIC dev 53.1 (9.0) 73.9 (6.6) 73.3 (4.9) ELECTRA-small exprivia-10 64.6 (9.5) 60.5 (12.6) 71.2 (9.0) ELECTRA-small MIMIC dev 54.2 (8.7) 62.4 (11.2) 73.2 (7.9) MT-based pipeline-3, train on Italian translation of MIMIC, test on Italian (exprivia-100) BERT-ita exprivia-10 69.8 (6.2) 78.6 (7.3) 81.5 (3.8) BERT-ita MIMIC dev 67.1 (7.8) 73.7 (3.0) 77.2 (3.1) BERTino exprivia-10 72.0 (7.5) 74.9 (2.9) 81.9 (2.6) BERTino MIMIC dev 67.7 (4.1) 77.3 (2.5) 83.3 (1.9) Table 5: Language transfer models, fine-tuning on the MIMIC training set and evaluation on exprivia- 100 test set; boldface indicates the best results. Standard deviation across 5 runs shown in brackets. cal processes. 7 Conclusion The use of MT shows considerable improve- We present a language transfer approach to un- ment over the baseline. The results suggest a better raveling discourse structure of clinical notes, fo- performance for the setting where the training set cusing on diagnosis and procedure. We combine is translated into Italian and the diagnosis extrac- transformer-based paragraph modeling with state- tion model is then learned on (noisy) Italian data. of-the-art MT architectures in a novel application, Moreover, this approach is much faster when used that is essential for eHealth big data analytics. as a service, as it directly operates on Italian input. Most importantly, our language transfer approach helps mitigate the need for expensive and time- We performed all the MT-based experiments 5 consuming medical resource creation (annotated times using random seeds to enable a better statis- train data as well as header taxonomy) in Italian. tical assessment of the results. While in general We empirically investigate two translation- the standard deviation is rather small considering based architectures, showing that both of them the very small test set, the setting with a translated outperform a generic cross-lingual pipeline. The test set leads to unstable benchmarking, especially approach based on translating train data is more for the smaller ELECTRA transformer. robust and efficient (at runtime) compared to trans- Finally, smaller transformer models, especially lating the test data, yielding more stable perfor- BERTino, exhibit very small performance drops mance. compared to larger transformers. This suggests In future, we plan to expand our study to other that they are robust enough to capture paragraph- discourse segments, such as allergy or history. level diagnosis semantics. Therefore, it is possible However, our first experiments with procedure to run the extraction service with low computa- segments show that, unlike diagnosis, modeling tional resources, e.g., using CPUs. Figure 1 shows and even annotating other headers require a more the stability of the learning with translated training tight collaboration with medical experts. data. Small models are able to match the perfor- 8 Acknowledgements mance of larger models, being also faster to con- verge. We believe that smaller models overfit less The research presented in this paper has been sup- the MIMIC training data, thus providing a final ported by the Autonomous Province of Trento better performance on the Exprivia data. Note that (project CareGenius). The computational power training was stopped after a fixed amount of time has been provided by the High Performance Com- for every experiment. BERTino, being smaller, is puting department of the CINECA Consortium able to do more steps in the same amount of time. (ISCRA project CareGeni). References Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre- training text encoders as discriminators rather than generators. Hong-Jie Dai, Shabbir Syed-Abdul, Chih-Wei Chen, and Chieh-Chen Wu. 2015. Recognition and eval- uation of clinical section headings in clinical doc- uments using token-based formulation with condi- tional random fields. BioMed Research Interna- tional, 2015. Joshua Denny, Randolph Miller, Kevin Johnson, and Anderson Spickard. 2008. Development and eval- uation of a clinical note section header terminology. In Proceeding of AMIA Annual Symposium, pages 156–160. Joshua Denny, Anderson Spickard, Kevin Johnson, Neeraja Peterson, Josh Peterson, and Randolph Miller. 2009. Evaluation of a method to iden- tify and categorize section headers in clinical doc- uments. Journal of the American Medical Informat- ics Association : JAMIA, 16(6):806–15. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language under- standing. Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, Li wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data, 3. Matteo Muffo and E. Bertino. 2020. Bertino: An ital- ian distilbert model. In CLiC-it. Sara Rosenthal, Ken Barker, and Zhicheng Liang. 2019. Leveraging medical literature for section pre- diction in electronic health records. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 4864–4873, Hong Kong, China, November. Association for Computa- tional Linguistics. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT — Building open translation services for the World. In Proceedings of the 22nd Annual Con- ferenec of the European Association for Machine Translation (EAMT), Lisbon, Portugal. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.