Language Transfer for Identifying Diagnostic Paragraphs in Clinical
                                   Notes

                Luca Di Liello, Olga Uryupina and Alessandro Moschitti
                                University of Trento, Italy
          {luca.diliello,moschitti}@unitn.it, uryupina@gmail.com


                        Abstract                                i.e., images, lab reports and, most importantly, tex-
                                                                tual medical documentation.
    English. This paper aims at uncovering
                                                                   The current paper focuses on Medical Dis-
    the structure of clinical documents, in par-
                                                                course Analysis: imposing structure on digitalized
    ticular, identifying paragraphs describing
                                                                health reports through document segmentation and
    “diagnosis” or “procedures”. We present
                                                                labeling of relevant segments (e.g., diagnoses).
    transformer-based architectures for ap-
                                                                Identifying and interpreting discourse fragments is
    proaching this task in a monolingual set-
                                                                essential for accurate and robust Information Ex-
    ting (English), exploring a weak supervi-
                                                                traction from medical documents. In terms of doc-
    sion scheme. We further extend our con-
                                                                tor assistance, such a system could quickly and re-
    tribution to a cross-lingual scenario, miti-
                                                                liably identify the most crucial parts of volumi-
    gating the need for expensive manual data
                                                                nous health records, allowing to highlight them
    annotation and taxonomy engineering for
                                                                for improved visibility and thus reducing cogni-
    Italian.
                                                                tive load on doctors. For example, a highlighted
    Italian. In questo lavoro abbiamo studiato                  problematic diagnosis can alert a doctor perusing
    approfonditamente la struttura dei docu-                    a large medical dossier. In terms of automated data
    menti clinici ed, in particolare, abbiamo                   analytics, discourse structure is crucial for correct
    creato sistemi automatici per l’estrazione                  interpretation of extracted information. For exam-
    di paragrafi contenenti diagnosi e pro-                     ple, if we want to study a possible correlation be-
    cedure. Attraverso l’utilizzo di modelli                    tween the use of a specific medicine and some out-
    basati sull’architettura transformer, abbi-                 come, we should only consider documents where
    amo estratto diagnosi e procedure nel set-                  this medicine is mentioned as a part of therapy,
    ting monolingua (in inglese). Successiva-                   but not as a part of allergies.
    mente, abbiamo esteso la nostra ricerca                        Some medical documents are generated using
    allo scenario multilingue, riducendo il                     task-specific eHealth software imposing certain
    fabbisogno di larghi dataset in italiano an-                discourse structure. In Italy, however, there is
    notati manualmente grazie all’utilizzo di                   no single software adopted at either national or
    machine translation e transfer learning.                    regional levels. While there is a general agree-
                                                                ment on the nature of information to be included,
                                                                there are no guidelines or programmatic imple-
1    Introduction                                               mentations for structuring it. In addition, his-
Big Data approaches have been shown to yield                    torical records, produced before the adoption of
a breakthrough to a variety of healthcare-related               recording software, follow the logic of individual
tasks, ranging from eHealth governance and pol-                 doctors and thus show even more variability. We
icy making to precision medicine and smart so-                  aim therefore at a statistical model that is able to
lutions/suites for hospitals or individual doctors.             infer the discourse structure without making any
They rely on large-scale and reliable automatic                 assumptions on the recording software.
processing of vast amounts of heterogeneous data,                  An important advantage of our approach is its
                                                                adaptability to new domains (e.g., radiology re-
     Copyright © 2021 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       ports) or languages as well as its robustness in
ternational (CC BY 4.0).                                        the (highly probable) scenario where new report-
generating systems appear at the market.                        by practicing clinicians. Most importantly, Sec-
   Several recent studies (Sec. 2) focus on segment             Tag goes beyond a superficial view of the task, not
labeling for medical records in English. To our                 only linking easily identifiable headers, (e.g., most
knowledge, no approach has been proposed so far                 common spellings, headers containing important
to analyze medical discourse structure automati-                key words), but also organising hierarchically con-
cally in other languages, including, most impor-                cepts that are normally expressed in very distinct
tantly, Italian. The required research is hampered              ways (e.g., linking “cause of death” or “gaf” to di-
by the lack of resources in other languages, rang-              agnoses). In total, SecTag provides 94 entries just
ing from no data annotated for discourse structure,             for diagnosis. This shows that a considerable
either for training or for benchmarking, to lack of             medical expertise is required for creating a similar
high-coverage resources, e.g., taxonomies. In our               resource for other languages from scratch.
study, we propose a language transfer approach to                  The SecTag release has led to the development
the problem of medical discourse analysis in Ital-              of a related method for automatic identification of
ian. We first investigate possibilities for training            sections in clinical notes (Denny et al., 2009), via
robust monolingual models (Sec. 4) and then build               a combination of NLP techniques, terminology-
upon our monolingual results to transfer the model              based rules, and naive Bayes classification.
in another language (Sec. 5).                                      While the SecTag approach exhibits remark-
                                                                able performance, creation and maintenance of the
2    State of the Art                                           header taxonomy is a very expensive task requir-
In the past decade, a massive effort has been in-               ing considerable medical expertise. More data-
vested into analyzing automatically textual med-                driven approaches have been proposed recently for
ical data (clinical notes). The notes’ internal                 English (Rosenthal et al., 2019; Dai et al., 2015),
logic is crucial for interpreting their underlying se-          among others. These systems, however, require
mantics, thus enabling better understanding and                 manually labeled data.
interoperability. This has given rise to empir-
ical studies on the medical document structure:                 3       Data for Identifying Diagnoses and
reliable and interpretable annotation guidelines                        Procedures Segments
and systems for automatically segmenting clinical
                                                                3.1      English Data: MIMIC-III
notes and annotating segments with labels such as
allergy or diagnosis.                                           Several large collections of medical data, with par-
   The most thorough attempt at defining clin-                  tial NLP annotations, have been released recently,
ical records’ structure via a taxonomy of sec-                  for example, MIMIC (Johnson et al., 2016) or
tion headers has been undertaken by Denny et al.                I2B22 . Unfortunately, none of these resources pro-
(2008). This study developed SecTag—a hierar-                   vide annotation for discourse structure. Our study
chical header terminology, supporting mappings                  relies on the MIMIC-III dataset, extending it with
to LOINC and other taxonomies. Table 1 shows                    an extra layer to label diagnosis and procedure
some SecTag entries related to diagnosis and                    fragments. Our choice follows practical motiva-
their parameters relevant for the present study.1               tions: it is the largest available dataset, most com-
The SecTag concepts (column 1) are organized hi-                monly used by the AI community. We only rely on
erarchically, with specific diagnoses (e.g., admis-             the textual data from MIMIC discharge notes (the
sion or discharge diagnoses) being subnodes (col-               NOTESEVENTS table), however, a future work
umn 2) of the main diagnosis concept (SecTag                    can explore possibilities of joint modeling of tex-
node “5.22”). Different ways of expressing the                  tual and numeric data (e.g., lab measurements).
specific semantics via headers (column 3) are then                 We have built a rule-based algorithm for an-
linked to the corresponding nodes. SecTag advo-                 notating MIMIC with diagnosis/procedure frag-
cates a practical data-driven approach, thus listing            ments. We segment a note into fragments and
headers that are not always grammatical (e.g., “ad-             label them based on the headers, looking them
mit diagnosis”), provided they are commonly used                up in SecTag (Section 2). For fragments with
    1                                                           no header, we propagate the label from the pre-
      SecTag entires contain 16 parameters, inheriting infor-
mation from referenced taxonomies such as LOINC, most of        vious fragment. Fragments with headers not
them are of no practical relevance in our case and, moreover,
                                                                    2
are typically set to NULL.                                              https://www.i2b2.org/
                  concept                         taxonomy tree id   header
                  diagnoses                       5.22               diagnosis
                  principle diagnosis             5.22.39            primary diagnoses
                  diagnosis at death              5.22.41            cause of death
                  admission diagnosis             5.22.44            admit diagnosis
                  discharge diagnosis             5.22.45            discharge diagnosis
                  global assessment functioning   5.22.49.58.11      gaf

                   Table 1: Examples of diagnostic headers in the SecTag taxonomy.

                                             MIMIC discharge      exprivia-10    exprivia-100
             total documents                           59652               10             100
             paragraphs per doc                         30.57             7.7           26.77
             diagnoses per doc                           1.22             0.8            1.28
             documents with no diagnosis        8674 (14.5%)         2 (20%)        27 (27%)
             procedures per doc                          0.71            N/A             N/A
             documents with no procedure      20797 (35.86%)             N/A             N/A

 Table 2: MIMIC-III discharge (silver annotation with SecTag) vs. Exprivia datasets (gold annotation).


found in SecTag are considered −diagnosis,             exhibit a considerable variability in terms of the
−procedure. The headers are then removed               underlying discourse structure. Each document is
from the document, thus forcing the model to learn     associated with a set of ICD-9 codes for discharge
paragraph classification from the textual content,     diagnoses. Yet, similarly to MIMIC, no inline
relying on headers as a silver supervision signal.     manual annotation is provided for identifying tex-
   While a typical MIMIC note has a single diag-       tual segments referring to diagnoses/procedures.
nostic paragraph, some contain multiple diagnos-          To provide accurate test data for our multilin-
tic fragments: (i) some notes span multiple related    gual approach, a human expert has conducted a
reports, where each report comes with its own di-      manual annotation of the Italian set. We have
agnosis; (ii) some notes contain semantically dif-     labeled a pilot of 10 notes and a random sam-
ferent diagnostic sections (e.g., “admitting diag-     ple of 100 notes. The annotation only covered
nosis” and “discharge diagnosis”); (iii) some notes    diagnosis as our pilot phase revealed that la-
cover complex cases and the diagnostic section is      beling procedure required considerably more
expressed in several (consecutive) paragraphs.         elaborate guidelines and medical training.
   Since SecTag predates major MIMIC releases,            Table 2 compares document statistics for
some popular headers are missing—we have               discharge notes from MIMIC-III and Exprivia
therefore manually extended the taxonomy (6.7k         datasets. It suggests that the pilot can only be used
headers) to cover another 75 of the most popular       as a very preliminary sample of the data: the notes
headers. The expansion yielded a considerable in-      are rather small and with few diagnoses. The Ital-
crease in procedure paragraphs, augmenting dras-       ian documents from exprivia-100 show a striking
tically the number of positive examples for train-     similarity to MIMIC: there are on average around
ing the procedure classifier. At the same time,        25-30 paragraphs per document, 1.2-1.3 of which
the overall precision improved, eliminating some       are diagnostic. The major difference comes from
consistent errors with diagnosis paragraphs. In        the documents with no diagnosis (27% in Italian,
what follows, we always rely on data preprocessed      14.5% in English). We believe that this similar-
with expanded SecTag.                                  ity reflects the fact that, despite differences in na-
                                                       tional and local healthcare regulations as well as
3.2   Italian Data: Exprivia Datasets                  individual practicing/recording approaches, clini-
A large collection of discharge reports in Italian     cal notes reflect a common underlying semantics
has been provided by Exprivia S.p.a. The docu-         and thus a language transfer model can be suc-
ments show some similarity to MIMIC discharge          cessful for our task, mitigating the need for very
reports: they are typically 0.5-1 page long, they      time-consuming and costly expert effort on con-
can be split into paragraphs rather reliably, they     structing taxonomies similar to SecTag in Italian.
4   Transformer-Based Architectures for                      Transformer              Language       parameters
    Diagnosis and Procedure Extraction                       BERT-base-uncased         English         109M
                                                             BERT-base-cased           English         108M
Transformer-based models have recently become                ELECTRA-small             English          13M
the standard in NLP. Models like BERT (Devlin                BERT-Ita                  Italian         110M
et al., 2019) and ELECTRA (Clark et al., 2020)               BERTino                   Italian          68M
showed impressive performance when compared
to previous state of the art. These models are based     Table 3: Transformer models used in empiric eval-
on the Transformer block (Vaswani et al., 2017),         uation
which exploits the attention mechanism to find re-
lations between all pairs of tokens in the input text    lingual embeddings to learn informative sub-word
and thus creates deep contextualized representa-         clues for diagnostic paragraphs.
tions. Transformer layers can be stacked to create          Our second cross-lingual pipeline builds di-
more powerful and refined models. For computa-           rectly upon the model presented in Section 4. We
tional efficiency, we focus on architectures with no     use an MT component to translate test documents
more than 12 layers.                                     from Italian into English, run our diagnosis identi-
                                                         fication model and then port the results to the Ital-
Tokenization. Raw text cannot be provided di-
                                                         ian original via a trivial paragraph-level alignment.
rectly to a transformer-based model: it is first tok-
                                                         Note that this model is trained on high-quality data
enized using a fixed-size vocabulary, created via a
                                                         in English and tested on noisy automatically trans-
segmentation algorithm, e.g., WordPiece. We ex-
                                                         lated data.
tended BERT vocabulary to account for eventual
                                                            For the third pipeline, we first translate the
deidentified medical input.
                                                         whole training set from English into Italian, while
Pre-training and fine-tuning. Transformer-               keeping paragraphs aligned.          We follow the
based models are usually trained in a 2-step             methodology from Section 4 to train a new model,
fashion. The model is first pretrained on a huge         operating on Italian directly. Note that, unlike the
amount of artificially labelled text taken from          second pipeline, this approach implies training on
sources like Wikipedia or CommonCrawl. At                noisy automatically translated data while testing
the fine-tuning stage, the model is adapted to a         on high-quality Italian. The effect of this is two-
specific task, e.g., Question Answering or Diag-         fold: on one hand, the task becomes more difficult
nosis Extraction. Since the model is already able        to learn, on the other hand, the resulting classifier
to create good contextualized representations,           should be more robust.
the fine-tuning requires only a small amount                To obtain a satisfactory translation using open-
of manually labelled examples. Following the             source architectures, we rely on the transformer
common transformer fine-tuning practices, we             encoder-decoder models (Tiedemann and Thottin-
classify paragraphs into ±diagnosis with a               gal, 2020) trained on the OPUS corpus3 . While
binary classification head on top of the first token     the OPUS corpus is not tailored specifically to the
output.                                                  medical domain, its large size and generic nature
                                                         allow for training very robust MT models. We ex-
5   Language Transfer for Diagnosis                      ploit the two models to translate from English to
    Identification                                       Italian 4 and from Italian to English 5 . Both are
                                                         transformer encoder-decoder models trained with
The main bottleneck for NLP on medical data in           the Causal Language Modeling objective.
Italian lies in the lack of annotated data and profes-
sionally created resources, similar to SecTag. To        6       Experiments
mitigate this issue, we advocate a language trans-
fer approach, combining our transformer models           6.1      Setup
(Section 4) with state-of-the-art machine transla-       Data processing. We split the MIMIC III dis-
tion (MT).                                               charge dataset into training, development and test-
   We investigate three cross-lingual setting. In            3
                                                               https://opus.nlpl.eu
the baseline set up, we do not perform any trans-            4
                                                               https://huggingface.co/Helsinki-NLP/opus-mt-en-it
                                                             5
lation, relying on BERT’s tokenizer and cross-                 https://huggingface.co/Helsinki-NLP/opus-mt-it-en
   Task        Filt. Accuracy Precision@1
          Paragraph-level granularity
   Diagnosis             92.4         95.9
   Procedure             97.1         98.4

Table 4: Diagnosis and procedure discourse seg-
ments identification, monolingual setting (En-
glish), document-level view: training, fine-tuning
and testing on subsets of MIMIC-III discharge.


ing sets (60%, 20% and 20% respectively). We
used the first for training all the models pre-
sented in this study, while we use the other two
for checkpoint selection, hyper-parameter tuning
(batch size and learning rate) and evaluating the
monolingual model. We used the exprivia-10 set
for validation and exprivia-100 set for testing in
the cross-lingual (language transfer) experiments.

Transformer Models. We run most experi-
ments in two modes: (i) with powerful trans-
former components comprising a large number of         Figure 1: Learning curves on the exprivia-10 val-
parameters and providing top performance such          idation set in the Italian pipeline: BERT-Ita (top)
as BERT (Devlin et al., 2019) and BERT-ita6 and        vs. BERTino (bottom). MAP (y-axis) for a given
(ii) with small and efficient transformer models       number of training steps (x-axis).
such as ELECTRA small (Clark et al., 2020) and
BERTino (Muffo and Bertino, 2020). The ob-
jective of this setup was to measure the perfor-       6.2   Results
mance/efficiency trade-off.                            Monolingual results. Table 4 summarizes the
   Table 3 presents all the used transformer models    English results. The numbers refer to a BERT-
with the respective number of parameters.              base-cased model fine-tuned with a batch size of
                                                       64 and a learning rate of 2 ∗ 10−6 . The model is
Evaluation metrics. Diagnosis/Procedure clas-
                                                       able to identify very accurately documents with no
sification task shows a very skewed label distribu-
                                                       diagnoses/procedures (92.4% and 97.1% accuracy
tion. For this reason, we approach it from an in-
                                                       respectively). Moreover, the binary classification
formation retrieval viewpoint, i.e., we rank para-
                                                       of paragraphs into diagnoses (or not), and proce-
graphs based on their probability of containing a
                                                       dures (or not) is very reliable: 95.9% and 98.4%
diagnosis. We use Mean Average Precision and
                                                       P@1 at document level.
Precision@1 to evaluate the ranking quality. The
former takes into account the whole ranking and        Cross-lingual experiments. Table 5 shows the
is therefore the best indicator of the ranking qual-   results of our language transfer experiments. A
ity. The latter indicates the number of times a cor-   moderate performance (58.8% Filtering Accuracy,
rect diagnosis is returned in the first position. To   49.2% P@1) can be achieved via a BERT model
provide a better comparison, we report MAP and         trained on English MIMIC data and directly tested
P@1 averaging only over the documents that con-        on the Italian exprivia-100 set. Multilingual-
tain at least one diagnosis. We also report model      BERT does slightly better as it was trained on
accuracy in recognizing documents with no diag-        104 languages, English and Italian included. This
noses (Filtering Accuracy). This metric was in-        approach relies on joint multilingual embeddings
troduced because a relevant fraction of documents      and fine tokenization. It can, for example, identify
did not contain a diagnosis, see Table 2.              and align stems of Latin origin for some disease
  6
    https://huggingface.co/dbmdz/bert-ba               names. However, it cannot go much beyond: it is
se-italian-xxl-cased                                   not able to model deep semantics related to medi-
                                                                  Test set performance
      Model                           Development set
                                                       Filt. Accuracy Precision1          MAP
                                         Cross-Lingual BERT
      BERT-base-uncased             exprivia-10              58.8           49.2          58.5
      Multilingual-BERT-cased       exprivia-10              51.2           73.5          75.6
      MT-based pipeline-2, train on English (MIMIC), test on English translation of exprivia-100
      BERT-cased                    exprivia-10          31.8 (7.6)      67.4 (6.8)    69.2 (3.3)
      BERT-cased                    MIMIC dev            53.1 (9.0)      73.9 (6.6)    73.3 (4.9)
      ELECTRA-small                 exprivia-10          64.6 (9.5)     60.5 (12.6)    71.2 (9.0)
      ELECTRA-small                 MIMIC dev            54.2 (8.7)     62.4 (11.2)    73.2 (7.9)
       MT-based pipeline-3, train on Italian translation of MIMIC, test on Italian (exprivia-100)
      BERT-ita                      exprivia-10          69.8 (6.2)      78.6 (7.3)    81.5 (3.8)
      BERT-ita                      MIMIC dev            67.1 (7.8)      73.7 (3.0)    77.2 (3.1)
      BERTino                       exprivia-10          72.0 (7.5)      74.9 (2.9)    81.9 (2.6)
      BERTino                       MIMIC dev            67.7 (4.1)      77.3 (2.5)    83.3 (1.9)

Table 5: Language transfer models, fine-tuning on the MIMIC training set and evaluation on exprivia-
100 test set; boldface indicates the best results. Standard deviation across 5 runs shown in brackets.

cal processes.                                            7   Conclusion
   The use of MT shows considerable improve-              We present a language transfer approach to un-
ment over the baseline. The results suggest a better      raveling discourse structure of clinical notes, fo-
performance for the setting where the training set        cusing on diagnosis and procedure. We combine
is translated into Italian and the diagnosis extrac-      transformer-based paragraph modeling with state-
tion model is then learned on (noisy) Italian data.       of-the-art MT architectures in a novel application,
Moreover, this approach is much faster when used          that is essential for eHealth big data analytics.
as a service, as it directly operates on Italian input.   Most importantly, our language transfer approach
                                                          helps mitigate the need for expensive and time-
   We performed all the MT-based experiments 5            consuming medical resource creation (annotated
times using random seeds to enable a better statis-       train data as well as header taxonomy) in Italian.
tical assessment of the results. While in general            We empirically investigate two translation-
the standard deviation is rather small considering        based architectures, showing that both of them
the very small test set, the setting with a translated    outperform a generic cross-lingual pipeline. The
test set leads to unstable benchmarking, especially       approach based on translating train data is more
for the smaller ELECTRA transformer.                      robust and efficient (at runtime) compared to trans-
   Finally, smaller transformer models, especially        lating the test data, yielding more stable perfor-
BERTino, exhibit very small performance drops             mance.
compared to larger transformers. This suggests               In future, we plan to expand our study to other
that they are robust enough to capture paragraph-         discourse segments, such as allergy or history.
level diagnosis semantics. Therefore, it is possible      However, our first experiments with procedure
to run the extraction service with low computa-           segments show that, unlike diagnosis, modeling
tional resources, e.g., using CPUs. Figure 1 shows        and even annotating other headers require a more
the stability of the learning with translated training    tight collaboration with medical experts.
data. Small models are able to match the perfor-
                                                          8   Acknowledgements
mance of larger models, being also faster to con-
verge. We believe that smaller models overfit less        The research presented in this paper has been sup-
the MIMIC training data, thus providing a final           ported by the Autonomous Province of Trento
better performance on the Exprivia data. Note that        (project CareGenius). The computational power
training was stopped after a fixed amount of time         has been provided by the High Performance Com-
for every experiment. BERTino, being smaller, is          puting department of the CINECA Consortium
able to do more steps in the same amount of time.         (ISCRA project CareGeni).
References
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
  Christopher D. Manning. 2020. Electra: Pre-
  training text encoders as discriminators rather than
  generators.
Hong-Jie Dai, Shabbir Syed-Abdul, Chih-Wei Chen,
  and Chieh-Chen Wu. 2015. Recognition and eval-
  uation of clinical section headings in clinical doc-
  uments using token-based formulation with condi-
  tional random fields. BioMed Research Interna-
  tional, 2015.
Joshua Denny, Randolph Miller, Kevin Johnson, and
   Anderson Spickard. 2008. Development and eval-
   uation of a clinical note section header terminology.
   In Proceeding of AMIA Annual Symposium, pages
   156–160.

Joshua Denny, Anderson Spickard, Kevin Johnson,
   Neeraja Peterson, Josh Peterson, and Randolph
   Miller. 2009. Evaluation of a method to iden-
   tify and categorize section headers in clinical doc-
   uments. Journal of the American Medical Informat-
   ics Association : JAMIA, 16(6):806–15.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. Bert: Pre-training of
   deep bidirectional transformers for language under-
   standing.
Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, Li wei
  H. Lehman, Mengling Feng, Mohammad Ghassemi,
  Benjamin Moody, Peter Szolovits, Leo Anthony
  Celi, and Roger G. Mark. 2016. MIMIC-III, a freely
  accessible critical care database. Scientific Data, 3.
Matteo Muffo and E. Bertino. 2020. Bertino: An ital-
 ian distilbert model. In CLiC-it.

Sara Rosenthal, Ken Barker, and Zhicheng Liang.
  2019. Leveraging medical literature for section pre-
  diction in electronic health records. In Proceedings
  of the 2019 Conference on Empirical Methods in
  Natural Language Processing and the 9th Interna-
  tional Joint Conference on Natural Language Pro-
  cessing (EMNLP-IJCNLP), pages 4864–4873, Hong
  Kong, China, November. Association for Computa-
  tional Linguistics.
Jörg Tiedemann and Santhosh Thottingal.        2020.
    OPUS-MT — Building open translation services for
    the World. In Proceedings of the 22nd Annual Con-
    ferenec of the European Association for Machine
    Translation (EAMT), Lisbon, Portugal.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need.