<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Applicability of Models Trained on Generated Clinical German Datasets on Out-domain Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oğuz Şerbetçi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulf Leser</string-name>
          <email>ulf.leser@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt Universität zu Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LWDA'23: Lernen</institution>
          ,
          <addr-line>Wissen, Daten, Analysen</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Strong privacy constraints heavily constrain the public availability of health record data in German which hinders the development of advanced NLP methods for clinical texts. As a remedy, work by Frei and Kramer [1] has leveraged the recent breakthroughs in generative large language models to generate a synthetic dataset for training Named Entity Recognition (NER) models in the clinical domain. Because the basis is synthetic, both the corpus and the NER model are publicly available. However, as clinical text is highly idiosyncratic, it is not clear how well this approach performs on real data. We evaluate the model on two real-world German clinical datasets from cardiology and oncology departments. Our analysis shows that learning on generated data for NER models do not transfer well to real-world data. german medical natural language processing, large language models„ medical informatics htp:/ceur-ws.org CEUR Workshop Proceedings (CEUR-WS.org) ISN1613-073</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Some of the important information in electronic health records is found only in unstructured
text and needs to be extracted using natural language processing (NLP). An important task
to this end is named entity recognition (NER) [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. However, this task is dificult in the
clinical domain because the goal of medical text is to concisely capture large amounts of factual
information, resulting in abbreviations, domain-specific terminology and awkward grammar.
Diferences between processes, hospitals and practitioners make it dificult to transfer models
between diferent datasets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In addition, privacy concerns limit the sharing of such data and
make annotation expensive due to the need for anonymisation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This problem is particularly
exacerbated for German-language applications, not only due to strong EU privacy regulations
such as GDPR, but also due to stronger national privacy regulations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Recently, the first annotated German clinical datasets BRONCO150 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and CARDIO:DE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for
NER have been published, which are distributable under a data use agreement (DUA). Both use
anonymisation. Kittner et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] goes a step further and shufles and filters sentences such that
a clinical document cannot be reconstructed and that parts with frequent personal information
are not distributed. Annotation of both datasets was an iterative and time consuming process
to improve inter-annotator agreement due to the aforementioned challenges of medical text. In
contrast, Frei and Kramer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] attempts to circumvent the privacy issue and the costly annotation
process by using recent advances in large language models. They generate both the clinical text
CEUR
Workshop
Proceedings
and its NER annotations for diagnosis, medication and dosage using prompts with manually
modified examples of clinical notes sentences, allowing both the data and the corresponding
NER model to be made publicly available without DUA. However, it is not yet clear how such
data and models trained on it can deal with the idiosyncrasies and heterogeneity of the real
clinical setting, as clinical text varies widely between practitioners, departments and institutions.
There are also diferences in how the text has been pre-processed between diferent applications,
how anonymisation is performed and which parts of clinical documents are included.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Experimental setup</title>
      <p>
        We want to evaluate how the approach proposed by Frei and Kramer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], models trained on
a language model generated annotated clinical text, performs on two real-world clinical NER
datasets BRONCO150 and CARDIO:DE, which are described in Table 1.
      </p>
      <p>
        Frei and Kramer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] first generate 23,411 clinical sentences that are annotated with diagnosis,
medication, and dosage entities using the large language model GPT-NeoX and example clinical
sentences that have been hand written based on real clinical text. They then fine-tune diferent
pretrained German medical language models based on the BERT architecture [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to obtain the
GPTNERMED model that can extract diagnosis, medication and dosage entities.
      </p>
      <p>
        The model is published using the spacy library1, and we use it to evaluate and train all NER
models in our experiments. We always use the best performing public model for GPTNERMED
that is based on the German-MedBERT (G-MedBERT) medical language model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For the
evaluation on BRONCO150, we use the first predefined 20% split as the test and all the other
splits as the training data. For the evaluation on CARDIO:DE, which does not have predefined
splits, we use the first 80% of the documents as training and the rest as test data. Entity
predictions are evaluated with exact span matches—creating a challenge due to variances in
annotation schemes across datasets and even disagreements among human annotators, as seen
in lower inter-annotator agreement compared to token based evaluation for both CARDIO:DE
and BRONCO150 [
        <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
        ]. Despite the dificulty, correct span prediction is crucial for downstream
tasks like entity normalization and linking.
      </p>
      <p>
        Our first experiment evaluates the GPTNERMED model as it is, i.e. without any finetuning
on BRONCO150 and CARDIO:DE. As a second experiment, we fine-tune the best performing
GPTNERMED base model G-MedBERT with training configuration published by Frei and
Kramer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on both datasets, which means resulting models have not seen the GPTNERMED
data. We also provide the results provided by the respective dataset papers BRONCO150 and
CARDIO:DE for comparison [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. In a third experiment, we try diferent fine-tuning strategies
to identify how to best utilize the work of Frei and Kramer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We try following setups: (1)
ifnetuning G-MedBERT, (2) fine-tune GPTNERMED model on BRONCO150, and (3) fine-tune
G-MedBERT on both BRONCO and GPTNERMED data.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Results</title>
      <p>Results for the GPTNERMED are in Table 2. We observe unsatisfactory performance of
of-theshelf GPTNERMED model on real-world datasets BRONCO150 and CARDIO:DE. Results from
experiment probing the best fine-tuning strategy using the GPTNERMED data are presented in
Table 3. We see that none of the strategies bring any considerable improvement.</p>
      <p>When we inspect the GPTNERMED predictions, we see that it predicts 74% of tokens with
more than one capital letter consecutively as medication or diagnosis, where as only 1% of
these tokens are annotated as medication or diagnosis. False positives include BRONCO150’s
anonymisation replacement tokens such as ”PATIENTen” and ”KRANKENHAUS” being
predicted as medication and B-SALUTE, B-PERSON being predicted as diagnosis. Some common
treatments, e.g. ”CT”, and diagnosis, e.g. ”HCC-suspekten Laesionen” that have capitalized
abbreviations are also predicted as medication.</p>
      <p>Similar problems also exist with dosage predictions in CARDIO:DE as it includes full doctor’s
reports, whereas GPTNERMED dataset only includes sentences with a mention of medicine and
potentially its dosage. This results in a lot of non-medical numerical information in CARDIO:DE,
e.g. age and birth date of the patient, are confused with dosage label by GPTNERMED. Of all
the tokens with at least one digit, GPTNERMED predicts dosage for 61%, whereas only 5% are
labeled as Dosage.</p>
      <p>
        We suspect both covariate and label shift occur with both dataset, which can be potentially
addressed by diferent methods. For a good overview we refer the reader to the review by
Hupkes et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion and Future Work</title>
      <p>Generating annotated medical text using large language models can circumvent the privacy
issues during dataset curation in clinical domain and address the lack of available training
data. However, the evaluations show that current solutions cannot deal with idiosyncrasies
of medical text. It is, however, important that published models consider down-stream
usecases and perform according evaluations. For a clinical application of a model it is required
that it is able to deal with anonymisation and sentences without any medical named entities.
Furthermore generated datasets should consider real world scenarios, where there are typos
and wrong punctuation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Funded by Gemeinsame Bundesausschuss (G-BA, 01VSF22041).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Frei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kramer</surname>
          </string-name>
          , Annotated Dataset Creation through
          <article-title>General Purpose Language Models for non-</article-title>
          <string-name>
            <surname>English Medical</surname>
            <given-names>NLP</given-names>
          </string-name>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2208</volume>
          .
          <fpage>14493</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Kundeti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vijayananda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mujjiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kalyan</surname>
          </string-name>
          ,
          <article-title>Clinical named entity recognition: Challenges and opportunities</article-title>
          ,
          <source>in: 2016 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1937</fpage>
          -
          <lpage>1945</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigData.
          <year>2016</year>
          .
          <volume>7840814</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Starlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pallarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ševa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rieke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sers</surname>
          </string-name>
          , U. Keilholz, U. Leser,
          <article-title>Variant information systems for precision oncology</article-title>
          ,
          <source>BMC Medical Informatics and Decision Making</source>
          <volume>18</volume>
          (
          <year>2018</year>
          )
          <article-title>107</article-title>
          . doi:
          <volume>10</volume>
          .1186/s12911- 018- 0665- z.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kormilitzin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nevado-Holgado</surname>
          </string-name>
          ,
          <article-title>Med7: A transferable clinical natural language processing model for electronic health records</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          <volume>118</volume>
          (
          <year>2021</year>
          )
          <article-title>102086</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.artmed.
          <year>2021</year>
          .
          <volume>102086</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Spasic</surname>
          </string-name>
          , G. Nenadic,
          <source>Clinical Text Data in Machine Learning: Systematic Review, JMIR Medical Informatics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <article-title>e17984</article-title>
          . doi:
          <volume>10</volume>
          .2196/17984.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kolditz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lohr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hellrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Modersohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Betz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kiehntopf</surname>
          </string-name>
          , U. Hahn,
          <article-title>Annotating german clinical documents for de-identification</article-title>
          ., in: MedInfo,
          <year>2019</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kittner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lamping</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Rieke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Götze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bajwa</surname>
          </string-name>
          , I. Jelas, G. Rüter,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hautow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sänger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Habibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zettwitz</surname>
          </string-name>
          , T. de Bortoli, L. Ostermann,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ševa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Starlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kohlbacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Malek</surname>
          </string-name>
          , U. Keilholz, U. Leser,
          <article-title>Annotation and initial evaluation of a large annotated German oncological corpus</article-title>
          ,
          <source>JAMIA Open 4</source>
          (
          <year>2021</year>
          )
          <article-title>ooab025</article-title>
          . doi:
          <volume>10</volume>
          .1093/ jamiaopen/ooab025.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Richter-Pechanski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wiesenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kiriakou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Allers</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Tiefenbacher</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kunz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martynova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Spiller</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mierisch</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Borchert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Schwind</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Frey</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Dieterich</surname>
            ,
            <given-names>N. A.</given-names>
          </string-name>
          <string-name>
            <surname>Geis</surname>
          </string-name>
          ,
          <article-title>A distributable German clinical corpus containing cardiovascular clinical routine doctor's letters</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <article-title>207</article-title>
          . doi:
          <volume>10</volume>
          .1038/ s41597- 023- 02128- 9.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          - 1423.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrestha</surname>
          </string-name>
          ,
          <article-title>Development of a Language Model for Medical Domain</article-title>
          ,
          <source>Ph.D. thesis</source>
          , Hochschule Rhein-Waal,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hupkes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giulianelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dankers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Elazar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pimentel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lasri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Saphra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sinclair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ulmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schottmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Batsuren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Khalatbari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ryskina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>State-of-the-art generalisation research in NLP: A taxonomy and</article-title>
          review,
          <year>2023</year>
          . arXiv:
          <volume>2210</volume>
          .
          <fpage>03050</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>