<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Embedding-Based Acronym Disambiguation Supported by Large Language Models in German Clinical Narratives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amila Kugic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Schulz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Kreuzthaler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>An embedding-based approach for acronym disambiguation in German was developed using a combination of large language model (LLM) prompts and the MedBERT.de model without human-annotated training data. Both zero-shot and few-shot prompting techniques were employed with the Generative Pretrained Transformer model, GPT-4o, to generate training examples for the creation of embedding spaces, which were then indexed using Faiss (Facebook AI Similarity Search) for a nearest-neighbor search. Each embeddings-based search for three distinct acronyms in context identified the closest long-form resolutions based on distances between embeddings. Acronym disambiguation achieved a maximum accuracy of 0.69 [0.64-0.73], which was comparable to the baseline accuracy of 0.65 [0.61-0.69] obtained using LLMs. However, synthetic training examples characterized by zero-shotprompting to build the embedding spaces resulted in a lower accuracy of 0.46 [0.41-0.50], in comparison to few-shot prompting of synthetic clinical narratives. The results underscore the challenge of accurately disambiguating acronyms in real-world clinical narratives without human-labeled data and highlight the contextual complexity involved.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Electronic Health Records</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Short forms, i.e., abbreviations and acronyms, are typically found in technical language, such as in
narrative content of electronic health records. Compact language expressions are often preferred
by clinicians to quickly and concisely communicate and document information about patients. The
drawbacks of short forms are a decrease in readability and an increase in ambiguity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A systematic
review covering the past 30 years, conducted on the readability of patient information, such as discharge
instructions or educational health information in various clinical specialties, showed that the reading
level of clinical information was too high for patients [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Consequently, resolving short forms could be
one of the steps taken to increase readability for patients. Correct disambiguation can only be achieved
using natural language processing (NLP) methods that are sensitive to the surrounding context. In 2024, a
systematic scoping review on the processing of short forms in clinical narratives with NLP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] illustrated
the need for further research on this topic in languages other than English. Additionally,
embeddingsbased methods have demonstrated state-of-the-art performance in disambiguating short forms in
English clinical texts. These methods leverage embedding representations to position semantically
similar n-grams close to one another, facilitating information retrieval. However, building efective
machine learning (ML) models for this task is expensive, as it relies on datasets annotated by domain
experts.
      </p>
      <p>The aim of this paper was to investigate the use of large language models (LLMs) to generate
silverstandard training examples for a special form of short forms, viz. acronyms, and create embedding
spaces with these examples. Silver-standard annotations are defined as automatically generated training
labels that approximate gold-standard annotations. A clinical text corpus, used as a test set, was
applied to gauge the applicability of this method. Three research questions guided the focus of this
investigation: (i) Is the application of embeddings with an LLM-generated training set feasible for
acronym disambiguation? (ii) What diference does a zero-shot vs. few-shot application of LLM prompts
have on the training set, and consequently on the performance results? (iii) How does the performance of
embeddings for acronym resolution compare to a straightforward LLM-based disambiguation approach?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        For the generation of silver-standard annotation, Li et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] compared both zero-shot and few-shot
synthetic data generation for a variety of openly available datasets for text classification tasks, such
as news and reviews, to gauge the classification performance. With BERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and RoBERTa [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
models, the trained models in few-shot scenarios seem to always outperform zero-shot labels, and
real-world data would almost always outperform models trained on synthetic datasets. Kruschwitz and
Schmidhuber [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] analyzed the possibility of creating synthetic datasets in English for online toxicity
detection. The authors summarized that while the method seems promising for further research, it did
not improve the classification task compared to applying original (human annotated) data. For acronym
disambiguation in English clinical narratives, Adams et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used contextualized word representations,
Gaussian embeddings, and a Bayesian skip-gram model to improve short-form expansion, resulting in a
performance of weighted mean F1-scores of 0.69 for the MIMIC-III [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] dataset, and 0.51 for the CASI
(Clinical Abbreviation Sense Inventory) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] dataset. Jaber and Martínez [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] outperformed Adams et
al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] on the CASI dataset by using a masked language modeling approach with three pretrained BERT
models, and incorporating context and expansions without fine-tuning, achieving 0.99 in accuracy.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <sec id="sec-3-1">
        <title>3.1. Clinical Narrative Dataset</title>
        <p>Clinical narratives from cardiology, dermatology, and oncology departments of KAGes, an Austrian
hospital network, were used to create training and test sets for acronym disambiguation. For the creation
of the dataset, a rule-based approach was applied, [\s[A-Z][A-Z0-9]{2}\s], to extract ambiguous
two-letter acronyms with a context width of 100 characters, with the matching acronym placed in the
middle. The dataset comprised three two-letter acronyms (“AP”, “HT”, “VA”) with multiple senses per
acronym. The created training set, applied for the creation of the synthetic dataset in the next step,
consisted of two examples of acronyms in context from clinical narratives per possible target sense. The
test set consisting of 500 contextual examples had already been applied for acronym disambiguation
with LLMs [12], denoted as the German 3A dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Synthetic Training Dataset</title>
        <p>Two training datasets were created by prompting the language model GPT-4o (release:
gpt-4o-2024-0806) from OpenAI with the application programming interface (API). The choice of language model was
informed by a prior study on acronym resolution, which had found GPT-4 to be the best-performing
among four LLMs [12]. We created two separate training datasets in order to investigate the efect
of zero-shot vs. few-shot prompting on the results, i.e., prompting without examples vs. prompting
with examples. The prompts for both datasets were identical except for the addition of examples in
the few-shot (two-shot) prompt, e.g., “Generate 50 snippets that are 100 characters long each, in German.
Each snippet should use the acronym “AP” in the middle, and correspond to the long form “Alkalische
Phosphatase”.” The prompts were executed six times for each long form, due to LLM response length
constraints, so that a total of 300 examples per long form could be obtained. Each prompt required the
creation of 50 examples with a length of approximately 100 characters each with the acronym appearing
in the middle of the example, approximately after 50 characters. The target sense for the acronym was
included in the prompt for accurate context creation. The context of the 100 characters was required to
be similar to clinical narrative documentation practices in German, with incomplete sentences, use of
short forms, laboratory results, etc. This aimed to descriptively create similar contexts in comparison
to clinical narratives for the synthetic datasets. The few-shot prompt included two examples per long
form from the training dataset. Each row of the dataset consisted of a unique row number, the acronym,
the target sense, and the synthetic example context.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Pre-processing</title>
        <p>Input texts processed for embedding generation and embedding search were uniformly processed.
Any characters outside of [a-zA-Z0-9üäöÄÖÜß-] were replaced with whitespace characters, and
whitespaces collapsed, so that consecutive whitespaces, created by pre-processing, were removed.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Embedding Space Generation</title>
        <p>Prior to building the index, each short form was replaced with the corresponding long form in each
synthetic example via reverse substitution to facilitate the resolution of acronyms during an
embeddingbased search. For 5-gram and 10-gram decomposition of the synthetic training dataset, two
768dimensional embedding spaces were built, each for zero-shot and few-shot created examples, and the
resulting vectors were indexed using Faiss [13]. The language model MedBERT.de [14] was applied to
create the embedding spaces, because the language model was created with a large corpus of German
medical documents and achieved state-of-the-art results in a variety of NLP tasks1.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Search Procedure</title>
        <p>For each acronym, a context-based search in the embeddings spaces was performed. All possible long
forms found as possible senses in the corresponding clinical narrative dataset were recorded in a lookup
table prior to starting the search. For disambiguation, the complete 100-character context around the
indicated acronym was used for the search. Distances recorded for the nearest neighbors for each
example were grouped to calculate the mean distance per possible acronym resolution. The mean
embedding distance values were ordered in ascending order. The shortest mean distance resulted in the
long form classification, i.e., the long form was assigned as the resolution candidate.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Evaluation</title>
        <p>Two domain experts annotated the long forms assigned for correctness, i.e., the labels “correct” or
“incorrect” were allocated for all acronym resolutions part of the German 3A dataset. The evaluation
of the results was performed with the metric accuracy and a 95% confidence interval (CI). The metric
accuracy was calculated by dividing the number of correctly labeled long forms by the total number of
annotations in the dataset, while the 95% CI provided a measure of statistical significance for comparisons
with other baselines, i.e., previously published works.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Baseline Comparison</title>
        <p>The best-performing run in previous results [12], i.e., prompting for the resolution of acronyms with
GPT-4, was used as a baseline comparison. The baseline aimed to resolve acronyms in a single step, where
the placeholders “ACRONYM” and “CONTEXT ” were substituted with the corresponding information
from clinical narratives. The complete baseline zero-shot prompt: “What is the resolution of the acronym
ACRONYM in the following clinical context: CONTEXT. The answer should be kept short and concise. The
acronym resolution should be given out in the following format: short form, long form. The answer should
not contain any further explanations.”
1https://huggingface.co/GerMedBERT/medbert-512</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In Table 1, the performance results for this acronym disambiguation task were listed. The
interrater agreement was calculated for the test set, i.e., a Cohen’s kappa  of above 0.9 indicates a high
agreement between the annotators [15, 16]. The various ways to generate silver-standard labels
significantly impacted the performance between zero-shot and few-shot prompting techniques, based
on the confidence intervals. Few-shot prompting for silver-standard labels generally resulted in higher
accuracy compared to zero-shot prompting. Across prompt types and n-gram decompositions, the
maximum accuracy of 0.69 was recorded for a 5-gram embedding space with few-shot synthetic training
examples. However, this did not result in a statistically significant improvement compared to the baseline
of 0.65, which consisted of prompting GPT-4 in a zero-shot manner directly for the disambiguation of
the acronyms.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <sec id="sec-6-1">
        <title>6.1. Research Questions</title>
        <sec id="sec-6-1-1">
          <title>6.1.1. Is the application of embeddings with an LLM-generated training set feasible for abbreviation disambiguation?</title>
          <p>
            While the application of embeddings with an LLM-generated training set is feasible for abbreviation
disambiguation, the results do not yet yield performance levels adequate for deployment in a clinical
context , e.g., for the disambiguation of acronyms in discharge summaries. The performance results
achieved similar results to Adams et al. [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], although Jaber and Martínez [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] have shown that even
better acronym disambiguation would be possible. However, both publications used real-world datasets
in English, i.e., they did not use synthetically created training datasets. Consequently, similarly to related
works in other domains by Li et al. [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] and Kruschwitz and Schmidhuber [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], the use of on-premise
datasets, labeled by annotators, would probably ofer higher performance results, but at the cost of
human annotation hours.
          </p>
        </sec>
        <sec id="sec-6-1-2">
          <title>6.1.2. What diference does a zero-shot vs. few-shot application of LLM prompts have on the training set, and consequently on the performance results?</title>
          <p>From the performance results, a statistically significant improvement (95% CI) was achieved for the
use of few-shot prompting, i.e., by only introducing the same two examples into each prompt.
Synthetic examples for zero-shot prompts often included wording unrepresentative for clinical narratives,
including nonsensical hallucinated information that did not adhere to the initial prompt, and was in
large proportions dissimilar to real-world clinical text, e.g., “Herzton haufen alternierende Molmyklene,
rhythmische Herzton [...]” (heart sound heap alternating Molmyklene, rhythmic heart sound). Typical
errors included hallucinated words and phrases, non-adherence to length requirements (often
generating less than the required 100 characters), and while abbreviations were used in a minor number of
cases, the majority of information was written out in full sentences, which might have been the largest
diference between the zero-shot and few-shot examples. Conversely, the variability in all examples
were given, i.e., no examples repeated themselves and were unique for each silver-standard dataset.
In the few-shot prompted synthetic dataset, these erroneous efects were still there, although less in
comparison to the zero-shot approach, and the examples largely similar to real-world clinical narrative
datasets. An example here would be the following: “Sättigung auf 96%, pulsierender HT, BD leicht erhöht,
kein sezernierendes Exanthem, temp. norm” (saturation at 96%, pulsating heart sound, blood pressure
slightly elevated, no secreting exanthem, temperature normal).</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>6.1.3. How does the performance of embeddings for acronym resolution compare to a straightforward LLM-based disambiguation approach?</title>
          <p>The straightforward LLM-based approach slightly underperformed compared to the best performing
results for embeddings-based acronym resolution, but without statistical significance based on the
confidence intervals. With the rapid advancements in LLM technology and considering that the baseline
method did in fact reach 0.98 in accuracy for a subset of the CASI dataset in English, future LLMs would
probably outperform the German 3A dataset baseline. From a resources perspective, the creation of the
embedding spaces, indexing and search procedures were far less resource-intensive in comparison to
the computational and memory resources needed, if one had to train and prompt LLMs on premise.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Error Analysis</title>
        <p>A summary error analysis of acronym disambiguation revealed that for zero-shot prompted synthetic
examples, the context was too dissimilar to real-world clinical narrative datasets. As a result, in
30% of nearest-neighbor searches, no nearest neighbor could be found and therefore no resolution
candidate was chosen. For few-shot prompted training examples, all embedding searches found a
resolution candidate. Reporting similarities of clinicians in clinical narratives, negation variations, and
discontinuous spans made distinctions particularly challenging in the embedding space.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. System Limitation</title>
        <p>One limitation of this embeddings-based approach was that prior to starting the disambiguation, any
possible long forms of the short forms need to be known, so that the embedding space included
representative examples for each long form. To explain, a search with the context around the acronym
“HT” for “Herzton” would have been found, because the generated examples contain that sense. If a
search was performed for the acronym “HT”, but with the resolution “Hydroxytryptamin” as part of
“5-HT Rezeptor”, an abbreviation for seratonin receptors, this sense would not have been found as this
was not part of the list of possible senses. The latter case had no negative impact on model performance,
as this sense was not represented in the German 3A dataset. Another limitation was the restriction to
two-letter abbreviations. Even though these indicate high ambiguity contextually based on previous
results [12], a larger and more diverse dataset would have been more representative for the acronym
disambiguation capability of the embeddings-based approach.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Outlook</title>
      <p>We presented a method to create synthetic training datasets for clinical narratives, labeled automatically
by LLMs, to be used as input for an embeddings-based acronym disambiguation task. The results
demonstrate acceptable performance suficient for initial deployment testing with German clinical
narratives. Embeddings-based acronym resolution shows great promise. In future investigations, other
methods for acronym resolution will be investigated for the same dataset, which is interesting due to
the high ambiguity of acronyms. One possibility would be the annotation of a subsection of clinical
narratives by human annotators to compare the performance, when the embedding space is trained on
the same real-world dataset.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools during the preparation of this work.
[12] A. Kugic, S. Schulz, M. Kreuzthaler, Disambiguation of acronyms in clinical narratives with large
language models, Journal of the American Medical Informatics Association 31 (2024) 2040–2046.</p>
      <p>URL: https://doi.org/10.1093/jamia/ocae157. doi:10.1093/jamia/ocae157.
[13] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Transactions on</p>
      <p>Big Data 7 (2019) 535–547.
[14] K. K. Bressem, et al., medbert.de: A comprehensive german bert model for the medical domain,
Expert Systems with Applications 237 (2024) 121598. URL: https://www.sciencedirect.com/science/
article/pii/S0957417423021000. doi:https://doi.org/10.1016/j.eswa.2023.121598.
[15] M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica 22 (2012) 276–282.
[16] C. O’Connor, H. Jofe, Intercoder reliability in qualitative research: Debates and
practical guidelines, International Journal of Qualitative Methods 19 (2020) 1609406919899220.
URL: https://doi.org/10.1177/1609406919899220. doi:10.1177/1609406919899220.
arXiv:https://doi.org/10.1177/1609406919899220.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Smolle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stoiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pregartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kamolz</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Sendlhofer, Structure, content, unsafe abbreviations, and completeness of discharge summaries: A retrospective analysis in a University Hospital in Austria</article-title>
          ,
          <source>Journal of Evaluation in Clinical Practice</source>
          <volume>27</volume>
          (
          <year>2021</year>
          )
          <fpage>1243</fpage>
          -
          <lpage>1251</lpage>
          . URL: https://onlinelibrary.wiley.com/doi/10.1111/jep.13533. doi:
          <volume>10</volume>
          .1111/jep.13533.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Okuhara</surname>
          </string-name>
          , E. Furukawa,
          <string-name>
            <given-names>H.</given-names>
            <surname>Okada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yokota</surname>
          </string-name>
          , T. Kiuchi,
          <article-title>Readability of written information for patients across 30 years: A systematic review of systematic reviews, Patient Education and Counseling (</article-title>
          <year>2025</year>
          )
          <article-title>108656</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S0738399125000230. doi:
          <volume>10</volume>
          .1016/j.pec.
          <year>2025</year>
          .
          <volume>108656</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kugic</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Modersohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pallaoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kreuzthaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Boeker</surname>
          </string-name>
          ,
          <article-title>Processing of Short-Form Content in Clinical Narratives: Systematic Scoping Review</article-title>
          ,
          <source>Journal of Medical Internet Research</source>
          <volume>26</volume>
          (
          <year>2024</year>
          )
          <article-title>e57852</article-title>
          . URL: https://www.jmir.org/
          <year>2024</year>
          /1/e57852. doi:
          <volume>10</volume>
          .2196/ 57852, company:
          <source>Journal of Medical Internet Research Distributor: Journal of Medical Internet Research Institution: Journal of Medical Internet Research Label: Journal of Medical Internet Research Publisher: JMIR Publications Inc</source>
          ., Toronto, Canada.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Synthetic data generation with large language models for text classification: Potential and limitations</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>10443</fpage>
          -
          <lpage>10461</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          . emnlp-main.
          <volume>647</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>647</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/N19-1423/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>
          , CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1907</year>
          .11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>U.</given-names>
            <surname>Kruschwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>LLM-based synthetic datasets: Applications and limitations in toxicity detection</article-title>
          , in: R.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ojha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lahiri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          Ratan (Eds.),
          <source>Proceedings of the Fourth Workshop on Threat, Aggression &amp; Cyberbullying @ LREC-COLING-2024</source>
          ,
          <article-title>ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>51</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .trac-
          <volume>1</volume>
          .6/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ketenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perotte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Elhadad</surname>
          </string-name>
          ,
          <article-title>Zero-Shot Clinical Acronym Expansion via Latent Meaning Cells</article-title>
          ,
          <source>in: Proceedings of Machine Learning Research</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. J.
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>L.-w. H.</given-names>
          </string-name>
          <string-name>
            <surname>Lehman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghassemi</surname>
            , B. Moody, P. Szolovits,
            <given-names>L. Anthony</given-names>
          </string-name>
          <string-name>
            <surname>Celi</surname>
          </string-name>
          , R. G.
          <article-title>Mark, MIMIC-III, a freely accessible critical care database</article-title>
          ,
          <source>Scientific Data</source>
          <volume>3</volume>
          (
          <year>2016</year>
          )
          <article-title>160035</article-title>
          . URL: http://www.nature.com/articles/sdata201635. doi:
          <volume>10</volume>
          .1038/sdata.
          <year>2016</year>
          .
          <volume>35</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Moon</surname>
          </string-name>
          , S. Pakhomov, G. Melton, Clinical Abbreviation Sense Inventory,
          <year>2012</year>
          . URL: http:// conservancy.umn.edu/handle/11299/137703, accepted:
          <fpage>2012</fpage>
          -
          <lpage>10</lpage>
          -31T19:
          <fpage>58</fpage>
          :
          <fpage>41Z</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <article-title>Disambiguating clinical abbreviations using a one-fits-all classifier based on deep learning techniques</article-title>
          ,
          <source>Methods of Information in Medicine</source>
          <volume>61</volume>
          (
          <year>2022</year>
          )
          <fpage>e28</fpage>
          -
          <lpage>e34</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>