<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of the tention is All You Need</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.7937/QXK2-QG03</article-id>
      <title-group>
        <article-title>Synthetic Annotated Data for Named Entity Recognition in Computed Tomography Scan Reports</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Platas</string-name>
          <email>aplatas@vicomtech.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Zotova</string-name>
          <email>ezotova@vicomtech.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paola Martínez-Arias</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karen López-Linares</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Montse Cuadros</string-name>
          <email>mcuadros@vicomtech.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Languages and Computer Systems, University of the Basque Country</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fundación Vicomtech, Basque Research and Technology Alliance (BRTA)</institution>
          ,
          <addr-line>Mikeletegi 57, 20009 Donostia-San Sebastián</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>27</volume>
      <fpage>2193</fpage>
      <lpage>2201</lpage>
      <abstract>
        <p>It is widely acknowledged that clinical data, in general, is scarce, and this scarcity worsens when focusing on specific domains. Moreover, the challenge escalates when annotated data is required. In this paper, we propose an approach to create synthetic annotated datasets for Named Entity Recognition (NER) tasks in Computed Tomography Reports (CTR) by leveraging large language models (LLMs). We investigate the potential of LLMs to generate meaningful texts in the healthcare domain through a combination of text generation techniques and automatic annotation using LLMs. Additionally, we conducted a series of experiments to demonstrate the eficacy of using synthetic data compared to real data for solving NER tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical NER</kwd>
        <kwd>text generation</kwd>
        <kwd>data synthesis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        lenge for machine learning (ML) and deep learning (DL)
techniques, as they rely on large supervised corpora for
This work presents a method for creating synthetic an- training models [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. BioNLP also addresses sensitive
innotated datasets for Named Entity Recognition (NER) in formation and privacy concerns, such as private
informaComputed Tomography Reports (CTR). We experiment tion in electronic health records (EHR), so most datasets
with text generation and automatic annotation with large are not publicly available for research and development
language models (LLMs), considering their capacity to purposes. Concerns regarding patient privacy and lack of
produce meaningful texts on a given topic and zero-shot reliable de-identification techniques have made hospitals
learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. LLMs have already shown potential in ex- and clinics highly reluctant to allow researchers to access
tracting valuable information from unstructured data, clinical data outside the association [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
such as electronic health records (EHRs) and digital med- We explore the new possibilities of synthetic textual
ical data. Instead of applying LLMs in a zero-shot set- data to overcome the above-mentioned factors. Synthetic
ting, we propose creating synthetic-labelled data using data, in general, according to The Alan Turing Institute,
LLMs for further fine-tuning supervised NER models. is “data that has been generated using a purpose-built
Our research is motivated by the following challenges in mathematical model or algorithm, with the aim of solving
Biomedical Natural Language Processing (BioNLP). a (set of) data science task(s)" [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This type of data
      </p>
      <p>
        High-quality annotated corpora are essential to train can statistically replicate real-world data’s underlying
and validate predictive models in healthcare. Manual patterns and characteristics despite its artificial nature,
annotation requires personnel time and preparation, and so its defining feature is this ability to mimic real-world
the challenge is even more dificult in BioNLP, as the characteristics. Synthetic data can be classified into three
cost of expertise for annotation is higher than in general- broad categories: fully synthetic, partially synthetic, and
purpose NLP, which makes using crowd-sourcing plat- hybrid. The fully synthetic data does not contain any
forms for annotations almost impossible. This scarcity original information; partially synthetic data replaces
of annotated clinical narratives poses a significant chal- only the values of the sensitive attribute selected with
synthetic values; and the hybrid synthetic data, which
we have generated, uses both the original and synthetic
data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The contributions of this paper are the following:
• We propose a hybrid method for generating
synthetic annotated corpus from real-world
structured data using an existing dataset of Computed
Tomography (CT) scans reports. This synthetic
data is used as a training corpus for fine-tuning
of language models for the biomedical NER task.</p>
      <p>Our method provides various prompting tech- crowd-source workers for annotation such tasks as
releniques for data generation with LLMs and the vance, stance, topics, and frame detection. The authors
analysis of the efectiveness of synthetic data as provided the corpora collected from Twitter and news
data augmentation. Leveraging real-world data and the annotations guides to the LLM as a prompt. A
in the text synthesis helps get good quality train- similar approach [22] leverages LLMs to generate a
fewing data. The synthetic annotated corpus will be shot prompt with explanations, which is then used to
publicly available1. annotate unlabelled data query and keyword relevance
• Experiments with the models fine-tuned for the assessment, question-answering task, disambiguating
NER task show that the synthetic data can help to word senses through binary classification of sentence
improve the models’ performance in the situation pairs. [23] and [24] use LLMs for annotation with noisy
of annotated data scarcity. labels and an active learning loop to determine what to
eficiently annotate.</p>
      <p>In a multilingual setting, a fine-tuned
5-billionparameter multilingual sequence-to-sequence model was
used to generate annotated data for intent classification
and slot tagging [25], and it was reported to perform
better than the back-translation method.</p>
      <p>
        This paper is organised as follows. In Section 2 we
overview works related to synthetic data and methods to
get augmented corpora, both in biomedical and
generalpurpose NLP. Section 3 describes the task and the corpus
we created with LLMs and the corpus with the original
data manually annotated. In Section 4 is dedicated to the
methodology of creating new corpora and in Section 5 we
explain the details of the experimentation with corpora. 2.2. Biomedical NLP
In Section 6 the results of the experiments are shown, Synthetic data generation has also witnessed a marked
and Section 7 concludes our paper and discusses future increase in research publications in biomedical NLP, as
work. well, suggesting a potential for broader adoption. The
surveys carried out by [26, 27] provide evidence that
syn2. Related Work thetic is helpful in diferent aspects of healthcare care
and has possibilities to bridge data access gaps in
re2.1. General-purpose NLP search and evidence-based policy making. [28], on the
contrary, explore the problem of synthetic data in
healthAn upsurge in data synthesis and augmentation in care: although it promises various positive opportunities,
general-purpose NLP began with rule-based approaches, synthetic data potential carries concerns such as the risk
such as grammar and lexicon replacement [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ], and of bias amplification, low interpretability, and an absence
then adopted model-based approaches, such as sentence of robust methods for examining data quality.
retrieval and backtranslation with machine learning tech- In [29], the authors tackle the task of generation of
niques [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]. The interest in synthetic data genera- medical imaging reports using a hierarchical recurrent
tion is also related to the emergence of new architectures neural network decoder, which generates a sequence of
of deep neural networks and pre-trained language mod- topic representations conditioned on image information,
els. Various authors use BERT [12], BART [13], and GPT- and this then conditions the generation of respective
2 [14] to generate data for classification and common sentences. [30] propose the approach based on
encodersense reasoning tasks, experiment with conditioning on decoder Transformer models [31] trained for the
gaplabels by prepending the label to training data during iflling task to generate discharge summaries from a large
ifne-tuning [ 15, 16, 17, 18]. [19] propose a task augmen- mental healthcare provider and an intensive care unit.
tation approach that utilises conditional generation to The model learns a sequence-to-sequence task where
create in-domain synthetic data for an auxiliary Natural the clinical information and the key phrases are in the
Language Inference (NLI) task, which then is employed to input, and the full original EHR record is in the output.
initialise the target task classifier. However, these works A classification model trained on synthetic data shows
show better results with synthetic data, but observe that results comparable to the models trained on original data.
one needs to detect and discard low-quality labelled data The methods for creating synthetic data with text
genor optionally re-label it. In the work of [20], the authors eration models are explored by [32]: CharRNN [33],
Segtry to overcome these problems by knowledge distillation GAN [34], GPT-2 [14], and CTRL [35]. Then, the authors
and self-training on domain-specific data. annotated the resulting data manually for in Named
En
      </p>
      <p>The most recent works explore the capacity of Large tity Recognition (NER) task. The best-performing
genLanguage Models (LLMs) to annotate corpora automat- eration model was GPT-2. [36] explores the ability of
ically. [21] report that the GPT-3.5-turbo2 outperforms LLMs to extract structured information from
unstructured healthcare texts, specifically for biological NER
and relation extraction (RE) tasks, in a zero-shot setting.
1The corpus will be released when the paper is accepted
2https://platform.openai.com/docs/models/gpt-3-5-turbo</p>
      <sec id="sec-1-1">
        <title>The quality of the synthetic corpora is evaluated by fine- Table 1</title>
        <p>tuning supervised models; the authors report improve- Corpora statistics. Number of reports and tokens in Authentic
ments in the performance of downstream tasks, com- and Synthetic datasets.
pared to the zero-shot scenario, but not in original data, Dataset Synthetic
although the performance is comparable. Train 197</p>
        <p>We should note that most existing works experiment
with corpora in English. There are only few attempts to 44272
create a multilingual datasets, for instance, a corpus for Dev
Health Question Answering and compare various LLMs
[37], including T5 [38], BART [13] and GPT-3.53. Test</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Task Definition and Corpora</title>
      <sec id="sec-2-1">
        <title>Named Entity Recognition (NER) [39] in the biomedi</title>
        <p>cpaulrpdoosmeaNinLPiskcnrouwcinalatsoneaxmtreadcetinntgiticeosn),cseupcths a(isnlogceantieornasl-, Entities Neunmtibteiersof Apvegr. erenptiotriets Apvegr. etonktietnys
treatment plans, medicines/drugs, diagnoses, etc, from Synthetic data
clinical narratives. NER uses an IOB (Inside, Outside, SEX 195 0.99 1.24
Begin) tagging scheme, where each word is assigned a AGE 199 1.01 2
tag indicating whether it is the beginning of a named THUEPMAOTOR_PSAITZHEY 423836 21..2405 23..0592
entity (B), inside a named entity (I), or outside a named PROCEDURE 198 1.01 3.06
entity (O). Formally, a sentence  in a medical text is Total 1311 6.65 2.41
denoted as a sequence of words  = (1, 2, . . . , ), Authentic data
and the corresponding tags for each word in the sentence SEX 1 0.01 1
are denoted as  = (1, 2, . . . , ), where tag  is an AGE 1 0.01 2
element of the tag set {, , }. THUEPMAOTOR_PSAITZHEY 223479 22..3479 12..5691</p>
        <p>Our goal is to train a NER model for detecting the PROCEDURE 147 1.47 3.31
following named entities in the Computed Tomography Total 635 6.35 2.39
Scan Reports (CTSR): SEX (patient’s sex), AGE (patient’s
age), HEPATOPATHY (type of hepatopathy found),
TUMOR_SIZE (liver tumor size), and PROCEDURE (proce- 4. CT Reports Generation
dure performed). We consider two types of annotated
corpora for the experimentation: (1) authentic data from In this Section, we describe how we create synthetic CT
liver cancer cases collected in a hospital and (2) synthetic reports. In our case, synthetic data generation aims to
dataset generated and annotated by an LLM. create realistic clinical narratives similar to real reports</p>
        <p>The first type of data includes a private dataset in while making them as diverse as possible. We reduce the
Spanish comprising 100 CTSRs performed on 66 patients. probability of an error or hallucinations by incorporating
This corpus is manually annotated by experts and it is information from real-world structured data.
used as gold-standard for the systems. Additionally, we The generated data were semi-automatically annotated
used six real samples as examples in instructions for by GPT-3.5-turbo model under human supervision to
corLLMs, which are not included in training data and are rect any potential annotation errors, such as entities left
used only to show report details such as structure, length unlabeled or the annotation of words that were not
entiand vocabulary. The second type of corpus consists of 197 ties. Our choice is explained by the model’s
state-of-thereports, created and annotated by the LLM (see details art capabilities of coherent text generation with a given
of text generation and annotation in Section 4). The prompt, which is an instruction or an example of how to
authentic corpus is split in train, development and test complete a task. Given that this dataset consists solely
sets, as shown in Table 1, while synthetic dataset is used of 197 reports, we manually verified these annotations.
in training split only. The test set is used to evaluate However, unlike other experiments carried out
rethe NER systems. Authentic reports are annotated with cently [40, 36], we compose prompts for an LLM
in635 entities and synthetic reports contain a total of 1311 struction with real-world data from
“Colorectal-Liverentities, as we can observe in Table 2. We can point out Metastases” dataset [41]. This dataset contains CT images
that classes SEX and AGE are unbalanced, appearing only from 197 patients with liver cancer. It also includes
strucin one report in the authentic reports dataset. tured data in a tabular format, as we can observe in Table
3, with 36 attributes for each patient, mostly numerical,
covering demographic, pathological, and survival data.
derstand the meaning of each column. Then, the model
is instructed to generate a medical report. We observed
a significant diference during the initial text
generations when we changed the type of text requested in
the prompt. As we can see in Table 4, using the term
“informe” (report, in English) we obtain a much more
schematic generation, while with the term “redacción”
(writing, in English) we obtain an output more similar to
the required one.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Once the desired text style is obtained, we provide the</title>
        <p>model with a real sample as an example to generate a
report with a similar structure. Providing real samples may
result in the inclusion of information from those samples
in the generated data. Therefore, instead of providing
a real sample, we only show the structure of the report
Prompt 2 Write a medical writing for patient and a description of the content it should include in each
&lt;Patient-ID&gt; section, as we can see in Table 5. When using structured
The patient with the code CRLM-CT-1001 is a 65 data for report generation, the model creates identical
reyear-old woman who has been diagnosed with liver ports by only changing the provided data. Furthermore,
cancer. She has colorectal liver metastases as the we add various synonyms, making annotated entities
primary disease. The patient has a tumor of 1.1 cm richer in vocabulary, as evidenced in Table 6. To achieve
in size [...] vocabulary variety, we employed high randomness in
report generation and automatically replaced repeated</p>
        <p>To create a prompt for the model, the role “system” phrases with a list of synonyms.
is described as an expert oncologist, and the patient ID Finally, we obtained the optimal prompt as shown in
is provided to retrieve information from a structured Table 7, where we specified the type of text, the report
dataset. For each column that must be included in the structure, and the patient ID which is used as an index
text, we wrote a brief description to help ChatGPT un- to get patients’ information from the structured dataset.</p>
      </sec>
      <sec id="sec-2-3">
        <title>An example of a CT report generated using these same</title>
        <p>prompt is visualised in Figure 1. We can see a coherent Role Optimal prompt
grammatically correct text with required entities anno- System “You are an expert oncologist”
tated. “Write a medical narrative with short and</p>
        <p>Comparing the generated texts among themselves, we concise sentences for patient &lt;Patinent-ID&gt;.
have observed that due to the high randomness used, the User fTohlleogweinnegrtaetxetdstterxutctsuhroeu: l&lt;dTheaxtv-eStthruecture&gt;.
reports vary significantly. For instance, the lengths of Do not include the patient ID in the report.”
the reports difer, the order in which the data is provided
varies, and some reports repeat information in
diferTttrrehheenhppeeOteooesrsnrreaexttfmcspstotheaweircoereteisn,tetohwscrdtuohoetqhchefutceerutorahreanlheienutsayttiptnhdeardaeexonn,rntvd.dwttiiHgdhcahrearoaoedtewmnntisheenmucsveio,taethgamatrieel,bctpnalhapeaellorllrrfoyuioaonmgtfcrgehoptudhrtttsr,hehreeaemeecisptng,someeytNrxhntnapEseeitenyrRhhcat.eactatteaeviidnndec.
ttspdaeaoainmtFrlaectoesesdresasi,tetrn.rexefuHmaoacmrnotpmuwnplrooaleeetytv.ai,oetmmern,doebtrwdhoeiatictnhaahclwtrthphoherneaoygstfaemewmnssesee,irocaaeonnntueadtdlildtspiaeeurnsoxsdavternisadsdychentfootmfrhlrtlooeeomrtrwiecsttedrhhneeee--still be distinguished from each other.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experiments</title>
      <p>of authentic reports with and without the addition of
synthetic reports with the aim of verifying their efectiveness
across diferent corpus sizes.</p>
      <p>To evaluate the efectiveness of the generated data, we
used diferent combinations of authentic and synthetic
data in the training set. These experiments can be divided
into 2 types based on their objective, so many of the
experiments can belong to both types, as shown in Table
8. All experiments have been evaluated with the same
authentic test set, as shown in Table 1.</p>
      <sec id="sec-3-1">
        <title>The first trial, composed of five experiments, used the</title>
        <p>entire training set of authentic reports and introduced
diferent amounts of randomly selected generated data.
The objective is to determine if synthetic reports can
provide any improvement and how much data would be
necessary.</p>
        <p>In the second trial, composed of 7 experiments, we
compared the metrics obtained using diferent amounts</p>
        <sec id="sec-3-1-1">
          <title>6.1. Increasing Synthetic Data</title>
          <p>The results of the first trial can be observed in Figure
2, where the amount of synthetic data in the training
set has been progressively increased. As we can see,
all models achieve better results when synthetic data is
introduced into the authentic dataset, especially models
based on RoBERTa [44], which show an increase in F1
score of between 8 and 10 points. On the other hand, the
improvement achieved in BERT models is much lower,
between 2 and 3 points. We can highlight that the mBERT
F-Score drops considerably when adding the entire set
of synthetic data (+197), which might indicate potential
overfitting. However, none of the experiments show a
decrease in performance compared to the baseline results
(+0).</p>
          <p>From the first insertion, where we introduced 25
reports or about 33% of the original data, the metrics
stabilise, meaning that despite this data improving the
results, the quantity added after 25 examples becomes
irrelevant. The high lexical and stylistic similarity between
synthetic reports could cause this; synthetic data could
lead to greater improvement if we had generated more
diverse reports using more samples as a reference.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>6.2. Increasing Authentic Data</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>In this second trial, the efectiveness of synthetic data</title>
        <p>across diferent amounts of authentic data was tested.
The average micro F1 score obtained, and the standard
deviation for each experiment are presented in Table 9.</p>
        <p>We observe a significant improvement when
introducing synthetic data into a small training set (25 real
reports) in any of the 4 models tested. However, as in
the previous trial, we can see a notable diference in
the improvement obtained between the models based on
RoBERTa and those based on BERT. Both XLM-RoBERTa
[43] and Biomedical-Clinical RoBERTa [45] reach 80
FScore points after the addition of synthetic reports, more
than 50 points than without using them, representing the
greatest improvement achieved in this trial.</p>
        <p>On the other hand, the models mBERT [12] and BETO
[46] are more robust, as although significant
improvements are achieved on small datasets, we observe that
using 50 reports, the F1 score already reaches 70 points
without using synthetic data. Therefore, the diference
between using them or not is smaller (improvement
between 12 and 2 points of F1 score).</p>
        <p>In the experiment with only synthetic data, we can
observe that the obtained metrics are very low,
comparable to using only 25 real reports. Therefore, we can
deduce that synthetic reports are efective only when
combined with real data. We can also observe that the
results are less stable when training with smaller datasets,
as the standard deviation exceeds 5 points in many
experiments, which are only real reports. This deviation is
considerably reduced when introducing synthetic data
(less than 2 points on average) as the size of the training
set increases significantly.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusion and Future Work</title>
      <p>Through the methods of transforming structured data
into medical reports using a generative LLM, we have
explored the benefits that such synthetic data can ofer in
ifne-tuning of the pre-trained language models for NER
tasks. We have developed a new synthetic NER corpus
of 197 CT scan reports in Spanish, each from diferent
patient. We used a structured and numerical data
originating from an image dataset and took 6 samples of real
reports as references.</p>
      <p>During the experiments, we have demonstrated that
the addition of synthetic data to the training set can lead
to considerable improvements in the results of all tested
models, especially those based on RoBERTa, one of them
likely due to being trained on data from the same domain,
and the other due to its large number of parameters, thus
enhancing its capabilities in this type of tasks.</p>
      <p>Our research leads to two valuable conclusions, which
reveal some keys to generating efective reports. On the
one hand, achieving the closest possible similarity to real
data. Authentic reports typically contain a rich
vocabulary, so this can be achieved by using high randomness
during generation or by inserting or replacing synonyms
in the text. On the other hand, maintaining minimal
similarity between the generated texts so that each one
contains relevant information to contribute while also
avoiding overfitting. In this case, diferent text structures
could be used in generation or even diferent generative
models, apart of GPT-3.5-turbo.</p>
      <p>It is worth noting that even though we apply the best
techniques and models to create synthetic data, due to the
textual complexity of the medical domain, there is still
no technology capable of generating data that perfectly
simulates real data. However, this synthetic data can be
very useful when combined with authentic data.</p>
      <p>We believe that the proposed methods can be useful
for generating new datasets from information extracted
from structured data, especially for languages such as
Spanish, where more datasets are needed to improve the
performance of Language Models in these languages.</p>
    </sec>
    <sec id="sec-5">
      <title>8. Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This work is partially funded by the STEER project, a Multi-Area Internal initiative from Vicomtech, and the EMPHASIS project (ZE-2021/00039), supported by the Basque Business Development Agency, SPRI.</title>
        <p>guage Technologies, Volume 1 (Long and Short Pa- Generate, Annotate, and Learn: NLP with Synthetic
pers), Association for Computational Linguistics, Text, Transactions of the Association for
ComputaMinneapolis, Minnesota, 2019, pp. 3291–3301. tional Linguistics 10 (2022) 826–842. doi:10.1162/
[12] J. Devlin, M. Chang, K. Lee, K. Toutanova, tacl_a_00492.</p>
        <p>BERT: pre-training of deep bidirectional trans- [21] F. Gilardi, M. Alizadeh, M. Kubli, Chatgpt
formers for language understanding, CoRR outperforms crowd workers for text-annotation
abs/1810.04805 (2018). URL: http://arxiv.org/abs/ tasks, Proceedings of the National Academy of
Sci1810.04805. arXiv:1810.04805. ences 120 (2023) e2305016120. doi:10.1073/pnas.
[13] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- 2305016120.</p>
        <p>hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: [22] X. He, Z.-W. Lin, Y. Gong, A. Jin, H. Zhang, C. Lin,
Denoising sequence-to-sequence pre-training for J. Jiao, S. M. Yiu, N. Duan, W. Chen, AnnoLLM:
natural language generation, translation, and com- Making Large Language Models to Be Better
Crowdprehension, in: D. Jurafsky, J. Chai, N. Schluter, sourced Annotators, ArXiv abs/2303.16854 (2023).
J. Tetreault (Eds.), Proceedings of the 58th Annual URL: https://api.semanticscholar.org/CorpusID:
Meeting of the Association for Computational Lin- 257805087.
guistics, Association for Computational Linguistics, [23] P. Bansal, A. Sharma, Large language models as
Online, 2020, pp. 7871–7880. doi:10.18653/v1/ annotators: Enhancing generalization of nlp models
2020.acl-main.703. at minimal cost, arXiv preprint arXiv:2306.15766
[14] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, (2023).</p>
        <p>I. Sutskever, et al., Language models are unsuper- [24] R. Zhang, Y. Li, Y. Ma, M. Zhou, L. Zou1,
LLvised multitask learners, OpenAI blog 1 (2019) 9. MAAA: Making Large Language Models as Active
[15] V. Kumar, A. Choudhary, E. Cho, Data augmenta- Annotators, in: Findings of the Association for
tion using pre-trained transformer models, 2021. Computational Linguistics: EMNLP 2023, 2023, p.
arXiv:2003.02245. 13088–13103.
[16] A. Anaby-Tavor, B. Carmeli, E. Goldbraich, A. Kan- [25] A. Rosenbaum, S. Soltan, W. Hamza, Y. Versley,
tor, G. Kour, S. Shlomov, N. Tepper, N. Zwerdling, M. Boese, Linguist: Language model instruction
Do not have enough data? deep learning to the tuning to generate annotated utterances for intent
rescue!, in: Proceedings of the AAAI Conference classification and slot tagging, in: COLING 2022,
on Artificial Intelligence, volume 34, 2020, pp. 7383– 2022. URL: https://arxiv.org/abs/2209.09900.
7390. [26] A. Gonzales, G. Guruswamy, S. R. Smith, Synthetic
[17] Y. Yang, C. Malaviya, J. Fernandez, S. Swayamdipta, data in health care: A narrative review, PLOS
DigR. Le Bras, J.-P. Wang, C. Bhagavatula, Y. Choi, ital Health 2 (2023) 1–16. doi:10.1371/journal.
D. Downey, Generative data augmentation for com- pdig.0000082.
monsense reasoning, in: T. Cohn, Y. He, Y. Liu [27] H. Murtaza, M. Ahmed, N. F. Khan, G. Murtaza,
(Eds.), Findings of the Association for Computa- S. Zafar, A. Bano, Synthetic data generation: State
tional Linguistics: EMNLP 2020, Association for of the art in health care domain, Computer Science
Computational Linguistics, Online, 2020, pp. 1008– Review 48 (2023) 100546. doi:https://doi.org/
1025. 10.1016/j.cosrev.2023.100546.
[18] Y. Meng, J. Huang, Y. Zhang, J. Han, Generating [28] M. Giufrè, D. Shung, Harnessing the power of
Training Data with Language Models: Towards synthetic data in healthcare: innovation,
applicaZero-Shot Language Understanding, in: A. H. Oh, tion, and privacy, npj Digital Medicine 6 (2023).
A. Agarwal, D. Belgrave, K. Cho (Eds.), Advances in doi:10.1038/s41746-023-00927-3.
Neural Information Processing Systems, 2022. URL: [29] B. Jing, P. Xie, E. Xing, On the automatic
generahttps://openreview.net/forum?id=4G1Sfp_1sz7. tion of medical imaging reports, in: I. Gurevych,
[19] T. Vu, M.-T. Luong, Q. Le, G. Simon, M. Iyyer, Y. Miyao (Eds.), Proceedings of the 56th Annual
STraTA: Self-Training with Task Augmentation Meeting of the Association for Computational
Linfor Better Few-shot Learning, in: M.-F. Moens, guistics (Volume 1: Long Papers), Association for
X. Huang, L. Specia, S. W.-t. Yih (Eds.), Proceedings Computational Linguistics, Melbourne, Australia,
of the 2021 Conference on Empirical Methods in 2018, pp. 2577–2586. URL: https://aclanthology.org/
Natural Language Processing, Association for Com- P18-1240. doi:10.18653/v1/P18-1240.
putational Linguistics, Online and Punta Cana, Do- [30] J. Ive, N. Viani, J. Kam, L. Yin, S. Verma, S. Puntis,
minican Republic, 2021, pp. 5715–5731. URL: https: R. N. Cardinal, A. Roberts, R. Stewart, S.
Velupil//aclanthology.org/2021.emnlp-main.462. doi:10. lai, Generation and evaluation of artificial mental
18653/v1/2021.emnlp-main.462. health records for natural language processing, NPJ
[20] X. He, I. Nassar, J. Kiros, G. Hafari, M. Norouzi, digital medicine 3 (2020) 69.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gudivada</surname>
          </string-name>
          ,
          <article-title>A Review of Current Trends, Techniques, and Challenges in Large Language Models (LLMs</article-title>
          ),
          <source>Applied Sciences</source>
          <volume>14</volume>
          (
          <year>2024</year>
          ). URL: https://www.mdpi.com/2076-3417/14/5/2074. doi:
          <volume>10</volume>
          .3390/app14052074.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fiorini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Wilbur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          , Bridging the Gap:
          <article-title>Incorporating a Semantic Similarity Measure for Efectively Mapping PubMed Queries to Documents</article-title>
          ,
          <source>Journal of Biomedical Informatics</source>
          <volume>75</volume>
          (
          <year>2017</year>
          )
          <fpage>122</fpage>
          -
          <lpage>127</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Nadkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hirschman</surname>
          </string-name>
          , L. W. D'avolio,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Savova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Uzuner</surname>
          </string-name>
          ,
          <article-title>Overcoming barriers to nlp for clinical text: the role of shared tasks and the need for additional creative solutions</article-title>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Szpruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Houssiau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bottarelli</surname>
          </string-name>
          , G. Cherubin,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maple</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <article-title>Synthetic Data - what, why</article-title>
          and how?,
          <year>2022</year>
          . arXiv:
          <volume>2205</volume>
          .
          <fpage>03257</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Surendra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <article-title>A review of synthetic data generation methods for privacy preserving data publishing</article-title>
          ,
          <source>International Journal of Scientific &amp; Technology Research</source>
          <volume>6</volume>
          (
          <year>2017</year>
          )
          <fpage>95</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>That's so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets</article-title>
          ,
          <source>in: Proceedings of the 2015 conference on empirical methods in natural language processing</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>2557</fpage>
          -
          <lpage>2563</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>LeCun, Character-level convolutional networks for text classification</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>28</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marzoev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Kaashoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Andreas</surname>
          </string-name>
          , Unnatural Language Processing:
          <article-title>Bridging the Gap Between Synthetic and Natural Language Data</article-title>
          , ArXiv abs/
          <year>2004</year>
          .13645 (
          <year>2020</year>
          ). URL: https://api.semanticscholar.org/CorpusID: 216562596.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gangal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mitamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>A survey of data augmentation approaches for nlp</article-title>
          ,
          <source>arXiv preprint arXiv:2105.03075</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          ,
          <article-title>Contextual augmentation: Data augmentation by words with paradigmatic relations</article-title>
          , in: M.
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Stent (Eds.),
          <source>Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>2</volume>
          (
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>452</fpage>
          -
          <lpage>457</lpage>
          . URL: https://aclanthology.org/N18-2072.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lichtarge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <article-title>Corpora Generation for Grammatical Error Correction</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human</source>
          Lan-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>