1. Introduction

Journal of the tention is All You Need

10.7937/QXK2-QG03

Synthetic Annotated Data for Named Entity Recognition in Computed Tomography Scan Reports

Alexander Platas

aplatas@vicomtech.org 1

Elena Zotova

ezotova@vicomtech.org 0 1

Paola Martínez-Arias

Karen López-Linares

Montse Cuadros

mcuadros@vicomtech.org 1 0 Department of Languages and Computer Systems, University of the Basque Country , Spain 1 Fundación Vicomtech, Basque Research and Technology Alliance (BRTA) , Mikeletegi 57, 20009 Donostia-San Sebastián , Spain

2015

27 2193 2201

It is widely acknowledged that clinical data, in general, is scarce, and this scarcity worsens when focusing on specific domains. Moreover, the challenge escalates when annotated data is required. In this paper, we propose an approach to create synthetic annotated datasets for Named Entity Recognition (NER) tasks in Computed Tomography Reports (CTR) by leveraging large language models (LLMs). We investigate the potential of LLMs to generate meaningful texts in the healthcare domain through a combination of text generation techniques and automatic annotation using LLMs. Additionally, we conducted a series of experiments to demonstrate the eficacy of using synthetic data compared to real data for solving NER tasks.

eol>Biomedical NER text generation data synthesis

1. Introduction

lenge for machine learning (ML) and deep learning (DL) techniques, as they rely on large supervised corpora for This work presents a method for creating synthetic an- training models [ 2 ]. BioNLP also addresses sensitive innotated datasets for Named Entity Recognition (NER) in formation and privacy concerns, such as private informaComputed Tomography Reports (CTR). We experiment tion in electronic health records (EHR), so most datasets with text generation and automatic annotation with large are not publicly available for research and development language models (LLMs), considering their capacity to purposes. Concerns regarding patient privacy and lack of produce meaningful texts on a given topic and zero-shot reliable de-identification techniques have made hospitals learning [ 1 ]. LLMs have already shown potential in ex- and clinics highly reluctant to allow researchers to access tracting valuable information from unstructured data, clinical data outside the association [ 3 ]. such as electronic health records (EHRs) and digital med- We explore the new possibilities of synthetic textual ical data. Instead of applying LLMs in a zero-shot set- data to overcome the above-mentioned factors. Synthetic ting, we propose creating synthetic-labelled data using data, in general, according to The Alan Turing Institute, LLMs for further fine-tuning supervised NER models. is “data that has been generated using a purpose-built Our research is motivated by the following challenges in mathematical model or algorithm, with the aim of solving Biomedical Natural Language Processing (BioNLP). a (set of) data science task(s)" [ 4 ]. This type of data

High-quality annotated corpora are essential to train can statistically replicate real-world data’s underlying and validate predictive models in healthcare. Manual patterns and characteristics despite its artificial nature, annotation requires personnel time and preparation, and so its defining feature is this ability to mimic real-world the challenge is even more dificult in BioNLP, as the characteristics. Synthetic data can be classified into three cost of expertise for annotation is higher than in general- broad categories: fully synthetic, partially synthetic, and purpose NLP, which makes using crowd-sourcing plat- hybrid. The fully synthetic data does not contain any forms for annotations almost impossible. This scarcity original information; partially synthetic data replaces of annotated clinical narratives poses a significant chal- only the values of the sensitive attribute selected with synthetic values; and the hybrid synthetic data, which we have generated, uses both the original and synthetic data [ 5 ].

The contributions of this paper are the following: • We propose a hybrid method for generating synthetic annotated corpus from real-world structured data using an existing dataset of Computed Tomography (CT) scans reports. This synthetic data is used as a training corpus for fine-tuning of language models for the biomedical NER task.

Our method provides various prompting tech- crowd-source workers for annotation such tasks as releniques for data generation with LLMs and the vance, stance, topics, and frame detection. The authors analysis of the efectiveness of synthetic data as provided the corpora collected from Twitter and news data augmentation. Leveraging real-world data and the annotations guides to the LLM as a prompt. A in the text synthesis helps get good quality train- similar approach [22] leverages LLMs to generate a fewing data. The synthetic annotated corpus will be shot prompt with explanations, which is then used to publicly available1. annotate unlabelled data query and keyword relevance • Experiments with the models fine-tuned for the assessment, question-answering task, disambiguating NER task show that the synthetic data can help to word senses through binary classification of sentence improve the models’ performance in the situation pairs. [23] and [24] use LLMs for annotation with noisy of annotated data scarcity. labels and an active learning loop to determine what to eficiently annotate.

In a multilingual setting, a fine-tuned 5-billionparameter multilingual sequence-to-sequence model was used to generate annotated data for intent classification and slot tagging [25], and it was reported to perform better than the back-translation method.

This paper is organised as follows. In Section 2 we overview works related to synthetic data and methods to get augmented corpora, both in biomedical and generalpurpose NLP. Section 3 describes the task and the corpus we created with LLMs and the corpus with the original data manually annotated. In Section 4 is dedicated to the methodology of creating new corpora and in Section 5 we explain the details of the experimentation with corpora. 2.2. Biomedical NLP In Section 6 the results of the experiments are shown, Synthetic data generation has also witnessed a marked and Section 7 concludes our paper and discusses future increase in research publications in biomedical NLP, as work. well, suggesting a potential for broader adoption. The surveys carried out by [26, 27] provide evidence that syn2. Related Work thetic is helpful in diferent aspects of healthcare care and has possibilities to bridge data access gaps in re2.1. General-purpose NLP search and evidence-based policy making. [28], on the contrary, explore the problem of synthetic data in healthAn upsurge in data synthesis and augmentation in care: although it promises various positive opportunities, general-purpose NLP began with rule-based approaches, synthetic data potential carries concerns such as the risk such as grammar and lexicon replacement [ 6, 7, 8 ], and of bias amplification, low interpretability, and an absence then adopted model-based approaches, such as sentence of robust methods for examining data quality. retrieval and backtranslation with machine learning tech- In [29], the authors tackle the task of generation of niques [ 9, 10, 11 ]. The interest in synthetic data genera- medical imaging reports using a hierarchical recurrent tion is also related to the emergence of new architectures neural network decoder, which generates a sequence of of deep neural networks and pre-trained language mod- topic representations conditioned on image information, els. Various authors use BERT [12], BART [13], and GPT- and this then conditions the generation of respective 2 [14] to generate data for classification and common sentences. [30] propose the approach based on encodersense reasoning tasks, experiment with conditioning on decoder Transformer models [31] trained for the gaplabels by prepending the label to training data during iflling task to generate discharge summaries from a large ifne-tuning [ 15, 16, 17, 18]. [19] propose a task augmen- mental healthcare provider and an intensive care unit. tation approach that utilises conditional generation to The model learns a sequence-to-sequence task where create in-domain synthetic data for an auxiliary Natural the clinical information and the key phrases are in the Language Inference (NLI) task, which then is employed to input, and the full original EHR record is in the output. initialise the target task classifier. However, these works A classification model trained on synthetic data shows show better results with synthetic data, but observe that results comparable to the models trained on original data. one needs to detect and discard low-quality labelled data The methods for creating synthetic data with text genor optionally re-label it. In the work of [20], the authors eration models are explored by [32]: CharRNN [33], Segtry to overcome these problems by knowledge distillation GAN [34], GPT-2 [14], and CTRL [35]. Then, the authors and self-training on domain-specific data. annotated the resulting data manually for in Named En

The most recent works explore the capacity of Large tity Recognition (NER) task. The best-performing genLanguage Models (LLMs) to annotate corpora automat- eration model was GPT-2. [36] explores the ability of ically. [21] report that the GPT-3.5-turbo2 outperforms LLMs to extract structured information from unstructured healthcare texts, specifically for biological NER and relation extraction (RE) tasks, in a zero-shot setting. 1The corpus will be released when the paper is accepted 2https://platform.openai.com/docs/models/gpt-3-5-turbo

The quality of the synthetic corpora is evaluated by fine- Table 1

tuning supervised models; the authors report improve- Corpora statistics. Number of reports and tokens in Authentic ments in the performance of downstream tasks, com- and Synthetic datasets. pared to the zero-shot scenario, but not in original data, Dataset Synthetic although the performance is comparable. Train 197

We should note that most existing works experiment with corpora in English. There are only few attempts to 44272 create a multilingual datasets, for instance, a corpus for Dev Health Question Answering and compare various LLMs [37], including T5 [38], BART [13] and GPT-3.53. Test

3. Task Definition and Corpora Named Entity Recognition (NER) [39] in the biomedi

cpaulrpdoosmeaNinLPiskcnrouwcinalatsoneaxmtreadcetinntgiticeosn),cseupcths a(isnlogceantieornasl-, Entities Neunmtibteiersof Apvegr. erenptiotriets Apvegr. etonktietnys treatment plans, medicines/drugs, diagnoses, etc, from Synthetic data clinical narratives. NER uses an IOB (Inside, Outside, SEX 195 0.99 1.24 Begin) tagging scheme, where each word is assigned a AGE 199 1.01 2 tag indicating whether it is the beginning of a named THUEPMAOTOR_PSAITZHEY 423836 21..2405 23..0592 entity (B), inside a named entity (I), or outside a named PROCEDURE 198 1.01 3.06 entity (O). Formally, a sentence in a medical text is Total 1311 6.65 2.41 denoted as a sequence of words = (1, 2, . . . , ), Authentic data and the corresponding tags for each word in the sentence SEX 1 0.01 1 are denoted as = (1, 2, . . . , ), where tag is an AGE 1 0.01 2 element of the tag set {, , }. THUEPMAOTOR_PSAITZHEY 223479 22..3479 12..5691

Our goal is to train a NER model for detecting the PROCEDURE 147 1.47 3.31 following named entities in the Computed Tomography Total 635 6.35 2.39 Scan Reports (CTSR): SEX (patient’s sex), AGE (patient’s age), HEPATOPATHY (type of hepatopathy found), TUMOR_SIZE (liver tumor size), and PROCEDURE (proce- 4. CT Reports Generation dure performed). We consider two types of annotated corpora for the experimentation: (1) authentic data from In this Section, we describe how we create synthetic CT liver cancer cases collected in a hospital and (2) synthetic reports. In our case, synthetic data generation aims to dataset generated and annotated by an LLM. create realistic clinical narratives similar to real reports

The first type of data includes a private dataset in while making them as diverse as possible. We reduce the Spanish comprising 100 CTSRs performed on 66 patients. probability of an error or hallucinations by incorporating This corpus is manually annotated by experts and it is information from real-world structured data. used as gold-standard for the systems. Additionally, we The generated data were semi-automatically annotated used six real samples as examples in instructions for by GPT-3.5-turbo model under human supervision to corLLMs, which are not included in training data and are rect any potential annotation errors, such as entities left used only to show report details such as structure, length unlabeled or the annotation of words that were not entiand vocabulary. The second type of corpus consists of 197 ties. Our choice is explained by the model’s state-of-thereports, created and annotated by the LLM (see details art capabilities of coherent text generation with a given of text generation and annotation in Section 4). The prompt, which is an instruction or an example of how to authentic corpus is split in train, development and test complete a task. Given that this dataset consists solely sets, as shown in Table 1, while synthetic dataset is used of 197 reports, we manually verified these annotations. in training split only. The test set is used to evaluate However, unlike other experiments carried out rethe NER systems. Authentic reports are annotated with cently [40, 36], we compose prompts for an LLM in635 entities and synthetic reports contain a total of 1311 struction with real-world data from “Colorectal-Liverentities, as we can observe in Table 2. We can point out Metastases” dataset [41]. This dataset contains CT images that classes SEX and AGE are unbalanced, appearing only from 197 patients with liver cancer. It also includes strucin one report in the authentic reports dataset. tured data in a tabular format, as we can observe in Table 3, with 36 attributes for each patient, mostly numerical, covering demographic, pathological, and survival data. derstand the meaning of each column. Then, the model is instructed to generate a medical report. We observed a significant diference during the initial text generations when we changed the type of text requested in the prompt. As we can see in Table 4, using the term “informe” (report, in English) we obtain a much more schematic generation, while with the term “redacción” (writing, in English) we obtain an output more similar to the required one.

Once the desired text style is obtained, we provide the

model with a real sample as an example to generate a report with a similar structure. Providing real samples may result in the inclusion of information from those samples in the generated data. Therefore, instead of providing a real sample, we only show the structure of the report Prompt 2 Write a medical writing for patient and a description of the content it should include in each <Patient-ID> section, as we can see in Table 5. When using structured The patient with the code CRLM-CT-1001 is a 65 data for report generation, the model creates identical reyear-old woman who has been diagnosed with liver ports by only changing the provided data. Furthermore, cancer. She has colorectal liver metastases as the we add various synonyms, making annotated entities primary disease. The patient has a tumor of 1.1 cm richer in vocabulary, as evidenced in Table 6. To achieve in size [...] vocabulary variety, we employed high randomness in report generation and automatically replaced repeated

To create a prompt for the model, the role “system” phrases with a list of synonyms. is described as an expert oncologist, and the patient ID Finally, we obtained the optimal prompt as shown in is provided to retrieve information from a structured Table 7, where we specified the type of text, the report dataset. For each column that must be included in the structure, and the patient ID which is used as an index text, we wrote a brief description to help ChatGPT un- to get patients’ information from the structured dataset.

An example of a CT report generated using these same

prompt is visualised in Figure 1. We can see a coherent Role Optimal prompt grammatically correct text with required entities anno- System “You are an expert oncologist” tated. “Write a medical narrative with short and

Comparing the generated texts among themselves, we concise sentences for patient <Patinent-ID>. have observed that due to the high randomness used, the User fTohlleogweinnegrtaetxetdstterxutctsuhroeu: l<dTheaxtv-eStthruecture>. reports vary significantly. For instance, the lengths of Do not include the patient ID in the report.” the reports difer, the order in which the data is provided varies, and some reports repeat information in diferTttrrehheenhppeeOteooesrsnrreaexttfmcspstotheaweircoereteisn,tetohwscrdtuohoetqhchefutceerutorahreanlheienutsayttiptnhdeardaeexonn,rntvd.dwttiiHgdhcahrearoaoedtewmnntisheenmucsveio,taethgamatrieel,bctpnalhapeaellorllrrfoyuioaonmgtfcrgehoptudhrtttsr,hehreeaemeecisptng,someeytNrxhntnapEseeitenyrRhhcat.eactatteaeviidnndec. ttspdaeaoainmtFrlaectoesesdresasi,tetrn.rexefuHmaoacmrnotpmuwnplrooaleeetytv.ai,oetmmern,doebtrwdhoeiatictnhaahclwtrthphoherneaoygstfaemewmnssesee,irocaaeonnntueadtdlildtspiaeeurnsoxsdavternisadsdychentfootmfrhlrtlooeeomrtrwiecsttedrhhneeee--still be distinguished from each other.

5. Experiments

of authentic reports with and without the addition of synthetic reports with the aim of verifying their efectiveness across diferent corpus sizes.

To evaluate the efectiveness of the generated data, we used diferent combinations of authentic and synthetic data in the training set. These experiments can be divided into 2 types based on their objective, so many of the experiments can belong to both types, as shown in Table 8. All experiments have been evaluated with the same authentic test set, as shown in Table 1.

The first trial, composed of five experiments, used the

entire training set of authentic reports and introduced diferent amounts of randomly selected generated data. The objective is to determine if synthetic reports can provide any improvement and how much data would be necessary.

In the second trial, composed of 7 experiments, we compared the metrics obtained using diferent amounts

6.1. Increasing Synthetic Data

The results of the first trial can be observed in Figure 2, where the amount of synthetic data in the training set has been progressively increased. As we can see, all models achieve better results when synthetic data is introduced into the authentic dataset, especially models based on RoBERTa [44], which show an increase in F1 score of between 8 and 10 points. On the other hand, the improvement achieved in BERT models is much lower, between 2 and 3 points. We can highlight that the mBERT F-Score drops considerably when adding the entire set of synthetic data (+197), which might indicate potential overfitting. However, none of the experiments show a decrease in performance compared to the baseline results (+0).

From the first insertion, where we introduced 25 reports or about 33% of the original data, the metrics stabilise, meaning that despite this data improving the results, the quantity added after 25 examples becomes irrelevant. The high lexical and stylistic similarity between synthetic reports could cause this; synthetic data could lead to greater improvement if we had generated more diverse reports using more samples as a reference.

6.2. Increasing Authentic Data In this second trial, the efectiveness of synthetic data

across diferent amounts of authentic data was tested. The average micro F1 score obtained, and the standard deviation for each experiment are presented in Table 9.

We observe a significant improvement when introducing synthetic data into a small training set (25 real reports) in any of the 4 models tested. However, as in the previous trial, we can see a notable diference in the improvement obtained between the models based on RoBERTa and those based on BERT. Both XLM-RoBERTa [43] and Biomedical-Clinical RoBERTa [45] reach 80 FScore points after the addition of synthetic reports, more than 50 points than without using them, representing the greatest improvement achieved in this trial.

On the other hand, the models mBERT [12] and BETO [46] are more robust, as although significant improvements are achieved on small datasets, we observe that using 50 reports, the F1 score already reaches 70 points without using synthetic data. Therefore, the diference between using them or not is smaller (improvement between 12 and 2 points of F1 score).

In the experiment with only synthetic data, we can observe that the obtained metrics are very low, comparable to using only 25 real reports. Therefore, we can deduce that synthetic reports are efective only when combined with real data. We can also observe that the results are less stable when training with smaller datasets, as the standard deviation exceeds 5 points in many experiments, which are only real reports. This deviation is considerably reduced when introducing synthetic data (less than 2 points on average) as the size of the training set increases significantly.

7. Conclusion and Future Work

Through the methods of transforming structured data into medical reports using a generative LLM, we have explored the benefits that such synthetic data can ofer in ifne-tuning of the pre-trained language models for NER tasks. We have developed a new synthetic NER corpus of 197 CT scan reports in Spanish, each from diferent patient. We used a structured and numerical data originating from an image dataset and took 6 samples of real reports as references.

During the experiments, we have demonstrated that the addition of synthetic data to the training set can lead to considerable improvements in the results of all tested models, especially those based on RoBERTa, one of them likely due to being trained on data from the same domain, and the other due to its large number of parameters, thus enhancing its capabilities in this type of tasks.

Our research leads to two valuable conclusions, which reveal some keys to generating efective reports. On the one hand, achieving the closest possible similarity to real data. Authentic reports typically contain a rich vocabulary, so this can be achieved by using high randomness during generation or by inserting or replacing synonyms in the text. On the other hand, maintaining minimal similarity between the generated texts so that each one contains relevant information to contribute while also avoiding overfitting. In this case, diferent text structures could be used in generation or even diferent generative models, apart of GPT-3.5-turbo.

It is worth noting that even though we apply the best techniques and models to create synthetic data, due to the textual complexity of the medical domain, there is still no technology capable of generating data that perfectly simulates real data. However, this synthetic data can be very useful when combined with authentic data.

We believe that the proposed methods can be useful for generating new datasets from information extracted from structured data, especially for languages such as Spanish, where more datasets are needed to improve the performance of Language Models in these languages.

8. Acknowledgments This work is partially funded by the STEER project, a Multi-Area Internal initiative from Vicomtech, and the EMPHASIS project (ZE-2021/00039), supported by the Basque Business Development Agency, SPRI.

guage Technologies, Volume 1 (Long and Short Pa- Generate, Annotate, and Learn: NLP with Synthetic pers), Association for Computational Linguistics, Text, Transactions of the Association for ComputaMinneapolis, Minnesota, 2019, pp. 3291–3301. tional Linguistics 10 (2022) 826–842. doi:10.1162/ [12] J. Devlin, M. Chang, K. Lee, K. Toutanova, tacl_a_00492.

BERT: pre-training of deep bidirectional trans- [21] F. Gilardi, M. Alizadeh, M. Kubli, Chatgpt formers for language understanding, CoRR outperforms crowd workers for text-annotation abs/1810.04805 (2018). URL: http://arxiv.org/abs/ tasks, Proceedings of the National Academy of Sci1810.04805. arXiv:1810.04805. ences 120 (2023) e2305016120. doi:10.1073/pnas. [13] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- 2305016120.

hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: [22] X. He, Z.-W. Lin, Y. Gong, A. Jin, H. Zhang, C. Lin, Denoising sequence-to-sequence pre-training for J. Jiao, S. M. Yiu, N. Duan, W. Chen, AnnoLLM: natural language generation, translation, and com- Making Large Language Models to Be Better Crowdprehension, in: D. Jurafsky, J. Chai, N. Schluter, sourced Annotators, ArXiv abs/2303.16854 (2023). J. Tetreault (Eds.), Proceedings of the 58th Annual URL: https://api.semanticscholar.org/CorpusID: Meeting of the Association for Computational Lin- 257805087. guistics, Association for Computational Linguistics, [23] P. Bansal, A. Sharma, Large language models as Online, 2020, pp. 7871–7880. doi:10.18653/v1/ annotators: Enhancing generalization of nlp models 2020.acl-main.703. at minimal cost, arXiv preprint arXiv:2306.15766 [14] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, (2023).

I. Sutskever, et al., Language models are unsuper- [24] R. Zhang, Y. Li, Y. Ma, M. Zhou, L. Zou1, LLvised multitask learners, OpenAI blog 1 (2019) 9. MAAA: Making Large Language Models as Active [15] V. Kumar, A. Choudhary, E. Cho, Data augmenta- Annotators, in: Findings of the Association for tion using pre-trained transformer models, 2021. Computational Linguistics: EMNLP 2023, 2023, p. arXiv:2003.02245. 13088–13103. [16] A. Anaby-Tavor, B. Carmeli, E. Goldbraich, A. Kan- [25] A. Rosenbaum, S. Soltan, W. Hamza, Y. Versley, tor, G. Kour, S. Shlomov, N. Tepper, N. Zwerdling, M. Boese, Linguist: Language model instruction Do not have enough data? deep learning to the tuning to generate annotated utterances for intent rescue!, in: Proceedings of the AAAI Conference classification and slot tagging, in: COLING 2022, on Artificial Intelligence, volume 34, 2020, pp. 7383– 2022. URL: https://arxiv.org/abs/2209.09900. 7390. [26] A. Gonzales, G. Guruswamy, S. R. Smith, Synthetic [17] Y. Yang, C. Malaviya, J. Fernandez, S. Swayamdipta, data in health care: A narrative review, PLOS DigR. Le Bras, J.-P. Wang, C. Bhagavatula, Y. Choi, ital Health 2 (2023) 1–16. doi:10.1371/journal. D. Downey, Generative data augmentation for com- pdig.0000082. monsense reasoning, in: T. Cohn, Y. He, Y. Liu [27] H. Murtaza, M. Ahmed, N. F. Khan, G. Murtaza, (Eds.), Findings of the Association for Computa- S. Zafar, A. Bano, Synthetic data generation: State tional Linguistics: EMNLP 2020, Association for of the art in health care domain, Computer Science Computational Linguistics, Online, 2020, pp. 1008– Review 48 (2023) 100546. doi:https://doi.org/ 1025. 10.1016/j.cosrev.2023.100546. [18] Y. Meng, J. Huang, Y. Zhang, J. Han, Generating [28] M. Giufrè, D. Shung, Harnessing the power of Training Data with Language Models: Towards synthetic data in healthcare: innovation, applicaZero-Shot Language Understanding, in: A. H. Oh, tion, and privacy, npj Digital Medicine 6 (2023). A. Agarwal, D. Belgrave, K. Cho (Eds.), Advances in doi:10.1038/s41746-023-00927-3. Neural Information Processing Systems, 2022. URL: [29] B. Jing, P. Xie, E. Xing, On the automatic generahttps://openreview.net/forum?id=4G1Sfp_1sz7. tion of medical imaging reports, in: I. Gurevych, [19] T. Vu, M.-T. Luong, Q. Le, G. Simon, M. Iyyer, Y. Miyao (Eds.), Proceedings of the 56th Annual STraTA: Self-Training with Task Augmentation Meeting of the Association for Computational Linfor Better Few-shot Learning, in: M.-F. Moens, guistics (Volume 1: Long Papers), Association for X. Huang, L. Specia, S. W.-t. Yih (Eds.), Proceedings Computational Linguistics, Melbourne, Australia, of the 2021 Conference on Empirical Methods in 2018, pp. 2577–2586. URL: https://aclanthology.org/ Natural Language Processing, Association for Com- P18-1240. doi:10.18653/v1/P18-1240. putational Linguistics, Online and Punta Cana, Do- [30] J. Ive, N. Viani, J. Kam, L. Yin, S. Verma, S. Puntis, minican Republic, 2021, pp. 5715–5731. URL: https: R. N. Cardinal, A. Roberts, R. Stewart, S. Velupil//aclanthology.org/2021.emnlp-main.462. doi:10. lai, Generation and evaluation of artificial mental 18653/v1/2021.emnlp-main.462. health records for natural language processing, NPJ [20] X. He, I. Nassar, J. Kiros, G. Hafari, M. Norouzi, digital medicine 3 (2020) 69.

[1]

Patil ,

Gudivada , A Review of Current Trends, Techniques, and Challenges in Large Language Models (LLMs ), Applied Sciences 14 ( 2024 ). URL: https://www.mdpi.com/2076-3417/14/5/2074. doi: 10 .3390/app14052074.

[2]

Kim ,

Fiorini ,

W. J.

Wilbur ,

Lu , Bridging the Gap: Incorporating a Semantic Similarity Measure for Efectively Mapping PubMed Queries to Documents , Journal of Biomedical Informatics 75 ( 2017 ) 122 - 127 .

[3]

W. W.

Chapman ,

P. M.

Nadkarni ,

Hirschman , L. W. D'avolio,

G. K.

Savova ,

Uzuner , Overcoming barriers to nlp for clinical text: the role of shared tasks and the need for additional creative solutions , 2011 .

[4]

Jordon ,

Szpruch ,

Houssiau ,

Bottarelli , G. Cherubin,

Maple ,

S. N.

Cohen ,

Weller , Synthetic Data - what, why and how?, 2022 . arXiv: 2205 . 03257 .

[5]

Surendra ,

Mohan , A review of synthetic data generation methods for privacy preserving data publishing , International Journal of Scientific & Technology Research 6 ( 2017 ) 95 - 101 .

[6]

W. Y.

Wang ,

Yang , That's so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets , in: Proceedings of the 2015 conference on empirical methods in natural language processing , 2015 , pp. 2557 - 2563 .

[7]

Zhang ,

Zhao , Y. LeCun, Character-level convolutional networks for text classification , Advances in neural information processing systems 28 ( 2015 ).

[8]

Marzoev ,

Madden ,

M. F.

Kaashoek ,

M. J.

Cafarella ,

Andreas , Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data , ArXiv abs/ 2004 .13645 ( 2020 ). URL: https://api.semanticscholar.org/CorpusID: 216562596.

[9]

S. Y.

Feng ,

Gangal ,

Wei ,

Chandar ,

Vosoughi ,

Mitamura ,

Hovy , A survey of data augmentation approaches for nlp , arXiv preprint arXiv:2105.03075 ( 2021 ).

[10]

Kobayashi , Contextual augmentation: Data augmentation by words with paradigmatic relations , in: M. Walker , H. Ji , A . Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 2 ( Short

Papers)

, Association for Computational Linguistics , New Orleans, Louisiana, 2018 , pp. 452 - 457 . URL: https://aclanthology.org/N18-2072.

[11]

Lichtarge ,

Alberti ,

Kumar ,

Shazeer ,

Parmar ,

Tong , Corpora Generation for Grammatical Error Correction , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan-