<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Artificial Conversations, Real Results: Fostering Language Detection with Synthetic Data</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milan</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Collecting high-quality training data is essential for fine-tuning Large Language Models (LLMs). However, acquiring such data is often costly and time-consuming, especially for non-English languages such as Italian. Recently, researchers have begun to explore the use of LLMs to generate synthetic datasets as a viable alternative. This study proposes a pipeline for generating synthetic data and a comprehensive approach for investigating the factors that influence the validity of synthetic data generated by LLMs by examining how model performance is afected by metrics such as prompt strategy, text length and target position in a specific task, i.e. inclusive language detection in Italian job advertisements. Our results show that, in most cases and across diferent metrics, the fine-tuned models trained on synthetic data consistently outperformed other models on both seed and synthetic test datasets. The study discusses the practical implications and limitations of using synthetic data for language detection tasks with LLMs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>synthetic data</kwd>
        <kwd>inference</kwd>
        <kwd>finetuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, Large Language Models (LLMs) have received considerable attention in language
recognition tasks. However, the efectiveness of these models largely depends on appropriate
finetuning procedures, which require access to large, diverse, and high-quality datasets for training and
evaluation. Obtaining such datasets is challenging due to issues such as data scarcity, privacy constraints,
class imbalance, an absence of edge cases, and the high costs of collection and annotation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Synthetic data has been proposed as a potential solution to address certain challenges associated
with language detection tasks, including limited data availability [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and the psychological impact on
annotators [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Synthetic data helps to overcome these limitations by ofering greater control over data
properties, allowing tailored augmentation for specific tasks, such as testing models under diferent
conditions or rare scenarios [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        It also accelerates the iterative model development process by providing readily available and
customizable datasets [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In addition, synthetic data helps mitigate biases present in real-world datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and improves model generalization by exposing algorithms to a wider range of input variations [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This
approach is particularly valuable in areas such as healthcare [8], complex systems [9], and natural
language processing [10], where obtaining high-quality labeled data is often dificult or resource-intensive.
      </p>
      <p>The capabilities of language models for most language detection tasks have been extensively discussed
[11, 12]. However, an investigation of the utility of LLMs for detecting inclusive language is still lacking.
In particular, an in-depth evaluation of the capabilities of such models, e.g. their ability to achieve
high performance even when fine-tuned using synthetic data. Therefore, this paper aims to address
the challenges associated with acquiring high-quality training data for fine-tuning LLMs, especially in
low-resource language settings, such as many non-English languages. In this paper, we address these
challenges through the following contributions: (1) Proposing a synthetic data generation pipeline to
address data scarcity in low-resource language settings. (2) Outline a workflow that involves fine-tuning
an LLM on synthetic training data, followed by inference with fine-tuned and pre-trained models on
synthetic test data to evaluate the efectiveness of synthetic data. (3) Focusing on inclusive language
detection, an under-researched and challenging task, especially in gendered languages such as Italian.
(4) Demonstrate the potential of synthetic data as a cost-efective, scalable solution by showing that
ifne-tuned models trained on synthetic data outperform other models on both seed and synthetic test
data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Synthetic data generation</title>
        <p>Synthetic data generation has become an essential approach to mitigate challenges related to data
scarcity, privacy, and the need for diverse datasets when training machine learning models [13]. Diferent
techniques, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and
LLMs, ofer diferent capabilities tailored to specific data types. GANs are particularly efective for
generating tabular data, while VAEs are widely used for generating synthetic images and tabular datasets
[14].</p>
        <p>For synthetic text data, LLMs are the most suitable choice. These models produce coherent,
contextually accurate text that is virtually indistinguishable from human-written content, making them ideal for
natural language processing tasks [15]. By providing well-crafted prompts or instructions, LLMs can
generate diverse and realistic textual data, such as dialogues, narratives, and domain-specific content
[14]. Such synthetic data can be used for a variety of purposes, including fine-tuning LLMs, increasing
the diversity of datasets, and simulating scenarios for testing and development.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Using synthetic data for language detection tasks</title>
        <p>Several studies have explored using synthetic data for language detection tasks. Casula et al. [16] showed
that instruction-based rewriting of original English texts can produce synthetic data efective enough
to train classifiers that perform as well as, or better than, those trained on original data. Maheshwari et
al. [17] demonstrated that synthetic data properties can predict performance in intent detection tasks.
Khullar et al. [18] generated synthetic data for hate speech detection in low-resource languages, such
as Hindi and Vietnamese, using machine translation and contextual entity substitution, finding that
models trained on synthetic data can match or exceed the performance of those trained on available
target-domain examples.</p>
        <p>Despite existing research on the use of synthetic data for tasks such as detecting hate speech and
abusive language, there is a notable lack of focus on inclusive language. Specifically, in the context of
generating synthetic data for job advertisements, to our knowledge, only one study has been developed
[19]. This study presented SkillSkape, an open-source synthetic dataset of job advertisements designed
for skill-matching tasks rather than comprehensive language detection. Due to the gendered nature
of the Italian language and the lack of research on this topic, our study is highly novel and fills an
important gap in the field.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Fine-tuning LLMs using synthetic data</title>
        <p>The use of LLMs to generate synthetic data is growing; however, there is limited research on fine-tuning
LLMs specifically using synthetic data. One notable study [ 20] introduced Generalised Instruction
Tuning (GLAN), a method for generating synthetic data tailored to instruction tuning tasks. Extensive
experiments on the Mistral model showed that training with GLAN enabled the model to excel in several
domains, including mathematical reasoning, coding, academic exams, logical reasoning, and general
instruction. Remarkably, this was achieved without the use of task-specific training data for these
applications. The study most closely related to our methodology is [21], which fine-tuned a model using
a combination of seed and synthetic data. However, their focus difers, as their synthetic data relates to
therapy sessions. Their experimental results showed that the hybrid model consistently outperformed
others in specific vertical applications, achieving superior performance across all metrics. Additional
tests confirmed the hybrid model’s enhanced adaptability and contextual understanding in diferent
scenarios. These results highlight the potential of combining seed and synthetic data to improve LLMs’
robustness and contextual sensitivity, particularly for domain-specific and specialized applications.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section outlines our proposed approach for developing a framework driven by LLMs, designed to
generate synthetic data and evaluate it using diferent prompt strategies across diferent language models.
In this study, we used this framework to detect non-inclusive language in Italian job advertisements.
Our approach consists of four main parts: (i) creation of a synthetic dataset by combining real and
generated data, (ii) study of diferent prompting techniques and their impact on the performance of
pre-trained models, (iii) fine-tuning of a model on the synthetic data, (iv) and finally inference using
our fine-tuned model and other pre-trained models, on both synthetic and seed test datasets. Figure 1
shows an overview of the entire framework.</p>
      <p>Dataset Generation
Pipeline (Details in</p>
      <p>Figure 2)
Synthetic
Train Data
Synthetic
Test Data</p>
      <p>Used for
Used for
testing</p>
      <p>Finetuning</p>
      <p>Finetuned Model
Pretrained Models</p>
      <p>Inference</p>
      <p>Evaluation &amp;
Comparison
ed lse
tin d
ra o
e M
rP ed
on tn
tlsu i-eun
seR F&amp;</p>
      <sec id="sec-3-1">
        <title>3.1. Synthetic data generation</title>
        <p>Synthetic data generation is a key part of our methodology, enabling the creation of large datasets for
both fine-tuning and evaluation purposes. A major challenge in this process is the risk of repetition
within the dataset, where variation is limited to minor changes in the words that replace the placeholders.
To mitigate this, we implemented a novel strategy: pre-splitting the templates into separate training
and test sets before data generation. By separating the templates at the outset, we ensure that the
test set remains suficiently distinct from the training set, reducing the overlap of similar sentence
structures. This approach minimizes overfitting and improves the overall robustness and performance
of our fine-tuned model.</p>
        <p>As shown in Figure 2, we allocate 70% of the dataset for training and the remaining 30% for evaluation.
By using this partitioning strategy, we achieve a balance between optimizing the model’s learning
process and validating its generalization capabilities, thereby reducing the risk of data leakage. This
ensures a robust and reliable evaluation of the model’s performance on previously unseen data.</p>
        <p>The process begins with a seed dataset composed of 99 real-world job descriptions, manually collected
and annotated by domain experts in law and linguistics, which are then deconstructed into individual
sentences (No. 1). Sentences containing words that can be masked are identified and reused as templates
for dataset construction. Each sentence is given a binary label: TODO for sentences with maskable words
that require further processing, and INCLUSIVE for sentences that are inherently neutral and cannot
be discriminated. For example, a sentence like “Sarai [VERB] per un colloquio conoscitivo” is labeled
TODO, while a neutral sentence like “Descrizione del ruolo:” is categorized as INCLUSIVE (No. 2).</p>
        <p>Research has shown that text length plays a crucial role in shaping the performance and behavior of
LLMs, afecting aspects such as coherence, contextual understanding, and generation quality [ 22]. A
notable advantage of our synthetic data generation pipeline is the inclusion of a chunk merger module
(No. 3), which allows precise control over text length. This feature allows the creation of synthetic data
sets of diferent lengths, facilitating in-depth analysis of how text length afects LLM performance. We
create templates containing placeholders for job titles, work-related adjectives, and verbs, categorized by
gender.</p>
        <p>In the next step (No. 4), we replace the placeholders in the annotated dataset with a corresponding
word from a substitution list, where each vocabulary is labeled as neutral, masculine, or feminine.
This list is validated by human native annotators to check if the genders are correctly assigned. This
substitution process generates a large number of possible text combinations (Chunk Merger module).
The generated dataset can be found here.</p>
        <p>After generating synthetic test data, the next step is to generate labeled training data, which will
be used for fine-tuning the LLM. To achieve this, the Response Maker module (No. 5) is used to
assign labels to each sentence generated by the Data Generator module. Sentences containing only
one label masculine or feminine are classified as NONINCLUSIVE, while those containing only
neutral elements are classified as INCLUSIVE. For example, if the placeholder [JOB] in the template
sentence “[JOB] will play a key role (In Italian: svolgerà un ruolo chiave)” is replaced by insegnante
(teacher), the sentence is marked as INCLUSIVE, because insegnante is a gender-neutral term in
Italian. Conversely, replacing [JOB] with infermiere (nurse) results in a label of NONINCLUSIVE, as
infermiere is a masculine term that excludes female candidates. This systematic labeling approach,
where labels are directly tied to gender references (i.e., masculine/feminine forms as non-inclusive and
neutral forms as inclusive) and validated by human annotators, ensures the accurate identification of
inclusive language within the dataset.</p>
        <p>To fine-tune the LLM using chat template data, we use the Prompt Generator module along with the
labeled sentences from the previous step. By feeding these two inputs into the Chat Maker module (No.
7), we generate 10,424 rows of data, which will serve as the training data set for fine-tuning the LLM.
The details of the Prompt Generator module will be discussed in the next section.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prompts generation based on diferent approaches</title>
        <p>One of the primary objectives of this study is to examine the impact of various prompting methods on
the responses generated by LLMs, with the ultimate goal of enhancing response quality. Recent research
has shown that even small variations in prompting - such as rephrasing or changing the structure
can have a significant impact on LLM performance [ 23]. For example, strategies such as encouraging
step-by-step reasoning or rephrasing objectives can lead to markedly diferent outcomes [ 24]. To
streamline this exploration, we are implementing an automated system for designing and managing
prompts using a modular prompting framework. This framework includes methods such as zero-shot
learning (ZSL), few-shot learning (FSL), and ZSL with chain-of-thought strategy, which we call (ZSLCOT)
prompting [25]. Using these prompting methods, we generated four diferent prompts: two for ZSL and
one for each of the other strategies. The results provide valuable insights into how to design prompts
to maximize LLM performance and response quality efectively.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. LLM fine-tuning</title>
        <p>We fine-tuned a pre-trained language model using the Unsloth library, known for its powerful
finetuning capabilities and accelerated processing speed [26]. For this study, we chose the Phi3-mini model,
a compact and cost-efective architecture with 29,884,416 trainable parameters. To further optimize
eficiency, we used a 4-bit quantized version of the pre-trained model, which significantly reduced
computational requirements and processing time without compromising performance. The fine-tuning</p>
        <p>Chunks Merger
(Text Length Control)</p>
        <p>Data
Generator</p>
        <p>Test</p>
        <p>Train
Prompt
Generator</p>
        <p>Output
2</p>
        <p>Templates Maker
(Split in sentences,</p>
        <p>Masking,...)
6
7</p>
        <p>Chat Maker
(Finetuning chat
template)
dataset consisted of 5,712 synthesized samples formatted as chat data containing questions, text, and
responses. These samples were tokenized using a custom chat template designed specifically for the
Phi-3 architecture to ensure compatibility and maximize the efectiveness of the fine-tuning process.</p>
        <p>The tuning used Parameter-Eficient Fine-Tuning (PEFT) with a LoFTQ configuration [ 27], which
allows tuning without changing all model parameters, making the process more resource eficient.
Using SFTTrainer from the Hugging Face library [28], a single epoch of 360 training steps was run on a
Tesla T4 GPU with 14.748 GB of RAM, achieving a speed of 0.28 iterations per second and completing
in 26.55 minutes. The resulting fine-tuned model was then uploaded to a model hub and made available
for further use.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Inference using fine-tuned and pre-trained models</title>
        <p>The inference process defined in this study refers to the procedure by which both pre-trained and
ifne-tuned language models generate responses based on input data. This process is applied consistently
to both the synthetic test dataset and the manually annotated real-world dataset, ensuring a consistent
evaluation framework. This dual evaluation approach minimizes the risk of overfitting by validating
model performance on diferent data sources. Answers are generated using the automatically generated
prompts, as described in Section 3.2.</p>
        <p>We compared the results obtained by our fine-tuned model with five diferent LLMs: LLaMA 3 7B
from Meta, Phi-3-mini 3.8B from Microsoft, Mistral 7B, Qwen 2 7B from Alibaba and Gemma 2 9B from
Google. The Phi-3-mini used in the comparison is diferent from the Phi-3-mini we have fine-tuned. Our
data collection yielded a substantial dataset of 10, 424 responses, for a total of 62, 544 data points. This
large dataset enables a thorough evaluation of each model, prompt, and inferred label for the examples.
The code developed in this study for prompt construction, data generation, and model finetuning is
available in this GitHub repository.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Comprehensive comparative analysis</title>
        <p>In this paper, the detection of non-inclusive language is framed as a binary classification problem, and
we evaluate the models from several perspectives. Our evaluation includes the following key aspects.
(i) The structure of the top responses generated by diferent LLMs, highlighting how well the output
matches the provided prompts. (ii) A comparison of the accuracy of diferent prompt strategies applied
to both synthetic and seed datasets, identifying the most efective models and prompts for each. (iii)
Having identified the best-performing prompt for each model, we evaluate model performance using
these optimized prompts. We use precision, recall, specificity , F1 score, and accuracy as the primary
metrics. However, due to the issue of data imbalance, often cited in the literature as a source of metric
skewness and potential bias, we also include balanced accuracy (bACC) as a more reliable metric [29].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Evaluation of the structure of the produced responses</title>
        <p>After generating responses using LLMs, the first step is to pre-process the output to ensure that the
responses are standardized. The primary objective is to extract the binary labels - INCLUSIVE and
NON-INCLUSIVE - from these responses. This pre-processing step is essential because the generated
output often contains extraneous information, such as reasoning and explanations, which must be
removed to produce a clean set of responses for accurate analysis.</p>
        <p>Figures 3a and 3b show the top 10 responses from GPT-4o-mini and fine-tuned Phi3 models as examples
evaluated on the same test dataset. While the pre-trained models such as GPT-4o-mini produce output
in the correct format, the fine-tuned model often introduces additional noise, such as explanations or
special characters. To handle the varying outputs, several preprocessing functions were implemented
to extract the desired labels. As a result, 99.90% of the responses were successfully transformed into the
desired format through automated preprocessing, with less than 0.1% of responses falling outside the
expected format.</p>
        <p>(a) Top 10 responses by pre-trained GPT-4o-mini.
(b) Top 10 responses by finetuned Phi-3.</p>
        <p>Another key observation is the class imbalance in the responses generated. The distribution of
responses generated by the LLMs for the synthetic dataset shows that INCLUSIVE labels appear twice
as often as NON INCLUSIVE labels. This imbalance supports the choice of balanced accuracy (bACC) as
a more appropriate evaluation metric.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation of the prompting strategies</title>
        <p>As discussed in the methodology section, the quality of the prompt is crucial for obtaining optimal
responses from both pre-trained and fine-tuned LLMs. To save time and computational resources, we
decided to use a single prompt strategy during the inference process. Before starting, we automatically
compared the diferent prompts generated by our pipeline based on their accuracy to select the best
one. The base ZSL prompt employed was: “You are an assistant who reads and analyzes sentences
from Italian job advertisements. Your task is to determine whether the text contains profession names
and whether the sentence refers to both genders. Respond only with the label ‘NON INCLUSIVE’ or
‘INCLUSIVE’.” For other prompting strategies, such as FSL, this basic ZSL prompt was extended by
incorporating illustrative examples or by specifying more detailed instructions. Table 1, shows the
accuracy of our four tested prompts for the synthetic and seed datasets, respectively.</p>
        <p>The results show that the FSL strategy generally performs better on synthetic data, achieving the
highest accuracy in five of the seven models tested. In particular, the fine-tuned model shows the best
performance of all strategies on the synthetic test data. On the seed dataset, our fine-tuned model also
shows superior accuracy compared to the others. Overall, both the FSL and ZSL methods are efective
on this dataset, with FSL outperforming three models and ZSL outperforming four of the seven models
tested.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation of the models’ performance</title>
        <p>Having identified the best-performing prompt for each model, we performed a comparative analysis
of the key performance metrics. The results presented in Table 2 show that our fine-tuned model
outperforms all metrics for the synthetic test data and three out of six metrics for the seed data. This
shows that training LLMs with synthetic data can be highly efective on seed data, highlighting the
potential of synthetic data for language detection tasks, even in complex contexts such as non-inclusive
language. A notable strength of almost all models is recall, which measures how efectively the model
identifies true positives - in this case, inclusive labels. In addition, Table 2 shows that GPT-4o-mini
performs well in terms of specificity and precision on seed data, demonstrating the potential of this
latest compact model from OpenAI for language detection tasks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we propose a comprehensive pipeline for generating and evaluating synthetic data using
LLMs. Our approach involves extracting sentences from seed data and systematically replacing job
titles or adjectives with alternatives that have diferent grammatical endings while adhering to Italian
language rules. The resulting synthetic dataset was used to fine-tune Phi3, a model from Microsoft. We
evaluated the performance of this fine-tuned model against six other pre-trained models. The results
demonstrate that the LLM fine-tuned on our synthetic data outperformed the others, achieving superior
performance on both the synthetic and seed test datasets. These findings indicate that our approach
has the potential to produce reliable synthetic data, a contribution that is particularly significant in
the context of Trustworthy AI. The quality and validity of training data are foundational to ensuring
fairness, transparency, and accountability in downstream applications. Future work will therefore
extend this methodology beyond conventional performance metrics to include systematic evaluations
of bias and fairness. In addition, we aim to fine-tune a broader set of LLMs, thereby strengthening the
generalizability and ethical robustness of our approach.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The work reported in this paper has been partly funded by the European Union - NextGenerationEU,
under the National Recovery and Resilience Plan (NRRP) Mission 4 Component 2 Investment Line 1.5:
Strengthening of research structures and creation of R&amp;D “innovation ecosystems”, set up of “territorial
leaders in R&amp;D”, within the project “MUSA - Multilayered Urban Sustainability Action” (contract no.
ECS 00000037).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-5 in order to: Grammar and spelling
check. After using these services, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[8] V. Bellandi, P. Ceravolo, E. Damiani, S. Maghool, M. Cesari, I. Basdekis, E. Iliadou, M. D. Marzan,
A methodology to engineering continuous monitoring of intrinsic capacity for elderly people,
Complex &amp; Intelligent Systems 8 (2022) 3953–3971.
[9] S. Maghool, N. Maleki-Jirsaraei, Epidemic spreading phenomena on a scale-free network with
time-varying transmission rate due to social responses, International Journal of Modern Physics
C 31 (2020) 2050148.
[10] P. Eigenschink, T. Reutterer, S. Vamosi, R. Vamosi, C. Sun, K. Kalcher, Deep generative models for
synthetic data: A survey, IEEE Access 11 (2023) 47304–47320.
[11] T. Sen, A. Das, M. Sen, Hatetinyllm: Hate speech detection using tiny large language models,
arXiv preprint arXiv:2405.01577 (2024).
[12] K. Guo, A. Hu, J. Mu, Z. Shi, Z. Zhao, N. Vishwamitra, H. Hu, An investigation of large language
models for real-world hate speech detection, in: 2023 International Conference on Machine
Learning and Applications (ICMLA), IEEE, 2023, pp. 1568–1573.
[13] Y. Lu, M. Shen, H. Wang, X. Wang, C. van Rechem, T. Fu, W. Wei, Machine learning for synthetic
data generation: a review, arXiv preprint arXiv:2302.04062 (2023).
[14] M. Goyal, Q. H. Mahmoud, A systematic review of synthetic data generation techniques using
generative ai, Electronics 13 (2024) 3509.
[15] Z. Li, H. Zhu, Z. Lu, M. Yin, Synthetic data generation with large language models for text
classification: Potential and limitations, arXiv preprint arXiv:2310.07849 (2023).
[16] C. Casula, E. Leonardelli, S. Tonelli, Don’t augment, rewrite? assessing abusive language detection
with synthetic data, in: Findings of the Association for Computational Linguistics ACL 2024, 2024,
pp. 11240–11247.
[17] G. Maheshwari, D. Ivanov, K. E. Haddad, Eficacy of synthetic data as a benchmark, arXiv preprint
arXiv:2409.11968 (2024).
[18] A. Khullar, D. Nkemelu, V. C. Nguyen, M. L. Best, Hate speech detection in limited data contexts
using synthetic data generation, ACM Journal on Computing and Sustainable Societies 2 (2024)
1–18.
[19] A. Magron, A. Dai, M. Zhang, S. Montariol, A. Bosselut, Jobskape: A framework for generating
synthetic job postings to enhance skill matching, arXiv preprint arXiv:2402.03242 (2024).
[20] H. Li, Q. Dong, Z. Tang, C. Wang, X. Zhang, H. Huang, S. Huang, X. Huang, Z. Huang, D. Zhang,
et al., Synthetic data (almost) from scratch: Generalized instruction tuning for language models,
arXiv preprint arXiv:2402.13064 (2024).
[21] A. Zhezherau, A. Yanockin, Hybrid training approaches for llms: Leveraging real and synthetic data
to enhance model performance in domain-specific applications, arXiv preprint arXiv:2410.09168
(2024).
[22] J.-T. Baillargeon, L. Lamontagne, Assessing the impact of sequence length learning on classification
tasks for transformer encoder models, in: The International FLAIRS Conference Proceedings,
volume 37, 2024.
[23] L. Beurer-Kellner, M. Fischer, M. Vechev, Prompting is programming: A query language for large
language models, Proceedings of the ACM on Programming Languages 7 (2023) 1946–1969.
[24] L. Reynolds, K. McDonell, Prompt programming for large language models: Beyond the few-shot
paradigm, in: Extended abstracts of the 2021 CHI conference on human factors in computing
systems, 2021, pp. 1–7.
[25] G.-G. Lee, E. Latif, X. Wu, N. Liu, X. Zhai, Applying large language models and chain-of-thought
for automatic scoring, Computers and Education: Artificial Intelligence 6 (2024) 100213.
[26] M. Labonne, Fine-tune llama 3.1 ultra-eficiently with unsloth, https://huggingface.co/blog/
mlabonne/sft-llama3, 2024. Online; accessed 2024-12-06.
[27] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, T. Zhao, Loftq: Lora-fine-tuning-aware
quantization for large language models, arXiv preprint arXiv:2310.08659 (2023).
[28] S. Jain, Hugging face. in introduction to transformers for nlp: With the hugging face library and
models to solve problems, 2022.
[29] K. H. Brodersen, C. S. Ong, K. E. Stephan, J. M. Buhmann, The balanced accuracy and its posterior
distribution, in: 2010 20th international conference on pattern recognition, IEEE, 2010, pp. 3121–
3124.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Si</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Best practices and lessons learned on synthetic data</article-title>
          ,
          <source>in: First Conference on Language Modeling</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Founta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Djouvas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chatzakou</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Leontiadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blackburn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Stringhini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vakali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sirivianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kourtellis</surname>
          </string-name>
          ,
          <article-title>Large scale crowdsourcing and characterization of twitter abusive behavior</article-title>
          ,
          <source>in: Proceedings of the international AAAI conference on web and social media</source>
          , volume
          <volume>12</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Masullo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. N.</given-names>
            <surname>Whipple</surname>
          </string-name>
          ,
          <article-title>The downsides of digital labor: Exploring the toll incivility takes on online comment moderators</article-title>
          ,
          <source>Computers in Human Behavior</source>
          <volume>107</volume>
          (
          <year>2020</year>
          )
          <fpage>106262</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Prakash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Boochoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brophy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Acuna</surname>
          </string-name>
          , E. Cameracci,
          <string-name>
            <given-names>G.</given-names>
            <surname>State</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shapira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Birchfield</surname>
          </string-name>
          ,
          <article-title>Structured domain randomization: Bridging the reality gap by context-aware synthetic data</article-title>
          ,
          <source>in: 2019 International Conference on Robotics and Automation (ICRA)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>7249</fpage>
          -
          <lpage>7255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Q.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Machine learning with synthetic data-a new way to learn and classify the pictorial augmented reality markers in real-time</article-title>
          ,
          <source>in: 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Maghool</surname>
          </string-name>
          , E. Casiraghi,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          ,
          <article-title>Enhancing fairness and accuracy in machine learning through similarity networks</article-title>
          ,
          <source>in: International conference on cooperative information systems</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kuang</surname>
          </string-name>
          , Domaindif:
          <article-title>Boost out-of-distribution generalization with synthetic data</article-title>
          ,
          <source>in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>5640</fpage>
          -
          <lpage>5644</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>