<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Shot Prompting: An Empirical Comparison for Short Answer Grading</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Joel Walsh</string-name>
          <email>jwalsh@ict.usc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siddarth Mamidanna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Nye</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Core</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Auerbach</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Palermo, Italy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California</institution>
          ,
          <addr-line>Santa Cruz, Santa Cruz, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Southern California - Institute for Creative Technologies</institution>
          ,
          <addr-line>Los Angeles, CA</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAIs fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these ifne-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot baseline instruction-tuned LLMs for OpenAI's closed models. While our evaluation set is limited, we find some evidence that the observed benefits of finetuning may be impacted by the domain subject matter. Lastly, we observed dramatic improvement with the LLama3.1 8B-Instruct open-weight model by seeding the initial training examples with a significant amount of cheaply generated synthetic training data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The widespread adoption of Massive Open Online Courses (MOOCs) and Learning Management
Systems has created an increasing amount of learning assessments in online, machine-readable
formats. Due to the volume of assessment responses and sometimes teacher-less nature of these
courses, Automated Short Answer Grading (ASAG) has become a robust research target. Most
attempts have used few or no shot learning with baseline LLMs. The practice of fine-tuning LLMs
ofers some promise as a way to increase the performance of LLMs on specific tasks, but the cost
of compute and data collection budget are significant constraints. This paper explores finetuning
under realistic constraints. Specifically, given a modest set of labeled examples (N=148) and
a singleGPU training envelope, when does parametereficient finetuning (QLoRA or OpenAIs
inhouse tuning) yield statistically and practically meaningful gains over fewshot prompting for
multiconcept ASAG? We find that the answer to this question is somewhat nuanced, as some
methods of fine-tuning modestly improve ASAG in this context, while other methods do not.</p>
      <p>Rather than train the models on a specific content domain, this fine-tuning approach trains
across a varied set of domains and with diferent numbers of few-shot prompts, with the goal
to tune a model that uses few-shot prompts more efectively for new content areas. With this
specific task the model must assign binary labels for demonstrated understanding of user-defined
concepts, or learning objectives. The evaluation set consists of expert human-graded responses
from diferent domain areas. This particular type of structured JSON output is particularly
(CC BY 4.0).</p>
      <p>CEUR</p>
      <p>ceur-ws.org
useful as agents and multi-agent systems have shown their usefulness at linking LLM outputs to
software decisions.</p>
      <sec id="sec-1-1">
        <title>1.1. Related Work</title>
        <p>
          Deep learning for automated text scoring has been the focus of numerous public dataset challenges,
but the approaches that have come from these competitions often focus on essay-length text
and often ignore issues such as textual entailment [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Since the widespread adoption of neural
methods in Natural Language Processing, there have been many attempts to utilize the latest
network architectures for Short Answer Grading (SAG). These approaches included using LSTMs
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], mixing sentence level and token level embeddings [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and a variety of transformers-based
approaches [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          As LLMs began to perform well on benchmark tasks similar to ASAG, such as Question
Answering [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], several groups began to experiment with leveraging the no-shot and few-shot
capabilities of LLMs for ASAG [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Since the release of industry open weight models
(e.g., Meta’s Llama family), there has been limited research comparing ASAG performance with
ifne-tuned closed models (i.e., diferent sizes of Open AI’s GPT-4) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          Supervised finetuning (SFT) and Instruction-tuned models [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] like ChatGPT have no doubt
changed the world in the past few years; but training often requires a lot of gas (i.e., large
numbers of examples). In the case of SFT, optimal prompt-response pairs is the gas. Since the
proliferation of instruction tuned models, researchers have had success mitigating this need by
using existing human-annotated data to drive reinforcement learning [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and using high-quality
examples as a seed to create additional SFT data [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>Although our approach to these empirical studies on ASAG are similar, we expand these
comparisons in a few areas. In this study, we also fine-tune a large open weight model for ASAG.
We feel that this is an important departure as closed models do not disclose details about the
parameter count, hyperparameters, or techniques used in fine-tuning. As many organizations are
compute-limited, we focused on training quantized 4-bit models that can fit on one NVIDIA A40
GPU. We also study the efect of varying the amount N of few shot examples (i.e., N-shot) used
to prompt each model at test time. We focus on producing JSON-structured outputs, which
play an important role in agentic and LLM-based software. Lastly, we seed synthetic data using
a small number of examples created by real subject matter experts and learners. This approach
will enable low-resource educational organizations to create supervised finetuning data on a scale
that makes the technology viable.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <sec id="sec-2-1">
        <title>2.1. Fine-tuning and baseline models</title>
        <p>
          For the closed models, we used OpenAI’s GPT-4o-mini. OpenAI does not publish parameter
counts for either of these models, or key information about the architecture. OpenAI has ofered
ifne-tuning services starting in August of 2023[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The documentation for the fine-tuning API
states that, "We typically see improvements from fine-tuning on 50 to 100 training examples"
[15]. However, OpenAI does not allow you to download the fine-tuned models, or any of their
models.
        </p>
        <p>For the open models, we used LLama3.1 8B-Instruct [16]. Like GPT-4o-mini, this model is
already instruction-tuned. The model size is 8 billion parameters, which is the largest model
that can be finetuned on one NVIDIA A40 GPU with 48 GB of RAM using a Quantized Low
Rank Adapter (QLoRA) [17] approach. Each model was evaluated on a set of 148 distinct
short-answer labeled examples, which was efectively extended to 17820 prompts (Sec. 2.5),
which took approximately 40-50 hours to run on our university’s NVIDIA A40 GPUs. Due to
these compute limitations at test time we were not able to run extensive ablation tests on model
architecture, or on hyperparameters like LoRa rank or number of hidden dimensions.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data collection</title>
        <p>Data was gathered with permission from the OpenTutor project [18]. OpenTutor allows
curriculum creators to author tutoring dialogues, where assessment of concept understanding is
conducted via multi-concept ASAG on student responses. Subject matter experts created the
dialogues, and also graded student responses in order to improve the system. The subject matter
of the training set pulled from technical subjects for which the models certainly had exposure
(biology and computer science) to subjects where the models likely had much less exposure; or
where best-practices may be changed over time, such as military leadership or culture. Each
authored dialogue contains several concepts, or key points that the responses were supposed to
address. For example, a dialogue on invasive species contains three concepts:
1. Monitoring and collecting data on the invasive species is key to fighting them,
2. Rapid or a quick response is another key to fighting invasive species, and
3. AI and machine learning can be used to find and track the spread of invasive species faster.</p>
        <p>When a subject matter expert grades the answer, they provide an either 0 or 1 label for each
concept; a 0 if the answer does not demonstrate knowledge of the concept and a 1 if it does.
Some answers contain correct statements about one concept without mentioning the others; this
is reflected in the grades as a skipped concept.</p>
        <p>All Short Answer questions and training examples were generated as a byproduct of user
interaction with the OpenTutor dialogue-based tutoring system. The corpus of graded responses
for this analysis comes from multiple studies conducted with adult learners, to include both
students and Mechanical Turk workers. The MTurk workers were screened for appropriate study
behavior (e.g., enough time spent, expert human review of answers) Subject matter experts were
then asked to grade each answer, specifying whether or not it addressed the concepts dictated
by the expert.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Fine-Tuning Training Data</title>
        <p>Subject matter experts labeled only a portion of user responses from OpenTutor dialogues; only
labeled responses were used in the current research. The training data included all labeled
responses from 42 lessons, excluding three lessons that were set aside for evaluation. Lessons
without graded responses were omitted.</p>
        <p>Due to varying levels of use and grading by experts, each lesson had a diferent number of
graded responses (ranging from as few as 4 to over 100). To generate training data, multiple
subsets of graded responses were randomly selected from each lesson, with each subset containing
between 1 and 40 examples. Each subset was then individually paired with another distinct
graded response from the same lesson, forming multiple complete training examples. Each of
these pairings represented an n-shot example, where n denotes the number of responses in the
subset.</p>
        <p>Initially, the graded responses contained only binary labels (true/false) for each concept,
without confidence scores or justifications. To enhance the quality and informativeness of the
data, we used GPT-4o to generate justifications and assign confidence scores for each concept.
Specifically, each concept label within a response was individually passed to GPT-4o along
with the corresponding student response, generating a justification and confidence rating. To
better calibrate the confidence ratings provided by GPT-4o, we averaged the confidence scores
with those obtained from an existing logistic regression classifier previously trained on the same
dataset [19]. This resulted in a more accurate reflection of the true confidence. All samples (a
total of 148 training examples) were manually reviewed afterward to ensure their quality and
correctness.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Generating Synthetic data</title>
        <p>To extend this research beyond this small amount of hand-labeled examples, we also investigated
the impact of training LLama3.1 8B-Instruct using synthetic data. To generate synthetic data,
the training data was split into a 90:10 train-validation split. Google Gemini’s 1.5 Flash model
was used for generation. Our process involved randomly choosing 1-3 examples from either the
train or validation, and appending them to a prompt to “generate one additional example from
an academic, corporate, or military training domain”. If the example did not form a valid JSON,
it was thrown out. These one thousand examples fortified the existing training and validation
sets. The test set remained entirely real data. For 1.5 Flash, this process cost 45 cents (US),
ofering a relatively cost-efective process. Surprisingly the latest Gemini thinking model, 2.5
Flash, could not create valid JSONs with any regularity. A limitation of this study is that the
amount of synthetic data was chosen somewhat arbitrarily, as we did not have the test-time
compute resources to test models trained on several diferent amounts of synthetic data. Also,
we were unable to fine-tune GPT 4o-mini with synthetic data to see how it would afect GPT
4o-mini. This will be explored in future work.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Generating Evaluation Data</title>
        <p>Evaluation data consisted of three gold lessons withheld from training, each covering a distinct
topic:</p>
        <sec id="sec-2-5-1">
          <title>1. Diode Breakdown (technical) 2. Reaching Out (leadership-focused military context) 3. Suicide Prevention (general risk factor awareness)</title>
          <p>Each gold lesson had manually graded responses to ensure high evaluation quality: Diode
Breakdown (50 responses), Reaching Out (50 responses), and Suicide Prevention (120 responses).</p>
          <p>To simulate realistic scenarios of incremental data annotation and model updating, we evaluated
the system using an N-shot strategy at diferent numbers of examples. For example, when N=5,
ifve graded responses were appended to the prompt as context examples. To ensure robust
evaluation despite limited available data, we structured the evaluation as follows (see Fig. 1):
1. Chunking. Responses from each gold lesson were grouped into test sets containing 10
responses each. This followed an approach similar to cross-validation, so for a lesson with
50 responses, there were 5 test sets.
2. Creating N-shot Examples. For each test set, multiple N-shot contexts were created by
drawing graded examples from the remaining responses within that lesson. For instance, if
a lesson contained 50 responses grouped into 5 test sets (each containing 10 items), each
test set would have 8 distinct N-shot context sets. The total number of N-shot sets was
calculated as (50 - 10) / 5 = 8 sets. This process was repeated for N=0, N=5, N=10, ...,
and while the N-shot contexts sometimes overlapped with each other, test items never
overlapped with any N-shot examples. Consequently, each test response was evaluated
multiple times, each time across diferent values of N and N-shot contexts.
3. Multiple Trials for Robustness. When N &gt; 0, we generated multiple evaluation trials
(typically 10 trials per set) to ensure stability and reliability in our assessment. For
each trial, we reshufled the selected N-shot examples, not just altering their order but
also simulating realistic variability by randomly ablating certain labeled concepts. This
meant that, across trials, some contexts were richly annotated, while others were sparsely
annotatedmimicking real-world annotation inconsistencies.
4. Comprehensive Evaluation. Every response in the test sets was evaluated across all trials
and corresponding N-shot contexts. This extensive evaluation approach ensured a thorough
and robust assessment, enabling confident observations and reliable conclusions. As we</p>
          <p>mentioned earlier the n-shot prompts also made evaluation somewhat slow (approximate
50 hours) for the open weight models, as self-attention inference scales quadratically with
context window length.</p>
          <p>Ultimately, this expanded our final test size across all lessons to be n=17820 examples,
providing a rigorous and thorough evaluation procedure.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>Our analysis finds that, for this particular type of structured output, fine-tuning GPT 4o-mini
on just ∼ 150 examples was impactful, resulting in an F1 increase from 0.68 to 0.73. The largest
improvement occurs in the Reaching Out and Suicide Prevention domains, which are based on
highly specific content domains and whose evaluations require using labeled examples rather
than prior knowledge. The objective of fine-tuning the model on a series of few-shot examples is
to allow it to adapt far quicker to the grading criteria and knowledge in the few-shot examples,
as opposed to relying on prior knowledge and idiosyncratic judgment. This alignment is not just
about boosting raw scores; it also makes the models output more interpretable and consistent
with human feedback loops. By internalizing each specific lesson’s grading framework, the
ifne-tuned model better generalizes to new topics with lower shot counts, delivering reliable,
rubric-compliant assessments.</p>
      <p>Additionally, the fine-tuned GPT 4o-mini model’s F1 score improved as the number of N-shots
increased. The F1 score graph for the Diodes lesson is impressive (Fig. 3), as it shows that while
the base model had a higher initial F1, it did not improve with increasing N-shots. However, the
F1 for the fine-tuned model started significantly lower than the base and progressively improved
with the number of N-shots, eventually matching the performance of the base model. Looking
at the overall F1 score comparison (Fig. 2) also displays the fine-tuned model adapting to new
data more efectively than the base model, as the F1 score increases at a slightly higher rate as
the number of N-shots increase for the fine-tuned model.</p>
      <p>On the other hand, fine-tuning LLama3.1 8B-Instruct using QLORA on the initial data was
not as successful. The baseline and 1 epoch fine-tuned model sometimes failed to generate
stopping points, entered phases of repeating, and mainly predicted one class - False. While
the model did show some signs of life at epoch 6, the F1 score was still only 0.408. This still
however, represents a large gain on the baseline model (see Table 1). The fine-tuning (for 6
epochs) ultimately provided a significant increase in performance over the base model. At 9
epochs, the model’s performance began to degrade, suggesting that the threshold for overfitting
had been reached. Adding in the synthetic data caused the most dramatic improvement. The
best model in both instances was 6 epochs, and boosted the F1 score from 0.408 to 0.653, nearly
matching the baseline GPT 4o-mini model. This would suggest that synthetic data can be
incredibly efective, and also that this family of models requires much more data than 50-100
examples in order to grasp structured ASAG tasks like this.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and Future Research</title>
      <p>While the Llama QLORA fine-tuning did not work as well as the GPT 4o mini fine-tuning, there
are some major advantages to tuning and serving open-weight models. For one, OpenAI will
never let a user download the model weights or architecture. Any system built on OpenAI’s
technology will have vendor lock-in, and a dependence on an internet connection. LLama3.1
8B-Instruct with synthetic data was close to GPT 4o-mini performance and trending enough in
a positive direction to justify serving this model as a viable alternative.</p>
      <p>Recent research into structured LLM outputs has shown that constraining LLMs to structured
1.0
0.8
0.0 0 5 10 15 20 25 30 35 40</p>
      <p>N-shots
Precision - Reaching Out
(Leadership)
outputs can have a deleterious efect on model reasoning and domain knowledge capabilities [ 20].
Future research will explore using an agentic process, whereby a model is prompted to provide a
free-form justification and a grade, and then a agent is prompted to parse the information into a
JSON using some combination of handwritten rules or additional LLM calls. Building learning
engineering systems with structured outputs is a constant battle with the noisy output of LLMs.
Agents built on handwritten rules or additional LLM calls can also serve as means to tweak
"almost there" responses that might be of by a data type, or a slightly incorrect key value.</p>
      <p>Other levers for improvement include prompt complexity and the size of both synthetic and
real training datasets. The real dataset required content creation, human short answers, and a
round of subject matter expert grading. This is time consuming and expensive to collect. As an
alternative to collecting thousands of real training examples at great cost or generating synthetic
data, a potential alternative could be to use the Deepseek approach of “cold start” supervised
ifne-tuning followed by Group Relative Policy Optimization (GRPO) reinforcement learning to
further refine the model [ 21]. This process is thought to be much more sample eficient than
supervised finetuning.</p>
      <p>In terms of training with synthetic data, this process can optimized substantially. The prompt
could be optimized, the amount of data could be increased, and much more advanced models
than Gemini 1.5 Flash now exist. However, the iterative nature of data annotation could lend
itself quite well to simulation by multi-agent system, where teacher and student agents with
diferent profiles, prior knowledge, and directives work in conjunction to create new examples.
This could create higher-quality synthetic examples.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study explored realistic approaches to supervised fine-tuning of LLMs for automated short
answer grading tasks. Our findings have shown that fine-tuning large commercial models (such
as GPT-4o-mini) with relatively small amounts of data enhances performance over baseline
few-shot prompting methods, even in highly context-dependent domains. Meanwhile, QLORA
ifnetuning LLama-3-8B models initially showed poor performance for this particular task, but
additional infusions of synthetic data made them competitive with GPT models.</p>
      <p>Overall, the efectiveness of this synthetic data-infused training approach holds great promise
for developing reliable ASAG systems that can function without relying on large corporations,
server-scale GPUs, or the presence of an Internet connection. Continued exploration into synthetic
data generation and hybrid agentic pipelines suggest further performance gains. Collectively,
these approaches can enable resource-limited educational organizations to deploy and use strong
ASAG technologies, helping democratize access to reliable AI-driven assessment methods.</p>
    </sec>
    <sec id="sec-6">
      <title>Appendix A. Prompt template, N=30</title>
      <p>{</p>
      <sec id="sec-6-1">
        <title>Listing 1: Abridged sample of 30-shot prompt</title>
        <p>The user provided an answer to a t u t o r i n g q u e s t i o n . The answer i s provided in JSON
format . You are a t u t o r who i s e v a l u a t i n g i f the answer i s s u f f i c i e n t to show
that the user knows a one or more ” concepts ” which w i l l be l a b e l e d ” concept_1 ”
to ”concept_N” . For each concept you evaluate , you must a l s o e x p r e s s how
c o n f i d e n t you are in your e v a l u a t i o n o f how w e l l the answer shows knowledge o f
each concept . You w i l l a l s o provide a b r i e f j u s t i f i c a t i o n .</p>
        <p>These are the concepts f o r t h i s l e s s o n .
{
” concept_1 ” : ”They ’ re much l e s s l i k e l y to commit s u i c i d e now or in the future ,
because s u i c i d a l urges l a s t l e s s than an hour . ” ,
” concept_2 ” : ”The person i s s t i l l at r i s k compared to other people , because they
s t i l l have s u i c i d a l thoughts . ”
}
This i s my answer provided in JSON to be evaluated .
{</p>
        <p>” answer_text ” : ” i t lowers ”
}
Please respond in the f o l l o w i n g format :
{
” answer ” : {
” answer_text ” : ” s t r i n g // State the t e x t o f the p a r t i c u l a r answer being
c l a s s i f i e d . ” ,
” concepts ” : {
”concept_N” : {
”is_known” : ” s t r i n g // true or f a l s e . I f the input answer i m p l i e s that the
concept i s known , the c l a s s i f i c a t i o n should be true . Otherwise i t should
be f a l s e . ” ,
” c o n f i d e n c e ” : ” f l o a t // A 0 to 1 s c o r e i n d i c a t i n g c e r t a i n t y that a
c l a s s i f i c a t i o n i s c o r r e c t . Confidence s c o r e s c l o s e r to 1 r e p r e s e n t
higher c e r t a i n t y , and c o n f i d e n c e s c o r e s c l o s e r to 0 r e p r e s e n t lower
c e r t a i n t y . ” ,
” j u s t i f i c a t i o n ” : ” s t r i n g // Why you b e l i e v e the user answer i s or i s not
s u f f i c i e n t to determine i f they know the concepts . ”
}
}</p>
        <p>}
}
Only respond with the JSON output in the exact format o f the template and no other
words or symbols . The output must be v a l i d JSON. Check that the output i s v a l i d
JSON.</p>
        <p>Here are some examples that have already been l a b e l e d ( although they may not be
f u l l y l a b e l e d ) . They are presented in JSON format , where the answer i s given ,
f o l l o w e d by a concept and a true or f a l s e l a b e l . Consider t h e s e to be ground
truth examples .
”answer_1” : {
” answer ” : ” It ’ l l d e t e r them from wanting to commit s u i c i d e . ” ,
” concept_1 ” : ” true ”
} ,
”answer_2” : {
” answer ” : ” they ’ re s t i l l more l i k e l y as they have s u i c i d a l i d e a t i o n ” ,
” concept_1 ” : ” f a l s e ”
} ,
.
.
.
”answer_30” : {
” answer ” : ” they may not have anything e l s e to use ” ,
” concept_1 ” : ” f a l s e ”
}</p>
        <p>}</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>B. Synthetic Content Generation Algorithm</title>
      <p>Algorithm 1 Synthetic Content Generation Algorithm
1: function GeneratePrompt(few_shot_examples)
2: prompt ← "You are an expert AI assistant..."
3: prompt ← prompt + "Your goal is to create new examples..."
4: prompt ← prompt + "Please generate ONE new example..."
5: prompt ← prompt + "The content of the new example should belong to..."
6: prompt ← prompt + "Ensure the generated examples are distinct..."
7: prompt ← prompt + "Here are some examples to learn from..."
8: for each example in few_shot_examples do
9: prompt ← prompt + "\n— Example —\n" + ToJson(example)
10: prompt ← prompt + "\nNow, generate a new, unique example..."
11: return prompt
12: function GenerateFromModel(model, prompt)
13: response ← model.generate(prompt, GenerationConfig, SafetySettings)
14: json_object ← ExtractJson(response.text)
15: return json_object
33:
Acknowledgments
16: procedure Main
17: originalExamples ← LoadExamples(InputFile)
18: WriteExamples(OutputFile, originalExamples)
19: generatedCount ← 0
20: failedAttempts ← 0
21: model ← new GenerativeModel(ModelName)
22: while generatedCount &lt; TargetCount and failedAttempts &lt; MaxFailedAttempts do
23: k ← RandomInt(1, 3)
24: fewShotSamples ← RandomSample(originalExamples, k)
25: promptText ← GeneratePrompt(fewShotSamples)
26: newExample ← GenerateFromModel(model, promptText)
27: if newExample is valid then
28: AppendExample(OutputFile, newExample)
29: generatedCount ← generatedCount + 1
30: failedAttempts ← 0
31: else
32:</p>
      <p>failedAttempts ← failedAttempts + 1
Print("Process finished. Total generated: " + generatedCount
)
The project or efort depicted was or is sponsored by the U.S. Government under contract
number W912CG-24-D-0001 and W911NF-14-D-0005, as part of the USC ICT UARC and the
AI Research Center of Excellence in Education (AIRCOEE). The content of the information does
not necessarily reflect the position or the policy of the Government, and no oficial endorsement
should be inferred.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
        <p>panding our custom models program, https://openai.com/index/
introducing-improvements-to-the-fine-tuning-api-and-expanding-our-custom-models-program/,
2023. Accessed on 2025-07-08.
[15] OpenAI platform, https://platform.openai.com, ???? Accessed on 2025-07-08.
[16] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, ..., Z. Ma, The
llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024). URL: http://arxiv.org/abs/
2407.21783.
[17] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Eficient finetuning of
quantized LLMs, arXiv preprint arXiv:2305.14314 (2023). URL: http://arxiv.org/abs/2305.
14314.
[18] B. D. Nye, R. Sanghrajka, V. Bodhwani, M. Acob, D. Budziwojski, K. Carr, W. R. Swartout,
Opentutor: Designing a rapid-authored tutor that learns as you grade, in: The International
FLAIRS Conference Proceedings, volume 34, 2021. URL: https://doi.org/10.32473/flairs.v34i1.
128576. doi:10.32473/flairs.v34i1.128576.
[19] B. D. Nye, R. Sanghrajka, V. Bodhwani, M. Acob, D. Budziwojski, K. Carr, W. R. Swartout,
Opentutor: Designing a rapid-authored tutor that learns as you grade, in: The International
FLAIRS Conference Proceedings, volume 34, 2021. URL: https://doi.org/10.32473/flairs.v34i1.
128576. doi:10.32473/flairs.v34i1.128576.
[20] Z. R. Tam, C.-K. Wu, Y.-L. Tsai, C.-Y. Lin, H. yi Lee, Y.-N. Chen, Let me speak freely? a
study on the impact of format restrictions on performance of large language models, arXiv
preprint arXiv:2408.02442 (2024). URL: http://arxiv.org/abs/2408.02442.
[21] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, et al., Deepseek-r1: Incentivizing reasoning
capability in llms via reinforcement learning, arXiv preprint arXiv:2501.12948 (2025). URL:
http://arxiv.org/abs/2501.12948.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Alikaniotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yannakoudakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <article-title>Automatic text scoring using neural networks</article-title>
          ,
          <source>in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2016</year>
          , p.
          <fpage>715725</fpage>
          . URL: https://www.aclweb.org/anthology/P16-1068.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <article-title>Earth mover's distance pooling over siamese LSTMs for automatic short answer grading</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2017</year>
          , p.
          <fpage>20462052</fpage>
          . URL: https://aclanthology.org/D17-1217.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. I.</given-names>
            <surname>Dhamecha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marvaniya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sindhgatta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          ,
          <article-title>Sentence level or token level features for automatic short answer grading?: Use both</article-title>
          ,
          <source>in: Natural Language Processing and Information Systems</source>
          , volume
          <volume>10859</volume>
          of Lecture Notes in Computer Science, Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>47</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -91947-
          <issue>8</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Automatic short answer grading via multiway attention networks</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>10166</volume>
          (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1909</year>
          .10166.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Waters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Grimaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Baraniuk</surname>
          </string-name>
          ,
          <article-title>A meta-learning augmented bidirectional transformer model for automatic short answer grading</article-title>
          ,
          <source>in: Proceedings of the 12th International Conference on Educational Data Mining (EDM</source>
          <year>2019</year>
          ),
          <source>International Educational Data Mining Society (IEDMS)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>163</lpage>
          . URL: https://educationaldatamining.org/files/conferences/EDM2019/papers/paper_156.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>14165</volume>
          (
          <year>2020</year>
          ). URL: http://arxiv.org/abs/
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.-Y.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <article-title>Short answer grading using one-shot prompting and text similarity scoring model</article-title>
          ,
          <source>arXiv preprint arXiv:2305.18638</source>
          (
          <year>2023</year>
          ). URL: http://arxiv.org/abs/2305.18638.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Niklaus</surname>
          </string-name>
          ,
          <article-title>Towards LLM-based autograding for short textual answers</article-title>
          ,
          <source>arXiv preprint arXiv:2309.11508</source>
          (
          <year>2024</year>
          ). URL: http://arxiv.org/abs/2309.11508.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ivanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          ,
          <article-title>Evaluating LLMs' performance at automatic short-answer grading</article-title>
          ,
          <source>in: Proceedings of the Workshop on Automatic Evaluation of Learning and Assessment Content (EvalLAC</source>
          <year>2024</year>
          ), volume
          <volume>3772</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Recife, Brazil,
          <year>2024</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3772</volume>
          /paper10short.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chamieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zesch</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Giebermann,</surname>
          </string-name>
          <article-title>LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches</article-title>
          ,
          <source>in: Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , p.
          <fpage>309315</fpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .bea-
          <volume>1</volume>
          .
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Finetuned language models are zero-shot learners</article-title>
          ,
          <source>arXiv preprint arXiv:2109.01652</source>
          (
          <year>2022</year>
          ). URL: http://arxiv.org/abs/2109.01652.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <article-title>Self-play fine-tuning converts weak language models to strong language models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.01335</source>
          (
          <year>2024</year>
          ). URL: http: //arxiv.org/abs/2401.01335.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Asawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Q.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hanin</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zaharia, BARE: Leveraging base language models for few-shot synthetic data generation</article-title>
          ,
          <source>arXiv preprint arXiv:2502.01697</source>
          (
          <year>2025</year>
          ). URL: http://arxiv.org/abs/2502.01697.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          ,
          <article-title>Introducing improvements to the ifne-tuning api and ex-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>