<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Sixth Workshop on Intelligent Textbooks, July</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>LLM-powered Framework for Automatic Generation of Metacognitive Scafolding Cues for Introductory Program ming in Higher Education</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anushka Durg</string-name>
          <email>adurg@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Can Kultur</string-name>
          <email>ckultur@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adam Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaromir Savelka</string-name>
          <email>jsavelka@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Carnegie Mellon University</institution>
          ,
          <addr-line>5000 Forbes Avenue, Pittsburgh, PA 15213</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>26</volume>
      <issue>2025</issue>
      <abstract>
        <p>Scafolding cues are instructional prompts designed to guide students through structured reasoning phases understanding a task, planning a solution, and reflecting on their work - in order to support deeper learning during programming exercises. In this paper, we evaluate the capabilities of large language models (LLMs) to generate scafolding cues for an introductory Python programming course in higher education. We used GPT-4 and TinyLlama to generate 126 scafolding cues focused on understanding, planning, and reflecting for 14 programming exercises. We found that LLMs can reliably generate scafolding cues that align with their intended reasoning type (Understand, Plan, Reflect), with expert annotators confirming the correctness of the reasoning type in 92% of cases overall. Further, we found that LLM-generated cues generally met instructional quality standards for clarity and relevance, though cues involving deeper reasoning (e.g., reflective depth) showed more variation and were harder to evaluate consistently by multiple annotators. The cues generated by GPT-4, in particular, were more likely to meet the quality criteria compared to TinyLlama, especially for Reflect-type cues. Overall, our findings suggest that LLMs could generate scafolding cues that are clear, relevant, and useful for instruction. This could help reduce the time instructors spend authoring scafolding cues or potentially enable personalized cues tailored to the needs of each individual student.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Finding the right answer often begins with asking the right question, especially for students learning
to solve problems independently. Expert instructors can guide students toward productive lines of
inquiry, but their time is scarce compared to the scale of the curriculum and the number of learners.
Automatically generated scafolding cues may help students better understand problems, plan solutions,
and reflect on completed work. This may in turn lead to better learning outcomes. Scafolding cues that
encourage students to engage in structured reasoning about a learning activity—such as asking them to
explain their plan, reflect on what worked, or connect the task to underlying logic—are well-established
pedagogical techniques in computing education. Thinking before coding has been shown to reduce
confusion and increase student success in logic-rich domains, while post-coding reflection supports
transfer and long-term understanding [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. In online platforms and other interactive environments,
such reasoning can be encouraged by inclusion of scafolding cues within problem statements to guide
students toward deeper and more productive engagement [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Our study uses Sail(), an online
platform that functions as an interactive textbook—delivering curated instructional content alongside
embedded exercises and cues. Similar to online textbooks like OLI1 and Runestone,2 Sail() serves as a
real-world instantiation of the smart textbook vision.
      </p>
      <p>However, manually authoring multiple scafolding cues for every programming task may be
prohibitively expensive. Instructors must not only write the cues for many tasks, but also tailor them across</p>
      <p>CEUR</p>
      <p>ceur-ws.org
multiple reasoning types -typically centered around understanding, planning, and reflecting . The high
cost of such efort has resulted in limited adoption of this useful approach at scale. Recent advances in
large language models (LLMs) suggest that AI systems may be capable of generating such scafolding
cues automatically.</p>
      <p>
        In this paper, we present a human-in-the loop framework to automatically generate cues intended
to scafold reasoning in coding tasks. We focus on three reasoning-support types, inspired by the
structure of Polya’s problem-solving method: “Understand” (interpreting the task), “Plan” (deciding how
to proceed), and “Reflect” (evaluating or revising one’s thinking) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Our contributions are twofold:
• We demonstrate that the LLMs are capable of generating scafolding cues of the intended
reasoning type. In a blinded classification experiment, human experts accurately identified
the type of each cue (e.g., Understand vs. Reflect) using only rubric definitions, indicating that
LLM-generated reasoning structures are distinct and recognizable.
• We evaluate LLM-generated scafolding cues for programming instruction, focused
specifically on reasoning types (Understand, Plan, Reflect). Using a structured 9-point rubric and
three reviewers, we show that GPT-4 can reliably produce cues that are clear, relevant, and
pedagogically aligned across reasoning types.
      </p>
      <p>To guide our investigations, we pose the following two research questions:
• RQ1: To what degree can LLMs generate scafolding cues that are semantically recognizable as
“Understand,” “Plan,” or “Reflect”?
• RQ2: How well do LLM-generated scafolding cues meet quality standards in terms of clarity,
relevance, and reasoning depth (defined in table 2)?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Reasoning Before and After Programming Tasks</title>
        <p>
          Many studies in computing education emphasize how important it is for students to reason before,
during, and after coding. Planning, which uses natural languages, is essential for coding, which uses
formal languages [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Explaining code and writing code are processes of transformation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] that can be
aided by careful reflections [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. VanLehn’s analysis showed that self-explanation in tutoring systems can
lead to meaningful learning gains in structured problem domains [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Grover and Pea also highlighted
how reflection and debugging help students build a deeper understanding of how programs work
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Koedinger et al. found that encouraging students to think through their approach before coding
improved engagement and performance in large-scale online courses [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          These approaches can be implemented in online platforms by including scafolding cues to help
students engage in the desired types of reasoning. Studies by Roll et al. and Kinnebrew et al. showed
that when students are asked to explain their thinking at diferent points in time, (i.e., before, during,
and after coding) they demonstrate better self-regulation and problem-solving skills [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ]. Based on
these findings, we focus our work on three student reasoning types commonly recognized in curricular
design: Understand, Plan, and Reflect .
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Large Language Models for Generating Educational Content</title>
        <p>
          Since the year of 2022, there have been considerable advancements in the educational applications of
LLMs. The technology has been used to generate a wide variety of natural language artifacts in general
educational context as well as in computing education. Leiker et al. developed a whole course utilizing
an LLM while keeping human experts in the loop to ensure high quality of the generated content
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Multiple research groups explored LLMs’ potential to support learning by explaining a given
code snippet [
          <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
          ]. Researchers have also shown that LLMs can be used to create programming
exercises [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ]. There is a large body of work focused on LLMs’ capabilities in generating model
solutions to programming tasks [
          <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
          ]. There is ongoing work exploring the possibilities of
generating real-time feedback or answers to student support requests in computing education [
          <xref ref-type="bibr" rid="ref19 ref20 ref21 ref22 ref23 ref24">19, 20,
21, 22, 23, 24</xref>
          ]. Other examples include personalized Parsons puzzles [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], MCQs [
          <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
          ], or learning
objectives [
          <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
          ].
        </p>
        <p>There is comparatively little work on using LLMs to generate scafolding cues that are explicitly
categorized by reasoning types. Most studies focus on correctness, explanation, or answer quality,
rather than metacognitive structure. Our work provides a novel contribution by assessing whether
LLMs can reliably generate scafolding cues for diferent types of reasoning (i.e., understanding the
task, planning the solution, and reflecting on performance).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>To support the experiments in this paper (see Section 4), we assembled a dataset of 14 coding exercises
from the Practical Programming with Python course delivered on the Sail() platform.3 This course has
an interactive introductory Python programming curriculum that emphasizes practical data processing
applications. The course is structured into eight instructional units, each containing auto-graded
projects, quizzes, and online discussion. We focus on two specific units in this study: Unit 2 (Control
Flow and String Manipulation) and Unit 3 (Data Structures).</p>
      <p>
        We chose these two units for a key methodological reason: both include programming exercises in
which students implement functions directly in a simple in-browser environment [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. These exercises
are relatively small and focused which makes them suitable for insertion of scafolding cues and
evaluation of the cues’ quality.
      </p>
      <p>Unit 2 includes three tasks:
• Color Game: Implement time_color() and is_correct() functions to compute game logic
from player input and track correctness.
• Tabular Reports: Use iteration and formatting to align numerical output in a structured report.
• Directory Contents Analysis: Simulate parsing and analyze file metadata using loops and
string conditions.</p>
      <p>Unit 3 includes one task:
• Container Type Agnostic API: Write reusable logic to update, test, or convert among dicts,
lists, tuples, and sets using unified condition checks.</p>
      <p>Each task consists of several coding exercises which provide ideal focused points for inserting scafolding
cues. For example, in the Color Game task, one coding exercise asking students to “Write a function that
formats integers with commas between every three digits, e.g., 1234567 → ”1,234,567”” is aligned with
the scafolding cue “This function isn’t about math—it’s about display. Why do we care about making
large numbers easier to read, and what part of the logic must handle digit grouping?”. A sample coding
exercise from the Color Game is shown in Figure 1 in the form it is displayed on the online learning
platform. To illustrate the kinds of scafolding cues generated, Table 1 shows the full set of Understand,
Plan, and Reflect prompts generated for the time_color() function, which maps time remaining to a
visual color display.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this section, we present our two-part evaluation framework, organized around the two research
questions. RQ1 concerns whether LLM-generated scafolding cues are semantically recognizable as a
specific reasoning type (i.e., Understand, Plan, Reflect). RQ2 evaluates the instructional quality of these
scafolding cues using a structured rubric.</p>
      <sec id="sec-4-1">
        <title>Scafolding Cue Generation</title>
        <p>The scafolding cue generation process is schematically depicted in Figure 2. Each exercise includes
embedded “TODO” comments which are ideal anchors for inserting scafolding cues. For each exercise,
we generated scafolding cues targeting the three reasoning types:
• Understand: Helps the student interpret the task and underlying logic.
• Plan: Breaks the solution into steps, decisions, or code structure.</p>
        <p>• Reflect: Encourages the student to analyze, revise, or explain their thinking.</p>
        <p>The prompt submitted to an LLM contained:
• A brief task description
• The code context around the exercise
• The target reasoning type (U/P/R)
• A rubric-aligned scafolding cue template (few-shot for GPT-4)
We used two models in our experiments:
• GPT-4 (gpt-4-0613) via the OpenAI API
• TinyLlama (1.1B) running locally on a quantized inference engine
GPT-4 was prompted using a few-shot format that included exemplar scafolding cues for each reasoning
type. In contrast, TinyLlama was prompted using a simplified zero-shot format with only the task
description, reasoning type, and instruction, as few-shot examples significantly degraded its output
quality due to limited context window size.</p>
        <p>For each combination of an exercise and scafolding cue type, an LLM was prompted to generate 10
candidate scafolding cues . One of the authors (the one not included in expert evaluations described
below) reviewed these and selected the best three scafolding cues based on clarity, relevance, and
conceptual alignment with the exercise. The selection also respected the following constraints:
• 2 scafolding cues from GPT-4
• 1 scafolding cue from TinyLlama
We repeated this process across the 14 exercises which resulted in:</p>
        <p>14 exercises × 3 cue types × 3 scafolding cues = 126 scafolding cues in total .</p>
        <p>The pipeline implements a human-AI interaction by combining automated and manual components:</p>
        <p>Reasoning Type</p>
        <p>Scafolding Cue Prompt</p>
        <p>You’re mapping time to a display color — black for high time, red for low. Try to predict
the thresholds before programming: what value should trigger red? What range should
feel like a warning? Picture a timeline and reason through which time values should fall
into each category and why.</p>
        <p>Try to imagine how a player would experience this function: if time is dropping quickly,
when should the display change to orange or red? What would happen if your time
conditions overlapped or left gaps? Could it return no color — or the wrong one — in
those cases?
If you see a color like ’red’, what does that really mean in the game? Can you match
that feeling of urgency to a number? Think about what these colors communicate and
how the function needs to convert a numeric signal into a visual one.</p>
        <p>Plan your if-elif-else structure: what will be your first condition? Which time values
should be caught early? Use a number line if it helps — plot out values like 5, 10, 11 to
see what range needs to go where.</p>
        <p>Break the function into small reasoning steps: How do you test for ‘less than or equal
to’? Will you handle red first or black first? Why? Which values will you use to confirm
your code is behaving as expected at boundaries?
Try programming the condition for red first, then build upward. Just write each rule as a
test on time and check what color it returns.</p>
        <p>Now that you’ve implemented the logic, go back and ask: which time range was trickiest
to classify? Did 5 go where you expected? Did 10?
How did you decide the order of your conditions — and did it matter? Would changing
the order of if/elif afect your outcome?
If you tested time = 7 and got the wrong color, what would you look at first in your logic
to fix it? What did you learn from doing that?
• Scafolding cue generation was fully automated via LLMs.</p>
        <p>• Scafolding cue selection was manual: we chose the top 3 of 10 cues for each reasoning type.
This setup combines model-generated content with comparatively inexpensive human supervision to
ensure instructional value.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1. RQ1: Cue Type Correctness</title>
        <p>To evaluate whether the generated scafolding cues exhibit distinct semantic structures that correspond
to given reasoning types, one of the authors performed a simple annotation task. We randomly sampled
a single high-quality scafolding cue of each reasoning type (U/P/R) for each exercise from the GPT-4
outputs. This yielded 42 scafolding cues . These scafolding cues were then randomly shufled,
removing all indicators of reasoning type.</p>
        <p>Annotation Procedure. One of the authors classified each scafolding cue into one of three categories:
Understand, Plan, or Reflect. The author was provided the rubric shown in Table 2. The task was to
assign each scafolding cue to the category it most closely matched, using only the rubric definitions.
Evaluation Metrics. We compared the original label followed by the model to the manual annotations.
We calculated accuracy for each scafolding cue type. We also reviewed the mistakes to understand
which types were often generated improperly.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.2. RQ2: Scafolding Cue Quality</title>
        <p>To assess the quality of the generated scafolding cues, we scored the full set of 126 scafolding cues
using a structured rubric focused on features such as clarity, relevance, and reasoning depth (see Table 2
for details).</p>
        <p>Annotator Assignment. Two authors (diferent from the author that performed the annotations for
RQ1) annotated all the 126 scafolding cues. Each annotator scored the scafolding cues independently
and using the rubric shown in Table 2.</p>
        <p>Rubric-Based Evaluation. Each scafolding cue was scored using a structured 9-item binary rubric
shown in Table 2. Each reasoning type had three corresponding rubric criteria. Annotators marked
each item as either “Yes” (1) or “No” (0). Scores were aggregated to compute the proportion of criteria
met per scafolding cue.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Evaluation</title>
        <p>Each scafolding cue was scored using the following formula:</p>
        <p>Rubric Score =
# of “Yes” ratings on relevant criteria
3
For example, a Plan-type scafolding cue marked “Yes” on “Clarity” and “Step-Based” but “No” on
“Code-Focused” would receive a score of 2 = 0.67. Scores were then averaged per scafolding cue type
3
and per trait to produce the summaries shown in Figure 3.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We report results for both research questions outlined in Section 1, corresponding to the correctness of
reasoning types (RQ1) and rubric-based quality assessment (RQ2).</p>
      <sec id="sec-5-1">
        <title>5.1. RQ1: Cue Type Correctness</title>
        <p>To assess whether the intended type of scafolding cues was generated, one of the authors of this paper
annotated 42 scafolding cues as either Understand, Plan, or Reflect. Table 3 shows the results of this
experiment. The results suggest that LLMs can reliably generate cues for the prescribed category. The
only disagreement in the expert annotations is that several Understand-type cues generated by the LLM
Scafolding cue Type</p>
        <p>Criterion</p>
        <p>Definition</p>
        <p>Is the question phrased clearly and without confusing
language? A student should be able to rephrase it confidently.</p>
        <p>Is the question about this exact task — not a general Python
idea or unrelated topic? It should mention key features like
time, logic, or structure.</p>
        <p>Does it help the student understand how the task works —
not just what to do? It should bring out a reasoning point.</p>
        <p>Is the suggestion or thinking path easy to follow? A student
should be able to imagine the steps.</p>
        <p>Does it break the task into logical parts — like what to check
first, what to test, or how to structure decisions?
Does it relate directly to the kind of code they’ll write — like
if-statements, range logic, or what values to handle?
Can the student tell what part of their thinking or code
they’re supposed to reflect on?
Is it clearly about this task’s logic — not how hard it was or
how they felt?
Does it ask them to explain what they figured out, fixed, or
might reuse — not just if it worked?
Plan
• Understand: 20/20 correct (100%)
• Plan: 29/34 correct (85.3%) – 5 misclassified as Understand
• Reflect: 32/34 correct (94.1%) – 2 misclassified as Understand
Most errors occurred between Plan and Understand, possibly due to overlap in language used to scafold
pre-solution reasoning. Notably, no Reflect scafolding cues were misclassified as Plan, suggesting that
the reflective structure was likely distinctive.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. RQ2: Scafolding Cue Quality</title>
        <p>Figure 3 shows the rubric-based evaluation of scafolding cues as scored independently by two human
annotators—Reviewer A and Reviewer B—on the same dataset. This setup allows us to analyze how
diferences in reviewer interpretation impact rubric-based scoring.</p>
        <p>Note that there was very little agreement between the two annotators which suggests that the rubric
needs to be further improved. For this reason, we report the raw results of the annotation process.
• Understand: Both reviewers rated Understand-type cues highly, with Reviewer B assigning
slightly higher average scores across traits such as “Understand – Clarity” (0.94 vs. 0.85) and
“Understand – Conceptual Clarity” (0.83 vs. 0.77).
• Plan: Reviewer B again scored consistently higher on planning traits, especially for “Plan –
Clarity” (0.91 vs. 0.72), suggesting diferences in how the two reviewers interpreted the clarity of
planning steps.
• Reflect: The largest disagreement appeared in Reflect-type traits. Reviewer B rated “Reflect –
Clarity” almost perfectly (0.99) compared to Reviewer A’s 0.77. A similar gap was observed for
“Reflect – Reflective Depth” (0.84 vs. 0.49).</p>
        <p>To better understand these disparities, Figure 3 breaks down rubric scores by individual criterion.
Reviewer B tended to assign higher scores across most traits, whereas Reviewer A’s scores reflected
more variation—particularly on traits involving deeper reasoning such as “Plan – Clarity” and “Reflect
– Reflective Depth.” These diferences highlight not only the dificulty of scoring certain instructional
traits consistently, but also suggest that rubric criteria such as “depth” may require clearer anchors
or reviewer calibration. Rather than treating variation as noise, we interpret it as insight into how
instructors with diferent pedagogical lenses might diferently value the same cue.
Since each task included two scafolding cues from GPT-4 and one from TinyLlama, we can compare
average rubric scores between the models. Across all reasoning types and rubric traits:
• GPT-4 scafolding cues (  = 84 ) had a mean rubric score of 0.83.</p>
        <p>• TinyLlama scafolding cues (  = 42 ) had a mean rubric score of 0.67.</p>
        <p>This reflects a clear advantage for GPT-4 in producing scafolding cues that meet human-annotated
instructional standards, particularly in reflective depth and task relevance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Our results suggest that LLMs—especially a frontier model such as GPT-4—can generate scafolding
cues that are both rubric-aligned and semantically distinct across reasoning types. The cues generated
for specific reasoning types (Understand, Plan, Reflect) were annotated by reviewers in alignment
with those intended types (RQ1). This supports the conclusion that LLMs are capable of producing
reasoning-aligned cues consistent with the type they were instructed to generate. Reviewer agreement
with the LLM’s intended category serves as evidence that the cues exhibit recognizable structural and
semantic features tied to each reasoning type.</p>
      <p>
        GPT-4 consistently outperformed TinyLlama across Understand, Plan, and Reflect categories, with
particularly strong performance on Reflect cues (RQ2). These results support prior findings that LLMs
are especially well-suited to post-coding reasoning support, particularly for generating reflective cues
that help students evaluate and revise their thinking [
        <xref ref-type="bibr" rid="ref31 ref4 ref5">4, 5, 31</xref>
        ]. However, trait-level analysis revealed
variability in how cues were rated, particularly in Plan Clarity and Reflective Depth.
      </p>
      <p>During our evaluation, we noticed that the two reviewers often gave diferent scores for the same
scafolding cues, especially for traits like Reflective Depth and Plan Clarity. This variation was due to
the fact that the rubric itself was still being refined. As a result, some criteria left room for interpretation,
leading reviewers to apply diferent expectations or teaching philosophies when making judgments.
We chose to report the reviewers’ ratings separately as it better reflects real-world teaching, where
instructors often bring diferent perspectives to how student thinking is evaluated. It also
acknowledges that some instructional traits—especially deeper reasoning skills—are naturally harder to score
consistently without more calibration or a more mature rubric.</p>
      <p>From a teaching perspective, these findings highlight both the potential and the limits of using LLMs
to generate reasoning-aligned scafolding cues. On the one hand, GPT-4 shows strong capability in
generating cues that support conceptual understanding and post-coding reflection. This can reduce
the time instructors spend drafting scafolding cues from scratch. On the other hand, the variation in
reviewer scores emphasizes the importance of human review, especially for traits involving reflection.
As one of the reviewers noted in their comments, “the question is often not whether a prompt is
technically clear, but whether it reflects the kind of reasoning or level of support the instructor intends
to foster.” Reviewer B also raised concerns about reasoning structure, stating that a cue “doesn’t provide
thinking through steps,” and pointed out limits in reflective support, noting “it is not clear what to
reuse, and fix what for what.” Such comments highlight the importance of aligning scafolding cues not
only with rubric traits, but also with instructional intent.</p>
      <p>Our findings also have broader implications for instructional design. When two reviewers disagree
about whether a cue demonstrates “clarity” or “depth,” it often reflects not rubric failure but diferences
in pedagogical goals. Rather than enforcing universal definitions of these traits, it may be more useful
to view LLM-generated cues as adaptable resources where instructors can tune to their own teaching
style or course context. LLMs may not eliminate the need for expert judgment, but they can speed up
the process of drafting, revising, and contextualizing reasoning supports at scale.</p>
      <sec id="sec-6-1">
        <title>Limitations</title>
        <p>While our findings are promising, several limitations remain. We did not test the generated scafolding
cues in live instructional settings. As a result, we cannot make claims about their actual impact on
student learning, engagement, or performance. We did not assess whether the scafolding cues helped
students achieve specific learning goals or improved understanding. Future work should pair scafolding
cue exposure with outcome metrics. Although our study involved 126 scafolding cues and 14 exercises,
the number of scafolding cue types and reviewers remains limited. Including more types of tasks and
more reviewers would help test if the findings apply more broadly. Scafolding cue selection, model
completions, and rubric scoring all involve human interpretation. Bias may have influenced both
scafolding cue filtering and annotation.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>In this paper, we explore whether LLMs can generate high-quality reasoning scafolding cues for small
programming tasks. We focus on three types of reasoning support. Understand, Plan, and Reflect.
Using exercises from an introductory Python course on the Sail() platform, we generated a dataset of
126 scafolding cues using GPT-4 and TinyLlama, and evaluated them across two research questions.</p>
      <p>A reviewer was able to confirm the correctness of scafolding cues with respect to their intended
types (U/P/R), showing that the structure of reasoning was clear and distinguishable (RQ1). Two other
reviewers rated the scafolding cues using a structured rubric and found that most scafolding cues met
important criteria for clarity, relevance, and reasoning depth (RQ2). GPT-4 scafolding cues performed
consistently well, especially for Reflect-type reasoning.</p>
      <p>These results suggest that LLMs—especially GPT-4—can help instructors generate reasoning-aligned
scafolding cues that reduce manual efort while supporting student thinking in code-based tasks. Even
with a small dataset, the scafolding cues showed strong alignment to pedagogical goals and reflected
recognizable reasoning structures.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Future Work</title>
      <p>Future work will involve testing the scafolding cues in real classrooms to study their impact on learning
outcomes, utilizing scafolding cues generated by both LLMs and expert instructors in a single-blinded
A/B test. We also plan to refine the evaluation process by improving the clarity and granularity of the
rubric itself, informed by the discrepancies observed between the two annotators. As seen in our results,
reviewer disagreement was especially pronounced for traits like Plan Clarity and Reflective Depth,
indicating that rubric refinement may be necessary to support consistent scoring. Beyond evaluation
design, we will also explore whether LLMs can be guided to improve their performance on specific
weak traits and whether models can be fine-tuned or prompted to generate personalized scafolding
cues based on student-level performance data. Larger-scale evaluations with more tasks and reviewers
will help validate the generalizability of these findings and support broader classroom integration.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT-4 and TinyLlama in order to: generate
candidate scafolding cues for programming exercises, grammar and spelling check, improve writing
style, and paraphrase/reword text. After using these tools and services, the author reviewed and edited
the content as needed and takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>VanLehn,</surname>
          </string-name>
          <article-title>The relative efectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems</article-title>
          ,
          <source>Educational Psychologist</source>
          <volume>40</volume>
          (
          <year>2005</year>
          )
          <fpage>195</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pea</surname>
          </string-name>
          ,
          <article-title>Computational thinking in k-12: A review of the state of the field</article-title>
          ,
          <source>Educational Researcher</source>
          <volume>42</volume>
          (
          <year>2015</year>
          )
          <fpage>38</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Koedinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>McLaughlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Bier</surname>
          </string-name>
          ,
          <article-title>Learning is not a spectator sport: Doing is better than watching for learning from a mooc, ACM Transactions on Computing Education (TOCE) 15 (</article-title>
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Roll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Aleven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>McLaren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Koedinger</surname>
          </string-name>
          ,
          <article-title>Metacognitive scafolding for learning programming</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence in Education</source>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>789</fpage>
          -
          <lpage>791</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Kinnebrew</surname>
          </string-name>
          , G. Biswas, Planning, reflection, and
          <article-title>self-regulation in open-ended learning environments</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence in Education</source>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Polya</surname>
          </string-name>
          , How to Solve It: A New Aspect of Mathematical Method, Princeton University Press,
          <year>1945</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Winslow</surname>
          </string-name>
          ,
          <article-title>Programming pedagogy-a psychological overview</article-title>
          ,
          <source>SIGCSE Bull</source>
          .
          <volume>28</volume>
          (
          <year>1996</year>
          )
          <fpage>17</fpage>
          -
          <lpage>22</lpage>
          . URL: https://doi.org/10.1145/234867.234872. doi:
          <volume>10</volume>
          .1145/234867.234872.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>PANE</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>A</article-title>
          .
          <string-name>
            <surname>RATANAMAHATANA</surname>
            ,
            <given-names>B. A. MYERS</given-names>
          </string-name>
          ,
          <article-title>Studying the language and structure in non-programmers' solutions to programming problems</article-title>
          ,
          <source>International Journal of HumanComputer Studies</source>
          <volume>54</volume>
          (
          <year>2001</year>
          )
          <fpage>237</fpage>
          -
          <lpage>264</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S1071581900904105. doi:https://doi.org/10.1006/ijhc.
          <year>2000</year>
          .
          <volume>0410</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Naik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S. T. Wu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bogart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sakr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Rose</surname>
          </string-name>
          ,
          <article-title>Generating situated reflection triggers about alternative solution paths: A case study of generative ai for computer-supported collaborative learning</article-title>
          ,
          <source>in: Artificial Intelligence in Education: 25th International Conference, AIED 2024, Recife, Brazil, July</source>
          <volume>8</volume>
          -
          <issue>12</issue>
          ,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , SpringerVerlag, Berlin, Heidelberg,
          <year>2024</year>
          , p.
          <fpage>46</fpage>
          -
          <lpage>59</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -64302-
          <issue>6</issue>
          _4. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -64302-
          <issue>6</issue>
          _
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Leiker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Finnigan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Gyllen</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Cukurova, Prototyping the use of large language models (llms) for adult learning content creation at scale</article-title>
          ,
          <source>in: LLM@AIED</source>
          ,
          <year>2023</year>
          . URL: https: //api.semanticscholar.org/CorpusID:259076210.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacNeil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mogil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Generating diverse code explanations using the gpt-3 large language model</article-title>
          ,
          <source>ICER '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          . URL: https://doi.org/10.1145/3501709.3544280. doi:
          <volume>10</volume>
          .1145/3501709.3544280.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacNeil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <article-title>Experiences from using code explanations generated by large language models in a web software development e-book</article-title>
          ,
          <source>SIGCSE</source>
          <year>2023</year>
          , ACM, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>931</fpage>
          -
          <lpage>937</lpage>
          . URL: https://doi.org/10.1145/ 3545945.3569785. doi:
          <volume>10</volume>
          .1145/3545945.3569785.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          , S. MacNeil, S. Sarsa,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <article-title>Comparing code explanations created by students and large language models</article-title>
          ,
          <source>Proceedings of the 2023 Conference on Innovation and Technology in Computer Science</source>
          Education V.
          <volume>1</volume>
          (
          <year>2023</year>
          ). URL: https://api.semanticscholar.org/CorpusID:258049009.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <article-title>Automatic generation of programming exercises and code explanations using large language models</article-title>
          ,
          <source>ACM</source>
          ,
          <year>2022</year>
          . URL: https://doi.org/10.1145%
          <fpage>2F3501385</fpage>
          . 3543957. doi:
          <volume>10</volume>
          .1145/3501385.3543957.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Del Carpio</surname>
          </string-name>
          <string-name>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Luxton-Reilly</surname>
          </string-name>
          ,
          <article-title>Evaluating Automatically Generated Contextualised Programming Exercises</article-title>
          ,
          <source>in: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <source>Portland OR USA</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>289</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Giacaman</surname>
          </string-name>
          ,
          <article-title>Conversing with copilot: Exploring prompt engineering for solving cs1 problems using natural language</article-title>
          ,
          <source>in: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <string-name>
            <surname>SIGCSE</surname>
          </string-name>
          <year>2023</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>1136</fpage>
          -
          <lpage>1142</lpage>
          . URL: https://doi.org/10.1145/3545945.3569823. doi:
          <volume>10</volume>
          .1145/3545945.3569823.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Piccolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Luxton-Reilly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Ridge</surname>
          </string-name>
          ,
          <article-title>Many bioinformatics programming tasks can be automated with chatgpt</article-title>
          ,
          <source>arXiv preprint arXiv:2303.13528</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bogart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sakr</surname>
          </string-name>
          ,
          <article-title>Thrilled by your progress! large language models (gpt-4) no longer struggle to pass assessments in higher education programming courses</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1</source>
          , ICER '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>78</fpage>
          -
          <lpage>92</lpage>
          . URL: https://doi.org/10.1145/3568813.3600142. doi:
          <volume>10</volume>
          .1145/3568813.3600142.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lifiton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sheese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          , Codehelp:
          <article-title>Using large language models with guardrails for scalable support in programming classes</article-title>
          ,
          <source>arXiv preprint arXiv:2308.06921</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sheese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lifiton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <article-title>Patterns of Student Help-Seeking When Using a Large Language Model-Powered Programming Assistant</article-title>
          ,
          <source>in: Proceedings of the 26th Australasian Computing Education Conference</source>
          , ACM,
          <source>Sydney NSW Australia</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kazemitabaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Henley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Craig</surname>
          </string-name>
          , T. Grossman,
          <article-title>CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Balances Student and Educator Needs</article-title>
          ,
          <source>in: Proceedings of the CHI Conference on Human Factors in Computing Systems</source>
          , ACM,
          <source>Honolulu HI USA</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bassner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Frankford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krusche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Iris: An</given-names>
            <surname>AI-Driven Virtual</surname>
          </string-name>
          Tutor For Computer Science Education,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2405.08008, arXiv:
          <fpage>2405</fpage>
          .08008 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>J. D.</surname>
            Zamfirescu-Pereira,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Hartmann</surname>
          </string-name>
          , J. DeNero, N. Norouzi, 61A-Bot:
          <article-title>AI homework assistance in CS1 is fast</article-title>
          and cheap - but is it helpful?,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2406.05600, arXiv:
          <fpage>2406</fpage>
          .05600 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zenke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Thornton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Malan</surname>
          </string-name>
          ,
          <article-title>Teaching CS50 with AI: Leveraging Generative Artificial Intelligence in Computer Science Education</article-title>
          ,
          <source>in: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <source>Portland OR USA</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>750</fpage>
          -
          <lpage>756</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3626252.3630938.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Ericson</surname>
          </string-name>
          ,
          <article-title>CodeTailor: LLM-Powered Personalized Parsons Puzzles for Engaging Support While Learning Programming</article-title>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2401.12125, arXiv:
          <fpage>2401</fpage>
          .12125 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Doughty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bompelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qayum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , et al.,
          <article-title>A comparative study of ai-generated (gpt-4) and human-crafted mcqs in programming education</article-title>
          ,
          <source>in: Proceedings of the 26th Australasian Computing Education Conference</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>114</fpage>
          -
          <lpage>123</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Angelikas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Okechukwu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. MacNeil</surname>
          </string-name>
          ,
          <article-title>Generating multiple choice questions for computing courses using large language models</article-title>
          ,
          <source>in: 2023 IEEE Frontiers in Education Conference (FIE)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bogart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sakr</surname>
          </string-name>
          ,
          <article-title>Harnessing llms in curricular design: Using gpt-4 to support authoring of learning objectives</article-title>
          ,
          <source>arXiv preprint arXiv:2306.17459</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Doyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sakr</surname>
          </string-name>
          ,
          <article-title>A comparative study of ai-generated and human-crafted learning objectives in computing education</article-title>
          ,
          <source>Journal of Computer Assisted Learning</source>
          <volume>41</volume>
          (
          <year>2025</year>
          )
          <article-title>e13092</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bogart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Šavelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Sakr,
          <article-title>Examining the trade-ofs between simplified and realistic coding environments in an introductory python programming class</article-title>
          ,
          <source>in: European Conference on Technology Enhanced Learning</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>315</fpage>
          -
          <lpage>329</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Naik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S. T. Wu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bogart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sakr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Rose</surname>
          </string-name>
          ,
          <article-title>Generating situated reflection triggers about alternative solution paths: A case study of generative ai for computer-supported collaborative learning</article-title>
          ,
          <source>in: Artificial Intelligence in Education: 25th International Conference, AIED 2024, Recife, Brazil, July</source>
          <volume>8</volume>
          -
          <issue>12</issue>
          ,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , SpringerVerlag, Berlin, Heidelberg,
          <year>2024</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>59</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -64302-
          <issue>6</issue>
          _4. doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>031</fpage>
          - 64302-
          <issue>6</issue>
          _
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>