<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Open-Ended Questions Need Personalized Feedback: Analyzing LLM-Enabled Features with Student Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachel Van Campenhout</string-name>
          <email>Rachel.vancampenhout@vitalsource.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeffrey S. Dittel</string-name>
          <email>jeff.dittel@vitalsource.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bill Jerome</string-name>
          <email>bill.jerome@vitalsource.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelle W. Clark</string-name>
          <email>michelle.clark@vitalsource.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benny G.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johnson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>VitalSource Technologies</institution>
          ,
          <addr-line>Raleigh, NC 27601</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large language models (LLMs) offer new opportunities to support deeper learning through open-ended, formative practice. This paper investigates two novel types of automatically generated questions: compareand-contrast prompts and student-authored exam questions. These question types are integrated into an ereader platform alongside conventional fill-in-the-blank items. To enable meaningful interaction with these open-ended tasks, an LLM is used to generate personalized feedback grounded in textbook content. A dataset of more than 90,000 student-question interactions is analyzed to evaluate how these new question types perform in terms of engagement, difficulty, persistence, and non-genuine responses, and how students interact with the LLM-generated feedback. Results are compared across contexts where questions were assigned as part of a course versus used voluntarily. Assigned usage dramatically increases engagement and improves performance across most metrics. To understand how students respond to the feedback itself, timing and textual overlap between the initial LLM-generated feedback and the student's second attempt are examined, revealing distinct patterns of reflection, revision, and potential feedback reuse. These results highlight both the promise and complexity of using LLMs to expand the cognitive scope of automated formative practice while maintaining pedagogical value at scale.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;automatic question generation</kwd>
        <kwd>open-ended questions</kwd>
        <kwd>personalized feedback</kwd>
        <kwd>large language models</kwd>
        <kwd>performance metrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Automatic question generation (AQG) has been a proficient area of research and development in the
past decade, enabled by advancements in natural language processing tools, machine learning
techniques, and artificial intelligence. Many approaches have been used to develop AQG pipelines
and for equally varied use-cases. However, from their systematic review of 92 AQG studies, Kurdi et
al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] found only one study that evaluated automatically generated (AG) questions using student
data and called for quantitative evaluations of question performance metrics. The AQG system
studied in this investigation is an expert-designed, rule-based system that uses textbook content as
the corpus for natural language processing in order to select important sentences and key terms and
transform them into formative practice questions for students to answer as they read. Formative
practice significantly benefits all students, particularly those who struggle or are disadvantaged [
        <xref ref-type="bibr" rid="ref2 ref3">2,
3</xref>
        ], with integrated practice achieving six times the effect size compared to reading alone [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Given
this robust causal relationship [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], leveraging AQG to scale formative practice widely was
pragmatic. To support equity and access at scale, the AG formative practice was made available for
free to any learner who uses textbooks containing it. Prior research on this AQG system in recent
years has compared engagement, difficulty, persistence, and discrimination performance metrics of
intermixed AG and human-authored questions within a courseware environment [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], and
evaluated these same performance metrics at scale with more than seven million student-question
interactions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Further research studied student learning behaviors via question interaction
patterns [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the most effective type of automatically generated feedback for student persistence
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and performance metrics with faculty and student perceptions from classroom implementations
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Not only does studying AQ questions using student data provide valuable performance
benchmarks for formative learning contexts, but this research leads to iterative improvement cycles
of the AQG system itself—ultimately benefiting the learners who use it.
      </p>
      <p>
        The AQG pipeline as originally created does not use large language models (LLMs) for question
generation for two primary reasons: first, the LLMs were far less robust when the AQG pipeline was
being developed and second, there is a possibility for factual inaccuracies from LLMs and the scale
at which questions are being generated is far too great for human review—an ethical barrier.
However, LLM technologies do have many advantages that could be applied to AQG pipelines if
done so responsibly. Personalized feedback is one such opportunity. Generating open-ended
questions is not challenging, but providing feedback is. Personalized, error-specific feedback is a
hallmark feature of intelligent tutoring systems, well established for being the most effective
computer-based learning environments [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. Once again, scale for this type of feedback was
largely prohibitive, but text comparison is a strength of LLMs and could provide a solution to this
challenge.
      </p>
      <p>
        In the fall of 2024, two new open-ended question types were added to the existing AG practice
question types: a glossary term compare and contrast (C&amp;C) question and a write your own exam
question. These two new question types were selected to engage students in higher-level cognitive
process dimensions [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. An LLM is then harnessed to compare the student’s response to the relevant
section of textbook content and provide constructive feedback. Although the rule-based AQG system
was capable of generating such open-ended questions before, implementing them without providing
feedback would have left students uncertain about their correctness, risking the perpetuation of
misconceptions. Therefore, including these question types required the ability to provide
accompanying personalized feedback.
      </p>
      <p>
        All AG questions are presented within a dedicated “CoachMe” panel alongside the ebook text,
allowing learners to interact with questions while reading (Figure 1). Students may attempt each
question as many times as desired, receiving immediate feedback indicating whether their response
is correct or incorrect. The fill-in-the-blank (FITB) questions have contextual hints generated using
related sentences of the same textbook section [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Students can revisit the textbook content for
support before retrying, or when needed, can choose to reveal the correct answer. Additionally,
students may rate the question after submitting a response using the thumbs up and down icon.
      </p>
      <p>
        Each student interaction with the ereader platform generates microlevel clickstream data, and
these “digital traces of student actions promise a more scalable and finer-grained understanding of
learning processes” [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. These high-quality data allow for investigation of learner behaviors as well
as learning technologies, allowing for old research questions to be answered in new ways and new
research questions to arise from novel data [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The clickstream data are stored with a student
identifier, so no personally identifiable information is connected with engagement data. The platform
does not capture any student demographic data. The analysis in this study includes all students who
have answered these questions, with an interest in studying the difference between self-motivated
usage and usage when assigned in a course context. Investigating the effectiveness of AI is required
to ensure its application is beneficial and performing as intended for learners. The use of AI in
educational technology should adhere to AI principles (such as accountability, transparency and
explainability, responsibility and ethics, and efficacy) both during its conceptualization and
development [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] as well as reporting efficacy findings and continuing to engage in iterative
improvement [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Beyond evaluating the two new question types, this study contributes to the broader theoretical
understanding of how generative AI can facilitate higher-order cognitive engagement [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and
constructive learning activities [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Practically, our findings provide educators and educational
technology designers concrete evidence supporting structured integration of cognitively demanding
open-ended questions paired with AI-generated personalized feedback. By empirically
demonstrating significant differences between assigned and unassigned contexts, this research
underscores the critical role of instructional design in maximizing the benefits of automated
formative practice at scale.
      </p>
      <p>The primary research questions for this paper are:
1.
2.
3.</p>
      <p>What are the performance metrics for the new open-ended question types and how do they
compare to the existing FITB questions as a benchmark?
How do the performance benchmarks differ between contexts where the questions are
unassigned (students self-selecting to answer) and assigned (known classroom
implementations)?</p>
      <p>How does the LLM-generated personalized feedback perform?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <sec id="sec-2-1">
        <title>2.1. Automatic Question Generation</title>
        <p>
          A rule-based AQG pipeline underpins the generation of the standard FITB questions used for
comparison in this investigation. While full implementation details can be found in earlier work (e.g.,
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]), a brief overview is provided here for context. The pipeline uses spaCy [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] to perform syntactic
and semantic analysis of textbook content and applies TextRank [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] to identify sentences deemed
important. Very short (fewer than five words) or very long (more than 40 words) sentences are
discarded. For each remaining sentence, a set of rule-based filters removes trivial or ambiguous terms
(e.g., function words, overly predictable words [23], or list items), leaving only key terms as blank
candidates. If multiple terms survive, each is turned into a separate FITB question. These items are
placed at major subsection boundaries so that learners regularly encounter formative practice
questions while reading the textbook.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.1.1. Open-Ended Questions with LLM-Enabled Feedback</title>
        <p>
          Building on the rule-based pipeline described above, two new question types extend CoachMe into
more open-ended tasks designed to foster deeper cognitive engagement (shown in Figure 1). The
existing questions (including the FITB questions used here for comparison) are primarily focused on
basic comprehension and most closely align with lower-level recognition and recall cognitive
processes in Bloom’s taxonomy [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and according to the ICAP framework [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], help maintain an
“active” mode of engagement. Despite the seemingly modest cognitive demands, these question types
have demonstrated effectiveness in supporting learning, as evidenced by the doer effect [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ]. While
the standard items attend to essential knowledge-building, the newly introduced open-ended
questions aim to elevate learners further along Bloom’s taxonomy and into a more “constructive”
mode within the ICAP framework.
        </p>
        <p>The student-authored exam questions direct students to “Write a test question for the section
‘[Textbook Section Title]’ as if you are the instructor preparing an exam.” This templated prompt is
placed at the end of each major section of the textbook. This aims to promote higher-order thinking
by requiring students to reflect on and synthesize key concepts. Having students compose their own
exam questions fosters metacognitive awareness and shifts them from simply receiving content to a
more constructive level of engagement. Research has found this type of student question creation
can increase engagement and significantly enhance comprehension and academic performance,
particularly when feedback is provided [24, 25].</p>
        <p>
          The compare-and-contrast (C&amp;C) questions focus on conceptual clarity by having students
compare related glossary terms. The system automatically identifies pairs of “coordinate” terms that
share a common final word (e.g., “lactate threshold” and “ventilatory threshold”) appearing close to
each other in the same textbook section. It then inserts the standardized question stem, “Explain the
difference between [Term 1] and [Term 2].” This task asks students to identify subtle distinctions,
thereby engaging them in elaboration and deeper processing consistent with the “analyze” cognitive
process dimension [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and constructive engagement [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Research on C&amp;C tasks suggests that
recognizing similarities and differences and drawing comparisons improves conceptual clarity,
facilitates retention beyond mere recall, supports the formation of conceptual categories, and aids in
establishing meaningful links among ideas [26, 27, 28].
        </p>
        <p>These new question types do not supplant the standard items; rather, they fulfill complementary
roles. The standard questions help ensure students do not passively skim the text without active
reflection of foundational content. The new open-ended questions require students to produce new
representations of knowledge. This higher-order interaction can bridge connections between
concepts more effectively and strengthen long-term retention.</p>
        <p>Once a student submits an answer, the platform gathers that response along with relevant
textbook passages or glossary entries and forwards them to an LLM-based evaluator. The evaluation
process proceeds as follows:
1.</p>
        <p>An excerpt from the textbook or glossary is supplied to the LLM, ensuring feedback remains
grounded in the source material and aligned with the textbook’s terminology. This
“textbookcentered” approach is designed to minimize hallucinations and maintain consistency with
established vocabulary.</p>
        <p>The LLM is instructed to gauge the accuracy, completeness, and clarity of the student’s
submission. In the case of a C&amp;C question, for example, the LLM checks whether the
student’s explanation clearly differentiates the two related concepts. Because these question
types are intrinsically open-ended, the system does not classify answers as strictly “correct”
or “incorrect.” Instead, the evaluator identifies strengths in the student’s work and points out
areas that might need additional clarification or elaboration.</p>
        <p>The LLM produces a concise textual critique, which may include praise for effectively
capturing key points, suggestions for further detail, or corrections if inaccuracies are evident.
In the current implementation, GPT-4o [29] is used for feedback generation, with the
temperature parameter set to 0 to decrease the likelihood of hallucinations. Since these
question types often call for a higher level of cognitive engagement than the standard items,
the feedback is intended to encourage iterative refinement, allowing students to revise and
resubmit their answers if they choose.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.2. Data Collection</title>
        <p>The dataset consists of all student-question interaction events for the LLM-enabled questions
gathered from August 15, 2024, through February 9, 2025. The ereader platform stores the raw
clickstream data with anonymous identifiers. Student consent for research and analytics is obtained
through acceptance of the platform’s terms of use and privacy policy. No student characteristics are
collected and the learner context is in general not known, though the majority of data comes from
higher education institutions in the United States. Data were grouped into student-question sessions,
consisting of all actions of an individual student on a single question, ordered chronologically. A
session may include multiple attempts on the question and, optionally, a thumbs up or down rating
by the student (see Section 3.1.4).</p>
        <p>This resulted in a dataset of 83,624 LLM-enabled question sessions (56,944 exam question and
26,680 C&amp;C), encompassing 92,719 interaction events, 23,750 questions, 14,696 students, and 1,929
textbooks. (Because only 544 of these textbooks included a glossary, C&amp;C questions could only be
generated for those particular books.) For comparative purposes, data from the standard FITB
questions were retrieved for the same textbooks and timeframe, resulting in 1,142,891 sessions
spanning 236,511 questions. The datasets are made available in our open data repository [30].</p>
        <p>These usage data reflect real-world learning contexts in which some courses assigned the
questions as part of a participation grade, while in other cases the questions remained optional.
Questions were categorized as either assigned or unassigned based on whether they were part of a
known classroom implementation. Specifically, 21 course sections across four institutions were
identified in which instructors explicitly required students to complete the practice questions; these
constitute the assigned group. All other usage is considered unassigned and typically reflects
voluntary student engagement, allowing for comparative analysis between contexts.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.3. Analysis</title>
      </sec>
      <sec id="sec-2-5">
        <title>2.3.1. Question Performance Metrics</title>
        <p>
          Previous research on AI-generated questions has relied on several core metrics to characterize
question performance, including engagement, difficulty, persistence, non-genuine response rates [
          <xref ref-type="bibr" rid="ref10 ref9">9,
10</xref>
          ], and student ratings [31]. The present study adopts these metrics to compare the performance of
new LLM-enabled items and standard FITB questions (detailed in the Results and Discussion section).
We adopt an exploratory approach for this study, using mean rates or proportions and quartiles for
each metric. If notable differences emerge, future investigations may employ more advanced
statistical approaches (e.g., mixed effects regression) to address variables such as subject domain or
student-level factors.
        </p>
        <p>However, because the new question types lack predefined correct answers, an LLM-based
approach was used to determine correctness (for C&amp;C items) and to detect non-genuine responses
(for both types). GPT-4o mini [32] was used to examine each C&amp;C response and its accompanying
feedback to decide whether a typical college instructor would reasonably consider it “complete and
correct.” Exam questions receive no correctness label but are checked for non-genuine attempts. Any
submission that does not address the prompt meaningfully (e.g., random text, “idk”) was flagged as
non-genuine. To ensure reliability, the prompts were iteratively developed using a subset of
responses, refining them until the LLM’s outputs were consistent with typical college-level
evaluation. It was verified that the LLM correctly identified non-genuine answers and assessed C&amp;C
accuracy in a way that reflected domain-reasonable expectations. After prompt refinement, spot
checks were performed on additional cases in the full dataset to confirm the LLM was applying these
criteria consistently. While not a formal validation study, this process ensured that the LLM's
classifications were consistent with our instructional intent. Although sufficient for the present
analysis, we acknowledge that a more systematic validation, such as expert annotation of a sample
set, would further strengthen the reliability of these measures. This remains an area for future work.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.3.2. Feedback Usage</title>
        <p>
          To examine how students might use the LLM-generated feedback, the time interval until a second
attempt was computed. Specifically, if a student’s initial attempt was incorrect or non-genuine and
a follow-up attempt occurred, the elapsed time (in seconds) between the submissions was calculated.
This is similar to prior work in intelligent tutoring systems, where response latency often serves as
a proxy for reflection or cognitive engagement [
          <xref ref-type="bibr" rid="ref13">13, 28</xref>
          ]. Because these intervals tend to be skewed,
we report the first quartile, median, and third quartile (Q1–Q3).
        </p>
        <p>To assess whether LLM feedback fosters learning, the analysis focuses on sessions in which the
first attempt was incorrect or non-genuine. The time interval data are stratified by the initial attempt
category and the outcome of the second attempt (correct, incorrect, or non-genuine). This framework
highlights pivotal transitions, such as moving from a non-genuine to a correct response, and
establishes a basis for comparing revision times with textual overlap of the revised attempt with the
feedback. Because the LLM’s feedback can occasionally provide near-complete model answers,
recognizing such overlap is relevant for distinguishing between independent construction of new
text and reuse of provided material.</p>
        <p>For this analysis, the LLM’s feedback and the student’s second answer were lowercased and
stripped of punctuation to help mitigate superficial differences, then tokenized on whitespace. A
token-level gestalt sequence-matching approach [33], implemented via Python’s
difflib.SequenceMatcher, produced a similarity percentage score, where 100% indicates a
verbatim match. Reordered text reduces the similarity score, penalizing partial rearrangements. This
method is intended to capture literal copying more effectively than simpler distance metrics, as it
identifies matching subsequences across the entire submission. These findings are then related to the
time interval results, exploring whether rapid resubmission coincides with higher textual overlap.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <sec id="sec-3-1">
        <title>3.1. Performance Metrics</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.1. Engagement</title>
        <p>Engagement measures whether students choose to attempt a given question upon encountering it. It
serves as a proxy for how appealing or approachable a question is to students in a given context.
Lower engagement may indicate a question type is perceived as more time-consuming, overly
difficult, or less beneficial. Because engagement serves as a core driver of the doer effect in formative
practice, it remains a critical baseline for understanding how new question types fare relative to
standard items. In this analysis, engagement is measured as the number of students who answered
each question, which provides a straightforward indicator of how often a question drew student
participation when encountered.</p>
        <p>Table 1 reports student engagement for each question type. For the assigned group, where
engagement was more substantial, the mean and Q1–Q3 are reported. For the unassigned group,
engagement was consistently low, so only the mean is reported. (In Tables 1–3, all cells represent
over 1,000 sessions.) For unassigned questions, most were answered by only a few students. The
mean number of students answering each exam question was 2.4, and 2.8 for C&amp;C. The assigned
courses show a very different pattern of behavior. A Mann–Whitney U test confirmed that
significantly more LLM-enabled questions were answered in assigned contexts than in unassigned
contexts (U = 1.23 × 106, p &lt; .001). The mean numbers of students answering the exam and C&amp;C
questions are very close (51.7 and 55.5, respectively), which seems reasonable given the similarity in
effort involved. The FITB questions are considerably higher at 84.5, and indeed are answered at a
much higher rate at each quartile.</p>
        <p>Across the assigned group, FITB questions were answered at a substantially higher rate than exam
questions at every quartile, with more than double the number of students answering at the 75th
percentile. C&amp;C questions showed stronger engagement than exam questions as well, with 75th
percentile participation nearly 50% higher. These patterns suggest that students were more
consistently willing to attempt FITB and C&amp;C items when assigned. Faculty practices may contribute
to this behavior; for example, many instructors assign participation credit for completing a portion
(e.g., 80%) of the available questions, which could lead students to selectively skip certain question
types.</p>
        <sec id="sec-3-2-1">
          <title>Assigned</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.1.2. Difficulty and Persistence</title>
        <p>
          Difficulty is reflected by the percentage of correct first attempts (sometimes referred to as the
difficulty index). While the open-ended exam questions are less amenable to an objective correctness
classification, C&amp;C responses can be more readily evaluated because they involve specific key
distinctions. GPT-4o mini [32] was employed post hoc to analyze each student’s submission together
with its LLM-generated feedback, instructed to determine whether a typical college professor would
regard it as “complete and correct.” This offline classification does not affect the real-time feedback
students receive, but rather serves as a means to compare overall difficulty of the C&amp;C items to that
of standard FITB questions. As shown in Table 2, the results for C&amp;C and FITB confirm trends from
prior research [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ] that the mean difficulties are higher when the practice is assigned, meaning
students get the questions correct more frequently in a classroom context when they are assigned.
Specifically, a chi-square test showed that the proportion of correct first attempts for C&amp;C was
significantly higher in the assigned context compared to unassigned (χ² = 207.87, p &lt; .001). The C&amp;C
questions had lower mean scores than the FITB questions, which is not unexpected given the higher
level of cognitive effort and content comprehension required to answer the C&amp;C compared to a
single-term FITB. However, the difficulty index of 59.8 for the assigned C&amp;C is within a reasonable
range for such a complex question type.
        </p>
        <p>
          Persistence occurs when a learner continues after an initial incorrect attempt until they
eventually arrive at a correct response. As with difficulty, persistence applies only to question types
where correctness is defined (C&amp;C and FITB). Although the system’s generative feedback focuses on
iterative improvement rather than binary correctness, persistence nevertheless provides insight into
how willing students are to revise more demanding items. The persistence data are a subset of the
difficulty dataset, as it is only the students who were incorrect on their first attempt. Also consistent
with prior research [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ], persistence increases when questions are assigned. For C&amp;C questions, a
chi-square test indicated that persistence was significantly higher in assigned contexts (χ² = 204.21,
p &lt; .001). Persistence for C&amp;C is much lower than for FITB. This could be related to two factors. First,
the effort to answer C&amp;C questions is much higher than for FITB, so it is not unexpected students
would be less inclined to attempt them more than once. Second, the post-submission experience
differs considerably: FITB initially provides correctness without revealing answers, prompting
retries or answer reveals, whereas incorrect C&amp;C responses immediately receive comprehensive
LLM-generated corrective feedback, reducing incentives to retry. Given this, students who persist
may show added effort to rephrase the correct response on their own, but students who don’t persist
have still received personalized corrective feedback—both beneficial learning experiences.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.1.3. Non-Genuine Responses</title>
        <p>Non-genuine answers are those that do not constitute a legitimate effort. For FITB items, a
rulebased filter detects obviously invalid submissions (e.g., single character, “idk”). As previously
discussed, an LLM was used to evaluate whether a response substantively engaged with the C&amp;C
terms or proposed a meaningful exam question; if not, the answer is flagged as non-genuine.
Nongenuine responses are lower for students in the assigned group for the open-ended questions: exam
questions 11.7% assigned versus 16.8% unassigned and C&amp;C questions 15.2% assigned compared to
19.1% unassigned. Chi-square tests confirm these differences are statistically significant for both
exam questions (χ² = 200.01, p &lt; .001) and C&amp;C questions (χ² = 63.86, p &lt; .001). The FITB questions
have 6.6% non-genuine responses for assigned versus 3.9% unassigned. C&amp;C questions had the
highest non-genuine response rate for both groups. Given the cognitive demand combined with the
need for understanding of two domain-specific terms, this is perhaps not surprising.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.1.4. Student Ratings</title>
        <p>
          Student “thumbs up/down” ratings (Figure 1) provide a mechanism for detecting problematic
questions. Students could give a rating after submitting an answer, with one rating opportunity per
question session. Table 3 shows higher overall rating frequency for unassigned questions. This
initially seems counter-intuitive given the engagement is much lower for unassigned. We attribute
this finding to rating fatigue [31]; students are more willing to rate early questions, but decline as
they continue to answer. The students in the assigned group answer dramatically more questions,
driving down their rating frequency. We also see an inverse relationship between the groups. The
unassigned group has more thumbs up than thumbs down ratings while the assigned group has more
thumbs down ratings. This could be attributed to students in the assigned group becoming more
selective in motivation for rating, letting questions they like go by and negatively rating ones they
liked less. These findings are consistent with prior research analyzing aggregate ratings [
          <xref ref-type="bibr" rid="ref10">10, 34</xref>
          ].
Exam questions have the highest thumbs up and down ratings for both groups. However, because
the exam questions use only a templated prompt, they are not susceptible to some of the reasons
FITB questions often get thumbs down, such as coming from an example or content students consider
less helpful. Therefore, the thumbs down reasoning for exam questions is more likely related to not
liking the question type itself.
        </p>
        <sec id="sec-3-5-1">
          <title>Thumbs Up</title>
          <p>Unassigned
3.67
2.22
2.30</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>Assigned 0.57 0.21 0.06</title>
        </sec>
        <sec id="sec-3-5-3">
          <title>Thumbs Down</title>
          <p>Unassigned
2.08
0.93
1.45</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>3.2. Feedback Usage</title>
        <p>To investigate how students engaged with the personalized feedback, we examined both how quickly
they revised their answers and how extensively they incorporated the LLM’s feedback text. Short
intervals may indicate minimal attention to the feedback, whereas longer intervals could suggest
more deliberate review. This also facilitates assessing whether rapid resubmissions align with
potential “copy-paste” behavior.</p>
        <p>The analysis focuses on cases in which the first attempt was incorrect (C&amp;C 28.4%) or
nongenuine (exam question 15.5%, C&amp;C 17.7%). Although FITB items show high persistence (61.2%
unassigned, 94.7% assigned), only 18.2% of exam-question sessions and 13.2% of C&amp;C sessions with
a non-correct first attempt proceeded to a second attempt. In Tables 4 and 5, which group data by
first and second answer attempt categories, every cell comprises more than 100 sessions unless
otherwise noted in the corresponding discussion. Tables 4 and 5 present the elapsed time between
first and second attempts and the overlap score between each second attempt and the LLM feedback,
disaggregated by question type, answer pattern (e.g., incorrect → correct), and assignment context.
Second attempts on exam questions are classified only as genuine or non-genuine. All cells represent
more than 100 sessions, except in the case of most incorrect or non-genuine second attempts (37–
107 sessions), and two specific cells, incorrect → non-genuine for unassigned and assigned, contain
just 9 and 11 sessions, respectively.</p>
        <p>There are several overall patterns noticeable from the time intervals. The first is that for every
response pattern—across each quartile except one—the assigned group took less time to respond the
second time. In many cases, they took roughly half the time, as seen in the C&amp;C incorrect → correct
response pattern. At first this seemed counterintuitive, as it could be assumed that students in the
unassigned group would put less effort (i.e., time) than their assigned peers. However, when we
consider the number of questions students in these groups answer, that changes the interpretation.
Students in the unassigned group only answer a mean of 2.4 exam questions and 2.8 C&amp;C, while the
students in the assigned group answer a mean of 51.7 exam questions and 55.5 C&amp;C. Students in the
assigned group, familiar with expectations, respond faster, whereas the unassigned group likely
requires additional time due to limited experience with the question type.</p>
        <p>
          Another intriguing finding is how similar the elapsed times are each quartile for both assigned
and unassigned for the exam question non-genuine → genuine and C&amp;C non-genuine → correct
response patterns. Prior research established that a percentage of students who input non-genuine
responses for FITB follow it up with the correct response, indicating a strategy to reveal feedback as
scaffolding [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. The similarity of elapsed times for the non-genuine to genuine/correct response
patterns suggests a similar strategy is being employed here.
        </p>
        <p>The overlap for non-genuine → non-genuine responses for both question types for both assigned
and unassigned groups was 10.4% or less. Students who continued to enter non-genuine responses
after receiving feedback did not appear to be considering the feedback or attempting to enter it back
in. For C&amp;C questions when students were incorrect on both attempts, they had among the highest
time interval across all quartiles, yet low overlap (≤  29.2%). This may reflect prolonged struggle or
repeated guesswork.</p>
        <p>However, overlap scores for exam questions (non-genuine → genuine) and C&amp;C questions
(incorrect → correct) reveal a wider range. Although the upper end of the overlap range still suggests
significant reliance on the LLM’s explanation, the lower overlap scores and longer time intervals
may imply more genuine reflection and partial rewriting or paraphrasing rather than copying
verbatim. The literal reuse of feedback does not necessarily impede learning—some learners may
paraphrase or synthesize the feedback effectively—yet identifying instances of minimal revision can
clarify the extent of students’ engagement with the system’s feedback.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Incorporating generative AI into educational technology should maintain focus on research-based
methods that benefit student learning and adhere to responsible AI principles. The addition of
openended questions that engage higher-order cognitive processes combined with personalized feedback
to an existing AQG system provides students with a robust formative learning tool. This large-scale
investigation of open-ended question types with LLM-enabled feedback provides a valuable
comparison of performance metrics to established AG practice benchmarks. Addressing our second
research question, we find that assigning questions has a profound effect on engagement, with clear
impact on the rest of the performance metrics, indicating that structured classroom use encourages
students to invest more effort in tackling these cognitively demanding tasks. Identifying effective
strategies to encourage engagement in unassigned contexts remains an important direction for
future research. Regarding research question one, assigned contexts showed higher difficulty and
lower persistence for the new question types compared to FITB items, as expected given the greater
effort required. The exam questions had notably less engagement, and combined with the thumbs
down ratings, indicate a need for further consideration regarding the frequency of their appearance
in the textbook.</p>
      <p>Studying the use of feedback (research question three) by using both time intervals between first
and second attempts as well as text overlap percentages between the feedback and student responses
revealed several patterns in student behaviors. The time interval between responses was shorter in
assigned contexts, suggesting experience answering more open-ended questions decreased the time
it took students to craft a second attempt. The overlap analysis shows that many learners who had
incorrect or non-genuine first responses incorporate moderate to large portions of the LLM’s
feedback into their next correct submission. However, this approach of rephrasing or revising after
a copy-paste does not necessarily preclude learning. A promising area for future research is further
analyzing subgroups of student responses, including non-genuine responses, to reveal additional
ways LLM-enabled feedback could scaffold learners.</p>
      <p>As more domain-level, student-level, and other factors emerge from continued usage data, future
work may employ more rigorous statistical modeling (e.g., mixed effects regression) to examine these
factors in greater depth. In addition, because correctness and non-genuine responses were
determined via an LLM-based evaluator, there is a possibility of classification errors or biases. Future
analyses can consider sampling student responses for expert review and refining LLM prompts if
necessary. Overall, these findings highlight both the promise and complexity of leveraging LLM
technology to expand the cognitive range of automated practice. As generative AI continues to
advance, maintaining rigorous analyses of usage patterns and performance metrics will remain
crucial for ensuring that new capabilities genuinely advance student learning rather than merely
accelerating the completion of tasks. In this case, we are satisfied that this first investigation shows
a valid application of LLM abilities to provide the personalized feedback required by open-ended
questions to support learning.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI o3 and GPT-4.5 for: refining draft
content; paraphrasing and rewording; grammar and spelling checks. After using these tools, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.
[23] Evert, S. (2009). Corpora and collocations. In A. Lüdeling &amp; M. Kytö (Eds.), Corpus linguistics:
An international handbook (Vol. 2, pp. 1212–1248). Mouton de Gruyter.
https://doi.org/10.1515/9783110213881.2.1212
[24] Yu, F.-Y., &amp; Pan, C.-C. (2014). The effects of student question-generation with online prompts
on learning. Educational Technology &amp; Society, 17(3), 267–279.
https://www.jstor.org/stable/jeductechsoci.17.3.267
[25] Rosenshine, B., Meister, C., &amp; Chapman, S. (1996). Teaching students to generate questions: A
review of the intervention studies. Review of Educational Research, 66(2), 181–221.
https://doi.org/10.3102/00346543066002181
[26] Alfieri, L., Nokes-Malach, T. J., &amp; Schunn, C. D. (2013). Learning through case comparisons: A
meta-analytic review. Educational Psychologist, 48(2), 87–113.
https://doi.org/10.1080/00461520.2013.775712
[27] Gentner, D., &amp; Namy, L. L. (1999). Comparison in the development of categories. Cognitive</p>
      <p>Development, 14(4), 487–513. https://doi.org/10.1016/S0885-2014(99)00016-7
[28] Aleven, V., &amp; Koedinger, K. (2000). Limitations of student control: Do students know when they
need help? In Proceedings of the 5th International Conference on Intelligent Tutoring Systems
(pp. 292–303).
[29] OpenAI. (2024, August 8). GPT-4o system card. https://openai.com/index/gpt-4o-system-card/
[30] VitalSource Supplemental Data Repository. (2025). https://github.com/vitalsource/data
[31] Johnson, B. G., Dittel, J., &amp; Van Campenhout, R. (2024). Investigating student ratings with
features of automatically generated questions: A large-scale analysis using data from natural
learning contexts. In Proceedings of the 17th International Conference on Educational Data
Mining (pp. 194–202). https://doi.org/10.5281/zenodo.12729796
[32] OpenAI. (2024, July 18). GPT-4o mini: advancing cost-efficient intelligence.</p>
      <p>https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
[33] Ratcliff, J. W., &amp; Metzener, D. E. (1988). Pattern matching: The gestalt approach. Dr. Dobb's</p>
      <p>Journal, 13(7), 46–51.
[34] Jerome, B., Van Campenhout, R., Dittel, J. S., Benton, R., &amp; Johnson, B. G. (2023). Iterative
improvement of automatically generated practice with the Content Improvement Service. In R.
Sottilare &amp; J. Schwarz (Eds.), Adaptive Instructional Systems. HCII 2023. Lecture Notes in
Computer Science (pp. 312–324). Springer, Cham. https://doi.org/10.1007/978-3-031-34735-1_22</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Kurdi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Al-Emari</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A systematic review of automatic question generation for educational purposes</article-title>
          .
          <source>International Journal of Artificial Intelligence in Education</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <fpage>121</fpage>
          -
          <lpage>204</lpage>
          . https://doi.org/10.1007/s40593-019-00186-y
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wiliam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Inside the black box: raising standards through classroom assessment</article-title>
          .
          <source>Phi Delta Kappan</source>
          ,
          <volume>92</volume>
          (
          <issue>1</issue>
          ),
          <fpage>81</fpage>
          -
          <lpage>90</lpage>
          . https://doi.org/10.1177/003172171009200119
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Theobald</surname>
            ,
            <given-names>E. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arroyo</surname>
            ,
            <given-names>E. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Behling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chambwe</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cintr</surname>
            ,
            <given-names>D. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cooper</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dunster</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grummer</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hennessey</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsiao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Ira-non, N.,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordt</surname>
          </string-name>
          , H., Keller, M.,
          <string-name>
            <surname>Lacey</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Littlefield</surname>
            ,
            <given-names>C. E.</given-names>
          </string-name>
          , …
          <string-name>
            <surname>Freeman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Active learning narrows achievement gaps for underrepresented students in undergraduate science</article-title>
          , technology, engineering, and math.
          <source>Proceedings of the National Academy of Sciences</source>
          ,
          <volume>117</volume>
          (
          <issue>12</issue>
          ),
          <fpage>6476</fpage>
          -
          <lpage>6483</lpage>
          . https://doi.org/10.1073/pnas.1916903117
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Koedinger</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McLaughlin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Learning is not a spectator sport: Doing is better than watching for learning from a MOOC</article-title>
          .
          <source>In Proceedings of the Second ACM Conference on Learning@Scale</source>
          (pp.
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          ). https://doi.org/10.1145/2724660.2724681
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Koedinger</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McLaughlin</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>J. Z.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bier</surname>
            ,
            <given-names>N. L.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Is the doer effect a causal relationship? How can we tell and why it's important</article-title>
          .
          <source>In Proceedings of the Sixth International Conference on Learning Analytics &amp; Knowledge</source>
          (pp.
          <fpage>388</fpage>
          -
          <lpage>397</lpage>
          ). http://dx.doi.org/10.1145/2883851.2883957
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Jerome</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Dittel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            , &amp;
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. G.</surname>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>The doer effect at scale: Investigating correlation and causation across seven courses</article-title>
          .
          <source>In Proceedings of the 13th International Learning Analytics and Knowledge Conference (LAK</source>
          <year>2023</year>
          )
          <article-title>(pp</article-title>
          .
          <fpage>357</fpage>
          -
          <lpage>365</lpage>
          ). https://doi.org/10.1145/3576050.3576103
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Dittel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            ,
            <surname>Jerome</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            , &amp;
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. G.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Transforming textbooks into learning by doing environments: An evaluation of textbook-based automatic question generation</article-title>
          .
          <source>In Third Workshop on Intelligent Textbooks at the 22nd International Conference on Artificial Intelligence in Education CEUR Workshop Proceedings</source>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ). https://ceurws.org/Vol-
          <volume>2895</volume>
          /paper06.pdf
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>B. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dittel</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            , &amp;
            <surname>Jerome</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Discrimination of automatically generated questions used as formative practice</article-title>
          .
          <source>In Proceedings of the Ninth ACM Conference on Learning@Scale</source>
          (pp.
          <fpage>325</fpage>
          -
          <lpage>329</lpage>
          ). https://doi.org/10.1145/3491140.3528323
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Jerome</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Dittel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            , &amp;
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. G.</surname>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Advancing intelligent textbooks with automatically generated practice: A large-scale analysis of student data</article-title>
          .
          <source>5th Workshop on Intelligent Textbooks. The 24th International Conference on Artificial Intelligence in Education</source>
          (pp.
          <fpage>15</fpage>
          -
          <lpage>28</lpage>
          ). https://intextbooks.science.uu.nl/workshop2023/files/itb23_s1p2.pdf
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Dittel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            ,
            <surname>Brown</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Benton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>B. G.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Exploring student persistence with automatically generated practice using interaction patterns</article-title>
          .
          <source>2023 International Conference on Software, Telecommunications and Computer Networks (SoftCOM)</source>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ). https://doi.org/10.23919/SoftCOM58365.
          <year>2023</year>
          .10271578
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Kimball</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Dittel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            ,
            <surname>Jerome</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            , &amp;
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. G.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>An investigation of automatically generated feedback on student behavior and learning</article-title>
          .
          <source>In Proceedings of LAK24: 14th International Learning Analytics and Knowledge Conference</source>
          (pp.
          <fpage>850</fpage>
          -
          <lpage>856</lpage>
          ). https://doi.org/10.1145/3636555.3636901
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. G.</given-names>
            ,
            <surname>Deininger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Harper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Odenweller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            , &amp;
            <surname>Wilgenbusch</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Automatically generated practice in the classroom: Exploring performance and impact across courses</article-title>
          .
          <source>In Proceedings of the 32nd International Conference on Software, Telecommunications and Computer</source>
          Networks (SoftCOM
          <year>2024</year>
          ) (pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ). https://doi.org/10.23919/SoftCOM62040.
          <year>2024</year>
          .10721828
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>VanLehn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems</article-title>
          .
          <source>Educational Psychologist</source>
          ,
          <volume>46</volume>
          (
          <issue>4</issue>
          ),
          <fpage>197</fpage>
          -
          <lpage>221</lpage>
          . https://doi.org/10.1080/00461520.
          <year>2011</year>
          .611369
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kulik</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Fletcher</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Effectiveness of intelligent tutoring systems: A meta-analytic review</article-title>
          .
          <source>Review of Educational Research</source>
          ,
          <volume>86</volume>
          (
          <issue>1</issue>
          ),
          <fpage>42</fpage>
          -
          <lpage>78</lpage>
          . https://doi.org/10.3102/0034654315581420
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>L. W.</given-names>
          </string-name>
          (Ed.), Krathwohl,
          <string-name>
            <surname>D. R</surname>
          </string-name>
          . (Ed.), Airasian,
          <string-name>
            <given-names>P. W.</given-names>
            ,
            <surname>Cruikshank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            ,
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            ,
            <surname>Pintrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            ,
            <surname>Raths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            , &amp;
            <surname>Wittrock</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. C.</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>A taxonomy for learning, teaching, and assessing: A revision of Bloom's Taxonomy of Educational Objectives (Complete edition)</article-title>
          .
          <source>Longman.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pardos</surname>
            ,
            <given-names>Z. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smyth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slater</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Warschauer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Mining big data in education: Affordances and challenges</article-title>
          . Review of Research in Education,
          <volume>44</volume>
          (
          <issue>1</issue>
          ),
          <fpage>130</fpage>
          -
          <lpage>160</lpage>
          . https://doi.org/10.3102/0091732X20903304
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>McFarland</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khanna</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingue</surname>
            ,
            <given-names>B. W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pardos</surname>
            ,
            <given-names>Z. A.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Education data science: Past, present, future</article-title>
          .
          <source>AERA Open</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . https://doi.org/10.1177/23328584211052055
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Soto-Karlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Selinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , &amp;
            <surname>Jerome</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Learning engineering in practice: A case study on developing LLM-based educational tools</article-title>
          . In R. A.
          <string-name>
            <surname>Sottilare</surname>
          </string-name>
          &amp; J.
          <string-name>
            <surname>Schwarz</surname>
          </string-name>
          (Eds.),
          <article-title>Adaptive instructional systems</article-title>
          .
          <source>HCII 2025. Lecture Notes in Computer Science</source>
          (Vol.
          <volume>15813</volume>
          , pp.
          <fpage>132</fpage>
          -
          <lpage>150</lpage>
          ). Springer. https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -92970-0_
          <fpage>10</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Brown</surname>
          </string-name>
          , N., &amp;
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>B. G.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>AI principles in practice with a learning engineering framework</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Computer Supported Education</source>
          (pp.
          <fpage>312</fpage>
          -
          <lpage>318</lpage>
          ). https://doi.org/10.5220/0013358600003932
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>M. T. H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wylie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The ICAP framework: Linking cognitive engagement to active learning outcomes</article-title>
          .
          <source>Educational Psychologist</source>
          ,
          <volume>49</volume>
          (
          <issue>4</issue>
          ),
          <fpage>219</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montani</surname>
            , I., Van Landeghem,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>spaCy: Industrial-strength natural language processing in Python</article-title>
          . https://doi.org/10.5281/zenodo.1212303
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Mihalcea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tarau</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>TextRank: Bringing order into text</article-title>
          .
          <source>In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing</source>
          (pp.
          <fpage>404</fpage>
          -
          <lpage>411</lpage>
          ). https://aclanthology.org/W04-3252
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>