<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Large Language Models for Evaluating Automatically Generated Questions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jeffrey S. Dittel</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelle W. Clark</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rachel Van Campenhout</string-name>
          <email>rachel@acrobatiq.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benny G. Johnson</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>VitalSource Technologies</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raleigh</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>North Carolina</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Automatic question generation has emerged as an effective and efficient method for incorporating formative practice into electronic textbooks on a large scale. This advancement, however, introduces new challenges in ensuring the quality of the generated questions. Traditionally, analyzing student responses has been effective in identifying low-quality questions. However, preemptively filtering out substandard questions before they reach students would be more desirable. In this study, we present preliminary findings on a promising technique that leverages a large language model (LLM) to identify potentially low-quality questions. Our hypothesis is that questions an LLM fails to answer correctly may contain quality issues, particularly since LLMs generally outperform students in answering automatically generated questions. Using a data set of questions from an open-source textbook, our method successfully identified nearly 30% of the questions that were rejected through analysis of student answer data. These results suggest that LLMs can be a valuable tool in improving the quality control process of automatically generated questions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automatic question generation</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>formative practice</kwd>
        <kwd>question evaluation</kwd>
        <kwd>content improvement service</kwd>
        <kwd>iterative improvement1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Using artificial intelligence for automatic question generation (AQG) has emerged as an
effective and efficient way to add formative practice at scale to electronic textbooks. Placing
formative practice alongside the electronic text (etext) significantly enhances learning
outcomes by implementing a learning science principle known as the doer effect. The doer
effect is proven to be beneficial for all students and has approximately six times the impact on
learning than reading alone [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 3</xref>
        ]. Since 2022, more than 2.5 million automatically generated
(AG) questions have been placed into over 9,000 etexts as a free learning feature, named
CoachMe, within VitalSource’s Bookshelf electronic reader platform.
      </p>
      <p>This practice feature contains several types of AG questions, including fill-in-the-blank
(FITB), matching, multiple choice, and free response. The FITB questions, which comprise the
majority of the AG questions, are the focus of the present study. For details of the AQG process
used, see [4]. As formative practice, students are allowed as many attempts to answer as they
like, receiving immediate feedback, and can also reveal the answer if stuck.</p>
      <p>Research on questions generated through this artificial intelligence (AI) process found that
they performed equally as well with students as human-authored questions on several key
metrics [5, 6]. However, as anyone who has created educational content is aware, no content is
perfect, and it is inevitable that problems or errors occur. Just as no human could write millions
of perfectly performing questions, neither does AI always generate perfect questions. Using
AQG on this unprecedented scale presents a new challenge in how to monitor and perform
quality assurance on this enormous question set.</p>
      <p>The Content Improvement Service (CIS) is an automated adaptive system that was developed
in response to this practical need [7, 8]. The CIS is a platform-level system that monitors
realtime clickstream data for all questions delivered in all e-textbooks. The CIS evaluates question
quality using a variety of modular plug-in tools and operates independently without requiring
human involvement. There are currently two primary evaluation methods used. One is a
Bayesian analysis that removes questions with a mean score likely to be below a minimum
acceptable threshold. The other uses student feedback given through a thumbs up or down
rating mechanism. Questions with more than one thumbs down rating within the first 100
students answering are removed. For more details on the CIS analysis methods, see [8].</p>
      <p>Even though the CIS can detect unacceptable questions using a relatively small amount of
data [8], it would be more desirable to preemptively remove these before releasing them to
students. To this end, post-AQG human assessment of question quality is common due to the
well-known limitations of AG questions. From a review of the AQG evaluation literature by
Kurdi et al. [9], studies that reported on “question acceptability or overall quality” from human
review found percentages of acceptable questions ranging from 50% to 93%. Early in the
CoachMe release, a human review pass was performed by the AQG development team to check
for common AQG quality issues that were not subject matter-related and did not require
pedagogical expertise. For details of this review process, see [4].</p>
      <p>There were practical difficulties with the human review process, however. For one, it was
not scalable to keep up with the large volume of questions needing review. For expedience, it
was also necessary to review the questions in isolation in spreadsheets outside of the etext, i.e.,
without the context of the textbook material for which the questions serve as practice. The pace
at which questions could be reviewed would have been much slower within the etext. While
isolated review increased the volume of questions that could be reviewed, this potentially
negatively affected review quality. Specifically, this led to retaining some questions that may
not have been considered acceptable with the benefit of contextual information. Furthermore,
reviewers were not subject matter experts (SMEs) in most domains they reviewed. It was not
practical to recruit SMEs (e.g., higher education instructors) for question review across multiple
disciplines at the scale needed. In a regression model on a large question usage data set from a
recent study [10], whether a question had passed a manual review (under the conditions
described) did not have a statistically significant impact on whether a student rated the question
as not helpful.</p>
      <p>Not surprisingly, the manual review step was discontinued as the scale of CoachMe
increased. However, the major difficulties with manual review, namely scale, inability to
consider context, and lack of reviewer subject matter expertise, could potentially be mitigated
or eliminated by incorporating a large language model (LLM) into an automated review process.
Volume is not an issue for an LLM, and both time and cost are significantly reduced compared
to human review. Unlike with human review, it is practical (and prudent) to give an LLM
contextual information to consider. And, given their massive training data sets, it seems
reasonable that an LLM could have greater subject matter expertise in many domains than a
non-SME human reviewer. For example, as seen in the Results and Discussion section, the LLM
used in this work was able to answer CoachMe questions correctly at a much higher rate than
students.</p>
      <p>Use of an LLM may therefore have potential to meaningfully enhance automated question
review and reduce the number of unacceptable questions that must be detected through
collection and analysis of student data by the CIS. In this work, we take a first step in
investigating this possibility. Our hypothesis is questions that an LLM fails to answer correctly
may contain issues that would lead to their rejection based on analysis of student answer data.
To be clear, here we are using an LLM to attempt to improve assessment of the quality of
questions generated by the CoachMe AQG process; CoachMe does not yet use LLMs in the
question generation process itself.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>The data set in this study is from use of CoachMe in the Chemistry 101 course at a U.S. major
public university in the Fall 2023 semester. A total of 744 students were enrolled in the course.
The textbook used in the course was Chemistry: Atoms First [11] from OpenStax. The textbook
contained 282 AG formative practice questions added by CoachMe, of which 106 were FITB
cloze questions. The implementation of CoachMe by the course instructors was that students
who answered 75% of the questions in relevant chapters (regardless of correctness) would have
an additional reading quiz dropped at the end of the semester.</p>
      <p>Since the CIS rejects questions receiving more than one thumbs down within the first 100
students, questions having at or near 100 answering students were selected for analysis, using
a natural break in the data observed at 88 students. (The mean score analysis can typically reach
a decision on question rejection in fewer than 100 students.) This resulted in 54 of the 106 FITB
questions being selected. The mean number of students answering each question was 118.3 and
the overall question mean score on the first attempt was 41.2%. A total of 247 students answered
questions in the data set. The mean number of questions answered per student was 25.9, with
71 students answering all 54 questions in the data set. Of these 54 questions, 27 (50%) would be
flagged for rejection using one or both CIS criteria (21 by mean score and 13 by student ratings).
Note that all selected questions had passed the human review process, even though half ended
up being rejected by the CIS analyses.</p>
      <p>The LLM evaluation work was done using GPT-4 [12]. The first evaluation performed was
simply to direct the LLM to answer the question as if it were a student. GPT-4’s temperature
parameter was set to 0 (the lowest value), which causes it to respond with the answer word that
is most probable under the model. GPT-4 uses a system prompt to provide instructions for the
conversation, such as setting a role for the LLM to assume, and a user prompt that provides the
actual query. The system prompt used was “You are a college student with a 4.0 GPA.”
The user prompt, illustrated using one of the questions from the data set, was
Here is a fill-in-the-blank question from your textbook. Please answer with
the word that best fits in the blank. Answer only with a single word.
Question: The uncertainty principle can be shown to be a consequence of
wave-particle duality, which lies at the heart of what distinguishes modern
______ theory from classical mechanics.</p>
      <p>Answer:</p>
      <p>For the above question, GPT-4 responded with “quantum”, which is correct.  Answering a
question incorrectly was taken as a predictor that the question will be rejected by the CIS. Why
might this simple criterion be useful? Since the LLM is much better at answering the questions
than students, an incorrect answer may be due to a defect in the question rather than a limitation
in the LLM’s knowledge; examples are presented below. Precision and recall were calculated
for this criterion.</p>
      <p>
        When CoachMe is used as intended, students answer the questions while reading the
textbook content; this leads to improved learning via the doer effect [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 3</xref>
        ]. The evaluation
was therefore repeated providing the LLM with the complete paragraph in which the question’s
sentence appeared, with the same answer word removed. This is intended to resemble how
students should use the formative practice more closely, i.e., answering the questions
immediately after reading the relevant textbook content.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <p>When the LLM was given each question to answer without additional context, 43 of 54 (79.6%)
were answered correctly. This is in stark contrast with the first attempts by students, which
were only 41.2% correct. A z test of two proportions shows this difference is statistically
significant (p &lt;&lt; .001).</p>
      <p>Taking an incorrect answer as predicting rejection by the CIS had precision 72.7% and recall
29.6%. This precision is perfectly acceptable. Maximizing precision is not a pressing concern
since the AQG process generates more questions than needed, with the rest held in reserve as
replacements for questions rejected by the CIS [8]. Recall is also reasonable considering the
minimal effort involved in obtaining it, i.e., merely checking if the LLM answers the question
incorrectly (more on this topic below).</p>
      <p>An example of a question correctly predicted for rejection by the CIS is</p>
      <p>The order of a(n) ______ bond is a guide to its strength; a bond between
two given atoms becomes stronger as the bond order increases.</p>
      <p>The correct answer is “covalent” and the LLM’s answer was “chemical”. While both are
reasonable words for completing the sentence in isolation, “covalent” is more specific to the
textbook context on the topic of nolecular orbitals for diatomic molecules, which concerns
covalent bonding.</p>
      <p>Another question correctly predicted for rejection is</p>
      <p>One particularly characteristic ______ of waves results when two or more
waves come into contact: They interfere with each other.</p>
      <p>The correct answer word (i.e., appearing in the textbook sentence) is “phenomenon” while the
LLM answered “property”. Here, these words are synonymous, completing the sentence equally
well (even considering context), but “property” would be counted as incorrect. This illustrates
how the LLM-based answer criterion can be useful, by identifying when an equally good answer
word as the one used by the textbook author exists.</p>
      <p>Providing the textbook paragraph in which the sentence occurred as context is expected to
increase the proportion of questions correctly answered. This was observed, with 46 of 54
(85.2%) correct, compared to 79.6% correct without this contextual information. More correctly
answered questions means fewer questions predicted as rejected, and so recall should decrease
and precision increase. While this was the case, with precision 75.0% and recall 22.2%,
interestingly, the difference made by this additional information was small.</p>
      <p>Although many questions rejected by the CIS were not identified by this method, it is
important to note that it is only one of several used to identify unacceptable questions without
student data. The focus is detecting cases where the answer word is not sufficiently predictable
from the question and the LLM’s background knowledge of the subject. This is only one way
AG questions can prove unacceptable [4, 9], and thus high recall of rejected questions is not
necessary as a measure of success. An example question that the LLM answered correctly but
was given thumbs down by multiple students illustrates this point.</p>
      <p>Because a hydrogen ______ molecule contains two oxygen atoms, as opposed
to the water molecule, which has only one, the two substances exhibit very
different properties.</p>
      <p>The LLM correctly answered “peroxide”, which is highly predictable in this context, and thus
the question was not predicted for rejection. However, students viewed the question as not
helpful because this sentence was serving as an example to illustrate a central concept (the
chemical mole concept), and not as an important fact that needed to be retained. This reason
was not related to the answer word’s predictability.</p>
      <p>We have investigated a very simple but powerful method of identifying unacceptable AG
questions without student data. Preliminary results are promising, identifying almost 30% of
questions that would be rejected by data analysis if released to students. Directions for
continued work involve analyzing a larger and more varied question sample, improving
precision and recall by involving the LLM more directly in the question generation process (e.g.,
in selecting the answer words, not just assessing them post-AQG), and extension to other AG
question types like multiple choice.</p>
      <p>The data for this study (AG questions, anonymized student interaction events, and LLM
analysis results) are available at our open-source data repository [13].
[3] R. Van Campenhout, B. Jerome, and B. G. Johnson, The doer effect at scale: Investigating
correlation and causation across seven courses, in LAK23: 13th International Learning
Analytics and Knowledge Conference (LAK 2023), 2023, pp. 357–365. doi:
10.1145/3576050.3576103.
[4] B. G. Johnson, R. Van Campenhout, B. Jerome, M. F. Castro, R. Bistolfi, and J. S. Dittel,
Automatic question generation for Spanish textbooks: Evaluating Spanish questions
generated with the parallel construction method, International Journal of Artificial
Intelligence in Education, vol. 34, no. 2, pp. 123-137, Apr. 2024. doi:
10.1007/s40593-02400394-1.
[5] R. Van Campenhout, J. S. Dittel, B. Jerome, and B. G. Johnson, Transforming textbooks into
learning by doing environments: An evaluation of textbook-based automatic question
generation, in Third Workshop on Intelligent Textbooks at the 22nd International
Conference on Artificial Intelligence in Education CEUR Workshop Proceedings, 2021, pp.
1–12. [Online]. Available: https://ceur-ws.org/Vol-2895/paper06.pdf.
[6] B. G. Johnson, J. S. Dittel, R. Van Campenhout, and B. Jerome, Discrimination of
automatically generated questions used as formative practice, in Proceedings of the Ninth
ACM Conference on Learning@Scale, 2022, pp. 325–329. doi: 10.1145/3491140.3528323.
[7] B. Jerome, R. Van Campenhout, J. S. Dittel, R. Benton, S. Greenberg, and B. G. Johnson, The
Content Improvement Service: An adaptive system for continuous improvement at scale,
in Interaction in New Media, Learning and Games. HCII 2022. Lecture Notes in Computer
Science, A. Meiselwitz et al., Eds. Cham: Springer, 2022, vol. 13517, pp. 286–296. doi:
10.1007/978-3-031-22131-6_22.
[8] B. Jerome, R. Van Campenhout, J. S. Dittel, R. Benton, and B. G. Johnson, Iterative
improvement of automatically generated practice with the Content Improvement Service,
in Adaptive Instructional Systems. HCII 2023. Lecture Notes in Computer Science, R.
Sottilare and J. Schwarz, Eds. Cham: Springer, 2023, pp. 312-324. doi:
10.1007/978-3-03134735-1_22.
[9] G. Kurdi, J. Leo, B. Parsia, U. Sattler, and S. Al-Emari, A systematic review of automatic
question generation for educational purposes, International Journal of Artificial
Intelligence in Education, vol. 30, no. 1, pp. 121–204, 2020. doi:
10.1007/s40593-019-00186y.
[10] B. G. Johnson, J. S. Dittel, and R. Van Campenhout, Investigating student ratings with
features of automatically generated questions: A large-scale analysis using data from
natural learning contexts, accepted for publication in EDM 2024: The 17th International
Conference on Educational Data Mining, Atlanta, Georgia, USA, July 14–17, 2024.
[11] P. Flowers, E. J. Neth, W. R. Robinson, K. Theopold, and R. Langley, Chemistry: Atoms</p>
      <p>First, 2nd ed. Houston: OpenStax, 2019.
[12] OpenAI, GPT-4 Technical Report, ArXiv, 2023. [Online]. Available:
https://arxiv.org/abs/2303.08774.
[13] VitalSource Technologies, VitalSource Supplemental Data Repository, 2024. [Online].</p>
      <p>Available: https://github.com/vitalsource/data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Koedinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>McLaughlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and N. L.</given-names>
            <surname>Bier</surname>
          </string-name>
          ,
          <article-title>Learning is not a spectator sport: Doing is better than watching for learning from a MOOC</article-title>
          ,
          <source>in Proceedings of the Second ACM Conference on Learning @ Scale (L@S '15)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          . doi:
          <volume>10</volume>
          .1145/2724660.2724681.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Koedinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>McLaughlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Jia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Bier</surname>
          </string-name>
          ,
          <article-title>Is the doer effect a causal relationship? How can we tell and why it's important</article-title>
          ,
          <source>in Proceedings of the Sixth International Conference on Learning Analytics &amp; Knowledge (LAK '16)</source>
          , Edinburgh, United Kingdom,
          <year>2016</year>
          , pp.
          <fpage>388</fpage>
          -
          <lpage>397</lpage>
          . doi:
          <volume>10</volume>
          .1145/2883851.2883957.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>