<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Developing an Automated Evaluation Tool for Multiple-Choice Questions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Steven Moore</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>5000 Forbes Ave, Pittsburgh, Pennsylvania, 15213</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The use of multiple-choice questions (MCQs) in higher education has increased due to their efficiency, objective grading, ability to generate item-analysis data, and short response time. Recently, learnersourcing has emerged as a method to scale up MCQ creation by involving students in the question creation process. While previous research has shown that students can effectively generate high-quality questions, the evaluation of student-generated questions remains a challenge due to subjectivity in human evaluation. The Item-Writing Flaws (IWF) rubric provides a standardized way to evaluate MCQs, but its application has relied on experts, making it difficult to scale. With recent advances in natural language processing, it may be possible to automatically apply the IWF rubric to student-generated questions, providing realtime feedback to students during the question creation process. Additionally, the Bloom's Taxonomy level and knowledge components (KCs) of the questions can also be automatically mapped to these student-generated questions. The goal of this research is to develop a tool that provides automatic evaluation and feedback for student-generated questions in the learnersourcing process, resulting in the creation of higher quality questions in a more efficient manner. First, we will investigate methods for automatically assessing educational MCQs for their quality and pedagogical usefulness. Second, we propose an extensive two phase study to evaluate the effectiveness of our tool for generating and evaluating multiple-choice questions in higher education. This will be done by observing how students utilize the tool across different academic domains within online courseware.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Learnersourcing</kwd>
        <kwd>Question Generation</kwd>
        <kwd>Automatic Question Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Multiple-choice questions (MCQs) are a
widely used form of assessment in higher
education, both for formative and summative
evaluations. MCQs are advantageous because
of their efficiency to score, objective grading,
ability to generate item-analysis data, and the
shorter time required for students to respond
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In recent years, authoring educational
MCQs has extended beyond instructors, and
has been scaled up by leveraging students in the
process of question creation [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. This is known
as learnersourcing, where students complete
activities that produce new content that can then
be leveraged by future students [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Previous
research has demonstrated that despite having a
range of expertise, students can effectively
generate short-answer and multiple-choice
questions that are of comparable quality to
expert-generated ones [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Additionally,
research supports the act of question generation
as a beneficial learning activity, so the benefit
is mutual. Learnersourcing efforts have also led
to the creation of several systems that allow
students to create and answer questions
generated by their peers [
        <xref ref-type="bibr" rid="ref10 ref6">6, 10</xref>
        ].
      </p>
      <p>
        These systems often rely on student
evaluation of other student-generated
questions, as a method to assess the quality and
usefulness of the questions. A common
challenge in human-evaluation of educational
material is subjectivity, particularly when the
expertise of the evaluator might not be inline
with the question content. Evaluation of these
student questions is needed though before other
students work on them, as low quality questions
can not only waste student time, but also be
detrimental to their learning [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. One common
evaluation method is the Item-Writing Flaws
(IWF) rubric that utilizes experts to evaluate
questions. This rubric contains 19 different
criteria and provides a standardized way to
evaluate multiple-choice questions that
accounts for their quality and pedagogical
usefulness [
        <xref ref-type="bibr" rid="ref1 ref17">1, 17</xref>
        ]. The use of this rubric helps
to avoid the common pitfalls of evaluating
questions based on subjective features or
evaluator opinion for what constitutes a
questions’ quality.
      </p>
      <p>
        While the use of the IWF rubric is an
effective way to evaluate multiple-choice
questions, previous efforts have relied on
experts to apply it, making it challenging to
scale. However, with recent advances in the
natural language processing domain, many of
the criteria for the rubric can be applied using a
series of rules implemented via most
programming languages [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Therefore, it may
be possible to automatically apply the criteria to
student-generated questions at the time they are
being generated, rather than evaluating them
after the fact. In evaluating the
studentgenerated questions as they are actively
partaking in the authoring process by providing
real time feedback, it can streamline the student
generation of high quality multiple-choice
questions and prevent the need for students to
evaluate other student-generated questions.
Leveraging recent advances, the Bloom’s
Taxonomy level of the questions could also be
automatically mapped, along with the skills or
knowledge components (KCs) required to solve
the question [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. After students use automatic
feedback from the IWF rubric to create a
highquality question, they can be shown the
potential Bloom’s Taxonomy and KCs their
question assesses and make any necessary
changes or approvals to them.
      </p>
      <p>Ultimately, if students’ effort is going to be
applied to a learning activity that involves them
generating multiple-choice questions, then the
process should be mindful of their time and
learning during the process. Through receiving
automated feedback as they generate the
questions, student output should not only yield
high quality questions, but ones that are
mapped to a Bloom’s Taxonomy label and set
of skills or knowledge components required to
answer the question. In doing so, when the
questions are used by other students, the proper
learning analytics can be leveraged to better
monitor student learning. Towards this goal, we
pursue these research questions:
1. Is it possible to automatically assess
student-generated multiple-choice questions
with the item-writing flaws rubric?
2. Can a tool be developed that provides
automatic evaluation and feedback for
student-generated questions in the
learnersourcing process, alleviating the need
for human evaluation?
3. To what extent does the use of the tool
result in students creating higher quality
questions in a more efficient manner
compared to traditional methods?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Automatic</title>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
    </sec>
    <sec id="sec-5">
      <title>Question</title>
      <p>
        Educational MCQs generated by instructors,
students, or automatically are all susceptible to
flaws that impact their efficacy and quality [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
One challenge in evaluating MCQs’ quality lies
in determining what criteria are sufficient to
quantify a question as being high-quality and
effective for use in an educational context. To
overcome this subjectivity, different item
response theory and statistical methods have
been utilized to evaluate student-generated
MCQs [
        <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
        ]. However, these techniques
require post-hoc analysis of student
performance data, which can be detrimental to
the learning process, because if the questions
being used have not been first vetted for their
quality, then they may be poorly constructed
which can negatively impact students’
performance and achievement [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To help
overcome this, recent research has leveraged
different methods for automatically evaluating
questions.
      </p>
      <p>
        The automatic evaluation of questions often
utilizes metrics related to readability and
explainability, including natural language
processing (NLP) metrics like BLEU and
METEOR [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However, these metrics were
shown to not correlate with human evaluation
and to not have pedagogical implications.
Recent efforts towards automatic evaluation of
educational questions have relied on large
datasets of student responses, which are then
used to train different classification models [
        <xref ref-type="bibr" rid="ref15 ref16">15,
16</xref>
        ]. Obtaining datasets across diverse subject
areas poses challenges for these methods,
which often rely on limited publicly available
datasets consisting of basic reading
comprehension or lower grade-level academic
questions. These methods infrequently utilize
questions from complex domains that go
beyond the cognitive process of recall [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Additionally, the model architectures used in
these methods often lack interpretability, due to
their simplistic evaluation criteria or blackbox
training methods.
2.2.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Item-Writing Flaws Rubric</title>
      <p>
        For evaluating the quality and pedagogical
usefulness of educational multiple-choice
questions, human evaluation remains as the
benchmark [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. While different rubrics have
been employed for this evaluation process, the
item-writing flaws (IWFs) rubric containing 19
criteria for assessing educational questions has
been standardized and evaluated via previous
research [
        <xref ref-type="bibr" rid="ref1 ref14 ref17">1, 14, 17</xref>
        ]. A previous study that
utilized this 19-item IWFs rubric assessed the
quality of over two thousand MCQs [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Utilizing two human evaluators, they
determined that nearly half of the questions
were deemed unacceptable for educational
usage, due to having more than one IWF. In this
case, the question difficulties may be skewed to
be too easy or too hard, which in turn misleads
students and related learning analytics [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
While this rubric is effective at evaluating
educational questions, the application of it often
requires substantial human effort and is
timeconsuming, especially when evaluating large
numbers of questions across multiple subject
areas [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, many of the rubric criteria
can be automatically evaluated for questions,
reducing much of this effort.
      </p>
    </sec>
    <sec id="sec-7">
      <title>2.3. Labeling Bloom’s Taxonomy and Knowledge Components</title>
      <p>
        Multiple-choice questions are often used to
assess lower levels of Bloom's Taxonomy, such
as remember and understand. However, MCQs
also have the potential to assess higher levels
such as application and analysis when they are
properly designed [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Assessing these higher
levels of Bloom’s Taxonomy is desirable, as the
higher cognitive processes are associated with
better student learning. Additionally, in order to
solve a problem, a student must also possess a
specific set of skills or knowledge components
(KCs). KCs are formally defined as specific
pieces of information necessary to solve a
problem [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Recent research has
demonstrated success in automatically
classifying the Bloom's Taxonomy label of
MCQs [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. It has also shown promise in
automatically suggesting KCs for questions. By
combining these automated methods with
previous work on learnersourcing KCs from
students, it could lead to more expert-level
results. Having the Bloom’s Taxonomy label
and the set of KCs for a question is beneficial
in that it can fuel learning analytics systems to
better measure student learning. These labels
are also essential for many methods of
measuring student learning and providing
adaptivity, such as knowledge tracing [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>3. Research Methods</title>
      <p>The initial phase of this research will
involve conducting a thorough literature review
of current learnersourcing systems and
automatic-question evaluation methods. This
review will examine how the domain of the
course that students are asked to generate
questions in might impact the success. This will
help to identify potential challenges that exist
between different domains for question
generation and inform how we can develop
towards a domain agnostic evaluation method.
We will also investigate the many different
question evaluation criteria used in prior
studies, focusing on criteria that are used in
educational contexts and include the
pedagogical applicability of the questions, such
as the IWF rubric. Finally, the literature review
will provide insights into how existing
learnersourcing systems for question
generation are used by students, which will help
inform the design decisions of our tool. The
review’s findings will serve as a foundation for
the multiple-choice question evaluation process
that can be combined or expanded to develop a
question authoring tool.</p>
      <p>
        Once an initial version of the tool is
developed, the sequential study has two phases:
evaluation of existing question datasets and a
user-study involving student utilization of the
tool. For the evaluation using existing datasets,
we will leverage educational multiple-choice
questions from previous studies that have
varying levels of question quality, such as from
the PeerWise platform or the LearningQ dataset
[
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ]. We will also leverage expert-generated
questions from a variety of domains from
courses on the Open Learning Initiative (OLI)
platform, which can still contain potential flaws
that our system should be able to identify. We
will manually evaluate these questions using
our defined question evaluation criteria. From
there, we will run the automatic evaluation to
determine which criteria we might need to
improve upon. This will also help inform how
our automatic evaluation is affected by
different domains and question content, causing
some criteria to be evaluated more successfully
than others.
      </p>
      <p>Following this, the user-study involving
students will be deployed through the OLI
platform, which hosts a plethora of open
educational courses used at higher education
institutions across the world. Through
embedding our tool within OLI courses of
different domains and with students of different
knowledge levels, we can gain a diverse sample
representative of more students. Students in the
courses we select will opt in, as part of our IRB
protocol, and in doing so they will be presented
with an activity that has them generate a
multiple-choice question as they work through
certain parts of their respective course. We will
utilize metrics collected from the platform, such
as time on task, the quality of the
studentgenerated questions, the amount of feedback
they received from the tool, and which flaws
students commonly encountered. These will
help us determine not only if the tool helps
students create high quality multiple-choice
questions, but also if it benefits their learning
and the types of students that are making these
high quality questions. We also have previously
collected student-generated questions from
several existing OLI courses, where students
created them without the use of a tool, receiving
no feedback. The questions students create
through our tool can further be compared to
them, to help measure how much the tool helps
or hinders this process for them.</p>
    </sec>
    <sec id="sec-9">
      <title>4. Contributions</title>
      <p>Learnersourcing continues to grow, as
students have already authored over a million
questions using tools that might not optimally
support their learning or time as effectively as
possible. This work will contribute an open
source tool that can be used to help students
author these questions. Additionally, the
methods can be retroactively applied to
questions from online courses and datasets to
evaluate existing questions and indicate how
they might be improved. Through the use of this
tool, we will also contribute a dataset of
student-generated questions from a variety of
domains in higher education, which will cover
more complex topics than existing educational
multiple-choice question datasets. The results
of this research will inform how we can better
create questions across a variety of domains and
make improvement to assessments that are
actively being used by students in existing
online courseware.
5. References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Breakall</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Randles</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tasker</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Development and use of a multiplechoice item writing flaws evaluation instrument in the context of general chemistry</article-title>
          .
          <source>Chemistry Education Research and Practice</source>
          .
          <volume>20</volume>
          ,
          <issue>2</issue>
          (
          <year>2019</year>
          ),
          <fpage>369</fpage>
          -
          <lpage>382</lpage>
          . 2.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Multiple-choice testing in education: Are the best practices for assessment also good for learning?</article-title>
          <source>Journal of Applied Research in Memory and Cognition. 7</source>
          ,
          <issue>3</issue>
          (
          <year>2018</year>
          ),
          <fpage>323</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hauff</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Houben</surname>
          </string-name>
          , G.-J.
          <year>2018</year>
          .
          <article-title>LearningQ: a large-scale dataset for educational question generation</article-title>
          .
          <source>Twelfth International AAAI Conference on Web and Social Media</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Clifton</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Schriner</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Assessing the quality of multiple-choice test items</article-title>
          .
          <source>Nurse educator. 35</source>
          ,
          <issue>1</issue>
          (
          <year>2010</year>
          ),
          <fpage>12</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Corbett</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction</article-title>
          .
          <volume>4</volume>
          ,
          <issue>4</issue>
          (
          <year>1994</year>
          ),
          <fpage>253</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Denny</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luxton-Reilly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Purchase</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>PeerWise: students sharing their multiple choice questions</article-title>
          .
          <source>Proceedings of the fourth international workshop on computing education research</source>
          (
          <year>2008</year>
          ),
          <fpage>51</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Downing</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education</article-title>
          .
          <source>Advances in health sciences education. 10</source>
          , (
          <year>2005</year>
          ),
          <fpage>133</fpage>
          -
          <lpage>143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lyu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2022</year>
          .
          <article-title>QAScore-An Unsupervised Unreferenced Metric for the Question Generation Evaluation</article-title>
          . Entropy.
          <volume>24</volume>
          ,
          <issue>11</issue>
          (
          <year>2022</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Khairani</surname>
            ,
            <given-names>A.Z.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Shamsuddin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Assessing item difficulty and discrimination indices of teacherdeveloped multiple-choice tests. Assessment for Learning Within and Beyond the Classroom: Taylor's 8th Teaching and Learning Conference (</article-title>
          <year>2015</year>
          ),
          <fpage>417</fpage>
          -
          <lpage>426</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Khosravi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kitto</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Ripple: a crowdsourced adaptive platform for recommendation of learning activities</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .
          <fpage>05522</fpage>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Koedinger</surname>
            ,
            <given-names>K.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corbett</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Perfetti</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>The KnowledgeLearning-Instruction framework: Bridging the science-practice chasm to enhance robust student learning</article-title>
          .
          <source>Cognitive science. 36</source>
          ,
          <issue>5</issue>
          (
          <year>2012</year>
          ),
          <fpage>757</fpage>
          -
          <lpage>798</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Kurdi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Al-Emari</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>A systematic review of automatic question generation for educational purposes</article-title>
          .
          <source>International Journal of Artificial Intelligence in Education</source>
          .
          <volume>30</volume>
          , (
          <year>2020</year>
          ),
          <fpage>121</fpage>
          -
          <lpage>204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domadia</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Stamper</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2022</year>
          .
          <article-title>Assessing the Quality of StudentGenerated Short Answer Questions Using GPT-3. Educating for a New Future: Making Sense of Technology-</article-title>
          <source>Enhanced Learning Adoption: 17th European Conference on Technology Enhanced Learning, EC-TEL</source>
          <year>2022</year>
          , Toulouse, France,
          <source>September 12-16</source>
          ,
          <year>2022</year>
          ,
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          (
          <year>2022</year>
          ),
          <fpage>243</fpage>
          -
          <lpage>257</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Stamper</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>Examining the Effects of Student Participation and Performance on the Quality of Learnersourcing MultipleChoice Questions</article-title>
          .
          <source>Proceedings of the Eighth ACM Conference on Learning @ Scale</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ni</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denny</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warren</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witbrock</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2022</year>
          .
          <article-title>Deepqr: Neural-based quality ratings for learnersourced multiple-choice questions</article-title>
          .
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          (
          <year>2022</year>
          ),
          <fpage>12826</fpage>
          -
          <lpage>12834</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Ruseti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dascalu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balyan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kopp</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNamara</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crossley</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Trausan-Matu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Predicting question quality using recurrent neural networks</article-title>
          .
          <source>Artificial Intelligence in Education: 19th International Conference, AIED</source>
          <year>2018</year>
          , London, UK, June 27-30,
          <year>2018</year>
          , Proceedings,
          <string-name>
            <surname>Part I</surname>
          </string-name>
          19 (
          <year>2018</year>
          ),
          <fpage>491</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Rush</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rankin</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>White</surname>
            ,
            <given-names>B.J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value</article-title>
          .
          <source>BMC medical education. 16</source>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Schurmeier</surname>
            ,
            <given-names>K.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atwood</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shepler</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lautenschlager</surname>
            ,
            <given-names>G.J.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Using item response theory to assess changes in student performance based on changes in question wording</article-title>
          .
          <source>Journal of chemical education. 87</source>
          ,
          <issue>11</issue>
          (
          <year>2010</year>
          ),
          <fpage>1268</fpage>
          -
          <lpage>1272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Scialom</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Staiano</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Ask to Learn: A Study on Curiosity-driven Question Generation</article-title>
          .
          <source>Proceedings of the 28th International Conference on Computational Linguistics</source>
          (
          <year>2020</year>
          ),
          <fpage>2224</fpage>
          -
          <lpage>2235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brooks</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Doroudi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2022</year>
          .
          <article-title>Learnersourcing in Theory and Practice: Synthesizing the Literature and Charting the Future</article-title>
          .
          <source>Proceedings of the Ninth ACM Conference on Learning@ Scale</source>
          (
          <year>2022</year>
          ),
          <fpage>234</fpage>
          -
          <lpage>245</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Tarrant</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knierim</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ware</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments</article-title>
          .
          <source>Nurse Education Today</source>
          .
          <volume>26</volume>
          ,
          <issue>8</issue>
          (
          <year>2006</year>
          ),
          <fpage>662</fpage>
          -
          <lpage>671</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mallick</surname>
            ,
            <given-names>D.B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Baraniuk</surname>
            ,
            <given-names>R.G.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>Towards blooms taxonomy classification without labels</article-title>
          .
          <source>Artificial Intelligence in Education: 22nd International Conference</source>
          , AIED (
          <year>2021</year>
          ),
          <fpage>433</fpage>
          -
          <lpage>445</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>