<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transforming Textbooks into Learning by Doing Environments: An Evaluation of Textbook-Based Automatic Question Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachel Van Campenhout</string-name>
          <email>rachel.vancampenhout@vitalsource.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeffrey S. Dittel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bill Jerome</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>nny G. Johnson</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Britlan, Ltd.</institution>
          ,
          <addr-line>Oconomowoc, WI 53066</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>VitalSource Technologies</institution>
          ,
          <addr-line>Pittsburgh, PA 15218</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Textbooks have been the traditional method of providing learning content to students for decades, and therefore have become the standard in highquality content. Yet the static textbook format is unable to take advantage of the cognitive and learning science research on effective interactive learning methods. This gap between quality content and highly efficient methods of learning can be closed with advances in artificial intelligence. This paper will contextualize the need for improving textbooks as a learning resource using research-based cognitive and learning science methods, and describe a process by which artificial intelligence transforms textbooks into more effective online learning environments. The goal of this paper is to evaluate textbook-based automatic question generation using student data from a variety of natural learning environments. We believe this analysis, based on 786,242 total observations of student-question interactions, is the largest evaluation of automatically generated questions using performance metrics and student data from natural learning contexts known to date, and will provide valuable insights into how automatic question generation can continue to enhance content. The implications for this integration of textbook content and learning science for effective learning at scale will be discussed.</p>
      </abstract>
      <kwd-group>
        <kwd>Automatic Question Generation</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>Textbooks</kwd>
        <kwd>Formative Practice</kwd>
        <kwd>Learning by Doing</kwd>
        <kwd>Doer Effect</kwd>
        <kwd>Courseware</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Textbooks are the de facto standard in quality educational content, and yet are not the
standard in effective learning. Students encounter long sections of content that they
must find a way to absorb, and risk reading passively with little retention. Entire
disciplines of study have arisen from the need to identify techniques that will help students
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
learn this content (see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). Instructors may assign reading with the expectation that said
readings will be absorbed prior to a learning activity, and yet have no way of monitoring
student progress in real time. Students often do not read the textbook as instructors
intend, limiting learning gains [
        <xref ref-type="bibr" rid="ref19 ref9">9, 19</xref>
        ]. This disconnect between textbook content and
learning produces a tension that ultimately puts students at a disadvantage.
      </p>
      <p>
        Research in cognitive and learning science has proven that certain methods are
markedly more effective for learning content in online contexts. For example,
integrating formative practice questions with short sections of content creates a method of
learning by doing that has been shown to increase student learning gains while also
decreasing the amount of time students spend studying [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Doing practice while
reading had a six times larger relationship to learning outcomes than just reading the content
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and follow-up research has shown this “doer effect” to be causal [
        <xref ref-type="bibr" rid="ref13 ref14 ref17">13, 14, 17</xref>
        ].
Doing practice is widely understood to be beneficial for learning, but learning science
has identified that doing practice at frequent intervals while reading causes learning.
      </p>
      <p>Given this simple yet highly effective learning method, why has it not been
incorporated into every online textbook? The main contributing factor is an issue of scale.
The volume of formative practice items needed to engage students in this type of
learning by doing is in the hundreds or thousands for a typical full-semester course—a scale
that becomes prohibitive in both time and cost. Question writing is a labor-intensive
process that requires both subject matter and item writing expertise. The formative
practice element of the courseware learning environment is therefore often a barrier too
high for either content providers or teaching faculty to overcome.</p>
      <p>
        Artificial intelligence is a promising solution for overcoming this barrier. Recent
advances in natural language processing (NLP) and machine learning (ML) have
provided tools needed to take high-quality textbook content and transform it to
courseware—a process that organizes content into shorter, topical lessons and creates
and embeds formative practice within those lessons. Automatic question generation
(AQG) has gained increasing focus in recent years, and yet few studies have evaluated
these questions empirically in a natural learning context using student data [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This
paper makes the following contributions to the AQG literature beyond the recent
systematic review by Kurdi et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Data are evaluated from 945 students across six
textbook-based courseware environments containing 2,610 automatically generated
(AG) questions, making this the largest study on natural student usage of AG questions
reported as of that review. We evaluate AG questions alongside human-authored (HA)
questions in the same courseware on three key performance metrics: engagement,
difficulty, and persistence. Our initial smaller-scale research found AG and HA questions
to be similar on these metrics, suggesting students did not perceive a difference that
caused them to behave differently with the AG questions [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Prior studies have
focused on difficulty but not on engagement-based metrics crucial to formative practice.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>2.1 The SmartStart Process</title>
        <p>
          VitalSource is engaged in a large-scale project, called SmartStart [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], for automatically
creating learning by doing courseware from textbooks. SmartStart uses NLP and ML
methods to accomplish three main tasks: identify short lessons of content to appear on
a courseware page, find and align learning objectives to those lessons of content, and
generate formative practice questions for each lesson page. These actions are completed
in an application interface with options for customization by a course designer. The
goal is to use this process to easily and quickly create foundational courseware that
engages students in learning by doing as they read the traditional textbook material.
The courseware that this process produces can be used as is, or can be further
customized in an authoring interface to include adaptivity and additional assessments.
Content Sections. Most often, textbooks contain units or chapters on broad topics that
cover significant amounts of content in contiguous blocks. While there may be a series
of subheadings, the content flows together in a single unit. This large volume of reading
poses a long-held concern about students becoming passive readers and having
difficulty reading for understanding [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. A set of expert rules for the content chunking
process was derived from prior experience with several dozen custom-built courseware
developed with the methods of Carnegie Mellon University’s Open Learning Initiative
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. These environments were used by thousands of students and historical data
provided insights on how different lesson lengths affected student retention and activity.
Using these rules, SmartStart analyzes the textbook structure and proposes how the
content could be presented in shorter, topically aligned lessons in a courseware
interface. It also uses expert rules to identify any potential issues with the content sections
for human evaluation.
        </p>
        <p>
          Learning Objectives. In addition to shorter sections of content, learning objectives
help students construct mental models by providing clear guidance on what they are
expected to learn, as well as an indication as to how they will be evaluated on that
content. Learning objectives also provide a practical function in courseware
environments, as they are tagged to formative practice and feed data to instructor dashboards
that are organized by objective in addition to post hoc analyses. Most textbooks provide
students with these types of learning statements, but their phrasing and location are
highly variable. Therefore, SmartStart must first locate the learning objectives. The lack
of consistency in phrasing, format, placement, and HTML markup observed across
even just a few dozen textbooks made a rule-based system for locating them infeasible,
so instead, a supervised ML model is used. The model includes features that represent
specific identifiers, placement characteristics, and Bloom’s Taxonomy verbs [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The
learning objective identification model and other ML models used in SmartStart were
developed using the scikit-learn library [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Once the learning objectives have been
located and extracted, the next task is to place them with the lessons created in the
structure task. Another model evaluates the content of each learning objective and
lesson and proposes the best placement. Placing the learning objective with lesson content
will also tag any formative questions placed on that lesson page, completing the data
collection and instrumentation architecture of the courseware.
        </p>
        <p>
          Automatic Question Generation. The key feature that turns static textbook content
into a learning by doing environment that takes advantage of the doer effect is formative
practice questions. Two types of formative cloze questions are generated by this AQG
process: fill-in-the-blank (FITB) and matching (as seen in Figure 1). The FITB provides
a sentence with a term missing for students to enter, making this question a recall type
on Bloom’s cognitive process dimension [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The matching question provides a
sentence with three missing terms and the student must drag and drop the terms into the
correct location, making this a recognition type on that dimension. Both recognition
and recall questions have long been researched for their learning value [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and are both
on the first level of Bloom’s Taxonomy [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. As formative practice, students can
continue to answer until they reach the correct response and receive immediate feedback.
        </p>
        <p>
          AQG is the most complex step in the SmartStart process given the number of
requirements and variables involved. AQG has been an increasingly researched topic
given its relevance to several fields, and there have been many differing approaches
developed. To describe the current approach, we will use the classification system
developed in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]: level(s) of understanding and procedure(s) of transformation. The level
of understanding for this AQG uses both syntactic and semantic information from the
textbook. The NLP analyses are carried out using the spaCy library [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This
information is used to accomplish two primary tasks: selecting the content sentences for the
questions and selecting the term(s) to be used as the answer(s). Syntactic information,
such as part-of-speech tagging and dependency parsing, is used in both sentence
selection and answer term selection. Semantic knowledge is also used for detecting
important content. The procedure of transformation is primarily a rule-based method. A
set of rules is used to select the question sentences and answer terms, and these rules
use both syntactic and semantic information to select the best options.
        </p>
        <p>
          After the syntactic and semantic processing of the textbook has been completed
and the selection rules applied, a set of questions has been generated. However, this set
contains more questions than will ultimately appear to students. The SmartStart AQG
uses an overgenerate-and-rank approach [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to select only the top questions of each
type to appear on the lesson page. As these questions are to be used by students in their
natural learning environment and not as part of an experiment, the question sets were
further scrutinized in a human review pass. The goal of this review was not to evaluate
the questions from the perspective of a subject matter expert, but rather to search for
quality issues common to the field of AQG. For example, some questions may be too
easily guessable (e.g., “The father of psychoanalysis was Sigmund _____.” or have
grammatical problems such as an unresolved anaphoric reference in the question stem.
For two of the six courses in this study (Communication A and Accounting, Table 1),
the textbook’s publisher also did a subsequent human review pass and made additional
minor modifications to some of the remaining questions.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Question Evaluation</title>
        <p>
          The purpose of this work is to expand upon the current AQG literature by furthering
the empirical evaluation of AG questions. As noted in Kurdi et al.’s 2020 systematic
review [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], the majority of studies generate questions for experimental settings. Only
one study reported using AG questions in a class setting [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], but this study used
template-based programming exercises and evaluated student pre- and post-test scores, not
individual item characteristics. Furthermore, only 14 of the 93 studies that met the
criteria for the review evaluated question difficulty. These studies primarily used small
samples (under 100 questions) and evaluation was based on expert review, not natural
student data (see [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]). This type of expert review was critical to the development of
the current AQG system, but the true test is how the AG questions perform with
students in authentic learning contexts.
        </p>
        <p>
          These AG questions were evaluated using data from a set of six SmartStart courses
that were created from existing textbooks and used by students in their natural learning
environments. For example, an introductory behavioral neuroscience textbook [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] was
used to generate the courseware discussed in detail in the next section. These courses
were also enhanced with human-authored questions post-generation, providing a
unique opportunity to compare AG and HA questions that the same students completed
on the same lesson pages; details are given in Table 1. The manually added HA
questions can also be categorized as recognition or recall. The recall category includes the
AG FITB as well as the HA FITB (the most direct counterpart) and, in the Neuroscience
course, HA numeric input. The recognition category includes the AG matching and all
other HA question types. Most similar to the AG matching are the HA drag-and-drop
(D&amp;D) types and the pulldown type, where students select a term from a dropdown
menu to complete a question stem. There are three types of HA multiple-choice
variations: conventional multiple-choice (MC), multiple-choice multiple-select (MCMS),
and multiple-choice grid (MC grid). The Neuroscience course also contains HA passage
selection questions, in which students select content in a short passage of text according
to the instructions.
        </p>
        <p>
          This type of in vivo experimentation across a variety of courses provides greater
external validity, while comparing interactions of the same students with both types of
question improves internal validity. To evaluate both question types through an
empirical approach, the performance metrics of engagement, difficulty, and persistence
provide a basis for comparison [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
        <p>The first metric studied is engagement—whether or not students chose to answer
questions they encountered on a lesson page. For questions that were answered, a
difficulty metric can provide insights into whether questions may be too easy or difficult.
The last metric is persistence—when students initially answer a question incorrectly,
how often do they continue to answer until they reach the correct answer? While mean
performance metric values are insightful, a mixed effects logistic regression model will
also be used to analyze these metrics more rigorously, including controlling for
covariates. The results will be presented and discussed in detail for a single course, followed
by discussion of patterns observed across all courses.</p>
        <p>
          Course
Neuroscience [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
Communication A [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
Microbiology [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
Psychology [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
Communication B [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
        </p>
        <p>
          Accounting [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results &amp; Discussion</title>
      <sec id="sec-3-1">
        <title>3.1 Neuroscience</title>
        <p>Engagement. Doing the formative questions is what generates the doer effect, which
helps students learn, so the first step is to evaluate how students choose to engage with
the different question types. We hypothesize that if students perceived problems with
the AG questions, they would engage with them less than HA questions or simply
wouldn’t do them at all. The Neuroscience course will be used as a detailed example.
This course has the largest data set, and also has the largest variety of HA question
types, allowing broader comparison with the AG questions. This textbook-based
courseware was used at 18 institutions in 21 total course sections, providing a wide
range of contexts and the highest likelihood of heterogeneity of students. The data set
was constructed as the set of all opportunities a student had to engage with a question.
While at first a simple cross of all students with all questions would seem appropriate,
there are many cases where a student did not visit a lesson page, and therefore did not
have the opportunity to choose whether to answer those questions. Rather, engagement
opportunities were taken as all student-question pairs on pages that the student visited
(very short page visits of under 5 seconds were excluded).</p>
        <p>
          Given this data set of student engagement opportunities, why not just use mean
engagement to assess the question types? Data from courseware show that engagement
typically declines over the course of a semester, and even within modules/chapters and
pages [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. The location of a question within the course may therefore impact the
likelihood that students will engage with it, and so a more sophisticated model is needed to
take this into consideration. Logistic regression can be used to model the probability
that a student will answer a question they encountered as a function of question type,
while also taking question location variables into account as covariates. Furthermore,
because there are multiple observations per student and question, these are not
independent, and a mixed effects model is required. The AG FITB is the baseline for the
question type categorical variable, facilitating comparison between this AG recall type
and the other AG and HA types. The R formula that expresses the model is:
glmer(answered ~ course_page_number + module_page_number
+ page_question_number + question_type
+ (1|student) + (1|question),
family=binomial(link=logit), data=df)
        </p>
        <p>The data set for the Neuroscience course consists of 286,129 individual
studentquestion observations. An answered question was recorded as 1 and an unanswered
question as 0. If we first consider the mean engagement of each question type, the range
is from 43.4% to 29.7%. There are clusters of question types with similar means. For
instance, AG matching, HA MC, HA MCMS, and HA numeric input all had
engagement around 43%. Next, AG FITB (41.1%), HA pulldown, and HA D&amp;D table all had
between 40% and 42% engagement. HA D&amp;D image, HA FITB, and HA passage
selection all had engagement below 38%. This information on its own is useful; however,
reviewing the results of the model in Table 2 gives additional insights.</p>
        <p>The variables for the location of the questions were all significant (p &lt; 0.001) and
negative, verifying that students are less likely to engage with the practice as they get
to the end of a page, module, and course. After controlling for the effects of question
location, we can examine differences in engagement for the question types. The HA
MC and AG matching questions, which had nearly identical mean engagement scores,
also had very similar estimates and significance (p &lt; 0.001) compared to the AG FITB.
Both question types are recognition types, but one was human-authored and one
automatically generated. The HA pulldown and HA MCMS were both similar in mean
scores and both more likely to be answered (p &lt; 0.01 and p &lt; 0.05 respectively). The
AG FITB also had similar mean engagement with the HA numeric input, and the model
showed no significant difference for engagement between these recall question types.
Neither of the D&amp;D question types were significantly different from the AG FITB,
despite one having similar mean engagement scores and the other having much lower
mean scores. The HA passage selection had the lowest mean engagement, and also was
significantly less likely to be engaged with (p &lt; 0.001). This could be due to the
complexity of the question type—requiring students to interpret instructions, read content,
and then select a segment of that content. Finally, the HA FITB (as the most direct
comparison to the AG FITB) had a lower mean score and the model showed that
students were less likely to engage with this type than the AG FITB (p &lt; 0.01).
Difficulty. When students choose to answer the formative questions, we can evaluate
the question difficulty through their first attempt accuracy. This difficulty data set
consists of all attempted questions for a total of 120,098 observations, with a correct answer
recorded as 1 and an incorrect answer recorded as 0. When we first consider the mean
difficulty scores per question type (Table 3), we see that there is a very wide range,
from a high of 86.4% correct for HA D&amp;D image to a low of 33.7% for HA passage
selection. There are three general groupings of question types. The easiest questions,
with means above 80%, were the HA D&amp;D table, AG matching, and HA D&amp;D image.
The next group ranges from a high of 70.0% to a low of 64.1% (AG FITB) and includes
the HA recognition types pulldown and MC, as well as all HA and AG recall types (HA
FITB, HA numeric input, AG FITB). The most difficult question types were HA
MCMS at 43.3% and HA passage selection at 33.7%.</p>
        <p>
          The same mixed effects logistic regression model was used, but with the difficulty
data set (Table 3). The location variables, unlike for engagement, were not all
significant. The location of the question in the module was significant (p &lt; 0.01), but neither
the course page nor the location on the page was significant. Unlike for engagement,
there was no consistent trend of location significance for difficulty or persistence, and
so for brevity the location variable results are omitted from the remaining tables (full
results are available at [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]), though the question type regression results are still
controlled for location. The regression model results for the question types were generally
consistent with the trends for the unadjusted mean scores. Nearly all the questions with
higher mean scores than the AG FITB (64.1%) were also easier than the AG FITB with
varying degrees of significance. This also includes the HA FITB—the most direct
counterpart to the AG FITB. The two question types with lower means—HA passage
selection and HA MCMS—were also more difficult than the AG FITB in the model.
Interestingly, the question type that was not statistically different from the AG FITB was
another recall type, HA numeric input.
Persistence. The last performance metric is persistence—when a student gets a
question incorrect on their first attempt, do they continue to answer until they reach the
correct response? The persistence data set is therefore a subset of the difficulty data set,
including only the incorrect first attempts for questions for a total of 34,124
observations. If a student eventually achieves the correct response the outcome was recorded
as 1, and if they did not persist until reaching the correct response the outcome was
recorded as 0. We hypothesize that persistence could be related to the difficulty of a
question type; if a student perceives a question type as too easy or too difficult, they
may not persist until reaching the correct response as often. It may seem that this
persistence metric is less impactful than engagement or difficulty, but VanLehn notes in a
meta-analysis of literature on human and computer-based tutoring that getting students
to finish problems correctly instead of giving up has a strong impact on learning [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
The persistence metric should therefore not be overlooked.
        </p>
        <p>
          The table for persistence was omitted for brevity (see [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]), as the trends were less
complex. There are six question types between 98.9% and 100% persistence, and all
are recognition types, including the AG matching. The next group is the three recall
question types: AG FITB (89.0%), HA FITB (85.7%), and HA numeric input (80.8%).
The outlier is the HA passage selection at 42.3%. Comparatively, the regression model
provides interesting results. Only three question types are significantly more likely (p
&lt; 0.001) to be answered until correct compared to the AG FITB: HA pulldown, AG
matching, and HA MC. Note that these question types were statistically easier than the
AG FITB, but they ranged from 67.8% to 84.3% correct. All three are recognition type
questions, which may account for the range in difficulty yet similarities in persistence.
        </p>
        <p>The HA MCMS was an outlier in terms of its high degree of difficulty, and yet had
statistically higher persistence than the AG FITB. There were also three question types
with significantly lower likelihood of persistence. The HA passage selection is the most
expected, given its high difficulty and low mean persistence. However, the other two
HA recall types had statistically lower persistence than the AG FITB. Students were
more likely to persist in the AG FITB than either the HA FITB or HA numeric input.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Trends Across Courses</title>
        <p>
          As seen in the regression models for each of the performance metrics above, there are
valuable insights into how students engage with different question types. In this section,
we present the mean values with the effects from the regression model for each course,
for each metric. Trends as well as anomalies can be detected when looking across
courses that can provide a higher degree of generalizability for findings and suggest
interesting areas for future research. Each course was different in terms of the HA
question types that were added to the SmartStart courseware. To discern trends that may be
characteristic of a question type, only types appearing in at least three of the six courses
are presented in the tables below; however, all of each course's question types were
included in its regression model (see [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]). If a question type was not used in a course,
its cells contain n/a. The mean is presented for every question type, but only significant
effects from the model are presented with their signs for ease of interpretation.
Engagement for All Courses. First, the mean engagement trends for question types
(Table 4) show similar patterns to the Neuroscience course. The AG matching questions
have means within a few percentage points of the HA MC for all courses, and often
close to other types such as the HA pulldown or HA MCMS. This shows that
engagement is similar for all these recognition types, regardless of the AG or HA origins.
Similarly, the mean engagement for the AG FITB is typically within a few percentage
points of the HA FITB, with three of four courses showing the AG FITB with slightly
higher engagement, indicating the recall question types are generally close in mean
engagement.
        </p>
        <p>
          The mean engagement values also show variation between courses in their overall
spread. In some courses, mean engagement for all question types is very close, such as
for Communication A (43.4–50.5%), Communication B (44.8–53.3%), or Psychology
(84.9–87.4%). Other courses have a wider variation in question type engagement
means, such as Accounting (52.0–77.9%) or Microbiology (42.3–69.3%). Courses also
have different ranges from one another; while Communication A and B have low mean
engagement across all question types, Psychology has much higher mean engagement
for all question types. While the reason for these differences is not discernible from the
data, it is likely that the implementation of the courseware in the classroom could be a
strong contributing factor, as data have shown that instructor implementation practices
can greatly influence student engagement with formative practice [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>Considering the effects from the regression model, none of the question types had
consistent positive or negative significant differences in engagement compared to the
AG FITB across all courses. The AG matching was positively significant in five out of
six courses, and the HA MC had positive significant engagement in four out of six
courses. The HA pulldown was positively significant in four courses and negatively
significant in one. Interestingly, the HA MCMS was not significantly different from the
AG FITB in three of five courses. While this question type is a recognition type, it is
also more complex than others like the HA MC or pulldown, which could be impacting
how often students choose to engage. The HA FITB was not significantly different from
AG FITB in three courses, and negatively significant in one. As the recall counterpart
to AG FITB, this strongly suggests that students treat recall questions similarly
regarding engagement. These trends indicate that the context of the course implementation
will likely influence the overall engagement patterns, but there is not evidence to
suggest that students engaged with AG question types differently than similar HA types.
Difficulty for All Courses. The difficulty means show trends consistent across courses
(Table 5). For instance, each course shows a range of difficulties across question types,
generally from the mid-forty to eighty percent accuracy range. This aligns with
expectations that some question types may be easier or more difficult than others.</p>
        <p>The effects from the regression model show differences in question difficulty when
we compare question types to the AG FITB. The AG matching question type is
significantly easier (p &lt; 0.001) than the AG FITB in every course. The HA pulldown is also
statistically easier in four courses, with no statistical difference in one course. The HA
Observations
HA D&amp;D Table
HA Pulldown
AG Matching
HA MC
HA MCMS
AG FITB
HA FITB
n/a
53.3
52.1
49.9
51.7
47.2
44.8
+***
+***
+***
+***
n/a
72.0
57.6
52.0
77.9
62.5
58.9
+**
n/a
–***
MC questions present an interesting mix across courses. They are statistically easier
than the AG FITB in three courses, not statistically different in one course, and
statistically more difficult in two courses. While the HA MC is a recognition type, which
typically trends easier than recall, the question type does not solely determine the
difficulty level. In particular, a question’s content obviously also contributes to its
difficulty; one possible explanation for the observed results is that the MC format has more
flexibility for question authoring than certain other formats, including FITB, as the
entire text of the stem and distractors are up to the author. For example, it is more feasible
to create higher-level Bloom’s questions in a MC format than one where the answer is
a single word. Although beyond the scope of the present study, this is an interesting
area for future investigation. Similarly, the HA MCMS is a recognition type but is
statistically more difficult than AG FITB in three courses and not different in two courses.</p>
        <p>
          For the recall question types, the HA FITB was statistically easier than the AG
FITB in one course and more difficult in one. For the remaining two courses, the HA
FITB was negative with marginal significance (p = 0.0521 and p = 0.06233); while not
reported as significant, this points to an interesting trend. Given the similarity of these
question types, differences in difficulty should be investigated further in future work.
Persistence for All Courses. The table of persistence means and regression model
effects is omitted for brevity (see [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]). Of the three metrics, persistence had the least
variation across question types and courses, and so we simply summarize the results.
The persistence data sets ranged from 34,124 observations (Neuroscience) to 2,027
observations (Accounting). Mean persistence is generally high across all question types
and courses. High persistence is ideal, as this means students continue to answer until
they reach the correct response. Mean persistence was over 80% in all but a few
instances. Several recognition types had consistently high persistence in all courses: HA
pulldown, AG matching, and HA MC. These also all have positive significance
compared to the AG FITB, with the exception of one HA MC case that was not significantly
different. These are encouraging findings, as the AG matching are grouped very closely
with other HA recognition types. The AG matching was generally one of the least
difficult question types, and yet that did not discourage students from persisting when they
did answer them incorrectly. The HA FITB had negative significant results (p &lt; 0.001)
compared to the AG FITB in each course. Students were less likely to persist on the
HA MC
HA MCMS
AG FITB
HA FITB
n/a
79.8
81.0
75.5
50.7
68.6
63.6
+***
+***
+***
–***
n/a
86.7
86.5
63.3
54.8
60.1
28.4
+***
+***
n/a
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>HA FITB, which is interesting given how similar the question type is to the AG FITB.
This trend will be a focus of future research.</p>
      <p>The current advances in artificial intelligence, natural language processing, and
machine learning make it possible to take high-quality textbook content and automatically
transform it into learning by doing courseware designed to be more effective for student
learning. AQG that is directly based on the textbook content is a practical solution to
achieve this goal of combining content and learning science-based practices at large
scale. However, the AG questions should be rigorously evaluated to ensure they meet
certain standards, and a comparison of performance metrics to HA questions provides
the first step to ensuring the quality of these questions.</p>
      <p>Through the analysis of data from 945 students who used six textbook-based
courseware containing a total of 2,610 AG questions, several interesting trends were
revealed. Student engagement with AG matching questions was very similar to other
HA types such as multiple choice or pulldown. The AG FITB was similar in
engagement to the HA FITB and numeric input. This indicates that there was not a difference
in engagement between comparable AG and HA question types, but rather that the
recognition vs. recall nature of the question type had the greatest impact. When
considering difficulty, the AG matching was generally one of the easiest question types while
the AG FITB was typically in the middle of the range. Across courses it was seen that
the difficulty of question types can vary. The HA MC questions were sometimes easier
and sometimes more difficult than the AG FITB, suggesting the influence that content
can have on the difficulty of question types. Finally, we see high persistence rates for
most question types in general. The AG matching was very similar in persistence to
other HA recognition types, indicating the easier questions did not deter students from
answering them until correct. However, the HA FITB—sometimes more difficult than
the AG FITB—had statistically lower persistence than its AG recall counterpart.</p>
      <p>
        Overall, the trends from this large-scale data analysis indicate that students in
natural learning contexts do not treat the automatically generated questions differently than
their human-authored counterparts. Along with continuing to expand the scope of
courses analyzed, future research will involve comparison of AG and HA questions on
additional metrics such as discrimination and alignment. As previously noted, some of
the unanticipated results also suggest interesting avenues for investigation. The ultimate
validation, however, will be to investigate the impact of AG questions on student
learning directly, such as in replication of studies of the doer effect on summative
assessments [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref17">12–14, 17</xref>
        ]. There is still much more to learn, but these findings give optimism
that textbook-based, automatically generated questions could provide a scalable path to
delivering the learning benefits that have been shown with human-authored questions.
This will continue to be a major research focus for some time to come.
      </p>
      <sec id="sec-4-1">
        <title>Acknowledgment</title>
        <p>
          The authors gratefully acknowledge Cathleen Profitko for identifying courses for these
analyses, Nick Brown for assistance with [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], and the anonymous reviewers for their
thoughtful and constructive comments.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>R. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp; du
          <string-name>
            <surname>Pré</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2019</year>
          ). Essential
          <string-name>
            <surname>Communication</surname>
          </string-name>
          (2nd ed.). New York: Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>R. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenfeld</surname>
            ,
            <given-names>L. B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Proctor</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R. F.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Interplay: The process of interpersonal communication (15th ed</article-title>
          .). New York: Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Van Doren</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>1940</year>
          ).
          <article-title>How to read a book</article-title>
          . New York, NY: Touchstone/Simon &amp; Schuster.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>L. W.</given-names>
          </string-name>
          (Ed.), Krathwohl,
          <string-name>
            <surname>D. R</surname>
          </string-name>
          . (Ed.), Airasian,
          <string-name>
            <given-names>P. W.</given-names>
            ,
            <surname>Cruikshank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            ,
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            ,
            <surname>Pintrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            ,
            <surname>Raths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            , &amp;
            <surname>Wittrock</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. C.</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>A taxonomy for learning, teaching, and assessing: A revision of Bloom's Taxonomy of Educational Objectives (Complete edition)</article-title>
          . New York: Longman.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Andrew</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>1938</year>
          ).
          <article-title>A comparison of two new-type questions: recall and recognition</article-title>
          .
          <source>Journal of Educational Psychology</source>
          ,
          <volume>29</volume>
          (
          <issue>3</issue>
          ),
          <fpage>175</fpage>
          -
          <lpage>193</lpage>
          . https://doi.org/10.1037/h0062394
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bosson</surname>
            ,
            <given-names>J. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vendello</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Buckner</surname>
            ,
            <given-names>C. V.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>The psychology of sex and gender (1st ed</article-title>
          .).
          <source>Thousand Oaks</source>
          , California: SAGE Publications.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dittel</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jerome</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Benton</surname>
          </string-name>
          , R.,
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Kimball</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            ,
            <surname>Profitko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            , &amp;
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. G.</surname>
          </string-name>
          (
          <year>2019</year>
          ).
          <source>SmartStart: Artificial Intelligence Technology for Automated Textbook-to-Courseware Transformation, Version</source>
          <volume>1</volume>
          .0.
          <string-name>
            <surname>Raleigh</surname>
          </string-name>
          , NC: VitalSource Technologies.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dunlosky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rawson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marsh</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nathan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Willingham</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Improving students' learning with effective learning techniques: promising directions from cognitive and educational psychology</article-title>
          .
          <source>Psychological Science in the Public Interest</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ),
          <fpage>4</fpage>
          -
          <lpage>58</lpage>
          . https://doi.org/10.1177/1529100612453266
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fitzpatrick</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>McConnell</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Student reading strategies and textbook use: an inquiry into economics and accounting courses</article-title>
          .
          <source>Research in Higher Education Journal</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Heilman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N. A.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Question Generation via Overgenerating Transformations and Ranking</article-title>
          .
          <source>Retrieved February 9</source>
          ,
          <year>2021</year>
          , from www.lti.cs.cmu.edu
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montani</surname>
            , I., Van Landeghem,
            <given-names>S.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>spaCy: Industrialstrength Natural Language Processing in Python</article-title>
          . https://doi.org/10.5281/zenodo.1212303
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Koedinger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McLaughlin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Learning is not a spectator sport: doing is better than watching for learning from a MOOC</article-title>
          .
          <source>In: Learning at Scale</source>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          . Vancouver, Canada. http://dx.doi.org/10.1145/2724660.2724681
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Koedinger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McLaughlin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Is the doer effect a causal relationship? How can we tell and why it's important. Learning Analytics and Knowledge</article-title>
          . Edinburgh, United Kingdom. http://dx.doi.org/10.1145/2883851.2883957
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Koedinger</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheines</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Schaldenbrand</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Is the doer effect robust across multiple data sets?</article-title>
          <source>Proceedings of the 11th International Conference on Educational Data Mining, EDM</source>
          <year>2018</year>
          ,
          <volume>369</volume>
          -
          <fpage>375</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kurdi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Al-Emari</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A Systematic Review of Automatic Question Generation for Educational Purposes</article-title>
          .
          <source>International Journal of Artificial Intelligence in Education</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <fpage>121</fpage>
          -
          <lpage>204</lpage>
          . https://doi.org/10.1007/s40593-019-00186-y
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lovett</surname>
            , M., Meyer,
            <given-names>O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Thille</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>The Open Learning Initiative: Measuring the Effectiveness of the OLI Statistics Course in Accelerating Student Learning</article-title>
          .
          <source>Journal of Interactive</source>
          Media in Education. http://doi.org/10.5334/2008-
          <fpage>14</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Olsen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>B. G.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Deeper collaborations: a finding that may have gone unnoticed</article-title>
          .
          <source>Paper Presented at the IMS Global Learning Impact Leadership Institute</source>
          , San Diego, CA.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Duchesnay</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>In Journal of Machine Learning Research</source>
          (Vol.
          <volume>12</volume>
          ). http://scikit-learn.sourceforge.net
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Phillips</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Phillips</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Sink or Skim: Textbook Reading Behaviors of Introductory Accounting Students</article-title>
          .
          <source>Issues in Accounting Education</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <fpage>21</fpage>
          -
          <lpage>44</lpage>
          . https://doi.org/10.2308/iace.
          <year>2007</year>
          .
          <volume>22</volume>
          .1.
          <fpage>21</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Scott</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Accounting for business (3rd ed</article-title>
          .). New York: Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Swanson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reguera</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaechter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Neidhardt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <source>Microbe (2nd ed.)</source>
          . Washington, DC: ASM Press
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Brown</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Jerome</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dittel</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>B. G.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Toward Effective Courseware at Scale: Investigating Automatically Generated Questions as Formative Practice</article-title>
          . Learning at Scale. https://doi.org/10.1145/3430895.3460162
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Van</surname>
            <given-names>Campenhout</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            &amp;
            <surname>Kimball</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>At the intersection of technology and teaching: The critical role of educators in implementing technology solutions</article-title>
          .
          <source>IICE</source>
          <year>2021</year>
          :
          <article-title>The 6th</article-title>
          IAFOR International Conference on Education.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>VanLehn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems</article-title>
          .
          <source>Educational Psychologist</source>
          ,
          <volume>46</volume>
          (
          <issue>4</issue>
          ),
          <fpage>197</fpage>
          -
          <lpage>221</lpage>
          . https://doi.org/10.1080/00461520.
          <year>2011</year>
          .611369
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <article-title>VitalSource Supplemental Data Repository</article-title>
          . https://github.com/vitalsource/data
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Watson</surname>
            ,
            <given-names>N. V.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Breedlove</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>The mind's machine: Foundations of brain and behavior (4th ed</article-title>
          .). Sunderland, Massachusetts: Sinauer Associates.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Zavala</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mendoza</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>On the use of semantic-based AIG to automatically generate programming exercises</article-title>
          .
          <source>In: the 49th ACM technical symposium on computer science education, ACM</source>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>