<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Student answers to OSPE questions tend
to be short sentence fragments that lack proper grammati-
cal structure. A preliminary examination using NLP on the
OSPE answers suggested that there was insufficient infor-
Introduction mation for the algorithm to derive much meaning. Hence,
due to the differences in the answer structure and the early
NLP assessment, existing approaches were not evaluated.It
was observed that the student answers, while short, gener-
ally used the unique, technical words of the anatomical sci-
ences, although not often the same words used in the fac-
ulty derived answer key. Therefore, it was hypothesized that
due to the technical nature of the anatomical sciences, par-
ticular and unique words should appear in correct answers,
even if they are not the words expected by faculty. Further-
more, decision tree learning can use a series of derived sim-
false rules to determine the combinations of such
words that infer a correct or incorrect answer. While the tool
created is likely not generally useful for short answer ques-
tions, an evaluation showed it is able to grade OSPE ques-
tions using the students' lexicon with</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Grading OSPE Questions with Decision Learning Trees: A First Step Towards an Intelligent Tutoring System for Anatomical Education</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jason Bernard</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruce Wainman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>O'Lencia Walker</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Courney Pitt</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilana Bayer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josh Mitchell</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Bak</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anthony Saraco</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ranil Sonnadara</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>96</volume>
      <issue>8</issue>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Intelligent tutoring systems (ITSs) have been used for
decades as a means for improving the quality of education for
learners primarily by providing guidance to students based
on a student model, e.g., predicting their knowledge level
on a subject. There have been few attempts to incorporate
ITSs into anatomical education. Objective structured
practical examinations (OSPEs) are an important, albeit
challenging, means of evaluation in anatomical education. This
research aims to create an ITS for anatomical OSPEs, and as a
crucial first step looks to create a machine learning-based
approach for grading OSPEs. To that end, decision tree learning
was evaluated with, and without, spellchecking to produce a
grading tool using the answer key developed by instructional
assistants. Using answers from 428 learners, the tool obtained
an average accuracy of 96:8% (SD = 3:4%) across 60
questions.
Intelligent tutoring systems (ITSs) in educational
technology have been researched since at least the 1960s (Regian
and Shute 1966). An ITS works by interacting with the
learner and utilizing student modelling techniques to provide
a customized experience based on their cognitive
characteristics, such as affect, knowledge level, and interests
        <xref ref-type="bibr" rid="ref1 ref5 ref7">(Regian
and Shute 1966; Bakhshinategh et al. 2018; Joshi et al. 2019;
Xu et al. 2019; Mousavinasab et al. 2021)</xref>
        . This is often done
using adaptive learning material; however, other
pedagogical techniques can be used, e.g., gamification, prompting the
learner to reflect on their answer, or proving an immediate
review
        <xref ref-type="bibr" rid="ref1 ref5 ref7">(Regian and Shute 1966; Bakhshinategh et al. 2018;
Joshi et al. 2019; Xu et al. 2019; Mousavinasab et al. 2021)</xref>
        .
Overall, the experience with ITSs suggest that they have a
positive effect on learning outcomes
        <xref ref-type="bibr" rid="ref5 ref7">(Joshi et al. 2019; Xu
et al. 2019)</xref>
        .
      </p>
      <p>Medical education has not been much of a focus for
educational technology in general, and ITSs in particular. This
research was undertaken to develop an ITS for anatomical
sciences education. In anatomical education, the objective</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>This section describes the approach used in this research to
evaluate decision trees (DTs) to grade OSPE questions. It
begins with an overview of the decision tree learning (DTL)
algorithm. This is followed by a description of how decision
trees are used to evaluate OSPE questions. Afterwards, the
data and data gathering approach is discussed and then the
metrics used to evaluate the OSPE grading tool.</p>
      <sec id="sec-2-1">
        <title>Decision Tree Learning</title>
        <p>
          Previously, it was shown that DTL could be useful for
parsing grammatical structures and using the resulting tree
to aide in grading short answer questions
          <xref ref-type="bibr" rid="ref6">(Leacock and
Chodorow 2003)</xref>
          . While their approach differs as they use
the structure as input to other NLP approaches, it suggests
that the decision tree structure can be useful for this kind of
problem. It is certainly possible that other algorithms may be
effective for identifying correct answers to OSPE questions
(e.g., potentially an unevaluated NLP algorithm or
clustering algorithms). However, DTs seem particularly well suited
to use in an ITS. Firstly, unlike most NLP algorithms that
use neural networks, which are black box algorithms, DTs
provide a transparent reasoning that can be expressed to
students along with a confidence level of correctness. Secondly,
other algorithms may struggle to define the relationship in an
efficient way, e.g., clustering algorithms would likely create
overlapping word spaces that would be difficult to evaluate.
Hence, for this research study, DTs were used to produce a
set of rules that describe a relationship between the words in
an OSPE answer and correctness. The following description
of DTs is summarized from Quinlan (1996) unless otherwise
noted.
        </p>
        <p>The aim of a DT is to produce a classification of a sample
based on a sequence of true/false rules relating to a feature in
the data. For example, to predict whether it will rain, a tree
may consist of the rules “is it cloudy?”, and if so then “is the
humidity above 60%?”. If the answer to both questions is
“true” (or yes), then predict it will rain, and if the answer to
either question is “false” (or no), then predict it will not rain.
Each rule is represented structurally as a node consisting of:
1. the Boolean rule, 2. the certainty for each possible
classification (described below), and 3. an optional connection
to two other nodes (called child nodes). One node is
designated for when the rule evaluates to “true” for a sample, and
the other for “false”.</p>
        <p>The collection of nodes is arranged in an (upside down)
tree-like structure (a simplified sample from this research
is shown in Figure 1), with a root node at the top and its
children below it, and their children on the next level down,
and so on. The bottommost nodes have no children and are
referred to as leaf nodes. Each child is a subtree of its
parent and referred to as ST (subtree “true”) and SF (subtree
“false”). By convention, the ST is to the left, and SF on
the right. Any movement to the right does not automatically
mean an answer is incorrect as there may be many different
combinations of words that are correct. Hence, for this tree,
it is possible that an answer lacking the word “muscles” is
still correct, and would be identified by the DT as such. For
those cases, the word at the second level would likely be an
alternative word that appears in correct answers. The effect
of the DT is to create a serial of rules. If “subvavular
apparatus” is a correct answer, then effectively the result of the
tree traversal is to ask:</p>
        <p>Does the answer not contain “muscles” and contains
“subvalvular” and contains “apparatus”, and if so, then the
answer is classified as correct.</p>
        <p>The process of building a tree from data is called DTL.
DTL is an iterative process that examines a data set to find
the Boolean rule that has the greatest information gain
measured by reducing entropy. In other words, all possible rules
are considered, and the rule that allows for the most clarity
to the prediction is selected. For example, it is difficult for it
to rain if there are no clouds in the sky, so this is a
reasonable first question to determine if it will rain. If there are no
clouds, then it will clearly not rain.</p>
        <p>The general case can then be described as follows. Let D0
be the initial dataset. Starting with the first iteration, every
possible rule that can be applied to D0 is considered. The
rule that is most effective at clarifying the prediction (called
information gain or I G) is selected and D0 is split into two
datasets DT 1 and DF 1. DT 1 contains all samples for which
the selected is true and DF 1 all those for which the rule
evaluates to false. The process iterates using DT 1, and then DF 1,
which will each produce two additional datasets, and so on
until no rule can split the dataset.</p>
        <p>Mathematically, this is done by computing entropy (E) as
shown in Equation 1 where T and F are the count of the
samples in the dataset that would evaluate to true and false
respectively if the rule were selected. For some iteration, let
there be n possible rules. Information gain (I G) can then be
computed by subtracting the entropy of a candidate rule n
from the entropy using the current dataset (Ecurr), e.g., in
the first iteration D0, in the second iteration, Ecurr is
computed using DT 1, and then separately using DF 1, and so on.
This is shown in Equation 2. Finally, the rule with the best
I G value is selected. The probability of each possible
classification (correct and incorrect for this research) is computed
and stored by count of the number of samples with each
label divided by the total. The classification associated to the
node is the one with the greatest probability. The certainty of
the classification being right is the probability of that
classification. For example, if 56% of all samples in a data set are
labelled as “correct”, then the “correct” classification will be
associated to the node with 56% certainty.
(1)
(2)
E(T ; F ) =</p>
        <p>T =(T + F ) logT =(T + F )</p>
        <p>F =(T + F ) logF =(T + F )
I G(Tn; Fn) = Ecurr</p>
        <p>E(Tn; Fn)</p>
        <p>Once the DT is built using this process, a new sample is
classified by traversing the tree starting from the root node.
At each node, the Boolean rule is applied to the sample and
if it evaluates to “true”, then the true connection is followed;
otherwise, the false connection is taken. If no traversal is
possible (which can happen if not all samples have the same
features), or if the node is a leaf node, then the process
terminates, and the associated classification and certainty are
returned.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Grading OSPE with a Decision Tree</title>
        <p>The previous subsection described how a DT can
mechanically grade a question based on some set of features. This
subsection describes the features utilized and principles by
which the tool functions. To begin, the feature set is simply
all of the unique words that exists across all student answers
for a question. It was hypothesized that students should have
a particular shared lexicon of words for describing the
correct answer, even if it is not exactly the same as the textbook
answer. Such a lexicon can be used to train a decision tree as
there will be a set of unique words that belong only to correct
answers, and a set that belong to only incorrect answers. So
while some will belong to both, if a student answer contains
the words associated with a correct answer as determined
by the decision tree, then this is positive evidence that the
answer is correct, and vice versa.</p>
        <p>
          Some preprocessing was done to the data set prior to
training the algorithm. First, all blank answers were removed
since they are trivially incorrect and uninteresting. Second,
all answers were spellchecked using the Jazzy spellchecker
v0.5.2
          <xref ref-type="bibr" rid="ref4">(Idzelis 2005)</xref>
          . The dictionary included with the Jazzy
library was used; however, since it does not contain many
medical terms all words in the master answer key were
included. In all cases where a misspelling was identified, the
top word in the correction list was taken. This is one area
where perhaps the algorithm could be improved by
considering different possible corrections. Finally, the following
common English words found in the student answers were
removed as they do not provide any indication of
correctness: “a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”,
“did”, “for”, “had”, “has”, “have”, “I”, “in”, “is”, “it”, “of”,
“on”, “or”, “so”, “than”, “that”, “the”, “then”, “they”, “this”,
“to”, “was”, “with”.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Data</title>
        <p>The data for this research consisted of the answers
from a 60 question OSPE in McMaster University’s
Health Sciences Human Anatomy and Physiology
(HTHSCI 2F03/2FF3/2L03/2LL3/1D06) undergraduate course.
The exam consisted of 20 two-dimensional images from
the Stereoscopic Atlas of Human Anatomy (Massachusetts
General Hospital 2017). Digital markings (pins, asterisks,
arrows, etc.) were added to the images to indicate the
anatomical structure for students to consider. Each image
had three associated questions. The exam was conducted
online using “Desire to Learn”, McMaster University’s
learning management system. The exam was completed by 428
students, who had 50 minutes to complete the exam. Virtual
proctoring software Respondus R was used or virtual
proctoring with a TA if Respondus R would not work on a
student’s computer. The questions and the marking master for
the OSPE were produced by the five senior faculty teaching
the course. All student answers were graded by two
teaching assistants (TAs), who then reviewed any differences and
came to a consensus, referred to as the initial grade. All of
the grades were then reviewed by the two instructional
assistants and a final grade was produced. Overall, approximately
5% of initial grades were altered by the instructional
assistants.</p>
        <p>As DTL was used for this research, a training and test
data set were required. The training set was used to produce
the tree by examining the student answers as described in
Subsection 2.1. With the trained DT, a grade was produced
for each student in the test set for each question. A 10-fold,
cross-validation approach was used. Hence for each fold, 42
students were randomly selected as the test set, and the
remainder was used as the training set. Students were selected
such that no student appeared in more than one test set. The
test set is treated as if their answer previously provided and
evaluated, and is therefore referred to as the “student key”.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Metrics</title>
        <p>The performance of the OSPE grading tool was measured
by comparing grade produced by the DT to the actual grade.
Specifically, for each of the 42 students in each fold, and
for each of the 60 questions, the grade produced by DT is
compared to the actual grade. The fold accuracy for each
question is then the average number of matches divided by
42. The accuracy for the question is the average across the
10 folds. While the average accuracy is useful to produce
a general sense of the effectiveness of the tool for
grading OSPE questions, the final grade, produced grade, and
certainty were recorded to allow for a deeper analysis into
the logic of the algorithm, especially when the DT does not
agree with the final grade.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The computed accuracy using the student answers for the
OSPE grading tool is shown in Figure 2, along with average
grade.</p>
      <p>The key result is the accuracy when determining a final
grade as this has the greatest effect on the students and is
essential for building an ITS. It can be seen from the results
that the accuracy when using the “Student Key” has an
average accuracy of 96:8% (SD = 3:4%), and lowest accuracy
of 84:8% (Q27). These results suggest that students develop
their own collective lexicon for answering anatomical
questions, but are still considered correct. Pedagogically, while
unexplored, it is possible that adapting learning material to
the students’ lexicon may be valuable in promoting a better
learning outcome.</p>
      <p>The result with spellchecking was not any better. In 11
questions, the average accuracy was slightly lower, while in
2 questions it was higher (&lt; 0:5% higher or lower). For the
most part, the DT algorithm seemed to learn the misspellings
as frequently the mistakes were identical. Where it would
make a difference, the spellchecker struggled with
medical terms despite making some effort to adjust it to
properly correct them. For this reason, the experiment will be
done again with a spellchecking algorithm and dictionary
designed specifically for medical use. For these results, no
significant conclusion can be reached.</p>
      <p>The relationship between accuracy and the average grade
was determined using the Pearson correlation coefficients
r = 0:153 and p = 0:244. While the R value suggests that
there is no correlation, p value is greater than 0:05 so it is
possible that the results are occurring by chance. Therefore,
an algorithmic and practical evaluation of the reasoning was
taken.</p>
      <p>Algorithmically, if the AI has a bias then it should favour
either guessing correct or incorrect. If the AI has a correct
bias, then questions with a grade less than 50% should have
consistently low accuracy. If the AI has an incorrect bias,
then questions with a grade of 50% or higher should have
a consistently low accuracy. It is evident from Figure 2 that
this is not the case. Questions with grades between 40% to
60% have accuracy values throughout the distribution.</p>
      <p>From a practical perspective, an anatomical expert was
recruited to duplicate the first step taken by the tool. The expert
was asked to examine the student answers for each question
and pick from the lexicon which word would be most likely
to indicate that the student had answered the question
correctly. The expert made the selection without knowing what
had been selected by the AI. Ideally, it would be better to
have the expert completely duplicate the process of the tree;
however, this would be quite time consuming for them, so
this was taken as a rough approximation of logical
agreement. The expert picked the same word as the tool in 45
of the 60 questions. For 4 of the 15 questions, the expert
picked an acronym that was not frequently used by the
students who preferred to write the answer out in full (which
itself is potentially pedagogically interesting), indicating the
expert was relying somewhat on their own expert knowledge
beyond just the student answers. In the other 11 cases, the
word selected by the expert was the OSPE grader’s second
choice. In particular it is notable, that when the answers were
longest (10 or more words on average) the expert and the AI
either agreed or it was the AI’s second choice. This indicates
that the AI is doing reasonably well at finding the important
words by mirroring the human choice. Overall, in
combination with the other observations discussed, this suggests that
the reasoning used by the AI is valid, and that it is not simply
guessing.</p>
    </sec>
    <sec id="sec-4">
      <title>Limitations</title>
      <p>While the result are promising, there are some noteworthy
limitations to the findings. The long term goal of this
research is to develop an ITS for anatomical education;
however, this research has assumed that questions have been
answered by a cohort of students from which the DT can be
built. This is not necessarily ideal as it would be more
practical to add a question to the hypothetical ITS and have the
AI simply work. This work would suggest that over time
the ITS would get better at grading the questions, which is
welcome, but it does not address what happens early on.
Of course, it is possible to simply take exam questions
after they have been used and add them to the ITS; however,
this requires the instructional assistants to constantly come
up with a new pool of questions. Therefore, this tool needs
to be evaluated without training using the student answers.
To address these limitations, two steps have recently been
taken. A group of third and fourth year university students
of the anatomical sciences have been recruited to produce
many OSPE questions. This will provide both the necessary
questions for the ITS, and an initial seed of student answers
to train the DT. Faculty will also add responses to the
answer key. Additionally, an evaluation is being performed
using only the faculty-derived answer key.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>
        This paper has presented an early look at a machine
learning-based tool for grading objective structured
practical examinations (OSPEs), which are frequently used and
viewed as an important aspect of anatomical sciences
education
        <xref ref-type="bibr" rid="ref2">(Chan et al. 2019)</xref>
        . As OSPE questions are short
answers, consisting of a few words or a short sentence, it is
more difficult to grade than multiple choice (for example),
where a student answer is definitively correct or incorrect. It
was hypothesized that a decision tree could learn the lexicon
used by learners to answer questions, and distinguish from
that lexicon the words associated with correct and incorrect
answers.
      </p>
      <p>Using the answers obtained from 428 anatomical sciences
students on a 60 question OSPE, the tool was trained using a
10-fold cross validation method. Overall, the algorithm
obtained a 96:8% accuracy (SD = 3:4%) for correctly
grading the student answers. Based on a multifaceted analysis of
the results, it was determined that the tool was not simply a
guess. Firstly, the algorithm shows no bias towards guessing
“correct” or “incorrect” based on an examination of
questions with grades ranging from 40% to 60%. Secondly, an
anatomical expert was recruited to examine the algorithms
selected root words and the AI choices were found to be
reasonable and matching 45 out of 60 times, and being the
second choice for 11 questions. For the remaining four
questions, the expert made a choice not possible for the AI by
using an acronym not used by the students. Overall, the
evidence suggests that the OSPE grading tool is using reasoning
and not guessing.</p>
      <p>While the average result was promising, three questions
(Q27, 30 and 33) were notably lower than the mean with
accuracies about 85%. The underlying causes for the errors
were examined and some anatomical terminology was not
recognized as words. For example, “C5”, “C6”, “CN11”,
and “CNXI” were not considered words, let alone unique
words, by the DT and therefore, they were included in the
solution space. For Q30, there were many different
variations of words that appeared in correct answers, e.g.,
movement, motion, forward, extension, hyperextension; however,
these words also appeared in incorrect answers as well. The
potential solution is to blend the student key with the faculty
derived answer key and have certain words marked as
critical. Other solutions would be to use a more complex natural
language processing solution that may be required to
understand the words in context or have the tool learn the weighted
importance of the different words.</p>
      <p>The future of this research, beyond addressing the issues
around accuracy, is to expand the grading tool into an ITS.
The ITS can then be evaluated on a student cohort to see
if it improves the learning outcomes. As part of building
an ITS, an investigation will be conducted on learning
outcomes when using the students’ lexicon versus textbook
answers. Recently, work has begun by developing an online
OSPE practice tool for students using the AI-based grader.</p>
      <p>Massachusetts General Hospital. 2017. Bassett Collection:
Stereoscopic Atlas of Human Anatomy.</p>
      <p>Mousavinasab, E.; Zarifsanaiey, N.; R. Niakan Kalhori, S.;
Rakhshan, M.; Keikha, L.; and Ghazi Saeedi, M. 2021.
Intelligent tutoring systems: A systematic review of
characteristics, applications, and evaluation methods. Interactive
Learning Environments 29(1): 142–163.</p>
      <p>Quinlan, J. R. 1996. Learning decision tree classifiers. ACM
Computing Surveys 28(1): 71–72.</p>
      <p>Regian, J.; and Shute, V. 1966. Arificial intelligence in
training: The evolution of intelligent tutoring systems. In
Proceedings of the Conference on Technology and Training In
Education.</p>
      <p>Shermis, M. D.; Burstein, J.; Brew, C.; Higgins, D.; and
Zechner, K. 2015. Recent Innovations in Machine Scoring
of Student and Test Taker Written and Spoken Responses.
In Handbook of Test Development, 351–370. Routledge.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bakhshinategh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zaiane</surname>
            ,
            <given-names>O. R.</given-names>
          </string-name>
          ; ElAtia, S.; and
          <string-name>
            <surname>Ipperciel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Educational data mining applications and tasks: A survey of the last 10 years</article-title>
          .
          <source>Education and Information Technologies</source>
          <volume>23</volume>
          (
          <issue>1</issue>
          ):
          <fpage>537</fpage>
          -
          <lpage>553</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>A. Y.-C. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Custers</surname>
            , E. J.; van Leeuwen,
            <given-names>M. S.</given-names>
          </string-name>
          ; Bleys,
          <string-name>
            <surname>R. L.</surname>
          </string-name>
          ;
          <article-title>and ten</article-title>
          <string-name>
            <surname>Cate</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Does an Additional Online Anatomy Course Improve Performance of Medical Students on Gross Anatomy Examinations?</article-title>
          <source>Medical Science Educator</source>
          <volume>29</volume>
          (
          <issue>3</issue>
          ):
          <fpage>697</fpage>
          -
          <lpage>707</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Latent semantic analysis</article-title>
          .
          <source>Annual Review of Information Science and Technology</source>
          <volume>38</volume>
          (
          <issue>1</issue>
          ):
          <fpage>188</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Idzelis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Jazzy: The Java open source spell checker</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Allessio</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Magee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Whitehill</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Arroyo</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Woolf</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sclaroff</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and Betke,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Affect-driven learning outcomes prediction in intelligent tutoring systems</article-title>
          .
          <source>In 14th IEEE International Conference on Automatic Face &amp; Gesture Recognition (FG</source>
          <year>2019</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Leacock</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and Chodorow,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2003</year>
          .
          <article-title>C-rater: Automated scoring of short-answer questions</article-title>
          .
          <source>Computers and the Humanities</source>
          <volume>37</volume>
          (
          <issue>4</issue>
          ):
          <fpage>389</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wijekumar</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ramirez</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; and Irey,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>The effectiveness of intelligent tutoring systems on K-12 students' reading comprehension: A meta-analysis</article-title>
          .
          <source>British Journal of Educational Technology</source>
          <volume>50</volume>
          (
          <issue>6</issue>
          ):
          <fpage>3119</fpage>
          -
          <lpage>3137</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>