Grading OSPE Questions with Decision Learning Trees: A First Step Towards an
           Intelligent Tutoring System for Anatomical Education
     Jason Bernard1,2 , Bruce Wainman3,5 , O’Lencia Walker3 , Courney Pitt3 , Ilana Bayer3,5 , Josh
                   Mitchell3 , Alex Bak4 , Anthony Saraco3 Ranil Sonnadara1,2
                            1
                                 Department of Surgery, McMaster University, Hamilton, Ontario, Canada
                                   2
                                     Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
         3
           Education Program in Anatomy, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
                            4
                              Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
               5
                 Department of Pathology and Molecular Medicine, McMaster University, Hamilton, Ontario, Canada
                                  bernac12@mcmaster.ca, wainmanb@mcmaster.ca, ranil@mcmaster.ca
                               Abstract                                      structured practical examination (OSPE) is considered an
                                                                             important part of the curriculum (Chan et al. 2019); how-
   Intelligent tutoring systems (ITSs) have been used for
                                                                             ever, it is an exam with which many learners struggle. OSPE
   decades as a means for improving the quality of education for
   learners primarily by providing guidance to students based                questions are in the form of an image (or sample) with a
   on a student model, e.g., predicting their knowledge level                pin indicating the anatomical structure to be considered by
   on a subject. There have been few attempts to incorporate                 the student. The student is typically asked to either identify
   ITSs into anatomical education. Objective structured practi-              the structure or its function in the form of a short sentence
   cal examinations (OSPEs) are an important, albeit challeng-               (or sentence fragment). Therefore, an algorithm is needed
   ing, means of evaluation in anatomical education. This re-                that can grade short answer OSPE-style questions. While
   search aims to create an ITS for anatomical OSPEs, and as a               there has been much work on grading short answer questions
   crucial first step looks to create a machine learning-based ap-           (Leacock and Chodorow 2003; Shermis et al. 2015; Du-
   proach for grading OSPEs. To that end, decision tree learning             mais 2004), these approaches use natural language process-
   was evaluated with, and without, spellchecking to produce a
                                                                             ing (NLP) techniques that are intended to work with short
   grading tool using the answer key developed by instructional
   assistants. Using answers from 428 learners, the tool obtained            paragraphs and to be more general to many topics (mainly
   an average accuracy of 96.8% (SD = 3.4%) across 60 ques-                  in a K-12 context). Student answers to OSPE questions tend
   tions.                                                                    to be short sentence fragments that lack proper grammati-
                                                                             cal structure. A preliminary examination using NLP on the
                                                                             OSPE answers suggested that there was insufficient infor-
                          Introduction                                       mation for the algorithm to derive much meaning. Hence,
Intelligent tutoring systems (ITSs) in educational technol-                  due to the differences in the answer structure and the early
ogy have been researched since at least the 1960s (Regian                    NLP assessment, existing approaches were not evaluated.It
and Shute 1966). An ITS works by interacting with the                        was observed that the student answers, while short, gener-
learner and utilizing student modelling techniques to provide                ally used the unique, technical words of the anatomical sci-
a customized experience based on their cognitive character-                  ences, although not often the same words used in the fac-
istics, such as affect, knowledge level, and interests (Regian               ulty derived answer key. Therefore, it was hypothesized that
and Shute 1966; Bakhshinategh et al. 2018; Joshi et al. 2019;                due to the technical nature of the anatomical sciences, par-
Xu et al. 2019; Mousavinasab et al. 2021). This is often done                ticular and unique words should appear in correct answers,
using adaptive learning material; however, other pedagogi-                   even if they are not the words expected by faculty. Further-
cal techniques can be used, e.g., gamification, prompting the                more, decision tree learning can use a series of derived sim-
learner to reflect on their answer, or proving an immediate                  ple true/false rules to determine the combinations of such
review (Regian and Shute 1966; Bakhshinategh et al. 2018;                    words that infer a correct or incorrect answer. While the tool
Joshi et al. 2019; Xu et al. 2019; Mousavinasab et al. 2021).                created is likely not generally useful for short answer ques-
Overall, the experience with ITSs suggest that they have a                   tions, an evaluation showed it is able to grade OSPE ques-
positive effect on learning outcomes (Joshi et al. 2019; Xu                  tions using the students’ lexicon with 96.8% accuracy.
et al. 2019).
   Medical education has not been much of a focus for edu-                                          Methodology
cational technology in general, and ITSs in particular. This                 This section describes the approach used in this research to
research was undertaken to develop an ITS for anatomical                     evaluate decision trees (DTs) to grade OSPE questions. It
sciences education. In anatomical education, the objective                   begins with an overview of the decision tree learning (DTL)
Copyright c 2021,for this paper by its authors. Use permitted un-        algorithm. This is followed by a description of how decision
der Creative Commons License Attribution 4.0 International (CC           trees are used to evaluate OSPE questions. Afterwards, the
BY 4.0).                                                                 data and data gathering approach is discussed and then the
metrics used to evaluate the OSPE grading tool.                         The process of building a tree from data is called DTL.
                                                                     DTL is an iterative process that examines a data set to find
Decision Tree Learning                                               the Boolean rule that has the greatest information gain mea-
Previously, it was shown that DTL could be useful for                sured by reducing entropy. In other words, all possible rules
parsing grammatical structures and using the resulting tree          are considered, and the rule that allows for the most clarity
to aide in grading short answer questions (Leacock and               to the prediction is selected. For example, it is difficult for it
Chodorow 2003). While their approach differs as they use             to rain if there are no clouds in the sky, so this is a reason-
the structure as input to other NLP approaches, it suggests          able first question to determine if it will rain. If there are no
that the decision tree structure can be useful for this kind of      clouds, then it will clearly not rain.
problem. It is certainly possible that other algorithms may be          The general case can then be described as follows. Let D0
effective for identifying correct answers to OSPE questions          be the initial dataset. Starting with the first iteration, every
(e.g., potentially an unevaluated NLP algorithm or cluster-          possible rule that can be applied to D0 is considered. The
ing algorithms). However, DTs seem particularly well suited          rule that is most effective at clarifying the prediction (called
to use in an ITS. Firstly, unlike most NLP algorithms that           information gain or IG) is selected and D0 is split into two
use neural networks, which are black box algorithms, DTs             datasets DT 1 and DF 1 . DT 1 contains all samples for which
provide a transparent reasoning that can be expressed to stu-        the selected is true and DF 1 all those for which the rule eval-
dents along with a confidence level of correctness. Secondly,        uates to false. The process iterates using DT 1 , and then DF 1 ,
other algorithms may struggle to define the relationship in an       which will each produce two additional datasets, and so on
efficient way, e.g., clustering algorithms would likely create       until no rule can split the dataset.
overlapping word spaces that would be difficult to evaluate.            Mathematically, this is done by computing entropy (E) as
Hence, for this research study, DTs were used to produce a           shown in Equation 1 where T and F are the count of the
set of rules that describe a relationship between the words in       samples in the dataset that would evaluate to true and false
an OSPE answer and correctness. The following description            respectively if the rule were selected. For some iteration, let
of DTs is summarized from Quinlan (1996) unless otherwise            there be n possible rules. Information gain (IG) can then be
noted.                                                               computed by subtracting the entropy of a candidate rule n
   The aim of a DT is to produce a classification of a sample        from the entropy using the current dataset (Ecurr ), e.g., in
based on a sequence of true/false rules relating to a feature in     the first iteration D0 , in the second iteration, Ecurr is com-
the data. For example, to predict whether it will rain, a tree       puted using DT 1 , and then separately using DF 1 , and so on.
may consist of the rules “is it cloudy?”, and if so then “is the     This is shown in Equation 2. Finally, the rule with the best
humidity above 60%?”. If the answer to both questions is             IG value is selected. The probability of each possible classi-
“true” (or yes), then predict it will rain, and if the answer to     fication (correct and incorrect for this research) is computed
either question is “false” (or no), then predict it will not rain.   and stored by count of the number of samples with each la-
Each rule is represented structurally as a node consisting of:       bel divided by the total. The classification associated to the
1. the Boolean rule, 2. the certainty for each possible clas-        node is the one with the greatest probability. The certainty of
sification (described below), and 3. an optional connection          the classification being right is the probability of that classi-
to two other nodes (called child nodes). One node is desig-          fication. For example, if 56% of all samples in a data set are
nated for when the rule evaluates to “true” for a sample, and        labelled as “correct”, then the “correct” classification will be
the other for “false”.                                               associated to the node with 56% certainty.
   The collection of nodes is arranged in an (upside down)
tree-like structure (a simplified sample from this research                   E(T, F ) = − T /(T + F ) · logT /(T + F )
is shown in Figure 1), with a root node at the top and its                                                                         (1)
                                                                                         − F/(T + F ) · logF/(T + F )
children below it, and their children on the next level down,
and so on. The bottommost nodes have no children and are
                                                                                   IG( Tn , Fn ) = Ecurr − E(Tn , Fn )            (2)
referred to as leaf nodes. Each child is a subtree of its par-
ent and referred to as ST (subtree “true”) and SF (subtree               Once the DT is built using this process, a new sample is
“false”). By convention, the ST is to the left, and SF on            classified by traversing the tree starting from the root node.
the right. Any movement to the right does not automatically          At each node, the Boolean rule is applied to the sample and
mean an answer is incorrect as there may be many different           if it evaluates to “true”, then the true connection is followed;
combinations of words that are correct. Hence, for this tree,        otherwise, the false connection is taken. If no traversal is
it is possible that an answer lacking the word “muscles” is          possible (which can happen if not all samples have the same
still correct, and would be identified by the DT as such. For        features), or if the node is a leaf node, then the process ter-
those cases, the word at the second level would likely be an         minates, and the associated classification and certainty are
alternative word that appears in correct answers. The effect         returned.
of the DT is to create a serial of rules. If “subvavular appa-
ratus” is a correct answer, then effectively the result of the       Grading OSPE with a Decision Tree
tree traversal is to ask:                                            The previous subsection described how a DT can mechan-
   Does the answer not contain “muscles” and contains “sub-          ically grade a question based on some set of features. This
valvular” and contains “apparatus”, and if so, then the an-          subsection describes the features utilized and principles by
swer is classified as correct.                                       which the tool functions. To begin, the feature set is simply
Figure 1: A decision tree (DT) is a series of nodes containing unique words that are connected by Boolean (True/False) deci-
sions. The nodes are described as either a ”root node” (a node that has nodes stemming from it) or a “leaf/terminal node” (a
node that does not have nodes stemming from another node). All nodes that result from the Boolean decision returning “True”
are included in the left subtree (surrounded by blue) while all nodes that result from the Boolean decision returning “false” are
included in the right subtree (surrounded by pink). The asterisks (*) are a “wild card”, representing any word that is not one
of the unique words. In this case a correct OSPE answer was atrial papillary muscle(s) or subvavular apparatus and all other
answers would have been found to be false by the DT.


all of the unique words that exists across all student answers     Data
for a question. It was hypothesized that students should have
a particular shared lexicon of words for describing the cor-       The data for this research consisted of the answers
rect answer, even if it is not exactly the same as the textbook    from a 60 question OSPE in McMaster University’s
answer. Such a lexicon can be used to train a decision tree as     Health Sciences Human Anatomy and Physiology (HTH-
there will be a set of unique words that belong only to correct    SCI 2F03/2FF3/2L03/2LL3/1D06) undergraduate course.
answers, and a set that belong to only incorrect answers. So       The exam consisted of 20 two-dimensional images from
while some will belong to both, if a student answer contains       the Stereoscopic Atlas of Human Anatomy (Massachusetts
the words associated with a correct answer as determined           General Hospital 2017). Digital markings (pins, asterisks,
by the decision tree, then this is positive evidence that the      arrows, etc.) were added to the images to indicate the
answer is correct, and vice versa.                                 anatomical structure for students to consider. Each image
                                                                   had three associated questions. The exam was conducted on-
   Some preprocessing was done to the data set prior to train-     line using “Desire to Learn”, McMaster University’s learn-
ing the algorithm. First, all blank answers were removed           ing management system. The exam was completed by 428
since they are trivially incorrect and uninteresting. Second,      students, who had 50 minutes to complete the exam. Virtual
all answers were spellchecked using the Jazzy spellchecker         proctoring software Respondus R was used or virtual proc-
v0.5.2 (Idzelis 2005). The dictionary included with the Jazzy      toring with a TA if Respondus R would not work on a stu-
library was used; however, since it does not contain many          dent’s computer. The questions and the marking master for
medical terms all words in the master answer key were in-          the OSPE were produced by the five senior faculty teaching
cluded. In all cases where a misspelling was identified, the       the course. All student answers were graded by two teach-
top word in the correction list was taken. This is one area        ing assistants (TAs), who then reviewed any differences and
where perhaps the algorithm could be improved by consid-           came to a consensus, referred to as the initial grade. All of
ering different possible corrections. Finally, the following       the grades were then reviewed by the two instructional assis-
common English words found in the student answers were             tants and a final grade was produced. Overall, approximately
removed as they do not provide any indication of correct-          5% of initial grades were altered by the instructional assis-
ness: “a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”,      tants.
“did”, “for”, “had”, “has”, “have”, “I”, “in”, “is”, “it”, “of”,      As DTL was used for this research, a training and test
“on”, “or”, “so”, “than”, “that”, “the”, “then”, “they”, “this”,   data set were required. The training set was used to produce
“to”, “was”, “with”.                                               the tree by examining the student answers as described in
Subsection 2.1. With the trained DT, a grade was produced        bias, then questions with a grade less than 50% should have
for each student in the test set for each question. A 10-fold,   consistently low accuracy. If the AI has an incorrect bias,
cross-validation approach was used. Hence for each fold, 42      then questions with a grade of 50% or higher should have
students were randomly selected as the test set, and the re-     a consistently low accuracy. It is evident from Figure 2 that
mainder was used as the training set. Students were selected     this is not the case. Questions with grades between 40% to
such that no student appeared in more than one test set. The     60% have accuracy values throughout the distribution.
test set is treated as if their answer previously provided and      From a practical perspective, an anatomical expert was re-
evaluated, and is therefore referred to as the “student key”.    cruited to duplicate the first step taken by the tool. The expert
                                                                 was asked to examine the student answers for each question
Metrics                                                          and pick from the lexicon which word would be most likely
The performance of the OSPE grading tool was measured            to indicate that the student had answered the question cor-
by comparing grade produced by the DT to the actual grade.       rectly. The expert made the selection without knowing what
Specifically, for each of the 42 students in each fold, and      had been selected by the AI. Ideally, it would be better to
for each of the 60 questions, the grade produced by DT is        have the expert completely duplicate the process of the tree;
compared to the actual grade. The fold accuracy for each         however, this would be quite time consuming for them, so
question is then the average number of matches divided by        this was taken as a rough approximation of logical agree-
42. The accuracy for the question is the average across the      ment. The expert picked the same word as the tool in 45
10 folds. While the average accuracy is useful to produce        of the 60 questions. For 4 of the 15 questions, the expert
a general sense of the effectiveness of the tool for grad-       picked an acronym that was not frequently used by the stu-
ing OSPE questions, the final grade, produced grade, and         dents who preferred to write the answer out in full (which
certainty were recorded to allow for a deeper analysis into      itself is potentially pedagogically interesting), indicating the
the logic of the algorithm, especially when the DT does not      expert was relying somewhat on their own expert knowledge
agree with the final grade.                                      beyond just the student answers. In the other 11 cases, the
                                                                 word selected by the expert was the OSPE grader’s second
                          Results                                choice. In particular it is notable, that when the answers were
The computed accuracy using the student answers for the          longest (10 or more words on average) the expert and the AI
OSPE grading tool is shown in Figure 2, along with average       either agreed or it was the AI’s second choice. This indicates
grade.                                                           that the AI is doing reasonably well at finding the important
   The key result is the accuracy when determining a final       words by mirroring the human choice. Overall, in combina-
grade as this has the greatest effect on the students and is     tion with the other observations discussed, this suggests that
essential for building an ITS. It can be seen from the results   the reasoning used by the AI is valid, and that it is not simply
that the accuracy when using the “Student Key” has an aver-      guessing.
age accuracy of 96.8% (SD = 3.4%), and lowest accuracy
of 84.8% (Q27). These results suggest that students develop
their own collective lexicon for answering anatomical ques-                              Limitations
tions, but are still considered correct. Pedagogically, while
unexplored, it is possible that adapting learning material to    While the result are promising, there are some noteworthy
the students’ lexicon may be valuable in promoting a better      limitations to the findings. The long term goal of this re-
learning outcome.                                                search is to develop an ITS for anatomical education; how-
   The result with spellchecking was not any better. In 11       ever, this research has assumed that questions have been an-
questions, the average accuracy was slightly lower, while in     swered by a cohort of students from which the DT can be
2 questions it was higher (< 0.5% higher or lower). For the      built. This is not necessarily ideal as it would be more prac-
most part, the DT algorithm seemed to learn the misspellings     tical to add a question to the hypothetical ITS and have the
as frequently the mistakes were identical. Where it would        AI simply work. This work would suggest that over time
make a difference, the spellchecker struggled with medi-         the ITS would get better at grading the questions, which is
cal terms despite making some effort to adjust it to prop-       welcome, but it does not address what happens early on.
erly correct them. For this reason, the experiment will be       Of course, it is possible to simply take exam questions af-
done again with a spellchecking algorithm and dictionary         ter they have been used and add them to the ITS; however,
designed specifically for medical use. For these results, no     this requires the instructional assistants to constantly come
significant conclusion can be reached.                           up with a new pool of questions. Therefore, this tool needs
   The relationship between accuracy and the average grade       to be evaluated without training using the student answers.
was determined using the Pearson correlation coefficients        To address these limitations, two steps have recently been
r = 0.153 and p = 0.244. While the R value suggests that         taken. A group of third and fourth year university students
there is no correlation, p value is greater than 0.05 so it is   of the anatomical sciences have been recruited to produce
possible that the results are occurring by chance. Therefore,    many OSPE questions. This will provide both the necessary
an algorithmic and practical evaluation of the reasoning was     questions for the ITS, and an initial seed of student answers
taken.                                                           to train the DT. Faculty will also add responses to the an-
   Algorithmically, if the AI has a bias then it should favour   swer key. Additionally, an evaluation is being performed us-
either guessing correct or incorrect. If the AI has a correct    ing only the faculty-derived answer key.
Figure 2: This figure shows the accuracy when using the student key (red line) ordered from highest to lowest. Additionally,
the percentage of students who answered each question correctly is shown as determined by the faculty-generated mark master
(background bar graph).


                       Conclusions                                accuracies about 85%. The underlying causes for the errors
                                                                  were examined and some anatomical terminology was not
This paper has presented an early look at a machine               recognized as words. For example, “C5”, “C6”, “CN11”,
learning-based tool for grading objective structured practi-      and “CNXI” were not considered words, let alone unique
cal examinations (OSPEs), which are frequently used and           words, by the DT and therefore, they were included in the
viewed as an important aspect of anatomical sciences edu-         solution space. For Q30, there were many different varia-
cation (Chan et al. 2019). As OSPE questions are short an-        tions of words that appeared in correct answers, e.g., move-
swers, consisting of a few words or a short sentence, it is       ment, motion, forward, extension, hyperextension; however,
more difficult to grade than multiple choice (for example),       these words also appeared in incorrect answers as well. The
where a student answer is definitively correct or incorrect. It   potential solution is to blend the student key with the faculty
was hypothesized that a decision tree could learn the lexicon     derived answer key and have certain words marked as criti-
used by learners to answer questions, and distinguish from        cal. Other solutions would be to use a more complex natural
that lexicon the words associated with correct and incorrect      language processing solution that may be required to under-
answers.                                                          stand the words in context or have the tool learn the weighted
   Using the answers obtained from 428 anatomical sciences        importance of the different words.
students on a 60 question OSPE, the tool was trained using a         The future of this research, beyond addressing the issues
10-fold cross validation method. Overall, the algorithm ob-       around accuracy, is to expand the grading tool into an ITS.
tained a 96.8% accuracy (SD = 3.4%) for correctly grad-           The ITS can then be evaluated on a student cohort to see
ing the student answers. Based on a multifaceted analysis of      if it improves the learning outcomes. As part of building
the results, it was determined that the tool was not simply a     an ITS, an investigation will be conducted on learning out-
guess. Firstly, the algorithm shows no bias towards guessing      comes when using the students’ lexicon versus textbook an-
“correct” or “incorrect” based on an examination of ques-         swers. Recently, work has begun by developing an online
tions with grades ranging from 40% to 60%. Secondly, an           OSPE practice tool for students using the AI-based grader.
anatomical expert was recruited to examine the algorithms
selected root words and the AI choices were found to be                                   References
reasonable and matching 45 out of 60 times, and being the
second choice for 11 questions. For the remaining four ques-      Bakhshinategh, B.; Zaiane, O. R.; ElAtia, S.; and Ipperciel,
tions, the expert made a choice not possible for the AI by        D. 2018. Educational data mining applications and tasks:
using an acronym not used by the students. Overall, the evi-      A survey of the last 10 years. Education and Information
dence suggests that the OSPE grading tool is using reasoning      Technologies 23(1): 537–553.
and not guessing.                                                 Chan, A. Y.-C. C.; Custers, E. J.; van Leeuwen, M. S.; Bleys,
   While the average result was promising, three questions        R. L.; and ten Cate, O. 2019. Does an Additional Online
(Q27, 30 and 33) were notably lower than the mean with            Anatomy Course Improve Performance of Medical Students
on Gross Anatomy Examinations? Medical Science Educa-
tor 29(3): 697–707.
Dumais, S. T. 2004. Latent semantic analysis. Annual Re-
view of Information Science and Technology 38(1): 188–
230.
Idzelis, M. 2005. Jazzy: The Java open source spell checker.
Joshi, A.; Allessio, D.; Magee, J.; Whitehill, J.; Arroyo, I.;
Woolf, B.; Sclaroff, S.; and Betke, M. 2019. Affect-driven
learning outcomes prediction in intelligent tutoring systems.
In 14th IEEE International Conference on Automatic Face
& Gesture Recognition (FG 2019), 1–5. IEEE.
Leacock, C.; and Chodorow, M. 2003. C-rater: Automated
scoring of short-answer questions. Computers and the Hu-
manities 37(4): 389–405.
Massachusetts General Hospital. 2017. Bassett Collection:
Stereoscopic Atlas of Human Anatomy.
Mousavinasab, E.; Zarifsanaiey, N.; R. Niakan Kalhori, S.;
Rakhshan, M.; Keikha, L.; and Ghazi Saeedi, M. 2021. In-
telligent tutoring systems: A systematic review of charac-
teristics, applications, and evaluation methods. Interactive
Learning Environments 29(1): 142–163.
Quinlan, J. R. 1996. Learning decision tree classifiers. ACM
Computing Surveys 28(1): 71–72.
Regian, J.; and Shute, V. 1966. Arificial intelligence in train-
ing: The evolution of intelligent tutoring systems. In Pro-
ceedings of the Conference on Technology and Training In
Education.
Shermis, M. D.; Burstein, J.; Brew, C.; Higgins, D.; and
Zechner, K. 2015. Recent Innovations in Machine Scoring
of Student and Test Taker Written and Spoken Responses.
In Handbook of Test Development, 351–370. Routledge.
Xu, Z.; Wijekumar, K.; Ramirez, G.; Hu, X.; and Irey, R.
2019. The effectiveness of intelligent tutoring systems on
K-12 students’ reading comprehension: A meta-analysis.
British Journal of Educational Technology 50(6): 3119–
3137.