Grading OSPE Questions with Decision Learning Trees: A First Step Towards an Intelligent Tutoring System for Anatomical Education Jason Bernard1,2 , Bruce Wainman3,5 , O’Lencia Walker3 , Courney Pitt3 , Ilana Bayer3,5 , Josh Mitchell3 , Alex Bak4 , Anthony Saraco3 Ranil Sonnadara1,2 1 Department of Surgery, McMaster University, Hamilton, Ontario, Canada 2 Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada 3 Education Program in Anatomy, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada 4 Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada 5 Department of Pathology and Molecular Medicine, McMaster University, Hamilton, Ontario, Canada bernac12@mcmaster.ca, wainmanb@mcmaster.ca, ranil@mcmaster.ca Abstract structured practical examination (OSPE) is considered an important part of the curriculum (Chan et al. 2019); how- Intelligent tutoring systems (ITSs) have been used for ever, it is an exam with which many learners struggle. OSPE decades as a means for improving the quality of education for learners primarily by providing guidance to students based questions are in the form of an image (or sample) with a on a student model, e.g., predicting their knowledge level pin indicating the anatomical structure to be considered by on a subject. There have been few attempts to incorporate the student. The student is typically asked to either identify ITSs into anatomical education. Objective structured practi- the structure or its function in the form of a short sentence cal examinations (OSPEs) are an important, albeit challeng- (or sentence fragment). Therefore, an algorithm is needed ing, means of evaluation in anatomical education. This re- that can grade short answer OSPE-style questions. While search aims to create an ITS for anatomical OSPEs, and as a there has been much work on grading short answer questions crucial first step looks to create a machine learning-based ap- (Leacock and Chodorow 2003; Shermis et al. 2015; Du- proach for grading OSPEs. To that end, decision tree learning mais 2004), these approaches use natural language process- was evaluated with, and without, spellchecking to produce a ing (NLP) techniques that are intended to work with short grading tool using the answer key developed by instructional assistants. Using answers from 428 learners, the tool obtained paragraphs and to be more general to many topics (mainly an average accuracy of 96.8% (SD = 3.4%) across 60 ques- in a K-12 context). Student answers to OSPE questions tend tions. to be short sentence fragments that lack proper grammati- cal structure. A preliminary examination using NLP on the OSPE answers suggested that there was insufficient infor- Introduction mation for the algorithm to derive much meaning. Hence, Intelligent tutoring systems (ITSs) in educational technol- due to the differences in the answer structure and the early ogy have been researched since at least the 1960s (Regian NLP assessment, existing approaches were not evaluated.It and Shute 1966). An ITS works by interacting with the was observed that the student answers, while short, gener- learner and utilizing student modelling techniques to provide ally used the unique, technical words of the anatomical sci- a customized experience based on their cognitive character- ences, although not often the same words used in the fac- istics, such as affect, knowledge level, and interests (Regian ulty derived answer key. Therefore, it was hypothesized that and Shute 1966; Bakhshinategh et al. 2018; Joshi et al. 2019; due to the technical nature of the anatomical sciences, par- Xu et al. 2019; Mousavinasab et al. 2021). This is often done ticular and unique words should appear in correct answers, using adaptive learning material; however, other pedagogi- even if they are not the words expected by faculty. Further- cal techniques can be used, e.g., gamification, prompting the more, decision tree learning can use a series of derived sim- learner to reflect on their answer, or proving an immediate ple true/false rules to determine the combinations of such review (Regian and Shute 1966; Bakhshinategh et al. 2018; words that infer a correct or incorrect answer. While the tool Joshi et al. 2019; Xu et al. 2019; Mousavinasab et al. 2021). created is likely not generally useful for short answer ques- Overall, the experience with ITSs suggest that they have a tions, an evaluation showed it is able to grade OSPE ques- positive effect on learning outcomes (Joshi et al. 2019; Xu tions using the students’ lexicon with 96.8% accuracy. et al. 2019). Medical education has not been much of a focus for edu- Methodology cational technology in general, and ITSs in particular. This This section describes the approach used in this research to research was undertaken to develop an ITS for anatomical evaluate decision trees (DTs) to grade OSPE questions. It sciences education. In anatomical education, the objective begins with an overview of the decision tree learning (DTL) Copyright c 2021,for this paper by its authors. Use permitted un- algorithm. This is followed by a description of how decision der Creative Commons License Attribution 4.0 International (CC trees are used to evaluate OSPE questions. Afterwards, the BY 4.0). data and data gathering approach is discussed and then the metrics used to evaluate the OSPE grading tool. The process of building a tree from data is called DTL. DTL is an iterative process that examines a data set to find Decision Tree Learning the Boolean rule that has the greatest information gain mea- Previously, it was shown that DTL could be useful for sured by reducing entropy. In other words, all possible rules parsing grammatical structures and using the resulting tree are considered, and the rule that allows for the most clarity to aide in grading short answer questions (Leacock and to the prediction is selected. For example, it is difficult for it Chodorow 2003). While their approach differs as they use to rain if there are no clouds in the sky, so this is a reason- the structure as input to other NLP approaches, it suggests able first question to determine if it will rain. If there are no that the decision tree structure can be useful for this kind of clouds, then it will clearly not rain. problem. It is certainly possible that other algorithms may be The general case can then be described as follows. Let D0 effective for identifying correct answers to OSPE questions be the initial dataset. Starting with the first iteration, every (e.g., potentially an unevaluated NLP algorithm or cluster- possible rule that can be applied to D0 is considered. The ing algorithms). However, DTs seem particularly well suited rule that is most effective at clarifying the prediction (called to use in an ITS. Firstly, unlike most NLP algorithms that information gain or IG) is selected and D0 is split into two use neural networks, which are black box algorithms, DTs datasets DT 1 and DF 1 . DT 1 contains all samples for which provide a transparent reasoning that can be expressed to stu- the selected is true and DF 1 all those for which the rule eval- dents along with a confidence level of correctness. Secondly, uates to false. The process iterates using DT 1 , and then DF 1 , other algorithms may struggle to define the relationship in an which will each produce two additional datasets, and so on efficient way, e.g., clustering algorithms would likely create until no rule can split the dataset. overlapping word spaces that would be difficult to evaluate. Mathematically, this is done by computing entropy (E) as Hence, for this research study, DTs were used to produce a shown in Equation 1 where T and F are the count of the set of rules that describe a relationship between the words in samples in the dataset that would evaluate to true and false an OSPE answer and correctness. The following description respectively if the rule were selected. For some iteration, let of DTs is summarized from Quinlan (1996) unless otherwise there be n possible rules. Information gain (IG) can then be noted. computed by subtracting the entropy of a candidate rule n The aim of a DT is to produce a classification of a sample from the entropy using the current dataset (Ecurr ), e.g., in based on a sequence of true/false rules relating to a feature in the first iteration D0 , in the second iteration, Ecurr is com- the data. For example, to predict whether it will rain, a tree puted using DT 1 , and then separately using DF 1 , and so on. may consist of the rules “is it cloudy?”, and if so then “is the This is shown in Equation 2. Finally, the rule with the best humidity above 60%?”. If the answer to both questions is IG value is selected. The probability of each possible classi- “true” (or yes), then predict it will rain, and if the answer to fication (correct and incorrect for this research) is computed either question is “false” (or no), then predict it will not rain. and stored by count of the number of samples with each la- Each rule is represented structurally as a node consisting of: bel divided by the total. The classification associated to the 1. the Boolean rule, 2. the certainty for each possible clas- node is the one with the greatest probability. The certainty of sification (described below), and 3. an optional connection the classification being right is the probability of that classi- to two other nodes (called child nodes). One node is desig- fication. For example, if 56% of all samples in a data set are nated for when the rule evaluates to “true” for a sample, and labelled as “correct”, then the “correct” classification will be the other for “false”. associated to the node with 56% certainty. The collection of nodes is arranged in an (upside down) tree-like structure (a simplified sample from this research E(T, F ) = − T /(T + F ) · logT /(T + F ) is shown in Figure 1), with a root node at the top and its (1) − F/(T + F ) · logF/(T + F ) children below it, and their children on the next level down, and so on. The bottommost nodes have no children and are IG( Tn , Fn ) = Ecurr − E(Tn , Fn ) (2) referred to as leaf nodes. Each child is a subtree of its par- ent and referred to as ST (subtree “true”) and SF (subtree Once the DT is built using this process, a new sample is “false”). By convention, the ST is to the left, and SF on classified by traversing the tree starting from the root node. the right. Any movement to the right does not automatically At each node, the Boolean rule is applied to the sample and mean an answer is incorrect as there may be many different if it evaluates to “true”, then the true connection is followed; combinations of words that are correct. Hence, for this tree, otherwise, the false connection is taken. If no traversal is it is possible that an answer lacking the word “muscles” is possible (which can happen if not all samples have the same still correct, and would be identified by the DT as such. For features), or if the node is a leaf node, then the process ter- those cases, the word at the second level would likely be an minates, and the associated classification and certainty are alternative word that appears in correct answers. The effect returned. of the DT is to create a serial of rules. If “subvavular appa- ratus” is a correct answer, then effectively the result of the Grading OSPE with a Decision Tree tree traversal is to ask: The previous subsection described how a DT can mechan- Does the answer not contain “muscles” and contains “sub- ically grade a question based on some set of features. This valvular” and contains “apparatus”, and if so, then the an- subsection describes the features utilized and principles by swer is classified as correct. which the tool functions. To begin, the feature set is simply Figure 1: A decision tree (DT) is a series of nodes containing unique words that are connected by Boolean (True/False) deci- sions. The nodes are described as either a ”root node” (a node that has nodes stemming from it) or a “leaf/terminal node” (a node that does not have nodes stemming from another node). All nodes that result from the Boolean decision returning “True” are included in the left subtree (surrounded by blue) while all nodes that result from the Boolean decision returning “false” are included in the right subtree (surrounded by pink). The asterisks (*) are a “wild card”, representing any word that is not one of the unique words. In this case a correct OSPE answer was atrial papillary muscle(s) or subvavular apparatus and all other answers would have been found to be false by the DT. all of the unique words that exists across all student answers Data for a question. It was hypothesized that students should have a particular shared lexicon of words for describing the cor- The data for this research consisted of the answers rect answer, even if it is not exactly the same as the textbook from a 60 question OSPE in McMaster University’s answer. Such a lexicon can be used to train a decision tree as Health Sciences Human Anatomy and Physiology (HTH- there will be a set of unique words that belong only to correct SCI 2F03/2FF3/2L03/2LL3/1D06) undergraduate course. answers, and a set that belong to only incorrect answers. So The exam consisted of 20 two-dimensional images from while some will belong to both, if a student answer contains the Stereoscopic Atlas of Human Anatomy (Massachusetts the words associated with a correct answer as determined General Hospital 2017). Digital markings (pins, asterisks, by the decision tree, then this is positive evidence that the arrows, etc.) were added to the images to indicate the answer is correct, and vice versa. anatomical structure for students to consider. Each image had three associated questions. The exam was conducted on- Some preprocessing was done to the data set prior to train- line using “Desire to Learn”, McMaster University’s learn- ing the algorithm. First, all blank answers were removed ing management system. The exam was completed by 428 since they are trivially incorrect and uninteresting. Second, students, who had 50 minutes to complete the exam. Virtual all answers were spellchecked using the Jazzy spellchecker proctoring software Respondus R was used or virtual proc- v0.5.2 (Idzelis 2005). The dictionary included with the Jazzy toring with a TA if Respondus R would not work on a stu- library was used; however, since it does not contain many dent’s computer. The questions and the marking master for medical terms all words in the master answer key were in- the OSPE were produced by the five senior faculty teaching cluded. In all cases where a misspelling was identified, the the course. All student answers were graded by two teach- top word in the correction list was taken. This is one area ing assistants (TAs), who then reviewed any differences and where perhaps the algorithm could be improved by consid- came to a consensus, referred to as the initial grade. All of ering different possible corrections. Finally, the following the grades were then reviewed by the two instructional assis- common English words found in the student answers were tants and a final grade was produced. Overall, approximately removed as they do not provide any indication of correct- 5% of initial grades were altered by the instructional assis- ness: “a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, tants. “did”, “for”, “had”, “has”, “have”, “I”, “in”, “is”, “it”, “of”, As DTL was used for this research, a training and test “on”, “or”, “so”, “than”, “that”, “the”, “then”, “they”, “this”, data set were required. The training set was used to produce “to”, “was”, “with”. the tree by examining the student answers as described in Subsection 2.1. With the trained DT, a grade was produced bias, then questions with a grade less than 50% should have for each student in the test set for each question. A 10-fold, consistently low accuracy. If the AI has an incorrect bias, cross-validation approach was used. Hence for each fold, 42 then questions with a grade of 50% or higher should have students were randomly selected as the test set, and the re- a consistently low accuracy. It is evident from Figure 2 that mainder was used as the training set. Students were selected this is not the case. Questions with grades between 40% to such that no student appeared in more than one test set. The 60% have accuracy values throughout the distribution. test set is treated as if their answer previously provided and From a practical perspective, an anatomical expert was re- evaluated, and is therefore referred to as the “student key”. cruited to duplicate the first step taken by the tool. The expert was asked to examine the student answers for each question Metrics and pick from the lexicon which word would be most likely The performance of the OSPE grading tool was measured to indicate that the student had answered the question cor- by comparing grade produced by the DT to the actual grade. rectly. The expert made the selection without knowing what Specifically, for each of the 42 students in each fold, and had been selected by the AI. Ideally, it would be better to for each of the 60 questions, the grade produced by DT is have the expert completely duplicate the process of the tree; compared to the actual grade. The fold accuracy for each however, this would be quite time consuming for them, so question is then the average number of matches divided by this was taken as a rough approximation of logical agree- 42. The accuracy for the question is the average across the ment. The expert picked the same word as the tool in 45 10 folds. While the average accuracy is useful to produce of the 60 questions. For 4 of the 15 questions, the expert a general sense of the effectiveness of the tool for grad- picked an acronym that was not frequently used by the stu- ing OSPE questions, the final grade, produced grade, and dents who preferred to write the answer out in full (which certainty were recorded to allow for a deeper analysis into itself is potentially pedagogically interesting), indicating the the logic of the algorithm, especially when the DT does not expert was relying somewhat on their own expert knowledge agree with the final grade. beyond just the student answers. In the other 11 cases, the word selected by the expert was the OSPE grader’s second Results choice. In particular it is notable, that when the answers were The computed accuracy using the student answers for the longest (10 or more words on average) the expert and the AI OSPE grading tool is shown in Figure 2, along with average either agreed or it was the AI’s second choice. This indicates grade. that the AI is doing reasonably well at finding the important The key result is the accuracy when determining a final words by mirroring the human choice. Overall, in combina- grade as this has the greatest effect on the students and is tion with the other observations discussed, this suggests that essential for building an ITS. It can be seen from the results the reasoning used by the AI is valid, and that it is not simply that the accuracy when using the “Student Key” has an aver- guessing. age accuracy of 96.8% (SD = 3.4%), and lowest accuracy of 84.8% (Q27). These results suggest that students develop their own collective lexicon for answering anatomical ques- Limitations tions, but are still considered correct. Pedagogically, while unexplored, it is possible that adapting learning material to While the result are promising, there are some noteworthy the students’ lexicon may be valuable in promoting a better limitations to the findings. The long term goal of this re- learning outcome. search is to develop an ITS for anatomical education; how- The result with spellchecking was not any better. In 11 ever, this research has assumed that questions have been an- questions, the average accuracy was slightly lower, while in swered by a cohort of students from which the DT can be 2 questions it was higher (< 0.5% higher or lower). For the built. This is not necessarily ideal as it would be more prac- most part, the DT algorithm seemed to learn the misspellings tical to add a question to the hypothetical ITS and have the as frequently the mistakes were identical. Where it would AI simply work. This work would suggest that over time make a difference, the spellchecker struggled with medi- the ITS would get better at grading the questions, which is cal terms despite making some effort to adjust it to prop- welcome, but it does not address what happens early on. erly correct them. For this reason, the experiment will be Of course, it is possible to simply take exam questions af- done again with a spellchecking algorithm and dictionary ter they have been used and add them to the ITS; however, designed specifically for medical use. For these results, no this requires the instructional assistants to constantly come significant conclusion can be reached. up with a new pool of questions. Therefore, this tool needs The relationship between accuracy and the average grade to be evaluated without training using the student answers. was determined using the Pearson correlation coefficients To address these limitations, two steps have recently been r = 0.153 and p = 0.244. While the R value suggests that taken. A group of third and fourth year university students there is no correlation, p value is greater than 0.05 so it is of the anatomical sciences have been recruited to produce possible that the results are occurring by chance. Therefore, many OSPE questions. This will provide both the necessary an algorithmic and practical evaluation of the reasoning was questions for the ITS, and an initial seed of student answers taken. to train the DT. Faculty will also add responses to the an- Algorithmically, if the AI has a bias then it should favour swer key. Additionally, an evaluation is being performed us- either guessing correct or incorrect. If the AI has a correct ing only the faculty-derived answer key. Figure 2: This figure shows the accuracy when using the student key (red line) ordered from highest to lowest. Additionally, the percentage of students who answered each question correctly is shown as determined by the faculty-generated mark master (background bar graph). Conclusions accuracies about 85%. The underlying causes for the errors were examined and some anatomical terminology was not This paper has presented an early look at a machine recognized as words. For example, “C5”, “C6”, “CN11”, learning-based tool for grading objective structured practi- and “CNXI” were not considered words, let alone unique cal examinations (OSPEs), which are frequently used and words, by the DT and therefore, they were included in the viewed as an important aspect of anatomical sciences edu- solution space. For Q30, there were many different varia- cation (Chan et al. 2019). As OSPE questions are short an- tions of words that appeared in correct answers, e.g., move- swers, consisting of a few words or a short sentence, it is ment, motion, forward, extension, hyperextension; however, more difficult to grade than multiple choice (for example), these words also appeared in incorrect answers as well. The where a student answer is definitively correct or incorrect. It potential solution is to blend the student key with the faculty was hypothesized that a decision tree could learn the lexicon derived answer key and have certain words marked as criti- used by learners to answer questions, and distinguish from cal. Other solutions would be to use a more complex natural that lexicon the words associated with correct and incorrect language processing solution that may be required to under- answers. stand the words in context or have the tool learn the weighted Using the answers obtained from 428 anatomical sciences importance of the different words. students on a 60 question OSPE, the tool was trained using a The future of this research, beyond addressing the issues 10-fold cross validation method. Overall, the algorithm ob- around accuracy, is to expand the grading tool into an ITS. tained a 96.8% accuracy (SD = 3.4%) for correctly grad- The ITS can then be evaluated on a student cohort to see ing the student answers. Based on a multifaceted analysis of if it improves the learning outcomes. As part of building the results, it was determined that the tool was not simply a an ITS, an investigation will be conducted on learning out- guess. Firstly, the algorithm shows no bias towards guessing comes when using the students’ lexicon versus textbook an- “correct” or “incorrect” based on an examination of ques- swers. Recently, work has begun by developing an online tions with grades ranging from 40% to 60%. Secondly, an OSPE practice tool for students using the AI-based grader. anatomical expert was recruited to examine the algorithms selected root words and the AI choices were found to be References reasonable and matching 45 out of 60 times, and being the second choice for 11 questions. For the remaining four ques- Bakhshinategh, B.; Zaiane, O. R.; ElAtia, S.; and Ipperciel, tions, the expert made a choice not possible for the AI by D. 2018. Educational data mining applications and tasks: using an acronym not used by the students. Overall, the evi- A survey of the last 10 years. Education and Information dence suggests that the OSPE grading tool is using reasoning Technologies 23(1): 537–553. and not guessing. Chan, A. Y.-C. C.; Custers, E. J.; van Leeuwen, M. S.; Bleys, While the average result was promising, three questions R. L.; and ten Cate, O. 2019. Does an Additional Online (Q27, 30 and 33) were notably lower than the mean with Anatomy Course Improve Performance of Medical Students on Gross Anatomy Examinations? Medical Science Educa- tor 29(3): 697–707. Dumais, S. T. 2004. Latent semantic analysis. Annual Re- view of Information Science and Technology 38(1): 188– 230. Idzelis, M. 2005. Jazzy: The Java open source spell checker. Joshi, A.; Allessio, D.; Magee, J.; Whitehill, J.; Arroyo, I.; Woolf, B.; Sclaroff, S.; and Betke, M. 2019. Affect-driven learning outcomes prediction in intelligent tutoring systems. In 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 1–5. IEEE. Leacock, C.; and Chodorow, M. 2003. C-rater: Automated scoring of short-answer questions. Computers and the Hu- manities 37(4): 389–405. Massachusetts General Hospital. 2017. Bassett Collection: Stereoscopic Atlas of Human Anatomy. Mousavinasab, E.; Zarifsanaiey, N.; R. Niakan Kalhori, S.; Rakhshan, M.; Keikha, L.; and Ghazi Saeedi, M. 2021. In- telligent tutoring systems: A systematic review of charac- teristics, applications, and evaluation methods. Interactive Learning Environments 29(1): 142–163. Quinlan, J. R. 1996. Learning decision tree classifiers. ACM Computing Surveys 28(1): 71–72. Regian, J.; and Shute, V. 1966. Arificial intelligence in train- ing: The evolution of intelligent tutoring systems. In Pro- ceedings of the Conference on Technology and Training In Education. Shermis, M. D.; Burstein, J.; Brew, C.; Higgins, D.; and Zechner, K. 2015. Recent Innovations in Machine Scoring of Student and Test Taker Written and Spoken Responses. In Handbook of Test Development, 351–370. Routledge. Xu, Z.; Wijekumar, K.; Ramirez, G.; Hu, X.; and Irey, R. 2019. The effectiveness of intelligent tutoring systems on K-12 students’ reading comprehension: A meta-analysis. British Journal of Educational Technology 50(6): 3119– 3137.