Computer Adaptive Testing Using the Same-Decision Probability Suming Chen and Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles {suming,aychoi,darwiche}@cs.ucla.edu Abstract inferring a user’s knowledge level given uncertain infor- mation (unanswered questions). Additionally, a significant body of work has been done on using Bayesian networks Computer Adaptive Tests dynamically allocate to perform a wide array of tasks in the field of educational questions to students based on their previous re- diagnosis (VanLehn & Niu, 2001; Conati, Gertner, & Van- sponses. This involves several challenges, such Lehn, 2002; Suebnukarn & Haddawy, 2006). as determining when the test should terminate, as well as which questions should be asked. In this A key question in this domain is when the test should termi- paper, we introduce a Computer Adaptive Test nate. Some students may perform so well or so poorly that that uses a Bayesian network as the underlying the system can recognize that further testing is unnecessary, model. Additionally, we show how the notion as asking further questions would have very little probabil- of the Same-Decision Probability can be used as ity of reversing the initial diagnosis. (Millán & Pérez-De- an information gathering criterion in this con- La-Cruz, 2002) discusses some stopping criteria that are text — to determine which further questions are used to determine when further questioning is needed, as needed and if so, which further questions should well as some selection criteria that are used to determine be asked. We show empirically that utilizing the which questions are actually asked. Same-Decision Probability is a viable and intu- In this paper, we discuss the creation of a Computer Adap- itive approach for determining question selection tive Test, and then take a recently introduced notion, called in Bayesian-based Computer Adaptive Tests, as the Same-Decision Probability (SDP) (Darwiche & Choi, its usage allows us to ask fewer questions while 2010), and show its usefulness as an information gathering still maintaining the same level of precision and criteria in this domain, in contrast to standard criteria. The recall in terms of classifying competent students. SDP quantifies the stability of threshold-based decisions in Bayesian networks and is defined as the probability that a current decision would stay the same, had we observed fur- 1 INTRODUCTION ther information. The paper is structured as follows: We first present some Computer Adaptive Tests have recently become increas- motivation for the constructed Computer Adaptive Test. ingly popular, as they have the ability to adapt to each indi- We then discuss some related work in the educational di- vidual unique user (Vomlel, 2004; Almond, DiBello, Moul- agnosis field. Following that, we then show how the Same- der, & Zapata-Rivera, 2007; Almond, Mislevy, Steinberg, Decision Probability can be used as a stopping and selec- Yan, & Williamson, 2015). This in turn allows for the test tion criterion for our Computer Adaptive Test. Finally, we to tailor itself specifically based on user responses and its discuss experiment setup, present empirical results demon- current estimation of the user’s knowledge level. For exam- strating the usefulness of the Same-Decision Probability, ple, if the user answers a series of questions correctly, the and then conclude the paper. test can adjust and curate some more difficult questions. On the other hand, if the user answers a series of questions incorrectly, the test can adjust and present easier questions. 2 MOTIVATION Bayesian networks have been used as the principal base The Berkeley Free Clinic1 is a clinic staffed entirely by vol- model for many Computer Adaptive Tests (Millán & Pérez- unteers and offers medical and dental treatment for the sur- De-La-Cruz, 2002; Vomlel, 2004; Munie & Shoham, 2008; 1 Almond et al., 2015), as they offer powerful approaches to http://www.berkeleyfreeclinic.org/ 34 rounding community. In particular, the dental section of the clinic offers a wide variety of services ranging from clean- ings, x-ray services, fillings, and extractions. Due to the expertise required for these services, the volunteers need to be highly trained to apply their knowledge in a practical setting. The volunteers are thus required to undergo a train- ing period in order to prepare them to serve at the clinic. Recently, it was decided that in order to better evaluate the knowledge of the volunteers, the volunteers would go through a competency exam to determine whether or not they had sufficient knowledge. The idea of this test is that if a volunteer was found to be inadequate, that he/she would have to undergo further instruction. The coordinators of the clinic worked in conjunction with the volunteer dentists in order to create a written test that thoroughly tested the dif- ferent concepts and skills necessary for a volunteer. The written test consists of 63 questions that includes questions ranging from showing a portrait of a certain instrument (e.g. a perioprobe) and asking “Name this instrument”, to “What is in a bridge and crown set up tray?”. A portion of this test can be seen in Figures 1, 2, 3, and 4. This test was given to 22 subjects. Sixteen of the sub- jects were evaluated prior to the test-taking to be competent volunteers, whereas the other 6 subjects were evaluated to be non-competent volunteers. The clinic coordinators set Figure 1: A portion of the competency test that measures the pass threshold to be at 60%, meaning that volunteers general dental clinic knowledge. needed to correctly answer 60% of the questions in order to be considered competent. The test proved to be effective in that of the 16 competent volunteers, 15 of them passed, and 2008; Millán, Descalco, Castillo, Oliveira, & Diogo, 2013). of the 6 non-competent volunteers, none of them passed. When compared to traditional techniques, these applica- The test taking duration ranged from 30 minutes to 1 hour. tions have been shown to be highly effective in increasing Feedback from the participants indicated that they felt this the efficiency of student learning (Millán & Pérez-De-La- test was fair and covered the necessary bases to be a volun- Cruz, 2002; Vomlel, 2004; Sinharay, 2006; Brusilovsky & teer, but that the test was too time-consuming. Millán, 2007; Beal, Arroyo, Cohen, Woolf, & Beal, 2010; Millán et al., 2013). The complaints about the duration of the test is in line with the discoveries of (Garcı́a, Amandi, Schiaffino, & Campo, In the work of (Millán & Pérez-De-La-Cruz, 2002), they 2007), who discover that a significant problem with web- use Bayesian networks as the framework for constructing based tests is that they are too long, which may hinder stu- Computer Adaptive Tests (CATs). They introduce a model dents’ ability to perform as they simply become bored and for representing student knowledge, where knowledge is careless. This serves as the chief motivation for our work, modeled as different interrelated concepts. They noted that as we want to turn this test into a Computer Adaptive Test student knowledge can be diagnosed (inferred) by treating (CAT), so that we can present students a test with fewer student answers as evidence. In addition to this model, questions, but just as many relevant questions — this way they introduced an adaptive testing algorithm. To evalu- we can decrease overall test duration without compromis- ate their model, they used 180 simulated students. They ing the effectiveness of the test. found that the introduction of adaptive question selection improves both accuracy and efficiency. Their model incor- porates notions such as “slipping”, where a student might 3 RELATED WORK answer a question incorrectly even if the concept is known. They generate students randomly by sampling from the net- There has been a surge of interest in applying artificial work, where each student is assumed to hold knowledge of intelligence techniques in educational diagnosis. Educa- different concepts. tional diagnosis a broad field that covers Interactive Tu- toring Systems (ITS) (Conati et al., 2002; Suebnukarn & Similar results were shown in (Vomlel, 2004), who discuss Haddawy, 2006; Gujarathi & Sonawane, 2012), as well applications of Bayesian networks in educational diagno- as Computer Adaptive Tests (CAT) (Munie & Shoham, sis, especially in skill diagnosis. They show that modeling 35 Figure 4: A portion of the competency test that measures Figure 2: A portion of the competency test that measures conceptual knowledge. procedural knowledge. threshold-based decision making under uncertainty, such as 1) in (Xenos, 2004), where the authors model the edu- cational experience of various students and use threshold- based decisions to determine whether or not a student is likely to fail, 2) in (Arroyo & Woolf, 2005), where the goal is to determine the type of learner a student was (in terms of Figure 3: A portion of the competency test that measures attitude) 3) in (Gertner, Conati, & VanLehn, 1998), where tool recognition knowledge. the goal is to infer what part of a problem a student is hav- ing trouble with, in order to provide hint-based remedia- tion, 4) in (Butz, Hua, & Maguire, 2004), where the con- structed ITS can assess knowledge using a BN as well as dependence between various skills allows for higher qual- recommending certain remediation, and a threshold-based ity of diagnosis. Additionally, they show that Computer decision is used to determine if a student is competent in Adaptive Tests allow them to best select a fitting question an area or not. to test for competency in certain skills. They found that computer adaptive testing substantially reduces the number Some other work in the field of educational diagnosis is of questions that may need to be asked. found in (Brusilovsky & Millán, 2007), which describes a model that details the relationship between the knowledge Just like (Millán & Pérez-De-La-Cruz, 2002), (Vomlel, of certain concepts and how the student will perform on 2004) and (Almond et al., 2015) found that Bayesian net- certain questions. They note that if a user demonstrates works to be useful to model the links between various re- lack of knowledge, the model can be used to locate the lated proficiencies, and modeled unseen questions as unob- most likely concepts that will remedy the situation. They served variables. They use diagnostic Bayesian networks in found that using a Bayesian model allows them to compute order to model uncertainty, where experts build a Bayesian the probability of a student answering a question correctly network and calibrate it from data. given competency in some domain. Another application of Bayesian networks to model stu- More recently, the work on educational diagnosis model- dent knowledge is discussed in (Munie & Shoham, 2008). ing has focused on even more intricate modeling and reme- They use a Bayesian network to model qualifying exams diation methods. For instance, (Rajendran, Iyer, Murthy, for graduate students at Stanford, where their goal is to se- Wilson, & Sheard, 2013) has focused on modeling user lect the questions (observations) that can best measure the emotional states, in order to detect when users are frus- knowledge in order to determine whether or not the student trated as well as the cause of the frustration. By finding the should pass. They also use a threshold function and stress root cause of the frustration, special measures may be taken that reducing the number of questions asked is important. to remediate. Similarly, (Chrysafiadi & Virvou, 2013) has They also prove the NP-hardness of deciding an optimal done significant work in modeling a student’s performance, set of questions to assign a student. progress, and behavior. The developed e-learning system Additionally, a variety of work has been done on using has been deployed into production. (Chrysafiadi & Virvou, Bayesian networks in the educational diagnosis field for 2014) finds that modeling a user’s detailed characteristics, 36 such as knowledge, errors, and motivation can improve the is competent. General competency is determined by com- quality of a student’s learning process as it allows for bet- petency in a collection of specialized fields of knowledge, ter remediation. The need for personalized remediation is which are represented by a variety of latent concept vari- also further stressed as (Vandewaetere & Clarebout, 2014) ables. The final type of variable is a question variable that studies the importance of giving learners instruction and represents a question that may be asked to the student. An support directly. They stress that learner models need to be answered question, whether correct or incorrect, will act as adjusted and updated with new information about the learn- evidence that can influence our belief on the student’s com- ers knowledge, effective states, and behavior in order to be petency. The graphical structure of the model can be seen maximally effective. in Figure 5. Similarly to (Gertner et al., 1998; VanLehn & Niu, 2001; 4 COMPUTER ADAPTIVE TESTING Cantarel, Weaver, McNeill, Zhang, Mackey, & Reese, 2014), the clinic coordinators/experts determined that the 4.1 USING A BAYESIAN NETWORK “pass threshold” should be set at 0.8, meaning that if given some evidence (questions answered), the posterior prob- To address the problem of having such a time-consuming ability of the decision variable was found to be over 0.8, test, we believe that using a Computerized Adaptive Test then the student would be considered as competent.2 Ac- (CAT) could differentiate the competent students from the cording to this pass threshold, we found once again that non-competent students with fewer questions than a stan- of the 16 competent volunteers, 15 of them passed, and of dard test. Asking fewer questions while still accurately the 6 non-competent volunteers, none of them passed. The measuring a student’s ability is a canonical problem in ed- volunteers’ scores are shown in Table 1. From this table ucational diagnosis, and is discussed thoroughly in (Millán we can see that our constructed Bayesian model allows us & Pérez-De-La-Cruz, 2002; Garcı́a et al., 2007; Munie & to accurately predict a student’s competency. Shoham, 2008; Millán et al., 2013). Our main goal is to best measure whether or not a student is competent with Student # Percentage Correct Posterior Probability a limited subset of questions. This means that we have to 1 0.952 0.957 determine when enough questions have been asked, as well 2 0.921 0.941 as which additional questions should be asked, so as to en- 3 0.714 0.951 sure that overall the Computer Adaptive Test is comprised 4 0.539 0.704 of fewer questions. 5 0.762 0.956 Bayesian networks have been found to be especially useful 6 0.746 0.950 for decision making under uncertainty (Nielsen & Jensen, 7 0.825 0.957 2009). We believe that they provide us a natural mecha- 8 0.619 0.948 nism to model both 1) student knowledge and 2) questions 9 0.777 0.952 that may be potentially asked to measure student knowl- 10 0.857 0.954 edge. In these models, Bayesian networks can model the 11 0.857 0.957 relationships between various proficiencies. These models 12 0.809 0.956 detail the relationship between knowledge of certain con- 13 0.349 0.153 cepts and how the student will perform on certain ques- 14 0.238 0.023 tions. 15 0.539 0.178 16 0.809 0.923 Using a Bayesian network thus allows us to predict how 17 0.603 0.883 likely a student is to have knowledge in a certain concept 18 0.635 0.954 based on the student’s answers. Hence, we worked with 19 0.873 0.954 the clinic coordinators and experts to create a Bayesian 20 0.524 0.135 network structure and elicit the parameters for this exam. 21 0.492 0.153 Our developed model is similar to the models developed 22 0.413 0.314 by (Millán & Pérez-De-La-Cruz, 2002; Vomlel, 2004; Al- mond et al., 2007; Munie & Shoham, 2008; Brusilovsky Table 1: This table shows, for each student, the percent- & Millán, 2007; Millán et al., 2013), where Bayesian net- age of questions answered correctly (out of 63 questions), works are used to evaluate and in a sense, “diagnose” a and the posterior probability of the student being compe- student’s degree of knowledge or competency. tent given all the answered questions. The non-competent volunteers are bolded. In our network, we have a main variable of interest that is representative of the student’s total level of knowledge. We refer to this variable as the decision variable (D), and 2 Another commonly used threshold in the educational diagno- it serves as an overall measure of our belief that a student sis field is 0.7 (Butz et al., 2004; Arroyo & Woolf, 2005). 37 4.2 STOPPING CRITERION FOR INFORMATION whether Pr (D= d | e) ≥ T for some evidence e and GATHERING threshold T . If H is a set of variables that are available to observe, then the SDP is: Although the test has a total of 63 questions, there may be X a point during the adaptive test where we can determine SDP(d, H, e, T ) = [Pr (d | h, e) ≥ T ]Pr (h | e). (1) that it is unnecessary to ask any further questions. For edu- h cational diagnosis in Bayesian networks, (Millán & Pérez- De-La-Cruz, 2002; Millán et al., 2013) shows there are two Here, [α] is an indicator function which is 1 if α is true, standard ways to determine if more information gathering and 0 otherwise. (additional questions posed) is necessary. The first involves a fixed test that asks a set number of questions, and then de- In short, the SDP is a measure of how robust a decision termines if the student is competent after all questions have is with respect to some unobserved variables and will help been answered. This static method is clearly ineffective, as determine how much more information is necessary. the student is only evaluated at the conclusion of the test. In this case, based on the student’s responses, at any point The second way involves terminating the test once the pos- in time, we can compute the posterior probability that they terior probability of the decision variable is above or be- are competent and thus determine a decision of whether or low some threshold. This stopping criteria is also seen not they are competent. Keep in mind that this decision is in (Hamscher, Console, & de Kleer, 1992; Heckerman, temporary and can be reversed if more questions are an- Breese, & Rommelse, 1995; Kruegel, Mutz, Robertson, & swered — we can compute the SDP over the remaining Valeur, 2003; Lu & Przytula, 2006). Note that (Millán & unanswered questions to determine how likely it is that the Pérez-De-La-Cruz, 2002) compares an approach that uti- current decision (deciding whether a student is competent lizes a set number of questions with a threshold-based ap- or non-competent) would remain the same even if the re- proach — they found that with a set number of questions maining questions were answered. If the SDP is high, that they were able to diagnose students correctly 90.27% of the is an indication that we can terminate the test early. We time, whereas by using an adaptive criterion they were able found that using the SDP allowed us to cut the test duration to diagnose students correctly 94.54% of the time, while significantly while maintaining diagnosis accuracy. Details requiring fewer number of questions to be asked. This is of our experiment setup and results can be found in Sec- a clear indication that using a less trivial stopping criterion tion 5.1. can be ultimately beneficial. Another possibility of a stopping criterion involves com- 4.3 SELECTION CRITERION FOR puting the value of information of some observations, and INFORMATION GATHERING if that value is not high enough, to then stop information gathering. However, this involves either computing the We have now motivated the usefulness of the SDP as a stop- myopic value of information and just computing the use- ping criterion for Computer Adaptive Tests, as it tells us fulness of making one observation, or computing the non- when we can terminate the test. However, the question re- myopic value of information and computing the usefulness mains: if computing the SDP indicates that more questions of making several observations. The former is fairly easy to are necessary, which questions should we ask? compute, but can prove to be very limited as often the com- In a standard, non-adaptive test, the ordering of the ques- bined usefulness of some observations is greater than its tions cannot be controlled. In the adaptive setting we have parts. For example, if a student answers a question incor- more control — based on a test taker’s answers on the test rectly, it may not be very telling. However, if that student so far, we can select questions that have the most potential also answers a question incorrectly for an entirely orthog- to give us further insight on the test taker. Our goal here is onal field, these two mistakes combined may be indicative to select questions such that the expected SDP after observ- of a student’s lack of competency. Note that computing the ing those questions is maximal. The expected SDP (Chen, non-myopic value of information is useful in determining if Choi, & Darwiche, 2015) is defined as the following: a set of observations has significant value, but is intractable to compute if the set of observations is large (Krause & Definition 2 Let G be a subset of the available features H Guestrin, 2009). and let D(ge) be the decision made after observing fea- To decide whether or not enough information has been tures G, i.e., D(ge) = d if Pr (d | g, e) ≥ T . The E-SDP gathered, we use a non-myopic stopping criterion and com- is then: pute the Same-Decision Probability (SDP) (Darwiche & Choi, 2010; Chen, Choi, & Darwiche, 2014). E-SDP(D, G, H, e, T ) X = SDP(D(ge), H \ G, ge, T ) · Pr (g | e). (2) Definition 1 Suppose we are making a decision based on g 38 The expected SDP is thus a measure of how similar our de- (to be more precise, we know whether each question was cision, after observing G, is to the decision made after ob- answered correctly or incorrectly). We denote this dataset serving H. We want to find a question that will maximize by T, where: the expected SDP. In other words, we want to ask ques- T = {e1 , . . . , e22 }, tions such that no matter how they are answered, will, on average, minimize the usefulness of any remaining ques- and where each ei ∈ T is an instantiation of the n test tions. By maximizing this objective, we reduce the overall responses for student i. Hence, the probability number of questions that need to be answered before test termination. Pr (D = + | ei ) As stated before, computing the value of information (VOI) is essential to the process of information gathering. Since denotes the posterior probability that student i is competent for our model it is intractable to compute the non-myopic given their test results. Using our Bayesian network, we de- value of information, that leaves only computing the my- cide that student i is competent if Pr (D = + | ei ) ≥ 0.80 opic VOI as an available possibility. The VOI of observ- (otherwise, we decide that they are not sufficiently compe- ing a variable may depend on various objective functions, tent). We report the quantity Pr (D = + | ei ) for each for instance, how much the observation reduces the en- student, in Table 1. tropy of the decision variable. There is a comprehensive To evaluate the SDP as a stopping and selection crite- overview of these different objective functions in (Krause rion, we simulate partially-completed tests from the fully- & Guestrin, 2009).3 (Chen et al., 2014, 2015) compares the completed tests T. In particular, we take the fully- approach maximizing these standard objective functions to completed test results ei , for each student i, and generate a maximizing the expected SDP and finds that maximizing set of partially-completed tests Qi = {qi,1 , . . . , qi,n }. In the expected SDP is more effective in reducing the number particular, we randomly permute the questions, and take the of questions that need to be answered, particularly in cases first j questions of the permutation as a partially-completed when the decision threshold is extreme. test qi,j . Hence, a partially-completed test qi,j+1 adds one Our approach involves selecting the question that leads to additional test question to test qi,j , and test qi,n corre- the highest expected SDP. We found that selecting vari- sponds to the fully-completed test ei . ables to optimize the expected SDP allowed us to substan- In our experiments, we use those partially-completed tests tially decrease the number of questions selected compared qi,j that have at least 10 questions, i.e., where j ≥ 10 (we to other selection criteria. Details of experiment setup and assume 10 questions to be the minimum number of ques- experimental results can be found in Section 5.2. tions, where we can begin to evaluate the competency of a student). Moreover, for each of the 22 students i, we sim- ulated 50 sets of partially-completed tests Qi based on 50 5 EXPERIMENTS random permutations, giving us a total of 22 · 50 = 1, 100 sets of partially-completed tests. In this section, we empirically evaluate the SDP as a stop- ping and selection criterion for our adaptive test. We com- pare the SDP against the standard criteria used by other 5.1 STOPPING CRITERION EXPERIMENTS Bayesian Computer Adaptive Tests. First, we introduce some notation used throughout the experimental section. Using our partially-completed tests, we evaluate the SDP as a stopping criterion, against more traditional methods. We use standard notation for variables and their instanti- In particular, we take each set of partially-completed tests ations, where variables are denoted by upper case letters Qi , and going from test qi,10 up to test qi,n−1 , we check (e.g. X) and their instantiations by lower case letters (e.g. whether each stopping criterion is satisfied, i.e., a decision x). Sets of variables are then denoted by bold upper case is made to stop asking questions. Note that given test qi,n , letters (e.g. X) and their instantiations by bold lower case the only decision is to stop, since there are no more test letters (e.g. x). The primary decision variable, measuring questions to ask. For the SDP, when we evaluate the test a student’s overall competency, is denoted by D, with two qi,j , we compute the SDP with respect to the remaining states + (competent) and − (non-competent). n − j unanswered questions (i.e., we treat them as the set We have the completed tests for 22 subjects, where for each of available observables H). subject, we have the responses of the n = 63 test questions As for other, more traditional, stopping criteria, we con- 3 sider (1) stopping after a fixed number of questions have Additionally, (Golovin & Krause, 2011) studies the usage of been answered (after which a student’s competence is de- adaptive submodularity for selection criteria and shows that ob- jective functions that satisfy this notion can be readily approxi- termined), or (2) stopping early once the posterior proba- mated. For our problem, our objective function does not satisfy bility Pr (D = + | qi,j ) surpasses a given threshold T , the notion of adaptive submodularity. after which we deem a student to be competent. 39 In Table 2, we report the results where our stopping crite- a lower recall. In fact, we see that once the threshold is rion is based on asking a fixed number of questions. In set high enough, the recall actually drops: Some students Table 3, we highlight the results where we use instead who should be diagnosed as competent are in fact being a posterior probability threshold (Millán & Pérez-De-La- diagnosed incorrectly as non-competent once the threshold Cruz, 2002; Xenos, 2004; Arroyo & Woolf, 2005; Mu- is set too high. nie & Shoham, 2008). These results are based on aver- Consider now the case where we set a very large threshold ages over our 1, 100 sets of partially-completed tests Qi , on the SDP (0.999). In this case, the precision and recall which we simulated from our original dataset. In both ta- are equivalent to the case where our criterion is to ask a set bles, we see that there is a clear trade-off between the ac- number of n−1 = 62 questions (as in Table 2). In contrast, curacy of the test, and the number of questions that we using the SDP criterion, we ask only an average of 42.05 ask. (Note again that students were already evaluated to questions, meaning that the SDP as a stopping criterion has be competent/non-competent prior to the test, and that pre- the same robustness as the ”set number of questions” stop- cision/recall is based on this prior evaluation). ping criterion — while asking nearly 20 fewer questions. # Questions Precision Recall SDP T Precision Recall # Questions Asked 10 0.566 0.403 0.850 0.927 0.932 39.10 15 0.752 0.659 0.875 0.938 0.938 39.73 20 0.805 0.755 0.900 0.942 0.938 39.61 25 0.815 0.812 0.925 0.946 0.938 39.77 30 0.835 0.835 0.950 0.946 0.938 40.09 35 0.874 0.875 0.975 0.950 0.938 40.61 40 0.901 0.909 0.990 0.950 0.938 41.14 45 0.915 0.914 0.995 0.950 0.938 41.41 50 0.918 0.926 0.999 0.954 0.938 42.05 55 0.924 0.938 60 0.950 0.938 62 0.954 0.938 Table 4: Precision, recall, and average number of questions asked for different thresholds. Table 2: Precision and recall for a set number of questions. It is clear from our results that if our goal is to ask fewer questions, while maintaining competitive precision and re- T Precision Recall # Questions Asked call rates, using the SDP as a stopping criterion is a com- 0.750 0.803 0.758 10.13 0.775 0.813 0.821 13.04 pelling alternative to the traditional stopping criteria of 1) 0.800 0.836 0.832 16.02 asking a set number of questions and 2) checking to see if 0.825 0.867 0.872 19.26 the posterior probability of competency surpasses a thresh- 0.850 0.903 0.915 23.13 old — using the SDP as a stopping criterion allows us to 0.875 0.919 0.912 27.14 reduce the number of questions asked while still maintain- 0.900 0.908 0.911 31.57 0.925 0.968 0.884 36.82 ing the same precision and recall. 0.950 0.974 0.818 44.25 5.2 SELECTION CRITERION EXPERIMENTS Table 3: Precision, recall, and average number of questions asked, for varying thresholds T on the posterior probability. We next consider experiments that compare various ques- tion selection criteria such as 1) random selection, where We next consider the SDP as a stopping criterion. In partic- we select the next question randomly (as in non-adaptive ular, we compute the SDP with respect to all unanswered or linear testing), 2) information gain (mutual information) questions, and then make a stopping decision when the (Millán & Pérez-De-La-Cruz, 2002; Vomlel, 2004), and 3) computed SDP surpasses a given threshold T . Note that margins of confidence (Krause & Guestrin, 2009). Note when we use a threshold T = 1.0, then we commit to a that as this model does not fit the decision-theoretic setting stopping decision only when no further observations will (there are no assigned utilities), we do not consider utility change the decision (i.e., the probability of making the maximization as a selection criterion. In addition to the same decision is 1.0). In Table 4, we report the results of above, we evaluate the SDP as a selection criterion. using the SDP as a stopping criterion. Since our adaptive test involves selecting one question at a We see that even when our threshold is set to a relatively time, our goal is to select a variable H ∈ H (correspond- small value (0.850), we still attain precision and recall rates ing to a question not yet presented) that leads to the greatest that are comparable to those obtained by asking nearly all gain in the SDP, i.e., the SDP gain. Our selection criteria questions (as in Table 2). For the stopping criterion based is based on asking the question which yields the highest on posterior probability thresholds (in Table 3), we can see SDP gain, using information gain as a tie-breaker, when that we can attain a higher precision, but at the expense of multiple questions have the same SDP gain. A tie-breaker 40 is needed as the SDP gain of observing a single variable the number of unasked questions that must be considered. can be zero. In this case, observing a single variable is not Figure 6 shows a plot of the running times for the SDP enough to change our decision (multiple observations may stopping and selection criteria, in comparison to other mea- be needed). In our experiments, we compute the SDP gain sures. Note that the SDP selection criterion is on aver- using a branch-and-bound algorithm, as in (Chen, Choi, & age more efficient than the SDP stopping criterion, even Darwiche, 2013). The advantages of this algorithm are (1) though the selection criterion includes an SDP gain com- it can prune the corresponding search space, and (2) it can putation, as well as an information gain computation (as a cache and reuse intermediate calculations, resulting in sig- tie breaker). Here, there are cases that can be detected that nificant computational savings. allow us to skip the SDP gain computation (Chen et al., 2013), leaving just the relatively efficient information gain We compare how quickly these selection criteria allow us computation. In general, we note that there is a trade-off: to stop information gathering, each using the SDP as a the SDP, as stopping and selection critera, provides valu- stopping criterion; see Table 5. Here, we took partially- able information (leading to fewer questions asked), but the completed tests (all tests where at least 30 questions were SDP is also more computationally demanding. answered), and used each selection criterion to select addi- tional questions to ask, until the SDP dictates that we stop asking questions. On average, we find that (1) the random 103 Algorithm Runtime Comparison selection criterion asks an additional 8.18 questions, (2) in- SDP-STOP formation gain asks an additional 3.67 questions, (3) mar- SDP-SEL 102 THRESH gins of confidence ask an additional 3.57 questions, and (4) Average Running Time (s) our approach based on the SDP gain requires the fewest IG additional questions: 2.77. 101 Note that while the random question selection criterion clearly has the poorest performance, there is a relatively 100 modest improvement based on our approach, using the SDP gain, compared to the more common approaches 10-1 based on information gain (Millán & Pérez-De-La-Cruz, 2002; Vomlel, 2004), and margins of confidence (Krause 10-2 5 10 15 20 25 30 & Guestrin, 2009). Nevertheless, the benefits of our ap- Number of Unasked Questions proach, as a selection criteria, are still evident. Further, the results suggest the potential of using the SDP gain in a Figure 6: Runtimes of different algorithms. SDP-STOP less greedy way, where we ask multiple questions at a time, and SDP-SEL are respectively the stopping and selection which we consider a promising direction for future work. criterion algorithms, THRESH represents the standard pos- terior stopping criterion, and IG represents the standard in- SDP T Random IG Margins SDP 0.850 4.77 2.98 2.71 2.05 formation gain selection criterion. 0.875 4.78 3.08 2.87 2.14 0.900 5.08 3.20 3.12 2.20 0.925 5.95 3.19 3.04 2.31 6 CONCLUSION 0.950 6.81 3.37 3.44 2.47 0.975 7.86 3.73 3.56 2.96 We created a Computer Adaptive Test using a Bayesian net- 0.990 10.64 4.19 4.29 3.18 0.995 12.95 4.32 4.22 3.43 work as the underlying model and showed how the notion 0.999 14.80 4.97 4.91 4.25 of the SDP can be used as an information gathering crite- rion in this context. We showed that it can act as a stopping Table 5: Number of additional observations necessary be- criterion for determining if further questions are needed, fore stopping for the random selection criterion, informa- and as a selection criterion for determining which questions tion gain criterion, margins of confidence criterion, and the should be asked. Finally, we have shown empirically that SDP hybrid criterion. the SDP is a valuable information gathering tool, as its us- age allows us to ask fewer questions while still maintaining the same level of precision and recall for diagnosis. 5.3 RUNNING TIMES Acknowledgments The SDP has been shown to be highly intractable, being PPPP -complete (Choi, Xue, & Darwiche, 2012). There- This work has been partially supported by ONR grant fore, the computational requirements of the SDP, as a stop- #N00014-12-1-0423, NSF grants #IIS-1514253 and #IIS- ping and selection criterion, may be higher than other, more 1118122, and a Google Research Award. We also thank the common approaches. This complexity depends largely on coordinators and volunteers at the Berkeley Free Clinic. 41 References Gertner, A. S., Conati, C., & VanLehn, K. (1998). Procedural help in Andes: Generating hints using a Bayesian network student Almond, R. G., DiBello, L. V., Moulder, B., & Zapata-Rivera, model. In Proceedings of the 13th National Conference on J.-D. (2007). Modeling diagnostic assessments with Bayesian Artificial intelligence, pp. 106–111. networks. Journal of Educational Measurement, 44(4), 341– Golovin, D., & Krause, A. (2011). Adaptive submodularity: The- 359. ory and applications in active learning and stochastic optimiza- Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D., & tion. JAIR, 42, 427–486. Williamson, D. M. (2015). Bayesian Networks in Educational Gujarathi, M. V., & Sonawane, M. S. (2012). Intelligent Tutoring Assessment. Springer. System: A case study of mobile mentoring for diabetes. IJAIS, Arroyo, I., & Woolf, B. (2005). Inferring learning and attitudes 3(8), 41–43. from a Bayesian network of log file data. In Proceedings of Hamscher, W., Console, L., & de Kleer, J. (Eds.). (1992). Read- the 12th International Conference on Artificial Intelligence in ings in Model-Based Diagnosis. Morgan Kaufmann Publishers Education, pp. 33–40. Inc. Beal, C. R., Arroyo, I., Cohen, P. R., Woolf, B. P., & Beal, C. R. Heckerman, D., Breese, J. S., & Rommelse, K. (1995). Decision- (2010). Evaluation of Animalwatch: An Intelligent Tutoring theoretic troubleshooting. Communications of the ACM, 38(3), System for arithmetic and fractions. Journal of Interactive On- 49–57. line Learning, 9(1), 64–77. Krause, A., & Guestrin, C. (2009). Optimal value of information Brusilovsky, P., & Millán, E. (2007). User models for adaptive in graphical models. Journal of Artificial Intelligence Research hypermedia and adaptive educational systems. In The adaptive (JAIR), 35, 557–591. web, pp. 3–53. Springer-Verlag. Kruegel, C., Mutz, D., Robertson, W., & Valeur, F. (2003). Bayesian event classification for intrusion detection. In Pro- Butz, C. J., Hua, S., & Maguire, R. B. (2004). A web-based In- ceedings of the Annual Computer Security Applications Con- telligent Tutoring System for computer programming. In Web ference (ACSAC). Intelligence, pp. 159–165. Lu, T.-C., & Przytula, K. W. (2006). Focusing strategies for mul- Cantarel, B. L., Weaver, D., McNeill, N., Zhang, J., Mackey, A. J., tiple fault diagnosis. In Proceedings of the 19th International & Reese, J. (2014). Baysic: a Bayesian method for combining FLAIRS Conference, pp. 842–847. sets of genome variants with improved specificity and sensitiv- ity. BMC bioinformatics, 15(1), 104. Millán, E., Descalco, L., Castillo, G., Oliveira, P., & Diogo, S. (2013). Using Bayesian networks to improve knowledge as- Chen, S., Choi, A., & Darwiche, A. (2013). An exact algorithm sessment. Computers & Education, 60(1). for computing the Same-Decision Probability. In Proceedings Millán, E., & Pérez-De-La-Cruz, J. L. (2002). A Bayesian diag- of the 23rd International Joint Conference on Artificial Intelli- nostic algorithm for student modeling and its evaluation. User gence, pp. 2525–2531. Modeling and User-Adapted Interaction, 12(2-3), 281–330. Chen, S., Choi, A., & Darwiche, A. (2014). Algorithms and appli- Munie, M., & Shoham, Y. (2008). Optimal testing of structured cations for the Same-Decision Probability. Journal of Artificial knowledge. In Proceedings of the 23rd National Conference Intelligence Research (JAIR), 49, 601–633. on Artificial intelligence, pp. 1069–1074. Chen, S., Choi, A., & Darwiche, A. (2015). Value of informa- Nielsen, T. D., & Jensen, F. V. (2009). Bayesian networks and tion based on Decision Robustness. In Proceedings of the 29th decision graphs. Springer Science & Business Media. Conference on Artificial Intelligence (AAAI). Rajendran, R., Iyer, S., Murthy, S., Wilson, C., & Sheard, J. Choi, A., Xue, Y., & Darwiche, A. (2012). Same-Decision Prob- (2013). A theory-driven approach to predict frustration in an ability: A confidence measure for threshold-based decisions. its. Learning Technologies, IEEE Transactions on, 6(4), 378– International Journal of Approximate Reasoning (IJAR), 2, 388. 1415–1428. Sinharay, S. (2006). Model diagnostics for Bayesian networks. Chrysafiadi, K., & Virvou, M. (2013). Persiva: An empirical eval- Journal of Educational and Behavioral Statistics, 31(1), 1–33. uation method of a student model of an intelligent e-learning Suebnukarn, S., & Haddawy, P. (2006). A Bayesian approach to environment for computer programming. Computers & Edu- generating tutorial hints in a collaborative medical problem- cation, 68, 322–333. based learning system. Artificial intelligence in Medicine, 38(1), 5–24. Chrysafiadi, K., & Virvou, M. (2014). Kem cs: A set of stu- dent’s characteristics for modeling in adaptive programming Vandewaetere, M., & Clarebout, G. (2014). Advanced technolo- tutoring systems. In Information, Intelligence, Systems and Ap- gies for personalized learning, instruction, and performance. In plications, IISA 2014, The 5th International Conference on, pp. Handbook of Research on Educational Communications and 106–110. IEEE. Technology, pp. 425–437. Springer. Conati, C., Gertner, A., & VanLehn, K. (2002). Using Bayesian VanLehn, K., & Niu, Z. (2001). Bayesian student modeling, user networks to manage uncertainty in student modeling. User interfaces and feedback: A sensitivity analysis. International Modeling and User-Adapted Interaction, 12(4), 371–417. Journal of Artificial Intelligence in Education, 12(2), 154–184. Vomlel, J. (2004). Bayesian networks in educational testing. In- Darwiche, A., & Choi, A. (2010). Same-Decision Probability: A ternational Journal of Uncertainty, Fuzziness and Knowledge- confidence measure for threshold-based decisions under noisy Based Systems, 12(supp01), 83–100. sensors. In 5th PGM, pp. 113–120. Xenos, M. (2004). Prediction and assessment of student be- Garcı́a, P., Amandi, A., Schiaffino, S., & Campo, M. (2007). haviour in open and distance education in computers using Evaluating Bayesian networks precision for detecting students Bayesian networks. Computers & Education, 43(4), 345–359. learning styles. Computers & Education, 49(3), 794–808. 42 Figure 5: Bayesian network modeling clinic volunteer knowledge. The 63 questions are labeled from Q1 to Q63. The primary decision variable Adequacy is located in the top center. 43