Computer Adaptive Testing Using the Same-Decision Probability


                                 Suming Chen and Arthur Choi and Adnan Darwiche
                                            Computer Science Department
                                         University of California, Los Angeles
                                   {suming,aychoi,darwiche}@cs.ucla.edu


                         Abstract                                inferring a user’s knowledge level given uncertain infor-
                                                                 mation (unanswered questions). Additionally, a significant
                                                                 body of work has been done on using Bayesian networks
     Computer Adaptive Tests dynamically allocate
                                                                 to perform a wide array of tasks in the field of educational
     questions to students based on their previous re-
                                                                 diagnosis (VanLehn & Niu, 2001; Conati, Gertner, & Van-
     sponses. This involves several challenges, such
                                                                 Lehn, 2002; Suebnukarn & Haddawy, 2006).
     as determining when the test should terminate, as
     well as which questions should be asked. In this            A key question in this domain is when the test should termi-
     paper, we introduce a Computer Adaptive Test                nate. Some students may perform so well or so poorly that
     that uses a Bayesian network as the underlying              the system can recognize that further testing is unnecessary,
     model. Additionally, we show how the notion                 as asking further questions would have very little probabil-
     of the Same-Decision Probability can be used as             ity of reversing the initial diagnosis. (Millán & Pérez-De-
     an information gathering criterion in this con-             La-Cruz, 2002) discusses some stopping criteria that are
     text — to determine which further questions are             used to determine when further questioning is needed, as
     needed and if so, which further questions should            well as some selection criteria that are used to determine
     be asked. We show empirically that utilizing the            which questions are actually asked.
     Same-Decision Probability is a viable and intu-
                                                                 In this paper, we discuss the creation of a Computer Adap-
     itive approach for determining question selection
                                                                 tive Test, and then take a recently introduced notion, called
     in Bayesian-based Computer Adaptive Tests, as
                                                                 the Same-Decision Probability (SDP) (Darwiche & Choi,
     its usage allows us to ask fewer questions while
                                                                 2010), and show its usefulness as an information gathering
     still maintaining the same level of precision and
                                                                 criteria in this domain, in contrast to standard criteria. The
     recall in terms of classifying competent students.
                                                                 SDP quantifies the stability of threshold-based decisions in
                                                                 Bayesian networks and is defined as the probability that a
                                                                 current decision would stay the same, had we observed fur-
1   INTRODUCTION                                                 ther information.
                                                                 The paper is structured as follows: We first present some
Computer Adaptive Tests have recently become increas-
                                                                 motivation for the constructed Computer Adaptive Test.
ingly popular, as they have the ability to adapt to each indi-
                                                                 We then discuss some related work in the educational di-
vidual unique user (Vomlel, 2004; Almond, DiBello, Moul-
                                                                 agnosis field. Following that, we then show how the Same-
der, & Zapata-Rivera, 2007; Almond, Mislevy, Steinberg,
                                                                 Decision Probability can be used as a stopping and selec-
Yan, & Williamson, 2015). This in turn allows for the test
                                                                 tion criterion for our Computer Adaptive Test. Finally, we
to tailor itself specifically based on user responses and its
                                                                 discuss experiment setup, present empirical results demon-
current estimation of the user’s knowledge level. For exam-
                                                                 strating the usefulness of the Same-Decision Probability,
ple, if the user answers a series of questions correctly, the
                                                                 and then conclude the paper.
test can adjust and curate some more difficult questions.
On the other hand, if the user answers a series of questions
incorrectly, the test can adjust and present easier questions.   2       MOTIVATION
Bayesian networks have been used as the principal base
                                                                 The Berkeley Free Clinic1 is a clinic staffed entirely by vol-
model for many Computer Adaptive Tests (Millán & Pérez-
                                                                 unteers and offers medical and dental treatment for the sur-
De-La-Cruz, 2002; Vomlel, 2004; Munie & Shoham, 2008;
                                                                     1
Almond et al., 2015), as they offer powerful approaches to               http://www.berkeleyfreeclinic.org/


                                                           34
rounding community. In particular, the dental section of the
clinic offers a wide variety of services ranging from clean-
ings, x-ray services, fillings, and extractions. Due to the
expertise required for these services, the volunteers need
to be highly trained to apply their knowledge in a practical
setting. The volunteers are thus required to undergo a train-
ing period in order to prepare them to serve at the clinic.
Recently, it was decided that in order to better evaluate
the knowledge of the volunteers, the volunteers would go
through a competency exam to determine whether or not
they had sufficient knowledge. The idea of this test is that
if a volunteer was found to be inadequate, that he/she would
have to undergo further instruction. The coordinators of the
clinic worked in conjunction with the volunteer dentists in
order to create a written test that thoroughly tested the dif-
ferent concepts and skills necessary for a volunteer. The
written test consists of 63 questions that includes questions
ranging from showing a portrait of a certain instrument
(e.g. a perioprobe) and asking “Name this instrument”, to
“What is in a bridge and crown set up tray?”. A portion of
this test can be seen in Figures 1, 2, 3, and 4.
This test was given to 22 subjects. Sixteen of the sub-
jects were evaluated prior to the test-taking to be competent
volunteers, whereas the other 6 subjects were evaluated to
be non-competent volunteers. The clinic coordinators set         Figure 1: A portion of the competency test that measures
the pass threshold to be at 60%, meaning that volunteers         general dental clinic knowledge.
needed to correctly answer 60% of the questions in order to
be considered competent. The test proved to be effective in
that of the 16 competent volunteers, 15 of them passed, and      2008; Millán, Descalco, Castillo, Oliveira, & Diogo, 2013).
of the 6 non-competent volunteers, none of them passed.          When compared to traditional techniques, these applica-
The test taking duration ranged from 30 minutes to 1 hour.       tions have been shown to be highly effective in increasing
Feedback from the participants indicated that they felt this     the efficiency of student learning (Millán & Pérez-De-La-
test was fair and covered the necessary bases to be a volun-     Cruz, 2002; Vomlel, 2004; Sinharay, 2006; Brusilovsky &
teer, but that the test was too time-consuming.                  Millán, 2007; Beal, Arroyo, Cohen, Woolf, & Beal, 2010;
                                                                 Millán et al., 2013).
The complaints about the duration of the test is in line with
the discoveries of (Garcı́a, Amandi, Schiaffino, & Campo,        In the work of (Millán & Pérez-De-La-Cruz, 2002), they
2007), who discover that a significant problem with web-         use Bayesian networks as the framework for constructing
based tests is that they are too long, which may hinder stu-     Computer Adaptive Tests (CATs). They introduce a model
dents’ ability to perform as they simply become bored and        for representing student knowledge, where knowledge is
careless. This serves as the chief motivation for our work,      modeled as different interrelated concepts. They noted that
as we want to turn this test into a Computer Adaptive Test       student knowledge can be diagnosed (inferred) by treating
(CAT), so that we can present students a test with fewer         student answers as evidence. In addition to this model,
questions, but just as many relevant questions — this way        they introduced an adaptive testing algorithm. To evalu-
we can decrease overall test duration without compromis-         ate their model, they used 180 simulated students. They
ing the effectiveness of the test.                               found that the introduction of adaptive question selection
                                                                 improves both accuracy and efficiency. Their model incor-
                                                                 porates notions such as “slipping”, where a student might
3   RELATED WORK
                                                                 answer a question incorrectly even if the concept is known.
                                                                 They generate students randomly by sampling from the net-
There has been a surge of interest in applying artificial
                                                                 work, where each student is assumed to hold knowledge of
intelligence techniques in educational diagnosis. Educa-
                                                                 different concepts.
tional diagnosis a broad field that covers Interactive Tu-
toring Systems (ITS) (Conati et al., 2002; Suebnukarn &          Similar results were shown in (Vomlel, 2004), who discuss
Haddawy, 2006; Gujarathi & Sonawane, 2012), as well              applications of Bayesian networks in educational diagno-
as Computer Adaptive Tests (CAT) (Munie & Shoham,                sis, especially in skill diagnosis. They show that modeling


                                                           35
                                                                Figure 4: A portion of the competency test that measures
Figure 2: A portion of the competency test that measures        conceptual knowledge.
procedural knowledge.

                                                                threshold-based decision making under uncertainty, such
                                                                as 1) in (Xenos, 2004), where the authors model the edu-
                                                                cational experience of various students and use threshold-
                                                                based decisions to determine whether or not a student is
                                                                likely to fail, 2) in (Arroyo & Woolf, 2005), where the goal
                                                                is to determine the type of learner a student was (in terms of
Figure 3: A portion of the competency test that measures        attitude) 3) in (Gertner, Conati, & VanLehn, 1998), where
tool recognition knowledge.                                     the goal is to infer what part of a problem a student is hav-
                                                                ing trouble with, in order to provide hint-based remedia-
                                                                tion, 4) in (Butz, Hua, & Maguire, 2004), where the con-
                                                                structed ITS can assess knowledge using a BN as well as
dependence between various skills allows for higher qual-
                                                                recommending certain remediation, and a threshold-based
ity of diagnosis. Additionally, they show that Computer
                                                                decision is used to determine if a student is competent in
Adaptive Tests allow them to best select a fitting question
                                                                an area or not.
to test for competency in certain skills. They found that
computer adaptive testing substantially reduces the number      Some other work in the field of educational diagnosis is
of questions that may need to be asked.                         found in (Brusilovsky & Millán, 2007), which describes a
                                                                model that details the relationship between the knowledge
Just like (Millán & Pérez-De-La-Cruz, 2002), (Vomlel,
                                                                of certain concepts and how the student will perform on
2004) and (Almond et al., 2015) found that Bayesian net-
                                                                certain questions. They note that if a user demonstrates
works to be useful to model the links between various re-
                                                                lack of knowledge, the model can be used to locate the
lated proficiencies, and modeled unseen questions as unob-
                                                                most likely concepts that will remedy the situation. They
served variables. They use diagnostic Bayesian networks in
                                                                found that using a Bayesian model allows them to compute
order to model uncertainty, where experts build a Bayesian
                                                                the probability of a student answering a question correctly
network and calibrate it from data.
                                                                given competency in some domain.
Another application of Bayesian networks to model stu-
                                                                More recently, the work on educational diagnosis model-
dent knowledge is discussed in (Munie & Shoham, 2008).
                                                                ing has focused on even more intricate modeling and reme-
They use a Bayesian network to model qualifying exams
                                                                diation methods. For instance, (Rajendran, Iyer, Murthy,
for graduate students at Stanford, where their goal is to se-
                                                                Wilson, & Sheard, 2013) has focused on modeling user
lect the questions (observations) that can best measure the
                                                                emotional states, in order to detect when users are frus-
knowledge in order to determine whether or not the student
                                                                trated as well as the cause of the frustration. By finding the
should pass. They also use a threshold function and stress
                                                                root cause of the frustration, special measures may be taken
that reducing the number of questions asked is important.
                                                                to remediate. Similarly, (Chrysafiadi & Virvou, 2013) has
They also prove the NP-hardness of deciding an optimal
                                                                done significant work in modeling a student’s performance,
set of questions to assign a student.
                                                                progress, and behavior. The developed e-learning system
Additionally, a variety of work has been done on using          has been deployed into production. (Chrysafiadi & Virvou,
Bayesian networks in the educational diagnosis field for        2014) finds that modeling a user’s detailed characteristics,


                                                          36
such as knowledge, errors, and motivation can improve the       is competent. General competency is determined by com-
quality of a student’s learning process as it allows for bet-   petency in a collection of specialized fields of knowledge,
ter remediation. The need for personalized remediation is       which are represented by a variety of latent concept vari-
also further stressed as (Vandewaetere & Clarebout, 2014)       ables. The final type of variable is a question variable that
studies the importance of giving learners instruction and       represents a question that may be asked to the student. An
support directly. They stress that learner models need to be    answered question, whether correct or incorrect, will act as
adjusted and updated with new information about the learn-      evidence that can influence our belief on the student’s com-
ers knowledge, effective states, and behavior in order to be    petency. The graphical structure of the model can be seen
maximally effective.                                            in Figure 5.
                                                                Similarly to (Gertner et al., 1998; VanLehn & Niu, 2001;
4     COMPUTER ADAPTIVE TESTING                                 Cantarel, Weaver, McNeill, Zhang, Mackey, & Reese,
                                                                2014), the clinic coordinators/experts determined that the
4.1   USING A BAYESIAN NETWORK                                  “pass threshold” should be set at 0.8, meaning that if given
                                                                some evidence (questions answered), the posterior prob-
To address the problem of having such a time-consuming          ability of the decision variable was found to be over 0.8,
test, we believe that using a Computerized Adaptive Test        then the student would be considered as competent.2 Ac-
(CAT) could differentiate the competent students from the       cording to this pass threshold, we found once again that
non-competent students with fewer questions than a stan-        of the 16 competent volunteers, 15 of them passed, and of
dard test. Asking fewer questions while still accurately        the 6 non-competent volunteers, none of them passed. The
measuring a student’s ability is a canonical problem in ed-     volunteers’ scores are shown in Table 1. From this table
ucational diagnosis, and is discussed thoroughly in (Millán    we can see that our constructed Bayesian model allows us
& Pérez-De-La-Cruz, 2002; Garcı́a et al., 2007; Munie &        to accurately predict a student’s competency.
Shoham, 2008; Millán et al., 2013). Our main goal is to
best measure whether or not a student is competent with           Student # Percentage Correct Posterior Probability
a limited subset of questions. This means that we have to             1           0.952               0.957
determine when enough questions have been asked, as well              2           0.921               0.941
as which additional questions should be asked, so as to en-           3           0.714               0.951
sure that overall the Computer Adaptive Test is comprised             4           0.539               0.704
of fewer questions.                                                   5           0.762               0.956
Bayesian networks have been found to be especially useful             6           0.746               0.950
for decision making under uncertainty (Nielsen & Jensen,              7           0.825               0.957
2009). We believe that they provide us a natural mecha-               8           0.619               0.948
nism to model both 1) student knowledge and 2) questions              9           0.777               0.952
that may be potentially asked to measure student knowl-              10           0.857               0.954
edge. In these models, Bayesian networks can model the               11           0.857               0.957
relationships between various proficiencies. These models            12           0.809               0.956
detail the relationship between knowledge of certain con-            13           0.349               0.153
cepts and how the student will perform on certain ques-              14           0.238               0.023
tions.                                                               15           0.539               0.178
                                                                     16           0.809               0.923
Using a Bayesian network thus allows us to predict how               17           0.603               0.883
likely a student is to have knowledge in a certain concept           18           0.635               0.954
based on the student’s answers. Hence, we worked with                19           0.873               0.954
the clinic coordinators and experts to create a Bayesian             20           0.524               0.135
network structure and elicit the parameters for this exam.           21           0.492               0.153
Our developed model is similar to the models developed               22           0.413               0.314
by (Millán & Pérez-De-La-Cruz, 2002; Vomlel, 2004; Al-
mond et al., 2007; Munie & Shoham, 2008; Brusilovsky            Table 1: This table shows, for each student, the percent-
& Millán, 2007; Millán et al., 2013), where Bayesian net-     age of questions answered correctly (out of 63 questions),
works are used to evaluate and in a sense, “diagnose” a         and the posterior probability of the student being compe-
student’s degree of knowledge or competency.                    tent given all the answered questions. The non-competent
                                                                volunteers are bolded.
In our network, we have a main variable of interest that
is representative of the student’s total level of knowledge.
We refer to this variable as the decision variable (D), and         2
                                                                      Another commonly used threshold in the educational diagno-
it serves as an overall measure of our belief that a student    sis field is 0.7 (Butz et al., 2004; Arroyo & Woolf, 2005).


                                                          37
4.2   STOPPING CRITERION FOR INFORMATION                          whether Pr (D= d | e) ≥ T for some evidence e and
      GATHERING                                                   threshold T . If H is a set of variables that are available
                                                                  to observe, then the SDP is:
Although the test has a total of 63 questions, there may be                             X
a point during the adaptive test where we can determine            SDP(d, H, e, T ) =      [Pr (d | h, e) ≥ T ]Pr (h | e). (1)
that it is unnecessary to ask any further questions. For edu-                            h
cational diagnosis in Bayesian networks, (Millán & Pérez-
De-La-Cruz, 2002; Millán et al., 2013) shows there are two       Here, [α] is an indicator function which is 1 if α is true,
standard ways to determine if more information gathering          and 0 otherwise.
(additional questions posed) is necessary. The first involves
a fixed test that asks a set number of questions, and then de-    In short, the SDP is a measure of how robust a decision
termines if the student is competent after all questions have     is with respect to some unobserved variables and will help
been answered. This static method is clearly ineffective, as      determine how much more information is necessary.
the student is only evaluated at the conclusion of the test.      In this case, based on the student’s responses, at any point
The second way involves terminating the test once the pos-        in time, we can compute the posterior probability that they
terior probability of the decision variable is above or be-       are competent and thus determine a decision of whether or
low some threshold. This stopping criteria is also seen           not they are competent. Keep in mind that this decision is
in (Hamscher, Console, & de Kleer, 1992; Heckerman,               temporary and can be reversed if more questions are an-
Breese, & Rommelse, 1995; Kruegel, Mutz, Robertson, &             swered — we can compute the SDP over the remaining
Valeur, 2003; Lu & Przytula, 2006). Note that (Millán &          unanswered questions to determine how likely it is that the
Pérez-De-La-Cruz, 2002) compares an approach that uti-           current decision (deciding whether a student is competent
lizes a set number of questions with a threshold-based ap-        or non-competent) would remain the same even if the re-
proach — they found that with a set number of questions           maining questions were answered. If the SDP is high, that
they were able to diagnose students correctly 90.27% of the       is an indication that we can terminate the test early. We
time, whereas by using an adaptive criterion they were able       found that using the SDP allowed us to cut the test duration
to diagnose students correctly 94.54% of the time, while          significantly while maintaining diagnosis accuracy. Details
requiring fewer number of questions to be asked. This is          of our experiment setup and results can be found in Sec-
a clear indication that using a less trivial stopping criterion   tion 5.1.
can be ultimately beneficial.
Another possibility of a stopping criterion involves com-         4.3    SELECTION CRITERION FOR
puting the value of information of some observations, and                INFORMATION GATHERING
if that value is not high enough, to then stop information
gathering. However, this involves either computing the            We have now motivated the usefulness of the SDP as a stop-
myopic value of information and just computing the use-           ping criterion for Computer Adaptive Tests, as it tells us
fulness of making one observation, or computing the non-          when we can terminate the test. However, the question re-
myopic value of information and computing the usefulness          mains: if computing the SDP indicates that more questions
of making several observations. The former is fairly easy to      are necessary, which questions should we ask?
compute, but can prove to be very limited as often the com-       In a standard, non-adaptive test, the ordering of the ques-
bined usefulness of some observations is greater than its         tions cannot be controlled. In the adaptive setting we have
parts. For example, if a student answers a question incor-        more control — based on a test taker’s answers on the test
rectly, it may not be very telling. However, if that student      so far, we can select questions that have the most potential
also answers a question incorrectly for an entirely orthog-       to give us further insight on the test taker. Our goal here is
onal field, these two mistakes combined may be indicative         to select questions such that the expected SDP after observ-
of a student’s lack of competency. Note that computing the        ing those questions is maximal. The expected SDP (Chen,
non-myopic value of information is useful in determining if       Choi, & Darwiche, 2015) is defined as the following:
a set of observations has significant value, but is intractable
to compute if the set of observations is large (Krause &          Definition 2 Let G be a subset of the available features H
Guestrin, 2009).                                                  and let D(ge) be the decision made after observing fea-
To decide whether or not enough information has been              tures G, i.e., D(ge) = d if Pr (d | g, e) ≥ T . The E-SDP
gathered, we use a non-myopic stopping criterion and com-         is then:
pute the Same-Decision Probability (SDP) (Darwiche &
Choi, 2010; Chen, Choi, & Darwiche, 2014).                              E-SDP(D, G, H, e, T )
                                                                           X
                                                                        =    SDP(D(ge), H \ G, ge, T ) · Pr (g | e).        (2)
Definition 1 Suppose we are making a decision based on                      g


                                                            38
The expected SDP is thus a measure of how similar our de-            (to be more precise, we know whether each question was
cision, after observing G, is to the decision made after ob-         answered correctly or incorrectly). We denote this dataset
serving H. We want to find a question that will maximize             by T, where:
the expected SDP. In other words, we want to ask ques-                                    T = {e1 , . . . , e22 },
tions such that no matter how they are answered, will, on
average, minimize the usefulness of any remaining ques-              and where each ei ∈ T is an instantiation of the n test
tions. By maximizing this objective, we reduce the overall           responses for student i. Hence, the probability
number of questions that need to be answered before test
termination.                                                                                Pr (D = + | ei )
As stated before, computing the value of information (VOI)
is essential to the process of information gathering. Since          denotes the posterior probability that student i is competent
for our model it is intractable to compute the non-myopic            given their test results. Using our Bayesian network, we de-
value of information, that leaves only computing the my-             cide that student i is competent if Pr (D = + | ei ) ≥ 0.80
opic VOI as an available possibility. The VOI of observ-             (otherwise, we decide that they are not sufficiently compe-
ing a variable may depend on various objective functions,            tent). We report the quantity Pr (D = + | ei ) for each
for instance, how much the observation reduces the en-               student, in Table 1.
tropy of the decision variable. There is a comprehensive             To evaluate the SDP as a stopping and selection crite-
overview of these different objective functions in (Krause           rion, we simulate partially-completed tests from the fully-
& Guestrin, 2009).3 (Chen et al., 2014, 2015) compares the           completed tests T. In particular, we take the fully-
approach maximizing these standard objective functions to            completed test results ei , for each student i, and generate a
maximizing the expected SDP and finds that maximizing                set of partially-completed tests Qi = {qi,1 , . . . , qi,n }. In
the expected SDP is more effective in reducing the number            particular, we randomly permute the questions, and take the
of questions that need to be answered, particularly in cases         first j questions of the permutation as a partially-completed
when the decision threshold is extreme.                              test qi,j . Hence, a partially-completed test qi,j+1 adds one
Our approach involves selecting the question that leads to           additional test question to test qi,j , and test qi,n corre-
the highest expected SDP. We found that selecting vari-              sponds to the fully-completed test ei .
ables to optimize the expected SDP allowed us to substan-            In our experiments, we use those partially-completed tests
tially decrease the number of questions selected compared            qi,j that have at least 10 questions, i.e., where j ≥ 10 (we
to other selection criteria. Details of experiment setup and         assume 10 questions to be the minimum number of ques-
experimental results can be found in Section 5.2.                    tions, where we can begin to evaluate the competency of a
                                                                     student). Moreover, for each of the 22 students i, we sim-
                                                                     ulated 50 sets of partially-completed tests Qi based on 50
5       EXPERIMENTS
                                                                     random permutations, giving us a total of 22 · 50 = 1, 100
                                                                     sets of partially-completed tests.
In this section, we empirically evaluate the SDP as a stop-
ping and selection criterion for our adaptive test. We com-
pare the SDP against the standard criteria used by other             5.1   STOPPING CRITERION EXPERIMENTS
Bayesian Computer Adaptive Tests. First, we introduce
some notation used throughout the experimental section.              Using our partially-completed tests, we evaluate the SDP
                                                                     as a stopping criterion, against more traditional methods.
We use standard notation for variables and their instanti-           In particular, we take each set of partially-completed tests
ations, where variables are denoted by upper case letters            Qi , and going from test qi,10 up to test qi,n−1 , we check
(e.g. X) and their instantiations by lower case letters (e.g.        whether each stopping criterion is satisfied, i.e., a decision
x). Sets of variables are then denoted by bold upper case            is made to stop asking questions. Note that given test qi,n ,
letters (e.g. X) and their instantiations by bold lower case         the only decision is to stop, since there are no more test
letters (e.g. x). The primary decision variable, measuring           questions to ask. For the SDP, when we evaluate the test
a student’s overall competency, is denoted by D, with two            qi,j , we compute the SDP with respect to the remaining
states + (competent) and − (non-competent).                          n − j unanswered questions (i.e., we treat them as the set
We have the completed tests for 22 subjects, where for each          of available observables H).
subject, we have the responses of the n = 63 test questions          As for other, more traditional, stopping criteria, we con-
    3
                                                                     sider (1) stopping after a fixed number of questions have
     Additionally, (Golovin & Krause, 2011) studies the usage of     been answered (after which a student’s competence is de-
adaptive submodularity for selection criteria and shows that ob-
jective functions that satisfy this notion can be readily approxi-   termined), or (2) stopping early once the posterior proba-
mated. For our problem, our objective function does not satisfy      bility Pr (D = + | qi,j ) surpasses a given threshold T ,
the notion of adaptive submodularity.                                after which we deem a student to be competent.


                                                               39
In Table 2, we report the results where our stopping crite-       a lower recall. In fact, we see that once the threshold is
rion is based on asking a fixed number of questions. In           set high enough, the recall actually drops: Some students
Table 3, we highlight the results where we use instead            who should be diagnosed as competent are in fact being
a posterior probability threshold (Millán & Pérez-De-La-        diagnosed incorrectly as non-competent once the threshold
Cruz, 2002; Xenos, 2004; Arroyo & Woolf, 2005; Mu-                is set too high.
nie & Shoham, 2008). These results are based on aver-
                                                                  Consider now the case where we set a very large threshold
ages over our 1, 100 sets of partially-completed tests Qi ,
                                                                  on the SDP (0.999). In this case, the precision and recall
which we simulated from our original dataset. In both ta-
                                                                  are equivalent to the case where our criterion is to ask a set
bles, we see that there is a clear trade-off between the ac-
                                                                  number of n−1 = 62 questions (as in Table 2). In contrast,
curacy of the test, and the number of questions that we
                                                                  using the SDP criterion, we ask only an average of 42.05
ask. (Note again that students were already evaluated to
                                                                  questions, meaning that the SDP as a stopping criterion has
be competent/non-competent prior to the test, and that pre-
                                                                  the same robustness as the ”set number of questions” stop-
cision/recall is based on this prior evaluation).
                                                                  ping criterion — while asking nearly 20 fewer questions.
               # Questions    Precision   Recall
                                                                        SDP T     Precision   Recall    # Questions Asked
                   10           0.566     0.403
                                                                         0.850      0.927     0.932           39.10
                   15           0.752     0.659
                                                                         0.875      0.938     0.938           39.73
                   20           0.805     0.755
                                                                         0.900      0.942     0.938           39.61
                   25           0.815     0.812
                                                                         0.925      0.946     0.938           39.77
                   30           0.835     0.835
                                                                         0.950      0.946     0.938           40.09
                   35           0.874     0.875
                                                                         0.975      0.950     0.938           40.61
                   40           0.901     0.909
                                                                         0.990      0.950     0.938           41.14
                   45           0.915     0.914
                                                                         0.995      0.950     0.938           41.41
                   50           0.918     0.926
                                                                         0.999      0.954     0.938           42.05
                   55           0.924     0.938
                   60           0.950     0.938
                   62           0.954     0.938                   Table 4: Precision, recall, and average number of questions
                                                                  asked for different thresholds.
Table 2: Precision and recall for a set number of questions.
                                                                  It is clear from our results that if our goal is to ask fewer
                                                                  questions, while maintaining competitive precision and re-
         T     Precision     Recall   # Questions Asked           call rates, using the SDP as a stopping criterion is a com-
       0.750     0.803       0.758          10.13
       0.775     0.813       0.821          13.04                 pelling alternative to the traditional stopping criteria of 1)
       0.800     0.836       0.832          16.02                 asking a set number of questions and 2) checking to see if
       0.825     0.867       0.872          19.26                 the posterior probability of competency surpasses a thresh-
       0.850     0.903       0.915          23.13                 old — using the SDP as a stopping criterion allows us to
       0.875     0.919       0.912          27.14                 reduce the number of questions asked while still maintain-
       0.900     0.908       0.911          31.57
       0.925     0.968       0.884          36.82                 ing the same precision and recall.
       0.950     0.974       0.818          44.25
                                                                  5.2   SELECTION CRITERION EXPERIMENTS
Table 3: Precision, recall, and average number of questions
asked, for varying thresholds T on the posterior probability.     We next consider experiments that compare various ques-
                                                                  tion selection criteria such as 1) random selection, where
We next consider the SDP as a stopping criterion. In partic-      we select the next question randomly (as in non-adaptive
ular, we compute the SDP with respect to all unanswered           or linear testing), 2) information gain (mutual information)
questions, and then make a stopping decision when the             (Millán & Pérez-De-La-Cruz, 2002; Vomlel, 2004), and 3)
computed SDP surpasses a given threshold T . Note that            margins of confidence (Krause & Guestrin, 2009). Note
when we use a threshold T = 1.0, then we commit to a              that as this model does not fit the decision-theoretic setting
stopping decision only when no further observations will          (there are no assigned utilities), we do not consider utility
change the decision (i.e., the probability of making the          maximization as a selection criterion. In addition to the
same decision is 1.0). In Table 4, we report the results of       above, we evaluate the SDP as a selection criterion.
using the SDP as a stopping criterion.
                                                                  Since our adaptive test involves selecting one question at a
We see that even when our threshold is set to a relatively        time, our goal is to select a variable H ∈ H (correspond-
small value (0.850), we still attain precision and recall rates   ing to a question not yet presented) that leads to the greatest
that are comparable to those obtained by asking nearly all        gain in the SDP, i.e., the SDP gain. Our selection criteria
questions (as in Table 2). For the stopping criterion based       is based on asking the question which yields the highest
on posterior probability thresholds (in Table 3), we can see      SDP gain, using information gain as a tie-breaker, when
that we can attain a higher precision, but at the expense of      multiple questions have the same SDP gain. A tie-breaker


                                                            40
is needed as the SDP gain of observing a single variable         the number of unasked questions that must be considered.
can be zero. In this case, observing a single variable is not    Figure 6 shows a plot of the running times for the SDP
enough to change our decision (multiple observations may         stopping and selection criteria, in comparison to other mea-
be needed). In our experiments, we compute the SDP gain          sures. Note that the SDP selection criterion is on aver-
using a branch-and-bound algorithm, as in (Chen, Choi, &         age more efficient than the SDP stopping criterion, even
Darwiche, 2013). The advantages of this algorithm are (1)        though the selection criterion includes an SDP gain com-
it can prune the corresponding search space, and (2) it can      putation, as well as an information gain computation (as a
cache and reuse intermediate calculations, resulting in sig-     tie breaker). Here, there are cases that can be detected that
nificant computational savings.                                  allow us to skip the SDP gain computation (Chen et al.,
                                                                 2013), leaving just the relatively efficient information gain
We compare how quickly these selection criteria allow us
                                                                 computation. In general, we note that there is a trade-off:
to stop information gathering, each using the SDP as a
                                                                 the SDP, as stopping and selection critera, provides valu-
stopping criterion; see Table 5. Here, we took partially-
                                                                 able information (leading to fewer questions asked), but the
completed tests (all tests where at least 30 questions were
                                                                 SDP is also more computationally demanding.
answered), and used each selection criterion to select addi-
tional questions to ask, until the SDP dictates that we stop
asking questions. On average, we find that (1) the random                                   103          Algorithm Runtime Comparison
selection criterion asks an additional 8.18 questions, (2) in-                                         SDP-STOP
formation gain asks an additional 3.67 questions, (3) mar-                                             SDP-SEL
                                                                                            102        THRESH
gins of confidence ask an additional 3.57 questions, and (4)


                                                                 Average Running Time (s)
our approach based on the SDP gain requires the fewest                                                 IG
additional questions: 2.77.                                                                 101
Note that while the random question selection criterion
clearly has the poorest performance, there is a relatively                                  100
modest improvement based on our approach, using the
SDP gain, compared to the more common approaches                                            10-1
based on information gain (Millán & Pérez-De-La-Cruz,
2002; Vomlel, 2004), and margins of confidence (Krause                                      10-2   5        10        15      20       25   30
& Guestrin, 2009). Nevertheless, the benefits of our ap-                                                   Number of Unasked Questions
proach, as a selection criteria, are still evident. Further,
the results suggest the potential of using the SDP gain in a     Figure 6: Runtimes of different algorithms. SDP-STOP
less greedy way, where we ask multiple questions at a time,      and SDP-SEL are respectively the stopping and selection
which we consider a promising direction for future work.         criterion algorithms, THRESH represents the standard pos-
                                                                 terior stopping criterion, and IG represents the standard in-
         SDP T    Random       IG    Margins    SDP
          0.850     4.77      2.98    2.71      2.05             formation gain selection criterion.
          0.875     4.78      3.08    2.87      2.14
          0.900     5.08      3.20    3.12      2.20
          0.925     5.95      3.19    3.04      2.31             6                          CONCLUSION
          0.950     6.81      3.37    3.44      2.47
          0.975     7.86      3.73    3.56      2.96             We created a Computer Adaptive Test using a Bayesian net-
          0.990    10.64      4.19    4.29      3.18
          0.995    12.95      4.32    4.22      3.43
                                                                 work as the underlying model and showed how the notion
          0.999    14.80      4.97    4.91      4.25             of the SDP can be used as an information gathering crite-
                                                                 rion in this context. We showed that it can act as a stopping
Table 5: Number of additional observations necessary be-         criterion for determining if further questions are needed,
fore stopping for the random selection criterion, informa-       and as a selection criterion for determining which questions
tion gain criterion, margins of confidence criterion, and the    should be asked. Finally, we have shown empirically that
SDP hybrid criterion.                                            the SDP is a valuable information gathering tool, as its us-
                                                                 age allows us to ask fewer questions while still maintaining
                                                                 the same level of precision and recall for diagnosis.
5.3   RUNNING TIMES
                                                                 Acknowledgments
The SDP has been shown to be highly intractable, being
PPPP -complete (Choi, Xue, & Darwiche, 2012). There-             This work has been partially supported by ONR grant
fore, the computational requirements of the SDP, as a stop-      #N00014-12-1-0423, NSF grants #IIS-1514253 and #IIS-
ping and selection criterion, may be higher than other, more     1118122, and a Google Research Award. We also thank the
common approaches. This complexity depends largely on            coordinators and volunteers at the Berkeley Free Clinic.


                                                           41
References                                                            Gertner, A. S., Conati, C., & VanLehn, K. (1998). Procedural help
                                                                         in Andes: Generating hints using a Bayesian network student
Almond, R. G., DiBello, L. V., Moulder, B., & Zapata-Rivera,             model. In Proceedings of the 13th National Conference on
  J.-D. (2007). Modeling diagnostic assessments with Bayesian            Artificial intelligence, pp. 106–111.
  networks. Journal of Educational Measurement, 44(4), 341–
                                                                      Golovin, D., & Krause, A. (2011). Adaptive submodularity: The-
  359.
                                                                         ory and applications in active learning and stochastic optimiza-
Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D., &              tion. JAIR, 42, 427–486.
  Williamson, D. M. (2015). Bayesian Networks in Educational          Gujarathi, M. V., & Sonawane, M. S. (2012). Intelligent Tutoring
  Assessment. Springer.                                                  System: A case study of mobile mentoring for diabetes. IJAIS,
Arroyo, I., & Woolf, B. (2005). Inferring learning and attitudes         3(8), 41–43.
  from a Bayesian network of log file data. In Proceedings of         Hamscher, W., Console, L., & de Kleer, J. (Eds.). (1992). Read-
  the 12th International Conference on Artificial Intelligence in        ings in Model-Based Diagnosis. Morgan Kaufmann Publishers
  Education, pp. 33–40.                                                  Inc.
Beal, C. R., Arroyo, I., Cohen, P. R., Woolf, B. P., & Beal, C. R.    Heckerman, D., Breese, J. S., & Rommelse, K. (1995). Decision-
  (2010). Evaluation of Animalwatch: An Intelligent Tutoring             theoretic troubleshooting. Communications of the ACM, 38(3),
  System for arithmetic and fractions. Journal of Interactive On-        49–57.
  line Learning, 9(1), 64–77.                                         Krause, A., & Guestrin, C. (2009). Optimal value of information
Brusilovsky, P., & Millán, E. (2007). User models for adaptive          in graphical models. Journal of Artificial Intelligence Research
  hypermedia and adaptive educational systems. In The adaptive           (JAIR), 35, 557–591.
  web, pp. 3–53. Springer-Verlag.                                     Kruegel, C., Mutz, D., Robertson, W., & Valeur, F. (2003).
                                                                         Bayesian event classification for intrusion detection. In Pro-
Butz, C. J., Hua, S., & Maguire, R. B. (2004). A web-based In-
                                                                         ceedings of the Annual Computer Security Applications Con-
  telligent Tutoring System for computer programming. In Web
                                                                         ference (ACSAC).
  Intelligence, pp. 159–165.
                                                                      Lu, T.-C., & Przytula, K. W. (2006). Focusing strategies for mul-
Cantarel, B. L., Weaver, D., McNeill, N., Zhang, J., Mackey, A. J.,      tiple fault diagnosis. In Proceedings of the 19th International
  & Reese, J. (2014). Baysic: a Bayesian method for combining            FLAIRS Conference, pp. 842–847.
  sets of genome variants with improved specificity and sensitiv-
  ity. BMC bioinformatics, 15(1), 104.                                Millán, E., Descalco, L., Castillo, G., Oliveira, P., & Diogo, S.
                                                                         (2013). Using Bayesian networks to improve knowledge as-
Chen, S., Choi, A., & Darwiche, A. (2013). An exact algorithm            sessment. Computers & Education, 60(1).
  for computing the Same-Decision Probability. In Proceedings
                                                                      Millán, E., & Pérez-De-La-Cruz, J. L. (2002). A Bayesian diag-
  of the 23rd International Joint Conference on Artificial Intelli-
                                                                         nostic algorithm for student modeling and its evaluation. User
  gence, pp. 2525–2531.
                                                                         Modeling and User-Adapted Interaction, 12(2-3), 281–330.
Chen, S., Choi, A., & Darwiche, A. (2014). Algorithms and appli-      Munie, M., & Shoham, Y. (2008). Optimal testing of structured
  cations for the Same-Decision Probability. Journal of Artificial       knowledge. In Proceedings of the 23rd National Conference
  Intelligence Research (JAIR), 49, 601–633.                             on Artificial intelligence, pp. 1069–1074.
Chen, S., Choi, A., & Darwiche, A. (2015). Value of informa-          Nielsen, T. D., & Jensen, F. V. (2009). Bayesian networks and
  tion based on Decision Robustness. In Proceedings of the 29th          decision graphs. Springer Science & Business Media.
  Conference on Artificial Intelligence (AAAI).                       Rajendran, R., Iyer, S., Murthy, S., Wilson, C., & Sheard, J.
Choi, A., Xue, Y., & Darwiche, A. (2012). Same-Decision Prob-            (2013). A theory-driven approach to predict frustration in an
  ability: A confidence measure for threshold-based decisions.           its. Learning Technologies, IEEE Transactions on, 6(4), 378–
  International Journal of Approximate Reasoning (IJAR), 2,              388.
  1415–1428.                                                          Sinharay, S. (2006). Model diagnostics for Bayesian networks.
Chrysafiadi, K., & Virvou, M. (2013). Persiva: An empirical eval-        Journal of Educational and Behavioral Statistics, 31(1), 1–33.
  uation method of a student model of an intelligent e-learning       Suebnukarn, S., & Haddawy, P. (2006). A Bayesian approach to
  environment for computer programming. Computers & Edu-                 generating tutorial hints in a collaborative medical problem-
  cation, 68, 322–333.                                                   based learning system. Artificial intelligence in Medicine,
                                                                         38(1), 5–24.
Chrysafiadi, K., & Virvou, M. (2014). Kem cs: A set of stu-
  dent’s characteristics for modeling in adaptive programming         Vandewaetere, M., & Clarebout, G. (2014). Advanced technolo-
  tutoring systems. In Information, Intelligence, Systems and Ap-        gies for personalized learning, instruction, and performance. In
  plications, IISA 2014, The 5th International Conference on, pp.        Handbook of Research on Educational Communications and
  106–110. IEEE.                                                         Technology, pp. 425–437. Springer.
Conati, C., Gertner, A., & VanLehn, K. (2002). Using Bayesian         VanLehn, K., & Niu, Z. (2001). Bayesian student modeling, user
  networks to manage uncertainty in student modeling. User               interfaces and feedback: A sensitivity analysis. International
  Modeling and User-Adapted Interaction, 12(4), 371–417.                 Journal of Artificial Intelligence in Education, 12(2), 154–184.
                                                                      Vomlel, J. (2004). Bayesian networks in educational testing. In-
Darwiche, A., & Choi, A. (2010). Same-Decision Probability: A            ternational Journal of Uncertainty, Fuzziness and Knowledge-
  confidence measure for threshold-based decisions under noisy           Based Systems, 12(supp01), 83–100.
  sensors. In 5th PGM, pp. 113–120.
                                                                      Xenos, M. (2004). Prediction and assessment of student be-
Garcı́a, P., Amandi, A., Schiaffino, S., & Campo, M. (2007).             haviour in open and distance education in computers using
  Evaluating Bayesian networks precision for detecting students          Bayesian networks. Computers & Education, 43(4), 345–359.
  learning styles. Computers & Education, 49(3), 794–808.


                                                                42
Figure 5: Bayesian network modeling clinic volunteer knowledge. The 63 questions are labeled from Q1 to Q63. The
primary decision variable Adequacy is located in the top center.


                                                   43