Bayesian Network Models for Adaptive Testing


                             Martin Plajner                                         Jiřı́ Vomlel
           Institute of Information Theory and Automation        Institute of Information Theory and Automation
             Academy of Sciences of the Czech Republic             Academy of Sciences of the Czech Republic
                         Pod vodárenskou věžı́ 4                            Pod vodárenskou věžı́ 4
                          Prague 8, CZ-182 08                                   Prague 8, CZ-182 08
                             Czech Republic                                        Czech Republic


                         Abstract                                more information can be found also in (Millán et al., 2000).
                                                                 It seems that there is a large possibility of applications of
                                                                 CAT in the domain of educational testing (Vomlel, 2004a;
     Computerized adaptive testing (CAT) is an in-
                                                                 Weiss and Kingsbury, 1984).
     teresting and promising approach to testing hu-
     man abilities. In our research we use Bayesian              In this paper we look into the problem of using Bayesian
     networks to create a model of tested humans.                network models (Kjærulff and Madsen, 2008) for adaptive
     We collected data from paper tests performed                testing (Millán et al., 2010). Bayesian network is a con-
     with grammar school students. In this article               ditional independence structure and its usage for CAT can
     we first provide the summary of data used for               be understood as an expansion of the Item Response The-
     our experiments. We propose several different               ory (IRT) (Almond and Mislevy, 1999). IRT has been suc-
     Bayesian networks, which we tested and com-                 cessfully used in testing for many years already and ex-
     pared by cross-validation. Interesting results              periments using Bayesian networks in CAT are also being
     were obtained and are discussed in the paper. The           made (Mislevy, 1994; Vomlel, 2004b).
     analysis has brought a clearer view on the model
                                                                 We discuss the construction of Bayesian network mod-
     selection problem. Future research is outlined in
                                                                 els for data collected in paper tests organized at grammar
     the concluding part of the paper.
                                                                 schools. We propose and experimentally compare different
                                                                 Bayesian network models. To evaluate models we simulate
                                                                 tests using parts of collected data. Results of all proposed
1   INTRODUCTION                                                 models are discussed and further research is outlined in the
                                                                 last section of this paper.
The testing of human knowledge is a very large field of hu-
man effort. We are in touch with different ability and skill
                                                                 2       DATA COLLECTION
checks almost daily. The computerized form of testing is
also getting an increased attention with the growing spread
                                                                 We designed a paper test of mathematical knowledge
of computers, smart phones and other devices which allow
                                                                 of grammar school students focused on simple func-
easy impact on the target groups. In this paper we focus on
                                                                 tions (mostly polynomial, trigonometric, and exponen-
the Computerized Adaptive Testing (CAT) (van der Linden
                                                                 tial/logarithmic). Students were asked to solve different
and Glas, 2000; Almond and Mislevy, 1999).
                                                                 mathematical problems1 including graph drawing and read-
CAT aims at creating shorter tests and thus it takes less time   ing, calculation of points on the graph, root finding, de-
without sacrificing its reliability. This type of test is com-   scription of function shape and other function properties.
puter administered. The test has an accompanied model
                                                                 The test design went through two rounds. First, we pre-
which models a student (a student model). This model is
                                                                 pared an initial version of the test. This version was carried
constructed based on samples of previous students. During
                                                                 out by a small group of students. We evaluated the first
the testing the model is updated to reflect abilities of one
                                                                 version of the test and based on this evaluation we made
particular student who is in the process of testing. At the
                                                                 changes before the main test cycle. Problems were updated
same time we use the model to adaptively select next ques-
                                                                 and changed to be better understood by students. Few prob-
tions to be asked in order to ask the most appropriate one.
This leads to collection of significant information in shorter       1
                                                                      In this case we use the term mathematical “problem” due to
time and allows to ask less questions. We provide an addi-       its nature. In general tests, terms “question” or “item” are often
tional description of the testing process in the Section 4 and   used. In this article all of these terms are interchangeable.


                                                           24
lems were removed completely from the test, mainly be-               The Table 1 shows the average scores of the grammar
cause the information benefit of the problem was too low             schools (the higher the score the better the results). We
due to its high or low difficulty. Moreover we divided prob-         also computed correlations between the score and average
lems into subproblems in the way that:                               grades from Mathematics, Physics, and Chemistry from
                                                                     previous three school terms. The grades are from the set
 (a) it is possible to separate the subproblem from the main         {1, 2, 3, 4, 5} with the best grade being 1 and the worst be-
     problem and solve it independently or                           ing 5. These correlations are shown in the Table 2. Nega-
                                                                     tive numbers mean that a better grade is correlated with a
(b) it is not possible to separate the subproblem, but it rep-       better result, which confirms our expectation.
    resents a subroutine of the main problem solution.
                                                                     Table 1: Average test scores of the four grammar schools.
Note that each subproblem of the first type can be viewed
as a completely separate problem. On the other hand, sub-                        GS1       GS2      GS3       GS4     Total
problems of the second type are inseparable pieces of a                         42.76     46.68    46.35     43.65    44.53
problem.
Next we present an example of a problem that appeared in
the test.                                                            Table 2: Correlation of the grades and the test total score.
Example 2.1. Decide which of the following functions                              Mathematics      Physics     Chemistry
                                                                                        -0.60        -0.42         -0.41
                  f (x)   =    x2 − 2x − 8
                  g(x)    =    −x2 + 2x + 8

is decreasing in the interval (−∞, −1].
                                                                     3     BAYESIAN NETWORK MODELS

The final version of test contains 29 mathematical prob-             In this section we discuss different Bayesian network mod-
lems. Each one of them is graded with 0–4 points. These              els we used to model relations between students’ math
problems have been further divided into 53 subproblems.              skills and students’ results when solving mathematical
Subproblems are graded so that the sum of their grades is            problems. All models discussed in this paper consists of
the grade of the parent problem, i.e., it falls into the set         the following:
{0, . . . , 4}. Usually a question is divided into two parts
each graded by at most two points2 . The granularity of                  • A set of n variables we want to estimate {S1 , . . . , Sn }.
subproblems is not the same for all of them and is a subset                We will call them skills or skill variables. We will
of the set {0, . . . , 4}. All together, the maximal possible              use symbol S to denote the multivariable (S1 , . . . , Sn )
score to obtain in the test is 120 points. In an alternative               taking states s = (s1 , . . . sn ).
evaluation approach, each subproblem is evaluated using
the Boolean values (correct/wrong). The answer is evalu-                 • A set of m questions (math problems) {X1 , . . . , Xm }.
ated as correct only if the solution of the subproblem and                 We will use the symbol X to denote the multivariable
the solution method is correct unless there is an obvious                  (X1 , . . . , Xm ) taking states x = (x1 , . . . , xm ).
numerical mistake.
                                                                         • A set of arcs between variables that define relations
We organized tests at four grammar schools. In total 281                   between skills and questions and, eventually, also in-
students participated in the testing. In addition to prob-                 between skills and inbetween questions.
lem solutions, we also collected basic personal data from
students including age, gender, name, and their grades in            The ultimate goal is to estimate the values of skills, i.e., the
mathematics, physics, and chemistry from previous three              probabilities of states of variables S1 , . . . , Sn .
school terms. The primal goal of the tests was not the
student evaluation. The goal was to provide them valu-
                                                                     3.1    QUESTIONS
able information about their weak and strong points. They
could view their result (the scores obtained in each individ-        The solution of math problems were either evaluated using
ual problem) as well as a comparison with the rest of the            a numeric scale or using a Boolean scale as explained in the
test group. The comparisons were provided in the form of             previous section. Although the numeric scale carries more
quantiles in their class, school and all participants.               information and thus it seems to be a better alternative,
   2
    There is one exception from this rule: The first problem is      there are other aspects discouraging such a choice. The
very simple and it is divided into 8 parts, each graded by zero or   main problem is the model learning. The more the states
one point (summing to the total maximum of 8).                       the higher the number of model parameters to be learned.


                                                               25
With a limited training data it may be difficult to reliably      during the learning phase the variable is observed and the
estimate the model parameters.                                    information is used for learning. On the other hand, during
                                                                  the testing the resulting score is not known – we are trying
We consider two alternatives in our models. Variables cor-
                                                                  to estimate the group into which would this test subject fall.
responding to problems’ solutions (questions) can either be
                                                                  In the testing phase the variable is hidden (unobserved).
  • Boolean, i.e. they have two states only 0 and 1 or
                                                                  3.3    ADDITIONAL INFORMATION
  • integer, i.e. each Xi takes mi states {1, . . . , mi },
    mi ∈ N, where mi is the maximal number points for             As mentioned above, we have collected not only solutions
    the corresponding math problem.                               to problems but also additional personal information about
                                                                  students. This additional information may improve the
In Section 5 we present results of experiments with both          quality of the student model. On the other hand it makes
options.                                                          the model more complex (more parameters need to be es-
                                                                  timated). It may mislead the reasoning based solely on
                                                                  question answers (especially later when sufficient infor-
3.2   SKILL NODES
                                                                  mation about a student is collected from his/her answers).
We assume the student responses can be explained by skill         The additional variables are Y1 , . . . , Y` and they take states
nodes that are parents of questions. Skill nodes model the        y1 , . . . , y` . We tested both versions of most of the models,
student abilities and, generally, they are not directly observ-   i.e. models with or without the additional information.
able. Several decisions are to be made during the model
creation.                                                         3.4    PROPOSED MODELS

The first decision is the number of skill nodes itself. Should    In total we have created 14 different models that differ in
we expect one common skill or should it rather be several         factors discussed above. The combinations of parameters’
different skills each related to a subset of questions only?      settings are displayed in the Table 3. One model type is
In the later case it is necessary to specify which skills are     shown in the Figure 1. It is the case of ”tf plus” which is
required to solve each particular question (i.e. a math prob-     a network with one hidden skill node and with the addi-
lem). Skills required for the successful solution of a ques-      tional information3 . Models that differ only by number of
tion become parents of the considered question.                   states of variables have the same structure. Models with
Most networks proposed in this paper have only one skill          the “obs” infix in the name and “o” in the ID have the skill
node. This node is connected to all questions. The student        variable modified to represent score groups rather than skill
is thus modelled by a single variable. Ordinarily, it is not      (as explained earlier in the part 3.2). Models without addi-
possible to give a precise interpretation to this variable.       tional information do not contain the part of variables on
                                                                  the right hand side of the skill variable S1 . Figure 2 shows
We created two models with more than one skill node. One          the structure of the expert models with 7 skill variables in
of them is with the Boolean scale of question nodes and           the middle part of the figure.
the other is with the numeric scale. We used our expert
knowledge of the field of secondary school mathematics
and our experiences gained during the evaluation of paper
                                                                  4     ADAPTIVE TESTS
tests. In these model we included 7 skill nodes with arcs
connecting each of them to 1 – 4 problems.                        All proposed models are supposed to serve for adaptive
                                                                  testing. In this section we describe the process of adaptive
Another issue is the state space of the skill nodes. As an        testing with the help of these models.
unobserved variable, it is hard to decide how many states it
should have. Another alternative is to use a continuous skill     At first, we select the model which we want to use. If this
variable instead of a discrete one but we did not elaborate       model contains additional information variables it is neces-
more on this option. In our models we have used skill nodes       sary to insert observed states of these variables before we
with either 2 or 3 states (si ∈ {1, 2} or si ∈ {1, 2, 3}).        start selecting and asking questions. Next, following steps
                                                                  are repeated:
We tried also the possibility of replacing the unobserved
skill variable by a variable representing a total score of the        • The next question to be asked is selected.
test. To do this we had to use a coarse discretization. We di-
vided the scores into three equally sized groups and thus we          • The question is asked and a result is obtained.
obtained an observed variable having three possible states.
The states represent a group of students with “bad”, “aver-           • The result is inserted into the network as evidence.
age”, and “good” scores achieved. The state of this variable          3
                                                                        Please note that the missing problems and problem numbers
is known if all questions were included in the test. Thus,        are due to the two-cycled test creation and problems removal.


                                                            26
          Figure 1: Bayesian network with one hidden variable and personal information about students


                                                                                                                               • The network is updated with this evidence.
                                                                                                                               • (optional) Subsequent answers are estimated.

                                                                                                                             This procedure is repeated as long as necessary. It means
                                                 No. of states of skill nodes


                                                                                                                             until we reach a termination criterion which can be either
                                                                                                                             a time restriction, the number of questions, or a confidence
                                                                                                                             interval of the estimated variables. Each of these criterion
                                                                                  Problem variables
                            No. of skill nodes


                                                                                                      Additional info


                                                                                                                             would lead to a different learning strategy (Vomlel, 2004b),
                                                                                                                             but because such strategy would be NP-Hard (Lı́n, 2005).
                                                                                                                             We have chosen an heuristic approach based on greedy en-
                                                                                                                             tropy minimization.
ID        Model name
b2            tf simple      1                     2                            Boolean               no                     4.1    SELECTING NEXT QUESTION
b2+             tf plus      1                     2                            Boolean               yes                    One task to solve during the procedure is the selection of
b3          tf3s simple      1                     3                            Boolean               no                     the next question. It is repeated in every step of the testing
b3+           tf3s plus      1                     3                            Boolean               yes                    and it is described below.
b3o      tf3s obssimple      1                     3                            Boolean               no
b3o+       tf3s obsplus      1                     3                            Boolean               yes                    Let the test be in the state after s − 1 steps where
b2e           tf expert      7                     2                            Boolean               no
n2        points simple      1                     2                            numeric               no                           Xs   = {Xi1 . . . Xin | i1 , . . . , in ∈ {1, . . . , m}}
n2+         points plus      1                     2                            numeric               yes                    are unobserved (unanswered) variables and
n3      points3s simple      1                     3                            numeric               no
n3+       points3s plus      1                     3                            numeric               yes                    e =
n3o    points3s obssimple    1                     3                            numeric               no                            {Xk1 = xk1 , . . . , Xko = xko |k1 , . . . , ko ∈ {1, . . . , m}}
n3o+    points3s obsplus     1                     3                            numeric               yes
n2e       points expert      7                     2                            numeric               no                     is evidence of observed variables – questions which were
                                                                                                                             already answered and, possibly, the initial information. The
  Table 3: Overview of Bayesian network models                                                                               goal is to select a variable from Xs to be asked as the next
                                                                                                                             question. We select a question with the largest expected
                                                                                                                             information gain.
                                                                                                                             We compute the cumulative Shannon entropy over all skill
                                                                                                                             variables of S given evidence e. It is given by the following


                                                                                                                        27
                              Figure 2: Bayesian network with 7 hidden variables (the expert model)


formula:                                                             seems to us as a better solution. Given the objective of the
                n X                                                  question selection, the greedy strategy based on the sum of
                X
 H(e)      =             −P (Si = si |e) · log P (Si = si |e) .      entropies provides good results. Moreover, the computa-
                i=1 si                                               tional time required for the proposed method is lower.
                                                                     Now, we can compute the expected entropy after answering
Assume we decide to ask a question X 0 ∈ Xs with possible            question X 0 :
outcomes x01 , . . . , x0p . After inserting the observed outcome
                                                                                            p
the entropy over all skills changes. We can compute the                                     X
value of new entropy for evidence extended by X 0 = x0j ,             EH(X 0 , e)       =         P (X 0 = x0j |e) · H(e, X 0 = x0j ) .
                                                                                            j=1
j ∈ {1, . . . , p} as:

        H(e, X 0 = x0j ) =                                           Finally, we choose a question X ∗ that maximizes the infor-
             n X
                                                                     mation gain IG(X 0 , e)
            X          −P (Si = si |e, X 0 = x0 )  j
                                                            .
                         · log P (Si = si |e, X 0 = x0j )                           X∗      =     arg max IG(X 0 , e) , where
               i=1 si                                                                              X 0 ∈Xs
                                                                                    0
This entropy H(e, X 0 = x0j ) is the sum of individual en-                 IG(X , e)        =     H(e) − EH(X 0 , e) .
tropies over all skill nodes. Another option would be to
compute the entropy of the joint probability distribution            4.2   INSERTION OF THE SELECTED QUESTION
of all skill nodes. This would take into account correla-
tions between these nodes. In our task we want to estimate           The selected question X ∗ is given to the student and his/her
marginal probabilities of all skill nodes. In the case of high       answer is obtained. This answer changes the state of vari-
correlations between two (or more) skills the second crite-          able X ∗ from unobserved to an observed state x∗ . Next, the
rion would assign them a lower significance in the model.            question together with its answer is inserted into the vec-
This is the behavior we wanted to avoid. The first crite-            tor of evidence e. We update the probability distributions
rion assigns the same significance to all skill nodes which          P (Si |e) of skill variables with the updated evidence e. We


                                                                28
also recompute the value of entropy H(e). The question             ID/Step      0        1        5     15        25      30
                                                                   b2         0.714   0.761    0.766   0.778    0.798   0.835
X ∗ is also removed from Xs forming a set of unobserved
                                                                   b2+        0.749   0.768    0.768   0.778    0.797   0.829
variables Xs+1 for the next step s and selection process can       b3         0.714   0.745    0.776   0.803    0.843   0.857
be repeated.                                                       b3+        0.746   0.754     0.78   0.801    0.831   0.859
                                                                   b3o        0.714   0.747    0.782    0.8     0.832   0.864
                                                                   b3o+       0.747   0.761    0.785   0.799     0.83   0.865
4.3   ESTIMATING SUBSEQUENT ANSWERS                                b2e        0.715    0.73    0.767   0.776    0.781   0.768
                                                                   n2         0.684   0.708     0.73   0.713    0.745   0.776
In experiments presented in the next section we will use           n2+        0.717   0.732    0.731   0.717     0.75   0.778
individual models to estimate answers for all subsequent           n3         0.684   0.723    0.745   0.758    0.781    0.79
questions in Xs+1 . This is easy since we enter evidence           n3+        0.684   0.724    0.743   0.757     0.77   0.776
e and perform inference to compute P (X 0 = x0 |e) for all         n3o        0.686   0.721    0.745   0.751     0.77   0.779
                                                                   n3o+       0.716   0.729    0.743   0.752    0.773   0.779
states of X 0 ∈ Xs+1 by invoking the distribute and collect        n2e        0.684   0.699    0.735   0.738    0.737   0.715
evidence procedures in the BN model.
                                                                     Table 4: Success ratios of Bayesian network models
5     MODEL EVALUATION
                                                                  Table 4 shows success rates of proposed networks for se-
In this section we report results of tests performed with         lected steps s = 0, 1, 5, 15, 25, 30. The network ID corre-
networks proposed in Section 3 of this paper. The test-           sponds to the ID from the Table 3. The most important part
ing was done by 10-fold cross-validation. For each model          of the tests are the first few steps, which is because of the
                                                       9
we learned the corresponding Bayesian network from 10    of       nature of CAT. We prefer shorter tests therefore we are in-
randomly divided data. The model parameters were learned          terested in the early progression of the model (in this case
using Hugin’s (Hugin, 2014) implementation of the EM al-          approximately up to the step 20). During the final stages
                         1
gorithm. The remaining 10   of the dataset served as a test-      of testing we estimate results of only a couple of questions
ing set. This procedure was repeated 10 times to obtain 10        which in some cases may cause rapid changes of success
networks for each model type.                                     rates. Questions which are left to the end of the test do
The testing was done as described in Section 4. For every         not carry a large amount of information (because of the
model and for each student from the testing data we simu-         entropy selection strategy). This may be caused by two
lated a test run. Collected initial evidence and answers were     possible reasons. The first one is that the state of the ques-
inserted into the model. During testing we estimated an-          tion is almost certain and knowing it does not bring any
swers of the current student based on evidence collected so       additional information. The second possibility is that the
far. At the end of the step s we computed probability distri-     question connection with the rest of the model is weak and
butions P (Xi |e) for all unobserved questions Xi ∈ Xs+1 .        because of that it does not change much the entropy of skill
Then we selected the most probable state of Xi :                  variables. In the latter case it is also hard to predict the
                                                                  state of such question because its probability distribution
            x∗i   =     arg max P (Xi = xl |e) .                  also does not change much with additional evidence.
                           xl
                                                                  From an analysis of success rates we have identified clus-
By comparing this value to the real answer x0i we obtained        ters of models with similar behavior. For models with in-
a success ratio of the response estimation for all questions      teger valued questions and also for models with Boolean
Xi ∈ Xs+1 of test (student) t in step s                           questions three clusters of models with similar success ra-
                                                                  tio emerged:
                                      ∗    0
                      P
             t           Xi ∈Xs+1 I(xi = xi )
          SRs =                               , where               • models with skill variable of 3 states,
                              |Xs+1 |
                      
                         1 if expr is true                          • models with skill variable of 2 states, and
      I(expr) =
                         0 otherwise.
                                                                    • the expert model.
The total success ratio of one model in the step s for all test
data (N = 281) is defined as
                                                                  We selected the best model from each cluster to display
                                PN     t                          success ratios SRs in steps s in Figure 3 for Boolean ques-
                                 t=1 SRs                          tions and in Figure 4 for integer valued questions. We made
                  SRs     =                .
                                     N                            the following observations:
We will refer to the success rate in the step s as to elements
of sr = (SR0 , SR1 , . . .), where SR0 is the success rate of       • Models with the skill variable with 3 states were more
the prediction before asking any question.                            successful.


                                                            29
           b2+      b3       b2e     n2+      n3       n2e      selected by the tested models at different stages of the tests.
 AZT       0.5      1.9      7.5    18.1     47.4     81.7      Figure 5 is for Boolean questions and Figure 6 for integer
 AS       0.002    0.006    0.026   0.047    0.081    0.121     valued questions. Only three models (the same as for suc-
                                                                cess ratio plots) were selected because other models share
Table 5: Avg. number of zeros/sparsity of different models      common behavior with others from the same cluster. On
                                                                the horizontal axis there is the step when the question was
  • Models with skill variable with 2 states were better at     asked, on the vertical axis are questions by their ID. The
    the very end of tests, but this test stage is is not very   darker the cell in the graph the more tests used the corre-
    important for CAT since the tests usually terminates at     sponding variable in the corresponding time. Even though
    early stages as explained above.                            it provides only a rough presentation it is possible to notice
                                                                different patterns of behavior. Especially, we would like to
  • The expert model achieved medium quality prediction         point out the clouded area of the expert model where it is
    in the middle stage but its prediction ability decreases    clear that the individual tests were very different. Expert
    in the second half of the tests.                            models are apparently less sure about the selection of the
                                                                next question. This may be caused by a large set of skill
We would like to point out that the distinction between         variables which divide the effort of the model into many
models is basically only by differences of skill variables      directions. This behavior is not necessarily unwanted be-
used in the models. The influence of additional informa-        cause it provides very different test for every test subject
tion is visible only at the very beginning of testing. As can   which may be considered positive, but it is necessary to
be seen in the Table 4 “+” models are scoring better in the     maintain the prediction success rates.
initial estimation and then in the first one. After that both
models follow almost the same track. In the late stages of
the test, models with additional information are estimating     6     CONCLUSION AND FUTURE
worse than their counterparts without information. It sug-            RESEARCH
gests that models without additional information are able
to derive the same information by getting answers to few        In this paper we presented several Bayesian network mod-
questions (in the order of a couple of steps).                  els designed for adaptive testing. We evaluated their per-
                                                                formance using data from paper tests organized at grammar
It is easy to observe that the expert model does not provide    schools. In the experiments we observed that:
as good results as other models especially during the sec-
ond half of the testing. As was stated above the second part
of the testing is not as important, nevertheless we have in-        • Larger state space of skill variables is beneficial.
vestigated causes for these inaccuracies. The main possible           Clearly, models with 3 states of the hidden skill vari-
reason for this behavior may be the complexity of this type           able behave better during the most important stages of
of model. With seven skill nodes and various connections              the tests. Test with hidden variables with more than 3
to question nodes this model contains a significantly higher          states are still to be done.
number of parameters to be fitted. It is possible that our
                                                                    • Expert model did not score as good as simpler models
limited learning sample leads to over-fitting. We have ex-
                                                                      but it showed a potential for its improvements. The
plored the conditional probability tables (CPTs) of models
                                                                      proposed expert model is much more complex than
used during cross-validation procedure to see how sparse
                                                                      other models in this paper and probably it can improve
they are. Our observation is shown in the Table 5. The
                                                                      its performance with more data collected.
number AZT is the average of the total number of zeros in
cross-validation models for the specific configuration and          • Additional information provided improves results
AS is the average sparsity of CPTs rows in these models.              only during the initial stage. This fact is positive be-
We can see that in the same type of scales (Boolean or nu-            cause obtaining such additional information may be
meric) the sparsity of expert models is significantly higher.         hard in practice. Additionally, it can be considered
This can be improved by increasing data volume or de-                 politically incorrect to make assumption about student
creasing the model’s complexity. This finding is consistent           skills using this type of information.
with the above explained possible cause for inaccuracies.
In addition we can observe that there is also an increase in
                                                                In the future we plan to explore models with one or two hid-
sparsity when more skill variables states are introduced. It
                                                                den variables having more than three states, expert models
seems to us as a good idea to further explore the space be-
                                                                with skill nodes of more than 2 states, and try to add re-
tween one skill variable and seven skill variables as well as
                                                                lations between skills into the expert model to improve its
the number of their states to provide a better insight into
                                                                performance. We would also like to compare our current
this problem and to draw out more general conclusions.
                                                                results with standard models used in adaptive testing like
In Figures 5 and 6 we compare which questions were often        the Rash and IRT models.


                                                          30
               0.95


                          b2+
                          b3
                          b2e
               0.85
success rate
               0.75
               0.65


                      0   5           10            15           20           25           30       35   40
                                                                 step


                                  Figure 3: Success ratios for models with Boolean questions
               0.95


                          n2+
                          n3
                          n2e
               0.85
success rate
               0.75
               0.65


                      0   5           10            15           20           25           30       35   40
                                                                 step


                                Figure 4: Success ratios for models with integer valued questions


                                                            31
            40


                                                                                                         40
                                                              40
            30


                                                                                                         30
                                                              30
  vars ID
Question


                                                       vars


                                                                                                  vars
            20


                                                                                                         20
                                                              20
            10


                                                                                                         10
                                                              10
            0


                                                                                                         0
                 0      5     10      15   20     25               0   5   10     15    20   25               0   5   10      15   20   25
                                   steps                                        steps
                                                                                 Step                                      steps


                 Figure 5: Relative occurrence of questions (on vertical axis) into models with Boolean scale. From left “b2+”,“b3”,“b2e”
            40


                                                              40


                                                                                                         40
            30


                                                              30


                                                                                                         30
  vars ID
Question


                                                       vars


                                                                                                  vars
            20


                                                              20


                                                                                                         20
            10


                                                              10


                                                                                                         10
            0


                                                              0


                                                                                                         0


                 0      5     10      15   20     25               0   5   10     15    20   25               0   5   10      15   20   25
                                   steps                                        steps
                                                                                 Step                                      steps


                 Figure 6: Relative occurrence of questions (on vertical axis) into models with numeric scale. From left “n2+”,“n3”,“n2e”


                                                                           32
Acknowledgements

The work on this paper has been supported from GACR
project n. 13-20012S.

References
Almond, R. G. and Mislevy, R. J. (1999). Graphical Mod-
  els and Computerized Adaptive Testing. Applied Psy-
  chological Measurement, 23(3):223–237.
Hugin (2014). Explorer, ver. 8.0, comput. software 2014,
  http://www.hugin.com.
Kjærulff, U. B. and Madsen, A. L. (2008). Bayesian Net-
  works and Influence Diagrams. Springer.
Lı́n, V. (2005). Complexity of finding optimal observation
   strategies for bayesian network models. In Proceedings
   of the conference Znalosti, Vysoké Tatry.
Millán, E., Loboda, T., and Pérez-de-la Cruz, J. L. (2010).
 Bayesian networks for student model engineering. Com-
 puters & Education, 55(4):1663–1683.
Millán, E., Trella, M., Prez-de-la Cruz, J., and Conejo, R.
 (2000). Using bayesian networks in computerized adap-
 tive tests. In Ortega, M. and Bravo, J., editors, Comput-
 ers and Education in the 21st Century, pages 217–228.
 Springer.
Mislevy, R. J. (1994). Evidence and inference in educa-
 tional assessment. Psychometrika, 59(4):439–483.
van der Linden, W. J. and Glas, C. A. (2000). Computerized
  Adaptive Testing: Theory and Practice. Springer.
Vomlel, J. (2004a). Bayesian networks in educational test-
  ing. International Journal of Uncertainty, Fuzziness and
  Knowledge-Based Systems, 12(supp01):83–100.
Vomlel, J. (2004b). Buliding Adaptive Test Using Bayesian
  Networks. Kybernetika, 40(3):333–348.
Weiss, D. J. and Kingsbury, G. G. (1984). Application
 of Computerized Adaptive Testing to Educational Prob-
 lems. Journal of Educational Measurement, 21:361–
 375.


                                                          33