An IRT-based Parameterization for Conditional Probability Tables


                                                   Russell G. Almond
                                      Educational Psychology and Learning Systems
                                                  College of Education
                                                Florida State Univeristy
                                                  Tallahassee, FL 32312


                        Abstract                                dom (e.g., some are latent), and this EM algorithm is imple-
                                                                mented in many common Bayesian network software pack-
     In educational assessment, as in many other ar-            ages (e.g., Netica; Norsys, 2012).
     eas of application for Bayesian networks, most             In many applications, all of the variables in the Bayesian
     variables are ordinal. Additionally conditional            network are ordinal and the network is expected to be
     probability tables need to express monotonic re-           monotonic (van der Gaag, Bodlaender, & Feelders, 2004),
     lationships; e.g., increasing skill should mean in-        higher values of parent variables are associated with higher
     creasing chance of a better performances on an             values of child variables. For example, in education, in-
     assessment task. This paper describes a flex-              creasing ability should result in a better performance (say
     ible parameterization for conditional probabil-            a higher partial credit score on short answer assessment
     ity tables based on item response theory (IRT)             item). This monotonicity condition is a violation of the
     that preserves monotonicity. The parameteriza-             local parameter independence assumption, which assumes
     tion is extensible because it rests on three aux-          that the distribution of the parameters for the rows of the
     iliary function: a mapping function which maps             CPT are independent.
     discrete parent states to real values, a combina-
     tion function which combines the parent values             Even ignoring this assumption violation, the hyper-
     into a sequence of real numbers corresponding              Dirichlet results may not provide stable estimates of the
     to the child variable states, and a link function          CPT. If the parent variables of a particular node are mod-
     which maps that vector of numbers to condi-                erately to strongly correlated, then certain row configura-
     tional probabilities. The paper also describes an          tions may be rare in the training set. For example if Skill 1
     EM-algorithm for estimating the parameters, and            and Skill 2 are correlated, few individuals for whom Skill 1
     describes a hybrid implementation using both R             is high and Skill 2 is low may be sampled. The problem
     and Netica, available for free download.                   gets worse as the number of parents increases, as number
                                                                of parameters of the hyper-Dirichlet distribution grows ex-
                                                                ponentially with the number of parents.
1   INTRODUCTION                                                The solution is to build parametric models for the CPTs.
                                                                A common parametric family is based on noisy-min and
The most commonly used parameterization for learning            noisy-max models (Dı́ez, 1993; Srinivas, 1993). Almond et
conditional probability tables (CPTs) in discrete Bayesian      al. (2001) proposed a method based adapting models from
networks with known structure is the hyper-Dirichlet            item response theory (IRT). This new model class has three
model (Spiegelhalter & Lauritzen, 1990). Spiegelhalter          parts: (1) a mapping from the discrete parent variables to
and Lauritzen show that under two assumptions, global pa-       an effective theta, a value on an equal interval real scale,
rameter independence and local parameter independence,          (2) a combination function that combined the effective
the hyper-Dirichlet distribution is a natural conjugate of      thetas for each parent variable into a single effective theta
the conditional multinomial distribution, that is, a Bayesian   for the item, and (3) a link function, based on the graded
network. In the complete data case, parameter learning          response model (Samejima, 1969). Almond (2010) and
is accomplished by generating contingency tables corre-         Almond, Mislevy, Steinberg, Yan, and Williamson (2015)
sponding to the CPT and simply counting the number of           extend and develop this model adding a new mapping func-
events corresponding to each combination in the training        tion and new link function based on the probit function.
data. This counting algorithm is easily extended to an EM       Almond, Kim, Shute, and Ventura (2013) provides an ad-
algorithm when some of the variable are missing at ran-         ditional extension based on the generalized partial credit


                                                           14
                                                                                                                         
model (Muraki, 1992).                                              Combination Rule A combination function, Zj θ̃(i0 ) ,
This paper organizes these IRT-based parametric models                is applied to the effective thetas to yield a combined
into an extensible framework that can be implemented                  effective theta for each row of the CPT for Variable j.
in the R language (R Core Team, 2015). It intro-
duces a Parameterized Node object which has two func-              Link The link function, gj (z), is evaluated at Zj (θ(i0 )) to
tional attributes: rules—which specifies the combination               produce the conditional probabilities for Row i0 of the
rules,—and link—which specifies the link function. It                  CPT for Variable j.
also has a parent tvals method which specifies the
mapping from discrete states to effective thetas, which can        This is an extension of the generalized linear model
be overridden to extend the model Because R is a func-             (McCullagh & Nelder, 1989) with a linear predictor Zj (θ̃),
tional language (functions can be stored as data), this cre-       and a link function, gj (z).
ates an open implementation protocol (Maeda, Lee, Mur-             Section 2.1 reviews the hyper-Dirichlet model of
phy, & Kiczales, 1997) which can be easily extended by an          Spiegelhalter and Lauritzen (1990). Section 2.2 de-
analyst familiar with R.                                           scribes how to map discrete parent variables on the theta
                                                                   scale. Section 2.3 describes rules for combining multiple
2     ITEM RESPONSE THEORY FOR                                     parent variables. Section 2.4 describes the generalized
      CONDITIONAL PROBABILITY                                      partial credit model, graded response and probit link
                                                                   functions.
      TABLES

Item response theory (IRT; Hambleton, Swaminathan, &               2.1   HYPER-DIRICHLET MODEL
Rogers, 1991) describes a family of models for the perfor-
mance of an examinee on a assessment. Let Xij be a bi-             The work of constructing a Bayesian network consists of
nary variable representing whether Examinee i got Item j           two parts: constructing the model graph, and construct-
correct or incorrect and let θi be the (latent) ability of the     ing the conditional probabilities Pr (X| pa(X)), the con-
examinee. The probability of a correct response is mod-            ditional probability tables (CPTs). A CPT is usually ex-
eled as a function of the examinee ability and item parame-        pressed as a matrix in which each row corresponds to a
ters. A typical parameterization is the 2-parameter logistic       configuration of the parent variables and each column cor-
(2PL) model:                                                       responds to a state of the child variable. A configuration
                                                                   is a mapping of each of the parent variables into a possible
                                   exp (1.7αj (θi − βj ))          state, and the corresponding row of the CPT is the condi-
    Pr (Xij = 1 | pai (Xj )) =                                ,    tional probability distribution over the possible values of
                                1 + exp (1.7αj (θi − βj ))
                                                             (1)   the child variable given that the parent variables are in the
also written as logit−1 (1.7αj (θi − βj )). The scale param-       corresponding configuration.
eter αj is called the discrimination and the location param-
                                                                   A commonly used parameterization for Bayesian networks
eter βj is called the difficulty. The constant 1.7 a scaling
                                                                   is the hyper-Dirichlet model (Spiegelhalter & Lauritzen,
factor used to make the inverse logistic function approxi-
                                                                   1990). Note that in a discrete Bayesian network, a node
mately the same as the normal ogive (probit) function. In
                                                                   which has no parents follows a categorical or multinomial
order to identify the latent scale, the mean and variance of
                                                                   distribution. The natural conjugate prior for the multino-
the latent ability variables θi are set to 0 and 1 in the target
                                                                   mial distribution is the Dirichlet distribution. If the node
population. There are many variations on this basic model;
                                                                   has parents, then each row of the CPT is a multinomial dis-
two variations for polytomous options are the graded re-
                                                                   tribution, and the natural conjugate for that row is also a
sponse model (GRM; Samejima, 1969) and the generalized
                                                                   Dirichlet distribution.
partial credit model (GPC; Muraki, 1992).
                                                                   Spiegelhalter and Lauritzen (1990) introduce two addi-
A key insight of Lou DiBello (Almond et al., 2001) is that
                                                                   tional assumptions. The global parameter independence
if each configuration of parent variables can be mapped to
                                                                   assumption states that the probabilities for the conditional
an effective theta, a point on a latent normal scale, then
                                                                   probability tables for any pair of variables are indepen-
standard IRT models can be used to calculate conditional
                                                                   dent. (In the psychometric context, this is equivalent to
probability tables. This section reviews the following prior
                                                                   the assumption that the parameters for different items are
work in IRT-based parameterizations of CPTs. The general
                                                                   independent.) The local parameter independence assump-
framework can be described in three steps:
                                                                   tion states that the probabilities for any two rows in a CPT
                                                                   are independent. Under these two assumptions, the hyper-
Mapping Each configuration, i0 , of the K parent variables         Dirichlet distribution, the distribution where every row of
   is mapped
           into a real value vector of effective thetas,         every CPT is given an independent Dirichlet distribution, is
   θ(i ) = θ̃1 (i0 ), . . . , θ̃K (i0 ) .
      0
                                                                   the natural conjugate of the Bayesian network.


                                                             15
Let the matrix AX be the parameter for the CPT of Vari-                  individual one standard deviation from the median.
able X. The rows of AX correspond to the possible
                                                                         Almond (2010) suggested using equally spaced quantiles
configurations of pa(X) and the columns of AX corre-
                                                                         of the normal distribution as effective theta values. For a
spond to the possible states of X, indexed by the num-
                                                                         parent variable Vk that takes one of Mk possible values, the
bers 0 to S.1 Thus, ai0 = (ai0 0 , . . . , ai0 S ) (the i0 th row
                                                                         effective theta values for state m is Φ−1 ((2m + 1)/2Mk ),
of AX ) is the parameters of the Dirichlet distribution for
                                                                         where Φ(·) is the cumulative distribution function of the
Pr (X | pa(X) = i0 ).
                                                                         standard normal distribution. The quantiles need not be
Let ŶX be the matrix of observed counts associated with                 equally spaced, but can be matched to any marginal dis-
Variable X. In other words, let yi0 v be then number of                  tribution (Almond et al., 2015). However, equally spaced
times when pa(V ) = i0 that V = v. Under the global in-                  quantiles form a uniform marginal distribution over the par-
dependence assumptions, this is the sufficient statistic for             ent variable, and thus are ready to be combined with infor-
the CPT. Under the hyper-Dirichlet distribution, the pos-                mation about the distribution of the parent variable from
terior parameter for the CPT for X is ŶX + AX . The                     elsewhere in the network.
posterior probability for Row i0 is a Dirichlet distribution
                                                                         The function which assigns a configuration of the parent
with parameter, (ŷi0 0 + ai0 0 , . . . , ŷi0 S + ai0 S ). The weight
                                                                         variables, i0 to a vector of effective theta values, θ̃(i0 ), is
given toPthe data, the observed sample size for the row, is
                                                                         called the mapping function. In most cases, this is a com-
yi0 + = s yi0 s and the weight given        P the prior, or effective    position function consisting of separate mapping functions
sample size of the prior, is ai0 + = s ai0 s .
                                                                         for each parent variable. The function based on normal
The number of elements of AX grows exponentially with                    quantiles given above is one example of a mapping func-
the number of parents. Furthermore, the observed sam-                    tion; however, any function that maps the states of the par-
ple of certain rows will be higher, sometimes much higher                ent variables to the values on the real line is a potential
than others. In particular, when the parent variables are                mapping function. To preserve monotonicity, the mapping
moderately correlated, rows corresponding to configura-                  should also be monotonic.
tions where one variable has a high value and the other a
low variable will be less common than cases where both                   2.3   COMBINATION RULES
parents have similar values. In extreme cases, or with small
training samples, Ŷi0 s will be very close to ai0 s for those           In the case of a binary child variable, the IRT 2PL function
rows.                                                                    gj (z) = logit−1 (1.7z)
                                                                                                 is a natural link
                                                                                                                  function. The com-
A second problem is that in many applications the CPTs                   bination rule Zj θ̃1 (i0 ), . . . , θ̃K (i0 ) must map the effec-
should be monotonically non-decreasing. When the par-                    tive thetas for the parent variables onto a single dimension
ent variables take on higher values, the probability that                representing the item. Also, in order to preserve mono-
the child variable should take on higher values should not               tonicity, Zj (·) must by monotonic.
decrease. It is possible, especially with a small training
                                                                         Almond et al. (2001) noted that changing the functional
sample, for the estimated CPT to have an unexpected non-
                                                                         form of Zj (·) changed the design pattern associate with the
monotonic pattern.
                                                                         item. They suggested the following functions:
The IRT-based CPT parameterizations described below
mitigate both of these problems. First, the number of pa-                Compensatory This structure function is a weighted
                                                                                                                            av-
rameters grows linearly in the number of parents. Second,                   erage of relevant skill variables. Zj θ̃(i ) =0
the parameterizations force the CPTs to be monotonically
                                                                                                0
                                                                                                                 √
                                                                            √1
                                                                                 P
increasing.                                                                   K     k αjk θ̃k (i ) − βj . Here 1/ K is a variance
                                                                            stabilization term which keeps the variance of Zj (·)
                                                                            from growing as more parents are added.
2.2       MAPPING DISCRETE VARIABLES ONTO
          CONTINUOUS SCALES                                              Conjunctive This structure function is based on the mini-
                                                                             mum of the relevant skillvariables
                                                                                                               (weakest skill dom-
IRT models assume that the parent variables are continu-
                                                                             inates performance). Zj θ̃(i ) = mink αjk θ̃k (i0 ) −
                                                                                                            0
ous. Basing CPTs for discrete Bayesian networks on IRT
models requires first assigning a real value θ̃mk to each                    βj .
possible state, m, of each parent variable Vk . Following                Disjunctive This structure function is based on the
the IRT convention, these numbers should be on the same                      maximum of the relevant skill variables
scale as a unit normal distribution with 0.0 representing the                                                      (strongest
                                                                                                                          
median individual in the population and 1.0 representing an                  skill dominates performance).     Zj θ̃(i0 )   =
                                                                               maxk αjk θ̃k (i0 ) − βj .
      1
     In the equations, the states run from lowest to highest, but in
the implementation the states run from highest to lowest.                Inhibitor A conditional function where the if the first skill


                                                                   16
     is lower than a threshold, θ∗ , then the value is at a            only useful for bipartite graphs where the parent variables
     nominal low value (usually based on the lowest pos-               and child variables each form distinct sets). However, this
     sible value of the second skill, θ̃2 (0)), but if the first       notation is useful for situations in which certain parent
     skill exceeds that threshold, the other skill dominates           variables have relevance for certain state transitions. Let
     the value of the function. This models tasks where a              qjsk = 1 if Parent k of Item j is relevant for the transi-
     certain minimal level of the first skill is necessary, but        tion from state s − 1 to s and zero otherwise, so that Qj is
     after that threshold is achieved more of the first skill          an item specific Q-matrix with columns corresponding to
     is not relevant.                                                  the parent
                                                                                variables. Let the combination rule for state s
                                                                                   0
                              (                                        be Zjs θ̃(i )[qjs ] , where the square bracket represents a
         
                         0
                              βj + αj θ̃2 (0) if θ̃1 (i0 ) < θ∗ ,
     Zj θ̃1(i0 ) , θ̃2 (i ) =                                          selection operator (it is styled after the R selection opera-
                               βj + αj θ̃2 (i0 ) if θ̃1 (i0 ) ≥ θ∗ ,   tor) which selects those values of θ̃(i0 ) for which qjsk = 1.
                                                               (2)     Note that in this case, the parameters for different states of
                                                                       the child variable (e.g., αjs and αjs0 ) could have different
Note that the difficulty and discrimination parameters are             dimensions.
built into the structure functions here. When there are mul-
tiple parent variables, there are multiple slope parameters,           2.4   LINK FUNCTIONS
αjk , giving the relative importance of the parent variables.
This makes a lot of sense in the case of the compensatory              The expressions Zjs (θ(i0 )[qjs ]) associates a real value
rule (which is just a linear model). In the case of the con-           with each configuration of parent variables, i0 , and each
junctive and disjunctive rules, it often makes more sense              state of the child variable, s (although usually Zj0 (·) is set
to have multiple intercepts (indicating different minimum              to a constant such as zero or infinity). The goal of the link
levels of skill are required for successful performance).              function gj (·) is to change these values into the conditional
                                                                       probability function. In the case of the binary variable, the
Offset Conjunctive Structure function is based on the                  inverse logit function is a possible link function. This sec-
    minimum of the relevant
                           skill variables with different             tion describes three more possibilities for cases where there
    thresholds. Zj θ̃(i0 ) = αj mink (θ̃k (i0 ) − βjk ).               the child variable has more than one state.
                                                                       Graded Response Link. Almond et al. (2001) suggested
Offset Disjunctive Structure function is based on the
                                                                       using the graded response function (Samejima, 1969) for
    maximum of the  relevant
                           skill variables with different             the link function. The graded response is based off of a se-
    thresholds. Zj θ̃(i0 ) = αj maxk (θ̃k (i0 ) − βjk ).
                                                                       ries of curves representing Pr(Xj ≥ s|θ), and the probabil-
                                                                       ity that Xj = s is given by subtracting
                                                                                                             two adjacent
                                                                                                                          curves.
                                                                                                                                
The link function for polytomous items (variables that take            In this case, let Pjs (i ) = logit−1 1.7Zjs θ̃(i0 )[qj s]
                                                                                            ∗ 0
on more
      than  two states) typically require a different value                      ∗
                                                                       and set Pj0    = 1 and Pj,S+1 = 0, where the states are
          0
Zjs θ̃(i ) for each state s of the child variable. Often               represented by integers going from 0 to S. Then the en-
this is achieved by simply using different values of the dif-          tries for Row i0 of the conditional probability table will be
ficultly parameter for each s. In the more general case,               pjs (i0 ) = Pjs
                                                                                     ∗ 0        ∗
                                                                                       (i ) − Pj,s+1 (i0 ).
each level of the dependent variable would have a different
                                                                       Note that in order for the graded response function to
set of discriminations, αjs , and difficulties, β js , or even
                                                                     not produce
                                                                                negative    probabilities,
                                                                                                         it must be the case
potentially a different functional form for Zjs θ̃(i0 ) .              that Zj1 θ̃(i )[qj1 ] < Zj2 θ̃(i0 )[qj2 ] < · · · <
                                                                                     0
                                                                                       
Von Davier (2008) proposed a similar class of generalized              ZjS θ̃(i0 )[qjS ] for all i0 . Note that the easiest way to
diagnostic models based on an IRT framework. Von Davier
                                                                       achieve this is to have all of the Zjs (·) with the same func-
expresses the parent-child relationship proficiency vari-
                                                                       tional form and slopes, differing only in the intercepts.
ables and observable outcome variables through a Q-matrix
(Tatsuoka, 1984). This is a matrix where the columns rep-              Probit Link. Almond et al. (2015) presents an alternative
resent proficiency variables and the rows represent observ-            way of thinking about the link function based on a regres-
able outcomes. When Variable k is thought to be relevant               sion model. The child variable like the parent variables in
for Item j, then qjk = 1; otherwise, qjk = 0. Note that the            mapped onto an effective theta scale. In particular a series
Q-matrix defines a large part of the graphical structure of            of cut points c0 = −∞ < c1 < · · · < cS < cS+1 = ∞
the Bayesian network (Almond, 2010): there is a directed               are established so that Φ(cs+1 ) − Φ(cs ) provides a de-
edge between the Proficiency Variable k and the Observ-                sired
                                                                            marginal
                                                                                      probability (Almond, 2010). The values
able Outcome Variable j if and only if qjk = 1.                        Zj θ̃(i0 ) are a series of linear predictors for each row, i0 ,
In a Bayesian network, the structure of the graph is used              of the CPT, and the conditional probabilities Pr(Xj = s|i0 )
in place of the Q-matrix (the Q-matrix approach is really              is calculated by finding the probability that a normal ran-


                                                                17
                                
dom variable with mean Zj θ̃(i0 ) and standard deviation             the combination rule and link functions described above,
σj falls between cs and cs+1 .                                       adding variants is straightforward. Second, it uses stan-
                                                                     dard object oriented conventions to allow functions such
Note that this link function uses a single combination rule          the mapping function and the function that computes the
for all states of the child variable. It also introduces a link      CPT from the parameters to be overridden by subclasses.
scale parameter, σj . Guo, Levina, Michailidis, and Zhu
(2015) introduce a similar model with σj = 1.
                                                                     3.1        PACKAGE STRUCTURE
Partial Credit Link. Muraki (1992) introduces an IRT
model appropriate for a constructed response item where              The implementation in R uses four R packages to take ad-
the solution to the problem requires several steps. Assume           vantage of previous work (Figure 1) and to minimize de-
that the examinee is assigned a score on that item based on          pendencies on the specific Bayes net engine used for the
how many of the steps were completed. If Item j has Sj               initial development (Netica; Norsys, 2012). The packages
steps, then                     are 0, . . . , Sj . For s > 
                                                            0, let   are as follows:
          the possible scores
                          
Pjs|s−1 θ̃(i0 ) = Pr Xj ≥ s | Xj ≥ s − 1, θ̃(i0 ) =
                                                               CPTtools CPTtools (Almond, 2015) is an existing collec-
logit−1 1.7Zjs θ̃(i0 )[qj s] ; that is, let Pjs|s−1 θ̃(i0 )             tion of R code that implements many of the mapping,
be the probability that the examinee completes Step s,                  combination rule and link functions described above.
given that the examinee has completed steps 0, . . . , s − 1.           In particular, it creates CPTs as R data frames (tables
Notethat Pr(X
               j ≥ 0) = 1, and so let Pj0|−1 = 1, and                  that can store both factor and numeric data). The first
Zj0 θ̃(i0 )    = 0. The probability that examinee whose                 several factor-valued columns of the data frame de-
                                                                        scribe the configurations of the parent variables and
parent proficiency variables are in Configuration i0 will
                                                                        the last few numeric columns the probabilities for
achieve Score s on Item j is then:
                                                                        each of the states.
                              Qs                     
                                                   0
                              r=0 P jr|r−1   θ̃(i  )               RNetica RNetica (Almond, 2015) is basically a binding in
     Pr Xj = s | θ̃(i0 ) =                              ,               R of the C API of Netica. It binds Netica network
                                         C
                                                                        and node objects into R objects so that they can be
where C is a normalization constant. Following Muraki                   accessed from R. It is able to set the CPT of a Netica
(1992), note that this collapses to:                                    node from the data frame representation calculated in
                                                                      the CPTtools package.
                               Pr Xj = s | θ̃(i0 ) =
                  P                                              Peanut Peanut2 is a new package providing abstract
                          s
            exp 1.7 r=0 Zjr θ̃(i0 )[qj r]              (3)               classes and generic functions. The goal is similar
         PSj               PR                    .                   to the DBI package (R Special Interest Group on
                                         0
           R=0 exp 1.7       r=0 Zjr θ̃(i )[qj r]                        Databases, 2013); that is, to isolate the code that de-
                                                                         pends a specific implementation to allow for easy ex-
The partial credit link function is the most flexible of the             tensibility.
three. Note that each state of the child variable can have
a different combination rule Zjs (·) as well as potentially          PNetica PNetica is a new package that provides an imple-
different slope and intercept parameters.                                mentation of the Peanut generic functions using RNet-
                                                                         ica objects. In particular, fields of Peanut objects are
                                                                         serialized and stored as node user data in Netica nodes
3    THE PARAMETERIZED NETWORK                                           and networks. Collections of nodes are represented as
     AND NODE OBJECT MODELS                                              node sets in Netica.

The mapping, combination rule and link functions de-                 Peanut uses the S3 class system (Becker, Chambers, &
scribed above represent a non-exhaustive survey of those             Wilks, 1988) which is quite loose. In particular, any ob-
known to the author at the time this paper was written.              ject can register itself as a parameterized node or network,
Software implementing this framework should be extensi-              it just needs to implement the methods described in the ob-
ble to cover additional possibilities. The implementation in         ject model below. This should make it straightforward to
R (R Core Team, 2015) described in this paper uses an open           replace the RNetica (which requires a license from Norsys
implementation framework (Maeda et al., 1997) to ensure              for substantial problems) and PNetica packages with simi-
extensibility. The design is opened in two ways. First,              lar packages which adapt the algorithm to a different Bayes
because R is a functional language, the combination rule             net engine.
and link functions can be stored as fields of a parameter-
                                                                           2
ized node object. While library functions are available for                    The name is a corruption of parameterized net.


                                                               18
                                                                           peatedly, once for each state except for the last one
                                                                           (CPTtools assumes the states are ordered highest to
                                                                           lowest), the result is a matrix with one fewer columns
                                                                           than the child variable has states. The link function
                                                                           takes this matrix and produces the conditional proba-
                                                                           bility tables. Note that there is an optional link scale
                                                                           parameter which is used for the probit method, but not
                                                                           for the partial credit or graded response method.

                                                                      prior(log alpha,beta,link scale) This takes a set of pa-
                                                                           rameters and returns the log prior probability for that
                 Figure 1: Package Structure                               parameter set. Note that any of the parameters may be
                                                                           vectors or lists.

                                                                      Figure 3 describes the calcDPCTable function (part of
                                                                      the CPTtools package) which shows how these func-
                                                                      tional arguments are used. Key to understanding this code
                                                                      is that the R function do.call takes two arguments, a
                                                                      function and a list of arguments and returns the value of
                                                                      evaluating the call. Consequently, the rules and link
                                                                      argument are functions or names of functions. Also, R is
Figure 2: Parameterized Network (Peanut) Class diagram                vectorized. So if the rules argument is a list of rules,
                                                                      then a different function will be applied for each state
                                                                      of the child variable. Similarly if the lnAlphas4 and
                                                                      betas are lists, then different values are used for each
3.2   OBJECT MODEL
                                                                      state. The build table() method of the Parameterized
Figure 2 shows the object model for Peanut. There are                 Node class is just a wrapper which calls calcDPCTable
two important classes, Parameterized Network and Param-               with arguments taken from its attributes and uses it to set
eterized Node, both of which extend ordinary Bayesian                 the CPT of the node.
network and node classes. The Parameterized network is                The use of functional arguments makes calcDPCtable
mostly a container for Parameterized Node and Observable              (and hence the Parameterized Node) class easy to extend.
Node (nodes which are referenced in the case data). Most              Any function that fits the generic model can be used in this
of its methods iterate over the the parameterized nodes.              application, so the set of functions provided by CPTtools
The attributes3 can be divided into two groups: numeric at-           is easily extended. The mapping function is slightly dif-
tributes representing parameters and functional attributes            ferent, as the mapping is usually associate with the parent
which take functions (or names of functions) as values.               variable. As a consequence, a generic function parent
In the key methods, build tables() and max CPT                        tvals() is used to calculated the effective values. The
params() the functional attributes are applied to numeric             current implementation for Netica nodes retrieves them
values to construct the tables.                                       from the parent nodes, but this could be overridden using a
                                                                      subclass of the parameterized node.
The three functional attributes, rules, link and prior
all must have specific signatures as they perform specific
functions.                                                            4    AN EM ALGORITHM FOR
                                                                           ESTIMATING PARAMETERIZED
rule(theta,alpha,beta) A combination rule must take ta-                    NODE PARAMETERS
     ble of effective thetas, corresponding to the configu-
     ration of parents, and return a vector of values, one            A large challenge in learning educational models from data
     for each row. Note that in R all values can be either            is that the variables representing the proficiencies are sel-
     scalars or vectors, so that there can be either a differ-        dom directly observed. The EM algorithm (Dempster,
     ent alpha or different beta for each parent (depending           Laird, & Rubin, 1977) provides a framework for MAP or
     on the specific rule.                                            MLE estimation in problems with latent or missing vari-
                                                                          4
link(et,link scale=NULL) When the rule is applied re-                       Because in educational settings, the discrimination parame-
                                                                      ters are restricted to be strictly positive, a log transformation of
   3
     These are expressed attributes in the object diagrams, but im-   the alphas is used in parameter learning instead of the original
plemented through accessor methods.                                   scale.


                                                                19
                                                                             ables. The EM algorithm alternates between two opera-
                                                                             tions:

                                                                             E-Step Calculate the expected value of the likelihood or
                                                                                 posterior by integrating out over the missing values
                                                                                 using the current set of parameter values. When
                                                                                 the likelihood comes from an exponential family, as
                                                                                 the conditional multinomial distribution does, this is
                                                                                 equivalent to setting the sufficient statistics for the dis-
                                                                                 tribution to their expected values.
c a l c D P C T a b l e <−                                                   M-Step Find values for the parameters which maximize
    function ( s k i l l L e v e l s , obsLevels ,                               the complete data log posterior distribution.
         lnAlphas , betas ,
         r u l e s =” C o m p e n s a t o r y ” ,                            Dempster, Laird and Rubin show that this algorithm will
         l i n k =” p a r t i a l C r e d i t ” ,                            converge to a local mode of the posterior distribution.
         l i n k S c a l e =NULL,Q=TRUE ,
         t v a l s =lapply ( skillLevels ,                                   For Bayesian network models, the E-Step can take ad-
              function ( sl )                                                vantage of the conditional independence assumptions en-
                   e f f e c t i v e T h e t a s ( length ( s l ) ) )        coded in the model graph. Adopting the global parameter
         ) {                                                                 independence assumption of Spiegelhalter and Lauritzen
    # # E r r o r c h e c k i n g and a r g u m e n t                        (1990) allows the M-Step to take advantage of the graph-
    ## p r o c e s s i n g code o m i t t e d                                ical structure as well. The global parameter independence
                                                                             assumption says that conditioned on the data being com-
     ## Create t a b l e o f e f f e c t i v e t h e t a s                   pletely observed, the parameters of the CPTs for different
     t h e t a s <− do . c a l l ( ” e x p a n d . g r i d ” , t v a l s )   variables in the network are independent. This means that
                                                                             the structural EM algorithm (Meng & van Dyk, 1997) can
     ## Apply r u l e s f o r each s t a t e ( e x c e p t                   be used, allowing the M-Step to be performed separately
     ## l a s t ) t o b u i l d per s t a t e t a b l e o f                  for each CPT.
     ## e f f e c t i v e t h e t a s .                                      Another simplification can be found through the use of
     e t <− m a t r i x ( 0 , nrow ( t h e t a s ) , k −1)                   sufficient statistics. Under the global global parameter
     f o r ( kk i n 1 : ( k −1)) {                                           independence assumption, the contingency table formed
         e t [ , kk ] <−                                                     by observing Xj and pa(Xj ), ŶXj , is a sufficient statis-
               do . c a l l ( r u l e s [ [ kk ] ] ,                         tic for Pr(Xj |pa(Xj ) (Spiegelhalter & Lauritzen, 1990).
                              l i s t ( t h e t a s [ ,Q[ s , ] ] ,          Thus in the E-Step it is sufficient to work out the ex-
                                      exp ( l n A l p h a s [ [ kk ] ] ) ,   pected value for this contingency table. If Oi is the ob-
                                      b e t a s [ [ kk ] ] ) )               served P data for Participant i, then this can be found by
     }                                                                       ỸXj = i Pr(Xij , pa(Xij )|Oi ); the inner term of the sum
                                                                             can be calculated by normal Bayesian network operations.
    # # A p p l y L i n k f u n c t i o n t o b u i l d t h e CPT            In this case the M-Step consists of finding the values of the
    do . c a l l ( l i n k , l i s t ( e t , l i n k S c a l e ,             parameters for Pr (Xj | pa(Xj )), θ j , that maximize
                                       obsLevels ) )
}
                                                                                            X
                                                                                                 Pr(ỹi0 Xj |pa(Xj ) = i0 , θ j )      (4)
                                                                                             i0

                 Figure 3: Function to build CPT
                                                                             It is quite likely that sparseness in the data will cause cells
                                                                             in ỸXj to have zero or near-zero values. This can come
                                                                             about for two reasons. One is that certain responses may
                                                                             be rare under certain configurations of the parent variables.
                                                                             The second is that often Bayesian networks are used to
                                                                             score complex constructed response tasks, and some vari-
                                                                             ables may not be observed unless the examinee employs
                                                                             a certain approach to problem, producing a low effective
                                                                             sample size for that variable. In either case, the zero values
                                                                             are a problem for maximum likelihood estimation. Equally
                                                                             problematic are values that are small integers, because the


                                                                        20
                                                                      A fully Bayesian estimation would not have this identifi-
                                                                      ability problem: even if the solutions shown in Figure 4
                0.8
  Probability


                                              Z2 ●
                                                                      have identical likelihoods, the posterior distribution would
                                            Z1 ●                      differ, and provide a preference for the solution which is
                0.4


                                      ZZ2                             closer to the normal range of the parameters. Assuming
                           ZZ1
                                                                      that the experts have provided initial values for the param-
                0.0


                                                                               (0)
                      −6         −4   −2       0     2   4   6
                                                                      eters, ψ j , a weakly informative prior can be created by
                                                                                                                                (0)
                                             Z(θi)                    using a normal distribution with a mean at ψ j and the
                                                                      variance chosen so that the 95% interval includes all of the
                                                                      reasonable values for the parameters. Let the resulting dis-
Figure 4: Two possible solutions (labeled Z and ZZ) for a             tribution be π(ψ j ). The M-Step for the fully Bayesian so-
probability difference of .1                                          lution then maximizes:


parameter values may be determined by only a handful of                     X
                                                                                     Pr ỹi0 Xj + ai0 j | pa(Xj ) = i0 , ψ j π(ψ j ) .
                                                                                                                            
                                                                                                                                         (6)
observations.
                                                                                i0
A common approach used in the analysis of contingency ta-
bles is to add a value less than one (usually 0.5) to each cell       4.1        IMPLEMENTATION OF THE GENERALIZED
of the table. (Note that the Dirichlet distribution with all                     EM ALGORITHM IN PEANUT
parameters equal to 0.5 is the non-informative prior pro-
                                                                      The GEMfit method takes advantage of the fact that the
duced by applying Jeffrey’s rule.) Applying any proper
                                                                      E-step of the EM algorithm is already implemented in Net-
prior distribution mitigates the problem with low cell count,
                                                                      ica! Netica, like many other Bayes net packages, offers a
and often the Bayesian network construction process elicits
                                                                      version of the EM algorithm for hyper-Dirichlet distribu-
a proper prior parameters from the subject matter experts.
                                                                      tions described in Spiegelhalter and Lauritzen (1990). In
In this case, suppose the experts supply parameter values
  (0)                           (0)                                   Netica, running the learning algorithm on the case file (ba-
ψ j , then Pr(Xj |pa(Xj ), ψ j ) is a matrix containing the
                                                                      sically, a table of values of observable nodes for various
expected value of a Dirichlet prior for the CPT for Xj . To
                                                                      subjects) produces both a new CPT for each node as well
get the Dirichlet prior, choose a vector of weights wj where
                                                                      as a set of posterior weights.5 Multiplying the new table
wi0 j is the weight in terms of number of observations to
                                                                      by the prior weights produces the table of expected counts
be given to Parent Configuration i0 . Then set the Dirichlet
                                     (0)                              which is the sufficient statistic required for the E-step.
prior Aj = wjt Pr(Xj |pa(Xj ), ψ j ). The the expression
to maximize in the M-step becomes:                                    Figure 5 shows the EM algorithm implementation
          X                                                           in Peanut.      Note the most of the work is done
              Pr ỹi0 Xj + ai0 j | pa(Xj ) = i0 , ψ j                 by four generic functions BuildAllTables(),
                                                      
                                                            (5)
                      i0                                              calcPnetLLike(),            calcExpTables(),         and
                                                                      maxAllTableParameters(), allowing the EM
This provides a semi-Bayesian approach to estimating the              algorithm to be customized by overriding those
CPT parameters.                                                       methods.     The functions BuildAllTables() and
If the discrete partial credit model is used for the condi-           maxAllTableParameters() iterate over the Param-
tional probability table, Equation 5 finds the points on the          eterized Nodes, allowing for further customization at the
logistic curve that will maximize the likelihood of the ob-           node rather than the net level.
served data. However, the full logistic curve does not enter          In particular, these functions perform the following roles:
the equation, just the points corresponding to possible con-
figurations of the parent variables. Consider a very simple           PnetBuildTables() This calculates CPTs for all Parame-
model where the target variable has one parent with two                   terized Nodes and sets the CPT of the node as well as
states. Assume that the conditional probabilities for one                 its prior weight.
outcome level given those states differ by .1. There are
multiple points on the logistic curve that provide that prob-         calcPnetLLike(cases) This calculates the log likelihood
ability difference, Figure 4 illustrates this. The two points              of the case files using the current parameters. In the
marked with circles (Z1 and Z2 ) differ by .1 on the y-axis                Netica implementation, it loops over the rows in the
and correspond to a solution with a difficulty of zero and a               case file and calculates the probability of each set of
discrimination of around 0.3. The two points marked with                   findings. The log probabilities are added together to
crosses (ZZ1 and ZZ2 ) also differ by .1 on the y-axis and                 produce the log-likelihood for the current parameter
correspond to a solution with a difficulty of 2.2 and a dis-               values.
                                                                            5
crimination of 2.3.                                                             These are called node experience in Netica.


                                                                 21
                                                                    calcExpTables(cases) This runs the E-step. In the Net-
                                                                         ica implementation it calls Netica’s EM learning algo-
                                                                         rithm.

                                                                    maxAllTableParams() This runs the M-step. It iterates
GEMfit <−                                                              over the max CPT params() methods for the Pa-
  f u n c t i o n ( n e t , c a s e s , t o l =1e −6 ,                 rameterized Node objects. The Netica implementation
                    m a x i t =100 , E s t e p i t =1 ,                calculates the expected table from the current CPT and
                    M s t e p i t =3) {                                posterior weights. It then calls the CPTtools func-
                                                                       tion mapDPC(). Like calcDPCTable() this func-
    # # Base c a s e                                                   tion takes most of the attributes of the Parameterized
    c o n v e r g e d <− FALSE                                         Node as arguments. It then finds a new set of parame-
    l l i k e <− rep (NA, m a x i t + 1 )                              ters (of the same shape as its input) to fit the expected
    i t e r <− 1                                                       table using a gradient decent algorithm.

    ## T h i s n e x t f u n c t i o n s e t s both t h e           Note that neither the EM algorithm embedded in the E-
    # # p r i o r CPTs and t h e p r i o r w e i g h t s            step (the one run natively in Netica) nor the gradient decent
    # # f o r e a c h node .                                        algorithm in the M-step need to be run to convergence. The
    BuildAllTables ( net )                                          convergence test is done at the level of the whole algorithm
                                                                    (this makes it a Generalized EM algorithm).
    ## I n i t i a l v a l u e o f l i k e l i h o o d f o r
    ## convergence t e s t                                          The prior weight is a tuning parameter for the algo-
    l l i k e [ i t e r ] <−                                        rithm. High values of prior weight will cause the es-
         calcPnetLLike ( net , cases )                              timates to shrink towards the prior (expert) values. Low
                                                                    values of prior weight will cause the estimates to be
    w h i l e ( ! c o n v e r g e d && i t e r <= m a x i t ) {     unstable when the data are sparse. I have found that values
       # # E−s t e p                                                around 10 provide a good compromise.
       calcExpTables ( net , cases ,
                Estepit=Estepit , tol=tol )                         5   SOFTWARE AVAILABILITY
        # # M−s t e p                                               The software is available for download from http://
        maxAllTableParams ( net ,                                   pluto.coe.fsu.edu/RNetica. All of the software
           Mstepit=Mstepit , t o l = t o l )                        is open source, which also should assist anyone trying to
                                                                    extend it either to cover more distribution types or different
        # # Update p a r a m e t e r s and                          Bayes net engines; however, RNetica links to the Netica
        # # do c o n v e r g e n c e t e s t                        API which requires a license from Norsys.
        i t e r <− i t e r + 1
        BuildAllTables ( net )                                      The framework described here is very flexible, perhaps too
        l l i k e [ i t e r ] <−                                    flexible. First, it allows an arbitrary number of parameters
             calcPnetLLike ( net , cases )                          for each CPT. There is no guarantee that those parameters
        c o n v e r g e d <−                                        are identifiable from the data. In many cases, the posterior
             ( abs ( l l i k e [ i t e r ]−                         distribution may look like the prior distribution. Second, R
                      l l i k e [ i t e r −1]) < t o l )            has a very weak type system. The framework assumes that
    }                                                               the combination rules and link functions are paired with ap-
                                                                    propriate parameters (alphas, betas and link scale parame-
    l i s t ( converged=converged , i t e r = i t e r ,             ters). This is not checked until the calcDPCTable func-
              llikes=llikes [1: iter ])                             tion is called to build the table meaning that errors could be
}                                                                   not caught until the analyst tries to update the network with
                                                                    data.

Figure 5: Generalized EM algorithm for Parameterized                Despite these weaknesses, the framework is a flexible one
Networks                                                            that supports the most common design patterns used in edu-
                                                                    cational testing applications (Mislevy et al., 2003; Almond,
                                                                    2010; Almond et al., 2013, 2015). Hopefully, these design
                                                                    patterns will be useful in other application areas as well. If
                                                                    not, the open implementation protocol used in the frame-
                                                                    work design makes it easy to extend.


                                                               22
References                                                       McCullagh, P., & Nelder, J. A. (1989). Generalized linear
                                                                       models. (2nd edition). Chapman and Hall.
Almond, R. G. (2010). ‘I can name that Bayesian                  Meng, X.-L., & van Dyk, D. (1997). The EM algorithm
       network in two matrixes’. International Journal                 — an old folk-song sung to a fast new tune. Journal
       of Approximate Reasoning, 51, 167–178.              Re-         of the Royal Statsitical Society, Series B, 59(3), 511-
       trieved from http://dx.doi.org/10.1016/                         567.
       j.ijar.2009.04.005 doi: 10.1016/j.ijar.2009               Mislevy, R. J., Hamel, L., Fried, R. G., Gaffney, T.,
       .04.005                                                         Haertel, G., Hafter, A., . . . Wenk, A. (2003).
Almond, R. G. (2015, May). RNetica: Binding the Net-                   Design patterns for assessing science inquiry
       ica API in R (-3.4 ed.) [Computer software man-                 (PADI Technical Report No. 1).           SRI Interna-
       ual]. Retrieved from http://pluto.coe.fsu                       tional. Retrieved from http://padi.sri.com/
       .edu/RNetica/RNetica.html (Open source                          publications.html
       software package)                                         Muraki, E. (1992). A generalized partial credit model:
Almond, R. G., DiBello, L., Jenkins, F., Mislevy, R. J.,               Application of an em algorithm. Applied Psycholog-
       Senturk, D., Steinberg, L. S., & Yan, D. (2001).                ical Measurement, 16(2), 159–176. doi: 10.1177/
       Models for conditional probability tables in educa-             014662169201600206
       tional assessment. In T. Jaakkola & T. Richard-           Norsys.       (2012).     Netica-c programmer’s module
       son (Eds.), Artificial intelligence and statistics 2001         (5.04 ed.) [Computer software manual].              Re-
       (p. 137-143). Morgan Kaufmann.                                  trieved from http://norsys.com/netica c
Almond, R. G., Kim, Y. J., Shute, V. J., & Ventura,                     api.htm (Bayesian Network Software)
       M. (2013). Debugging the evidence chain. In               R Core Team. (2015). R: A language and environment
       R. G. Almond & O. Mengshoel (Eds.), Proceed-                    for statistical computing [Computer software man-
       ings of the 2013 uai application workshops: Big                 ual]. Vienna, Austria. Retrieved from http://
       data meet complex models and models for spatial,                www.R-project.org/
       temporal and network data (UAI2013AW) (pp. 1–             R Special Interest Group on Databases. (2013). DBI: R
       10). Aachen. Retrieved from http://ceur-ws                      database interface [Computer software manual]. Re-
       .org/Vol-1024/paper-01.pdf                                      trieved from http://CRAN.R-project.org/
Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D.,              package=DBI (R package version 0.2-7)
       & Williamson, D. M. (2015). Bayesian networks in          Samejima, F. (1969). Estimation of latent ability using
       educational assessment. Springer.                               a response pattern of graded scores. Psychometrika
Becker, R. A., Chambers, J. M., & Wilks, A. R. (1988).                 Monograph No. 17, 34(4), (Part 2).
       The new S language: a programming environment             Spiegelhalter, D. J., & Lauritzen, S. L. (1990). Sequen-
       for data analysis and graphics. Wadworth & Brook-               tial updating of conditional probabilities on directed
       s/Cole.                                                         graphical structures. Networks, 20, 579–605.
Dempster, A. P., Laird, N., & Rubin, D. B. (1977). Maxi-         Srinivas, S. (1993). A generalization of the noisy-or model,
       mum likelihood from incomplete data via the em al-              the generalized noisy or-gate. In D. Heckerman &
       gorithm. JRSS B, 39, 1-38.                                      A. Mamdani (Eds.), Uncertainty in artificial intelli-
Dı́ez, F. J. (1993). Parameter adjustment in Bayes net-                gence ’93 (pp. 208–215). Morgan-Kaufmann.
       works. the generalized noisy or-gate. In D. Heck-         Tatsuoka, K. K. (1984). Analysis of errors in fraction ad-
       erman & A. Mamdani (Eds.), In uncertainty in                    dition and subtraction problems (Vol. 20; NIE Fi-
       artificial intelligence 93 (pp. 99–105). Morgan-                nal report No. NIE-G-81-002). University of Illinois,
       Kaufmann.                                                       Computer-based Education Research.
Guo, J., Levina, E., Michailidis, G., & Zhu, J. (2015).          van der Gaag, L. C., Bodlaender, H. L., & Feelders, A.
       Graphical models for ordinal data. Journal of                   (2004). Monotonicity in Bayesian networks. In
       Computational and Graphical Statistics, 24(1), 183-             M. Chickering & J. Halpern (Eds.), Proceedings of
       204. Retrieved from http://dx.doi.org/10                        the twentieth conference on uncertainty in artificial
       .1080/10618600.2014.889023                     doi: 10          intelligence (pp. 569–576). AUAI.
       .1080/10618600.2014.889023                                von Davier, M. (2008). A general diagnostic model applied
Hambleton, R. K., Swaminathan, H., & Rogers, H. J.                     to language testing data. British Journal of Mathe-
       (1991). Fundamentals of item response theory.                   matical and Statistical Psychology, 61, 287–307.
       Sage.
Maeda, C., Lee, A., Murphy, G., & Kiczales, G. (1997).
       Open implementation analysis and design. In Ssr
       ’97: Proceedings of the 1997 symposium on software
       reusability (pp. 44–52). New York, NY, USA: ACM.
       doi: http://doi.acm.org/10.1145/258366.258383


                                                           23