<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An IRT-based Parameterization for Conditional Probability Tables</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Russell G. Almond Educational Psychology</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Learning Systems</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Education Florida State Univeristy Tallahassee</institution>
          ,
          <addr-line>FL 32312</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>14</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>In educational assessment, as in many other areas of application for Bayesian networks, most variables are ordinal. Additionally conditional probability tables need to express monotonic relationships; e.g., increasing skill should mean increasing chance of a better performances on an assessment task. This paper describes a flexible parameterization for conditional probability tables based on item response theory (IRT) that preserves monotonicity. The parameterization is extensible because it rests on three auxiliary function: a mapping function which maps discrete parent states to real values, a combination function which combines the parent values into a sequence of real numbers corresponding to the child variable states, and a link function which maps that vector of numbers to conditional probabilities. The paper also describes an EM-algorithm for estimating the parameters, and describes a hybrid implementation using both R and Netica, available for free download.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The most commonly used parameterization for learning
conditional probability tables (CPTs) in discrete Bayesian
networks with known structure is the hyper-Dirichlet
model
        <xref ref-type="bibr" rid="ref20">(Spiegelhalter &amp; Lauritzen, 1990)</xref>
        . Spiegelhalter
and Lauritzen show that under two assumptions, global
parameter independence and local parameter independence,
the hyper-Dirichlet distribution is a natural conjugate of
the conditional multinomial distribution, that is, a Bayesian
network. In the complete data case, parameter learning
is accomplished by generating contingency tables
corresponding to the CPT and simply counting the number of
events corresponding to each combination in the training
data. This counting algorithm is easily extended to an EM
algorithm when some of the variable are missing at
random (e.g., some are latent), and this EM algorithm is
implemented in many common Bayesian network software
packages
        <xref ref-type="bibr" rid="ref16">(e.g., Netica; Norsys, 2012)</xref>
        .
      </p>
      <p>
        In many applications, all of the variables in the Bayesian
network are ordinal and the network is expected to be
monotonic
        <xref ref-type="bibr" rid="ref23">(van der Gaag, Bodlaender, &amp; Feelders, 2004)</xref>
        ,
higher values of parent variables are associated with higher
values of child variables. For example, in education,
increasing ability should result in a better performance (say
a higher partial credit score on short answer assessment
item). This monotonicity condition is a violation of the
local parameter independence assumption, which assumes
that the distribution of the parameters for the rows of the
CPT are independent.
      </p>
      <p>Even ignoring this assumption violation, the
hyperDirichlet results may not provide stable estimates of the
CPT. If the parent variables of a particular node are
moderately to strongly correlated, then certain row
configurations may be rare in the training set. For example if Skill 1
and Skill 2 are correlated, few individuals for whom Skill 1
is high and Skill 2 is low may be sampled. The problem
gets worse as the number of parents increases, as number
of parameters of the hyper-Dirichlet distribution grows
exponentially with the number of parents.</p>
      <p>
        The solution is to build parametric models for the CPTs.
A common parametric family is based on noisy-min and
noisy-max models
        <xref ref-type="bibr" rid="ref21 ref8">(D´ıez, 1993; Srinivas, 1993)</xref>
        .
        <xref ref-type="bibr" rid="ref3">Almond et
al. (2001)</xref>
        proposed a method based adapting models from
item response theory (IRT). This new model class has three
parts: (1) a mapping from the discrete parent variables to
an effective theta, a value on an equal interval real scale,
(2) a combination function that combined the effective
thetas for each parent variable into a single effective theta
for the item, and (3) a link function, based on the graded
response model
        <xref ref-type="bibr" rid="ref19">(Samejima, 1969)</xref>
        .
        <xref ref-type="bibr" rid="ref1">Almond (2010)</xref>
        and
Almond, Mislevy, Steinberg, Yan, and Williamson (2015)
extend and develop this model adding a new mapping
function and new link function based on the probit function.
Almond, Kim, Shute, and Ventura (2013) provides an
additional extension based on the generalized partial credit
model
        <xref ref-type="bibr" rid="ref15">(Muraki, 1992)</xref>
        .
      </p>
      <p>
        This paper organizes these IRT-based parametric models
into an extensible framework that can be implemented
in the R language
        <xref ref-type="bibr" rid="ref17 ref2 ref5">(R Core Team, 2015)</xref>
        . It
introduces a Parameterized Node object which has two
functional attributes: rules—which specifies the combination
rules,—and link—which specifies the link function. It
also has a parent tvals method which specifies the
mapping from discrete states to effective thetas, which can
be overridden to extend the model Because R is a
functional language (functions can be stored as data), this
creates an open implementation protocol
        <xref ref-type="bibr" rid="ref11">(Maeda, Lee,
Murphy, &amp; Kiczales, 1997)</xref>
        which can be easily extended by an
analyst familiar with R.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>ITEM RESPONSE THEORY FOR</title>
    </sec>
    <sec id="sec-3">
      <title>CONDITIONAL PROBABILITY</title>
    </sec>
    <sec id="sec-4">
      <title>TABLES</title>
      <p>
        Item response theory
        <xref ref-type="bibr" rid="ref10">(IRT; Hambleton, Swaminathan, &amp;
Rogers, 1991)</xref>
        describes a family of models for the
performance of an examinee on a assessment. Let Xij be a
binary variable representing whether Examinee i got Item j
correct or incorrect and let i be the (latent) ability of the
examinee. The probability of a correct response is
modeled as a function of the examinee ability and item
parameters. A typical parameterization is the 2-parameter logistic
(2PL) model:
exp (1:7 j ( i
1 + exp (1:7 j ( i
j ))
j ))
;
Pr (Xij = 1 j pai(Xj )) =
(1)
also written as logit 1 (1:7 j ( i j )). The scale
parameter j is called the discrimination and the location
parameter j is called the difficulty. The constant 1.7 a scaling
factor used to make the inverse logistic function
approximately the same as the normal ogive (probit) function. In
order to identify the latent scale, the mean and variance of
the latent ability variables i are set to 0 and 1 in the target
population. There are many variations on this basic model;
two variations for polytomous options are the graded
response model
        <xref ref-type="bibr" rid="ref19">(GRM; Samejima, 1969)</xref>
        and the generalized
partial credit model
        <xref ref-type="bibr" rid="ref15">(GPC; Muraki, 1992)</xref>
        .
      </p>
      <p>
        A key insight of Lou DiBello
        <xref ref-type="bibr" rid="ref3">(Almond et al., 2001)</xref>
        is that
if each configuration of parent variables can be mapped to
an effective theta, a point on a latent normal scale, then
standard IRT models can be used to calculate conditional
probability tables. This section reviews the following prior
work in IRT-based parameterizations of CPTs. The general
framework can be described in three steps:
Mapping Each configuration, i0, of the K parent variables
is mapped into a real value vector of effective thetas,
(i0) =
~1(i0); : : : ; ~K (i0) .
      </p>
      <p>Combination Rule A combination function, Zj
is applied to the effective thetas to yield a combined
effective theta for each row of the CPT for Variable j.
Link The link function, gj (z), is evaluated at Zj ( (i0)) to
produce the conditional probabilities for Row i0 of the
CPT for Variable j.</p>
      <p>
        This is an extension of the generalized linear model
        <xref ref-type="bibr" rid="ref12">(McCullagh &amp; Nelder, 1989)</xref>
        with a linear predictor Zj (~),
and a link function, gj (z).
      </p>
      <p>
        Section 2.1 reviews the hyper-Dirichlet model of
        <xref ref-type="bibr" rid="ref20">Spiegelhalter and Lauritzen (1990)</xref>
        . Section 2.2
describes how to map discrete parent variables on the theta
scale. Section 2.3 describes rules for combining multiple
parent variables. Section 2.4 describes the generalized
partial credit model, graded response and probit link
functions.
2.1
      </p>
      <sec id="sec-4-1">
        <title>HYPER-DIRICHLET MODEL</title>
        <p>The work of constructing a Bayesian network consists of
two parts: constructing the model graph, and
constructing the conditional probabilities Pr (Xj pa(X)), the
conditional probability tables (CPTs). A CPT is usually
expressed as a matrix in which each row corresponds to a
configuration of the parent variables and each column
corresponds to a state of the child variable. A configuration
is a mapping of each of the parent variables into a possible
state, and the corresponding row of the CPT is the
conditional probability distribution over the possible values of
the child variable given that the parent variables are in the
corresponding configuration.</p>
        <p>
          A commonly used parameterization for Bayesian networks
is the hyper-Dirichlet model
          <xref ref-type="bibr" rid="ref20">(Spiegelhalter &amp; Lauritzen,
1990)</xref>
          . Note that in a discrete Bayesian network, a node
which has no parents follows a categorical or multinomial
distribution. The natural conjugate prior for the
multinomial distribution is the Dirichlet distribution. If the node
has parents, then each row of the CPT is a multinomial
distribution, and the natural conjugate for that row is also a
Dirichlet distribution.
        </p>
        <p>
          <xref ref-type="bibr" rid="ref20">Spiegelhalter and Lauritzen (1990)</xref>
          introduce two
additional assumptions. The global parameter independence
assumption states that the probabilities for the conditional
probability tables for any pair of variables are
independent. (In the psychometric context, this is equivalent to
the assumption that the parameters for different items are
independent.) The local parameter independence
assumption states that the probabilities for any two rows in a CPT
are independent. Under these two assumptions, the
hyperDirichlet distribution, the distribution where every row of
every CPT is given an independent Dirichlet distribution, is
the natural conjugate of the Bayesian network.
        </p>
        <p>Let the matrix AX be the parameter for the CPT of
Variable X. The rows of AX correspond to the possible
configurations of pa(X) and the columns of AX
correspond to the possible states of X, indexed by the
numbers 0 to S.1 Thus, ai0 = (ai00; : : : ; ai0S ) (the i0th row
of AX ) is the parameters of the Dirichlet distribution for
Pr (X j pa(X) = i0).</p>
        <p>Let Y^X be the matrix of observed counts associated with
Variable X. In other words, let yi0v be then number of
times when pa(V ) = i0 that V = v. Under the global
independence assumptions, this is the sufficient statistic for
the CPT. Under the hyper-Dirichlet distribution, the
posterior parameter for the CPT for X is Y^X + AX . The
posterior probability for Row i0 is a Dirichlet distribution
with parameter, (y^i00 + ai00; : : : ; y^i0S + ai0S ). The weight
given to the data, the observed sample size for the row, is
yi0+ = Ps yi0s and the weight given the prior, or effective
sample size of the prior, is ai0+ = Ps ai0s.</p>
        <p>The number of elements of AX grows exponentially with
the number of parents. Furthermore, the observed
sample of certain rows will be higher, sometimes much higher
than others. In particular, when the parent variables are
moderately correlated, rows corresponding to
configurations where one variable has a high value and the other a
low variable will be less common than cases where both
parents have similar values. In extreme cases, or with small
training samples, Y^i0s will be very close to ai0s for those
rows.</p>
        <p>A second problem is that in many applications the CPTs
should be monotonically non-decreasing. When the
parent variables take on higher values, the probability that
the child variable should take on higher values should not
decrease. It is possible, especially with a small training
sample, for the estimated CPT to have an unexpected
nonmonotonic pattern.</p>
        <p>The IRT-based CPT parameterizations described below
mitigate both of these problems. First, the number of
parameters grows linearly in the number of parents. Second,
the parameterizations force the CPTs to be monotonically
increasing.
2.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>MAPPING DISCRETE VARIABLES ONTO</title>
      </sec>
      <sec id="sec-4-3">
        <title>CONTINUOUS SCALES</title>
        <p>
          IRT models assume that the parent variables are
continuous. Basing CPTs for discrete Bayesian networks on IRT
models requires first assigning a real value ~mk to each
possible state, m, of each parent variable Vk. Following
the IRT convention, these numbers should be on the same
scale as a unit normal distribution with 0:0 representing the
median individual in the population and 1:0 representing an
1In the equations, the states run from lowest to highest, but in
the implementation the states run from highest to lowest.
individual one standard deviation from the median.
          <xref ref-type="bibr" rid="ref1">Almond (2010)</xref>
          suggested using equally spaced quantiles
of the normal distribution as effective theta values. For a
parent variable Vk that takes one of Mk possible values, the
effective theta values for state m is 1((2m + 1)=2Mk),
where ( ) is the cumulative distribution function of the
standard normal distribution. The quantiles need not be
equally spaced, but can be matched to any marginal
distribution
          <xref ref-type="bibr" rid="ref2 ref5">(Almond et al., 2015)</xref>
          . However, equally spaced
quantiles form a uniform marginal distribution over the
parent variable, and thus are ready to be combined with
information about the distribution of the parent variable from
elsewhere in the network.
        </p>
        <p>The function which assigns a configuration of the parent
variables, i0 to a vector of effective theta values, ~(i0), is
called the mapping function. In most cases, this is a
composition function consisting of separate mapping functions
for each parent variable. The function based on normal
quantiles given above is one example of a mapping
function; however, any function that maps the states of the
parent variables to the values on the real line is a potential
mapping function. To preserve monotonicity, the mapping
should also be monotonic.
2.3</p>
      </sec>
      <sec id="sec-4-4">
        <title>COMBINATION RULES</title>
        <p>In the case of a binary child variable, the IRT 2PL function
gj (z) = logit 1(1:7z) is a natural link function. The
combination rule Zj ~1(i0); : : : ; ~K (i0) must map the
effective thetas for the parent variables onto a single dimension
representing the item. Also, in order to preserve
monotonicity, Zj ( ) must by monotonic.</p>
        <p>
          <xref ref-type="bibr" rid="ref3">Almond et al. (2001)</xref>
          noted that changing the functional
form of Zj ( ) changed the design pattern associate with the
item. They suggested the following functions:
Compensatory This structure function is a weighted
average of relevant skill variables. Zj ~(i0) =
p1K Pk jk ~k(i0) j . Here 1=pK is a variance
stabilization term which keeps the variance of Zj ( )
from growing as more parents are added.
        </p>
        <p>Conjunctive This structure function is based on the
minimum of the relevant skill variables (weakest skill
dominates performance). Zj ~(i0) = mink jk ~k(i0)
j .</p>
        <p>Disjunctive This structure function is based on the
maximum of the relevant skill variables (strongest
skill dominates performance). Zj ~(i0) =
maxk jk ~k(i0)</p>
        <p>j .</p>
        <p>Inhibitor A conditional function where the if the first skill
is lower than a threshold, , then the value is at a
nominal low value (usually based on the lowest
possible value of the second skill, ~2(0)), but if the first
skill exceeds that threshold, the other skill dominates
the value of the function. This models tasks where a
certain minimal level of the first skill is necessary, but
after that threshold is achieved more of the first skill
is not relevant.</p>
        <p>Zj
~1(i0); ~2(i0) =
( j + j ~2(0) if ~1(i0) &lt;
j + j ~2(i0) if ~1(i0)
(2)
Note that the difficulty and discrimination parameters are
built into the structure functions here. When there are
multiple parent variables, there are multiple slope parameters,
jk, giving the relative importance of the parent variables.
This makes a lot of sense in the case of the compensatory
rule (which is just a linear model). In the case of the
conjunctive and disjunctive rules, it often makes more sense
to have multiple intercepts (indicating different minimum
levels of skill are required for successful performance).
Offset Conjunctive Structure function is based on the
minimum of the relevant skill variables with different
thresholds. Zj ~(i0) = j mink(~k(i0) jk).
Offset Disjunctive Structure function is based on the
maximum of the relevant skill variables with different
thresholds. Zj ~(i0) = j maxk(~k(i0) jk).
The link function for polytomous items (variables that take
on more than two states) typically require a different value
Zjs ~(i0) for each state s of the child variable. Often
this is achieved by simply using different values of the
difficultly parameter for each s. In the more general case,
each level of the dependent variable would have a different
set of discriminations, js, and difficulties, js, or even
potentially a different functional form for Zjs
~(i0) .</p>
        <p>
          <xref ref-type="bibr" rid="ref24">Von Davier (2008</xref>
          ) proposed a similar class of generalized
diagnostic models based on an IRT framework. Von Davier
expresses the parent-child relationship proficiency
variables and observable outcome variables through a Q-matrix
          <xref ref-type="bibr" rid="ref22">(Tatsuoka, 1984)</xref>
          . This is a matrix where the columns
represent proficiency variables and the rows represent
observable outcomes. When Variable k is thought to be relevant
for Item j, then qjk = 1; otherwise, qjk = 0. Note that the
Q-matrix defines a large part of the graphical structure of
the Bayesian network
          <xref ref-type="bibr" rid="ref1">(Almond, 2010)</xref>
          : there is a directed
edge between the Proficiency Variable k and the
Observable Outcome Variable j if and only if qjk = 1.
In a Bayesian network, the structure of the graph is used
in place of the Q-matrix (the Q-matrix approach is really
only useful for bipartite graphs where the parent variables
and child variables each form distinct sets). However, this
notation is useful for situations in which certain parent
variables have relevance for certain state transitions. Let
qjsk = 1 if Parent k of Item j is relevant for the
transition from state s 1 to s and zero otherwise, so that Qj is
an item specific Q-matrix with columns corresponding to
the parent variables. Let the combination rule for state s
be Zjs ~(i0)[qjs] , where the square bracket represents a
selection operator (it is styled after the R selection
operator) which selects those values of ~(i0) for which qjsk = 1.
Note that in this case, the parameters for different states of
the child variable (e.g., js and js0 ) could have different
dimensions.
2.4
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>LINK FUNCTIONS</title>
        <p>The expressions Zjs ( (i0)[qjs]) associates a real value
with each configuration of parent variables, i0, and each
state of the child variable, s (although usually Zj0( ) is set
to a constant such as zero or infinity). The goal of the link
function gj ( ) is to change these values into the conditional
probability function. In the case of the binary variable, the
inverse logit function is a possible link function. This
section describes three more possibilities for cases where there
the child variable has more than one state.</p>
        <p>
          Graded Response Link.
          <xref ref-type="bibr" rid="ref3">Almond et al. (2001)</xref>
          suggested
using the graded response function
          <xref ref-type="bibr" rid="ref19">(Samejima, 1969)</xref>
          for
the link function. The graded response is based off of a
series of curves representing Pr(Xj sj ), and the
probability that Xj = s is given by subtracting two adjacent curves.
In this case, let Pjs(i0) = logit 1 1:7Zjs
~(i0)[qj s]
and set Pj0 = 1 and Pj;S+1 = 0, where the states are
represented by integers going from 0 to S. Then the
entries for Row i0 of the conditional probability table will be
pjs(i0) = Pjs(i0) Pj;s+1(i0).
        </p>
        <p>Note that in order for the graded response function to
not produce negative probabilities, it must be the case
that Zj1 ~(i0)[qj1] &lt; Zj2 ~(i0)[qj2] &lt; &lt;
ZjS</p>
        <p>
          ~(i0)[qjS ] for all i0. Note that the easiest way to
achieve this is to have all of the Zjs( ) with the same
functional form and slopes, differing only in the intercepts.
Probit Link.
          <xref ref-type="bibr" rid="ref2 ref5">Almond et al. (2015)</xref>
          presents an alternative
way of thinking about the link function based on a
regression model. The child variable like the parent variables in
mapped onto an effective theta scale. In particular a series
of cut points c0 = 1 &lt; c1 &lt; &lt; cS &lt; cS+1 = 1
are established so that (cs+1) (cs) provides a
desired marginal probability
          <xref ref-type="bibr" rid="ref1">(Almond, 2010)</xref>
          . The values
Zj ~(i0) are a series of linear predictors for each row, i0,
of the CPT, and the conditional probabilities Pr(Xj = sji0)
is calculated by finding the probability that a normal
random variable with mean Zj
j falls between cs and cs+1.
        </p>
        <p>~(i0) and standard deviation
Note that this link function uses a single combination rule
for all states of the child variable. It also introduces a link
scale parameter, j . Guo, Levina, Michailidis, and Zhu
(2015) introduce a similar model with j = 1.</p>
        <p>
          Partial Credit Link.
          <xref ref-type="bibr" rid="ref15">Muraki (1992)</xref>
          introduces an IRT
model appropriate for a constructed response item where
the solution to the problem requires several steps. Assume
that the examinee is assigned a score on that item based on
how many of the steps were completed. If Item j has Sj
steps, then the possible scores are 0; : : : ; Sj . For s &gt; 0, let
Pjsjs 1
        </p>
        <p>~(i0)
logit 1 1:7Zjs
= Pr Xj
s j Xj
s
1; ~(i0)
=
~(i0)[qj s] ; that is, let Pjsjs 1
~(i0)
be the probability that the examinee completes Step s,
given that the examinee has completed steps 0; : : : ; s 1.
Note that Pr(Xj 0) = 1, and so let Pj0j 1 = 1, and
Zj0
~(i0)</p>
        <p>= 0. The probability that examinee whose
parent proficiency variables are in Configuration i0 will
achieve Score s on Item j is then:</p>
        <p>Pr Xj = s j ~(i0) =</p>
        <p>Qs
r=0 Pjrjr 1</p>
        <p>
          ~(i0)
C
where C is a normalization constant. Following
          <xref ref-type="bibr" rid="ref15">Muraki
(1992)</xref>
          , note that this collapses to:
        </p>
        <p>Pr Xj = s j ~(i0) =
exp 1:7 Ps
r=0 Zjr</p>
        <p>~(i0)[qj r]
PSj r=0 Zjr</p>
        <p>R=0 exp 1:7 PR
~(i0)[qj r]
:
;
(3)
The partial credit link function is the most flexible of the
three. Note that each state of the child variable can have
a different combination rule Zjs( ) as well as potentially
different slope and intercept parameters.
3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>THE PARAMETERIZED NETWORK</title>
    </sec>
    <sec id="sec-6">
      <title>AND NODE OBJECT MODELS</title>
      <p>
        The mapping, combination rule and link functions
described above represent a non-exhaustive survey of those
known to the author at the time this paper was written.
Software implementing this framework should be
extensible to cover additional possibilities. The implementation in
R
        <xref ref-type="bibr" rid="ref17 ref2 ref5">(R Core Team, 2015)</xref>
        described in this paper uses an open
implementation framework
        <xref ref-type="bibr" rid="ref11">(Maeda et al., 1997)</xref>
        to ensure
extensibility. The design is opened in two ways. First,
because R is a functional language, the combination rule
and link functions can be stored as fields of a
parameterized node object. While library functions are available for
the combination rule and link functions described above,
adding variants is straightforward. Second, it uses
standard object oriented conventions to allow functions such
the mapping function and the function that computes the
CPT from the parameters to be overridden by subclasses.
3.1
      </p>
      <sec id="sec-6-1">
        <title>PACKAGE STRUCTURE</title>
        <p>
          The implementation in R uses four R packages to take
advantage of previous work (Figure 1) and to minimize
dependencies on the specific Bayes net engine used for the
initial development
          <xref ref-type="bibr" rid="ref16">(Netica; Norsys, 2012)</xref>
          . The packages
are as follows:
CPTtools CPTtools
          <xref ref-type="bibr" rid="ref2 ref5">(Almond, 2015)</xref>
          is an existing
collection of R code that implements many of the mapping,
combination rule and link functions described above.
In particular, it creates CPTs as R data frames (tables
that can store both factor and numeric data). The first
several factor-valued columns of the data frame
describe the configurations of the parent variables and
the last few numeric columns the probabilities for
each of the states.
        </p>
        <p>
          RNetica RNetica
          <xref ref-type="bibr" rid="ref2 ref5">(Almond, 2015)</xref>
          is basically a binding in
R of the C API of Netica. It binds Netica network
and node objects into R objects so that they can be
accessed from R. It is able to set the CPT of a Netica
node from the data frame representation calculated in
the CPTtools package.
        </p>
        <p>
          Peanut Peanut2 is a new package providing abstract
classes and generic functions. The goal is similar
to the DBI package
          <xref ref-type="bibr" rid="ref18 ref4">(R Special Interest Group on
Databases, 2013)</xref>
          ; that is, to isolate the code that
depends a specific implementation to allow for easy
extensibility.
        </p>
        <p>PNetica PNetica is a new package that provides an
implementation of the Peanut generic functions using
RNetica objects. In particular, fields of Peanut objects are
serialized and stored as node user data in Netica nodes
and networks. Collections of nodes are represented as
node sets in Netica.</p>
        <p>
          Peanut uses the S3 class system
          <xref ref-type="bibr" rid="ref6">(Becker, Chambers, &amp;
Wilks, 1988)</xref>
          which is quite loose. In particular, any
object can register itself as a parameterized node or network,
it just needs to implement the methods described in the
object model below. This should make it straightforward to
replace the RNetica (which requires a license from Norsys
for substantial problems) and PNetica packages with
similar packages which adapt the algorithm to a different Bayes
net engine.
        </p>
        <sec id="sec-6-1-1">
          <title>2The name is a corruption of parameterized net.</title>
          <p>The attributes3 can be divided into two groups: numeric
attributes representing parameters and functional attributes
which take functions (or names of functions) as values.
In the key methods, build tables() and max CPT
params() the functional attributes are applied to numeric
values to construct the tables.</p>
          <p>The three functional attributes, rules, link and prior
all must have specific signatures as they perform specific
functions.
rule(theta,alpha,beta) A combination rule must take
table of effective thetas, corresponding to the
configuration of parents, and return a vector of values, one
for each row. Note that in R all values can be either
scalars or vectors, so that there can be either a
different alpha or different beta for each parent (depending
on the specific rule.
link(et,link scale=NULL) When the rule is applied
re3These are expressed attributes in the object diagrams, but
implemented through accessor methods.
peatedly, once for each state except for the last one
(CPTtools assumes the states are ordered highest to
lowest), the result is a matrix with one fewer columns
than the child variable has states. The link function
takes this matrix and produces the conditional
probability tables. Note that there is an optional link scale
parameter which is used for the probit method, but not
for the partial credit or graded response method.
prior(log alpha,beta,link scale) This takes a set of
parameters and returns the log prior probability for that
parameter set. Note that any of the parameters may be
vectors or lists.</p>
          <p>Figure 3 describes the calcDPCTable function (part of
the CPTtools package) which shows how these
functional arguments are used. Key to understanding this code
is that the R function do.call takes two arguments, a
function and a list of arguments and returns the value of
evaluating the call. Consequently, the rules and link
argument are functions or names of functions. Also, R is
vectorized. So if the rules argument is a list of rules,
then a different function will be applied for each state
of the child variable. Similarly if the lnAlphas4 and
betas are lists, then different values are used for each
state. The build table() method of the Parameterized
Node class is just a wrapper which calls calcDPCTable
with arguments taken from its attributes and uses it to set
the CPT of the node.</p>
          <p>The use of functional arguments makes calcDPCtable
(and hence the Parameterized Node) class easy to extend.
Any function that fits the generic model can be used in this
application, so the set of functions provided by CPTtools
is easily extended. The mapping function is slightly
different, as the mapping is usually associate with the parent
variable. As a consequence, a generic function parent
tvals() is used to calculated the effective values. The
current implementation for Netica nodes retrieves them
from the parent nodes, but this could be overridden using a
subclass of the parameterized node.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>AN EM ALGORITHM FOR</title>
    </sec>
    <sec id="sec-8">
      <title>ESTIMATING PARAMETERIZED</title>
    </sec>
    <sec id="sec-9">
      <title>NODE PARAMETERS</title>
      <p>
        A large challenge in learning educational models from data
is that the variables representing the proficiencies are
seldom directly observed. The EM algorithm
        <xref ref-type="bibr" rid="ref7">(Dempster,
Laird, &amp; Rubin, 1977)</xref>
        provides a framework for MAP or
MLE estimation in problems with latent or missing
vari4Because in educational settings, the discrimination
parameters are restricted to be strictly positive, a log transformation of
the alphas is used in parameter learning instead of the original
scale.
c a l c D P C T a b l e &lt;
f u n c t i o n ( s k i l l L e v e l s , o b s L e v e l s ,
l n A l p h a s , b e t a s ,
r u l e s = ” C o m p e n s a t o r y ” ,
l i n k = ” p a r t i a l C r e d i t ” ,
l i n k S c a l e =NULL , Q=TRUE ,
t v a l s = l a p p l y ( s k i l l L e v e l s ,
f u n c t i o n ( s l )
      </p>
      <p>e f f e c t i v e T h e t a s ( l e n g t h ( s l ) ) )
) f
# # E r r o r c h e c k i n g and a r g u m e n t
# # p r o c e s s i n g c o d e o m i t t e d
# # C r e a t e t a b l e o f e f f e c t i v e t h e t a s
t h e t a s &lt; do . c a l l ( ” e x p a n d . g r i d ” , t v a l s )
# # A p p l y r u l e s f o r e a c h s t a t e ( e x c e p t
# # l a s t ) t o b u i l d p e r s t a t e t a b l e o f
# # e f f e c t i v e t h e t a s .
e t &lt; m a t r i x ( 0 , nrow ( t h e t a s ) , k 1)
f o r ( kk i n 1 : ( k 1 ) ) f
e t [ , kk ] &lt;
do . c a l l ( r u l e s [ [ kk ] ] ,
l i s t ( t h e t a s [ ,Q[ s , ] ] ,
exp ( l n A l p h a s [ [ kk ] ] ) ,
b e t a s [ [ kk ] ] ) )
g</p>
      <p>g
# # A p p l y L i n k f u n c t i o n t o b u i l d t h e CPT
do . c a l l ( l i n k , l i s t ( e t , l i n k S c a l e ,</p>
      <p>o b s L e v e l s ) )
ables. The EM algorithm alternates between two
operations:
E-Step Calculate the expected value of the likelihood or
posterior by integrating out over the missing values
using the current set of parameter values. When
the likelihood comes from an exponential family, as
the conditional multinomial distribution does, this is
equivalent to setting the sufficient statistics for the
distribution to their expected values.</p>
      <p>M-Step Find values for the parameters which maximize
the complete data log posterior distribution.</p>
      <p>
        Dempster, Laird and Rubin show that this algorithm will
converge to a local mode of the posterior distribution.
For Bayesian network models, the E-Step can take
advantage of the conditional independence assumptions
encoded in the model graph. Adopting the global parameter
independence assumption of
        <xref ref-type="bibr" rid="ref20">Spiegelhalter and Lauritzen
(1990)</xref>
        allows the M-Step to take advantage of the
graphical structure as well. The global parameter independence
assumption says that conditioned on the data being
completely observed, the parameters of the CPTs for different
variables in the network are independent. This means that
the structural EM algorithm
        <xref ref-type="bibr" rid="ref13">(Meng &amp; van Dyk, 1997)</xref>
        can
be used, allowing the M-Step to be performed separately
for each CPT.
      </p>
      <p>
        Another simplification can be found through the use of
sufficient statistics. Under the global global parameter
independence assumption, the contingency table formed
by observing Xj and pa(Xj ), Y^Xj , is a sufficient
statistic for Pr(Xj jpa(Xj )
        <xref ref-type="bibr" rid="ref20">(Spiegelhalter &amp; Lauritzen, 1990)</xref>
        .
Thus in the E-Step it is sufficient to work out the
expected value for this contingency table. If Oi is the
observed data for Participant i, then this can be found by
Y~Xj = Pi Pr(Xij ; pa(Xij )jOi); the inner term of the sum
can be calculated by normal Bayesian network operations.
In this case the M-Step consists of finding the values of the
parameters for Pr (Xj j pa(Xj )), j , that maximize
X Pr(y~i0Xj jpa(Xj ) = i0; j )
i0
(4)
It is quite likely that sparseness in the data will cause cells
in Y~Xj to have zero or near-zero values. This can come
about for two reasons. One is that certain responses may
be rare under certain configurations of the parent variables.
The second is that often Bayesian networks are used to
score complex constructed response tasks, and some
variables may not be observed unless the examinee employs
a certain approach to problem, producing a low effective
sample size for that variable. In either case, the zero values
are a problem for maximum likelihood estimation. Equally
problematic are values that are small integers, because the
parameter values may be determined by only a handful of
observations.
      </p>
      <p>A common approach used in the analysis of contingency
tables is to add a value less than one (usually 0.5) to each cell
of the table. (Note that the Dirichlet distribution with all
parameters equal to 0.5 is the non-informative prior
produced by applying Jeffrey’s rule.) Applying any proper
prior distribution mitigates the problem with low cell count,
and often the Bayesian network construction process elicits
a proper prior parameters from the subject matter experts.
In this case, suppose the experts supply parameter values
(j0), then Pr(Xj jpa(Xj ); (j0)) is a matrix containing the
expected value of a Dirichlet prior for the CPT for Xj . To
get the Dirichlet prior, choose a vector of weights wj where
wi0j is the weight in terms of number of observations to
be given to Parent Configuration i0. Then set the Dirichlet
prior Aj = wjt Pr(Xj jpa(Xj ); (j0)). The the expression
to maximize in the M-step becomes:</p>
      <p>X Pr y~i0Xj + ai0j j pa(Xj ) = i0; j
i0
(5)
This provides a semi-Bayesian approach to estimating the
CPT parameters.</p>
      <p>If the discrete partial credit model is used for the
conditional probability table, Equation 5 finds the points on the
logistic curve that will maximize the likelihood of the
observed data. However, the full logistic curve does not enter
the equation, just the points corresponding to possible
configurations of the parent variables. Consider a very simple
model where the target variable has one parent with two
states. Assume that the conditional probabilities for one
outcome level given those states differ by :1. There are
multiple points on the logistic curve that provide that
probability difference, Figure 4 illustrates this. The two points
marked with circles (Z1 and Z2) differ by :1 on the y-axis
and correspond to a solution with a difficulty of zero and a
discrimination of around 0:3. The two points marked with
crosses (ZZ1 and ZZ2) also differ by :1 on the y-axis and
correspond to a solution with a difficulty of 2:2 and a
discrimination of 2:3.</p>
      <p>A fully Bayesian estimation would not have this
identifiability problem: even if the solutions shown in Figure 4
have identical likelihoods, the posterior distribution would
differ, and provide a preference for the solution which is
closer to the normal range of the parameters. Assuming
that the experts have provided initial values for the
parameters, (j0), a weakly informative prior can be created by
using a normal distribution with a mean at (j0) and the
variance chosen so that the 95% interval includes all of the
reasonable values for the parameters. Let the resulting
distribution be ( j ). The M-Step for the fully Bayesian
solution then maximizes:</p>
      <p>X Pr y~i0Xj + ai0j j pa(Xj ) = i0; j
i0
( j ) :
(6)</p>
      <sec id="sec-9-1">
        <title>4.1 IMPLEMENTATION OF THE GENERALIZED</title>
      </sec>
      <sec id="sec-9-2">
        <title>EM ALGORITHM IN PEANUT</title>
        <p>
          The GEMfit method takes advantage of the fact that the
E-step of the EM algorithm is already implemented in
Netica! Netica, like many other Bayes net packages, offers a
version of the EM algorithm for hyper-Dirichlet
distributions described in
          <xref ref-type="bibr" rid="ref20">Spiegelhalter and Lauritzen (1990)</xref>
          . In
Netica, running the learning algorithm on the case file
(basically, a table of values of observable nodes for various
subjects) produces both a new CPT for each node as well
as a set of posterior weights.5 Multiplying the new table
by the prior weights produces the table of expected counts
which is the sufficient statistic required for the E-step.
Figure 5 shows the EM algorithm implementation
in Peanut. Note the most of the work is done
by four generic functions BuildAllTables(),
calcPnetLLike(), calcExpTables(), and
maxAllTableParameters(), allowing the EM
algorithm to be customized by overriding those
methods. The functions BuildAllTables() and
maxAllTableParameters() iterate over the
Parameterized Nodes, allowing for further customization at the
node rather than the net level.
        </p>
        <p>In particular, these functions perform the following roles:
PnetBuildTables() This calculates CPTs for all
Parameterized Nodes and sets the CPT of the node as well as
its prior weight.
calcPnetLLike(cases) This calculates the log likelihood
of the case files using the current parameters. In the
Netica implementation, it loops over the rows in the
case file and calculates the probability of each set of
findings. The log probabilities are added together to
produce the log-likelihood for the current parameter
values.</p>
        <sec id="sec-9-2-1">
          <title>5These are called node experience in Netica.</title>
        </sec>
        <sec id="sec-9-2-2">
          <title>GEMfit &lt;</title>
          <p>f u n c t i o n ( n e t , c a s e s , t o l =1 e 6 ,
m a x i t = 1 0 0 , E s t e p i t = 1 ,</p>
          <p>M s t e p i t = 3 ) f
# # B a s e c a s e
c o n v e r g e d &lt; FALSE
l l i k e &lt; r e p (NA, m a x i t + 1 )
i t e r &lt; 1
# # T h i s n e x t f u n c t i o n s e t s b o t h t h e
# # p r i o r CPTs and t h e p r i o r w e i g h t s
# # f o r e a c h n o d e .</p>
          <p>B u i l d A l l T a b l e s ( n e t )
# # I n i t i a l v a l u e o f l i k e l i h o o d f o r
# # c o n v e r g e n c e t e s t
l l i k e [ i t e r ] &lt;</p>
          <p>c a l c P n e t L L i k e ( n e t , c a s e s )
w h i l e ( ! c o n v e r g e d &amp;&amp; i t e r &lt;= m a x i t ) f
# # E s t e p
c a l c E x p T a b l e s ( n e t , c a s e s ,</p>
          <p>E s t e p i t = E s t e p i t , t o l = t o l )
# # M s t e p
m a x A l l T a b l e P a r a m s ( n e t ,</p>
          <p>M s t e p i t = M s t e p i t , t o l = t o l )
# # U p d a t e p a r a m e t e r s and
# # do c o n v e r g e n c e t e s t
i t e r &lt; i t e r + 1
B u i l d A l l T a b l e s ( n e t )
l l i k e [ i t e r ] &lt;</p>
          <p>c a l c P n e t L L i k e ( n e t , c a s e s )
c o n v e r g e d &lt;
( a b s ( l l i k e [ i t e r ]
l l i k e [ i t e r
1 ] ) &lt; t o l )
g
g
l i s t ( c o n v e r g e d = c o n v e r g e d , i t e r = i t e r ,
l l i k e s = l l i k e s [ 1 : i t e r ] )
calcExpTables(cases) This runs the E-step. In the
Netica implementation it calls Netica’s EM learning
algorithm.
maxAllTableParams() This runs the M-step. It iterates
over the max CPT params() methods for the
Parameterized Node objects. The Netica implementation
calculates the expected table from the current CPT and
posterior weights. It then calls the CPTtools
function mapDPC(). Like calcDPCTable() this
function takes most of the attributes of the Parameterized
Node as arguments. It then finds a new set of
parameters (of the same shape as its input) to fit the expected
table using a gradient decent algorithm.</p>
          <p>Note that neither the EM algorithm embedded in the
Estep (the one run natively in Netica) nor the gradient decent
algorithm in the M-step need to be run to convergence. The
convergence test is done at the level of the whole algorithm
(this makes it a Generalized EM algorithm).</p>
          <p>The prior weight is a tuning parameter for the
algorithm. High values of prior weight will cause the
estimates to shrink towards the prior (expert) values. Low
values of prior weight will cause the estimates to be
unstable when the data are sparse. I have found that values
around 10 provide a good compromise.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>SOFTWARE AVAILABILITY</title>
      <p>The software is available for download from http://
pluto.coe.fsu.edu/RNetica. All of the software
is open source, which also should assist anyone trying to
extend it either to cover more distribution types or different
Bayes net engines; however, RNetica links to the Netica
API which requires a license from Norsys.</p>
      <p>The framework described here is very flexible, perhaps too
flexible. First, it allows an arbitrary number of parameters
for each CPT. There is no guarantee that those parameters
are identifiable from the data. In many cases, the posterior
distribution may look like the prior distribution. Second, R
has a very weak type system. The framework assumes that
the combination rules and link functions are paired with
appropriate parameters (alphas, betas and link scale
parameters). This is not checked until the calcDPCTable
function is called to build the table meaning that errors could be
not caught until the analyst tries to update the network with
data.</p>
      <p>
        Despite these weaknesses, the framework is a flexible one
that supports the most common design patterns used in
educational testing applications
        <xref ref-type="bibr" rid="ref1 ref14 ref2 ref4 ref5">(Mislevy et al., 2003; Almond,
2010; Almond et al., 2013, 2015)</xref>
        . Hopefully, these design
patterns will be useful in other application areas as well. If
not, the open implementation protocol used in the
framework design makes it easy to extend.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Almond</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>'I can name that Bayesian network in two matrixes'</article-title>
          .
          <source>International Journal of Approximate Reasoning</source>
          ,
          <volume>51</volume>
          ,
          <fpage>167</fpage>
          -
          <lpage>178</lpage>
          . Retrieved from http://dx.doi.org/10.1016/ j.ijar.
          <year>2009</year>
          .
          <volume>04</volume>
          .005 doi: 10.1016/j.ijar.
          <year>2009</year>
          .
          <volume>04</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Almond</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          (
          <year>2015</year>
          , May).
          <source>RNetica: Binding the Netica API in R (-3</source>
          .4 ed.) [Computer software manual]. Retrieved from http://pluto.coe.fsu .edu/RNetica/RNetica.html (Open source software package)
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Almond</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DiBello</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenkins</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mislevy</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senturk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinberg</surname>
            ,
            <given-names>L. S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>Models for conditional probability tables in educational assessment</article-title>
          . In T.
          <string-name>
            <surname>Jaakkola</surname>
            &amp;
            <given-names>T</given-names>
          </string-name>
          . Richardson (Eds.),
          <source>Artificial intelligence and statistics</source>
          <year>2001</year>
          (p.
          <fpage>137</fpage>
          -
          <lpage>143</lpage>
          ). Morgan Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Almond</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shute</surname>
            ,
            <given-names>V. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ventura</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Debugging the evidence chain</article-title>
          . In R. G. Almond &amp; O.
          <string-name>
            <surname>Mengshoel</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 2013 uai application workshops: Big data meet complex models and models for spatial, temporal and network data (UAI2013AW) (pp</article-title>
          .
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ). Aachen. Retrieved from http://ceur-ws .
          <source>org/</source>
          Vol-
          <volume>1024</volume>
          /paper-01.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Almond</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mislevy</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinberg</surname>
            ,
            <given-names>L. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Williamson</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Bayesian networks in educational assessment</article-title>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chambers</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wilks</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          (
          <year>1988</year>
          ).
          <article-title>The new S language: a programming environment for data analysis and graphics</article-title>
          . Wadworth &amp; Brooks/Cole.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Dempster</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laird</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>D. B.</given-names>
          </string-name>
          (
          <year>1977</year>
          ).
          <article-title>Maximum likelihood from incomplete data via the em algorithm</article-title>
          .
          <source>JRSS B</source>
          ,
          <volume>39</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>D´ıez</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>Parameter adjustment in Bayes networks. the generalized noisy or-gate</article-title>
          . In D. Heckerman &amp; A.
          <string-name>
            <surname>Mamdani</surname>
          </string-name>
          (Eds.),
          <source>In uncertainty in artificial intelligence 93</source>
          (pp.
          <fpage>99</fpage>
          -
          <lpage>105</lpage>
          ). MorganKaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michailidis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Graphical models for ordinal data</article-title>
          .
          <source>Journal of Computational and Graphical Statistics</source>
          ,
          <volume>24</volume>
          (
          <issue>1</issue>
          ),
          <fpage>183</fpage>
          -
          <lpage>204</lpage>
          . Retrieved from http://dx.doi.org/10 .1080/10618600.
          <year>2014</year>
          .889023 doi: 10 .1080/10618600.
          <year>2014</year>
          .889023
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Hambleton</surname>
            ,
            <given-names>R. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swaminathan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          (
          <year>1991</year>
          ).
          <article-title>Fundamentals of item response theory</article-title>
          . Sage.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Maeda</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kiczales</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Open implementation analysis and design</article-title>
          .
          <source>In Ssr '97: Proceedings of the 1997 symposium on software reusability</source>
          (pp.
          <fpage>44</fpage>
          -
          <lpage>52</lpage>
          ). New York, NY, USA: ACM. doi: http://doi.acm.
          <source>org/10</source>
          .1145/258366.258383
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>McCullagh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Nelder</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          (
          <year>1989</year>
          ).
          <article-title>Generalized linear models</article-title>
          .
          <source>(2nd edition)</source>
          .
          <source>Chapman and Hall.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.-L.</given-names>
          </string-name>
          , &amp; van
          <string-name>
            <surname>Dyk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>The EM algorithm - an old folk-song sung to a fast new tune</article-title>
          .
          <source>Journal of the Royal Statsitical Society</source>
          , Series B,
          <volume>59</volume>
          (
          <issue>3</issue>
          ),
          <fpage>511</fpage>
          -
          <lpage>567</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Mislevy</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamel</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fried</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaffney</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haertel</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hafter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , . . .
          <string-name>
            <surname>Wenk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Design patterns for assessing science inquiry (</article-title>
          <source>PADI Technical Report No. 1)</source>
          . SRI International. Retrieved from http://padi.sri.com/ publications.html
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Muraki</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>1992</year>
          ).
          <article-title>A generalized partial credit model: Application of an em algorithm</article-title>
          .
          <source>Applied Psychological Measurement</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ),
          <fpage>159</fpage>
          -
          <lpage>176</lpage>
          . doi:
          <volume>10</volume>
          .1177/ 014662169201600206
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Norsys.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Netica-c programmer's module (5</article-title>
          .04 ed.) [Computer software manual]. Retrieved from http://norsys.com/netica c api.
          <source>htm (Bayesian Network Software)</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>R Core</given-names>
            <surname>Team</surname>
          </string-name>
          . (
          <year>2015</year>
          ).
          <article-title>R: A language and environment for statistical computing [Computer software manual]</article-title>
          . Vienna, Austria. Retrieved from http:// www.R-project.org/
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          R Special Interest Group on Databases. (
          <year>2013</year>
          ).
          <article-title>DBI: R database interface [Computer software manual]</article-title>
          . Retrieved from http://CRAN.R-project.
          <source>org/ package=DBI (R package version 0.2-7)</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Samejima</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>1969</year>
          ).
          <article-title>Estimation of latent ability using a response pattern of graded scores</article-title>
          .
          <source>Psychometrika Monograph No. 17</source>
          ,
          <issue>34</issue>
          (
          <issue>4</issue>
          ),
          <source>(Part 2).</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Spiegelhalter</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lauritzen</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          (
          <year>1990</year>
          ).
          <article-title>Sequential updating of conditional probabilities on directed graphical structures</article-title>
          .
          <source>Networks</source>
          ,
          <volume>20</volume>
          ,
          <fpage>579</fpage>
          -
          <lpage>605</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Srinivas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>A generalization of the noisy-or model, the generalized noisy or-gate</article-title>
          . In D. Heckerman &amp; A.
          <string-name>
            <surname>Mamdani</surname>
          </string-name>
          (Eds.),
          <source>Uncertainty in artificial intelligence '93</source>
          (pp.
          <fpage>208</fpage>
          -
          <lpage>215</lpage>
          ). Morgan-Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Tatsuoka</surname>
            ,
            <given-names>K. K.</given-names>
          </string-name>
          (
          <year>1984</year>
          ).
          <article-title>Analysis of errors in fraction addition and subtraction problems (Vol. 20; NIE Final report No</article-title>
          . NIE-G-
          <volume>81</volume>
          -002). University of Illinois,
          <source>Computer-based Education Research.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>van der Gaag</surname>
            ,
            <given-names>L. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodlaender</surname>
            ,
            <given-names>H. L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Feelders</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Monotonicity in Bayesian networks</article-title>
          . In M. Chickering &amp; J.
          <string-name>
            <surname>Halpern</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the twentieth conference on uncertainty in artificial intelligence</source>
          (pp.
          <fpage>569</fpage>
          -
          <lpage>576</lpage>
          ). AUAI.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>von Davier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>A general diagnostic model applied to language testing data</article-title>
          .
          <source>British Journal of Mathematical and Statistical Psychology</source>
          ,
          <volume>61</volume>
          ,
          <fpage>287</fpage>
          -
          <lpage>307</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>