<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Constructing Bayesian Networks for Linear Dynamic Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sander Evers Peter J.F. Lucas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Radboud University Nijmegen Radboud University Nijmegen Inst. for Computing and Information Sciences Inst. for Computing and Information Sciences Nijmegen</institution>
          ,
          <addr-line>the Netherlands Nijmegen</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Building probabilistic models for industrial applications cannot be done e ectively without making use of knowledge engineering methods that are geared to the industrial setting. In this paper, we build on well-known modelling methods from linear dynamic system theory as commonly used by the engineering community to facilitate the systematic creation of probabilistic graphical models. In particular, we explore a direction of research where the parameters of a linear dynamic system are assumed to be uncertain. As a case study, the heating process of paper in a printer is modelled. Di erent options for the representation, inference and learning of such a system are discussed, and experimental results obtained by this approach are presented. We conclude that the methods developed in this paper o er an attractive foundation for a methodology for building industrial, probabilistic graphical models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        As part of the manual construction process of a
Bayesian network for an actual problem one needs to
somehow translate knowledge that is available in the
problem domain to an appropriate graphical structure.
Problem characteristics also play a major role in the
choice of the type of the associated local probability
distributions. The whole process can be looked on as a
form of knowledge engineering, where the actual
construction of the Bayesian network is only one of the
many needed activities in the development process.
The acquisition of knowledge from domain experts is
traditionally seen as one of the most important
bottlenecks in knowledge engineering. It is well known
that this can be greatly alleviated if there is an easy
mapping from the informal ways domain knowledge
is described, in documents or verbally by experts, to
the Bayesian network formalism [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A typical
example of such a mapping is the exploitation of available
causal knowledge in a particular domain; often, causal
knowledge can be easily translated to an initial graph
structure of a Bayesian network, which can be re ned
later, for example by examining the conditional
independence relationship represented by the resulting
network. Fields where causal knowledge has been
successfully used in knowledge engineering for Bayesian
networks include medicine and biology.
      </p>
      <p>However, for industrial applications the situation is
somewhat di erent. This is mainly because in time,
engineers have developed their own notational
conventions and formalisms to get a grip on the domain of
concern in the system-development process. In
addition, industrial artifacts are designed by humans, and
already early in the design process models are available
that also can be used for other purposes: model-based
design and development is here the central paradigm.
Thus, rather than replacing methods from
engineering by some new and unrelated methods, a better
option seems to be to deploy as many of the
engineering principles, methods, and assumptions as possible.
Linearity is one of the assumptions frequently used,
as it facilitates the development of complex models as
industrial applications often are. Another commonly
used method is the use of diagrams that act as
abstractions of the system being developed. Diagrams
act as important means for communication. Ideally,
one would like to use similar diagrams as a starting
point for building Bayesian networks or related
probabilistic graphical models.</p>
      <p>
        In this paper we explore these ideas by taking Linear
Dynamic Systems, LDS for short, as a start for the
construction process. LDS models enjoy a well-developed
theory and practice, as they are widely used
throughout many engineering disciplines for tracking system
behaviour (cf. the well-known Kalman lter [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) as well
as for controlling this behaviour. Like Bayesian
networks, an LDS can often be represented by a graphical
diagram, which facilitates documentation and
communication of the model among experts and non-experts.
Although an LDS is deterministic in nature, it is
often used in situations that involve uncertainty. The
Kalman lter is the canonical example here: given
some noisy observations, it determines the expected
current state of the system (mean and variance).
However, the Kalman lter only accounts for one speci c
type of uncertainty: additive linear Gaussian noise on
the state and output variables.
      </p>
      <p>
        In this article, we explore a di erent direction of
augmenting an LDS with probabilities: we regard the
parameters as unknown. This allows us for example to
model a printer that heats di erent types of paper, in
which it is uncertain what the current type is.
Previous Bayesian networks modelling this situation were
developed in a laborious ad hoc manner by close
cooperation of domain experts and probabilistic modelling
experts [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; in this article we aim for a more systematic
approach.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>LDS models and their role in the engineering process</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Basic de nitions</title>
        <p>In its most basic form, a Linear Dynamic System or
LDS describes a system's behaviour by the di erential
equation</p>
        <p>ddt x(t) = Ax(t) + Bu(t)
known as the state-space representation. Here, vector
x(t) represents the system's state at time t, u(t) its
input at time t, and matrices A and B describe how the
current state change depends linearly on the current
state and input.</p>
        <p>This is a continuous-time representation; in order to
calculate with LDS models, time is usually discretized.
In this article, we therefore use the simple discretized
model
xt+1</p>
        <p>xt = Axt + But
in which the size of the discretization steps
conveniently equals the unit of the time domain in order
to simplify the exposition (in practice one can use
discretization steps of size t and scale the A and B
matrices with this factor). The equation can then be
rewritten to:
xt+1 = Adxt + Bdut</p>
        <p>Ad = A + I
Bd = B
(1)
r
r0</p>
        <p>Ptin
c
c0
In the engineering process, LDS models of systems are
often represented by means of diagrams. We exemplify
the role of these models and diagrams using a case
study which remains our running example thoughout
the paper. The case study originates from a
manufacturer of industrial printers. To ensure print quality,
paper needs to be heated to a certain temperature, which
is accomplished by passing the paper along a metal
heater plate. It is quite important that the paper
reaches the right temperature. If it is too cold, print
quality su ers; if it is too hot, energy (and money)
is wasted or worse: the printer might malfunction.
Therefore, engineers have put a lot of e ort in the
accurate modelling of the heating process. This results
in models such as Fig. 1, in which the heater is
modelled as two distinct heat masses: when the heater is
powered, the rst mass is directly heated, thereby
indirectly heating the second mass, which transfers the
heat to the paper.1 In the diagram, the heating
dynamics are represented as an electrical circuit, where
temperature plays the part of voltage and heat ow
that of current. A diagram like this has important
advantages for the engineering process:</p>
        <p>It is very well suited for documentation and
communication. A trained engineer can read the
system's basic dynamic behaviour o this diagram
in a blink of an eye; for a non-expert with a
science background it is relatively easy to gain some
intuition.</p>
        <p>It has a formal meaning as an LDS; it
translates into the state-space equations in the form
of Eq. (1), connecting it to a vast body of
theoretical and practical knowledge.</p>
        <p>It separates qualitative and quantitative aspects of
the model; the former are determined by the
diagram structure, the latter by the parameters.
It is composable: other models like this can be
developed independently and joined into a larger
system.</p>
        <p>
          It is supported by software: drawing and
managing modules of electrical circuits (and also
other graphical forms like bond graphs [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and
schematic diagrams) can be done by tools like
20sim [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which can also perform the translation
to the state-space representation. This
representation can be used for simulation, e.g. in
MATLAB.
        </p>
        <p>However, it is con ned to modelling deterministic
behaviour. In the realm of probabilistic modelling, the
formalism of Bayesian networks shares the above
attractive properties. A natural question is therefore:
how can we combine these well-known LDS models
with Bayesian networks?
Speci cally, this paper will explore the situation where
the parameters of the system (in this case: r; c; r0; c0)
involve uncertainty. This direction is induced by the
following use case: the paper heater modelled above
is used with di erent paper types fpt1; pt2; pt3g (for
example: 80 g=m2 A4, 200 g=m2 A4, 80 g=m2 Letter).
We have no direct knowledge about which type is in
the heater, and would therefore like to model it as a
probabilistic variable PT. Each paper type leads to a
di erent value for the system's r0 parameter (the heat
resistance between plate and paper). The question we
1This is known as a lumped element model ; in
contrast, the heat distribution could also be modelled as a
3-dimensional temperature eld over the plate.
X1</p>
        <p>X2</p>
        <p>Y</p>
        <p>Y</p>
        <p>Y
X1</p>
        <p>X2
X1</p>
        <p>X2
ask ourselves is: How can we join the paper type
variable to the LDS model, so we can infer probabilistically
which paper type is in the heater, by observing the T 0
values of the system?
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Augmenting LDS models with uncertainty</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Bayesian network representation</title>
        <p>
          A Bayesian network B = (G; ) is an acyclic directed
graph G, consisting of nodes and arcs, that is faithful
to a joint probability distribution factored as , which
contains for each node Y (with parents X1; X2; : : :) a
family of local conditional probability density or mass
functions P(Y jX1; X2; : : :). A dynamic Bayesian
network [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a special case of this, where the nodes are
partitioned in time slices all consisting of the same
structure and distributions. Furthermore, arcs are
only allowed between nodes in the same or adjacent
time slice.
        </p>
        <p>For modelling linear dynamic systems as (dynamic)
Bayesian networks, only the following types of nodes
are needed:</p>
      </sec>
      <sec id="sec-3-2">
        <title>Deterministic nodes: a node Y with parents</title>
        <p>X1; X2; : : : is called deterministic if its conditional
probability distribution is</p>
        <p>P(y jx1; x2; : : :) =
(1 if y = f (x1; x2; : : :)</p>
        <p>0 if y 6= f (x1; x2; : : :)
for a certain function f ; in this article, these
functions are mostly linear, i.e.</p>
        <p>y =
+
x2 + : : :
We use a special notation for these linear
deterministic nodes shown in Fig. 2 (left).</p>
      </sec>
      <sec id="sec-3-3">
        <title>Linear Gaussians: a node Y with</title>
        <p>
          X1; X2; : : : is known in Bayesian
literature as a linear Gaussian if
parents
network
P(Y jx1; x2; : : :) = N ( +
x2 + : : : ; 2)
Networks that consist only of linear Gaussians
(with &gt; 0) have theoretical signi cance: their
joint distribution is multivariate Gaussian, and
exact inference is easy and e cient (e.g. see [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]).
A linear Gaussian without parents N ( ; 2) is
simply called Gaussian; the Gaussian N (0; 1) is
called a standard Gaussian. A linear Gaussian
can be written as a linear deterministic node with
two extra parents; see Fig. 2 (middle).
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Conditional linear Gaussians: a node Y with par</title>
        <p>ents X1; X2; : : : and discrete parent is
conditional linear Gaussian if
P(Y jx1; x2; : : : ; ) = N (
+
x2 +: : : ; 2)
i.e. it is linear Gaussian for each value . If
X1; X2; : : : are Gaussian, the marginal
distribution over Y is a mixture of Gaussians:</p>
        <p>P(Y ) =</p>
        <p>X P( )N (^ ; ^2)
^ =
^2 =
2</p>
        <p>+
2 +</p>
        <p>X1 +
2 X21 +</p>
        <p>X2 + : : :
2 X22 + : : :
Again, this also holds for complete networks: if all
nodes are (conditional) linear Gaussian, the joint
distribution is a mixture of multivariate
Gaussians. However, this number of components in
this mixture is exponential in the number of
variables, which can make inference hard. A
conditional linear Gaussian can also be written as a
deterministic node with extra parents. For this
we use a special notation shown in Fig. 2 (right),
to which we will return later.</p>
        <p>As these three node types can all be written as
deterministic nodes, we will henceforth use the convention
that all non-root nodes in our networks are
deterministic.
3.2</p>
      </sec>
      <sec id="sec-3-5">
        <title>LDS models as Bayesian networks</title>
        <p>The paper heater model in Fig. 1 translates to the
following discrete-time state-space equations in the form
of Eq. (1):</p>
        <p>The state of the system consists of the temperatures Tt
and Tt0 of the two heat masses. In fact, translating the
electrical diagram by tools such as 20-sim rst leads to
a more elaborate form in which auxiliary power
variables are present. It is instructive to represent this
form as a Bayesian network consisting only of linear
deterministic nodes; this is shown at the right side of
Fig. 1. As the network is completely deterministic,
it might also be read as a system of equations over
ordinary variables:</p>
        <p>Each node represents the left-hand side of an
equation, consisting of one variable.</p>
        <p>The incoming arcs represent the right-hand side:
a linear combination of the parent variables, with
coe cients as speci ed on the arcs. Note: we
follow the convention that empty arcs carry a
coe cient of 1.</p>
        <sec id="sec-3-5-1">
          <title>For example, the gure shows that</title>
          <p>Pttrans = 1r Tt +
1
r</p>
          <p>Tt0 =</p>
          <p>Tt</p>
          <p>Tt0
r
Tt+1 = 1c Ptin + Tt +
c
1 Pttrans = Tt +</p>
          <p>P in
t
We will now start to add uncertainty to the LDS. First,
as is often done (e.g. in the Kalman lter), we augment
the state variables with additive zero-mean Gaussian
noise:
The noise is represented by two independent variables
Wt and Wt0. A graphical representation of this system
is shown in Fig. 3; we have only replaced Wt and Wt0 by
two anonymous standard Gaussian variables N (0; 1)
whose value is multiplied by and 0. As a result, the
Tt variables now have the linear Gaussian form from
Fig. 2 (middle).</p>
          <p>In fact, this makes the whole system Gaussian:
although the Pttrans and P out nodes are not linear
Gaussian, we have already seen that they can be
marginalized out, making the Tt+1; Tt0+1 nodes directly depend
on Tt; Tt0. As for the P in node: as it is always used
t
with concrete evidence pitn, it never represents a
probability distribution, and can be reformulated to take</p>
          <p>Tt+1
Tt0+1</p>
          <p>0
N (0; 1)</p>
          <p>R
R0</p>
          <p>Tt</p>
          <p>Tt0</p>
          <p>Tt+1
Tt0+1</p>
          <p>0
N (0; 1)</p>
          <p>C
C0
N (0; I)
the place of in the Tt+1 distribution. Thus, Fig. 3
is a dynamic Bayesian network with a Gaussian joint
distribution. This means that we can use a variant of
the standard Kalman lter to do exact inference on
this network; we discuss this in detail in Sect. 4.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>LDS models with uncertain parameters</title>
        <p>Above, we have shown how to represent an LDS with
additive zero-mean Gaussian noise as a Bayesian
network; while it may be instructive, thus far it is only a
convenient reformulation of known theory. However,
to model the relation with the paper type variable,
we need uncertainty in the parameters, i.e. the coe
cients of the linear relation. Our solution is to
transform these coe cients into probabilistic variables. We
accomplish this by introducing a new node type:</p>
      </sec>
      <sec id="sec-3-7">
        <title>Conditional linear deterministic node: a linear</title>
        <p>deterministic node extended with a special
parent, which is distinguished from the rest because
its arc ends in a diamond (and carries no coe
cient); we call this parent the conditioning parent.
The coe cients on the other arcs can depend on
the value of the conditioning parent. This
dependence is shown by putting a bullet in the place
where this value is to be lled in.</p>
        <p>We have already given an example of such a node:
the conditional linear Gaussian node in Fig. 2. Just
like the linear Gaussian node is an instance of a linear
deterministic node, viz. having speci c parents 1 and
N (0; 1), a conditional linear Gaussian node is a speci c
instance of conditional linear deterministic node.
By using conditional linear deterministic nodes, we can
extend our already noisy paper heater model with
uncertain parameters: we replace parameter r by
variable R|which becomes the conditioning parent of the
node that depended on r|and do the same for the
other parameters. The result is shown in Fig. 4.
We can now proceed to connect a discrete paper type
variable to the model, with an example distribution
assigning to pt1; pt2; pt3 the probabilities 0:3, 0:5 and 0:2.
Like we mentioned, the paper type determines the R0
parameter, but for generalization's sake we will assume
that it in uences all the parameters. The resulting
model, in Fig. 5, also shows that the notation for
(conditional) linear deterministic nodes extends very
naturally to vector-valued variables: coe cients become
matrices. These are the matrices from Eq. (2), written
as functions Ad( ) and Bd( ) of = (r; c; r0; c0). Thus,
we have made a graphical summary of the model which
is linked very clearly to the state-space equations.
Although this hides some of the model's internal
structure, it is useful for keeping an overview.</p>
        <p>Discrete : The paper types pt1; pt2; pt3
deterministically set a value 1; 2; 3 (resp.) for . In
fact, this turns Tt and Tt+1 into conditional
linear Gaussians (conditioned by the paper type), so
the joint distribution is a mixture of 3 Gaussians.
Continuous : The paper types pt1; pt2; pt3
determine the parameters ( 1; 1), ( 2; 2),
( 2; 2) for a Gaussian-distributed . This
model is no longer linear.</p>
        <p>These options have an in uence on inference and
learning in the model, which we discuss in the next sections.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Inference</title>
      <p>In this section, we shortly discuss inference in the
uncertain parameter model of Fig. 4, for both the
discrete and continuous given above. Assume
we observe the system's Tt0 variable responding to
the Ptin input for a while, resulting in data D =
fpi0n; t00; : : : ; pimn 1; t0m 1; t0mg, and the goal is to nd out
the paper type.</p>
      <p>This can be done by a forward pass over the model as
known from dynamic Bayesian network literature. We
start with the prior distribution</p>
      <p>P(t0; ; t00) = P(t0)P( )P(t00)
and perform a recursive forward pass from t = 0 to
t + 1 = m:
Z</p>
      <p>P(tt+1; ; pi0n::t; t00::t+1) =</p>
      <sec id="sec-4-1">
        <title>Finally, we marginalize out tm:</title>
        <p>P( ; D) =</p>
        <p>Z</p>
        <p>P(tm; ; D) dtm
The details of the inference algorithm depend on the
model used. For the discrete , all the distributions
above are linear Gaussian, so we can multiply and
integrate exactly. To be precise, P(tt+1 jtt; t0t; pitn; ) and
P(t0t+1 jtt; t0t; ) are linear Gaussian for each of the 3
individual i values; the algorithm thus independently
works on 3 Gaussians.</p>
        <p>For the continuous , the conditional distributions of
Tt+1 and Tt0+1 are not linear Gaussian; we can do
approximate inference by linearizing these distributions
at each timeslice around the means of Tt; (given the
Regarding the probability distribution of the
able, we give two examples:</p>
        <p>
          P(tt; ; pi0n::t 1; t00::t)P(tt+1 jtt; t0t; pitn; )P(t0t+1 jtt; t0t; ) dtt ing on the model for .
A second type of inference that we do with the model
is smoothing. In particular, we want to calculate
P( ; tt; tt+1 jD) for each timeslice, in order to do EM
learning (see the next section). We have used a
RauchTung-Striebel -type smoother [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This uses a forward
pass like discussed above, with the adjustment that it
stores the distributions over two time slices. The last
of these, i.e. P(tm 1; tm; ; D), is used as the input for
a recursive backward pass de ned as follows:
        </p>
        <p>P(tt 1; tt; ; D) =</p>
        <p>Z</p>
        <p>P(tt; tt+1; ; D)P(tt 1 jtt; ; D) dtt+1
where the rst factor in the integral is the recursive
one, and the second is calculated from the distribution
over tt 1; tt stored in the forward pass:</p>
        <p>P(tt 1 jtt; ; D) = P(tt 1 jtt; pi0n::t 1; t00::t)</p>
        <p>P(tt 1; tt; pi0n::t 1; t00::t)
= R P(tt 1; tt; pi0n::t 1; t00::t) dtt 1
The advantage of such a smoother over an independent
backward pass is that it does not linearize the
distribution over Tt 1; T in two di erent ways (the backward
pass uses the linearization of the forward pass).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Learning</title>
      <p>For learning the model, we discuss the situation
where we know the paper type (assume it is pt1)
and observe the system like before, i.e. D =
fpi0n; t00; : : : ; pimn 1; t0m 1; t0mg. The goal is to learn
the parameter set that maximizes the likelihood
P(Djpt1; ). The situation is a little di erent
depend5.1</p>
      <sec id="sec-5-1">
        <title>EM for continuous</title>
        <p>
          For the continuous , the restriction to pt1 means that
we are learning the parameters for one multivariate
Gaussian variable (actually consisting of four
independent variables R, C, R0, C0) and the ; 0
process noise parameters. Thus, = ( 1; 1; ; 0).
This can be done by a standard EM algorithm [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for
Bayesian networks: given a set of initial parameters
i, the approximate smoother infers the distributions
P(tt; tt+1; jD; pt1; i). From these, the expected
sufcient statistics are gathered for maximizing
        </p>
        <p>P(D; T0::m; jpt1; i+1)
expected under the old parameters i; this is repeated
until convergence.
Note that the role of is di erent here; we are not
learning the optimal parameters for a distribution
over , but the optimal single value. This requires
some adjustments to the smoother: it should store
distributions over (Tt; Tt+1) instead of over (Tt; Tt+1; ),
and should not use a prior distribution over
either. Because all the probability distributions are
linear Gaussian again, smoothing is exact now.
However, maximizing the expected likelihood is not so
trivial now: we are looking for the optimal linear
Gaussian distribution P(tt+1; t0t+1 jtt; t0t; pitn; ) constrained
to a certain form prescribed by A and B. Speci cally,
the log likelihood for an individual time slice is:
log P(tt+1; t0t+1 jtt; t0t; pitn; ) =
where
= h 02 002 i and
1 T
2
1
1</p>
        <p>log j2
2
j
is an abbreviation for
= tt+1
t0t+1</p>
        <p>Ad( ) tt0t
t</p>
        <p>pin
Bd( ) 20t C
Separating variables and parameters, we can write:
= D( )xt
D( ) = I</p>
        <p>Ad( )</p>
        <p>Bd( )
xt = tt+1 t0t+1
tt t0t pitn
The log likelihood for all the time slices (ignoring the
term 21 log j2 j for now) is then:</p>
        <p>log P(D; t0::m j ) =
=
where Di denotes the ith row of D. The goal is to
maximize the expected value of this expression. At
rst sight, it seems that the two terms are dependent
through , but on closer inspection</p>
        <p>D( ) =
h 1 0 1+1=rc
0 1 1=rc0</p>
        <p>1=rc 1=c
1+1=rc0+1=r0c0 0
we see that the values in the rst row do not
constrain those in the second, or vice versa. We can
therefore minimize the expected value of the two
terms independently. We can also see that there are
linear constraints for the values within a row, e.g.
D1;3( )+D1;4( ) = 1. We record these constraints in
a matrix C and vector c such that CD1T ( ) = c.
Substituting d = D1T ( ), for the rst term we are looking
for the d that minimizes</p>
        <p>X
t=0::m</p>
        <p>E(xtT ddT xt) = dT
"</p>
        <p>X
t=0::m</p>
        <p>#
E(xtxtT ) d
under the constraint Cd = c. This is a linearly
constrained quadratic optimization problem that can be
solved by the method of Lagrange multipliers. The
second term can be minimized in the same way.
In conclusion, we have derived the M-phase for the
discrete model; in the E-phase, we therefore have to
collect the expected su cient statistics E(xtxtT ).
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Comparison</title>
        <p>It is interesting to compare learning for continuous and
discrete . In order to do this, we have simulated the
system in Fig. 1 for 150 time slices, with a sine wave
as P in input and random Gaussian disturbance. We
provide the EM algorithms discussed above with the
P in and the generated Tt0 data. Typical results of a
60-iterations run are shown in Fig. 6.</p>
        <p>The most interesting fact to observe is that the two
approaches converge to di erent values (but the same
log likelihood). This probably means that the system
is not identi able: several choices for lead to the
same likelihood of the observed behaviour Tt0. To test
this hypothesis, we also generated synthetic Tt; Tt0 data
(without disturbance) for systems with the learned
parameters. The results are plotted in Fig. 7. These
results indeed show that both methods arrive at the
same approximation (green, red) of the original (blue)
Tt0 data; however, the di erent values for the
parameters lead to a di erent behavior of Tt.</p>
        <p>
          A second observation from Fig. 6 is that learning
continuous converges faster than learning discrete .
The explanation for this is as follows: the EM
algorithm for the continuous parameter space uses
inference to compute a posterior distribution over the
variable . In this algorithm, the posterior distribution
is updated for each time slice. However, we can also
regard the algorithm as doing full Bayesian learning
where is viewed as a parameter; the algorithm is
then performing incremental EM [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which is known
to converge faster.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>The central scienti c hypothesis which initiated the
research described in this paper was that knowledge
engineering methods for industrial applications of
probabilistic graphical models should be based as much as
0.0135
4000 10 20 30 40 50 60 104.00030 10 20 30 40 50 60</p>
      <p>C C’
2000 10 20 30 40 50 60 8000 10 20 30 40 50 60
T1 0 log lik
R
possible on existing methods from engineering. We
have developed a systematic framework, where we
start with linear system theory and associated
diagrams and notations as the rst step in the
knowledge engineering process. The advantage of this
approach is that engineers have already dealt with the
unavoidable complexity issues in representing realistic
models. Subsequently, it was shown how linear
dynamic system models can be augmented with
probabilistic variables for uncertain parameters,
transforming them into dynamic Bayesian networks with
conditionally linear nodes. We introduced a concise
notation that combines LDS and Bayesian network
concepts in a natural way and demonstrated methods for
inference and learning from data in these models. The
practical usefulness of the framework was illustrated
by a case study from the domain of large production
printers.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work has been carried out as part of the
OCTOPUS project under the responsibility of the
Embedded Systems Institute. This project is partially
supported by the Netherlands Ministry of Economic
A airs under the Embedded Systems Institute
program.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <given-names>Keiji</given-names>
            <surname>Kanazawa</surname>
          </string-name>
          .
          <article-title>A model for reasoning about persistence and causation</article-title>
          .
          <source>Computational Intelligence</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <volume>142</volume>
          {
          <fpage>150</fpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Dempster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Laird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D. B.</given-names>
            <surname>Rubin</surname>
          </string-name>
          .
          <article-title>Maximum likelihood from incomplete data via the EM algorithm</article-title>
          .
          <source>Journal of the Royal Statistical Society. Series B (Methodological)</source>
          ,
          <volume>39</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>38</fpage>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.J.</given-names>
            <surname>Hommersom</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.J.F.</given-names>
            <surname>Lucas</surname>
          </string-name>
          .
          <article-title>Using Bayesian networks in an industrial setting: Making printing systems adaptive</article-title>
          .
          <source>In 19th European Conference on Articial Intelligence</source>
          , pages
          <fpage>401</fpage>
          {
          <fpage>406</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.E.</given-names>
            <surname>Kalman</surname>
          </string-name>
          et al.
          <article-title>A new approach to linear ltering and prediction problems</article-title>
          .
          <source>Journal of Basic Engineering</source>
          ,
          <volume>82</volume>
          (
          <issue>1</issue>
          ):
          <volume>35</volume>
          {
          <fpage>45</fpage>
          ,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.C.</given-names>
            <surname>Karnopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.L.</given-names>
            <surname>Margolis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.C.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          .
          <article-title>System dynamics: modeling and simulation of mechatronic systems</article-title>
          . John Wiley &amp; Sons,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Koller</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Probabilistic graphical models: Principles and techniques</article-title>
          . The MIT Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.B.</given-names>
            <surname>Korb</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.E.</given-names>
            <surname>Nicholson</surname>
          </string-name>
          .
          <source>Bayesian Arti cial Intelligence</source>
          . Chapman &amp; Hall/CRC,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Radford</given-names>
            <surname>Neal</surname>
          </string-name>
          and Geo rey
          <string-name>
            <given-names>E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>A view of the EM algorithm that justi es incremental, sparse, and other variants</article-title>
          . In M.I. Jordan, editor,
          <source>Learning in Graphical Models</source>
          , pages
          <volume>355</volume>
          {
          <fpage>368</fpage>
          . Kluwer Academic Publishers,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.E.</given-names>
            <surname>Rauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tung</surname>
          </string-name>
          , and
          <string-name>
            <given-names>CT</given-names>
            <surname>Striebel</surname>
          </string-name>
          .
          <article-title>Maximum likelihood estimates of linear dynamic systems</article-title>
          .
          <source>AIAA journal</source>
          ,
          <volume>3</volume>
          (
          <issue>8</issue>
          ):
          <volume>1445</volume>
          {
          <fpage>1450</fpage>
          ,
          <year>1965</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.W.</given-names>
            <surname>Sorenson</surname>
          </string-name>
          .
          <article-title>Kalman ltering: theory and application</article-title>
          . IEEE,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>[11] http://www.20sim.com/.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>