=Paper=
{{Paper
|id=None
|storemode=property
|title=Constructing Bayesian Networks for Linear Dynamic Systems
|pdfUrl=https://ceur-ws.org/Vol-818/paper4.pdf
|volume=Vol-818
}}
==Constructing Bayesian Networks for Linear Dynamic Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-818/paper4.pdf</pdf>
<pre>
     Constructing Bayesian Networks for Linear Dynamic Systems


                       Sander Evers                                    Peter J.F. Lucas
                 Radboud University Nijmegen                       Radboud University Nijmegen
        Inst. for Computing and Information Sciences      Inst. for Computing and Information Sciences
                  Nijmegen, the Netherlands                         Nijmegen, the Netherlands


                     Abstract                             mapping from the informal ways domain knowledge
                                                          is described, in documents or verbally by experts, to
    Building probabilistic models for industrial          the Bayesian network formalism [7]. A typical exam-
    applications cannot be done effectively with-         ple of such a mapping is the exploitation of available
    out making use of knowledge engineering               causal knowledge in a particular domain; often, causal
    methods that are geared to the industrial set-        knowledge can be easily translated to an initial graph
    ting. In this paper, we build on well-known           structure of a Bayesian network, which can be refined
    modelling methods from linear dynamic sys-            later, for example by examining the conditional in-
    tem theory as commonly used by the engi-              dependence relationship represented by the resulting
    neering community to facilitate the system-           network. Fields where causal knowledge has been suc-
    atic creation of probabilistic graphical mod-         cessfully used in knowledge engineering for Bayesian
    els. In particular, we explore a direction of         networks include medicine and biology.
    research where the parameters of a linear dy-         However, for industrial applications the situation is
    namic system are assumed to be uncertain.             somewhat different. This is mainly because in time,
    As a case study, the heating process of paper         engineers have developed their own notational conven-
    in a printer is modelled. Different options           tions and formalisms to get a grip on the domain of
    for the representation, inference and learning        concern in the system-development process. In addi-
    of such a system are discussed, and experi-           tion, industrial artifacts are designed by humans, and
    mental results obtained by this approach are          already early in the design process models are available
    presented. We conclude that the methods de-           that also can be used for other purposes: model-based
    veloped in this paper offer an attractive foun-       design and development is here the central paradigm.
    dation for a methodology for building indus-          Thus, rather than replacing methods from engineer-
    trial, probabilistic graphical models.                ing by some new and unrelated methods, a better op-
                                                          tion seems to be to deploy as many of the engineer-
                                                          ing principles, methods, and assumptions as possible.
1   Introduction                                          Linearity is one of the assumptions frequently used,
                                                          as it facilitates the development of complex models as
As part of the manual construction process of a           industrial applications often are. Another commonly
Bayesian network for an actual problem one needs to       used method is the use of diagrams that act as ab-
somehow translate knowledge that is available in the      stractions of the system being developed. Diagrams
problem domain to an appropriate graphical structure.     act as important means for communication. Ideally,
Problem characteristics also play a major role in the     one would like to use similar diagrams as a starting
choice of the type of the associated local probability    point for building Bayesian networks or related prob-
distributions. The whole process can be looked on as a    abilistic graphical models.
form of knowledge engineering, where the actual con-
                                                          In this paper we explore these ideas by taking Linear
struction of the Bayesian network is only one of the
                                                          Dynamic Systems, LDS for short, as a start for the con-
many needed activities in the development process.
                                                          struction process. LDS models enjoy a well-developed
The acquisition of knowledge from domain experts is
                                                          theory and practice, as they are widely used through-
traditionally seen as one of the most important bot-
                                                          out many engineering disciplines for tracking system
tlenecks in knowledge engineering. It is well known
                                                          behaviour (cf. the well-known Kalman filter [4]) as well
that this can be greatly alleviated if there is an easy
as for controlling this behaviour. Like Bayesian net-                   Ptin                               Ptin
works, an LDS can often be represented by a graphical
diagram, which facilitates documentation and commu-                                                           1/c
nication of the model among experts and non-experts.                                                                Tt+1
                                                                                                 Tt
Although an LDS is deterministic in nature, it is of-                                                       −1/c
ten used in situations that involve uncertainty. The                     c                         1/r
Kalman filter is the canonical example here: given                                                       Pttrans
some noisy observations, it determines the expected
                                                                                                  −1/r
current state of the system (mean and variance). How-                                                        1/c0
ever, the Kalman filter only accounts for one specific        r
                                                                         c0                      Tt0                 0
                                                                                                                    Tt+1
type of uncertainty: additive linear Gaussian noise on
the state and output variables.                                                                            −1/c0
                                                                                                  1/r0
In this article, we explore a different direction of aug-                                                 Ptout
menting an LDS with probabilities: we regard the pa-         r0
rameters as unknown. This allows us for example to                     20◦C                      −1/r0
model a printer that heats different types of paper, in


                                                                       +
                                                                       −
which it is uncertain what the current type is. Pre-                                           20◦C
vious Bayesian networks modelling this situation were
                                                             Figure 1: LDS model of paper heater. Left: as an elec-
developed in a laborious ad hoc manner by close coop-        trical diagram, where voltage represents temperature and
eration of domain experts and probabilistic modelling        current represents heat flow. From top to bottom, the di-
experts [3]; in this article we aim for a more systematic    agram consists of a (1) time-variable heat source, (2) heat
approach.                                                    mass with capacity c, (3) heat resistance r, (4) heat mass
                                                             with capacity c0 , (5) heat resistance r0 , (6) heat sink with
                                                             temperature 20◦C modelling the constant flow of paper.
2     LDS models and their role in the                       Right: as a graphical representation of the state-space
                                                             equations for two discrete time points t, t+1. The state of
      engineering process                                    the system is described by the T variables, which repre-
                                                             sent the temperatures of the two heat masses. Its input
2.1   Basic definitions                                      is Ptin , the power entering the system. Auxiliary variables
                                                             represent the power flowing from the first heat mass to
In its most basic form, a Linear Dynamic System or           the second (Pttrans ), and from the second to the heat sink
LDS describes a system’s behaviour by the differential       (Ptout ). Note: The parameters of the system are r, c, r0 , c0
equation                                                     (instantiated with concrete values in a real system).
               d
               dt x(t) = Ax(t) + Bu(t)
known as the state-space representation. Here, vector
x(t) represents the system’s state at time t, u(t) its in-
put at time t, and matrices A and B describe how the         2.2   Role in engineering
current state change depends linearly on the current
state and input.                                             In the engineering process, LDS models of systems are
This is a continuous-time representation; in order to        often represented by means of diagrams. We exemplify
calculate with LDS models, time is usually discretized.      the role of these models and diagrams using a case
In this article, we therefore use the simple discretized     study which remains our running example thoughout
model                                                        the paper. The case study originates from a manufac-
                                                             turer of industrial printers. To ensure print quality, pa-
                xt+1 − xt = Axt + But                        per needs to be heated to a certain temperature, which
                                                             is accomplished by passing the paper along a metal
in which the size of the discretization steps conve-         heater plate. It is quite important that the paper
niently equals the unit of the time domain in order          reaches the right temperature. If it is too cold, print
to simplify the exposition (in practice one can use dis-     quality suffers; if it is too hot, energy (and money)
cretization steps of size ∆t and scale the A and B           is wasted or worse: the printer might malfunction.
matrices with this factor). The equation can then be         Therefore, engineers have put a lot of effort in the ac-
rewritten to:                                                curate modelling of the heating process. This results
                 xt+1 = Ad xt + Bd ut                        in models such as Fig. 1, in which the heater is mod-
                                                             elled as two distinct heat masses: when the heater is
                   Ad = A + I                         (1)
                                                             powered, the first mass is directly heated, thereby in-
                   Bd = B                                    directly heating the second mass, which transfers the
heat to the paper.1 In the diagram, the heating dy-                                              N (0, 1)                 N (0, 1)
                                                                    1                    1                       1
namics are represented as an electrical circuit, where
temperature plays the part of voltage and heat flow                     µ                    µ      σ                µ•       σ•
                                                              X1                    X1                      X1
that of current. A diagram like this has important                      α                    α                       α•
                                                                             Y                       Y                        Y
advantages for the engineering process:                                 β                    β                       β•
                                                                  X2                  X2                     X2
  • It is very well suited for documentation and com-                                                                     Θ
    munication. A trained engineer can read the sys-
    tem’s basic dynamic behaviour off this diagram            Figure 2: The three basic types of Bayesian network
                                                              nodes we will use for LDS models. Left: Linear determin-
    in a blink of an eye; for a non-expert with a sci-        istic node P(y|x1 , x2 ) = 1 if y = µ + αx1 + βx2 . Middle:
    ence background it is relatively easy to gain some        Linear Gaussian node P(Y |x1 , x2 ) = N (µ+αx1 +βx2 , σ 2 ).
    intuition.                                                Right: Conditional linear Gaussian node P(Y |x1 , x2 , θ) =
                                                              N (µθ + αθ x1 + βθ x2 , σθ2 ) (for discrete Θ). The notation is
  • It has a formal meaning as an LDS; it trans-              ours and is introduced in section 3.3.
    lates into the state-space equations in the form
    of Eq. (1), connecting it to a vast body of theo-
    retical and practical knowledge.                          ask ourselves is: How can we join the paper type vari-
                                                              able to the LDS model, so we can infer probabilistically
  • It separates qualitative and quantitative aspects of      which paper type is in the heater, by observing the T 0
    the model; the former are determined by the dia-          values of the system?
    gram structure, the latter by the parameters.
                                                              3     Augmenting LDS models with
  • It is composable: other models like this can be
    developed independently and joined into a larger
                                                                    uncertainty
    system.
                                                              3.1      Bayesian network representation
  • It is supported by software: drawing and man-
                                                              A Bayesian network B = (G, Φ) is an acyclic directed
    aging modules of electrical circuits (and also
                                                              graph G, consisting of nodes and arcs, that is faithful
    other graphical forms like bond graphs [5] and
                                                              to a joint probability distribution factored as Φ, which
    schematic diagrams) can be done by tools like 20-
                                                              contains for each node Y (with parents X1 , X2 , . . .) a
    sim [11], which can also perform the translation
                                                              family of local conditional probability density or mass
    to the state-space representation. This represen-
                                                              functions P(Y |X1 , X2 , . . .). A dynamic Bayesian net-
    tation can be used for simulation, e.g. in MAT-
                                                              work [1] is a special case of this, where the nodes are
    LAB.
                                                              partitioned in time slices all consisting of the same
                                                              structure and distributions. Furthermore, arcs are
However, it is confined to modelling deterministic be-        only allowed between nodes in the same or adjacent
haviour. In the realm of probabilistic modelling, the         time slice.
formalism of Bayesian networks shares the above at-
tractive properties. A natural question is therefore:         For modelling linear dynamic systems as (dynamic)
how can we combine these well-known LDS models                Bayesian networks, only the following types of nodes
with Bayesian networks?                                       are needed:

Specifically, this paper will explore the situation where
                                                              Deterministic nodes: a node Y with parents
the parameters of the system (in this case: r, c, r0 , c0 )
                                                                 X1 , X2 , . . . is called deterministic if its conditional
involve uncertainty. This direction is induced by the
                                                                 probability distribution is
following use case: the paper heater modelled above
is used with different paper types {pt1 , pt2 , pt3 } (for                                   (
                                                                                               1 if y = f (x1 , x2 , . . .)
example: 80 g/m2 A4, 200 g/m2 A4, 80 g/m2 Letter).                   P(y|x1 , x2 , . . .) =
We have no direct knowledge about which type is in                                             0 if y 6= f (x1 , x2 , . . .)
the heater, and would therefore like to model it as a
probabilistic variable PT. Each paper type leads to a               for a certain function f ; in this article, these func-
different value for the system’s r0 parameter (the heat             tions are mostly linear, i.e.
resistance between plate and paper). The question we
                                                                                 y = µ + αx1 + βx2 + . . .
   1
    This is known as a lumped element model ; in con-
trast, the heat distribution could also be modelled as a            We use a special notation for these linear deter-
3-dimensional temperature field over the plate.                     ministic nodes shown in Fig. 2 (left).
Linear Gaussians: a node Y             with              parents       The state of the system consists of the temperatures Tt
    X1 , X2 , . . . is known in Bayesian                 network       and Tt0 of the two heat masses. In fact, translating the
    literature as a linear Gaussian if                                 electrical diagram by tools such as 20-sim first leads to
                                                                       a more elaborate form in which auxiliary power vari-
       P(Y |x1 , x2 , . . .) = N (µ + αx1 + βx2 + . . . , σ 2 )
                                                                       ables are present. It is instructive to represent this
      Networks that consist only of linear Gaussians                   form as a Bayesian network consisting only of linear
      (with σ > 0) have theoretical significance: their                deterministic nodes; this is shown at the right side of
      joint distribution is multivariate Gaussian, and                 Fig. 1. As the network is completely deterministic,
      exact inference is easy and efficient (e.g. see [6]).            it might also be read as a system of equations over
      A linear Gaussian without parents N (µ, σ 2 ) is                 ordinary variables:
      simply called Gaussian; the Gaussian N (0, 1) is
      called a standard Gaussian. A linear Gaussian                      • Each node represents the left-hand side of an
      can be written as a linear deterministic node with                   equation, consisting of one variable.
      two extra parents; see Fig. 2 (middle).
                                                                         • The incoming arcs represent the right-hand side:
Conditional linear Gaussians: a node Y with par-
                                                                           a linear combination of the parent variables, with
   ents X1 , X2 , . . . and discrete parent Θ is condi-
                                                                           coefficients as specified on the arcs. Note: we
   tional linear Gaussian if
                                                                           follow the convention that empty arcs carry a co-
      P(Y |x1 , x2 , . . . , θ) = N (µθ +αθ x1 +βθ x2 +. . . , σθ2 )       efficient of 1.
      i.e. it is linear Gaussian for each value θ. If
      X1 , X2 , . . . are Gaussian, the marginal distribu-             For example, the figure shows that
      tion over Y is a mixture of Gaussians:
                                                                               1      −1 0     Tt − Tt0
                                                                       Pttrans = Tt +    Tt =
                          X
               P(Y ) =       P(θ)N (µ̂θ , σ̂θ2 )                               r       r          r
                         θ∈Θ                                                   1 in         −1 trans          P in − Pttrans
                  µ̂θ = µθ + αθ µX1 + βθ µX2 + . . .                     Tt+1 = Pt + Tt +      Pt       = Tt + t
                                                                               c             c                      c
                  σ̂θ2 = σθ2 + αθ2 σX
                                    2
                                        + βθ2 σX
                                               2
                                                   + ...
                                      1          2                     The state-space equations (2) are obtained from this
      Again, this also holds for complete networks: if all             system by substituting the P trans and P out variables
      nodes are (conditional) linear Gaussian, the joint               for their right-hand sides. Interpreted as a Bayesian
      distribution is a mixture of multivariate Gaus-                  network, this corresponds to marginalization.
      sians. However, this number of components in
                                                                       We will now start to add uncertainty to the LDS. First,
      this mixture is exponential in the number of Θ
                                                                       as is often done (e.g. in the Kalman filter), we augment
      variables, which can make inference hard. A con-
                                                                       the state variables with additive zero-mean Gaussian
      ditional linear Gaussian can also be written as a
                                                                       noise:
      deterministic node with extra parents. For this
                                                                                                      in   
      we use a special notation shown in Fig. 2 (right),                         Tt+1           Tt         Pt        Wt
                                                                                  0     = Ad 0 + Bd              +
      to which we will return later.                                             Tt+1          Tt         20◦C       Wt0
                                                                                                                            (3)
As these three node types can all be written as deter-                              Wt ∼ N (0, σ 2 )
ministic nodes, we will henceforth use the convention                              Wt0 ∼ N (0, σ 02 )
that all non-root nodes in our networks are determin-
istic.                                                                 The noise is represented by two independent variables
                                                                       Wt and Wt0 . A graphical representation of this system
3.2    LDS models as Bayesian networks                                 is shown in Fig. 3; we have only replaced Wt and Wt0 by
                                                                       two anonymous standard Gaussian variables N (0, 1)
The paper heater model in Fig. 1 translates to the fol-                whose value is multiplied by σ and σ 0 . As a result, the
lowing discrete-time state-space equations in the form                 Tt variables now have the linear Gaussian form from
of Eq. (1):                                                            Fig. 2 (middle).
                              in 
        Tt+1            Tt         Pt                                  In fact, this makes the whole system Gaussian: al-
          0    = Ad 0 + Bd
        Tt+1            Tt        20◦C                                 though the Pttrans and P out nodes are not linear Gaus-
                  
                    1 − 1/rc         1/rc
                                                                      sian, we have already seen that they can be marginal-
            Ad =                                    (2)                                               0
                      1/rc0   1 − 1/rc0 − 1/r0 c0                      ized out, making the Tt+1 , Tt+1   nodes directly depend
                                                                                0               in
                                                                     on Tt , Tt . As for the Pt node: as it is always used
                    1/c     0                                          with concrete evidence pin
            Bd =                                                                                 t , it never represents a prob-
                     0 1/r0 c0                                         ability distribution, and can be reformulated to take
                                                                                  N (0, 1)
                                     N (0, 1)                                                               pt1 : 0.3 | pt2 : 0.5 | pt3 : 0.2
                      Ptin                                         Ptin
                                                                                    σ
                                       σ
                                                                            1/•                                                   Θ
                               1/c                                                           C
                                      Tt+1           Tt                            Tt+1
       Tt
                          −1/c                                         −1/•                               N (0, I)                N (0, I)
                                                            1/•                                                               σ, σ 0
              1/r                                R                                                        σ, σ 0                             
                                                                  Pttrans                                              Tt                 Tt+1
                     Pttrans                                                                       ···
                                                           −1/•                                            Ad (•)      Tt0                 0
                                                                                                                                  Ad (•) Tt+1
             −1/r
                                                                            1/•                          Bd (•)               Bd (•)
                             1/c0                                                   0
                                                                                             C0
       Tt0                             0
                                      Tt+1           Tt0                           Tt+1                    in
                                                                                                                                Ptin
                                                                                                                                     
                                                                                                         Pt−1
                         −1/c0                                         −1/•                              20◦C                  20◦C
                                       σ0                   1/•                     σ0
              1/r0                              R0
                                                                  Ptout                            Figure 5: The model from Fig. 4,
                     Ptout                                                                         summarized using vector variables
                                                       −1/•                       N (0, 1)         (where Θ represents R, C, R0 , C 0 ) and
         −1/r0                       N (0, 1)
                                                                                                   augmented with a paper type variable
                                                                                                   with a discrete distribution. The re-
                                                  20◦C
      20◦C                                                                                         lation between time slices t − 1 and t
                                                                                                   is also shown. There are several op-
                                                Figure 4: LDS of Eq. (3) augmented                 tions for the relation between paper
Figure 3: LDS of Eq. (3), containing
                                                with uncertain parameters. These are               type and Θ (the dashed arc here is
additive zero-mean Gaussian noise on
                                                modelled by conditional linear deter-              not a linear influence).
the state variables.
                                                ministic nodes.


the place of µ in the Tt+1 distribution. Thus, Fig. 3                        We have already given an example of such a node:
is a dynamic Bayesian network with a Gaussian joint                          the conditional linear Gaussian node in Fig. 2. Just
distribution. This means that we can use a variant of                        like the linear Gaussian node is an instance of a linear
the standard Kalman filter to do exact inference on                          deterministic node, viz. having specific parents 1 and
this network; we discuss this in detail in Sect. 4.                          N (0, 1), a conditional linear Gaussian node is a specific
                                                                             instance of conditional linear deterministic node.
                                                                             By using conditional linear deterministic nodes, we can
3.3    LDS models with uncertain parameters
                                                                             extend our already noisy paper heater model with un-
                                                                             certain parameters: we replace parameter r by vari-
Above, we have shown how to represent an LDS with                            able R—which becomes the conditioning parent of the
additive zero-mean Gaussian noise as a Bayesian net-                         node that depended on r—and do the same for the
work; while it may be instructive, thus far it is only a                     other parameters. The result is shown in Fig. 4.
convenient reformulation of known theory. However,
to model the relation with the paper type variable,                          We can now proceed to connect a discrete paper type
we need uncertainty in the parameters, i.e. the coeffi-                      variable to the model, with an example distribution as-
cients of the linear relation. Our solution is to trans-                     signing to pt1 , pt2 , pt3 the probabilities 0.3, 0.5 and 0.2.
form these coefficients into probabilistic variables. We                     Like we mentioned, the paper type determines the R0
accomplish this by introducing a new node type:                              parameter, but for generalization’s sake we will assume
                                                                             that it influences all the parameters. The resulting
                                                                             model, in Fig. 5, also shows that the notation for (con-
Conditional linear deterministic node: a linear                              ditional) linear deterministic nodes extends very nat-
   deterministic node extended with a special par-                           urally to vector-valued variables: coefficients become
   ent, which is distinguished from the rest because                         matrices. These are the matrices from Eq. (2), written
   its arc ends in a diamond (and carries no coeffi-                         as functions Ad (θ) and Bd (θ) of θ = (r, c, r0 , c0 ). Thus,
   cient); we call this parent the conditioning parent.                      we have made a graphical summary of the model which
   The coefficients on the other arcs can depend on                          is linked very clearly to the state-space equations. Al-
   the value of the conditioning parent. This depen-                         though this hides some of the model’s internal struc-
   dence is shown by putting a bullet in the place                           ture, it is useful for keeping an overview.
   where this value is to be filled in.
    Regarding the probability distribution of the Θ vari-                          data up to t), in analogy to the Extended Kalman Fil-
    able, we give two examples:                                                    ter (see e.g. [6, 10]).
                                                                                   A second type of inference that we do with the model
    Discrete Θ: The paper types pt1 , pt2 , pt3 determin-                          is smoothing. In particular, we want to calculate
        istically set a value θ1 , θ2 , θ3 (resp.) for Θ. In                       P(θ, tt , tt+1 |D) for each timeslice, in order to do EM
        fact, this turns Tt and Tt+1 into conditional lin-                         learning (see the next section). We have used a Rauch-
        ear Gaussians (conditioned by the paper type), so                          Tung-Striebel -type smoother [9]. This uses a forward
        the joint distribution is a mixture of 3 Gaussians.                        pass like discussed above, with the adjustment that it
    Continuous Θ: The paper types pt1 , pt2 , pt3 de-                              stores the distributions over two time slices. The last
       termine the parameters (µθ1 , σθ1 ), (µθ2 , σθ2 ),                          of these, i.e. P(tm−1 , tm , θ, D), is used as the input for
       (µθ2 , σθ2 ) for a Gaussian-distributed Θ. This                             a recursive backward pass defined as follows:
       model is no longer linear.
                                                                                       P(tt−1 , tt , θ, D) =
                                                                                                      Z
    These options have an influence on inference and learn-                                              P(tt , tt+1 , θ, D)P(tt−1 |tt , θ, D) dtt+1
    ing in the model, which we discuss in the next sections.
                                                                                   where the first factor in the integral is the recursive
    4      Inference                                                               one, and the second is calculated from the distribution
                                                                                   over tt−1 , tt stored in the forward pass:
    In this section, we shortly discuss inference in the                                                                            0
                                                                                       P(tt−1 |tt , θ, D) = P(tt−1 |tt , pin
                                                                                                                          0..t−1 , t0..t )
    uncertain parameter model of Fig. 4, for both the
                                                                                                                                            0
    discrete and continuous Θ given above. Assume                                                                P(tt−1 , tt , pin0..t−1 , t0..t )
                                                                                                           =R                            0
    we observe the system’s Tt0 variable responding to                                                                         in
                                                                                                                P(tt−1 , tt , p0..t−1 , t0..t ) dtt−1
    the Ptin input for a while, resulting in data D =
    {pin   0            in     0      0                                            The advantage of such a smoother over an independent
      0 , t0 , . . . , pm−1 , tm−1 , tm }, and the goal is to find out
    the paper type.                                                                backward pass is that it does not linearize the distribu-
                                                                                   tion over Tt−1 , T in two different ways (the backward
    This can be done by a forward pass over the model as                           pass uses the linearization of the forward pass).
    known from dynamic Bayesian network literature. We
    start with the prior distribution
                                                                                   5     Learning
                      P(t0 , θ, t00 ) = P(t0 )P(θ)P(t00 )
                                                                                   For learning the model, we discuss the situation
    and perform a recursive forward pass from t = 0 to                             where we know the paper type (assume it is pt1 )
    t + 1 = m:                                                                     and observe the system like before, i.e. D =
                                                                                          0                   0      0
                                                                                   {pin                in
                                                                                     0 , t0 , . . . , pm−1 , tm−1 , tm }. The goal is to learn
        P(tt+1 , θ, pin      0                                                     the parameter set ρ that maximizes the likelihood
                     0..t , t0..t+1 ) =
Z                                                                                  P(D|pt1 ; ρ). The situation is a little different depend-
                         0                   0             0          0
    P(tt , θ, pin                                 in
               0..t−1 , t0..t )P(tt+1 |tt , tt , pt , θ)P(tt+1 |tt , tt , θ) dtt
                                                                                   ing on the model for Θ.

    Finally, we marginalize out tm :                                               5.1     EM for continuous Θ

                                                                                   For the continuous Θ, the restriction to pt1 means that
                           Z
                 P(θ, D) = P(tm , θ, D) dtm                                        we are learning the parameters for one multivariate
                                                                                   Gaussian variable Θ (actually consisting of four inde-
    The details of the inference algorithm depend on the                           pendent variables R, C, R0 , C 0 ) and the σ, σ 0 pro-
    model used. For the discrete Θ, all the distributions                          cess noise parameters. Thus, ρ = (µθ1 , σθ1 , σ, σ 0 ).
    above are linear Gaussian, so we can multiply and in-                          This can be done by a standard EM algorithm [2] for
    tegrate exactly. To be precise, P(tt+1 |tt , t0t , pin
                                                        t , θ) and                 Bayesian networks: given a set of initial parameters
    P(t0t+1 |tt , t0t , θ) are linear Gaussian for each of the 3                   ρi , the approximate smoother infers the distributions
    individual θi values; the algorithm thus independently                         P(tt , tt+1 , θ|D, pt1 ; ρi ). From these, the expected suf-
    works on 3 Gaussians.                                                          ficient statistics are gathered for maximizing
    For the continuous Θ, the conditional distributions of                                             P(D, T0..m , Θ|pt1 ; ρi+1 )
               0
    Tt+1 and Tt+1   are not linear Gaussian; we can do ap-
    proximate inference by linearizing these distributions                         expected under the old parameters ρi ; this is repeated
    at each timeslice around the means of Tt , Θ (given the                        until convergence.
5.2    EM for discrete Θ                                           D1,3 (θ)+D1,4 (θ) = −1. We record these constraints in
                                                                   a matrix C and vector c such that CDT1 (θ) = c. Sub-
For the discrete parameter space model, we are look-               stituting d = DT1 (θ), for the first term we are looking
ing for the parameter set (θ, σ, σ 0 ) that maximizes the          for the d that minimizes
likelihood                                                                                         "                #
                P(D|pt1 , Θ = θ; σ, σ 0 )
                                                                         X                            X
                                                                                  T    T         T               T
                                                                              E(xt dd xt ) = d             E(xt xt ) d
                                                                         t=0..m                     t=0..m
Note that the role of Θ is different here; we are not
learning the optimal parameters for a distribution                 under the constraint Cd = c. This is a linearly con-
over Θ, but the optimal single value. This requires                strained quadratic optimization problem that can be
some adjustments to the smoother: it should store dis-             solved by the method of Lagrange multipliers. The
tributions over (Tt , Tt+1 ) instead of over (Tt , Tt+1 , Θ),      second term can be minimized in the same way.
and should not use a prior distribution over Θ ei-
                                                                   In conclusion, we have derived the M-phase for the
ther. Because all the probability distributions are lin-
                                                                   discrete Θ model; in the E-phase, we therefore have to
ear Gaussian again, smoothing is exact now.
                                                                   collect the expected sufficient statistics E(xt xTt ).
However, maximizing the expected likelihood is not so
trivial now: we are looking for the optimal linear Gaus-           5.3    Comparison
sian distribution P(tt+1 , t0t+1 |tt , t0t , pin
                                              t , θ) constrained
to a certain form prescribed by A and B. Specifically,             It is interesting to compare learning for continuous and
the log likelihood for an individual time slice is:                discrete Θ. In order to do this, we have simulated the
                                                                   system in Fig. 1 for 150 time slices, with a sine wave
                                          1 T −1    1              as P in input and random Gaussian disturbance. We
log P(tt+1 , t0t+1 |tt , t0t , pin
                                t , θ) = − δ Σ   δ − log |2πΣ|
                                          2         2              provide the EM algorithms discussed above with the
              h 2        i                                         P in and the generated Tt0 data. Typical results of a
where Σ = σ0 σ002 and δ is an abbreviation for                     60-iterations run are shown in Fig. 6.
                                   in                        The most interesting fact to observe is that the two
           tt+1            tt           pt
        δ= 0      − Ad (θ) 0 − Bd (θ)                              approaches converge to different values (but the same
           tt+1            tt          20◦C
                                                                   log likelihood). This probably means that the system
Separating variables and parameters, we can write:                 is not identifiable: several choices for Θ lead to the
                                                                   same likelihood of the observed behaviour Tt0 . To test
          δ = D(θ)xt                                               this hypothesis, we also generated synthetic Tt , Tt0 data
              
       D(θ) = I −Ad (θ) −Bd (θ)
                                                                  (without disturbance) for systems with the learned pa-
                                                       T          rameters. The results are plotted in Fig. 7. These
         xt = tt+1 t0t+1 tt t0t pin             20◦C
              
                                  t                                results indeed show that both methods arrive at the
                                                                   same approximation (green, red) of the original (blue)
The log likelihood for all the time slices (ignoring the           Tt0 data; however, the different values for the parame-
term − 21 log |2πΣ| for now) is then:                              ters lead to a different behavior of Tt .

                            1 X                                    A second observation from Fig. 6 is that learning con-
   log P(D, t0..m |θ) = −            (D(θ)xt )T Σ−1 D(θ)xt         tinuous Θ converges faster than learning discrete Θ.
                            2 t=0..m
                                                                   The explanation for this is as follows: the EM algo-
      1 X (D1 (θ)xt )T D1 (θ)xt (D2 (θ)xt )T D2 (θ)xt              rithm for the continuous parameter space uses infer-
=−                             +
      2 t=0..m    σ2                    σ 02                       ence to compute a posterior distribution over the vari-
                                                                   able Θ. In this algorithm, the posterior distribution
where Di denotes the ith row of D. The goal is to                  is updated for each time slice. However, we can also
maximize the expected value of this expression. At                 regard the algorithm as doing full Bayesian learning
first sight, it seems that the two terms are dependent             where Θ is viewed as a parameter; the algorithm is
through θ, but on closer inspection                                then performing incremental EM [8], which is known
             h                                              i      to converge faster.
                 1 0 −1+1/rc   −1/rc         −1/c 0
  D(θ) =         0 1 1/rc0 −1+1/rc0 +1/r 0 c0 0 −1/r 0 c0
                                                                   6     Conclusion
we see that the values in the first row do not con-
strain those in the second, or vice versa. We can                  The central scientific hypothesis which initiated the re-
therefore minimize the expected value of the two                   search described in this paper was that knowledge en-
terms independently. We can also see that there are                gineering methods for industrial applications of prob-
linear constraints for the values within a row, e.g.               abilistic graphical models should be based as much as
                                                                                                         600
  0.015                                          0.06
                                       R                                             R’
                                                                                                         580
 0.0145                                          0.05

  0.014                                          0.04                                                    560

 0.0135                                          0.03
          0   10   20   30   40   50        60          0   10   20   30   40   50        60             540


                                                                                                    T
   400                                           1400
                                       C                                             C’
   350                                                                                                   520
                                                 1200
   300


                                                                                                    T’
                                                 1000                                                    500
   250

   200                                            800
          0   10   20   30   40   50        60          0   10   20   30   40   50        60             480
                                                    0
   500                                 T1                                       log lik
   480                                           −200                                                    460

   460
                                                 −400                                                    440
   440                                                                                                         0       50            100            150
   420                                           −600                                                                          t
          0   10   20   30   40   50        60          0   10   20   30   40   50        60

                                                                                                Figure 7: Blue: synthetic data used for learning (Tt and
Figure 6: EM learning of the Θ parameters: compar-                                              Tt0 are shown, but only the latter is given to the learn-
ison of discrete parameter space (parameters are single                                         ing algorithms). As input Ptin , we used a sine wave. We
values; shown in green) and continuous parameter space                                          disturbed the system with additive zero-mean Gaussian
(parameters are Gaussians; µ values shown in red). The                                          noise. Green, red: response of an undisturbed determin-
horizontal axis represents 60 EM iterations. Also shown                                         istic system using the learned parameters (with discrete
are the learned distribution over T1 (Gaussian, µ value)
and the log likelihood.                                                                         and continuous parameter space, resp.) to the same Ptin
                                                                                                sine wave.


possible on existing methods from engineering. We                                               [2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maxi-
have developed a systematic framework, where we                                                     mum likelihood from incomplete data via the EM algo-
start with linear system theory and associated dia-                                                 rithm. Journal of the Royal Statistical Society. Series
                                                                                                    B (Methodological), 39(1):1–38, 1977.
grams and notations as the first step in the knowl-
edge engineering process. The advantage of this ap-                                             [3] A.J. Hommersom and P.J.F. Lucas. Using Bayesian
proach is that engineers have already dealt with the                                                networks in an industrial setting: Making printing sys-
unavoidable complexity issues in representing realistic                                             tems adaptive. In 19th European Conference on Arti-
                                                                                                    ficial Intelligence, pages 401–406, 2010.
models. Subsequently, it was shown how linear dy-
namic system models can be augmented with proba-                                                [4] R.E. Kalman et al. A new approach to linear filtering
bilistic variables for uncertain parameters, transform-                                             and prediction problems. Journal of Basic Engineer-
ing them into dynamic Bayesian networks with con-                                                   ing, 82(1):35–45, 1960.
ditionally linear nodes. We introduced a concise no-                                            [5] D.C. Karnopp, D.L. Margolis, and R.C. Rosenberg.
tation that combines LDS and Bayesian network con-                                                  System dynamics: modeling and simulation of mecha-
cepts in a natural way and demonstrated methods for                                                 tronic systems. John Wiley & Sons, 2006.
inference and learning from data in these models. The                                           [6] D. Koller and N. Friedman. Probabilistic graphical
practical usefulness of the framework was illustrated                                               models: Principles and techniques. The MIT Press,
by a case study from the domain of large production                                                 2009.
printers.
                                                                                                [7] K.B. Korb and A.E. Nicholson. Bayesian Artificial
                                                                                                    Intelligence. Chapman & Hall/CRC, 2004.
Acknowledgements                                                                                [8] Radford Neal and Geoffrey E. Hinton. A view of the
                                                                                                    EM algorithm that justifies incremental, sparse, and
This work has been carried out as part of the                                                       other variants. In M.I. Jordan, editor, Learning in
OCTOPUS project under the responsibility of the Em-                                                 Graphical Models, pages 355–368. Kluwer Academic
bedded Systems Institute. This project is partially                                                 Publishers, 1998.
supported by the Netherlands Ministry of Economic                                               [9] H.E. Rauch, F. Tung, and CT Striebel. Maximum
Affairs under the Embedded Systems Institute pro-                                                   likelihood estimates of linear dynamic systems. AIAA
gram.                                                                                               journal, 3(8):1445–1450, 1965.
                                                                                               [10] H.W. Sorenson. Kalman filtering: theory and applica-
References                                                                                          tion. IEEE, 1985.

 [1] Thomas Dean and Keiji Kanazawa. A model for rea-                                          [11] http://www.20sim.com/.
     soning about persistence and causation. Computa-
     tional Intelligence, 5(3):142–150, 1989.

</pre>