-

Constructing Bayesian Networks for Linear Dynamic Systems

Sander Evers Peter J.F. Lucas

0 0 Radboud University Nijmegen Radboud University Nijmegen Inst. for Computing and Information Sciences Inst. for Computing and Information Sciences Nijmegen , the Netherlands Nijmegen , the Netherlands

Building probabilistic models for industrial applications cannot be done e ectively without making use of knowledge engineering methods that are geared to the industrial setting. In this paper, we build on well-known modelling methods from linear dynamic system theory as commonly used by the engineering community to facilitate the systematic creation of probabilistic graphical models. In particular, we explore a direction of research where the parameters of a linear dynamic system are assumed to be uncertain. As a case study, the heating process of paper in a printer is modelled. Di erent options for the representation, inference and learning of such a system are discussed, and experimental results obtained by this approach are presented. We conclude that the methods developed in this paper o er an attractive foundation for a methodology for building industrial, probabilistic graphical models.

As part of the manual construction process of a Bayesian network for an actual problem one needs to somehow translate knowledge that is available in the problem domain to an appropriate graphical structure. Problem characteristics also play a major role in the choice of the type of the associated local probability distributions. The whole process can be looked on as a form of knowledge engineering, where the actual construction of the Bayesian network is only one of the many needed activities in the development process. The acquisition of knowledge from domain experts is traditionally seen as one of the most important bottlenecks in knowledge engineering. It is well known that this can be greatly alleviated if there is an easy mapping from the informal ways domain knowledge is described, in documents or verbally by experts, to the Bayesian network formalism [ 7 ]. A typical example of such a mapping is the exploitation of available causal knowledge in a particular domain; often, causal knowledge can be easily translated to an initial graph structure of a Bayesian network, which can be re ned later, for example by examining the conditional independence relationship represented by the resulting network. Fields where causal knowledge has been successfully used in knowledge engineering for Bayesian networks include medicine and biology.

However, for industrial applications the situation is somewhat di erent. This is mainly because in time, engineers have developed their own notational conventions and formalisms to get a grip on the domain of concern in the system-development process. In addition, industrial artifacts are designed by humans, and already early in the design process models are available that also can be used for other purposes: model-based design and development is here the central paradigm. Thus, rather than replacing methods from engineering by some new and unrelated methods, a better option seems to be to deploy as many of the engineering principles, methods, and assumptions as possible. Linearity is one of the assumptions frequently used, as it facilitates the development of complex models as industrial applications often are. Another commonly used method is the use of diagrams that act as abstractions of the system being developed. Diagrams act as important means for communication. Ideally, one would like to use similar diagrams as a starting point for building Bayesian networks or related probabilistic graphical models.

In this paper we explore these ideas by taking Linear Dynamic Systems, LDS for short, as a start for the construction process. LDS models enjoy a well-developed theory and practice, as they are widely used throughout many engineering disciplines for tracking system behaviour (cf. the well-known Kalman lter [ 4 ]) as well as for controlling this behaviour. Like Bayesian networks, an LDS can often be represented by a graphical diagram, which facilitates documentation and communication of the model among experts and non-experts. Although an LDS is deterministic in nature, it is often used in situations that involve uncertainty. The Kalman lter is the canonical example here: given some noisy observations, it determines the expected current state of the system (mean and variance). However, the Kalman lter only accounts for one speci c type of uncertainty: additive linear Gaussian noise on the state and output variables.

In this article, we explore a di erent direction of augmenting an LDS with probabilities: we regard the parameters as unknown. This allows us for example to model a printer that heats di erent types of paper, in which it is uncertain what the current type is. Previous Bayesian networks modelling this situation were developed in a laborious ad hoc manner by close cooperation of domain experts and probabilistic modelling experts [ 3 ]; in this article we aim for a more systematic approach. 2

LDS models and their role in the engineering process

2.1

Basic de nitions

In its most basic form, a Linear Dynamic System or LDS describes a system's behaviour by the di erential equation

ddt x(t) = Ax(t) + Bu(t) known as the state-space representation. Here, vector x(t) represents the system's state at time t, u(t) its input at time t, and matrices A and B describe how the current state change depends linearly on the current state and input.

This is a continuous-time representation; in order to calculate with LDS models, time is usually discretized. In this article, we therefore use the simple discretized model xt+1

xt = Axt + But in which the size of the discretization steps conveniently equals the unit of the time domain in order to simplify the exposition (in practice one can use discretization steps of size t and scale the A and B matrices with this factor). The equation can then be rewritten to: xt+1 = Adxt + Bdut

Ad = A + I Bd = B (1) r r0

Ptin c c0 In the engineering process, LDS models of systems are often represented by means of diagrams. We exemplify the role of these models and diagrams using a case study which remains our running example thoughout the paper. The case study originates from a manufacturer of industrial printers. To ensure print quality, paper needs to be heated to a certain temperature, which is accomplished by passing the paper along a metal heater plate. It is quite important that the paper reaches the right temperature. If it is too cold, print quality su ers; if it is too hot, energy (and money) is wasted or worse: the printer might malfunction. Therefore, engineers have put a lot of e ort in the accurate modelling of the heating process. This results in models such as Fig. 1, in which the heater is modelled as two distinct heat masses: when the heater is powered, the rst mass is directly heated, thereby indirectly heating the second mass, which transfers the heat to the paper.1 In the diagram, the heating dynamics are represented as an electrical circuit, where temperature plays the part of voltage and heat ow that of current. A diagram like this has important advantages for the engineering process:

It is very well suited for documentation and communication. A trained engineer can read the system's basic dynamic behaviour o this diagram in a blink of an eye; for a non-expert with a science background it is relatively easy to gain some intuition.

It has a formal meaning as an LDS; it translates into the state-space equations in the form of Eq. (1), connecting it to a vast body of theoretical and practical knowledge.

It separates qualitative and quantitative aspects of the model; the former are determined by the diagram structure, the latter by the parameters. It is composable: other models like this can be developed independently and joined into a larger system.

It is supported by software: drawing and managing modules of electrical circuits (and also other graphical forms like bond graphs [ 5 ] and schematic diagrams) can be done by tools like 20sim [ 11 ], which can also perform the translation to the state-space representation. This representation can be used for simulation, e.g. in MATLAB.

However, it is con ned to modelling deterministic behaviour. In the realm of probabilistic modelling, the formalism of Bayesian networks shares the above attractive properties. A natural question is therefore: how can we combine these well-known LDS models with Bayesian networks? Speci cally, this paper will explore the situation where the parameters of the system (in this case: r; c; r0; c0) involve uncertainty. This direction is induced by the following use case: the paper heater modelled above is used with di erent paper types fpt1; pt2; pt3g (for example: 80 g=m2 A4, 200 g=m2 A4, 80 g=m2 Letter). We have no direct knowledge about which type is in the heater, and would therefore like to model it as a probabilistic variable PT. Each paper type leads to a di erent value for the system's r0 parameter (the heat resistance between plate and paper). The question we 1This is known as a lumped element model ; in contrast, the heat distribution could also be modelled as a 3-dimensional temperature eld over the plate. X1

Y X1

X2 X1

X2 ask ourselves is: How can we join the paper type variable to the LDS model, so we can infer probabilistically which paper type is in the heater, by observing the T 0 values of the system? 3

Augmenting LDS models with uncertainty

3.1

Bayesian network representation

A Bayesian network B = (G; ) is an acyclic directed graph G, consisting of nodes and arcs, that is faithful to a joint probability distribution factored as , which contains for each node Y (with parents X1; X2; : : :) a family of local conditional probability density or mass functions P(Y jX1; X2; : : :). A dynamic Bayesian network [ 1 ] is a special case of this, where the nodes are partitioned in time slices all consisting of the same structure and distributions. Furthermore, arcs are only allowed between nodes in the same or adjacent time slice.

For modelling linear dynamic systems as (dynamic) Bayesian networks, only the following types of nodes are needed:

Deterministic nodes: a node Y with parents

X1; X2; : : : is called deterministic if its conditional probability distribution is

P(y jx1; x2; : : :) = (1 if y = f (x1; x2; : : :)

0 if y 6= f (x1; x2; : : :) for a certain function f ; in this article, these functions are mostly linear, i.e.

y = + x2 + : : : We use a special notation for these linear deterministic nodes shown in Fig. 2 (left).

Linear Gaussians: a node Y with

X1; X2; : : : is known in Bayesian literature as a linear Gaussian if parents network P(Y jx1; x2; : : :) = N ( + x2 + : : : ; 2) Networks that consist only of linear Gaussians (with > 0) have theoretical signi cance: their joint distribution is multivariate Gaussian, and exact inference is easy and e cient (e.g. see [ 6 ]). A linear Gaussian without parents N ( ; 2) is simply called Gaussian; the Gaussian N (0; 1) is called a standard Gaussian. A linear Gaussian can be written as a linear deterministic node with two extra parents; see Fig. 2 (middle).

Conditional linear Gaussians: a node Y with par

ents X1; X2; : : : and discrete parent is conditional linear Gaussian if P(Y jx1; x2; : : : ; ) = N ( + x2 +: : : ; 2) i.e. it is linear Gaussian for each value . If X1; X2; : : : are Gaussian, the marginal distribution over Y is a mixture of Gaussians:

P(Y ) =

X P( )N (^ ; ^2) ^ = ^2 = 2

+ 2 +

X1 + 2 X21 +

X2 + : : : 2 X22 + : : : Again, this also holds for complete networks: if all nodes are (conditional) linear Gaussian, the joint distribution is a mixture of multivariate Gaussians. However, this number of components in this mixture is exponential in the number of variables, which can make inference hard. A conditional linear Gaussian can also be written as a deterministic node with extra parents. For this we use a special notation shown in Fig. 2 (right), to which we will return later.

As these three node types can all be written as deterministic nodes, we will henceforth use the convention that all non-root nodes in our networks are deterministic. 3.2

LDS models as Bayesian networks

The paper heater model in Fig. 1 translates to the following discrete-time state-space equations in the form of Eq. (1):

The state of the system consists of the temperatures Tt and Tt0 of the two heat masses. In fact, translating the electrical diagram by tools such as 20-sim rst leads to a more elaborate form in which auxiliary power variables are present. It is instructive to represent this form as a Bayesian network consisting only of linear deterministic nodes; this is shown at the right side of Fig. 1. As the network is completely deterministic, it might also be read as a system of equations over ordinary variables:

Each node represents the left-hand side of an equation, consisting of one variable.

The incoming arcs represent the right-hand side: a linear combination of the parent variables, with coe cients as speci ed on the arcs. Note: we follow the convention that empty arcs carry a coe cient of 1.

For example, the gure shows that

Pttrans = 1r Tt + 1 r

Tt0 =

Tt0 r Tt+1 = 1c Ptin + Tt + c 1 Pttrans = Tt +

P in t We will now start to add uncertainty to the LDS. First, as is often done (e.g. in the Kalman lter), we augment the state variables with additive zero-mean Gaussian noise: The noise is represented by two independent variables Wt and Wt0. A graphical representation of this system is shown in Fig. 3; we have only replaced Wt and Wt0 by two anonymous standard Gaussian variables N (0; 1) whose value is multiplied by and 0. As a result, the Tt variables now have the linear Gaussian form from Fig. 2 (middle).

In fact, this makes the whole system Gaussian: although the Pttrans and P out nodes are not linear Gaussian, we have already seen that they can be marginalized out, making the Tt+1; Tt0+1 nodes directly depend on Tt; Tt0. As for the P in node: as it is always used t with concrete evidence pitn, it never represents a probability distribution, and can be reformulated to take

Tt+1 Tt0+1

0 N (0; 1)

R R0

Tt0

Tt+1 Tt0+1

0 N (0; 1)

C C0 N (0; I) the place of in the Tt+1 distribution. Thus, Fig. 3 is a dynamic Bayesian network with a Gaussian joint distribution. This means that we can use a variant of the standard Kalman lter to do exact inference on this network; we discuss this in detail in Sect. 4. 3.3

LDS models with uncertain parameters

Above, we have shown how to represent an LDS with additive zero-mean Gaussian noise as a Bayesian network; while it may be instructive, thus far it is only a convenient reformulation of known theory. However, to model the relation with the paper type variable, we need uncertainty in the parameters, i.e. the coe cients of the linear relation. Our solution is to transform these coe cients into probabilistic variables. We accomplish this by introducing a new node type:

Conditional linear deterministic node: a linear

deterministic node extended with a special parent, which is distinguished from the rest because its arc ends in a diamond (and carries no coe cient); we call this parent the conditioning parent. The coe cients on the other arcs can depend on the value of the conditioning parent. This dependence is shown by putting a bullet in the place where this value is to be lled in.

We have already given an example of such a node: the conditional linear Gaussian node in Fig. 2. Just like the linear Gaussian node is an instance of a linear deterministic node, viz. having speci c parents 1 and N (0; 1), a conditional linear Gaussian node is a speci c instance of conditional linear deterministic node. By using conditional linear deterministic nodes, we can extend our already noisy paper heater model with uncertain parameters: we replace parameter r by variable R|which becomes the conditioning parent of the node that depended on r|and do the same for the other parameters. The result is shown in Fig. 4. We can now proceed to connect a discrete paper type variable to the model, with an example distribution assigning to pt1; pt2; pt3 the probabilities 0:3, 0:5 and 0:2. Like we mentioned, the paper type determines the R0 parameter, but for generalization's sake we will assume that it in uences all the parameters. The resulting model, in Fig. 5, also shows that the notation for (conditional) linear deterministic nodes extends very naturally to vector-valued variables: coe cients become matrices. These are the matrices from Eq. (2), written as functions Ad( ) and Bd( ) of = (r; c; r0; c0). Thus, we have made a graphical summary of the model which is linked very clearly to the state-space equations. Although this hides some of the model's internal structure, it is useful for keeping an overview.

Discrete : The paper types pt1; pt2; pt3 deterministically set a value 1; 2; 3 (resp.) for . In fact, this turns Tt and Tt+1 into conditional linear Gaussians (conditioned by the paper type), so the joint distribution is a mixture of 3 Gaussians. Continuous : The paper types pt1; pt2; pt3 determine the parameters ( 1; 1), ( 2; 2), ( 2; 2) for a Gaussian-distributed . This model is no longer linear.

These options have an in uence on inference and learning in the model, which we discuss in the next sections.

4 Inference

In this section, we shortly discuss inference in the uncertain parameter model of Fig. 4, for both the discrete and continuous given above. Assume we observe the system's Tt0 variable responding to the Ptin input for a while, resulting in data D = fpi0n; t00; : : : ; pimn 1; t0m 1; t0mg, and the goal is to nd out the paper type.

This can be done by a forward pass over the model as known from dynamic Bayesian network literature. We start with the prior distribution

P(t0; ; t00) = P(t0)P( )P(t00) and perform a recursive forward pass from t = 0 to t + 1 = m: Z

P(tt+1; ; pi0n::t; t00::t+1) =

Finally, we marginalize out tm:

P( ; D) =

P(tm; ; D) dtm The details of the inference algorithm depend on the model used. For the discrete , all the distributions above are linear Gaussian, so we can multiply and integrate exactly. To be precise, P(tt+1 jtt; t0t; pitn; ) and P(t0t+1 jtt; t0t; ) are linear Gaussian for each of the 3 individual i values; the algorithm thus independently works on 3 Gaussians.

For the continuous , the conditional distributions of Tt+1 and Tt0+1 are not linear Gaussian; we can do approximate inference by linearizing these distributions at each timeslice around the means of Tt; (given the Regarding the probability distribution of the able, we give two examples:

P(tt; ; pi0n::t 1; t00::t)P(tt+1 jtt; t0t; pitn; )P(t0t+1 jtt; t0t; ) dtt ing on the model for . A second type of inference that we do with the model is smoothing. In particular, we want to calculate P( ; tt; tt+1 jD) for each timeslice, in order to do EM learning (see the next section). We have used a RauchTung-Striebel -type smoother [ 9 ]. This uses a forward pass like discussed above, with the adjustment that it stores the distributions over two time slices. The last of these, i.e. P(tm 1; tm; ; D), is used as the input for a recursive backward pass de ned as follows:

P(tt 1; tt; ; D) =

P(tt; tt+1; ; D)P(tt 1 jtt; ; D) dtt+1 where the rst factor in the integral is the recursive one, and the second is calculated from the distribution over tt 1; tt stored in the forward pass:

P(tt 1 jtt; ; D) = P(tt 1 jtt; pi0n::t 1; t00::t)

P(tt 1; tt; pi0n::t 1; t00::t) = R P(tt 1; tt; pi0n::t 1; t00::t) dtt 1 The advantage of such a smoother over an independent backward pass is that it does not linearize the distribution over Tt 1; T in two di erent ways (the backward pass uses the linearization of the forward pass). 5

Learning

For learning the model, we discuss the situation where we know the paper type (assume it is pt1) and observe the system like before, i.e. D = fpi0n; t00; : : : ; pimn 1; t0m 1; t0mg. The goal is to learn the parameter set that maximizes the likelihood P(Djpt1; ). The situation is a little di erent depend5.1

EM for continuous

For the continuous , the restriction to pt1 means that we are learning the parameters for one multivariate Gaussian variable (actually consisting of four independent variables R, C, R0, C0) and the ; 0 process noise parameters. Thus, = ( 1; 1; ; 0). This can be done by a standard EM algorithm [ 2 ] for Bayesian networks: given a set of initial parameters i, the approximate smoother infers the distributions P(tt; tt+1; jD; pt1; i). From these, the expected sufcient statistics are gathered for maximizing

P(D; T0::m; jpt1; i+1) expected under the old parameters i; this is repeated until convergence. Note that the role of is di erent here; we are not learning the optimal parameters for a distribution over , but the optimal single value. This requires some adjustments to the smoother: it should store distributions over (Tt; Tt+1) instead of over (Tt; Tt+1; ), and should not use a prior distribution over either. Because all the probability distributions are linear Gaussian again, smoothing is exact now. However, maximizing the expected likelihood is not so trivial now: we are looking for the optimal linear Gaussian distribution P(tt+1; t0t+1 jtt; t0t; pitn; ) constrained to a certain form prescribed by A and B. Speci cally, the log likelihood for an individual time slice is: log P(tt+1; t0t+1 jtt; t0t; pitn; ) = where = h 02 002 i and 1 T 2 1 1

log j2 2 j is an abbreviation for = tt+1 t0t+1

Ad( ) tt0t t

pin Bd( ) 20t C Separating variables and parameters, we can write: = D( )xt D( ) = I

Ad( )

Bd( ) xt = tt+1 t0t+1 tt t0t pitn The log likelihood for all the time slices (ignoring the term 21 log j2 j for now) is then:

log P(D; t0::m j ) = = where Di denotes the ith row of D. The goal is to maximize the expected value of this expression. At rst sight, it seems that the two terms are dependent through , but on closer inspection

D( ) = h 1 0 1+1=rc 0 1 1=rc0

1=rc 1=c 1+1=rc0+1=r0c0 0 we see that the values in the rst row do not constrain those in the second, or vice versa. We can therefore minimize the expected value of the two terms independently. We can also see that there are linear constraints for the values within a row, e.g. D1;3( )+D1;4( ) = 1. We record these constraints in a matrix C and vector c such that CD1T ( ) = c. Substituting d = D1T ( ), for the rst term we are looking for the d that minimizes

X t=0::m

E(xtT ddT xt) = dT "

X t=0::m

# E(xtxtT ) d under the constraint Cd = c. This is a linearly constrained quadratic optimization problem that can be solved by the method of Lagrange multipliers. The second term can be minimized in the same way. In conclusion, we have derived the M-phase for the discrete model; in the E-phase, we therefore have to collect the expected su cient statistics E(xtxtT ). 5.3

Comparison

It is interesting to compare learning for continuous and discrete . In order to do this, we have simulated the system in Fig. 1 for 150 time slices, with a sine wave as P in input and random Gaussian disturbance. We provide the EM algorithms discussed above with the P in and the generated Tt0 data. Typical results of a 60-iterations run are shown in Fig. 6.

The most interesting fact to observe is that the two approaches converge to di erent values (but the same log likelihood). This probably means that the system is not identi able: several choices for lead to the same likelihood of the observed behaviour Tt0. To test this hypothesis, we also generated synthetic Tt; Tt0 data (without disturbance) for systems with the learned parameters. The results are plotted in Fig. 7. These results indeed show that both methods arrive at the same approximation (green, red) of the original (blue) Tt0 data; however, the di erent values for the parameters lead to a di erent behavior of Tt.

A second observation from Fig. 6 is that learning continuous converges faster than learning discrete . The explanation for this is as follows: the EM algorithm for the continuous parameter space uses inference to compute a posterior distribution over the variable . In this algorithm, the posterior distribution is updated for each time slice. However, we can also regard the algorithm as doing full Bayesian learning where is viewed as a parameter; the algorithm is then performing incremental EM [ 8 ], which is known to converge faster. 6

Conclusion

The central scienti c hypothesis which initiated the research described in this paper was that knowledge engineering methods for industrial applications of probabilistic graphical models should be based as much as 0.0135 4000 10 20 30 40 50 60 104.00030 10 20 30 40 50 60

C C’ 2000 10 20 30 40 50 60 8000 10 20 30 40 50 60 T1 0 log lik R possible on existing methods from engineering. We have developed a systematic framework, where we start with linear system theory and associated diagrams and notations as the rst step in the knowledge engineering process. The advantage of this approach is that engineers have already dealt with the unavoidable complexity issues in representing realistic models. Subsequently, it was shown how linear dynamic system models can be augmented with probabilistic variables for uncertain parameters, transforming them into dynamic Bayesian networks with conditionally linear nodes. We introduced a concise notation that combines LDS and Bayesian network concepts in a natural way and demonstrated methods for inference and learning from data in these models. The practical usefulness of the framework was illustrated by a case study from the domain of large production printers.

Acknowledgements

This work has been carried out as part of the OCTOPUS project under the responsibility of the Embedded Systems Institute. This project is partially supported by the Netherlands Ministry of Economic A airs under the Embedded Systems Institute program.

[1]

Thomas

Dean and

Keiji

Kanazawa . A model for reasoning about persistence and causation . Computational Intelligence , 5 ( 3 ): 142 { 150 , 1989 .

[2]

A. P.

Dempster ,

N. M.

Laird ,

and D. B.

Rubin . Maximum likelihood from incomplete data via the EM algorithm . Journal of the Royal Statistical Society. Series B (Methodological) , 39 ( 1 ):1{ 38 , 1977 .

[3]

A.J.

Hommersom and

P.J.F.

Lucas . Using Bayesian networks in an industrial setting: Making printing systems adaptive . In 19th European Conference on Articial Intelligence , pages 401 { 406 , 2010 .

[4]

R.E.

Kalman et al. A new approach to linear ltering and prediction problems . Journal of Basic Engineering , 82 ( 1 ): 35 { 45 , 1960 .

[5]

D.C.

Karnopp ,

D.L.

Margolis , and

R.C.

Rosenberg . System dynamics: modeling and simulation of mechatronic systems . John Wiley & Sons, 2006 .

[6]

Koller and

Friedman . Probabilistic graphical models: Principles and techniques . The MIT Press, 2009 .

[7]

K.B.

Korb and

A.E.

Nicholson . Bayesian Arti cial Intelligence . Chapman & Hall/CRC, 2004 .

[8]

Radford

Neal and Geo rey

Hinton . A view of the EM algorithm that justi es incremental, sparse, and other variants . In M.I. Jordan, editor, Learning in Graphical Models , pages 355 { 368 . Kluwer Academic Publishers, 1998 .

[9]

H.E.

Rauch ,

Tung , and

Striebel . Maximum likelihood estimates of linear dynamic systems . AIAA journal , 3 ( 8 ): 1445 { 1450 , 1965 .

[10]

H.W.

Sorenson . Kalman ltering: theory and application . IEEE, 1985 .

[11] http://www.20sim.com/.