=Paper=
{{Paper
|id=None
|storemode=property
|title=Constructing Bayesian Networks for Linear Dynamic Systems
|pdfUrl=https://ceur-ws.org/Vol-818/paper4.pdf
|volume=Vol-818
}}
==Constructing Bayesian Networks for Linear Dynamic Systems==
Constructing Bayesian Networks for Linear Dynamic Systems Sander Evers Peter J.F. Lucas Radboud University Nijmegen Radboud University Nijmegen Inst. for Computing and Information Sciences Inst. for Computing and Information Sciences Nijmegen, the Netherlands Nijmegen, the Netherlands Abstract mapping from the informal ways domain knowledge is described, in documents or verbally by experts, to Building probabilistic models for industrial the Bayesian network formalism [7]. A typical exam- applications cannot be done effectively with- ple of such a mapping is the exploitation of available out making use of knowledge engineering causal knowledge in a particular domain; often, causal methods that are geared to the industrial set- knowledge can be easily translated to an initial graph ting. In this paper, we build on well-known structure of a Bayesian network, which can be refined modelling methods from linear dynamic sys- later, for example by examining the conditional in- tem theory as commonly used by the engi- dependence relationship represented by the resulting neering community to facilitate the system- network. Fields where causal knowledge has been suc- atic creation of probabilistic graphical mod- cessfully used in knowledge engineering for Bayesian els. In particular, we explore a direction of networks include medicine and biology. research where the parameters of a linear dy- However, for industrial applications the situation is namic system are assumed to be uncertain. somewhat different. This is mainly because in time, As a case study, the heating process of paper engineers have developed their own notational conven- in a printer is modelled. Different options tions and formalisms to get a grip on the domain of for the representation, inference and learning concern in the system-development process. In addi- of such a system are discussed, and experi- tion, industrial artifacts are designed by humans, and mental results obtained by this approach are already early in the design process models are available presented. We conclude that the methods de- that also can be used for other purposes: model-based veloped in this paper offer an attractive foun- design and development is here the central paradigm. dation for a methodology for building indus- Thus, rather than replacing methods from engineer- trial, probabilistic graphical models. ing by some new and unrelated methods, a better op- tion seems to be to deploy as many of the engineer- ing principles, methods, and assumptions as possible. 1 Introduction Linearity is one of the assumptions frequently used, as it facilitates the development of complex models as As part of the manual construction process of a industrial applications often are. Another commonly Bayesian network for an actual problem one needs to used method is the use of diagrams that act as ab- somehow translate knowledge that is available in the stractions of the system being developed. Diagrams problem domain to an appropriate graphical structure. act as important means for communication. Ideally, Problem characteristics also play a major role in the one would like to use similar diagrams as a starting choice of the type of the associated local probability point for building Bayesian networks or related prob- distributions. The whole process can be looked on as a abilistic graphical models. form of knowledge engineering, where the actual con- In this paper we explore these ideas by taking Linear struction of the Bayesian network is only one of the Dynamic Systems, LDS for short, as a start for the con- many needed activities in the development process. struction process. LDS models enjoy a well-developed The acquisition of knowledge from domain experts is theory and practice, as they are widely used through- traditionally seen as one of the most important bot- out many engineering disciplines for tracking system tlenecks in knowledge engineering. It is well known behaviour (cf. the well-known Kalman filter [4]) as well that this can be greatly alleviated if there is an easy as for controlling this behaviour. Like Bayesian net- Ptin Ptin works, an LDS can often be represented by a graphical diagram, which facilitates documentation and commu- 1/c nication of the model among experts and non-experts. Tt+1 Tt Although an LDS is deterministic in nature, it is of- −1/c ten used in situations that involve uncertainty. The c 1/r Kalman filter is the canonical example here: given Pttrans some noisy observations, it determines the expected −1/r current state of the system (mean and variance). How- 1/c0 ever, the Kalman filter only accounts for one specific r c0 Tt0 0 Tt+1 type of uncertainty: additive linear Gaussian noise on the state and output variables. −1/c0 1/r0 In this article, we explore a different direction of aug- Ptout menting an LDS with probabilities: we regard the pa- r0 rameters as unknown. This allows us for example to 20◦C −1/r0 model a printer that heats different types of paper, in + − which it is uncertain what the current type is. Pre- 20◦C vious Bayesian networks modelling this situation were Figure 1: LDS model of paper heater. Left: as an elec- developed in a laborious ad hoc manner by close coop- trical diagram, where voltage represents temperature and eration of domain experts and probabilistic modelling current represents heat flow. From top to bottom, the di- experts [3]; in this article we aim for a more systematic agram consists of a (1) time-variable heat source, (2) heat approach. mass with capacity c, (3) heat resistance r, (4) heat mass with capacity c0 , (5) heat resistance r0 , (6) heat sink with temperature 20◦C modelling the constant flow of paper. 2 LDS models and their role in the Right: as a graphical representation of the state-space equations for two discrete time points t, t+1. The state of engineering process the system is described by the T variables, which repre- sent the temperatures of the two heat masses. Its input 2.1 Basic definitions is Ptin , the power entering the system. Auxiliary variables represent the power flowing from the first heat mass to In its most basic form, a Linear Dynamic System or the second (Pttrans ), and from the second to the heat sink LDS describes a system’s behaviour by the differential (Ptout ). Note: The parameters of the system are r, c, r0 , c0 equation (instantiated with concrete values in a real system). d dt x(t) = Ax(t) + Bu(t) known as the state-space representation. Here, vector x(t) represents the system’s state at time t, u(t) its in- put at time t, and matrices A and B describe how the 2.2 Role in engineering current state change depends linearly on the current state and input. In the engineering process, LDS models of systems are This is a continuous-time representation; in order to often represented by means of diagrams. We exemplify calculate with LDS models, time is usually discretized. the role of these models and diagrams using a case In this article, we therefore use the simple discretized study which remains our running example thoughout model the paper. The case study originates from a manufac- turer of industrial printers. To ensure print quality, pa- xt+1 − xt = Axt + But per needs to be heated to a certain temperature, which is accomplished by passing the paper along a metal in which the size of the discretization steps conve- heater plate. It is quite important that the paper niently equals the unit of the time domain in order reaches the right temperature. If it is too cold, print to simplify the exposition (in practice one can use dis- quality suffers; if it is too hot, energy (and money) cretization steps of size ∆t and scale the A and B is wasted or worse: the printer might malfunction. matrices with this factor). The equation can then be Therefore, engineers have put a lot of effort in the ac- rewritten to: curate modelling of the heating process. This results xt+1 = Ad xt + Bd ut in models such as Fig. 1, in which the heater is mod- elled as two distinct heat masses: when the heater is Ad = A + I (1) powered, the first mass is directly heated, thereby in- Bd = B directly heating the second mass, which transfers the heat to the paper.1 In the diagram, the heating dy- N (0, 1) N (0, 1) 1 1 1 namics are represented as an electrical circuit, where temperature plays the part of voltage and heat flow µ µ σ µ• σ• X1 X1 X1 that of current. A diagram like this has important α α α• Y Y Y advantages for the engineering process: β β β• X2 X2 X2 • It is very well suited for documentation and com- Θ munication. A trained engineer can read the sys- tem’s basic dynamic behaviour off this diagram Figure 2: The three basic types of Bayesian network nodes we will use for LDS models. Left: Linear determin- in a blink of an eye; for a non-expert with a sci- istic node P(y|x1 , x2 ) = 1 if y = µ + αx1 + βx2 . Middle: ence background it is relatively easy to gain some Linear Gaussian node P(Y |x1 , x2 ) = N (µ+αx1 +βx2 , σ 2 ). intuition. Right: Conditional linear Gaussian node P(Y |x1 , x2 , θ) = N (µθ + αθ x1 + βθ x2 , σθ2 ) (for discrete Θ). The notation is • It has a formal meaning as an LDS; it trans- ours and is introduced in section 3.3. lates into the state-space equations in the form of Eq. (1), connecting it to a vast body of theo- retical and practical knowledge. ask ourselves is: How can we join the paper type vari- able to the LDS model, so we can infer probabilistically • It separates qualitative and quantitative aspects of which paper type is in the heater, by observing the T 0 the model; the former are determined by the dia- values of the system? gram structure, the latter by the parameters. 3 Augmenting LDS models with • It is composable: other models like this can be developed independently and joined into a larger uncertainty system. 3.1 Bayesian network representation • It is supported by software: drawing and man- A Bayesian network B = (G, Φ) is an acyclic directed aging modules of electrical circuits (and also graph G, consisting of nodes and arcs, that is faithful other graphical forms like bond graphs [5] and to a joint probability distribution factored as Φ, which schematic diagrams) can be done by tools like 20- contains for each node Y (with parents X1 , X2 , . . .) a sim [11], which can also perform the translation family of local conditional probability density or mass to the state-space representation. This represen- functions P(Y |X1 , X2 , . . .). A dynamic Bayesian net- tation can be used for simulation, e.g. in MAT- work [1] is a special case of this, where the nodes are LAB. partitioned in time slices all consisting of the same structure and distributions. Furthermore, arcs are However, it is confined to modelling deterministic be- only allowed between nodes in the same or adjacent haviour. In the realm of probabilistic modelling, the time slice. formalism of Bayesian networks shares the above at- tractive properties. A natural question is therefore: For modelling linear dynamic systems as (dynamic) how can we combine these well-known LDS models Bayesian networks, only the following types of nodes with Bayesian networks? are needed: Specifically, this paper will explore the situation where Deterministic nodes: a node Y with parents the parameters of the system (in this case: r, c, r0 , c0 ) X1 , X2 , . . . is called deterministic if its conditional involve uncertainty. This direction is induced by the probability distribution is following use case: the paper heater modelled above is used with different paper types {pt1 , pt2 , pt3 } (for ( 1 if y = f (x1 , x2 , . . .) example: 80 g/m2 A4, 200 g/m2 A4, 80 g/m2 Letter). P(y|x1 , x2 , . . .) = We have no direct knowledge about which type is in 0 if y 6= f (x1 , x2 , . . .) the heater, and would therefore like to model it as a probabilistic variable PT. Each paper type leads to a for a certain function f ; in this article, these func- different value for the system’s r0 parameter (the heat tions are mostly linear, i.e. resistance between plate and paper). The question we y = µ + αx1 + βx2 + . . . 1 This is known as a lumped element model ; in con- trast, the heat distribution could also be modelled as a We use a special notation for these linear deter- 3-dimensional temperature field over the plate. ministic nodes shown in Fig. 2 (left). Linear Gaussians: a node Y with parents The state of the system consists of the temperatures Tt X1 , X2 , . . . is known in Bayesian network and Tt0 of the two heat masses. In fact, translating the literature as a linear Gaussian if electrical diagram by tools such as 20-sim first leads to a more elaborate form in which auxiliary power vari- P(Y |x1 , x2 , . . .) = N (µ + αx1 + βx2 + . . . , σ 2 ) ables are present. It is instructive to represent this Networks that consist only of linear Gaussians form as a Bayesian network consisting only of linear (with σ > 0) have theoretical significance: their deterministic nodes; this is shown at the right side of joint distribution is multivariate Gaussian, and Fig. 1. As the network is completely deterministic, exact inference is easy and efficient (e.g. see [6]). it might also be read as a system of equations over A linear Gaussian without parents N (µ, σ 2 ) is ordinary variables: simply called Gaussian; the Gaussian N (0, 1) is called a standard Gaussian. A linear Gaussian • Each node represents the left-hand side of an can be written as a linear deterministic node with equation, consisting of one variable. two extra parents; see Fig. 2 (middle). • The incoming arcs represent the right-hand side: Conditional linear Gaussians: a node Y with par- a linear combination of the parent variables, with ents X1 , X2 , . . . and discrete parent Θ is condi- coefficients as specified on the arcs. Note: we tional linear Gaussian if follow the convention that empty arcs carry a co- P(Y |x1 , x2 , . . . , θ) = N (µθ +αθ x1 +βθ x2 +. . . , σθ2 ) efficient of 1. i.e. it is linear Gaussian for each value θ. If X1 , X2 , . . . are Gaussian, the marginal distribu- For example, the figure shows that tion over Y is a mixture of Gaussians: 1 −1 0 Tt − Tt0 Pttrans = Tt + Tt = X P(Y ) = P(θ)N (µ̂θ , σ̂θ2 ) r r r θ∈Θ 1 in −1 trans P in − Pttrans µ̂θ = µθ + αθ µX1 + βθ µX2 + . . . Tt+1 = Pt + Tt + Pt = Tt + t c c c σ̂θ2 = σθ2 + αθ2 σX 2 + βθ2 σX 2 + ... 1 2 The state-space equations (2) are obtained from this Again, this also holds for complete networks: if all system by substituting the P trans and P out variables nodes are (conditional) linear Gaussian, the joint for their right-hand sides. Interpreted as a Bayesian distribution is a mixture of multivariate Gaus- network, this corresponds to marginalization. sians. However, this number of components in We will now start to add uncertainty to the LDS. First, this mixture is exponential in the number of Θ as is often done (e.g. in the Kalman filter), we augment variables, which can make inference hard. A con- the state variables with additive zero-mean Gaussian ditional linear Gaussian can also be written as a noise: deterministic node with extra parents. For this in we use a special notation shown in Fig. 2 (right), Tt+1 Tt Pt Wt 0 = Ad 0 + Bd + to which we will return later. Tt+1 Tt 20◦C Wt0 (3) As these three node types can all be written as deter- Wt ∼ N (0, σ 2 ) ministic nodes, we will henceforth use the convention Wt0 ∼ N (0, σ 02 ) that all non-root nodes in our networks are determin- istic. The noise is represented by two independent variables Wt and Wt0 . A graphical representation of this system 3.2 LDS models as Bayesian networks is shown in Fig. 3; we have only replaced Wt and Wt0 by two anonymous standard Gaussian variables N (0, 1) The paper heater model in Fig. 1 translates to the fol- whose value is multiplied by σ and σ 0 . As a result, the lowing discrete-time state-space equations in the form Tt variables now have the linear Gaussian form from of Eq. (1): Fig. 2 (middle). in Tt+1 Tt Pt In fact, this makes the whole system Gaussian: al- 0 = Ad 0 + Bd Tt+1 Tt 20◦C though the Pttrans and P out nodes are not linear Gaus- 1 − 1/rc 1/rc sian, we have already seen that they can be marginal- Ad = (2) 0 1/rc0 1 − 1/rc0 − 1/r0 c0 ized out, making the Tt+1 , Tt+1 nodes directly depend 0 in on Tt , Tt . As for the Pt node: as it is always used 1/c 0 with concrete evidence pin Bd = t , it never represents a prob- 0 1/r0 c0 ability distribution, and can be reformulated to take N (0, 1) N (0, 1) pt1 : 0.3 | pt2 : 0.5 | pt3 : 0.2 Ptin Ptin σ σ 1/• Θ 1/c C Tt+1 Tt Tt+1 Tt −1/c −1/• N (0, I) N (0, I) 1/• σ, σ 0 1/r R σ, σ 0 Pttrans Tt Tt+1 Pttrans ··· −1/• Ad (•) Tt0 0 Ad (•) Tt+1 −1/r 1/• Bd (•) Bd (•) 1/c0 0 C0 Tt0 0 Tt+1 Tt0 Tt+1 in Ptin Pt−1 −1/c0 −1/• 20◦C 20◦C σ0 1/• σ0 1/r0 R0 Ptout Figure 5: The model from Fig. 4, Ptout summarized using vector variables −1/• N (0, 1) (where Θ represents R, C, R0 , C 0 ) and −1/r0 N (0, 1) augmented with a paper type variable with a discrete distribution. The re- 20◦C 20◦C lation between time slices t − 1 and t is also shown. There are several op- Figure 4: LDS of Eq. (3) augmented tions for the relation between paper Figure 3: LDS of Eq. (3), containing with uncertain parameters. These are type and Θ (the dashed arc here is additive zero-mean Gaussian noise on modelled by conditional linear deter- not a linear influence). the state variables. ministic nodes. the place of µ in the Tt+1 distribution. Thus, Fig. 3 We have already given an example of such a node: is a dynamic Bayesian network with a Gaussian joint the conditional linear Gaussian node in Fig. 2. Just distribution. This means that we can use a variant of like the linear Gaussian node is an instance of a linear the standard Kalman filter to do exact inference on deterministic node, viz. having specific parents 1 and this network; we discuss this in detail in Sect. 4. N (0, 1), a conditional linear Gaussian node is a specific instance of conditional linear deterministic node. By using conditional linear deterministic nodes, we can 3.3 LDS models with uncertain parameters extend our already noisy paper heater model with un- certain parameters: we replace parameter r by vari- Above, we have shown how to represent an LDS with able R—which becomes the conditioning parent of the additive zero-mean Gaussian noise as a Bayesian net- node that depended on r—and do the same for the work; while it may be instructive, thus far it is only a other parameters. The result is shown in Fig. 4. convenient reformulation of known theory. However, to model the relation with the paper type variable, We can now proceed to connect a discrete paper type we need uncertainty in the parameters, i.e. the coeffi- variable to the model, with an example distribution as- cients of the linear relation. Our solution is to trans- signing to pt1 , pt2 , pt3 the probabilities 0.3, 0.5 and 0.2. form these coefficients into probabilistic variables. We Like we mentioned, the paper type determines the R0 accomplish this by introducing a new node type: parameter, but for generalization’s sake we will assume that it influences all the parameters. The resulting model, in Fig. 5, also shows that the notation for (con- Conditional linear deterministic node: a linear ditional) linear deterministic nodes extends very nat- deterministic node extended with a special par- urally to vector-valued variables: coefficients become ent, which is distinguished from the rest because matrices. These are the matrices from Eq. (2), written its arc ends in a diamond (and carries no coeffi- as functions Ad (θ) and Bd (θ) of θ = (r, c, r0 , c0 ). Thus, cient); we call this parent the conditioning parent. we have made a graphical summary of the model which The coefficients on the other arcs can depend on is linked very clearly to the state-space equations. Al- the value of the conditioning parent. This depen- though this hides some of the model’s internal struc- dence is shown by putting a bullet in the place ture, it is useful for keeping an overview. where this value is to be filled in. Regarding the probability distribution of the Θ vari- data up to t), in analogy to the Extended Kalman Fil- able, we give two examples: ter (see e.g. [6, 10]). A second type of inference that we do with the model Discrete Θ: The paper types pt1 , pt2 , pt3 determin- is smoothing. In particular, we want to calculate istically set a value θ1 , θ2 , θ3 (resp.) for Θ. In P(θ, tt , tt+1 |D) for each timeslice, in order to do EM fact, this turns Tt and Tt+1 into conditional lin- learning (see the next section). We have used a Rauch- ear Gaussians (conditioned by the paper type), so Tung-Striebel -type smoother [9]. This uses a forward the joint distribution is a mixture of 3 Gaussians. pass like discussed above, with the adjustment that it Continuous Θ: The paper types pt1 , pt2 , pt3 de- stores the distributions over two time slices. The last termine the parameters (µθ1 , σθ1 ), (µθ2 , σθ2 ), of these, i.e. P(tm−1 , tm , θ, D), is used as the input for (µθ2 , σθ2 ) for a Gaussian-distributed Θ. This a recursive backward pass defined as follows: model is no longer linear. P(tt−1 , tt , θ, D) = Z These options have an influence on inference and learn- P(tt , tt+1 , θ, D)P(tt−1 |tt , θ, D) dtt+1 ing in the model, which we discuss in the next sections. where the first factor in the integral is the recursive 4 Inference one, and the second is calculated from the distribution over tt−1 , tt stored in the forward pass: In this section, we shortly discuss inference in the 0 P(tt−1 |tt , θ, D) = P(tt−1 |tt , pin 0..t−1 , t0..t ) uncertain parameter model of Fig. 4, for both the 0 discrete and continuous Θ given above. Assume P(tt−1 , tt , pin0..t−1 , t0..t ) =R 0 we observe the system’s Tt0 variable responding to in P(tt−1 , tt , p0..t−1 , t0..t ) dtt−1 the Ptin input for a while, resulting in data D = {pin 0 in 0 0 The advantage of such a smoother over an independent 0 , t0 , . . . , pm−1 , tm−1 , tm }, and the goal is to find out the paper type. backward pass is that it does not linearize the distribu- tion over Tt−1 , T in two different ways (the backward This can be done by a forward pass over the model as pass uses the linearization of the forward pass). known from dynamic Bayesian network literature. We start with the prior distribution 5 Learning P(t0 , θ, t00 ) = P(t0 )P(θ)P(t00 ) For learning the model, we discuss the situation and perform a recursive forward pass from t = 0 to where we know the paper type (assume it is pt1 ) t + 1 = m: and observe the system like before, i.e. D = 0 0 0 {pin in 0 , t0 , . . . , pm−1 , tm−1 , tm }. The goal is to learn P(tt+1 , θ, pin 0 the parameter set ρ that maximizes the likelihood 0..t , t0..t+1 ) = Z P(D|pt1 ; ρ). The situation is a little different depend- 0 0 0 0 P(tt , θ, pin in 0..t−1 , t0..t )P(tt+1 |tt , tt , pt , θ)P(tt+1 |tt , tt , θ) dtt ing on the model for Θ. Finally, we marginalize out tm : 5.1 EM for continuous Θ For the continuous Θ, the restriction to pt1 means that Z P(θ, D) = P(tm , θ, D) dtm we are learning the parameters for one multivariate Gaussian variable Θ (actually consisting of four inde- The details of the inference algorithm depend on the pendent variables R, C, R0 , C 0 ) and the σ, σ 0 pro- model used. For the discrete Θ, all the distributions cess noise parameters. Thus, ρ = (µθ1 , σθ1 , σ, σ 0 ). above are linear Gaussian, so we can multiply and in- This can be done by a standard EM algorithm [2] for tegrate exactly. To be precise, P(tt+1 |tt , t0t , pin t , θ) and Bayesian networks: given a set of initial parameters P(t0t+1 |tt , t0t , θ) are linear Gaussian for each of the 3 ρi , the approximate smoother infers the distributions individual θi values; the algorithm thus independently P(tt , tt+1 , θ|D, pt1 ; ρi ). From these, the expected suf- works on 3 Gaussians. ficient statistics are gathered for maximizing For the continuous Θ, the conditional distributions of P(D, T0..m , Θ|pt1 ; ρi+1 ) 0 Tt+1 and Tt+1 are not linear Gaussian; we can do ap- proximate inference by linearizing these distributions expected under the old parameters ρi ; this is repeated at each timeslice around the means of Tt , Θ (given the until convergence. 5.2 EM for discrete Θ D1,3 (θ)+D1,4 (θ) = −1. We record these constraints in a matrix C and vector c such that CDT1 (θ) = c. Sub- For the discrete parameter space model, we are look- stituting d = DT1 (θ), for the first term we are looking ing for the parameter set (θ, σ, σ 0 ) that maximizes the for the d that minimizes likelihood " # P(D|pt1 , Θ = θ; σ, σ 0 ) X X T T T T E(xt dd xt ) = d E(xt xt ) d t=0..m t=0..m Note that the role of Θ is different here; we are not learning the optimal parameters for a distribution under the constraint Cd = c. This is a linearly con- over Θ, but the optimal single value. This requires strained quadratic optimization problem that can be some adjustments to the smoother: it should store dis- solved by the method of Lagrange multipliers. The tributions over (Tt , Tt+1 ) instead of over (Tt , Tt+1 , Θ), second term can be minimized in the same way. and should not use a prior distribution over Θ ei- In conclusion, we have derived the M-phase for the ther. Because all the probability distributions are lin- discrete Θ model; in the E-phase, we therefore have to ear Gaussian again, smoothing is exact now. collect the expected sufficient statistics E(xt xTt ). However, maximizing the expected likelihood is not so trivial now: we are looking for the optimal linear Gaus- 5.3 Comparison sian distribution P(tt+1 , t0t+1 |tt , t0t , pin t , θ) constrained to a certain form prescribed by A and B. Specifically, It is interesting to compare learning for continuous and the log likelihood for an individual time slice is: discrete Θ. In order to do this, we have simulated the system in Fig. 1 for 150 time slices, with a sine wave 1 T −1 1 as P in input and random Gaussian disturbance. We log P(tt+1 , t0t+1 |tt , t0t , pin t , θ) = − δ Σ δ − log |2πΣ| 2 2 provide the EM algorithms discussed above with the h 2 i P in and the generated Tt0 data. Typical results of a where Σ = σ0 σ002 and δ is an abbreviation for 60-iterations run are shown in Fig. 6. in The most interesting fact to observe is that the two tt+1 tt pt δ= 0 − Ad (θ) 0 − Bd (θ) approaches converge to different values (but the same tt+1 tt 20◦C log likelihood). This probably means that the system Separating variables and parameters, we can write: is not identifiable: several choices for Θ lead to the same likelihood of the observed behaviour Tt0 . To test δ = D(θ)xt this hypothesis, we also generated synthetic Tt , Tt0 data D(θ) = I −Ad (θ) −Bd (θ) (without disturbance) for systems with the learned pa- T rameters. The results are plotted in Fig. 7. These xt = tt+1 t0t+1 tt t0t pin 20◦C t results indeed show that both methods arrive at the same approximation (green, red) of the original (blue) The log likelihood for all the time slices (ignoring the Tt0 data; however, the different values for the parame- term − 21 log |2πΣ| for now) is then: ters lead to a different behavior of Tt . 1 X A second observation from Fig. 6 is that learning con- log P(D, t0..m |θ) = − (D(θ)xt )T Σ−1 D(θ)xt tinuous Θ converges faster than learning discrete Θ. 2 t=0..m The explanation for this is as follows: the EM algo- 1 X (D1 (θ)xt )T D1 (θ)xt (D2 (θ)xt )T D2 (θ)xt rithm for the continuous parameter space uses infer- =− + 2 t=0..m σ2 σ 02 ence to compute a posterior distribution over the vari- able Θ. In this algorithm, the posterior distribution where Di denotes the ith row of D. The goal is to is updated for each time slice. However, we can also maximize the expected value of this expression. At regard the algorithm as doing full Bayesian learning first sight, it seems that the two terms are dependent where Θ is viewed as a parameter; the algorithm is through θ, but on closer inspection then performing incremental EM [8], which is known h i to converge faster. 1 0 −1+1/rc −1/rc −1/c 0 D(θ) = 0 1 1/rc0 −1+1/rc0 +1/r 0 c0 0 −1/r 0 c0 6 Conclusion we see that the values in the first row do not con- strain those in the second, or vice versa. We can The central scientific hypothesis which initiated the re- therefore minimize the expected value of the two search described in this paper was that knowledge en- terms independently. We can also see that there are gineering methods for industrial applications of prob- linear constraints for the values within a row, e.g. abilistic graphical models should be based as much as 600 0.015 0.06 R R’ 580 0.0145 0.05 0.014 0.04 560 0.0135 0.03 0 10 20 30 40 50 60 0 10 20 30 40 50 60 540 T 400 1400 C C’ 350 520 1200 300 T’ 1000 500 250 200 800 0 10 20 30 40 50 60 0 10 20 30 40 50 60 480 0 500 T1 log lik 480 −200 460 460 −400 440 440 0 50 100 150 420 −600 t 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 7: Blue: synthetic data used for learning (Tt and Figure 6: EM learning of the Θ parameters: compar- Tt0 are shown, but only the latter is given to the learn- ison of discrete parameter space (parameters are single ing algorithms). As input Ptin , we used a sine wave. We values; shown in green) and continuous parameter space disturbed the system with additive zero-mean Gaussian (parameters are Gaussians; µ values shown in red). The noise. Green, red: response of an undisturbed determin- horizontal axis represents 60 EM iterations. Also shown istic system using the learned parameters (with discrete are the learned distribution over T1 (Gaussian, µ value) and the log likelihood. and continuous parameter space, resp.) to the same Ptin sine wave. possible on existing methods from engineering. We [2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maxi- have developed a systematic framework, where we mum likelihood from incomplete data via the EM algo- start with linear system theory and associated dia- rithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. grams and notations as the first step in the knowl- edge engineering process. The advantage of this ap- [3] A.J. Hommersom and P.J.F. Lucas. Using Bayesian proach is that engineers have already dealt with the networks in an industrial setting: Making printing sys- unavoidable complexity issues in representing realistic tems adaptive. In 19th European Conference on Arti- ficial Intelligence, pages 401–406, 2010. models. Subsequently, it was shown how linear dy- namic system models can be augmented with proba- [4] R.E. Kalman et al. A new approach to linear filtering bilistic variables for uncertain parameters, transform- and prediction problems. Journal of Basic Engineer- ing them into dynamic Bayesian networks with con- ing, 82(1):35–45, 1960. ditionally linear nodes. We introduced a concise no- [5] D.C. Karnopp, D.L. Margolis, and R.C. Rosenberg. tation that combines LDS and Bayesian network con- System dynamics: modeling and simulation of mecha- cepts in a natural way and demonstrated methods for tronic systems. John Wiley & Sons, 2006. inference and learning from data in these models. The [6] D. Koller and N. Friedman. Probabilistic graphical practical usefulness of the framework was illustrated models: Principles and techniques. The MIT Press, by a case study from the domain of large production 2009. printers. [7] K.B. Korb and A.E. Nicholson. Bayesian Artificial Intelligence. Chapman & Hall/CRC, 2004. Acknowledgements [8] Radford Neal and Geoffrey E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and This work has been carried out as part of the other variants. In M.I. Jordan, editor, Learning in OCTOPUS project under the responsibility of the Em- Graphical Models, pages 355–368. Kluwer Academic bedded Systems Institute. This project is partially Publishers, 1998. supported by the Netherlands Ministry of Economic [9] H.E. Rauch, F. Tung, and CT Striebel. Maximum Affairs under the Embedded Systems Institute pro- likelihood estimates of linear dynamic systems. AIAA gram. journal, 3(8):1445–1450, 1965. [10] H.W. Sorenson. Kalman filtering: theory and applica- References tion. IEEE, 1985. [1] Thomas Dean and Keiji Kanazawa. A model for rea- [11] http://www.20sim.com/. soning about persistence and causation. Computa- tional Intelligence, 5(3):142–150, 1989.