Bayesian Network Parameter Learning using EM with Parameter Sharing Erik Reed Ole J. Mengshoel Electrical and Computer Engineering Electrical and Computer Engineering Carnegie Mellon University Carnegie Mellon University Abstract been investigated [11, 18, 34, 38]. Second, EM is often computationally demanding, especially when the BN This paper explores the e↵ects of parameter is complex and there is much data [2, 3, 29, 35]. Third, sharing on Bayesian network (BN) parameter parameters that EM converges to can be far from the learning when there is incomplete data. Us- true probability distribution, yet still have a high like- ing the Expectation Maximization (EM) al- lihood. This is a limitation of EM based on MLE. gorithm, we investigate how varying degrees In this paper we investigate, for known BN structures, of parameter sharing, varying number of hid- how varying degree of parameter sharing [17, 25, 26], den nodes, and di↵erent dataset sizes impact varying number of hidden nodes, and di↵erent dataset EM performance. The specific metrics of sizes impact EM performance. Specifically, we are: EM performance examined are: likelihood, error, and the number of iterations required for convergence. These metrics are important • running many random initializations (or random in a number of applications, and we empha- restarts) of EM, a technique known to e↵ectively size learning of BNs for diagnosis of electrical counter-act premature convergence [10, 22]; power systems. One main point, which we • recording for each EM run the following metrics: investigate both analytically and empirically, (i) log-likelihood (``) of estimated BN parameters, is how parameter sharing impacts the error (ii) error (the Euclidean distance between true associated with EM’s parameter estimates. and estimated BN parameters), and (iii) number of EM iterations until convergence; and 1 INTRODUCTION • testing BNs with great potential for parameter Bayesian network (BN) conditional probability tables sharing, with a focus on electrical power system (CPTs) can be learned when the BN structure is BNs (reflecting electrical power system compo- known, for either complete or incomplete data. Dif- nents known to exhibit similar behavior). ferent algorithms have been explored in the case of incomplete data, including: Expectation Maximiza- Even when EM converges to a high-likelihood MLE, tion [8, 14, 15, 28], Markov Chain Monte Carlo meth- the error can be large and vary depending on initial ods such as Gibbs sampling [17], and gradient descent conditions. This is a fundamental limitation of EM us- methods [9]. Expectation Maximization (EM) seeks to ing MLE; even a BN with high likelihood may be far maximize the likelihood, or the Maximum a Posteriori from the true distribution and thus have a large error. (MAP) estimate, for the BN CPTs. Error as a metric for the EM algorithm for BN param- eter learning has not been discussed extensively in the We focus in this paper on EM [8, 14, 15, 28], an itera- existing literature. The analysis and experiments in tive algorithm that converges to a maximum likelihood this paper provide new insights in this area. estimate (MLE). While EM is powerful and popular, there are several challenges that motivate our research. Our main application is electrical power systems, and First, when computing MLEs, EM is easily trapped in particular NASA’s Advanced Diagnostics and Prog- in local optima and is typically very sensitive to the nostics Testbed (ADAPT) [27]. ADAPT has already placement of initial CPT values. Methods of making been represented as BNs, which have proven them- EM less prone to getting trapped in local optimal have selves as very well-suited to electrical power system 48 health management [12, 19–21, 30–33]. Through com- Notation Explanation pilation of BNs to arithmetic circuits [4, 5], a broad X BN nodes W BN edges range of discrete and continuous faults can be detected ✓ BN CPTs and diagnosed in a computationally efficient and pre- ✓ˆ estimated CPTs dictable manner, resulting in award-winning perfor- ✓⇤ true CPTs mance in international diagnostic competitions [30].1 O observable nodes From a machine learning and EM perspective, as con- H hidden nodes sidered in this paper, it is hypothesized that the learn- S (actually) shared nodes ing of ADAPT BNs may benefit from parameter shar- P (potentially) shared nodes U unshared nodes ing. This is because there are several repeated BN Y set partition of X nodes and fragments in these BNs. In addition to pa- TP number of wrong CPTs rameter sharing, we study in this paper the impact on E evidence nodes EM of varying the number of hidden nodes, reflecting R non-evidence nodes di↵erent sensing capabilities. tmin min # of EM iterations tmax max # of EM iterations Why are BNs and arithmetic circuits useful for elec- t0 iteration # at EM convergence trical power system diagnostics? First, power systems ✏ tolerance for EM exhibit multi-variate uncertainty, for example regard- err(✓)ˆ error of ✓ˆ relative to ✓ ⇤ ing component and sensor health (are they working nA = |A| cardinality of the set A or failing?) as well as noisy sensor readings. Sec- ` likelihood `` log-likelihood ond, there is substantial local structure, as reflected = (X, W , ✓) Bayesian network (BN) in an EPS schematic, that can be taken advantage E(Z) expectation of r.v. Z of when constructing a BN automatically or semi- V (Z) variance of r.v. Z automatically [19,20,30]. Consequently, BN treewidth r Pearson’s corr. coe↵. is small enough for exact computation using junction ✓ 2 [0, 1] CPT parameter trees, variable elimination, or arithmetic circuits to be ✓ˆ 2 [0, 1] estimated CPT parameter feasible [19,30]. Third, we compile BNs into arithmetic ✓⇤ 2 [0, 1] true CPT parameter circuits [4, 5], which are fast and predictable in addi- error bound for ✓ tion to being exact. These are all important benefits in Table 1: Notation used in this paper. cyber-physical systems including electrical power sys- tems. The rest of this paper is structured as follows. In BN factors a joint distribution Pr(X), enabling dif- Section 2, we introduce BNs, parameter learning for ferent probabilistic queries to be answered by efficient incomplete data using EM, and related research. Sec- algorithms; they assume that nodes E are clamped to tion 3 presents our main application area, electrical values e. One query of interest is to compute a most power systems. In Section 4, we define the sharing con- probable explanation (MPE) over the remaining nodes cept, discuss sharing in EM for BN parameter learn- R = X \ E, or MPE(e). Computation of marginals ing, and provide analytical results. In Section 5 we (or beliefs) amounts to inferring the posterior probabil- present experimental results for parameter sharing in ities over one or more query nodes Q ✓ R, specifically BNs when using EM, emphasizing electrical power sys- BEL(Q, e), where Q 2 Q. tem fault diagnosis using BNs. Finally, we conclude In this paper, we focus on situations where ✓ needs and outline future research opportunities in Section 6. to be estimated but the BN structure (X and W ) is known. Data is complete or incomplete; in other words 2 BACKGROUND there may be hidden nodes H where H = X \ O and O are observed. A dataset is defined as (x1 , . . . , xm ) This section presents preliminaries including notation with m samples (observations), where xi is a vector (see also Table 1). of instantiations of nodes X in the complete data case. When the data is complete, the BN parame- 2.1 BAYESIAN NETWORKS ters ✓ are often estimated to maximize the data like- lihood (MLE). In this paper, for a given dataset, a Consider a BN = (X, W , ✓), where X are discrete variable X 2 X is either observable (X 2 O) or hid- nodes, W are edges, and ✓ are CPT parameters. Let den (X 2 H); it is not hidden for just a strict subset E ✓ X be evidence nodes, and e the evidence. A of the samples.2 Let H > 0. For each hidden node 1 2 Further information can be found here: https:// In other words, a variable that is completely hidden sites.google.com/site/dxcompetition/. in the training data is a latent variable. Consequently, its 49 H 2 H there is then a “?” or “N/A” in each sam- When data is incomplete, the BN parameter estima- ple. Learning from incomplete data also relies on a tion problem is in general non-identifiable. There may likelihood function, similar to the complete data case. be several parameter estimates ✓ˆ1 , ..., ✓ˆm that have However, for incomplete data several properties of the the same likelihood, given the dataset [36]. Thus, we complete data likelihood function–such as unimodal- need to be careful when applying standard asymptotic ity, a closed-form representation, and decomposition theory from statistics (which assumes identifiability) into a product form–are lost. As a consequence, the and when interpreting a learned model. Section 4.2 computational issues associated with BN parameter introduces an error measure that provides some in- learning are more complex, as we now discuss. sight regarding identifiability, since it measures dis- tance from the true distribution ✓ ⇤ . 2.2 TRADITIONAL EM 3 ELECTRICAL POWER SYSTEMS For the problem of optimizing such multi-dimensional, highly non-linear, and multimodal functions, several Electrical Power Systems (EPSs) are critical in today’s algorithms have been developed. They include EM, society, for instance they are essential for the safe oper- our focus in this paper. EM performs a type of hill- ation of aircraft and spacecraft. The terrestrial power climbing in which an estimate in the form of an ex- grid’s transition into a smart grid is also very impor- pected likelihood function ` is used in place of the true tant, and the emergence of electrical power in hybrid likelihood `. and all-electric cars is a striking trend in the automo- Specifically, we examine the EM approach to learn BN tive industry. parameters ✓ from incomplete data sets.3 The tradi- ADAPT (Advanced Diagnostics and Prognostics tional EM algorithm, without sharing, initializes pa- Testbed) is an EPS testbed developed at NASA [27]. rameters to ✓ (0) . Then, EM alternates between an Publicly available data from ADAPT is being used to E-step and an M-step. In the t-th E-step, using pa- develop, evaluate, and mature diagnosis and progno- rameters ✓ (t) and observables from the dataset, EM sis algorithms. The EPS functions of ADAPT are as generates the likelihood `(t) taking into account the follows. For power generation, it currently uses utility hidden nodes H. In the M-step, EM modifies the pa- power with battery chargers (there are also plans to rameters to ✓ (t+1) to maximize the data likelihood. investigate solar power generation). For power stor- While |`(t) `(t 1) | ✏, where ✏ is a tolerance, EM age, ADAPT contains three sets of 24 VDC 100 Amp- repeats from the E-step. hr sealed lead acid batteries. Power distribution is EM monotonically increases the likelihood function ` aided by electromechanical relays, and there are two or the log-likelihood function ``, thus EM converges load banks with AC and DC outputs. For control and to a point ✓ˆ or a set of points (a region). Since `` monitoring there are two National Instruments com- is bounded, EM is guaranteed to converge. Typically, pact FieldPoint backplanes. Finally, there are sensors EM converges to a local maximum [37] at some itera- of several types, including for: voltage, current, tem- tion t0 , bounded as follows: tmax t0 tmin . Due to perature, light, and relay positions. the use of ✏ above, it is for practical implementations ADAPT has been used in di↵erent configurations and of EM with restart more precise to discuss regions of represented in several fault detection and diagnosis convergence, even when there is point-convergence in BNs [12, 19–21, 30–33], some of which are investigated theory. in this paper (see Table 2). Each ADAPT BN node One topic that has been discussed is the initialization typically has two to five discrete states. BN nodes phase of the EM algorithm [8]. A second research represent, for instance: sensors (measuring, for exam- topic is stochastic variants of EM, typically known as ple, voltage or temperature); components (for example Stochastic EM [7,11]. Generally, Stochastic EM is con- batteries, loads, or relays); or system health of compo- cerned with improving the computational efficiency of nents and sensors (broadly, they can be in “healthy” EM’s E-step. Several other methods for increasing the or “faulty” states). efficacy of the EM algorithm for BNs exist. These include parameter constraints [1, 6, 25], parameter in- 3.1 SHARED VERSUS UNSHARED equalities [26], exploiting domain knowledge [17, 24], and parameter sharing [13, 17, 25, 26]. In BN instances where parameters are not shared, the CPT for a node in the BN is treated as separate from true distribution is not identifiable. the other CPTs. The assumption is not always rea- 3 Using EM, learning from complete data is a special sonable, however. ADAPT BNs may benefit from case of learning from incomplete data. shared parameters, because there are typically several 50 3.2 OBSERVABLE VERSUS HIDDEN Consider a complex engineered system, such as an elec- trical power system. After construction, but before it is put into production, it is typically tested exten- sively. The sensors used during testing lead to one set of observation nodes in the BN, OT . The sensors used during production lead to another set of observations in the BN, OP . For reasons including cost, fewer sen- sors are typically used during production than during testing, thus we assume OP ✓ OT . As an example, in Figure 1 we denote OP = {CB , S} as production observation nodes, and OT = Figure 1: One of three similar sub-networks in the {CB , S, HB , HS } as testing observation nodes. mini-ADAPT BN. Under parameter sharing, the SB node is shared between the three sub-networks. In all our experiments, shared nodes are also hidden, or S ✓ H. Typically, there are hidden nodes that are not necessarily shared, or S ⇢ H. This is, for example, the case for the mini-ADAPT BN as reflected in the repeated nodes or fragments in these BNs [21, 30]. It sub-network in Figure 1. is reasonable to assume that “identical” power system sensors and components will behave in similar ways. More broadly, sub-networks of identical components 4 EM WITH SHARING should function in a similar way to other “identical” sub-networks in a power system, and such knowledge 4.1 SHARING IN BAYESIAN NETWORKS can be the basis for parameter sharing. Consider a BN = (X, W , ✓). A sharing set partition For parameter sharing as investigated in this paper, for nodes X is a set partition Y of X with subsets the CPTs of some nodes are assumed to be approxi- Y 1 , ..., Y k , with Y i ✓ X and k 1. For each Y i mately equal to the CPTs of di↵erent nodes elsewhere with k i 1 the nodes X 2 Y i share a CPT during in the BN.4 Data for one set of nodes can be used else- EM learning as discussed in Section 4.2. We assume where in the BN if the corresponding nodes are shared that the nodes in Y i have exactly the same number of during EM learning. This is a case of parameter shar- states. The same applies to their respective parents in ing involving the global structure of the BN, where , leading to each Y 2 Y i having the same number of di↵erent CPTs are shared, as opposed to parameter parent instantiations and exactly the same CPTs. sharing within a single CPT [13]. Traditional non-sharing is a special case of sharing in In some ADAPT BNs, one sub-network is essentially the following way. We assign each BN node to a sep- duplicated three times, reflecting triple redundancy in arate set partition such that for X = {X1 , ..., Xn } we ADAPT’s power storage and distribution network [27]. have Y 1 , ..., Y n with Y i = {Xi }. Such a six-node sub-network from an ADAPT BN is One key research goal is to better understand the be- shown in Figure 1. This sub-network was, in this pa- havior of EM as sharing nodes S and observable nodes per, manually selected for further study of sharing due O vary. We examine three cases: complete data O C to this duplication. The sub-network is part of the (no hidden nodes); a sensor-rich testing setting with mini-ADAPT BN used in experiments in Section 5.3. observations O T ; and a sensor-poor production setting In the mini-ADAPT sharing condition, only the node with observations O P . Understanding the impact of SB was shared between all three BN fragments. Gen- varying observations O is important due to cost and erally, we define shared nodes S and unshared nodes e↵ort associated with observation or sensing. U , with U = X \ S. 4.2 SHARING EM 4 It is unrealistic to assume that several engineered phys- ical objects, even when it is desired that they are exactly Similar to traditional EM (see Section 2.2), the shar- the same, in fact turn out to have exactly the same behav- ing EM algorithm also takes as input a dataset and ior. A similar argument has been made for object-oriented estimates a vector of BN parameters ✓ˆ by iteratively BNs [14], describing it as “violating the OO assumption.” We thus say that CPTs are approximately equal rather improving `` until convergence. The main di↵erence than equal. Under sharing, however, we are making the of sharing EM compared to traditional EM is that we simplifying assumption that shared CPTs are equal. are now setting some nodes as shared S, according to 51 Y . To arrive at the sharing EM algorithm from the Due to the sharing, SEM intuitively has fewer con- traditional EM algorithm, we modify the likelihood vergence regions than TEM. This is due to SEM’s function to introduce parameter sharing and combine slightly modified M-step that aggregates the shared parameters (see also [13, 17]). When running sharing CPTs and parameters. Consider a BN S with ex- EM, nodes X are treated as separate for the E-step. actly the same nodes and edges as U , but with shar- There is a slightly modified M-step, using Y , to ag- ing, specifically with sharing set partitions Y 1 , ..., Y k gregate the shared CPTs and parameters.5 This is the and k < n. Without loss of generality, assume that use of aggregate sufficient statistics, which considers Xi 2 Y i . Then the actual number of convergence re- sufficient statistics from more than one BN node [13]. gions ( S ) is upper bounded by ̄( S ) as follows: Let ✓ˆi,j be the j’th estimated probability parameter k Y for BN node Xi 2 X. We define error of ✓ˆ for BN ( S )  ̄( S ) = (Xi ), (3) (X, W , ✓)ˆ as the L2 distance from the true probability i=1 distribution ✓ ⇤ from which data is sampled: s assuming that the (Xi ) convergence regions used for X X⇣ ⌘2 Xi in (2) carry over to Y i . ˆ err(✓) = ⇤ ✓i,j ✓ˆi,j . (1) i j A special case of (3) is when there exists exactly one This error is the summation of the Euclidean distance Y 0 2 Y such that |Y 0 | 2 while for any Z 2 Y \Y 0 we between true and estimated CPT parameters, or the have |Z| = 1. The experiments performed in Section 5 L2 distance, providing an overall distance metric. are all for this special case. Specifically, but without loss of generality, let Y i = {Xi } for 1  i < k and Why do we use Euclidean distance to measure error? Y k = {Xk , ..., Xn } with nS = n k + 1 (i.e., S = Y k One could, after all, argue that this distance metric has nS sharing nodes). It is illustrative to consider the is poor because it does not agree with likelihood. We ratio: use Euclidean distance because we are interested not Qn Yn only in the black box performance of the BN, but also ̄( U ) (Xi ) its validity and understandability to a human expert.6 = Qi=1 k = (Xi ). (4) ̄( S ) i=1 (Xi ) i=k+1 This is important when an expert needs to evaluate, validate, or refine a BN model, for example a BN for Here, we assume that Xi has (Xi ) convergence re- an electrical power system. gions in both U and S and take into account that for shared nodes Y k = S, CPTs are tied together. 4.3 EM’S BEHAVIOR UNDER SHARING The simple analysis above suggests a non-trivial im- We now provide a simple analysis of certain aspects pact of sharing, given the multiplicative e↵ect of the of traditional EM (TEM) and sharing EM (SEM). For (Xi )’s for k + 1  i  n in (4). However, since upper simplicity, we only consider EM runs that converge bounds are the focus in this analysis, only a partial and exclude runs that time out.7 and conservative picture is painted. The experiments in Section 5—see for example Figure 3, Figure 5, and For a node Xi 2 X, TEM will converge to one among Figure 6—provide further details. potentially several convergence regions. Suppose that the CPT of node Xi has (Xi ) convergence regions. Then the actual number of convergence regions ( U ) 4.4 ANALYSIS OF ERROR for a non-shared BN U with nodes X = {X1 , ..., Xn } We now consider the number of erroneous CPTs as is upper bounded by ̄( U ) as follows: estimated by SEM when sharing is varied. Clearly, a n Y CPT parameter is continuous and its EM estimate is ( U )  ̄( U ) = (Xi ). (2) extremely unlikely to be equal to the original param- i=1 eter. Thus we consider here a discrete variable, based 5 For the relevant LibDAI source code, please see on forming an interval in the one-parameter case. Gen- here: https://github.com/erikreed/HadoopBNEM/blob/ erally, let a discrete BN node X 2 X have k states such master/src/emalg.cpp#L167. In words, it is a modifi- that xi 2 {x1 , ..., xk }. Consider ✓xi |z = Pr(X = xi | cation on the collection of sufficient statistics during the maximization step. Z = z) for a parent instantiation z. We now have an 6 We assume that (1) is better than likelihood in this original CPT parameter ✓i 2 {✓x1 |z , ..., ✓xk 1 |z } and regard, if the original BN was manually constructed. The its EM estimate ✓ˆi 2 {✓ˆx1 |z , ..., ✓ˆxk 1 |z }. Let us jointly BNs experimented with in Section 5 were manually con- consider the original CPT parameter ✓i⇤ and its esti- structed, for example. 7 In practice, runs that time out are very rare with the mate ✓ˆi . If ✓ˆi 2 [✓i⇤ ⇤ ˆ i , ✓i + i ] we count ✓i as correct, parameter settings we use in experiments. ˆ ⇤ and say ✓i = ✓i ; else it is incorrect or wrong, and we 52 say ✓ˆi 6= ✓i⇤ . This analysis clearly carries over to mul- and by substituting (7) and (8) into (6) we get tiple CPT parameters, parent instantiations, and BN nodes. This shows how we go from a continuous (esti- V (TP ) = p(1 p)((nP nS ) + n2S ). (9) mated CPT parameters ✓) ˆ to a discrete value (number of wrong or incorrect CPT estimates), where the latter In words, (9) tells us that as the number nS of shared is used in this analysis. nodes increases at the expense of the number of un- Suppose that up to nP nodes can be shared. Further shared nodes nU , variance due to non-shared nodes de- suppose that nS nodes are actually shared while nU creases linearly, but variance due to sharing increases nodes are unshared, with nU + nS = nP . Let TP quadratically. The net e↵ect shown in (9) is that vari- be a random variable representing the total number ance V (TP ) of the error increases with the number of of wrong or incorrect CPTs, TS the total for shared shared nodes, according to our analysis above. Ex- nodes, and TU the total for unshared nodes. Clearly, pectation, on the other hand, remains constant (5) we have TP = TS +TU . regardless of how many nodes are shared. These ana- lytical results have empirical counterparts as discussed Let us first consider the expectation E(TP ). Due to in Section 5, see for example the error sub-plot at the linearity, we have E(TP ) = E(TS ) + E(TU ). In the bottom of Figure 2. non-shared case, assume for simplicity that errors are iid and follow a Bernoulli distribution, with probability 5 EXPERIMENTS p of error and (1 p) = q of no error.8 This gives E(TU ) = nU p, using the fact that a sum of Bernoulli We now report on EM experiments for several di↵er- random variables follows a Binomial distribution.9 ent BNs, using varying degrees of sharing. We also In the shared case, all shared nodes either have an vary the number of hidden nodes and dataset size. We incorrect CPT ✓ˆ 6= ✓⇤ or the correct CPT ✓ˆ = ✓⇤ . used ✏ = 1e 3 as an EM convergence criterion, mean- Assuming again probabilities p of error10 and (1 p) = ing that EM stopped at iteration t0 when the ``-score q of no error, and by using the definition of expectation changed by a value  ✏ between iterations t0 1 and t0 of Binomials we obtain E(TS ) = nS p. for tmax t0 tmin . In these experiments, tmax = 100 and tmin = 3. Substituting into E(TP ) we get E(TP ) = nU p + nS p = nP p. (5) 5.1 METHODS AND DATA Bayesian networks. Table 2 presents BNs used in Let us next consider the variance V (TP ). While vari- the experiments.11 Except for the BN Pigs, these BNs ance in general is not linear, we assume linearity for all represent (parts of) the ADAPT electrical power simplicity, and obtain system (see Section 3). The BN Pigs has the largest V (TP ) = V (TU ) + V (TS ). (6) number of nodes that can be shared (nP = 296), com- prising 67% of the entire BN. The largest BN used, In the non-shared case we have again a Binomial dis- in terms of node count, edges, and total CPT size, is tribution, with well-known variance ADAPT T2. V (TU ) = nU p(1 p). (7) Datasets. Data for EM learning of parameters for these BNs were generated using forward sampling.12 In the shared case we use the definition of variance, Each sample in a dataset is a vector x (see Section 2.1). put p1 = (1 p), p2 = p, and µ = nS p, and obtain The larger BNs were tested with increasing numbers after some simple manipulations: of samples ranging from 25 to 400, while mini-ADAPT was tested with 25 to 2000 samples. 2 X V (TS ) = pi (Xi µ)2 = n2S p(1 p), (8) Sharing. Each BN has a di↵erent number of param- i=1 eters that can be shared, where a set of nodes Y i 2 8 11 This is a simplification, since our use of the Bernoulli ADAPT BNs can be found here: http://works. assumes that each CPT is either “correct” or “incorrect.” bepress.com/ole_mengshoel/. When learned from data, the estimated parameters are 12 Our experiments are limited in that we are only learn- clearly almost never exactly correct, but close to or far ing the parameters of BNs, using data generated from those from their respective original values. BNs. Clearly, in most applications, data is not generated 9 If X is Binomial with parameters n and p, it is well- from a BN and the true distribution does not conform ex- known that the expected value is E(X) = np. actly to some BN structure. However, our analytical and 10 The error probabilities of TS and TU are assumed to experimental investigation of error would not have been be the same as a simplifying assumption. possible without this simplifying assumption. 53 NAME |X| |P | |W | CPT The following gradual sharing method is used to vary ADAPT T1 120 26 136 1504 sharing. Given a fixed set of hidden nodes H and an ADAPT T2 671 107 789 13281 ADAPT P1 172 33 224 4182 initially empty set of shared nodes S: ADAPT P2 494 99 602 10973 mini-ADAPT 18 3 15 108 1. Randomly add nS 1 hidden nodes that are Pigs 441 296 592 8427 not yet shared to the set of shared nodes S. Since Table 2: Bayesian networks used in experiments. The we only have a single sharing set, this means mov- |P | column presents the number of potentially shared ing nS nodes from the set H \ S to the set S. nodes, with actually shared nodes S ✓ P . The CPT 2. Perform m sharing EM trials in this configuration, column denotes the total number of parameters in the and record the three metrics for each trial. conditional probability tables. 3. Repeat until all hidden nodes are shared; that is, Y with equal CPTs are deemed sharable. In most S = H. cases, there were multiple sets of nodes Y 1 , ..., Y k with |Y i | 2 for k i 1. When multiple sets were Using the gradual sharing method above, BN nodes available, the largest set was selected for experimenta- were picked as hidden and then gradually shared. tion, as shown in Section 5.2’s pseudo-code. When increasing the number of shared nodes, the new set of shared nodes was a superset of the previous set, Metrics. After an EM trial converged to an estimate ˆ we collected the following three metrics: and a certain number of EM trials was performed for ✓, each set. ˆ 1. number of iterations t0 needed to converge to ✓, 5.2.1 One Network ˆ and 2. log-likelihood `` of ✓, For ADAPT T2, m = 200 samples were generated and 3. error: distance between ✓ˆ and the original ✓ ⇤ (see nH = 66 nodes were hidden. We used, in the gradual (1) for the definition). sharing method, nS = 4 from nS = 2 to nS = 66 (every hidden node was eventually set as shared). To provide reliable statistics on mean and standard deviation, many randomly initialized EM trials were run. Software. Among the available software implementa- tions of the EM algorithm for BNs, we have based our work on LibDAI [23].13 LibDAI uses factor graphs for its internal representation of BNs, and has several BN inference algorithms implemented. During EM, the exact junction tree inference algorithm [16] was used, since it has performed well previously [19]. 5.2 VARYING NUMBER OF SHARED NODES Here we investigate how varying the number of shared nodes impacts EM. A set of hidden nodes H was cre- ated for each BN by selecting H = arg max |Y i |, Y i 2Y where Y is a sharing set partition for BN nodes X (see Section 4.1). In other words, each experimental Figure 2: The number of iterations (top), log- BN had its largest set of shareable nodes hidden, giv- likelihood or `` (middle), and error (bottom) for a ing nH = 12 nodes for ADAPT T1, nH = 66 nodes varying number of shared nodes nS (along the x-axis) for ADAPT T2, nH = 32 nodes for ADAPT P1, and for the BN ADAPT T2. Here, nH = 66 nodes are hid- nH = 145 nodes for Pigs. den. Shared nodes are a random subset of the hidden 13 www.libdai.org nodes, so nS  nH . 54 ADAPT T1 Iterations Likelihood Error sample correlation coefficient r: Size r(µ) r( ) r(µ) r( ) r(µ) r( ) 25 -0.937 -0.135 0.997 0.882 -0.068 0.888 P m 50 -0.983 -0.765 0.992 0.941 0.383 0.988 (xi x̄)(yi ȳ) 100 200 -0.825 0.544 0.702 0.926 0.494 -0.352 0.941 0.892 0.786 0.517 0.985 0.939 r(X, Y ) = i=1 , (10) 400 0.810 0.814 -0.206 0.963 0.863 0.908 (m 1)sx sy ADAPT T2 Iterations Likelihood Error where x̄ and ȳ are the sample means of two random Size r(µ) r( ) r(µ) r( ) r(µ) r( ) 25 0.585 0.852 0.749 0.842 0.276 0.997 variables X and Y , and sx and sy are the sample stan- 50 0.811 0.709 0.377 0.764 0.332 0.981 dard deviations of X and Y respectively. Here, (10) 100 0.772 0.722 -0.465 0.855 -0.387 0.992 measures the correlation between m samples from X 200 0.837 0.693 -0.668 0.788 -0.141 0.989 400 0.935 0.677 -0.680 0.784 -0.0951 0.987 and Y .15 ADAPT P1 Iterations Likelihood Error The number of samples, m, refers to the number of Size r(µ) r( ) r(µ) r( ) r(µ) r( ) 25 -0.769 -0.0818 -0.926 0.278 -0.107 0.774 (X, Y ) sharing samples we have for this correlation 50 -0.864 -0.0549 -0.939 0.218 -0.331 0.188 analysis (and not the number of samples used to learn 100 -0.741 0.436 -0.871 0.678 -0.458 0.457 200 -0.768 0.408 -0.879 0.677 -0.196 0.700 BN parameters). In all these experiments, we do a trial 400 -0.665 0.560 -0.862 0.681 0.00444 0.653 for a number of shared nodes, giving several (X, Y ) PIGS Iterations Likelihood Error pairs. Consequently, each number of shared nodes Size r(µ) r( ) r(µ) r( ) r(µ) r( ) tested would be an X, and the metric measured would 25 0.954 0.763 0.970 0.614 0.965 0.887 50 0.966 0.956 0.137 0.908 0.740 0.869 be a Y . For example, if we use nS 2 {2, 4, 6, 8, 10} 100 -0.922 -0.0167 -0.994 -0.800 -0.893 0.343 shared nodes, then m = 5. 200 -0.827 -0.394 -0.985 -0.746 -0.577 0.0157 400 -0.884 -0.680 -0.942 -0.699 -0.339 0.285 We now tie r(X, Y ) in (10) to the r(µ) and r( ) used in Table 3. In Table 3, µ and show the mean and stan- Table 3: Values for Pearson’s r for the correlation be- dard deviation, respectively, for a metric. Thus, r(µ) tween the number of shared nodes and statistics for is the correlation of the number of shared nodes and these metrics: number of iterations, log-likelihood (``), the mean likelihood of a metric, while r( ) is the corre- and error. r(µ) defines the correlation between num- lation of the number of shared nodes and the standard ber of shared nodes and the mean of the metric, while deviation of likelihood of a metric. r( ) defines the correlation for the standard deviation. Figure 2 helps in understanding exactly what is being correlated, as µ and for all three metrics are shown Figure 2 summarizes the results from this experiment. for the BN ADAPT T2. In the top plot, r(µ) is the Here, nS is varied along the x-axis while the y-axis correlation between the number of shared nodes (x- shows statistics for di↵erent metrics in each of the axis) and the mean number of iterations (y-axis). In three sub-plots. For example, in the top plot of Fig- other words, the mean yi = µi is for a batch of 50 trials ure 2, each marker is the mean number of iterations of EM. The mean ȳ used in Pearson’s r is, in this case, (µ), and the error bar is +/- one standard deviation a mean of means, namely the mean over 50-EM-trials- ( ). The main trends14 in this figure are: parameter means over di↵erent numbers of shared nodes. sharing increased the mean number of iterations re- quired for EM and slowly decreased the mean ``. In- Table 3 summarizes the experimental results for four creasing the number of shared nodes resulted in a cor- BNs. In this table, a positive correlation implies that responding increase in standard deviation for the num- parameter sharing increased the corresponding metric ber of iterations, ``, and error of the BN ADAPT T2. statistic. For example, the highest correlation between For standard deviation of error, this is in line with our number of shared nodes and mean likelihood is for analysis in Section 4.4. ADAPT T1 at 25 samples, where r(µ) = 0.997. This suggests that increasing the number of shared nodes was highly correlated with an increase in the likelihood 5.2.2 Multiple Networks of EM. Negative coefficients show that increasing the We now investigate how varying the number of shared number of shared nodes resulted in a decrease of the nodes impacts EM for several BNs, specifically the cor- corresponding metric statistic. relation between the number of parameters shared and A prominent trend in Table 3 is the consistently pos- the mean µ and standard deviation for our three itive correlation between the number of shared nodes metrics. To measure correlation, we use Pearson’s 15 In this case, X is the independent variable, specifically 14 We say “main trends” because the curves for the met- the number of shared nodes. We treat the metric Y as a rics mean number of iterations and `` are in fact reversing function of X. When X is highly correlated with Y , this their respective trends and dropping close to the maximum is expressed in r through extreme (positive or negative) of 66 shared nodes. correlation values. 55 NO SHARING Observable Error Likelihood Iterations µ µ µ OC 16.325 (-) -3.047e4 (-) (-) (-) OP 48.38 3.74 -2.009e4 43.94 16.39 4.98 OT 33.25 7.45 -2.492e4 1190 8.65 1.99 SHARING Observable Error Likelihood Iterations µ µ µ OC 16.323 (-) -3.047e4 (-) (-) (-) OP 48.56 3.92 -2.010e4 62.87 15.98 4.95 OT 34.12 14.3 -2.629e4 2630 6.59 2.48 Table 4: Comparison of No Sharing (top) versus Shar- (a) O P – No Sharing (b) O P – Sharing ing (bottom) for di↵erent observable node sets O C , O P , and O T during 600 EM trials for mini-ADAPT. nS and the standard deviation of error, r( ), for all 4 BNs. This is in line with the analytical result involving nS in (9). The number of samples was shown to have a signifi- cant impact on these correlations. The Pigs network showed a highly correlated increase in the mean num- (c) O T – No Sharing (d) O T – Sharing ber of iterations for 25 and 50 samples. However, for 100, 200, and 400 samples there was a decrease in the Figure 3: The progress of di↵erent random EM trials mean number of iterations. The opposite behavior is for the mini-ADAPT BN. Both the degree of sharing observed in ADAPT T1, where fewer samples resulted (No Sharing versus Sharing) and the number of ob- in better performance for parameter sharing (reducing servables nodes (O P versus O T ) are varied. the mean number of iterations), while for 200 and 400 samples we found that parameter sharing increased the mean number of iterations. Further experimentation executed with and without sharing. and analysis may improve the understanding of the in- teraction between sharing and the number of samples. Table 4 shows results, in terms of means µ and stan- dard deviations , for these EM trials. For O P , with 5.3 CONVERGENCE REGIONS nH = 4, the means µ of the metrics error, likeli- hood, and number of iterations showed minor di↵er- 5.3.1 Small Bayesian Networks ences when parameter sharing was introduced. The largest change due to sharing was an increase in First, we will show how sharing influences EM parame- of likelihood. For O T , where nH = 2, di↵erences were ter interactions for the mini-ADAPT BN shown in Fig- greater. The µ of likelihood for sharing was lower with ure 1 and demonstrate how shared parameters jointly over a 2x increase in . The µ for error demonstrated converge. only a minor change, but nearly a 2x increase in . This is consistent with our analysis in Section 4.4. Earlier we introduced O P as observable nodes in a production system and O T as observable nodes in a Figure 3a and Figure 3c show how log-likelihood or `` testing system. Complementing O P , hidden nodes are (x-axis) and error (y-axis) changed during 15 EM tri- H P = {HB , HS , VB , SB }. Complementing O T , hid- als for O P and O T respectively. These EM trials were den nodes are H T = {VB , SB }. When a node is hid- selected randomly among the trials reported on in Ta- den, the EM algorithm will converge to one among its ble 4. Parameter sharing is introduced in Figure 3b potentially many convergence regions. For O P , EM and Figure 3d. For O P , the progress of the EM trials had much less observed data to work with than for O T is similar for sharing (Figure 3b) and non-sharing (Fig- (see Figure 1). For O P , the health breaker node HB ure 3a), although for a few trials in the sharing condi- was, for instance, not observed or even connected to tion the error is more extreme (and mostly smaller!). any nodes that were observed. In contrast, O T was de- This is also displayed in Table 4, where the di↵erence signed to allow better observation of the components’ in number of iterations, error, and likelihood was mi- behaviors, and HB 2 O T . From mini-ADAPT, 500 nor (relative to O T ). On the other hand, there is a samples were generated. Depending on the observable clear di↵erence in the regions of convergence for O T set used, either nH = |H T | = 2 or nH = |H P | = 4 when parameter sharing is introduced, consistent with nodes were hidden, and 600 random EM trials were the analysis in Section 4.3. Figure 3d shows how the 56 (a) O P – No Sharing (b) O P – Sharing (a) No Sharing (b) Sharing Figure 5: The progress of 15 random EM trials for ADAPT T2. The No Sharing condition (a) shows more locally optimal convergence regions than Shar- ing (b), where there appears to be only four locally optimal convergence regions. (c) O T – No Sharing (d) O T – Sharing Figure 4: Four 20-bin histograms, for mini-ADAPT, of the error values of 600 randomly initialized EM tri- als at convergence. Both the degree of sharing (No Sharing versus Sharing) and the number of observables nodes (O P versus O T ) are varied. EM trials typically followed a path heading to optima far above or far below the mean error, with two of the (a) No Sharing (b) Sharing EM trials plotted converging in the middle region of the error. Figure 6: The progress of random EM trials for Histograms for the 600 EM trials used in Table 4 are ADAPT P1. While the number of EM trials is the shown in Figure 4. The 20-bin histograms show error same for both conditions, the No Sharing condition ˆ at convergence. The O P and O T sets are shown err(✓) (a) clearly shows more local optima than Sharing (b). without parameter sharing in Figure 4a and Figure 4c, respectively. Parameter sharing is introduced in Fig- ure 4b and Figure 4d. There is an increased of error able nodes, number of samples, and number of shared due to parameter sharing for O T . When comparing nodes. Some of the results are reported here. Figure 4c (No Sharing) and Figure 4d (Sharing), we Figure 5a shows results for ADAPT T2 without pa- notice di↵erent regions of error for EM convergence rameter sharing during 15 EM trials using nH = 66 due to sharing. Figure 4c appears to show four main hidden nodes and 200 samples. Figure 5b shows a error regions, with the middle two being greatest in substantial change in error when the nH = 66 hidden frequency, while Figure 4d appears to show three re- nodes were shared. The range of the error for EM is gions of error, with the outer two being most frequent. much larger in Figure 5b, while the upper and lower The outer two regions in Figure 4d are further apart error curves have a symmetric quality. Four regions than their non-sharing counterparts, showing that pa- for the converged error are visible in Figure 5b, with rameter sharing yielded a larger range of error for O T , the inner two terminating at a lower `` than the outer see Section 4.4. two. The lowest error region of Figure 5b is also lower than the lowest error of Figure 5a, while retaining a 5.3.2 Large Bayesian Networks similar ``. Next, large BNs are used to investigate the e↵ects of Figure 6 uses a smaller ADAPT BN, containing 172 parameter sharing, using a varying number of shared nodes instead of 671 nodes (see Table 2). Here, nH = nodes. The larger ADAPT networks and Pigs were run 33 nodes were hidden and 200 samples were used. In with 50 EM trials16 for each configuration of observ- several respects, the results are similar to those ob- tained for ADAPT T2 and mini-ADAPT. However, 16 The decrease in number of EM trials performed rela- tive to mini-ADAPT was due to the substantial increase in CPU time required (days to weeks). 57 Figure 6a shows that EM terminates on di↵erent like- would be interesting to investigate the connection to lihoods, which is not observed in Figure 5a. The error object-oriented and relational BNs in future work. also appears to generally fluctuate more in Figure 6a, whereas the error changes the most during later it- References erations in Figure 5a. Figure 6b applies parameter sharing to the nH = 33 hidden nodes. A symmetric [1] E.E. Altendorf, A.C. Restificar, and T.G. Dietterich. e↵ect is visible between high and low error, reflect- Learning from sparse data by exploiting monotonicity constraints. In Proceedings of UAI, volume 5, 2005. ing the analysis in Section 4. Of the 15 trials shown in Figure 6b, two attained `` > 1.2e 4 , while the [2] A. Basak, I. Brinster, X. Ma, and O. J. Mengshoel. rest converged at `` ⇡ 1.21e 4 . Additionally, the Accelerating Bayesian network parameter learning us- ``s of these two trials were greater than any of the ing Hadoop and MapReduce. In Proc. of BigMine-12, Beijing, China, August 2012. non-sharing ``s shown in Figure 6a. [3] P.S. Bradley, U. Fayyad, and C. Reina. Scaling EM (Expectation-Maximization) clustering to large 6 CONCLUSION databases. Microsoft Research Report, MSR-TR-98- 35, 1998. Bayesian networks have proven themselves as very [4] M. Chavira and A. Darwiche. Compiling Bayesian suitable for electrical power system diagnostics [19,20, networks with local structure. In Proceedings of the 30–33]. By compiling Bayesian networks to arithmetic 19th International Joint Conference on Artificial In- circuits [4,5], a broad range of discrete and continuous telligence (IJCAI), pages 1306–1312, 2005. faults can be handled in a computationally efficient [5] A. Darwiche. A di↵erential approach to inference in and predictable manner. This approach has resulted Bayesian networks. Journal of the ACM, 50(3):280– in award-winning performance on public data from 305, 2003. ADAPT, an electrical power system at NASA [30]. [6] C.P. de Campos and Q. Ji. Improving Bayesian net- The goal of this paper is to investigate the e↵ect, on work parameter learning using constraints. In Proc. EM’s behavior, of parameter sharing in Bayesian net- 19th International Conference on Pattern Recognition works. We emphasize electrical power systems as an (ICPR), pages 1–4. IEEE, 2008. application, and in particular examine EM for ADAPT [7] B. Delyon, M. Lavielle, and E. Moulines. Convergence Bayesian networks. In these networks, there is consid- of a stochastic approximation version of the EM algo- erable opportunity for parameter sharing. rithm. Annals of Statistics, (27):94–128, 1999. Our results suggest complex interactions between [8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Max- imum likelihood from incomplete data via the EM al- varying degrees of parameter sharing, varying number gorithm. Journal Of The Royal Statistical Society, of hidden nodes, and di↵erent dataset sizes when it Series B, 39(1):1–38, 1977. comes to impact on EM performance, specifically like- lihood, error, and the number of iterations required for [9] R. Greiner, X. Su, B. Shen, and W. Zhou. Struc- tural extension to logistic regression: Discriminative convergence. One main point, which we investigated parameter learning of belief net classifiers. Machine both analytically and empirically, is how parameter Learning, 59(3):297–322, 2005. sharing impacts the error associated with EM’s pa- rameter estimates. In particular, we have found an- [10] S.H. Jacobson and E. Yucesan. Global optimization performance measures for generalized hill climbing al- alytically that the error variance increases with the gorithms. Journal of Global Optimization, 29(2):173– number of shared parameters. Experiments with sev- 190, 2004. eral BNs, mostly for fault diagnosis of electrical power [11] W. Jank. The EM algorithm, its randomized imple- systems, are in line with the analysis. The good news mentation and global optimization: Some challenges here is that there is, in the sharing case, smaller error and opportunities for operations research. In F. B. some of the time. Alt, M. C. Fu, and B. L. Golden, editors, Perspectives in Operations Research: Papers in Honor of Saul Gass Further theoretical research to better understand pa- 80th Birthday. Springer, 2006. rameter sharing is required. Since parameter sharing was demonstrated to perform poorly in certain cases, [12] W. B. Knox and O. J. Mengshoel. Diagnosis and re- configuration using Bayesian networks: An electrical further investigations appear promising. Parameter power system case study. In Proc. of the IJCAI-09 sharing sometimes reduced the number of EM itera- Workshop on Self-? and Autonomous Systems (SAS): tions required for parameter learning, while at other Reasoning and Integration Challenges, pages 67–74, times the number of EM iterations increases. Improv- 2009. ing the understanding of the joint impact of parameter [13] D. Koller and N. Friedman. Probabilistic graphical sharing and the number of samples on the number of models: principles and techniques. The MIT Press, EM iterations would be useful, for example. Finally, it 2009. 58 [14] H. Langseth and O. Bangsø. Parameter learning [27] S. Poll, A. Patterson-Hine, J. Camisa, D. Gar- in object-oriented Bayesian networks. Annals of cia, D. Hall, C. Lee, O.J. Mengshoel, C. Neukom, Mathematics and Artificial Intelligence, 32(1):221– D. Nishikawa, J. Ossenfort, A. Sweet, S. Yen- 243, 2001. tus, I. Roychoudhury, M. Daigle, G. Biswas, and X. Koutsoukos. Advanced diagnostics and prognostics [15] S.L. Lauritzen. The EM algorithm for graphical as- testbed. In Proc. of the 18th International Workshop sociation models with missing data. Computational on Principles of Diagnosis (DX-07), pages 178–185, Statistics & Data Analysis, 19(2):191–201, 1995. 2007. [16] S.L. Lauritzen and D.J. Spiegelhalter. Local compu- [28] M. Ramoni and P. Sebastiani. Robust learning with tations with probabilities on graphical structures and missing data. Machine Learning, 45(2):147–170, 2001. their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), [29] E. Reed and O.J. Mengshoel. Scaling Bayesian pages 157–224, 1988. network parameter learning with expectation maxi- mization using MapReduce. Proc. of Big Learning [17] W. Liao and Q. Ji. Learning Bayesian network param- Workshop on Neural Information Processing Systems eters under incomplete data with domain knowledge. (NIPS-12), 2012. Pattern Recognition, 42(11):3046–3056, 2009. [30] B. Ricks and O. J. Mengshoel. Diagnosis for uncertain, [18] G.J. McLachlan and D. Peel. Finite mixture models. dynamic and hybrid domains using Bayesian networks Wiley, 2000. and arithmetic circuits. International Journal of Ap- proximate Reasoning, 55(5):1207–1234, 2014. [19] O. J. Mengshoel, M. Chavira, K. Cascio, S. Poll, A. Darwiche, and S. Uckun. Probabilistic model- [31] B. W. Ricks, C. Harrison, and O. J. Mengshoel. In- based diagnosis: An electrical power system case tegrating probabilistic reasoning and statistical qual- study. IEEE Trans. on Systems, Man, and Cyber- ity control techniques for fault diagnosis in hybrid netics, 40(5):874–885, 2010. domains. In In Proc. of the Annual Conference of the Prognostics and Health Management Society 2011 [20] O. J. Mengshoel, A. Darwiche, K. Cascio, M. Chavira, (PHM-11), Montreal, Canada, 2011. S. Poll, and S. Uckun. Diagnosing faults in electrical power systems of spacecraft and aircraft. In Proceed- [32] B. W. Ricks and O. J. Mengshoel. Methods for prob- ings of the Twentieth Innovative Applications of Arti- abilistic fault diagnosis: An electrical power system ficial Intelligence Conference (IAAI-08), pages 1699– case study. In Proc. of Annual Conference of the PHM 1705, Chicago, IL, 2008. Society, 2009 (PHM-09), San Diego, CA, 2009. [33] B. W. Ricks and O. J. Mengshoel. Diagnosing in- [21] O. J. Mengshoel, S. Poll, and T. Kurtoglu. Devel- termittent and persistent faults using static Bayesian oping large-scale Bayesian networks by composition: networks. In Proc. of the 21st International Work- Fault diagnosis of electrical power systems in aircraft shop on Principles of Diagnosis (DX-10), Portland, and spacecraft. In Proc. of the IJCAI-09 Workshop OR, 2010. on Self-? and Autonomous Systems (SAS): Reasoning and Integration Challenges, pages 59–66, 2009. [34] A. Saluja, P.K. Sundararajan, and O.J. Mengshoel. Age-Layered Expectation Maximization for parameter [22] O.J. Mengshoel, D.C. Wilkins, and D. Roth. Initial- learning in Bayesian Networks. In Proceedings of Arti- ization and restart in stochastic local search: Com- ficial Intelligence and Statistics (AIStats), La Palma, puting a most probable explanation in Bayesian net- Canary Islands, 2012. works. IEEE Transactions on Knowledge and Data Engineering, 23(2):235–247, 2011. [35] B. Thiesson, C. Meek, and D. Heckerman. Accel- erating EM for large databases. Machine Learning, [23] J.M. Mooij. libDAI: A free and open source C++ 45(3):279–299, 2001. library for discrete approximate inference in graphi- cal models. Journal of Machine Learning Research, [36] S. Watanabe. Algebraic analysis for nonidentifiable 11:2169–2173, August 2010. learning machines. Neural Computation, 13(4):899– 933, 2001. [24] S. Natarajan, P. Tadepalli, E. Altendorf, T.G. Diet- terich, A. Fern, and A.C. Restificar. Learning first- [37] C.F. Wu. On the convergence properties of the EM al- order probabilistic models with combining rules. In gorithm. The Annals of Statistics, 11(1):95–103, 1983. ICML, pages 609–616, 2005. [38] Z. Zhang, B.T. Dai, and A.K.H. Tung. Estimating lo- [25] R.S. Niculescu, T.M. Mitchell, and R.B. Rao. cal optimums in EM algorithm over Gaussian mixture Bayesian network learning with parameter con- model. In Proc. of the 25th international conference straints. The Journal of Machine Learning Research, on Machine learning, pages 1240–1247. ACM, 2008. 7:1357–1383, 2006. [26] R.S. Niculescu, T.M. Mitchell, and R.B. Rao. A theoretical framework for learning Bayesian networks with parameter inequality constraints. In Proc. of the 20th International Joint Conference on Artifical Intel- ligence, pages 155–160, 2007. 59