=Paper= {{Paper |id=Vol-1218/bmaw2014_paper_5 |storemode=property |title=Bayesian Network Parameter Learning using EM with Parameter Sharing |pdfUrl=https://ceur-ws.org/Vol-1218/bmaw2014_paper_5.pdf |volume=Vol-1218 |dblpUrl=https://dblp.org/rec/conf/uai/ReedM14 }} ==Bayesian Network Parameter Learning using EM with Parameter Sharing== https://ceur-ws.org/Vol-1218/bmaw2014_paper_5.pdf

Bayesian Network Parameter Learning using EM with
Parameter Sharing

Erik Reed Ole J. Mengshoel
Electrical and Computer Engineering Electrical and Computer Engineering
Carnegie Mellon University Carnegie Mellon University

Abstract been investigated [11, 18, 34, 38]. Second, EM is often
computationally demanding, especially when the BN
This paper explores the e↵ects of parameter is complex and there is much data [2, 3, 29, 35]. Third,
sharing on Bayesian network (BN) parameter parameters that EM converges to can be far from the
learning when there is incomplete data. Us- true probability distribution, yet still have a high like-
ing the Expectation Maximization (EM) al- lihood. This is a limitation of EM based on MLE.
gorithm, we investigate how varying degrees In this paper we investigate, for known BN structures,
of parameter sharing, varying number of hid- how varying degree of parameter sharing [17, 25, 26],
den nodes, and di↵erent dataset sizes impact varying number of hidden nodes, and di↵erent dataset
EM performance. The specific metrics of sizes impact EM performance. Specifically, we are:
EM performance examined are: likelihood,
error, and the number of iterations required
for convergence. These metrics are important • running many random initializations (or random
in a number of applications, and we empha- restarts) of EM, a technique known to e↵ectively
size learning of BNs for diagnosis of electrical counter-act premature convergence [10, 22];
power systems. One main point, which we
• recording for each EM run the following metrics:
investigate both analytically and empirically,
(i) log-likelihood (``) of estimated BN parameters,
is how parameter sharing impacts the error
(ii) error (the Euclidean distance between true
associated with EM’s parameter estimates.
and estimated BN parameters), and (iii) number
of EM iterations until convergence; and
1 INTRODUCTION
• testing BNs with great potential for parameter
Bayesian network (BN) conditional probability tables sharing, with a focus on electrical power system
(CPTs) can be learned when the BN structure is BNs (reflecting electrical power system compo-
known, for either complete or incomplete data. Dif- nents known to exhibit similar behavior).
ferent algorithms have been explored in the case of
incomplete data, including: Expectation Maximiza- Even when EM converges to a high-likelihood MLE,
tion [8, 14, 15, 28], Markov Chain Monte Carlo meth- the error can be large and vary depending on initial
ods such as Gibbs sampling [17], and gradient descent conditions. This is a fundamental limitation of EM us-
methods [9]. Expectation Maximization (EM) seeks to ing MLE; even a BN with high likelihood may be far
maximize the likelihood, or the Maximum a Posteriori from the true distribution and thus have a large error.
(MAP) estimate, for the BN CPTs. Error as a metric for the EM algorithm for BN param-
eter learning has not been discussed extensively in the
We focus in this paper on EM [8, 14, 15, 28], an itera-
existing literature. The analysis and experiments in
tive algorithm that converges to a maximum likelihood
this paper provide new insights in this area.
estimate (MLE). While EM is powerful and popular,
there are several challenges that motivate our research. Our main application is electrical power systems, and
First, when computing MLEs, EM is easily trapped in particular NASA’s Advanced Diagnostics and Prog-
in local optima and is typically very sensitive to the nostics Testbed (ADAPT) [27]. ADAPT has already
placement of initial CPT values. Methods of making been represented as BNs, which have proven them-
EM less prone to getting trapped in local optimal have selves as very well-suited to electrical power system

48
health management [12, 19–21, 30–33]. Through com- Notation Explanation
pilation of BNs to arithmetic circuits [4, 5], a broad X BN nodes
W BN edges
range of discrete and continuous faults can be detected ✓ BN CPTs
and diagnosed in a computationally efficient and pre- ✓ˆ estimated CPTs
dictable manner, resulting in award-winning perfor- ✓⇤ true CPTs
mance in international diagnostic competitions [30].1 O observable nodes
From a machine learning and EM perspective, as con- H hidden nodes
sidered in this paper, it is hypothesized that the learn- S (actually) shared nodes
ing of ADAPT BNs may benefit from parameter shar- P (potentially) shared nodes
U unshared nodes
ing. This is because there are several repeated BN
Y set partition of X
nodes and fragments in these BNs. In addition to pa- TP number of wrong CPTs
rameter sharing, we study in this paper the impact on E evidence nodes
EM of varying the number of hidden nodes, reflecting R non-evidence nodes
di↵erent sensing capabilities. tmin min # of EM iterations
tmax max # of EM iterations
Why are BNs and arithmetic circuits useful for elec- t0 iteration # at EM convergence
trical power system diagnostics? First, power systems ✏ tolerance for EM
exhibit multi-variate uncertainty, for example regard- err(✓)ˆ error of ✓ˆ relative to ✓ ⇤
ing component and sensor health (are they working nA = |A| cardinality of the set A
or failing?) as well as noisy sensor readings. Sec- ` likelihood
`` log-likelihood
ond, there is substantial local structure, as reflected
= (X, W , ✓) Bayesian network (BN)
in an EPS schematic, that can be taken advantage E(Z) expectation of r.v. Z
of when constructing a BN automatically or semi- V (Z) variance of r.v. Z
automatically [19,20,30]. Consequently, BN treewidth r Pearson’s corr. coe↵.
is small enough for exact computation using junction ✓ 2 [0, 1] CPT parameter
trees, variable elimination, or arithmetic circuits to be ✓ˆ 2 [0, 1] estimated CPT parameter
feasible [19,30]. Third, we compile BNs into arithmetic ✓⇤ 2 [0, 1] true CPT parameter
circuits [4, 5], which are fast and predictable in addi- error bound for ✓
tion to being exact. These are all important benefits in
Table 1: Notation used in this paper.
cyber-physical systems including electrical power sys-
tems.
The rest of this paper is structured as follows. In BN factors a joint distribution Pr(X), enabling dif-
Section 2, we introduce BNs, parameter learning for ferent probabilistic queries to be answered by efficient
incomplete data using EM, and related research. Sec- algorithms; they assume that nodes E are clamped to
tion 3 presents our main application area, electrical values e. One query of interest is to compute a most
power systems. In Section 4, we define the sharing con- probable explanation (MPE) over the remaining nodes
cept, discuss sharing in EM for BN parameter learn- R = X \ E, or MPE(e). Computation of marginals
ing, and provide analytical results. In Section 5 we (or beliefs) amounts to inferring the posterior probabil-
present experimental results for parameter sharing in ities over one or more query nodes Q ✓ R, specifically
BNs when using EM, emphasizing electrical power sys- BEL(Q, e), where Q 2 Q.
tem fault diagnosis using BNs. Finally, we conclude In this paper, we focus on situations where ✓ needs
and outline future research opportunities in Section 6. to be estimated but the BN structure (X and W ) is
known. Data is complete or incomplete; in other words
2 BACKGROUND there may be hidden nodes H where H = X \ O and
O are observed. A dataset is defined as (x1 , . . . , xm )
This section presents preliminaries including notation with m samples (observations), where xi is a vector
(see also Table 1). of instantiations of nodes X in the complete data
case. When the data is complete, the BN parame-
2.1 BAYESIAN NETWORKS ters ✓ are often estimated to maximize the data like-
lihood (MLE). In this paper, for a given dataset, a
Consider a BN = (X, W , ✓), where X are discrete variable X 2 X is either observable (X 2 O) or hid-
nodes, W are edges, and ✓ are CPT parameters. Let den (X 2 H); it is not hidden for just a strict subset
E ✓ X be evidence nodes, and e the evidence. A of the samples.2 Let H > 0. For each hidden node
1 2
Further information can be found here: https:// In other words, a variable that is completely hidden
sites.google.com/site/dxcompetition/. in the training data is a latent variable. Consequently, its

49
H 2 H there is then a “?” or “N/A” in each sam- When data is incomplete, the BN parameter estima-
ple. Learning from incomplete data also relies on a tion problem is in general non-identifiable. There may
likelihood function, similar to the complete data case. be several parameter estimates ✓ˆ1 , ..., ✓ˆm that have
However, for incomplete data several properties of the the same likelihood, given the dataset [36]. Thus, we
complete data likelihood function–such as unimodal- need to be careful when applying standard asymptotic
ity, a closed-form representation, and decomposition theory from statistics (which assumes identifiability)
into a product form–are lost. As a consequence, the and when interpreting a learned model. Section 4.2
computational issues associated with BN parameter introduces an error measure that provides some in-
learning are more complex, as we now discuss. sight regarding identifiability, since it measures dis-
tance from the true distribution ✓ ⇤ .
2.2 TRADITIONAL EM
3 ELECTRICAL POWER SYSTEMS
For the problem of optimizing such multi-dimensional,
highly non-linear, and multimodal functions, several Electrical Power Systems (EPSs) are critical in today’s
algorithms have been developed. They include EM, society, for instance they are essential for the safe oper-
our focus in this paper. EM performs a type of hill- ation of aircraft and spacecraft. The terrestrial power
climbing in which an estimate in the form of an ex- grid’s transition into a smart grid is also very impor-
pected likelihood function ` is used in place of the true tant, and the emergence of electrical power in hybrid
likelihood `. and all-electric cars is a striking trend in the automo-
Specifically, we examine the EM approach to learn BN tive industry.
parameters ✓ from incomplete data sets.3 The tradi- ADAPT (Advanced Diagnostics and Prognostics
tional EM algorithm, without sharing, initializes pa- Testbed) is an EPS testbed developed at NASA [27].
rameters to ✓ (0) . Then, EM alternates between an Publicly available data from ADAPT is being used to
E-step and an M-step. In the t-th E-step, using pa- develop, evaluate, and mature diagnosis and progno-
rameters ✓ (t) and observables from the dataset, EM sis algorithms. The EPS functions of ADAPT are as
generates the likelihood `(t) taking into account the follows. For power generation, it currently uses utility
hidden nodes H. In the M-step, EM modifies the pa- power with battery chargers (there are also plans to
rameters to ✓ (t+1) to maximize the data likelihood. investigate solar power generation). For power stor-
While |`(t) `(t 1) | ✏, where ✏ is a tolerance, EM age, ADAPT contains three sets of 24 VDC 100 Amp-
repeats from the E-step. hr sealed lead acid batteries. Power distribution is
EM monotonically increases the likelihood function ` aided by electromechanical relays, and there are two
or the log-likelihood function ``, thus EM converges load banks with AC and DC outputs. For control and
to a point ✓ˆ or a set of points (a region). Since `` monitoring there are two National Instruments com-
is bounded, EM is guaranteed to converge. Typically, pact FieldPoint backplanes. Finally, there are sensors
EM converges to a local maximum [37] at some itera- of several types, including for: voltage, current, tem-
tion t0 , bounded as follows: tmax t0 tmin . Due to perature, light, and relay positions.
the use of ✏ above, it is for practical implementations ADAPT has been used in di↵erent configurations and
of EM with restart more precise to discuss regions of represented in several fault detection and diagnosis
convergence, even when there is point-convergence in BNs [12, 19–21, 30–33], some of which are investigated
theory. in this paper (see Table 2). Each ADAPT BN node
One topic that has been discussed is the initialization typically has two to five discrete states. BN nodes
phase of the EM algorithm [8]. A second research represent, for instance: sensors (measuring, for exam-
topic is stochastic variants of EM, typically known as ple, voltage or temperature); components (for example
Stochastic EM [7,11]. Generally, Stochastic EM is con- batteries, loads, or relays); or system health of compo-
cerned with improving the computational efficiency of nents and sensors (broadly, they can be in “healthy”
EM’s E-step. Several other methods for increasing the or “faulty” states).
efficacy of the EM algorithm for BNs exist. These
include parameter constraints [1, 6, 25], parameter in- 3.1 SHARED VERSUS UNSHARED
equalities [26], exploiting domain knowledge [17, 24],
and parameter sharing [13, 17, 25, 26]. In BN instances where parameters are not shared, the
CPT for a node in the BN is treated as separate from
true distribution is not identifiable. the other CPTs. The assumption is not always rea-
3
Using EM, learning from complete data is a special sonable, however. ADAPT BNs may benefit from
case of learning from incomplete data. shared parameters, because there are typically several

50
3.2 OBSERVABLE VERSUS HIDDEN

Consider a complex engineered system, such as an elec-
trical power system. After construction, but before
it is put into production, it is typically tested exten-
sively. The sensors used during testing lead to one set
of observation nodes in the BN, OT . The sensors used
during production lead to another set of observations
in the BN, OP . For reasons including cost, fewer sen-
sors are typically used during production than during
testing, thus we assume OP ✓ OT .
As an example, in Figure 1 we denote OP =
{CB , S} as production observation nodes, and OT =
Figure 1: One of three similar sub-networks in the {CB , S, HB , HS } as testing observation nodes.
mini-ADAPT BN. Under parameter sharing, the SB
node is shared between the three sub-networks. In all our experiments, shared nodes are also hidden, or
S ✓ H. Typically, there are hidden nodes that are not
necessarily shared, or S ⇢ H. This is, for example,
the case for the mini-ADAPT BN as reflected in the
repeated nodes or fragments in these BNs [21, 30]. It sub-network in Figure 1.
is reasonable to assume that “identical” power system
sensors and components will behave in similar ways.
More broadly, sub-networks of identical components 4 EM WITH SHARING
should function in a similar way to other “identical”
sub-networks in a power system, and such knowledge 4.1 SHARING IN BAYESIAN NETWORKS
can be the basis for parameter sharing.
Consider a BN = (X, W , ✓). A sharing set partition
For parameter sharing as investigated in this paper, for nodes X is a set partition Y of X with subsets
the CPTs of some nodes are assumed to be approxi- Y 1 , ..., Y k , with Y i ✓ X and k 1. For each Y i
mately equal to the CPTs of di↵erent nodes elsewhere with k i 1 the nodes X 2 Y i share a CPT during
in the BN.4 Data for one set of nodes can be used else- EM learning as discussed in Section 4.2. We assume
where in the BN if the corresponding nodes are shared that the nodes in Y i have exactly the same number of
during EM learning. This is a case of parameter shar- states. The same applies to their respective parents in
ing involving the global structure of the BN, where , leading to each Y 2 Y i having the same number of
di↵erent CPTs are shared, as opposed to parameter parent instantiations and exactly the same CPTs.
sharing within a single CPT [13].
Traditional non-sharing is a special case of sharing in
In some ADAPT BNs, one sub-network is essentially the following way. We assign each BN node to a sep-
duplicated three times, reflecting triple redundancy in arate set partition such that for X = {X1 , ..., Xn } we
ADAPT’s power storage and distribution network [27]. have Y 1 , ..., Y n with Y i = {Xi }.
Such a six-node sub-network from an ADAPT BN is
One key research goal is to better understand the be-
shown in Figure 1. This sub-network was, in this pa-
havior of EM as sharing nodes S and observable nodes
per, manually selected for further study of sharing due
O vary. We examine three cases: complete data O C
to this duplication. The sub-network is part of the
(no hidden nodes); a sensor-rich testing setting with
mini-ADAPT BN used in experiments in Section 5.3.
observations O T ; and a sensor-poor production setting
In the mini-ADAPT sharing condition, only the node
with observations O P . Understanding the impact of
SB was shared between all three BN fragments. Gen-
varying observations O is important due to cost and
erally, we define shared nodes S and unshared nodes
e↵ort associated with observation or sensing.
U , with U = X \ S.
4.2 SHARING EM
4
It is unrealistic to assume that several engineered phys-
ical objects, even when it is desired that they are exactly Similar to traditional EM (see Section 2.2), the shar-
the same, in fact turn out to have exactly the same behav- ing EM algorithm also takes as input a dataset and
ior. A similar argument has been made for object-oriented estimates a vector of BN parameters ✓ˆ by iteratively
BNs [14], describing it as “violating the OO assumption.”
We thus say that CPTs are approximately equal rather improving `` until convergence. The main di↵erence
than equal. Under sharing, however, we are making the of sharing EM compared to traditional EM is that we
simplifying assumption that shared CPTs are equal. are now setting some nodes as shared S, according to

51
Y . To arrive at the sharing EM algorithm from the Due to the sharing, SEM intuitively has fewer con-
traditional EM algorithm, we modify the likelihood vergence regions than TEM. This is due to SEM’s
function to introduce parameter sharing and combine slightly modified M-step that aggregates the shared
parameters (see also [13, 17]). When running sharing CPTs and parameters. Consider a BN S with ex-
EM, nodes X are treated as separate for the E-step. actly the same nodes and edges as U , but with shar-
There is a slightly modified M-step, using Y , to ag- ing, specifically with sharing set partitions Y 1 , ..., Y k
gregate the shared CPTs and parameters.5 This is the and k < n. Without loss of generality, assume that
use of aggregate sufficient statistics, which considers Xi 2 Y i . Then the actual number of convergence re-
sufficient statistics from more than one BN node [13]. gions ( S ) is upper bounded by ̄( S ) as follows:
Let ✓ˆi,j be the j’th estimated probability parameter k
Y
for BN node Xi 2 X. We define error of ✓ˆ for BN ( S )  ̄( S ) = (Xi ), (3)
(X, W , ✓)ˆ as the L2 distance from the true probability i=1
distribution ✓ ⇤ from which data is sampled:
s assuming that the (Xi ) convergence regions used for
X X⇣ ⌘2
Xi in (2) carry over to Y i .
ˆ
err(✓) = ⇤
✓i,j ✓ˆi,j . (1)
i j A special case of (3) is when there exists exactly one
This error is the summation of the Euclidean distance Y 0 2 Y such that |Y 0 | 2 while for any Z 2 Y \Y 0 we
between true and estimated CPT parameters, or the have |Z| = 1. The experiments performed in Section 5
L2 distance, providing an overall distance metric. are all for this special case. Specifically, but without
loss of generality, let Y i = {Xi } for 1  i < k and
Why do we use Euclidean distance to measure error? Y k = {Xk , ..., Xn } with nS = n k + 1 (i.e., S = Y k
One could, after all, argue that this distance metric has nS sharing nodes). It is illustrative to consider the
is poor because it does not agree with likelihood. We ratio:
use Euclidean distance because we are interested not
Qn Yn
only in the black box performance of the BN, but also ̄( U ) (Xi )
its validity and understandability to a human expert.6 = Qi=1
k
= (Xi ). (4)
̄( S ) i=1 (Xi ) i=k+1
This is important when an expert needs to evaluate,
validate, or refine a BN model, for example a BN for Here, we assume that Xi has (Xi ) convergence re-
an electrical power system. gions in both U and S and take into account that
for shared nodes Y k = S, CPTs are tied together.
4.3 EM’S BEHAVIOR UNDER SHARING
The simple analysis above suggests a non-trivial im-
We now provide a simple analysis of certain aspects pact of sharing, given the multiplicative e↵ect of the
of traditional EM (TEM) and sharing EM (SEM). For (Xi )’s for k + 1  i  n in (4). However, since upper
simplicity, we only consider EM runs that converge bounds are the focus in this analysis, only a partial
and exclude runs that time out.7 and conservative picture is painted. The experiments
in Section 5—see for example Figure 3, Figure 5, and
For a node Xi 2 X, TEM will converge to one among
Figure 6—provide further details.
potentially several convergence regions. Suppose that
the CPT of node Xi has (Xi ) convergence regions.
Then the actual number of convergence regions ( U ) 4.4 ANALYSIS OF ERROR
for a non-shared BN U with nodes X = {X1 , ..., Xn }
We now consider the number of erroneous CPTs as
is upper bounded by ̄( U ) as follows:
estimated by SEM when sharing is varied. Clearly, a
n
Y CPT parameter is continuous and its EM estimate is
( U )  ̄( U ) = (Xi ). (2) extremely unlikely to be equal to the original param-
i=1
eter. Thus we consider here a discrete variable, based
5
For the relevant LibDAI source code, please see on forming an interval in the one-parameter case. Gen-
here: https://github.com/erikreed/HadoopBNEM/blob/ erally, let a discrete BN node X 2 X have k states such
master/src/emalg.cpp#L167. In words, it is a modifi-
that xi 2 {x1 , ..., xk }. Consider ✓xi |z = Pr(X = xi |
cation on the collection of sufficient statistics during the
maximization step. Z = z) for a parent instantiation z. We now have an
6
We assume that (1) is better than likelihood in this original CPT parameter ✓i 2 {✓x1 |z , ..., ✓xk 1 |z } and
regard, if the original BN was manually constructed. The its EM estimate ✓ˆi 2 {✓ˆx1 |z , ..., ✓ˆxk 1 |z }. Let us jointly
BNs experimented with in Section 5 were manually con- consider the original CPT parameter ✓i⇤ and its esti-
structed, for example.
7
In practice, runs that time out are very rare with the mate ✓ˆi . If ✓ˆi 2 [✓i⇤ ⇤ ˆ
i , ✓i + i ] we count ✓i as correct,
parameter settings we use in experiments. ˆ ⇤
and say ✓i = ✓i ; else it is incorrect or wrong, and we

52
say ✓ˆi 6= ✓i⇤ . This analysis clearly carries over to mul- and by substituting (7) and (8) into (6) we get
tiple CPT parameters, parent instantiations, and BN
nodes. This shows how we go from a continuous (esti- V (TP ) = p(1 p)((nP nS ) + n2S ). (9)
mated CPT parameters ✓) ˆ to a discrete value (number
of wrong or incorrect CPT estimates), where the latter In words, (9) tells us that as the number nS of shared
is used in this analysis. nodes increases at the expense of the number of un-
Suppose that up to nP nodes can be shared. Further shared nodes nU , variance due to non-shared nodes de-
suppose that nS nodes are actually shared while nU creases linearly, but variance due to sharing increases
nodes are unshared, with nU + nS = nP . Let TP quadratically. The net e↵ect shown in (9) is that vari-
be a random variable representing the total number ance V (TP ) of the error increases with the number of
of wrong or incorrect CPTs, TS the total for shared shared nodes, according to our analysis above. Ex-
nodes, and TU the total for unshared nodes. Clearly, pectation, on the other hand, remains constant (5)
we have TP = TS +TU . regardless of how many nodes are shared. These ana-
lytical results have empirical counterparts as discussed
Let us first consider the expectation E(TP ). Due to in Section 5, see for example the error sub-plot at the
linearity, we have E(TP ) = E(TS ) + E(TU ). In the bottom of Figure 2.
non-shared case, assume for simplicity that errors are
iid and follow a Bernoulli distribution, with probability
5 EXPERIMENTS
p of error and (1 p) = q of no error.8 This gives
E(TU ) = nU p, using the fact that a sum of Bernoulli
We now report on EM experiments for several di↵er-
random variables follows a Binomial distribution.9
ent BNs, using varying degrees of sharing. We also
In the shared case, all shared nodes either have an vary the number of hidden nodes and dataset size. We
incorrect CPT ✓ˆ 6= ✓⇤ or the correct CPT ✓ˆ = ✓⇤ . used ✏ = 1e 3 as an EM convergence criterion, mean-
Assuming again probabilities p of error10 and (1 p) = ing that EM stopped at iteration t0 when the ``-score
q of no error, and by using the definition of expectation changed by a value  ✏ between iterations t0 1 and t0
of Binomials we obtain E(TS ) = nS p. for tmax t0 tmin . In these experiments, tmax = 100
and tmin = 3.
Substituting into E(TP ) we get

E(TP ) = nU p + nS p = nP p. (5) 5.1 METHODS AND DATA

Bayesian networks. Table 2 presents BNs used in
Let us next consider the variance V (TP ). While vari-
the experiments.11 Except for the BN Pigs, these BNs
ance in general is not linear, we assume linearity for
all represent (parts of) the ADAPT electrical power
simplicity, and obtain
system (see Section 3). The BN Pigs has the largest
V (TP ) = V (TU ) + V (TS ). (6) number of nodes that can be shared (nP = 296), com-
prising 67% of the entire BN. The largest BN used,
In the non-shared case we have again a Binomial dis- in terms of node count, edges, and total CPT size, is
tribution, with well-known variance ADAPT T2.

V (TU ) = nU p(1 p). (7) Datasets. Data for EM learning of parameters for
these BNs were generated using forward sampling.12
In the shared case we use the definition of variance, Each sample in a dataset is a vector x (see Section 2.1).
put p1 = (1 p), p2 = p, and µ = nS p, and obtain The larger BNs were tested with increasing numbers
after some simple manipulations: of samples ranging from 25 to 400, while mini-ADAPT
was tested with 25 to 2000 samples.
2
X
V (TS ) = pi (Xi µ)2 = n2S p(1 p), (8) Sharing. Each BN has a di↵erent number of param-
i=1 eters that can be shared, where a set of nodes Y i 2
8 11
This is a simplification, since our use of the Bernoulli ADAPT BNs can be found here: http://works.
assumes that each CPT is either “correct” or “incorrect.” bepress.com/ole_mengshoel/.
When learned from data, the estimated parameters are 12
Our experiments are limited in that we are only learn-
clearly almost never exactly correct, but close to or far ing the parameters of BNs, using data generated from those
from their respective original values. BNs. Clearly, in most applications, data is not generated
9
If X is Binomial with parameters n and p, it is well- from a BN and the true distribution does not conform ex-
known that the expected value is E(X) = np. actly to some BN structure. However, our analytical and
10
The error probabilities of TS and TU are assumed to experimental investigation of error would not have been
be the same as a simplifying assumption. possible without this simplifying assumption.

53
NAME |X| |P | |W | CPT The following gradual sharing method is used to vary
ADAPT T1 120 26 136 1504
sharing. Given a fixed set of hidden nodes H and an
ADAPT T2 671 107 789 13281
ADAPT P1 172 33 224 4182 initially empty set of shared nodes S:
ADAPT P2 494 99 602 10973
mini-ADAPT 18 3 15 108
1. Randomly add nS 1 hidden nodes that are
Pigs 441 296 592 8427
not yet shared to the set of shared nodes S. Since
Table 2: Bayesian networks used in experiments. The we only have a single sharing set, this means mov-
|P | column presents the number of potentially shared ing nS nodes from the set H \ S to the set S.
nodes, with actually shared nodes S ✓ P . The CPT
2. Perform m sharing EM trials in this configuration,
column denotes the total number of parameters in the
and record the three metrics for each trial.
conditional probability tables.
3. Repeat until all hidden nodes are shared; that is,
Y with equal CPTs are deemed sharable. In most S = H.
cases, there were multiple sets of nodes Y 1 , ..., Y k
with |Y i | 2 for k i 1. When multiple sets were Using the gradual sharing method above, BN nodes
available, the largest set was selected for experimenta- were picked as hidden and then gradually shared.
tion, as shown in Section 5.2’s pseudo-code. When increasing the number of shared nodes, the new
set of shared nodes was a superset of the previous set,
Metrics. After an EM trial converged to an estimate
ˆ we collected the following three metrics: and a certain number of EM trials was performed for
✓,
each set.
ˆ
1. number of iterations t0 needed to converge to ✓,
5.2.1 One Network
ˆ and
2. log-likelihood `` of ✓,
For ADAPT T2, m = 200 samples were generated and
3. error: distance between ✓ˆ and the original ✓ ⇤ (see nH = 66 nodes were hidden. We used, in the gradual
(1) for the definition). sharing method, nS = 4 from nS = 2 to nS = 66
(every hidden node was eventually set as shared).
To provide reliable statistics on mean and standard
deviation, many randomly initialized EM trials were
run.
Software. Among the available software implementa-
tions of the EM algorithm for BNs, we have based our
work on LibDAI [23].13 LibDAI uses factor graphs for
its internal representation of BNs, and has several BN
inference algorithms implemented. During EM, the
exact junction tree inference algorithm [16] was used,
since it has performed well previously [19].

5.2 VARYING NUMBER OF SHARED
NODES

Here we investigate how varying the number of shared
nodes impacts EM. A set of hidden nodes H was cre-
ated for each BN by selecting

H = arg max |Y i |,
Y i 2Y

where Y is a sharing set partition for BN nodes X
(see Section 4.1). In other words, each experimental Figure 2: The number of iterations (top), log-
BN had its largest set of shareable nodes hidden, giv- likelihood or `` (middle), and error (bottom) for a
ing nH = 12 nodes for ADAPT T1, nH = 66 nodes varying number of shared nodes nS (along the x-axis)
for ADAPT T2, nH = 32 nodes for ADAPT P1, and for the BN ADAPT T2. Here, nH = 66 nodes are hid-
nH = 145 nodes for Pigs. den. Shared nodes are a random subset of the hidden
13
www.libdai.org nodes, so nS  nH .

54
ADAPT T1
Iterations Likelihood Error
sample correlation coefficient r:
Size r(µ) r( ) r(µ) r( ) r(µ) r( )
25 -0.937 -0.135 0.997 0.882 -0.068 0.888 P
m
50 -0.983 -0.765 0.992 0.941 0.383 0.988 (xi x̄)(yi ȳ)
100
200
-0.825
0.544
0.702
0.926
0.494
-0.352
0.941
0.892
0.786
0.517
0.985
0.939
r(X, Y ) = i=1 , (10)
400 0.810 0.814 -0.206 0.963 0.863 0.908
(m 1)sx sy
ADAPT T2
Iterations Likelihood Error where x̄ and ȳ are the sample means of two random
Size r(µ) r( ) r(µ) r( ) r(µ) r( )
25 0.585 0.852 0.749 0.842 0.276 0.997
variables X and Y , and sx and sy are the sample stan-
50 0.811 0.709 0.377 0.764 0.332 0.981 dard deviations of X and Y respectively. Here, (10)
100 0.772 0.722 -0.465 0.855 -0.387 0.992 measures the correlation between m samples from X
200 0.837 0.693 -0.668 0.788 -0.141 0.989
400 0.935 0.677 -0.680 0.784 -0.0951 0.987 and Y .15
ADAPT P1
Iterations Likelihood Error The number of samples, m, refers to the number of
Size r(µ) r( ) r(µ) r( ) r(µ) r( )
25 -0.769 -0.0818 -0.926 0.278 -0.107 0.774 (X, Y ) sharing samples we have for this correlation
50 -0.864 -0.0549 -0.939 0.218 -0.331 0.188 analysis (and not the number of samples used to learn
100 -0.741 0.436 -0.871 0.678 -0.458 0.457
200 -0.768 0.408 -0.879 0.677 -0.196 0.700 BN parameters). In all these experiments, we do a trial
400 -0.665 0.560 -0.862 0.681 0.00444 0.653 for a number of shared nodes, giving several (X, Y )
PIGS
Iterations Likelihood Error pairs. Consequently, each number of shared nodes
Size r(µ) r( ) r(µ) r( ) r(µ) r( ) tested would be an X, and the metric measured would
25 0.954 0.763 0.970 0.614 0.965 0.887
50 0.966 0.956 0.137 0.908 0.740 0.869 be a Y . For example, if we use nS 2 {2, 4, 6, 8, 10}
100 -0.922 -0.0167 -0.994 -0.800 -0.893 0.343 shared nodes, then m = 5.
200 -0.827 -0.394 -0.985 -0.746 -0.577 0.0157
400 -0.884 -0.680 -0.942 -0.699 -0.339 0.285 We now tie r(X, Y ) in (10) to the r(µ) and r( ) used in
Table 3. In Table 3, µ and show the mean and stan-
Table 3: Values for Pearson’s r for the correlation be-
dard deviation, respectively, for a metric. Thus, r(µ)
tween the number of shared nodes and statistics for
is the correlation of the number of shared nodes and
these metrics: number of iterations, log-likelihood (``),
the mean likelihood of a metric, while r( ) is the corre-
and error. r(µ) defines the correlation between num-
lation of the number of shared nodes and the standard
ber of shared nodes and the mean of the metric, while
deviation of likelihood of a metric.
r( ) defines the correlation for the standard deviation.
Figure 2 helps in understanding exactly what is being
correlated, as µ and for all three metrics are shown
Figure 2 summarizes the results from this experiment. for the BN ADAPT T2. In the top plot, r(µ) is the
Here, nS is varied along the x-axis while the y-axis correlation between the number of shared nodes (x-
shows statistics for di↵erent metrics in each of the axis) and the mean number of iterations (y-axis). In
three sub-plots. For example, in the top plot of Fig- other words, the mean yi = µi is for a batch of 50 trials
ure 2, each marker is the mean number of iterations of EM. The mean ȳ used in Pearson’s r is, in this case,
(µ), and the error bar is +/- one standard deviation a mean of means, namely the mean over 50-EM-trials-
( ). The main trends14 in this figure are: parameter means over di↵erent numbers of shared nodes.
sharing increased the mean number of iterations re-
quired for EM and slowly decreased the mean ``. In- Table 3 summarizes the experimental results for four
creasing the number of shared nodes resulted in a cor- BNs. In this table, a positive correlation implies that
responding increase in standard deviation for the num- parameter sharing increased the corresponding metric
ber of iterations, ``, and error of the BN ADAPT T2. statistic. For example, the highest correlation between
For standard deviation of error, this is in line with our number of shared nodes and mean likelihood is for
analysis in Section 4.4. ADAPT T1 at 25 samples, where r(µ) = 0.997. This
suggests that increasing the number of shared nodes
was highly correlated with an increase in the likelihood
5.2.2 Multiple Networks
of EM. Negative coefficients show that increasing the
We now investigate how varying the number of shared number of shared nodes resulted in a decrease of the
nodes impacts EM for several BNs, specifically the cor- corresponding metric statistic.
relation between the number of parameters shared and A prominent trend in Table 3 is the consistently pos-
the mean µ and standard deviation for our three itive correlation between the number of shared nodes
metrics. To measure correlation, we use Pearson’s
15
In this case, X is the independent variable, specifically
14
We say “main trends” because the curves for the met- the number of shared nodes. We treat the metric Y as a
rics mean number of iterations and `` are in fact reversing function of X. When X is highly correlated with Y , this
their respective trends and dropping close to the maximum is expressed in r through extreme (positive or negative)
of 66 shared nodes. correlation values.

55
NO SHARING
Observable Error Likelihood Iterations
µ µ µ
OC 16.325 (-) -3.047e4 (-) (-) (-)
OP 48.38 3.74 -2.009e4 43.94 16.39 4.98
OT 33.25 7.45 -2.492e4 1190 8.65 1.99
SHARING
Observable Error Likelihood Iterations
µ µ µ
OC 16.323 (-) -3.047e4 (-) (-) (-)
OP 48.56 3.92 -2.010e4 62.87 15.98 4.95
OT 34.12 14.3 -2.629e4 2630 6.59 2.48

Table 4: Comparison of No Sharing (top) versus Shar-
(a) O P – No Sharing (b) O P – Sharing
ing (bottom) for di↵erent observable node sets O C ,
O P , and O T during 600 EM trials for mini-ADAPT.

nS and the standard deviation of error, r( ), for all 4
BNs. This is in line with the analytical result involving
nS in (9).
The number of samples was shown to have a signifi-
cant impact on these correlations. The Pigs network
showed a highly correlated increase in the mean num-
(c) O T – No Sharing (d) O T – Sharing
ber of iterations for 25 and 50 samples. However, for
100, 200, and 400 samples there was a decrease in the Figure 3: The progress of di↵erent random EM trials
mean number of iterations. The opposite behavior is for the mini-ADAPT BN. Both the degree of sharing
observed in ADAPT T1, where fewer samples resulted (No Sharing versus Sharing) and the number of ob-
in better performance for parameter sharing (reducing servables nodes (O P versus O T ) are varied.
the mean number of iterations), while for 200 and 400
samples we found that parameter sharing increased the
mean number of iterations. Further experimentation
executed with and without sharing.
and analysis may improve the understanding of the in-
teraction between sharing and the number of samples. Table 4 shows results, in terms of means µ and stan-
dard deviations , for these EM trials. For O P , with
5.3 CONVERGENCE REGIONS nH = 4, the means µ of the metrics error, likeli-
hood, and number of iterations showed minor di↵er-
5.3.1 Small Bayesian Networks ences when parameter sharing was introduced. The
largest change due to sharing was an increase in
First, we will show how sharing influences EM parame- of likelihood. For O T , where nH = 2, di↵erences were
ter interactions for the mini-ADAPT BN shown in Fig- greater. The µ of likelihood for sharing was lower with
ure 1 and demonstrate how shared parameters jointly over a 2x increase in . The µ for error demonstrated
converge. only a minor change, but nearly a 2x increase in .
This is consistent with our analysis in Section 4.4.
Earlier we introduced O P as observable nodes in a
production system and O T as observable nodes in a Figure 3a and Figure 3c show how log-likelihood or ``
testing system. Complementing O P , hidden nodes are (x-axis) and error (y-axis) changed during 15 EM tri-
H P = {HB , HS , VB , SB }. Complementing O T , hid- als for O P and O T respectively. These EM trials were
den nodes are H T = {VB , SB }. When a node is hid- selected randomly among the trials reported on in Ta-
den, the EM algorithm will converge to one among its ble 4. Parameter sharing is introduced in Figure 3b
potentially many convergence regions. For O P , EM and Figure 3d. For O P , the progress of the EM trials
had much less observed data to work with than for O T is similar for sharing (Figure 3b) and non-sharing (Fig-
(see Figure 1). For O P , the health breaker node HB ure 3a), although for a few trials in the sharing condi-
was, for instance, not observed or even connected to tion the error is more extreme (and mostly smaller!).
any nodes that were observed. In contrast, O T was de- This is also displayed in Table 4, where the di↵erence
signed to allow better observation of the components’ in number of iterations, error, and likelihood was mi-
behaviors, and HB 2 O T . From mini-ADAPT, 500 nor (relative to O T ). On the other hand, there is a
samples were generated. Depending on the observable clear di↵erence in the regions of convergence for O T
set used, either nH = |H T | = 2 or nH = |H P | = 4 when parameter sharing is introduced, consistent with
nodes were hidden, and 600 random EM trials were the analysis in Section 4.3. Figure 3d shows how the

56
(a) O P – No Sharing (b) O P – Sharing

(a) No Sharing (b) Sharing

Figure 5: The progress of 15 random EM trials for
ADAPT T2. The No Sharing condition (a) shows
more locally optimal convergence regions than Shar-
ing (b), where there appears to be only four locally
optimal convergence regions.
(c) O T – No Sharing (d) O T – Sharing

Figure 4: Four 20-bin histograms, for mini-ADAPT,
of the error values of 600 randomly initialized EM tri-
als at convergence. Both the degree of sharing (No
Sharing versus Sharing) and the number of observables
nodes (O P versus O T ) are varied.

EM trials typically followed a path heading to optima
far above or far below the mean error, with two of the
(a) No Sharing (b) Sharing
EM trials plotted converging in the middle region of
the error. Figure 6: The progress of random EM trials for
Histograms for the 600 EM trials used in Table 4 are ADAPT P1. While the number of EM trials is the
shown in Figure 4. The 20-bin histograms show error same for both conditions, the No Sharing condition
ˆ at convergence. The O P and O T sets are shown
err(✓) (a) clearly shows more local optima than Sharing (b).
without parameter sharing in Figure 4a and Figure 4c,
respectively. Parameter sharing is introduced in Fig-
ure 4b and Figure 4d. There is an increased of error able nodes, number of samples, and number of shared
due to parameter sharing for O T . When comparing nodes. Some of the results are reported here.
Figure 4c (No Sharing) and Figure 4d (Sharing), we Figure 5a shows results for ADAPT T2 without pa-
notice di↵erent regions of error for EM convergence rameter sharing during 15 EM trials using nH = 66
due to sharing. Figure 4c appears to show four main hidden nodes and 200 samples. Figure 5b shows a
error regions, with the middle two being greatest in substantial change in error when the nH = 66 hidden
frequency, while Figure 4d appears to show three re- nodes were shared. The range of the error for EM is
gions of error, with the outer two being most frequent. much larger in Figure 5b, while the upper and lower
The outer two regions in Figure 4d are further apart error curves have a symmetric quality. Four regions
than their non-sharing counterparts, showing that pa- for the converged error are visible in Figure 5b, with
rameter sharing yielded a larger range of error for O T , the inner two terminating at a lower `` than the outer
see Section 4.4. two. The lowest error region of Figure 5b is also lower
than the lowest error of Figure 5a, while retaining a
5.3.2 Large Bayesian Networks similar ``.
Next, large BNs are used to investigate the e↵ects of Figure 6 uses a smaller ADAPT BN, containing 172
parameter sharing, using a varying number of shared nodes instead of 671 nodes (see Table 2). Here, nH =
nodes. The larger ADAPT networks and Pigs were run 33 nodes were hidden and 200 samples were used. In
with 50 EM trials16 for each configuration of observ- several respects, the results are similar to those ob-
tained for ADAPT T2 and mini-ADAPT. However,
16
The decrease in number of EM trials performed rela-
tive to mini-ADAPT was due to the substantial increase in CPU time required (days to weeks).

57
Figure 6a shows that EM terminates on di↵erent like- would be interesting to investigate the connection to
lihoods, which is not observed in Figure 5a. The error object-oriented and relational BNs in future work.
also appears to generally fluctuate more in Figure 6a,
whereas the error changes the most during later it- References
erations in Figure 5a. Figure 6b applies parameter
sharing to the nH = 33 hidden nodes. A symmetric [1] E.E. Altendorf, A.C. Restificar, and T.G. Dietterich.
e↵ect is visible between high and low error, reflect- Learning from sparse data by exploiting monotonicity
constraints. In Proceedings of UAI, volume 5, 2005.
ing the analysis in Section 4. Of the 15 trials shown
in Figure 6b, two attained `` > 1.2e 4 , while the [2] A. Basak, I. Brinster, X. Ma, and O. J. Mengshoel.
rest converged at `` ⇡ 1.21e 4 . Additionally, the Accelerating Bayesian network parameter learning us-
``s of these two trials were greater than any of the ing Hadoop and MapReduce. In Proc. of BigMine-12,
Beijing, China, August 2012.
non-sharing ``s shown in Figure 6a.
[3] P.S. Bradley, U. Fayyad, and C. Reina. Scaling
EM (Expectation-Maximization) clustering to large
6 CONCLUSION databases. Microsoft Research Report, MSR-TR-98-
35, 1998.
Bayesian networks have proven themselves as very
[4] M. Chavira and A. Darwiche. Compiling Bayesian
suitable for electrical power system diagnostics [19,20, networks with local structure. In Proceedings of the
30–33]. By compiling Bayesian networks to arithmetic 19th International Joint Conference on Artificial In-
circuits [4,5], a broad range of discrete and continuous telligence (IJCAI), pages 1306–1312, 2005.
faults can be handled in a computationally efficient
[5] A. Darwiche. A di↵erential approach to inference in
and predictable manner. This approach has resulted Bayesian networks. Journal of the ACM, 50(3):280–
in award-winning performance on public data from 305, 2003.
ADAPT, an electrical power system at NASA [30].
[6] C.P. de Campos and Q. Ji. Improving Bayesian net-
The goal of this paper is to investigate the e↵ect, on work parameter learning using constraints. In Proc.
EM’s behavior, of parameter sharing in Bayesian net- 19th International Conference on Pattern Recognition
works. We emphasize electrical power systems as an (ICPR), pages 1–4. IEEE, 2008.
application, and in particular examine EM for ADAPT [7] B. Delyon, M. Lavielle, and E. Moulines. Convergence
Bayesian networks. In these networks, there is consid- of a stochastic approximation version of the EM algo-
erable opportunity for parameter sharing. rithm. Annals of Statistics, (27):94–128, 1999.

Our results suggest complex interactions between [8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Max-
imum likelihood from incomplete data via the EM al-
varying degrees of parameter sharing, varying number gorithm. Journal Of The Royal Statistical Society,
of hidden nodes, and di↵erent dataset sizes when it Series B, 39(1):1–38, 1977.
comes to impact on EM performance, specifically like-
lihood, error, and the number of iterations required for [9] R. Greiner, X. Su, B. Shen, and W. Zhou. Struc-
tural extension to logistic regression: Discriminative
convergence. One main point, which we investigated parameter learning of belief net classifiers. Machine
both analytically and empirically, is how parameter Learning, 59(3):297–322, 2005.
sharing impacts the error associated with EM’s pa-
rameter estimates. In particular, we have found an- [10] S.H. Jacobson and E. Yucesan. Global optimization
performance measures for generalized hill climbing al-
alytically that the error variance increases with the gorithms. Journal of Global Optimization, 29(2):173–
number of shared parameters. Experiments with sev- 190, 2004.
eral BNs, mostly for fault diagnosis of electrical power
[11] W. Jank. The EM algorithm, its randomized imple-
systems, are in line with the analysis. The good news
mentation and global optimization: Some challenges
here is that there is, in the sharing case, smaller error and opportunities for operations research. In F. B.
some of the time. Alt, M. C. Fu, and B. L. Golden, editors, Perspectives
in Operations Research: Papers in Honor of Saul Gass
Further theoretical research to better understand pa- 80th Birthday. Springer, 2006.
rameter sharing is required. Since parameter sharing
was demonstrated to perform poorly in certain cases, [12] W. B. Knox and O. J. Mengshoel. Diagnosis and re-
configuration using Bayesian networks: An electrical
further investigations appear promising. Parameter power system case study. In Proc. of the IJCAI-09
sharing sometimes reduced the number of EM itera- Workshop on Self-? and Autonomous Systems (SAS):
tions required for parameter learning, while at other Reasoning and Integration Challenges, pages 67–74,
times the number of EM iterations increases. Improv- 2009.
ing the understanding of the joint impact of parameter [13] D. Koller and N. Friedman. Probabilistic graphical
sharing and the number of samples on the number of models: principles and techniques. The MIT Press,
EM iterations would be useful, for example. Finally, it 2009.

58
[14] H. Langseth and O. Bangsø. Parameter learning [27] S. Poll, A. Patterson-Hine, J. Camisa, D. Gar-
in object-oriented Bayesian networks. Annals of cia, D. Hall, C. Lee, O.J. Mengshoel, C. Neukom,
Mathematics and Artificial Intelligence, 32(1):221– D. Nishikawa, J. Ossenfort, A. Sweet, S. Yen-
243, 2001. tus, I. Roychoudhury, M. Daigle, G. Biswas, and
X. Koutsoukos. Advanced diagnostics and prognostics
[15] S.L. Lauritzen. The EM algorithm for graphical as- testbed. In Proc. of the 18th International Workshop
sociation models with missing data. Computational on Principles of Diagnosis (DX-07), pages 178–185,
Statistics & Data Analysis, 19(2):191–201, 1995. 2007.

[16] S.L. Lauritzen and D.J. Spiegelhalter. Local compu- [28] M. Ramoni and P. Sebastiani. Robust learning with
tations with probabilities on graphical structures and missing data. Machine Learning, 45(2):147–170, 2001.
their application to expert systems. Journal of the
Royal Statistical Society. Series B (Methodological), [29] E. Reed and O.J. Mengshoel. Scaling Bayesian
pages 157–224, 1988. network parameter learning with expectation maxi-
mization using MapReduce. Proc. of Big Learning
[17] W. Liao and Q. Ji. Learning Bayesian network param- Workshop on Neural Information Processing Systems
eters under incomplete data with domain knowledge. (NIPS-12), 2012.
Pattern Recognition, 42(11):3046–3056, 2009.
[30] B. Ricks and O. J. Mengshoel. Diagnosis for uncertain,
[18] G.J. McLachlan and D. Peel. Finite mixture models. dynamic and hybrid domains using Bayesian networks
Wiley, 2000. and arithmetic circuits. International Journal of Ap-
proximate Reasoning, 55(5):1207–1234, 2014.
[19] O. J. Mengshoel, M. Chavira, K. Cascio, S. Poll,
A. Darwiche, and S. Uckun. Probabilistic model- [31] B. W. Ricks, C. Harrison, and O. J. Mengshoel. In-
based diagnosis: An electrical power system case tegrating probabilistic reasoning and statistical qual-
study. IEEE Trans. on Systems, Man, and Cyber- ity control techniques for fault diagnosis in hybrid
netics, 40(5):874–885, 2010. domains. In In Proc. of the Annual Conference of
the Prognostics and Health Management Society 2011
[20] O. J. Mengshoel, A. Darwiche, K. Cascio, M. Chavira, (PHM-11), Montreal, Canada, 2011.
S. Poll, and S. Uckun. Diagnosing faults in electrical
power systems of spacecraft and aircraft. In Proceed- [32] B. W. Ricks and O. J. Mengshoel. Methods for prob-
ings of the Twentieth Innovative Applications of Arti- abilistic fault diagnosis: An electrical power system
ficial Intelligence Conference (IAAI-08), pages 1699– case study. In Proc. of Annual Conference of the PHM
1705, Chicago, IL, 2008. Society, 2009 (PHM-09), San Diego, CA, 2009.

[33] B. W. Ricks and O. J. Mengshoel. Diagnosing in-
[21] O. J. Mengshoel, S. Poll, and T. Kurtoglu. Devel-
termittent and persistent faults using static Bayesian
oping large-scale Bayesian networks by composition:
networks. In Proc. of the 21st International Work-
Fault diagnosis of electrical power systems in aircraft
shop on Principles of Diagnosis (DX-10), Portland,
and spacecraft. In Proc. of the IJCAI-09 Workshop
OR, 2010.
on Self-? and Autonomous Systems (SAS): Reasoning
and Integration Challenges, pages 59–66, 2009. [34] A. Saluja, P.K. Sundararajan, and O.J. Mengshoel.
Age-Layered Expectation Maximization for parameter
[22] O.J. Mengshoel, D.C. Wilkins, and D. Roth. Initial- learning in Bayesian Networks. In Proceedings of Arti-
ization and restart in stochastic local search: Com- ficial Intelligence and Statistics (AIStats), La Palma,
puting a most probable explanation in Bayesian net- Canary Islands, 2012.
works. IEEE Transactions on Knowledge and Data
Engineering, 23(2):235–247, 2011. [35] B. Thiesson, C. Meek, and D. Heckerman. Accel-
erating EM for large databases. Machine Learning,
[23] J.M. Mooij. libDAI: A free and open source C++ 45(3):279–299, 2001.
library for discrete approximate inference in graphi-
cal models. Journal of Machine Learning Research, [36] S. Watanabe. Algebraic analysis for nonidentifiable
11:2169–2173, August 2010. learning machines. Neural Computation, 13(4):899–
933, 2001.
[24] S. Natarajan, P. Tadepalli, E. Altendorf, T.G. Diet-
terich, A. Fern, and A.C. Restificar. Learning first- [37] C.F. Wu. On the convergence properties of the EM al-
order probabilistic models with combining rules. In gorithm. The Annals of Statistics, 11(1):95–103, 1983.
ICML, pages 609–616, 2005.
[38] Z. Zhang, B.T. Dai, and A.K.H. Tung. Estimating lo-
[25] R.S. Niculescu, T.M. Mitchell, and R.B. Rao. cal optimums in EM algorithm over Gaussian mixture
Bayesian network learning with parameter con- model. In Proc. of the 25th international conference
straints. The Journal of Machine Learning Research, on Machine learning, pages 1240–1247. ACM, 2008.
7:1357–1383, 2006.

[26] R.S. Niculescu, T.M. Mitchell, and R.B. Rao. A
theoretical framework for learning Bayesian networks
with parameter inequality constraints. In Proc. of the
20th International Joint Conference on Artifical Intel-
ligence, pages 155–160, 2007.