=Paper= {{Paper |id=Vol-1218/bmaw2014_paper_5 |storemode=property |title=Bayesian Network Parameter Learning using EM with Parameter Sharing |pdfUrl=https://ceur-ws.org/Vol-1218/bmaw2014_paper_5.pdf |volume=Vol-1218 |dblpUrl=https://dblp.org/rec/conf/uai/ReedM14 }} ==Bayesian Network Parameter Learning using EM with Parameter Sharing== https://ceur-ws.org/Vol-1218/bmaw2014_paper_5.pdf
            Bayesian Network Parameter Learning using EM with
                            Parameter Sharing



                           Erik Reed                                    Ole J. Mengshoel
             Electrical and Computer Engineering               Electrical and Computer Engineering
                  Carnegie Mellon University                        Carnegie Mellon University



                     Abstract                               been investigated [11, 18, 34, 38]. Second, EM is often
                                                            computationally demanding, especially when the BN
    This paper explores the e↵ects of parameter             is complex and there is much data [2, 3, 29, 35]. Third,
    sharing on Bayesian network (BN) parameter              parameters that EM converges to can be far from the
    learning when there is incomplete data. Us-             true probability distribution, yet still have a high like-
    ing the Expectation Maximization (EM) al-               lihood. This is a limitation of EM based on MLE.
    gorithm, we investigate how varying degrees             In this paper we investigate, for known BN structures,
    of parameter sharing, varying number of hid-            how varying degree of parameter sharing [17, 25, 26],
    den nodes, and di↵erent dataset sizes impact            varying number of hidden nodes, and di↵erent dataset
    EM performance. The specific metrics of                 sizes impact EM performance. Specifically, we are:
    EM performance examined are: likelihood,
    error, and the number of iterations required
    for convergence. These metrics are important              • running many random initializations (or random
    in a number of applications, and we empha-                  restarts) of EM, a technique known to e↵ectively
    size learning of BNs for diagnosis of electrical            counter-act premature convergence [10, 22];
    power systems. One main point, which we
                                                              • recording for each EM run the following metrics:
    investigate both analytically and empirically,
                                                                (i) log-likelihood (``) of estimated BN parameters,
    is how parameter sharing impacts the error
                                                                (ii) error (the Euclidean distance between true
    associated with EM’s parameter estimates.
                                                                and estimated BN parameters), and (iii) number
                                                                of EM iterations until convergence; and
1   INTRODUCTION
                                                              • testing BNs with great potential for parameter
Bayesian network (BN) conditional probability tables            sharing, with a focus on electrical power system
(CPTs) can be learned when the BN structure is                  BNs (reflecting electrical power system compo-
known, for either complete or incomplete data. Dif-             nents known to exhibit similar behavior).
ferent algorithms have been explored in the case of
incomplete data, including: Expectation Maximiza-           Even when EM converges to a high-likelihood MLE,
tion [8, 14, 15, 28], Markov Chain Monte Carlo meth-        the error can be large and vary depending on initial
ods such as Gibbs sampling [17], and gradient descent       conditions. This is a fundamental limitation of EM us-
methods [9]. Expectation Maximization (EM) seeks to         ing MLE; even a BN with high likelihood may be far
maximize the likelihood, or the Maximum a Posteriori        from the true distribution and thus have a large error.
(MAP) estimate, for the BN CPTs.                            Error as a metric for the EM algorithm for BN param-
                                                            eter learning has not been discussed extensively in the
We focus in this paper on EM [8, 14, 15, 28], an itera-
                                                            existing literature. The analysis and experiments in
tive algorithm that converges to a maximum likelihood
                                                            this paper provide new insights in this area.
estimate (MLE). While EM is powerful and popular,
there are several challenges that motivate our research.    Our main application is electrical power systems, and
First, when computing MLEs, EM is easily trapped            in particular NASA’s Advanced Diagnostics and Prog-
in local optima and is typically very sensitive to the      nostics Testbed (ADAPT) [27]. ADAPT has already
placement of initial CPT values. Methods of making          been represented as BNs, which have proven them-
EM less prone to getting trapped in local optimal have      selves as very well-suited to electrical power system

                                                       48
health management [12, 19–21, 30–33]. Through com-              Notation           Explanation
pilation of BNs to arithmetic circuits [4, 5], a broad          X                  BN nodes
                                                                W                  BN edges
range of discrete and continuous faults can be detected         ✓                  BN CPTs
and diagnosed in a computationally efficient and pre-           ✓ˆ                 estimated CPTs
dictable manner, resulting in award-winning perfor-             ✓⇤                 true CPTs
mance in international diagnostic competitions [30].1           O                  observable nodes
From a machine learning and EM perspective, as con-             H                  hidden nodes
sidered in this paper, it is hypothesized that the learn-       S                  (actually) shared nodes
ing of ADAPT BNs may benefit from parameter shar-               P                  (potentially) shared nodes
                                                                U                  unshared nodes
ing. This is because there are several repeated BN
                                                                Y                  set partition of X
nodes and fragments in these BNs. In addition to pa-            TP                 number of wrong CPTs
rameter sharing, we study in this paper the impact on           E                  evidence nodes
EM of varying the number of hidden nodes, reflecting            R                  non-evidence nodes
di↵erent sensing capabilities.                                  tmin               min # of EM iterations
                                                                tmax               max # of EM iterations
Why are BNs and arithmetic circuits useful for elec-            t0                 iteration # at EM convergence
trical power system diagnostics? First, power systems           ✏                  tolerance for EM
exhibit multi-variate uncertainty, for example regard-          err(✓)ˆ            error of ✓ˆ relative to ✓ ⇤
ing component and sensor health (are they working               nA = |A|           cardinality of the set A
or failing?) as well as noisy sensor readings. Sec-             `                  likelihood
                                                                ``                 log-likelihood
ond, there is substantial local structure, as reflected
                                                                   = (X, W , ✓)    Bayesian network (BN)
in an EPS schematic, that can be taken advantage                E(Z)               expectation of r.v. Z
of when constructing a BN automatically or semi-                V (Z)              variance of r.v. Z
automatically [19,20,30]. Consequently, BN treewidth            r                  Pearson’s corr. coe↵.
is small enough for exact computation using junction            ✓ 2 [0, 1]         CPT parameter
trees, variable elimination, or arithmetic circuits to be       ✓ˆ 2 [0, 1]        estimated CPT parameter
feasible [19,30]. Third, we compile BNs into arithmetic         ✓⇤ 2 [0, 1]        true CPT parameter
circuits [4, 5], which are fast and predictable in addi-                           error bound for ✓
tion to being exact. These are all important benefits in
                                                                     Table 1: Notation used in this paper.
cyber-physical systems including electrical power sys-
tems.
The rest of this paper is structured as follows. In         BN factors a joint distribution Pr(X), enabling dif-
Section 2, we introduce BNs, parameter learning for         ferent probabilistic queries to be answered by efficient
incomplete data using EM, and related research. Sec-        algorithms; they assume that nodes E are clamped to
tion 3 presents our main application area, electrical       values e. One query of interest is to compute a most
power systems. In Section 4, we define the sharing con-     probable explanation (MPE) over the remaining nodes
cept, discuss sharing in EM for BN parameter learn-         R = X \ E, or MPE(e). Computation of marginals
ing, and provide analytical results. In Section 5 we        (or beliefs) amounts to inferring the posterior probabil-
present experimental results for parameter sharing in       ities over one or more query nodes Q ✓ R, specifically
BNs when using EM, emphasizing electrical power sys-        BEL(Q, e), where Q 2 Q.
tem fault diagnosis using BNs. Finally, we conclude         In this paper, we focus on situations where ✓ needs
and outline future research opportunities in Section 6.     to be estimated but the BN structure (X and W ) is
                                                            known. Data is complete or incomplete; in other words
2     BACKGROUND                                            there may be hidden nodes H where H = X \ O and
                                                            O are observed. A dataset is defined as (x1 , . . . , xm )
This section presents preliminaries including notation      with m samples (observations), where xi is a vector
(see also Table 1).                                         of instantiations of nodes X in the complete data
                                                            case. When the data is complete, the BN parame-
2.1   BAYESIAN NETWORKS                                     ters ✓ are often estimated to maximize the data like-
                                                            lihood (MLE). In this paper, for a given dataset, a
Consider a BN = (X, W , ✓), where X are discrete            variable X 2 X is either observable (X 2 O) or hid-
nodes, W are edges, and ✓ are CPT parameters. Let           den (X 2 H); it is not hidden for just a strict subset
E ✓ X be evidence nodes, and e the evidence. A              of the samples.2 Let H > 0. For each hidden node
   1                                                           2
     Further information can be found here:    https://          In other words, a variable that is completely hidden
sites.google.com/site/dxcompetition/.                       in the training data is a latent variable. Consequently, its

                                                      49
H 2 H there is then a “?” or “N/A” in each sam-             When data is incomplete, the BN parameter estima-
ple. Learning from incomplete data also relies on a         tion problem is in general non-identifiable. There may
likelihood function, similar to the complete data case.     be several parameter estimates ✓ˆ1 , ..., ✓ˆm that have
However, for incomplete data several properties of the      the same likelihood, given the dataset [36]. Thus, we
complete data likelihood function–such as unimodal-         need to be careful when applying standard asymptotic
ity, a closed-form representation, and decomposition        theory from statistics (which assumes identifiability)
into a product form–are lost. As a consequence, the         and when interpreting a learned model. Section 4.2
computational issues associated with BN parameter           introduces an error measure that provides some in-
learning are more complex, as we now discuss.               sight regarding identifiability, since it measures dis-
                                                            tance from the true distribution ✓ ⇤ .
2.2   TRADITIONAL EM
                                                            3     ELECTRICAL POWER SYSTEMS
For the problem of optimizing such multi-dimensional,
highly non-linear, and multimodal functions, several        Electrical Power Systems (EPSs) are critical in today’s
algorithms have been developed. They include EM,            society, for instance they are essential for the safe oper-
our focus in this paper. EM performs a type of hill-        ation of aircraft and spacecraft. The terrestrial power
climbing in which an estimate in the form of an ex-         grid’s transition into a smart grid is also very impor-
pected likelihood function ` is used in place of the true   tant, and the emergence of electrical power in hybrid
likelihood `.                                               and all-electric cars is a striking trend in the automo-
Specifically, we examine the EM approach to learn BN        tive industry.
parameters ✓ from incomplete data sets.3 The tradi-         ADAPT (Advanced Diagnostics and Prognostics
tional EM algorithm, without sharing, initializes pa-       Testbed) is an EPS testbed developed at NASA [27].
rameters to ✓ (0) . Then, EM alternates between an          Publicly available data from ADAPT is being used to
E-step and an M-step. In the t-th E-step, using pa-         develop, evaluate, and mature diagnosis and progno-
rameters ✓ (t) and observables from the dataset, EM         sis algorithms. The EPS functions of ADAPT are as
generates the likelihood `(t) taking into account the       follows. For power generation, it currently uses utility
hidden nodes H. In the M-step, EM modifies the pa-          power with battery chargers (there are also plans to
rameters to ✓ (t+1) to maximize the data likelihood.        investigate solar power generation). For power stor-
While |`(t) `(t 1) |    ✏, where ✏ is a tolerance, EM       age, ADAPT contains three sets of 24 VDC 100 Amp-
repeats from the E-step.                                    hr sealed lead acid batteries. Power distribution is
EM monotonically increases the likelihood function `        aided by electromechanical relays, and there are two
or the log-likelihood function ``, thus EM converges        load banks with AC and DC outputs. For control and
to a point ✓ˆ or a set of points (a region). Since ``       monitoring there are two National Instruments com-
is bounded, EM is guaranteed to converge. Typically,        pact FieldPoint backplanes. Finally, there are sensors
EM converges to a local maximum [37] at some itera-         of several types, including for: voltage, current, tem-
tion t0 , bounded as follows: tmax     t0  tmin . Due to    perature, light, and relay positions.
the use of ✏ above, it is for practical implementations     ADAPT has been used in di↵erent configurations and
of EM with restart more precise to discuss regions of       represented in several fault detection and diagnosis
convergence, even when there is point-convergence in        BNs [12, 19–21, 30–33], some of which are investigated
theory.                                                     in this paper (see Table 2). Each ADAPT BN node
One topic that has been discussed is the initialization     typically has two to five discrete states. BN nodes
phase of the EM algorithm [8]. A second research            represent, for instance: sensors (measuring, for exam-
topic is stochastic variants of EM, typically known as      ple, voltage or temperature); components (for example
Stochastic EM [7,11]. Generally, Stochastic EM is con-      batteries, loads, or relays); or system health of compo-
cerned with improving the computational efficiency of       nents and sensors (broadly, they can be in “healthy”
EM’s E-step. Several other methods for increasing the       or “faulty” states).
efficacy of the EM algorithm for BNs exist. These
include parameter constraints [1, 6, 25], parameter in-     3.1   SHARED VERSUS UNSHARED
equalities [26], exploiting domain knowledge [17, 24],
and parameter sharing [13, 17, 25, 26].                     In BN instances where parameters are not shared, the
                                                            CPT for a node in the BN is treated as separate from
true distribution is not identifiable.                      the other CPTs. The assumption is not always rea-
   3
     Using EM, learning from complete data is a special     sonable, however. ADAPT BNs may benefit from
case of learning from incomplete data.                      shared parameters, because there are typically several

                                                      50
                                                                 3.2   OBSERVABLE VERSUS HIDDEN

                                                                 Consider a complex engineered system, such as an elec-
                                                                 trical power system. After construction, but before
                                                                 it is put into production, it is typically tested exten-
                                                                 sively. The sensors used during testing lead to one set
                                                                 of observation nodes in the BN, OT . The sensors used
                                                                 during production lead to another set of observations
                                                                 in the BN, OP . For reasons including cost, fewer sen-
                                                                 sors are typically used during production than during
                                                                 testing, thus we assume OP ✓ OT .
                                                                 As an example, in Figure 1 we denote OP =
                                                                 {CB , S} as production observation nodes, and OT =
Figure 1: One of three similar sub-networks in the               {CB , S, HB , HS } as testing observation nodes.
mini-ADAPT BN. Under parameter sharing, the SB
node is shared between the three sub-networks.                   In all our experiments, shared nodes are also hidden, or
                                                                 S ✓ H. Typically, there are hidden nodes that are not
                                                                 necessarily shared, or S ⇢ H. This is, for example,
                                                                 the case for the mini-ADAPT BN as reflected in the
repeated nodes or fragments in these BNs [21, 30]. It            sub-network in Figure 1.
is reasonable to assume that “identical” power system
sensors and components will behave in similar ways.
More broadly, sub-networks of identical components               4     EM WITH SHARING
should function in a similar way to other “identical”
sub-networks in a power system, and such knowledge               4.1   SHARING IN BAYESIAN NETWORKS
can be the basis for parameter sharing.
                                                                 Consider a BN = (X, W , ✓). A sharing set partition
For parameter sharing as investigated in this paper,             for nodes X is a set partition Y of X with subsets
the CPTs of some nodes are assumed to be approxi-                Y 1 , ..., Y k , with Y i ✓ X and k   1. For each Y i
mately equal to the CPTs of di↵erent nodes elsewhere             with k i 1 the nodes X 2 Y i share a CPT during
in the BN.4 Data for one set of nodes can be used else-          EM learning as discussed in Section 4.2. We assume
where in the BN if the corresponding nodes are shared            that the nodes in Y i have exactly the same number of
during EM learning. This is a case of parameter shar-            states. The same applies to their respective parents in
ing involving the global structure of the BN, where                , leading to each Y 2 Y i having the same number of
di↵erent CPTs are shared, as opposed to parameter                parent instantiations and exactly the same CPTs.
sharing within a single CPT [13].
                                                                 Traditional non-sharing is a special case of sharing in
In some ADAPT BNs, one sub-network is essentially                the following way. We assign each BN node to a sep-
duplicated three times, reflecting triple redundancy in          arate set partition such that for X = {X1 , ..., Xn } we
ADAPT’s power storage and distribution network [27].             have Y 1 , ..., Y n with Y i = {Xi }.
Such a six-node sub-network from an ADAPT BN is
                                                                 One key research goal is to better understand the be-
shown in Figure 1. This sub-network was, in this pa-
                                                                 havior of EM as sharing nodes S and observable nodes
per, manually selected for further study of sharing due
                                                                 O vary. We examine three cases: complete data O C
to this duplication. The sub-network is part of the
                                                                 (no hidden nodes); a sensor-rich testing setting with
mini-ADAPT BN used in experiments in Section 5.3.
                                                                 observations O T ; and a sensor-poor production setting
In the mini-ADAPT sharing condition, only the node
                                                                 with observations O P . Understanding the impact of
SB was shared between all three BN fragments. Gen-
                                                                 varying observations O is important due to cost and
erally, we define shared nodes S and unshared nodes
                                                                 e↵ort associated with observation or sensing.
U , with U = X \ S.
                                                                 4.2   SHARING EM
   4
     It is unrealistic to assume that several engineered phys-
ical objects, even when it is desired that they are exactly      Similar to traditional EM (see Section 2.2), the shar-
the same, in fact turn out to have exactly the same behav-       ing EM algorithm also takes as input a dataset and
ior. A similar argument has been made for object-oriented        estimates a vector of BN parameters ✓ˆ by iteratively
BNs [14], describing it as “violating the OO assumption.”
We thus say that CPTs are approximately equal rather             improving `` until convergence. The main di↵erence
than equal. Under sharing, however, we are making the            of sharing EM compared to traditional EM is that we
simplifying assumption that shared CPTs are equal.               are now setting some nodes as shared S, according to

                                                           51
Y . To arrive at the sharing EM algorithm from the             Due to the sharing, SEM intuitively has fewer con-
traditional EM algorithm, we modify the likelihood             vergence regions than TEM. This is due to SEM’s
function to introduce parameter sharing and combine            slightly modified M-step that aggregates the shared
parameters (see also [13, 17]). When running sharing           CPTs and parameters. Consider a BN S with ex-
EM, nodes X are treated as separate for the E-step.            actly the same nodes and edges as U , but with shar-
There is a slightly modified M-step, using Y , to ag-          ing, specifically with sharing set partitions Y 1 , ..., Y k
gregate the shared CPTs and parameters.5 This is the           and k < n. Without loss of generality, assume that
use of aggregate sufficient statistics, which considers        Xi 2 Y i . Then the actual number of convergence re-
sufficient statistics from more than one BN node [13].         gions ( S ) is upper bounded by ̄( S ) as follows:
Let ✓ˆi,j be the j’th estimated probability parameter                                               k
                                                                                                    Y
for BN node Xi 2 X. We define error of ✓ˆ for BN                               ( S )  ̄( S ) =         (Xi ),           (3)
(X, W , ✓)ˆ as the L2 distance from the true probability                                            i=1
distribution ✓ ⇤ from which data is sampled:
                         s                                     assuming that the (Xi ) convergence regions used for
                     X X⇣                  ⌘2
                                                               Xi in (2) carry over to Y i .
                ˆ
            err(✓) =              ⇤
                                 ✓i,j ✓ˆi,j .        (1)
                       i      j                                A special case of (3) is when there exists exactly one
This error is the summation of the Euclidean distance          Y 0 2 Y such that |Y 0 | 2 while for any Z 2 Y \Y 0 we
between true and estimated CPT parameters, or the              have |Z| = 1. The experiments performed in Section 5
L2 distance, providing an overall distance metric.             are all for this special case. Specifically, but without
                                                               loss of generality, let Y i = {Xi } for 1  i < k and
Why do we use Euclidean distance to measure error?             Y k = {Xk , ..., Xn } with nS = n k + 1 (i.e., S = Y k
One could, after all, argue that this distance metric          has nS sharing nodes). It is illustrative to consider the
is poor because it does not agree with likelihood. We          ratio:
use Euclidean distance because we are interested not
                                                                                    Qn              Yn
only in the black box performance of the BN, but also                    ̄( U )          (Xi )
its validity and understandability to a human expert.6                           = Qi=1
                                                                                      k
                                                                                                 =       (Xi ).     (4)
                                                                         ̄( S )      i=1 (Xi )   i=k+1
This is important when an expert needs to evaluate,
validate, or refine a BN model, for example a BN for           Here, we assume that Xi has (Xi ) convergence re-
an electrical power system.                                    gions in both U and S and take into account that
                                                               for shared nodes Y k = S, CPTs are tied together.
4.3   EM’S BEHAVIOR UNDER SHARING
                                                               The simple analysis above suggests a non-trivial im-
We now provide a simple analysis of certain aspects            pact of sharing, given the multiplicative e↵ect of the
of traditional EM (TEM) and sharing EM (SEM). For              (Xi )’s for k + 1  i  n in (4). However, since upper
simplicity, we only consider EM runs that converge             bounds are the focus in this analysis, only a partial
and exclude runs that time out.7                               and conservative picture is painted. The experiments
                                                               in Section 5—see for example Figure 3, Figure 5, and
For a node Xi 2 X, TEM will converge to one among
                                                               Figure 6—provide further details.
potentially several convergence regions. Suppose that
the CPT of node Xi has (Xi ) convergence regions.
Then the actual number of convergence regions ( U )           4.4    ANALYSIS OF ERROR
for a non-shared BN U with nodes X = {X1 , ..., Xn }
                                                               We now consider the number of erroneous CPTs as
is upper bounded by ̄( U ) as follows:
                                                               estimated by SEM when sharing is varied. Clearly, a
                                   n
                                   Y                           CPT parameter is continuous and its EM estimate is
              ( U )  ̄( U ) =         (Xi ).        (2)    extremely unlikely to be equal to the original param-
                                   i=1
                                                               eter. Thus we consider here a discrete variable, based
   5
     For the relevant LibDAI source code, please see           on forming an interval in the one-parameter case. Gen-
here: https://github.com/erikreed/HadoopBNEM/blob/             erally, let a discrete BN node X 2 X have k states such
master/src/emalg.cpp#L167. In words, it is a modifi-
                                                               that xi 2 {x1 , ..., xk }. Consider ✓xi |z = Pr(X = xi |
cation on the collection of sufficient statistics during the
maximization step.                                             Z = z) for a parent instantiation z. We now have an
   6
     We assume that (1) is better than likelihood in this      original CPT parameter ✓i 2 {✓x1 |z , ..., ✓xk 1 |z } and
regard, if the original BN was manually constructed. The       its EM estimate ✓ˆi 2 {✓ˆx1 |z , ..., ✓ˆxk 1 |z }. Let us jointly
BNs experimented with in Section 5 were manually con-          consider the original CPT parameter ✓i⇤ and its esti-
structed, for example.
   7
     In practice, runs that time out are very rare with the    mate ✓ˆi . If ✓ˆi 2 [✓i⇤       ⇤                   ˆ
                                                                                         i , ✓i + i ] we count ✓i as correct,
parameter settings we use in experiments.                                 ˆ       ⇤
                                                               and say ✓i = ✓i ; else it is incorrect or wrong, and we

                                                         52
say ✓ˆi 6= ✓i⇤ . This analysis clearly carries over to mul-      and by substituting (7) and (8) into (6) we get
tiple CPT parameters, parent instantiations, and BN
nodes. This shows how we go from a continuous (esti-                      V (TP ) = p(1     p)((nP     nS ) + n2S ).     (9)
mated CPT parameters ✓)    ˆ to a discrete value (number
of wrong or incorrect CPT estimates), where the latter           In words, (9) tells us that as the number nS of shared
is used in this analysis.                                        nodes increases at the expense of the number of un-
Suppose that up to nP nodes can be shared. Further               shared nodes nU , variance due to non-shared nodes de-
suppose that nS nodes are actually shared while nU               creases linearly, but variance due to sharing increases
nodes are unshared, with nU + nS = nP . Let TP                   quadratically. The net e↵ect shown in (9) is that vari-
be a random variable representing the total number               ance V (TP ) of the error increases with the number of
of wrong or incorrect CPTs, TS the total for shared              shared nodes, according to our analysis above. Ex-
nodes, and TU the total for unshared nodes. Clearly,             pectation, on the other hand, remains constant (5)
we have TP = TS +TU .                                            regardless of how many nodes are shared. These ana-
                                                                 lytical results have empirical counterparts as discussed
Let us first consider the expectation E(TP ). Due to             in Section 5, see for example the error sub-plot at the
linearity, we have E(TP ) = E(TS ) + E(TU ). In the              bottom of Figure 2.
non-shared case, assume for simplicity that errors are
iid and follow a Bernoulli distribution, with probability
                                                                 5     EXPERIMENTS
p of error and (1 p) = q of no error.8 This gives
E(TU ) = nU p, using the fact that a sum of Bernoulli
                                                                 We now report on EM experiments for several di↵er-
random variables follows a Binomial distribution.9
                                                                 ent BNs, using varying degrees of sharing. We also
In the shared case, all shared nodes either have an              vary the number of hidden nodes and dataset size. We
incorrect CPT ✓ˆ 6= ✓⇤ or the correct CPT ✓ˆ = ✓⇤ .              used ✏ = 1e 3 as an EM convergence criterion, mean-
Assuming again probabilities p of error10 and (1 p) =            ing that EM stopped at iteration t0 when the ``-score
q of no error, and by using the definition of expectation        changed by a value  ✏ between iterations t0 1 and t0
of Binomials we obtain E(TS ) = nS p.                            for tmax t0 tmin . In these experiments, tmax = 100
                                                                 and tmin = 3.
Substituting into E(TP ) we get

              E(TP ) = nU p + nS p = nP p.                (5)    5.1   METHODS AND DATA

                                                                 Bayesian networks. Table 2 presents BNs used in
Let us next consider the variance V (TP ). While vari-
                                                                 the experiments.11 Except for the BN Pigs, these BNs
ance in general is not linear, we assume linearity for
                                                                 all represent (parts of) the ADAPT electrical power
simplicity, and obtain
                                                                 system (see Section 3). The BN Pigs has the largest
               V (TP ) = V (TU ) + V (TS ).               (6)    number of nodes that can be shared (nP = 296), com-
                                                                 prising 67% of the entire BN. The largest BN used,
In the non-shared case we have again a Binomial dis-             in terms of node count, edges, and total CPT size, is
tribution, with well-known variance                              ADAPT T2.

                    V (TU ) = nU p(1     p).              (7)    Datasets. Data for EM learning of parameters for
                                                                 these BNs were generated using forward sampling.12
In the shared case we use the definition of variance,            Each sample in a dataset is a vector x (see Section 2.1).
put p1 = (1 p), p2 = p, and µ = nS p, and obtain                 The larger BNs were tested with increasing numbers
after some simple manipulations:                                 of samples ranging from 25 to 400, while mini-ADAPT
                                                                 was tested with 25 to 2000 samples.
                    2
                    X
        V (TS ) =         pi (Xi   µ)2 = n2S p(1   p),    (8)    Sharing. Each BN has a di↵erent number of param-
                    i=1                                          eters that can be shared, where a set of nodes Y i 2
    8                                                               11
      This is a simplification, since our use of the Bernoulli         ADAPT BNs can be found here: http://works.
assumes that each CPT is either “correct” or “incorrect.”        bepress.com/ole_mengshoel/.
When learned from data, the estimated parameters are                12
                                                                       Our experiments are limited in that we are only learn-
clearly almost never exactly correct, but close to or far        ing the parameters of BNs, using data generated from those
from their respective original values.                           BNs. Clearly, in most applications, data is not generated
    9
      If X is Binomial with parameters n and p, it is well-      from a BN and the true distribution does not conform ex-
known that the expected value is E(X) = np.                      actly to some BN structure. However, our analytical and
   10
      The error probabilities of TS and TU are assumed to        experimental investigation of error would not have been
be the same as a simplifying assumption.                         possible without this simplifying assumption.

                                                           53
         NAME           |X|    |P |   |W |   CPT           The following gradual sharing method is used to vary
         ADAPT T1       120     26     136    1504
                                                           sharing. Given a fixed set of hidden nodes H and an
         ADAPT T2       671    107     789   13281
         ADAPT P1        172    33     224    4182         initially empty set of shared nodes S:
         ADAPT P2        494    99     602   10973
         mini-ADAPT       18     3      15     108
                                                            1. Randomly add nS           1 hidden nodes that are
         Pigs           441    296     592    8427
                                                               not yet shared to the set of shared nodes S. Since
Table 2: Bayesian networks used in experiments. The            we only have a single sharing set, this means mov-
|P | column presents the number of potentially shared          ing nS nodes from the set H \ S to the set S.
nodes, with actually shared nodes S ✓ P . The CPT
                                                            2. Perform m sharing EM trials in this configuration,
column denotes the total number of parameters in the
                                                               and record the three metrics for each trial.
conditional probability tables.
                                                            3. Repeat until all hidden nodes are shared; that is,
Y with equal CPTs are deemed sharable. In most                 S = H.
cases, there were multiple sets of nodes Y 1 , ..., Y k
with |Y i | 2 for k i 1. When multiple sets were           Using the gradual sharing method above, BN nodes
available, the largest set was selected for experimenta-   were picked as hidden and then gradually shared.
tion, as shown in Section 5.2’s pseudo-code.               When increasing the number of shared nodes, the new
                                                           set of shared nodes was a superset of the previous set,
Metrics. After an EM trial converged to an estimate
ˆ we collected the following three metrics:                and a certain number of EM trials was performed for
✓,
                                                           each set.
                                                  ˆ
 1. number of iterations t0 needed to converge to ✓,
                                                           5.2.1   One Network
                         ˆ and
 2. log-likelihood `` of ✓,
                                                           For ADAPT T2, m = 200 samples were generated and
 3. error: distance between ✓ˆ and the original ✓ ⇤ (see   nH = 66 nodes were hidden. We used, in the gradual
    (1) for the definition).                               sharing method, nS = 4 from nS = 2 to nS = 66
                                                           (every hidden node was eventually set as shared).
To provide reliable statistics on mean and standard
deviation, many randomly initialized EM trials were
run.
Software. Among the available software implementa-
tions of the EM algorithm for BNs, we have based our
work on LibDAI [23].13 LibDAI uses factor graphs for
its internal representation of BNs, and has several BN
inference algorithms implemented. During EM, the
exact junction tree inference algorithm [16] was used,
since it has performed well previously [19].

5.2      VARYING NUMBER OF SHARED
         NODES

Here we investigate how varying the number of shared
nodes impacts EM. A set of hidden nodes H was cre-
ated for each BN by selecting

                   H = arg max |Y i |,
                          Y i 2Y

where Y is a sharing set partition for BN nodes X
(see Section 4.1). In other words, each experimental       Figure 2: The number of iterations (top), log-
BN had its largest set of shareable nodes hidden, giv-     likelihood or `` (middle), and error (bottom) for a
ing nH = 12 nodes for ADAPT T1, nH = 66 nodes              varying number of shared nodes nS (along the x-axis)
for ADAPT T2, nH = 32 nodes for ADAPT P1, and              for the BN ADAPT T2. Here, nH = 66 nodes are hid-
nH = 145 nodes for Pigs.                                   den. Shared nodes are a random subset of the hidden
  13
       www.libdai.org                                      nodes, so nS  nH .

                                                     54
                           ADAPT T1
            Iterations         Likelihood            Error
                                                                     sample correlation coefficient r:
 Size     r(µ)        r( )    r(µ)      r( )      r(µ)      r( )
   25   -0.937      -0.135   0.997     0.882    -0.068     0.888                              P
                                                                                              m
   50   -0.983      -0.765   0.992     0.941     0.383     0.988                                   (xi   x̄)(yi    ȳ)
 100
 200
        -0.825
         0.544
                     0.702
                     0.926
                             0.494
                            -0.352
                                       0.941
                                       0.892
                                                 0.786
                                                 0.517
                                                           0.985
                                                           0.939
                                                                                  r(X, Y ) = i=1                         ,   (10)
 400     0.810       0.814  -0.206     0.963     0.863     0.908
                                                                                                   (m    1)sx sy
                           ADAPT T2
            Iterations         Likelihood            Error           where x̄ and ȳ are the sample means of two random
 Size     r(µ)        r( )    r(µ)      r( )      r(µ)      r( )
   25    0.585       0.852   0.749     0.842     0.276     0.997
                                                                     variables X and Y , and sx and sy are the sample stan-
   50    0.811       0.709   0.377     0.764     0.332     0.981     dard deviations of X and Y respectively. Here, (10)
 100     0.772       0.722  -0.465     0.855    -0.387     0.992     measures the correlation between m samples from X
 200     0.837       0.693  -0.668     0.788    -0.141     0.989
 400     0.935       0.677  -0.680     0.784   -0.0951     0.987     and Y .15
                           ADAPT P1
            Iterations         Likelihood             Error          The number of samples, m, refers to the number of
 Size     r(µ)        r( )    r(µ)      r( )       r(µ)      r( )
   25   -0.769    -0.0818   -0.926     0.278     -0.107     0.774    (X, Y ) sharing samples we have for this correlation
   50   -0.864    -0.0549   -0.939     0.218     -0.331     0.188    analysis (and not the number of samples used to learn
 100    -0.741       0.436  -0.871     0.678     -0.458     0.457
 200    -0.768       0.408  -0.879     0.677     -0.196     0.700    BN parameters). In all these experiments, we do a trial
 400    -0.665       0.560  -0.862     0.681   0.00444      0.653    for a number of shared nodes, giving several (X, Y )
                             PIGS
            Iterations         Likelihood            Error           pairs. Consequently, each number of shared nodes
 Size     r(µ)        r( )    r(µ)      r( )      r(µ)      r( )     tested would be an X, and the metric measured would
   25    0.954       0.763   0.970     0.614     0.965     0.887
   50    0.966       0.956   0.137     0.908     0.740     0.869     be a Y . For example, if we use nS 2 {2, 4, 6, 8, 10}
 100    -0.922    -0.0167   -0.994    -0.800    -0.893     0.343     shared nodes, then m = 5.
 200    -0.827      -0.394  -0.985    -0.746    -0.577    0.0157
 400    -0.884      -0.680  -0.942    -0.699    -0.339     0.285     We now tie r(X, Y ) in (10) to the r(µ) and r( ) used in
                                                                     Table 3. In Table 3, µ and show the mean and stan-
Table 3: Values for Pearson’s r for the correlation be-
                                                                     dard deviation, respectively, for a metric. Thus, r(µ)
tween the number of shared nodes and statistics for
                                                                     is the correlation of the number of shared nodes and
these metrics: number of iterations, log-likelihood (``),
                                                                     the mean likelihood of a metric, while r( ) is the corre-
and error. r(µ) defines the correlation between num-
                                                                     lation of the number of shared nodes and the standard
ber of shared nodes and the mean of the metric, while
                                                                     deviation of likelihood of a metric.
r( ) defines the correlation for the standard deviation.
                                                                     Figure 2 helps in understanding exactly what is being
                                                                     correlated, as µ and for all three metrics are shown
Figure 2 summarizes the results from this experiment.                for the BN ADAPT T2. In the top plot, r(µ) is the
Here, nS is varied along the x-axis while the y-axis                 correlation between the number of shared nodes (x-
shows statistics for di↵erent metrics in each of the                 axis) and the mean number of iterations (y-axis). In
three sub-plots. For example, in the top plot of Fig-                other words, the mean yi = µi is for a batch of 50 trials
ure 2, each marker is the mean number of iterations                  of EM. The mean ȳ used in Pearson’s r is, in this case,
(µ), and the error bar is +/- one standard deviation                 a mean of means, namely the mean over 50-EM-trials-
( ). The main trends14 in this figure are: parameter                 means over di↵erent numbers of shared nodes.
sharing increased the mean number of iterations re-
quired for EM and slowly decreased the mean ``. In-                  Table 3 summarizes the experimental results for four
creasing the number of shared nodes resulted in a cor-               BNs. In this table, a positive correlation implies that
responding increase in standard deviation for the num-               parameter sharing increased the corresponding metric
ber of iterations, ``, and error of the BN ADAPT T2.                 statistic. For example, the highest correlation between
For standard deviation of error, this is in line with our            number of shared nodes and mean likelihood is for
analysis in Section 4.4.                                             ADAPT T1 at 25 samples, where r(µ) = 0.997. This
                                                                     suggests that increasing the number of shared nodes
                                                                     was highly correlated with an increase in the likelihood
5.2.2   Multiple Networks
                                                                     of EM. Negative coefficients show that increasing the
We now investigate how varying the number of shared                  number of shared nodes resulted in a decrease of the
nodes impacts EM for several BNs, specifically the cor-              corresponding metric statistic.
relation between the number of parameters shared and                 A prominent trend in Table 3 is the consistently pos-
the mean µ and standard deviation       for our three                itive correlation between the number of shared nodes
metrics. To measure correlation, we use Pearson’s
                                                                       15
                                                                         In this case, X is the independent variable, specifically
  14
     We say “main trends” because the curves for the met-            the number of shared nodes. We treat the metric Y as a
rics mean number of iterations and `` are in fact reversing          function of X. When X is highly correlated with Y , this
their respective trends and dropping close to the maximum            is expressed in r through extreme (positive or negative)
of 66 shared nodes.                                                  correlation values.

                                                                55
                                NO SHARING
 Observable       Error             Likelihood        Iterations
                   µ                   µ                 µ
 OC           16.325      (-)   -3.047e4       (-)      (-)     (-)
 OP            48.38    3.74    -2.009e4    43.94    16.39    4.98
 OT            33.25    7.45    -2.492e4     1190     8.65    1.99
                                  SHARING
 Observable       Error             Likelihood        Iterations
                   µ                   µ                 µ
 OC           16.323      (-)   -3.047e4       (-)      (-)     (-)
 OP            48.56    3.92    -2.010e4    62.87    15.98    4.95
 OT            34.12    14.3    -2.629e4     2630     6.59    2.48

Table 4: Comparison of No Sharing (top) versus Shar-
                                                                          (a) O P – No Sharing           (b) O P – Sharing
ing (bottom) for di↵erent observable node sets O C ,
O P , and O T during 600 EM trials for mini-ADAPT.


nS and the standard deviation of error, r( ), for all 4
BNs. This is in line with the analytical result involving
nS in (9).
The number of samples was shown to have a signifi-
cant impact on these correlations. The Pigs network
showed a highly correlated increase in the mean num-
                                                                          (c) O T – No Sharing           (d) O T – Sharing
ber of iterations for 25 and 50 samples. However, for
100, 200, and 400 samples there was a decrease in the                 Figure 3: The progress of di↵erent random EM trials
mean number of iterations. The opposite behavior is                   for the mini-ADAPT BN. Both the degree of sharing
observed in ADAPT T1, where fewer samples resulted                    (No Sharing versus Sharing) and the number of ob-
in better performance for parameter sharing (reducing                 servables nodes (O P versus O T ) are varied.
the mean number of iterations), while for 200 and 400
samples we found that parameter sharing increased the
mean number of iterations. Further experimentation
                                                                      executed with and without sharing.
and analysis may improve the understanding of the in-
teraction between sharing and the number of samples.                  Table 4 shows results, in terms of means µ and stan-
                                                                      dard deviations , for these EM trials. For O P , with
5.3     CONVERGENCE REGIONS                                           nH = 4, the means µ of the metrics error, likeli-
                                                                      hood, and number of iterations showed minor di↵er-
5.3.1    Small Bayesian Networks                                      ences when parameter sharing was introduced. The
                                                                      largest change due to sharing was an increase in
First, we will show how sharing influences EM parame-                 of likelihood. For O T , where nH = 2, di↵erences were
ter interactions for the mini-ADAPT BN shown in Fig-                  greater. The µ of likelihood for sharing was lower with
ure 1 and demonstrate how shared parameters jointly                   over a 2x increase in . The µ for error demonstrated
converge.                                                             only a minor change, but nearly a 2x increase in .
                                                                      This is consistent with our analysis in Section 4.4.
Earlier we introduced O P as observable nodes in a
production system and O T as observable nodes in a                    Figure 3a and Figure 3c show how log-likelihood or ``
testing system. Complementing O P , hidden nodes are                  (x-axis) and error (y-axis) changed during 15 EM tri-
H P = {HB , HS , VB , SB }. Complementing O T , hid-                  als for O P and O T respectively. These EM trials were
den nodes are H T = {VB , SB }. When a node is hid-                   selected randomly among the trials reported on in Ta-
den, the EM algorithm will converge to one among its                  ble 4. Parameter sharing is introduced in Figure 3b
potentially many convergence regions. For O P , EM                    and Figure 3d. For O P , the progress of the EM trials
had much less observed data to work with than for O T                 is similar for sharing (Figure 3b) and non-sharing (Fig-
(see Figure 1). For O P , the health breaker node HB                  ure 3a), although for a few trials in the sharing condi-
was, for instance, not observed or even connected to                  tion the error is more extreme (and mostly smaller!).
any nodes that were observed. In contrast, O T was de-                This is also displayed in Table 4, where the di↵erence
signed to allow better observation of the components’                 in number of iterations, error, and likelihood was mi-
behaviors, and HB 2 O T . From mini-ADAPT, 500                        nor (relative to O T ). On the other hand, there is a
samples were generated. Depending on the observable                   clear di↵erence in the regions of convergence for O T
set used, either nH = |H T | = 2 or nH = |H P | = 4                   when parameter sharing is introduced, consistent with
nodes were hidden, and 600 random EM trials were                      the analysis in Section 4.3. Figure 3d shows how the

                                                                 56
        (a) O P – No Sharing    (b) O P – Sharing

                                                                   (a) No Sharing                 (b) Sharing

                                                            Figure 5: The progress of 15 random EM trials for
                                                            ADAPT T2. The No Sharing condition (a) shows
                                                            more locally optimal convergence regions than Shar-
                                                            ing (b), where there appears to be only four locally
                                                            optimal convergence regions.
        (c) O T – No Sharing    (d) O T – Sharing

Figure 4: Four 20-bin histograms, for mini-ADAPT,
of the error values of 600 randomly initialized EM tri-
als at convergence. Both the degree of sharing (No
Sharing versus Sharing) and the number of observables
nodes (O P versus O T ) are varied.


EM trials typically followed a path heading to optima
far above or far below the mean error, with two of the
                                                                   (a) No Sharing                 (b) Sharing
EM trials plotted converging in the middle region of
the error.                                                  Figure 6: The progress of random EM trials for
Histograms for the 600 EM trials used in Table 4 are        ADAPT P1. While the number of EM trials is the
shown in Figure 4. The 20-bin histograms show error         same for both conditions, the No Sharing condition
     ˆ at convergence. The O P and O T sets are shown
err(✓)                                                      (a) clearly shows more local optima than Sharing (b).
without parameter sharing in Figure 4a and Figure 4c,
respectively. Parameter sharing is introduced in Fig-
ure 4b and Figure 4d. There is an increased of error        able nodes, number of samples, and number of shared
due to parameter sharing for O T . When comparing           nodes. Some of the results are reported here.
Figure 4c (No Sharing) and Figure 4d (Sharing), we          Figure 5a shows results for ADAPT T2 without pa-
notice di↵erent regions of error for EM convergence         rameter sharing during 15 EM trials using nH = 66
due to sharing. Figure 4c appears to show four main         hidden nodes and 200 samples. Figure 5b shows a
error regions, with the middle two being greatest in        substantial change in error when the nH = 66 hidden
frequency, while Figure 4d appears to show three re-        nodes were shared. The range of the error for EM is
gions of error, with the outer two being most frequent.     much larger in Figure 5b, while the upper and lower
The outer two regions in Figure 4d are further apart        error curves have a symmetric quality. Four regions
than their non-sharing counterparts, showing that pa-       for the converged error are visible in Figure 5b, with
rameter sharing yielded a larger range of error for O T ,   the inner two terminating at a lower `` than the outer
see Section 4.4.                                            two. The lowest error region of Figure 5b is also lower
                                                            than the lowest error of Figure 5a, while retaining a
5.3.2     Large Bayesian Networks                           similar ``.
Next, large BNs are used to investigate the e↵ects of       Figure 6 uses a smaller ADAPT BN, containing 172
parameter sharing, using a varying number of shared         nodes instead of 671 nodes (see Table 2). Here, nH =
nodes. The larger ADAPT networks and Pigs were run          33 nodes were hidden and 200 samples were used. In
with 50 EM trials16 for each configuration of observ-       several respects, the results are similar to those ob-
                                                            tained for ADAPT T2 and mini-ADAPT. However,
   16
      The decrease in number of EM trials performed rela-
tive to mini-ADAPT was due to the substantial increase in   CPU time required (days to weeks).

                                                      57
Figure 6a shows that EM terminates on di↵erent like-        would be interesting to investigate the connection to
lihoods, which is not observed in Figure 5a. The error      object-oriented and relational BNs in future work.
also appears to generally fluctuate more in Figure 6a,
whereas the error changes the most during later it-         References
erations in Figure 5a. Figure 6b applies parameter
sharing to the nH = 33 hidden nodes. A symmetric             [1] E.E. Altendorf, A.C. Restificar, and T.G. Dietterich.
e↵ect is visible between high and low error, reflect-            Learning from sparse data by exploiting monotonicity
                                                                 constraints. In Proceedings of UAI, volume 5, 2005.
ing the analysis in Section 4. Of the 15 trials shown
in Figure 6b, two attained `` > 1.2e 4 , while the           [2] A. Basak, I. Brinster, X. Ma, and O. J. Mengshoel.
rest converged at `` ⇡ 1.21e 4 . Additionally, the               Accelerating Bayesian network parameter learning us-
``s of these two trials were greater than any of the             ing Hadoop and MapReduce. In Proc. of BigMine-12,
                                                                 Beijing, China, August 2012.
non-sharing ``s shown in Figure 6a.
                                                             [3] P.S. Bradley, U. Fayyad, and C. Reina. Scaling
                                                                 EM (Expectation-Maximization) clustering to large
6   CONCLUSION                                                   databases. Microsoft Research Report, MSR-TR-98-
                                                                 35, 1998.
Bayesian networks have proven themselves as very
                                                             [4] M. Chavira and A. Darwiche. Compiling Bayesian
suitable for electrical power system diagnostics [19,20,         networks with local structure. In Proceedings of the
30–33]. By compiling Bayesian networks to arithmetic             19th International Joint Conference on Artificial In-
circuits [4,5], a broad range of discrete and continuous         telligence (IJCAI), pages 1306–1312, 2005.
faults can be handled in a computationally efficient
                                                             [5] A. Darwiche. A di↵erential approach to inference in
and predictable manner. This approach has resulted               Bayesian networks. Journal of the ACM, 50(3):280–
in award-winning performance on public data from                 305, 2003.
ADAPT, an electrical power system at NASA [30].
                                                             [6] C.P. de Campos and Q. Ji. Improving Bayesian net-
The goal of this paper is to investigate the e↵ect, on           work parameter learning using constraints. In Proc.
EM’s behavior, of parameter sharing in Bayesian net-             19th International Conference on Pattern Recognition
works. We emphasize electrical power systems as an               (ICPR), pages 1–4. IEEE, 2008.
application, and in particular examine EM for ADAPT          [7] B. Delyon, M. Lavielle, and E. Moulines. Convergence
Bayesian networks. In these networks, there is consid-           of a stochastic approximation version of the EM algo-
erable opportunity for parameter sharing.                        rithm. Annals of Statistics, (27):94–128, 1999.

Our results suggest complex interactions between             [8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Max-
                                                                 imum likelihood from incomplete data via the EM al-
varying degrees of parameter sharing, varying number             gorithm. Journal Of The Royal Statistical Society,
of hidden nodes, and di↵erent dataset sizes when it              Series B, 39(1):1–38, 1977.
comes to impact on EM performance, specifically like-
lihood, error, and the number of iterations required for     [9] R. Greiner, X. Su, B. Shen, and W. Zhou. Struc-
                                                                 tural extension to logistic regression: Discriminative
convergence. One main point, which we investigated               parameter learning of belief net classifiers. Machine
both analytically and empirically, is how parameter              Learning, 59(3):297–322, 2005.
sharing impacts the error associated with EM’s pa-
rameter estimates. In particular, we have found an-         [10] S.H. Jacobson and E. Yucesan. Global optimization
                                                                 performance measures for generalized hill climbing al-
alytically that the error variance increases with the            gorithms. Journal of Global Optimization, 29(2):173–
number of shared parameters. Experiments with sev-               190, 2004.
eral BNs, mostly for fault diagnosis of electrical power
                                                            [11] W. Jank. The EM algorithm, its randomized imple-
systems, are in line with the analysis. The good news
                                                                 mentation and global optimization: Some challenges
here is that there is, in the sharing case, smaller error        and opportunities for operations research. In F. B.
some of the time.                                                Alt, M. C. Fu, and B. L. Golden, editors, Perspectives
                                                                 in Operations Research: Papers in Honor of Saul Gass
Further theoretical research to better understand pa-            80th Birthday. Springer, 2006.
rameter sharing is required. Since parameter sharing
was demonstrated to perform poorly in certain cases,        [12] W. B. Knox and O. J. Mengshoel. Diagnosis and re-
                                                                 configuration using Bayesian networks: An electrical
further investigations appear promising. Parameter               power system case study. In Proc. of the IJCAI-09
sharing sometimes reduced the number of EM itera-                Workshop on Self-? and Autonomous Systems (SAS):
tions required for parameter learning, while at other            Reasoning and Integration Challenges, pages 67–74,
times the number of EM iterations increases. Improv-             2009.
ing the understanding of the joint impact of parameter      [13] D. Koller and N. Friedman. Probabilistic graphical
sharing and the number of samples on the number of               models: principles and techniques. The MIT Press,
EM iterations would be useful, for example. Finally, it          2009.

                                                      58
[14] H. Langseth and O. Bangsø. Parameter learning             [27] S. Poll, A. Patterson-Hine, J. Camisa, D. Gar-
     in object-oriented Bayesian networks. Annals of                cia, D. Hall, C. Lee, O.J. Mengshoel, C. Neukom,
     Mathematics and Artificial Intelligence, 32(1):221–            D. Nishikawa, J. Ossenfort, A. Sweet, S. Yen-
     243, 2001.                                                     tus, I. Roychoudhury, M. Daigle, G. Biswas, and
                                                                    X. Koutsoukos. Advanced diagnostics and prognostics
[15] S.L. Lauritzen. The EM algorithm for graphical as-             testbed. In Proc. of the 18th International Workshop
     sociation models with missing data. Computational              on Principles of Diagnosis (DX-07), pages 178–185,
     Statistics & Data Analysis, 19(2):191–201, 1995.               2007.

[16] S.L. Lauritzen and D.J. Spiegelhalter. Local compu-       [28] M. Ramoni and P. Sebastiani. Robust learning with
     tations with probabilities on graphical structures and         missing data. Machine Learning, 45(2):147–170, 2001.
     their application to expert systems. Journal of the
     Royal Statistical Society. Series B (Methodological),     [29] E. Reed and O.J. Mengshoel.      Scaling Bayesian
     pages 157–224, 1988.                                           network parameter learning with expectation maxi-
                                                                    mization using MapReduce. Proc. of Big Learning
[17] W. Liao and Q. Ji. Learning Bayesian network param-            Workshop on Neural Information Processing Systems
     eters under incomplete data with domain knowledge.             (NIPS-12), 2012.
     Pattern Recognition, 42(11):3046–3056, 2009.
                                                               [30] B. Ricks and O. J. Mengshoel. Diagnosis for uncertain,
[18] G.J. McLachlan and D. Peel. Finite mixture models.             dynamic and hybrid domains using Bayesian networks
     Wiley, 2000.                                                   and arithmetic circuits. International Journal of Ap-
                                                                    proximate Reasoning, 55(5):1207–1234, 2014.
[19] O. J. Mengshoel, M. Chavira, K. Cascio, S. Poll,
     A. Darwiche, and S. Uckun. Probabilistic model-           [31] B. W. Ricks, C. Harrison, and O. J. Mengshoel. In-
     based diagnosis: An electrical power system case               tegrating probabilistic reasoning and statistical qual-
     study. IEEE Trans. on Systems, Man, and Cyber-                 ity control techniques for fault diagnosis in hybrid
     netics, 40(5):874–885, 2010.                                   domains. In In Proc. of the Annual Conference of
                                                                    the Prognostics and Health Management Society 2011
[20] O. J. Mengshoel, A. Darwiche, K. Cascio, M. Chavira,           (PHM-11), Montreal, Canada, 2011.
     S. Poll, and S. Uckun. Diagnosing faults in electrical
     power systems of spacecraft and aircraft. In Proceed-     [32] B. W. Ricks and O. J. Mengshoel. Methods for prob-
     ings of the Twentieth Innovative Applications of Arti-         abilistic fault diagnosis: An electrical power system
     ficial Intelligence Conference (IAAI-08), pages 1699–          case study. In Proc. of Annual Conference of the PHM
     1705, Chicago, IL, 2008.                                       Society, 2009 (PHM-09), San Diego, CA, 2009.

                                                               [33] B. W. Ricks and O. J. Mengshoel. Diagnosing in-
[21] O. J. Mengshoel, S. Poll, and T. Kurtoglu. Devel-
                                                                    termittent and persistent faults using static Bayesian
     oping large-scale Bayesian networks by composition:
                                                                    networks. In Proc. of the 21st International Work-
     Fault diagnosis of electrical power systems in aircraft
                                                                    shop on Principles of Diagnosis (DX-10), Portland,
     and spacecraft. In Proc. of the IJCAI-09 Workshop
                                                                    OR, 2010.
     on Self-? and Autonomous Systems (SAS): Reasoning
     and Integration Challenges, pages 59–66, 2009.            [34] A. Saluja, P.K. Sundararajan, and O.J. Mengshoel.
                                                                    Age-Layered Expectation Maximization for parameter
[22] O.J. Mengshoel, D.C. Wilkins, and D. Roth. Initial-            learning in Bayesian Networks. In Proceedings of Arti-
     ization and restart in stochastic local search: Com-           ficial Intelligence and Statistics (AIStats), La Palma,
     puting a most probable explanation in Bayesian net-            Canary Islands, 2012.
     works. IEEE Transactions on Knowledge and Data
     Engineering, 23(2):235–247, 2011.                         [35] B. Thiesson, C. Meek, and D. Heckerman. Accel-
                                                                    erating EM for large databases. Machine Learning,
[23] J.M. Mooij. libDAI: A free and open source C++                 45(3):279–299, 2001.
     library for discrete approximate inference in graphi-
     cal models. Journal of Machine Learning Research,         [36] S. Watanabe. Algebraic analysis for nonidentifiable
     11:2169–2173, August 2010.                                     learning machines. Neural Computation, 13(4):899–
                                                                    933, 2001.
[24] S. Natarajan, P. Tadepalli, E. Altendorf, T.G. Diet-
     terich, A. Fern, and A.C. Restificar. Learning first-     [37] C.F. Wu. On the convergence properties of the EM al-
     order probabilistic models with combining rules. In            gorithm. The Annals of Statistics, 11(1):95–103, 1983.
     ICML, pages 609–616, 2005.
                                                               [38] Z. Zhang, B.T. Dai, and A.K.H. Tung. Estimating lo-
[25] R.S. Niculescu, T.M. Mitchell, and R.B. Rao.                   cal optimums in EM algorithm over Gaussian mixture
     Bayesian network learning with parameter con-                  model. In Proc. of the 25th international conference
     straints. The Journal of Machine Learning Research,            on Machine learning, pages 1240–1247. ACM, 2008.
     7:1357–1383, 2006.

[26] R.S. Niculescu, T.M. Mitchell, and R.B. Rao. A
     theoretical framework for learning Bayesian networks
     with parameter inequality constraints. In Proc. of the
     20th International Joint Conference on Artifical Intel-
     ligence, pages 155–160, 2007.

                                                         59