Semantics for Reducing Complexity and Improving Accuracy in Model Creation Using Bayesian Network Decision Tools Oscar Kipersztok Boeing Research & Technology P.O.Box 3707, MC: 4C-77 Seattle, WA 98124 oscar.kipersztok@boeing.com Abstract likelihood, impact and timing of events and trends (Kipersztok, 2004). The work presented simplifies and makes accessible the process of using advanced probabilistic models to reason DecAid is aimed at strategic decision making where the about complex scenarios without the need for advanced risk of making the wrong decision can be very costly and training. More specifically, it greatly simplifies the effort where there is need for argumentative rigor and careful involved in building Bayesian Networks for making documentation of ideas, associations and assumptions probabilistic predictions in complex domains. These leading to the final decision. The modeling methodology methods typically require trained users with a was created to enable domain experts to create Bayesian sophisticated understanding of how to build and use these networks (BN) without having to familiarize with the networks to predict future events. It entails the creation of theory of graphical probabilistic networks or the practice simplified semantics that keeps the complexity of the of how to build them. Such users may not also require the methodology transparent to users. We provide more involvement of a knowledge engineer. At the levels where precise semantics to the definition of concept variables in high impact decisions are made, requiring high-level of the domain model, as well as using those semantics to abstraction and dealing with large number of variables assign more precise and robust meaning to predicted and interdependencies, it is less likely that decision outcomes. This work is presented in the context of a tool makers will use advanced decision analytic tools and methodology, called DecAid, where complex requiring learning specialized methodology to define and cognitive models are created by defining domain-specific represent complex domain knowledge. The overall goals concepts using free language and defining relations and and requirements identified for the development of the causal weights between them. In response to a user query DecAid tool were described in (Kipersztok, 2007). the DecAid, unconstrained, directed graph is converted In a world of rapid change it is incresingly challenging to into a Bayesian network to enable predictions of events stay abreast of occurring events and trends, making it and trends. more difficult to process information without the use of advanced technology tools designed to manage 1 INTRODUCTION complexity and large volumes of information. Furthermore, strategic decision makers recognize the need DecAid is a hypothesis-driven decision support tool that for argumentative explanations to strategic decisions that facilitates complex strategic decisions with features that capture the hypothetical reasoning and the evidential allow for easy, fast, knowledge capture and modeling in context behind each decision. For these reasons the need complex domains. It identifies the key variables relevant arises to rely on advanced methods to gather, organize, to a specific query. While the cognitive, unconstrained, process and analyze data and knowledge. model is built, the defined concepts are used to create a probabilistic model to forecast events and trends. Bayesian networks practitioners recognize the need to Similarly, the free-language used to define and label the make the technology more accessible to end users due to concepts is used to generate a document search classifier the challenges presented during the model creation to retrieve evidence for validation of hypotheses raised by process. Some of the most significant challenges that the predictive model. DecAid’s   goal   is   to   predict DecAid aims to address are: 1) the complexity in eliciting expert knowledge, 2) defining a, potentially, large number 41 of parameters and relations in a particular domain, 3) remain unchanged, or decrease. Various levels of adhering to conditional independence constraint in the granularity can be selected to define the trend concept definition of causal variables, and 4) requiring to avoid states. feedback reasoning during model creation that may result In this section we describe the formal definitions that in graphs with cycles. enable the creation of a DN and its subsequent conversion The first challenge has been addressed by various into a BN. software packages (e.g., Netica, GeNIe, Hugin, etc.) that enable users to build BN with user-friendly interfaces 2.1 Definition of a DecAid Network (DN) equipped with knowledge elicitation tools. Learning Similar to a Bayesian network, each DecAid variable algorithms have also provided the means for automated (DV) represents a concept, which is some aspect of the construction of BN structures and their parameters from domain modeled. More specifically, a DV defines a data. To address the second challenge, canonical probability distribution over its possible values and it is structures have been defined that reduce the number of discrete—i.e., finite-valued and typically taking 2, 3, 5, or parameters needed to construct conditional probability 7 values. For example, we might have a DV named tables (CPT). (Farry et al, 2008) review several canonical ‘Barometric   Pressure’   that   has   3   values:   ‘decreasing’,   models, including Influence Networks (Rose and Smith, ‘unchanged’,  and   ‘increasing’.   The set of values is taken 1996), Noisy-OR, Noisy-MAX, Qualitative Probabilistic to have some natural ordering so that we can speak of Networks (QPN) and Causal Influence Models (CIM). high values versus low values. If the variable is binary, They, in particular, emphasize usability of CIM models we would say that values such as false / off / does-not- where the causal influence of each parent is captured by a occur  would  be  “low”  compared  to  true  /  on  /  occurs. single number and the combined influence of all parents is the mean of the individual parent values. (Pfautz et al, More formally, a DecAid model M includes a set V of 2007) address the first three challenges and describe DVs and, taken together, the variables in V jointly additional ones in findings from in-depth analyses of their describe a distribution over the entire scenario modeled experience in facilitation of model construction from by M. Along with the set V, the model M includes a numerous projects. directed graph structure G connecting the variables of V. Each variable in V is a node of G and each arc denotes a The purpose of this work is to describe formal semantics direct  probabilistic  influence  of  the  parent’s  value  on  the   that enable DecAid to be directly accessible to domain distribution  over  the  child’s  values.  The  directed  graph  G experts to create BN models without having to concern is unconstrained—all connections are allowed and cycles themselves with these challenges. These semantics are are permitted. Each arc is labeled with a single real aimed at easing the constraints imposed by the number  between  −1  and  1  called  the   weight. Intuitively, aforementioned challenges by enabling users to define the closer |w| is to 1, the stronger the influence of the concepts and their relations in free-association mode. parent over the child and the closer |w| is to 0, the weaker Concepts are defined and labeled using free language and the influence. If the weight is positive, a high parent a single numerical weight is assigned to each parent-child value makes high child values more likely and a low relation. This effort results in the creation of the DecAid parent value make low child values more likely. A (unconstrained) network (DN), a directed graph, which negative weight flips the influence so that a high parent allows cycles. The step of creating a BN from the DN value makes low child values more likely and a low starts with a query definition, and it involves the parent value makes high child values more likely (other identification of the query-specific sub graph and removal things being equal). Note that a moderate parent value of its cycles by, optimally, minimizing the information will make moderate child values more likely. loss. The result is BN directed acyclic graph specific to the query. 2 FROM DECAID NETWORKS TO BAYESIAN NETWORKS DecAid is a system for simple but powerful probabilistic modeling of arbitrary scenarios. It enables domain expert to create DecAid networks by defining concepts with free language and causal relations between them. For each pair of relations, the user assigns a weight of causal belief. There are two types of concepts: a) Event concepts that Figure 2.1: A DecAid Unconstrained Model represent quantities that can occur or not-occur; and b) Trend concepts that represent quantities that increase, 42 Figure 2.1 shows an example of an unconstrained DN What follows is a description of the method used to representing three concept variables in the Aviation express the random variable (RV) encoded by a DV. That Safety domain and five parent-child relations with their is, we show how to calculate a conditional probability corresponding weights. table (CPT) for each variable in the DecAid model given its parent set and the size of each variable. Once, the unconstrained model is built, DecAid is capable of transforming the DN into a BN in order to make predictions in response to queries. 3.1 Concepts Defined as Random Variables Let X be an n-valued DV from a DecAid model D. We say that the sample space S for X is the real interval [0,1). 2.2 Transforming a DecAid Network (DN) That is, we can suppose that X describes an experiment into a Bayesian Network structure whose outcome is a real number r such   that   0   ≤   r < 1. The values of the random variable X break the sample A user can make a query to the DN by defining a set of space into n disjoint events—namely, half-open intervals observation variables and a target variable. In response of equal length. The set of events is thus: to the query, DecAid is capable of transforming the unconstrained (directed graph) model to a Bayesian { r  [k/n , (k+1)/n)    :  for  0  ≤  k < n } network by carrying out the following sequence of steps: Example (3.1.1) 1) Identifying all cycles in the unconstrained model. We If X has 2 states, the events corresponding to the states of use an algorithm by (Johnson, 1975) that finds the X are: elementary cycles in the directed graph by improving over the original algorithm by (Tarjan, 1973); { r  [0.0, 0.5) , r  [0.5,1.0) }. 2) Eliminating the cycles in the unconstrained model by Example (3.1.2) removing the weak edges. This is done, optimally, in If X has 5 states, the events corresponding to the states of order to minimize the information loss in the X are: unconstrained model. This step constitutes a tradeoff between increased expressive power for domain-expert { r  [0.0, 0.2), r  [0.2, 0.4), r  [0.4, 0.6), r  [0.6, users and modest information loss resulting from removal 0.8), and r  [0.8, 1.0) }. of edges that least contribute to the information flow. 3) Identifying the sub graph relevant to the query by 3.2 Conditional Probability Tables pruning the non relevant variables from the resulting The heart of the probabilistic semantics is the definition Bayesian network (Geiger et al, 1990). This step of local conditional probability distributions for DecAid constitutes an important feature of DecAid in that it can variables. We consider the various cases below: a) where list all the relevant parameters to the user that are relevant the variable has no parents, b) where it has one parent of to a specific user query. weight 1, c) where it has one parent of arbitrary weight, The last step in the creation of a query specific Bayesian and finally, d) where it has any number of parents. network is the creation of the conditional probability Case 3.2.1 -Variables without parents tables (CPT). The semantics to achieve that are described in section 3. If X has no parents in D, then it is simply given a uniform distribution: For practitioners involved in high-level, strategic, decision making the use of Bayesian network building P(X = xk) = 1/n for  0  ≤  k < n . tools can be counterintuitive and may require significant That is, the event X = xk corresponds to r  [k/n, (k+1)/n). training time, unavailable to such intended users. Making, The probability equals the proportion of the total length of however, the BN technology accessible through tools like S contributed by X=xk. Since the total length of S is 1.0, DecAid not only will improve the accuracy of decision it is simply equal to the length of the interval, which is making but will also provide the means to document and (k+1  −  k)/n = 1/n. track the chain of causal reasoning behind each decision. Example (3.2.1.1) If X has 2 states, P(X = xk)  =  0.5    for  0  ≤  k ≤  1. 3 SEMANTICS TO CREATE Example (3.2.1.2) CODITIONAL PROBABILITY TABLES If X has 5 states, P(X = xk)  =  0.2  for  0  ≤  k ≤  4. Case 3.2.2 – Variables with one parent and |w| = 1 43 We first describe the case where we have a single parent P(x | y) x0 x1 x2 x3 x4 Y and where the link from Y to its child X has weight 1. We need to show how to calculate the conditional y0 0.4 0.4 0.2 0 0 probability P(X = xk | Y = yj). This is given by the y1 0 0 0.2 0.4 0.4 formula: P(X = xk | Y = yj , w = 1) = P(X = xk & Y = yj) / P(Y = yj) . Case 3.2.3 – Variables with one parent and |w| < 1 That is, the conditional probability of the event X = xk given that Y = yj is equal to the intersection of the We next look at the case where the weight is different intervals corresponding to these events divided by the than 1. It is useful to refer to the distribution defined in length of the interval corresponding to Y = yj. Case 2a as the full-weight distribution—i.e., where w=1. Example (3.2.2.1) Let Pfull(X | yj ) be the distribution over the values of X given Y = yj under the assumption that the arc from Y to X Suppose YoX and Y has 5 states and X has 2 states, has weight w = 1. Let U(X) be the uniform distribution P(X = x0 | Y = y2) = | Intersection of [0, 0.5) & [0.4, 0.6) | / over the values of X.  Then,  if  the  weight  is  0  ≤  w < 1, we | 0.6 – 0.4 |= 0.5 have The full CPT would be: P(X | yj ,  0  ≤  w < 1 ) = w·Pfull(X | yj ) + (1 – w)·U(X) P(x | y) x0 x1 That is, the final distribution is a weighted combination of the distribution calculated in Case 3.2.1 and the uniform y0 1 0 distribution—which is the default distribution if there y1 1 0 were no parent. Note that the weight acts as the probability that we get the full-weight distribution instead y2 0.5 0.5 of a uniform distribution. y3 0 1 y4 0 1 Example (3.2.3.1) Following the previous example (II.2.3), suppose YoX and Y has 2 states and X has 5 states. But now suppose Example (3.2.2.2) that the weight of the arc is w = 0.6, then we have Suppose ZoX and Z has 3 states and X has 5 states, P(X = x0 | Y = y0) = w·Pfull(X | y0 ) + (1 – w)·U(X) P(X = x0 | Z = z0) = | Intersection of [0, 0.2) & [0, 0.33) | / | = 0.6·0.4 + (1.0 – 0.6)·(1/5) 0.33 – 0 | = 0.6 = 0.24 + 0.4·0.2 = 0.3 + .08 = 0.32 The full CPT is: The full CPT is: P(x | z) x0 x1 x2 x3 x4 P(x | y) x0 x1 x2 x3 x4 z0 0.6 0.4 0 0 0 z1 0 0.2 0.6 0.2 0 y0 0.32 0.32 0.2 0.08 0.08 z2 0 0 0 0.4 0.6 y1 0.08 0.08 0.2 0.32 0.32 Example (3.2.2.3) If   the   weight   is   negative,   the   direction   of   the   parent’s   influence is reversed. If Y is an m-valued variable, we can Suppose YoX and Y has 2 states and X has 5 states, calculate the resulting distribution using a similar P(X = x0 | Y = y0) = | Intersection of [0, 0.2) & [0, 0.5) | / | calculation   above   but   for   the   “opposed”   value   of   the   0.5 – 0 | = 0.2 / 0.5 = 0.4 parent.    By  “opposed”  we  mean  the value at the other side of the range—i.e., highest is opposed to lowest, second- highest is opposed to second-lowest, etc. More The full CPT is: specifically, if the weight w < 0, we have 44 P(X | yj , –1  ≤  w < 0) = w·Pfull(X | ym-j-1 ) + (1 – w)·U(X) = c·[ 0.03, 0.03, 0.02, 0.03, 0.04 ] = [0.2, 0.2, 0.13, 0.2, 0.27] Example (3.2.3.2) The full CPT is: Following the previous example (II.2.3), suppose YoX P(x | y, z) x0 x1 x2 x3 x4 and Y has 2 states and X has 5 states. But now suppose that the weight of the arc is w = –0.6, then we have y0 , z0 0.2 0.2 0.13 0.2 0.27 P(X = x2 | Y = y1) = 0.6·0.2 + (1.0 – 0.6)·(1/5) = 0.12 + y0 , z1 0.15 0.3 0.4 0.1 0.05 0.4·0.2 = 0.12 + .08 = 0.2 y0 , z2 0.48 0.36 0.08 0.04 0.04 The full CPT is: y1 , z0 0.04 0.04 0.08 0.36 0.48 P(x | y) x0 x1 x2 x3 x4 y1 , z1 0.05 0.1 0.4 0.3 0.15 y0 0.08 0.08 0.2 0.32 0.32 y1 , z2 0.27 0.2 0.13 0.2 0.2 y1 0.32 0.32 0.2 0.08 0.08 Case 3.2.4 – Variables with multiple parents 4 DISCUSSION DecAid is used for strategic decision making. Here are a The remaining situation is when we have a variable with few examples of such decisions: a) when to launch a new multiple parents. In this situation, we assume that the product into a specific market, b) how close is a rouge influence of each parent is independent of the influence of country to achieving nuclear weapon capability, or c) other parents. So, if X has parents Y1, Y2,  …,  YN, we set whether to invest in a particular emerging technology. P(X | Y1, Y2,  …,  YN) = c·P(X | Y1) ·P(X | Y2)  ·…·P(X  |  YN) These are decisions that involve several variables and their inter relations. The system enables decision makers where c is normalization constant to make the distribution to define concepts of the problem in a simple, intuitive, sum to 1. manner using free language. As the user defines the concepts and relations, the system is creating an unconstrained model. Once, the model is built, DecAid is Example (3.2.4.1) capable of making predictions in response to queries by Suppose X has 5 states and two parents: Y with 2 states converting the unconstrained model into a Bayesian and weight 0.5 and Z with 3 states and weight –0.5. As network. we saw above from examples (3.2.2.1) and (3.2.2.2), if we ignore the weights of the arcs and the fact that there are multiple parents, we have for parent Z: Pfull(X | z0) = [ 0.6, 0.4, 0.0, 0.0, 0.0 ] And for parent Y: Pfull(X | y0) = [ 0.4, 0.4, 0.2, 0.0, 0.0 ] Next, adding in the effect of the weights on the Figure 4.2: Predictions made by DecAid Model distributions, we get: Figure 4.2 shows two such predictions derived from the P(X | z0, w= –.5) = [ 0.1, 0.1, 0.1, 0.3, 0.4 ] model in Figure 2.1. The first prediction forecasts a 0.85 probability  that  “Public  concern”  will  increase  given  that   and “Occurrence   of   accidents”   has   increased   and   P(X | y0, w= .5) = [ 0.3, 0.3, 0.2, 0.1, 0.1 ] “Government   oversight”   does   not   occur.   The   second   prediction lowers   the   forecast   that   “Public   concern”   will   Now, combining both parents we get increase  to  a  0.53  probability,  if  “Government  oversight”   P(X | y0 , z0) = c·P(X | y0)·P(X | z0) occurs. = c·[ 0.3, 0.3, 0.2, 0.1, 0.1 ]·[ 0.1, 0.1, 0.1, 0.3, At the final stage of making decisions, summarization and 0.4 ] argumentation becomes critical steps. It is the aim of DecAid to facilitate the capture of knowledge and 45 information for that last stage, as well, by combining the probability is defined as the size of the intersection of predictive analytic capability obtained from the cognitive parent and child intervals divided by the size of parent models with the ability to retrieve evidential data and interval. 3) The magnitude of the weight is used as the information to validated predictive hypotheses, which is probability that you get the full-weight conditional outside the scope of this paper. distribution instead of a uniform distribution. 4) The sign of the weight is used to reverse the direction of influence. The complete probabilistic semantics of a DecAid model And 5) the probabilistic influence of multiple parents on a include how the local probability models are combined child are assumed to be independent of one another. (not discussed here). The cornerstone of the semantics, however, is the definition given here for the complete The semantics described in this paper enable the creation CPT of a local variable from the simple numeric weights of a Bayesian networks from an unconstrained, directed associated with its parents as provided by the end-user graph model created by a user within a simpler, more creating the model. intuitive, framework implemented in a tool called DecAid, without requiring specialized training in how to build DecAid variables represent a tradeoff between simplicity Bayesian networks. of model definition and expressive power. Aside from adding temporal modeling (Nodelman, et. al. 2002, 2003) to DecAid, there are additional areas where the balance Acknowledgement between simplicity and expressivity could be further enhanced. In one such area there is, currently, complete This work would have not been possible without the symmetry between the positive effect of a parent taking contributions and valuable, long-term, collaboration with on a high value and the negative effect of a parent taking Uri Nodelman for which the author and The Boeing on a low value. Sometimes this symmetry is warranted Company are greatly thankful. but sometimes it is not. For example, consider a child variable   “Strength of a   Fire”   (“fire”)   with   a parent References “Oxygen   Level   Present”   (“oxygen”).   Increasing   oxygen   Farry M, Pfautz J, Cox Z, Bisantz A, Stone R, and Roth E will tend to increase the fire and decreasing oxygen will (2008),  “An  Experimental  Procedure  for  Evaluating  User- tend to lessen the fire. But now consider an alternative Centered methods for Rapid Bayesian Network parent  “Use  of  fire-extinguisher”.    If  the  fire-extinguisher Construction”,  Proceedings of the 6th Bayesian Modeling is used, that will tend to lessen the fire. But lack of fire- Applications Workshop at the 24th Annual Conference on extinguisher use does not, in itself, increase the fire. So Uncertainty in AI: UAI 2008, Helsinki, Finland. we may, in general, want to allow an asymmetry between the impact of a high-value parent and a low-value parent Geiger, Verma, and Pearl, (1990), "Identifying where the high-value has the regular effect but the low- Independence in Bayesian Networks", Networks 20:507- value has no special impact on the child. 534. Furthermore, the assumption that the effects of multiple Johnson, B. (1975),  “Finding  all  the  elementary  circuits  in   parents are independent of each other is strong. a  directed  graph”,  SIAM Journal of Computing, 4(1). Obviously, there are many cases where this assumption is unwarranted. The problem would be to find a simple, Kipersztok, O (2007). “Using Human Factors, Reasoning understandable way for end-users to convey extra and Text Processing for Hypothesis Validation.” Third information about covariance and to find an algorithm International Workshop on Knowledge and Reasoning for that could examine the link structure in other parts of the Answering Questions. International Joint Conference in DecAid model and extract some useful information about Artificial Intelligence, Hydrabad, India. the dependencies among the parents. Kipersztok O (2004). “Combining Cognitive Causal 5. SUMMARY Models with Reasoning and Text Processing Methods for Decision Support.” Second Bayesian Modeling The semantics definition is given in this paper for the Applications Workshop at the Uncertainty in AI complete CPT of a local variable from simple numeric Conference, Banff Canada. weights associated with its parents as provided by the end-user creating a DecAid model. Explicitly, 1) The Nodelman, U., Shelton, C. R., and Koller, D. (2002). values of a DV are represented as equal-length “Continuous Time Bayesian Networks.” Proceedings of subintervals of the unit interval and making explicit that the Eighteenth Conference on Uncertainty in Artificial they have a natural ordering so they can be seen as Intelligence, pp. 378-387, 2002. coming in opposed pairs (except for a possible middle- most value). 2) A single parent full-weight conditional 46 Nodelman, U., Shelton, C.R., and Koller, D. (2003). “Learning Continuous Time Bayesian Networks.” Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, pp. 451-458, 2003. Pfautz J, Cox Z, Catto G, Koelle D, Campolongo J, Roth E  (2007),”User-Centered Methods for Rapid Creation and Validation of Bayesian Networks”.  In Proceedings of 5th Bayesian Applications Workshop at Uncertainty in Artificial Intelligence (UAI 07). Rosen J and Smith W (1996), "Influence Net Modeling with Causal Strengths: An Evolutionary Approach," In the Proceedings of the 1996 Command and Control Research and Technology Symposium, Monterey CA. Tarjan, R. E. (1972), "Depth-first search and linear graph algorithms", SIAM Journal on Computing 1 (2): 146–160. Tarjan, R.E. (1973), “Enumeration of the elementary circuits of a directed graph”, SIAM Journal of Computing 2 :211-216. . 47