The computational power of dynamic bayesian networks Joshua Brulé Department of Computer Science University of Maryland, College Park jbrule@cs.umd.edu Abstract. This paper considers the computational power of constant size, dynamic Bayesian networks. Although discrete dynamic Bayesian networks are no more powerful than hidden Markov models, dynamic Bayesian networks with continuous random variables and discrete chil- dren of continuous parents are capable of performing Turing-complete computation. With modified versions of existing algorithms for belief propagation, such a simulation can be carried out in real time. This re- sult suggests that dynamic Bayesian networks may be more powerful than previously considered. Relationships to causal models and neural networks are also discussed. 1 Introduction Bayesian networks are probabilistic graphical models that represent a set of ran- dom variables and their conditional dependencies via a directed acyclic graph. Explicitly modeling the conditional dependencies between random variables per- mit efficient algorithms to perform inference and learning in the network. Causal Bayesian networks have the additional requirement that all edges in the network model a causal relationship. Dynamic Bayesian networks are the time-generalization of Bayesian networks and relate variables to each other over adjacent time steps. Dynamic Bayesian networks unify and extend a number of state-space models including hidden Markov models, hierarchical hidden Markov models and Kalman filters. Dynamic Bayesian networks can also be seen as the natural extension of acyclic causal models to models that permit cyclic causal relationships, while avoiding problems with causal models that try to model temporal relationships with an atemporal description [1]. A natural question is, ‘What is the expressive power of such networks?’ The result in this paper shows that although discrete dynamic Bayesian networks are sub-Turing in computational power, introducing continuous random vari- ables with discrete children is sufficient to model Turing-complete computation. In particular, there exists a mapping between binary strings (representing a bi- nary stack) into a single continuous random variable, with operations on the stack modeled by appropriate conditional probability densities in the dynamic Bayesian network. The distributions used in the construction are such that the marginal posterior probabilities of random variables in the network can be ef- fectively computed with modified versions of existing algorithms. Ignoring the overhead from arbitrary precision arithmetic, the simulation can be conducted with only a constant time penalty. 2 The Model and Main Results A Bayesian network consists of a directed-acyclic graph, G over a set V = {V1 , . . . , Vn } of vertices and a probability distribution P (v) over the set of vari- ables that correspond to the vertices in G [2]. A Bayesian network “factorizes” the probability distribution over its variables, by requiring that each variable, vi , is conditionally independent of its non-descendants, given its parents (denoted pa(vi )). This is the Markov condition [3]: Y P (x1 , . . . , xn ) = P (xi |pai ) (1) i Dynamic Bayesian networks (DBN) extend Bayesian networks to model a probability distribution over a semi-infinite collection of random variables, with each collection of random variables modeling the system at a point in time [4]. Following the conventions in [5], the collections are denoted Z1 , Z2 , . . . and variables are partitioned Zt = (Ut , Xt , Yt ) to represent input, hidden and output variables of a state space model. Such a network is “dynamic” in the sense that it can model a dynamic system, not that the network topology changes over time. A DBN is defined as a pair (B1 , B→ ), where B1 is a Bayesian network that defines the prior P (Z1 ) and B→ is a two-slice temporal Bayes net (2TBN) that defines P (Zt |Zt−1 ) via a directed acyclic graph: N Y P (Zt |Zt−1 ) = P (Zti |pa(Zti )) (2) i=1 where Zti is the ith node at time t, and pa(Zti ) are the parents of Zti in the graph. The parents of a node can either be in the same time slice or in the previous time slice (i.e. the model is first-order Markov). The semantics of a DBN can be defined by “unrolling” the 2TBN until there are T time-slices; the joint distribution is then given by: T Y Y N P (Z1:T ) = P (ZTi |pa(Zti )) (3) t=1 i=1 Analyzing the computational power of a DBN requires defining what it means for a DBN to accept (and halt) or reject an input. Define an input sequence, {Ut } of Bernoulli random variables to model the binary input. Similarly, define an output sequence {Yt } (Yt ∈ {run, halt0 , halt1 }) to represent whether the machine has halted and the answer that it gives. Given an input, in1 , in2 , . . . , int , to a decision problem, the machine modeled by the DBN has halted and accepted at time t, if and only if P (Yt = halt1 |U1 = in1 , . . . , Un = int ) > 0.5 and halted and rejected if and only if P (Yt = halt0 |U1 = in1 , . . . , Un = int ) > 0.5. 2.1 Discrete Dynamic Bayesian Networks Are Not Turing-complete “Discrete” Bayesian networks are Bayesian networks where all random variables have some finite number of outcomes, i.e. Bernoulli or categorical random vari- ables. If dynamic Bayesian networks are permitted to increase the number of random variables in the network over time, then simulating a Turing-machine becomes trivial: simply add a new variable each time step to model a newly reachable cell on the Turing machine’s tape. However, this requires some ‘first- order’ features in the language used to specify the network and the computational effort required at each step of the simulation will grow without bound. With a fixed number of random variables at each time step and the property that DBNs are first-order Markov, the computational effort per step remains constant. However, discrete DBNs have sub-Turing computational power. Intu- itively, a discrete DBN cannot possibly simulate a Turing machine since there is no way to store the contents of the machine’s tape. More formally, any discrete Bayesian network can be converted into a hidden Markov model [5]. This is done by ‘collapsing’ the hidden variables (Xt ) of the DBN into a single random variable by taking the Cartesian product of their sample space. The ‘collapsed’ DBN models a probability distribution over a exponentially larger, but still finite sample space. Hidden Markov models are equivalent to probabilistic finite automata [6] which recognize the stochastic languages. Stochastic languages are in the RP-complexity class and thus discrete DBNs are not Turing complete. 2.2 A Dynamic Bayesian Network with Continuous and Discrete Variables A 2TBN can be constructed to simulate the transitions of a two stack push-down automaton (PDA), which is equivalent to the standard one tape Turing machine. A two stack PDA consists of a finite control, two unbounded binary stacks and an input tape. At each step of computation, the machine reads and advances the input tape, reads the top element of each stack and can either push a new element, pop the top element or leave each stack unchanged. The state of the control can change as function of previous state and the read symbols. When the control reaches one of two possible halt states ({halt0 , halt1 }), the machine stops and its output to the decision problem it was computing is defined which of the halt states it stops on. A key part of the construction is using a Dirac distribution to simulate a stack. A Dirac distribution centered at µ can be defined as the limit of normal distributions: 1 −x2 δ(µ) ≡ lim √ e 2σ2 (4) σ↓0 σ 2π A single Dirac distributed random variable is sufficient to simulate a stack. The stack construction adapted from [7] encodes a binary string ω = ω1 ω2 . . . ωn into the number: n X 2ωi + 1 q= (5) i=1 4i Note that if the string begins with the value 1, then q has a value of at least 3/4 and if the string begins with 0, then q is less than 1/2 - there is never a need to distinguish among two very close numbers to read the most significant digit. In addition, the empty string is encoded as q = 0, but any non-empty string has value at least 1/4. All random variables, except for the stack random variables, are categorically distributed - thus, the conditional probabilities densities between them can be represented using standard conditional probability tables. Extracting the top value from a stack requires a conditional probability distri- bution for a Bernoulli random variable (T op ∈ {0, 1}), given a Dirac (Stack ∈ R) distributed parent. The Heavyside step function meets this requirement and can be defined as the limit of logistic functions (or, more generally, softmax func- tions), centered at 1/2: 1 H(x) ≡ lim (6) k→∞ 1 + e−k(x−1/2) The linear operation 4q − 2 transfers the range of q to at least 1 when the top element of the stack is 1 and no more than 0 when the top element of the stack is 0. Then, the conditional probability density function: P (T op|Stack = q) = H(4q − 2) (7) yields P (T op) = 1 whenever the top element of the stack is 1 and P (T op) = 0 whenever the top element of the stack is 0. Similarly, a conditional probability distribution can be defined for Bernoulli random variable Empty ∈ {0, 1}, as: P (Empty|Stack = q) = 1 − H(4q) (8) to check if a stack is empty. Finally, the linear operations 4q + 2b+1 4 and 4q − (2b + 1) push and pop b, respectively, from a stack. The conditional probability density for a stack at time t + 1, given a stack at time t, the top of the stack at time t, and action to be performed on the stack (Actiont ∈ {push0 , push1 , pop, noop}) is fully described as follows: P (Stackt+1 |T opt = p, Stackt = q, Actiont = push0 ) = δ(q/4 + 1/4) P (Stackt+1 |T opt = p, Stackt = q, Actiont = push1 ) = δ(q/4 + 3/4) P (Stackt+1 |T opt = p, Stackt = q, Actiont = pop) = δ(4q − (2p + 1)) P (Stackt+1 |T opt = p, Stackt = q, Actiont = noop) = δ(q) (9) Since there are two stacks in the full construction, they are labeled, at time t, as Stacka,t and Stackb,t . The rest of the construction is straightforward. Statet , Actiona and Actionb are functions of Statet−1 , T opa,t , Emptya,t , T opb,t , Emptyb,t and int . Since all of these are discrete random variables, the conditional prob- ability densities is simply the transition function of the PDA, written as a (0, 1) stochastic matrix. As expected P (Y = halti |State) = 1 if State is that halt state, and 0 otherwise. Finally, the priors for the dynamic Bayesian network are simply P (Stacka,1 ) = P (Stackb,1 ) = δ(0), P (State1 = q0 ) = 1, where q0 is the initial state. As described, this construction is somewhat of an abuse of the term ‘proba- bilistic graphical model’ - all probability mass is concentrated into a single event for every random variable in the system, for every time step. However, it is easy to see this construction faithfully simulates a two stack machine, as each random variable in the construction corresponds exactly to a component of the simulated automaton. 2.3 Exact Inference in Continuous-discrete Bayesian Networks This construction requires continuous random variables, which raise concerns as to whether the marginal posterior probabilities can be effectively computed. The original junction tree algorithm [8] and cut-set conditioning [9] approaches to belief propagation compute exact marginals for arbitrary DAGs, but require discrete random variables. Lauritzen’s algorithm [10] conducts inference in mixed graphical models, but is limited to conditional linear Gaussian (CLG) continuous random variables. In a CLG model, let X be a continuous node, A be its discrete parents, and Y1 , . . . , Yk be continuous parents. Then k X p(X|a, y) = N (wa,0 + wa,i yi ; σa2 ) (10) i=1 Lauritzen’s algorithm can only conduct approximate inference, since the true posterior marginals may be some multimodal mix of Gaussians, while the algo- rithm itself only supports CLG random variables. However, the algorithm is exact in the sense that it computes exact first and second moments for the posterior marginals which is sufficient for the Turing machine simulation. Laurientz’s algorithm does not permit discrete random variables to be chil- dren of continuous random variables. Lerner’s algorithm [11] extends Lauritzen’s algorithm to support softmax conditional probability densities for discrete chil- dren of continuous parents. Let A be a discrete node with the possible values a1 , . . . , am and let Y1 , . . . , Yk be its parents. Then: Pn exp(bi + l=1 wli yl ) P (A = ai |y1 , . . . , yk ) = Pm j Pn i (11) j=1 exp(b + l=1 wl yl ) Like Lauritzen’s algorithm, Lerner’s algorithm computes approximate poste- rior marginals - relying on the observation that the product of a softmax and a Gaussian is approximately Gaussian - but exact first and second moments, up to errors in the numerical integration used to compute the best Gaussian approximation of the product of a Gaussian and a softmax. This calculation is actually simpler in the case where the softmax is replaced with a Heavyside and the Lerner algorithm can run essentially unmodified with a mixture of Heavyside and softmax conditional probability densities. In the case of Dirac-distributed parents, with Heavyside conditional probability densities, numeric integration is unnecessary and no errors are introduced in computing the first and second moments of the posterior distribution. Any non-zero variance for the continuous variables will ‘leak’ probability to other values for the ‘stack’ random variables in the Turing machine simulation, eventually leading to errors. Lauritzen’s original algorithm assumes positive- definite covariance matrices for the continuous random variables, but can be extend to handle degenerate Gaussians [12]. In summary: posterior marginals for the Turing machine simulation can be computed exactly, using a modified version of the Lerner algorithm when restricted to Dirac distributed continuous random variables with Heavside conditional probability densities. If Gaussian random variables and softmax conditional probability densities are also intro- duced, then the first and second moments of the posterior marginals can be computed ‘exactly’, up to errors in numerical integration, although this will slowly degrade the quality of the Turing machine simulation in later time steps. Inference in Bayesian networks is NP-hard [13]. However, assuming that arithmetic operations can be computed in unit time over arbitrary-precision numbers (e.g. the real RAM model), the work necessary at each time step is constant. Thus, dynamic Bayesian networks can simulate Turing-machines with only a constant time overhead in the real RAM model, and slowdown propor- tional to the time complexity of arbitrary precision arithmetic otherwise. 3 Discussion This result suggests that causal Bayesian networks may be a richer language for modeling causality than currently appreciated. Halpern [14] suggests that for general causal reasoning, a richer language, including some-first order features may be needed. First-order features will likely be useful for causal modeling in practice, but the Turing-complete power of dynamic Bayesian networks suggests that first-order features may be unnecessary. This result for dynamic Bayesian networks is analogous to Siegelmann and Sontag’s proof that a recurrent neural network can simulate a Turing machine in real time [7]. In fact, it turns out that neural networks and Bayesian networks have very similar expressive power: – Single perceptron ≈ Gaussian naive Bayes (Logistic regression) [15] – Multilayer perceptron ≈ Full Bayesian network (Universal function approx- imation) [16] [17] – Recurrent neural network ≈ Dynamic Bayesian network (Turing complete) Although the conditional probability distributions used in this construction are likely not biologically plausible, their density functions are the limit of the very ‘natural’ Gaussian and logistic functions commonly used in neural model- ing. This suggests there is no theoretical barrier to casting a neural system as a dynamic probabilistic system and conducting analysis accordingly, although there are still likely many practical difficulties in such an approach. While simple recurrent neural networks are theoretically capable of perform- ing arbitrary computations, practical extensions include higher-order connec- tions [18], ‘gates’ in long short-term memory [19], and even connections to an ‘external’ Turing machine [20]. These additions enrich the capabilities of stan- dard neural networks, making it easier to train them for complex algorithmic tasks. An interesting question is to what degree dynamic Bayesian networks can be similarly extended and how the ‘core’ dynamic Bayesian network being ca- pable of Turing-complete computation affects the overall performance of such networks. There is a very small gap in decidability - it takes little to turn a sub- Turing framework for modeling into a Turing-complete one. In the case of neural networks, a single recurrent layer, with arbitrary-precision rational weights and a saturating linear transfer function is sufficient. With dynamic Bayesian networks, two time-slices, continuous-valued random variables with a combination of linear and step function conditional probability densities is sufficient. While Turing- completeness is often desirable in trying to model complex systems, it can lead to difficulties in analyzing and predicting their behavior. For example, given any Turing-complete DBN, it is impossible to find a general-purpose method to determine the stationary distribution or even whether a stationary distribution exists. Ultimately, the Turing-completeness of DBNs is both helpful and unhelpful - Bayesian models are powerful enough to model neural systems, but also inherit the same fundamental difficulties in analyzing their behavior. Acknowledgements I would like to thank James Reggia, William Gasarch, Brendan Good and Kristo- pher Micinski for their discussions and helpful comments on early drafts of this paper. References 1. Poole, D., Crowley, M.: Cyclic causal models with discrete variables: Markov chain equilibrium semantics and sample ordering. In: Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, AAAI Press (2013) 1060– 1068 2. Pearl, J.: Bayesian networks: A model of self-activated memory for evidential reasoning. In: Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine. (August 1985) 329–334 3. Bareinboim, E., Brito, C., Pearl, J.: Local Characterizations of Causal Bayesian Networks. In: Graph Structures for Knowledge Representation and Reasoning: Sec- ond International Workshop, GKR 2011, Barcelona, Spain, July 16, 2011. Revised Selected Papers. Springer Berlin Heidelberg, Berlin, Heidelberg (2012) 1–17 4. Dean, T., Kanazawa, K.: A model for reasoning about persistence and causation. Comput. Intell. 5(3) (December 1989) 142–150 5. Murphy, K.: Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, University of California, Berkeley (2002) 6. Dupont, P., Denis, F., Esposito, Y.: Links between probabilistic automata and hidden markov models: probability distributions, learning models and induction algorithms. Pattern Recognition 38(9) (2005) 1349 – 1371 Grammatical Inference. 7. Siegelmann, H.T., Sontag, E.D.: On the computational power of neural nets. Jour- nal of computer and system sciences 50(1) (1995) 132–150 8. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological) (1988) 157–224 9. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc. (1988) 10. Lauritzen, S.L.: Propagation of probabilities, means, and variances in mixed graph- ical association models. Journal of the American Statistical Association 87(420) (1992) 1098–1108 11. Lerner, U., Segal, E., Koller, D.: Exact inference in networks with discrete children of continuous parents. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. (2001) 319–328 12. Raphael, C.: Bayesian networks with degenerate gaussian distributions. Method- ology and Computing in Applied Probability 5(2) (2003) 235–263 13. Cooper, G.F.: The computational complexity of probabilistic inference using bayesian belief networks. Artificial intelligence 42(2) (1990) 393–405 14. Halpern, J.Y.: Axiomatizing causal reasoning. Journal of Artificial Intelligence Research (2000) 317–337 15. Ng, A., Jordan, M.: On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems 14 (2002) 841 16. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathe- matics of control, signals and systems 2(4) (1989) 303–314 17. Varando, G., Bielza, C., Larrañaga, P.: Expressive power of binary relevance and chain classifiers based on bayesian networks for multi-label classification. In: Prob- abilistic Graphical Models. Springer (2014) 519–534 18. Pineda, F.J.: Generalization of back propagation to recurrent and higher order neural networks. In: Neural information processing systems. (1988) 602–611 19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8) (1997) 1735–1780 20. Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)