Constrained Training of Neural Networks via Theorem Proving Mark Chevallier1 , Matthew Whyte1 and Jacques D. Fleuriot1 1 Artificial Intelligence and its Applications Institute, School of Informatics, University of Edinburgh, UK Abstract We introduce a theorem proving approach to the specification and generation of temporal logical constraints for training neural networks. The distinctive benefit of our approach is the fully rigorous implementation of generalised constrained training, that guarantees correctness of implementation at every step. 1. Introduction Neural networks are powerful tools that can be used for low-level action and perception in complicated environments, e.g. autonomous driving, but whether they subsequently implement intended behaviour is difficult to assert, let alone guarantee [1]. We want a neural network that can be taught constraints such as β€œnever go past this threshold” or β€œalways avoid this region” and actually obey them as intended when deployed. Learning that incorporates such constraints has been the subject of recent efforts, where many approaches have been used to express and inject knowledge into model training. The latter encompass logical encoding directly in PyTorch [2] and TensorFlow [3] and a range of other bespoke encoding and integration [4, 5, 6, 7, 8, 9]. Given the ad-hoc nature of these implementation mechanisms, it is difficult to provide strong guarantees about the correctness of logical constraints and hence about the actual training. Our work introduces a neurosymbolic pipeline that injects a loss function, which measures the breach of a constraint expressed in linear temporal logic over finite traces [10] (LTL𝑓 ), into the training of any neural network, with formal guarantees of correctness. To achieve this, we use the proof-assistant Isabelle/HOL [11] to formally verify our model of the logical system, the soundness of the loss function β„’ and the loss function derivative 𝑑ℒ, which is used for backpropagation in training neural networks. The formal proof that β„’ is sound with respect to the semantics of LTL𝑓 guarantees that it will enforce the intended behaviour during training. We then automatically generate OCaml code from our provably correct formal specifications [12], and integrate it into a PyTorch neural network via a library that provides OCaml bindings for Python [13]. The use of code generation here, rather than an ad-hoc, manually written implementation, guarantees that the implementation will match the specification. OVERLAY 2022: 4th Workshop on Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis, November 28, 2022, Udine, Italy $ m.chevallier@sms.ed.ac.uk (M. Chevallier); m.j.whyte@sms.ed.ac.uk (M. Whyte); jdf@ed.ac.uk (J. D. Fleuriot)  0000-0001-5307-7018 (M. Chevallier); 0000-0002-6867-9836 (J. D. Fleuriot) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Lastly, to demonstrate the approach’s guarantees, we build an example implementation, and perform some experiments confirming that it behaves as expected. The results show that the method yields a practical implementation with formal guarantees. 2. Formalising linear temporal logic We define a state as a function from the integers to the reals that tracks several values in a learning problem that might be compared: a constant, or some measurement that a neural network uses in its training e.g. how far a robotic hand is from a barrier at a given moment in time. The integers passed to a state function are simply the indices to those values of interest. As we are working with LTL𝑓 (see below) we allow the state functions to vary over time. With each state function encoding information about a system at a specific time-step, we define a path of length 𝑁 as a set of state functions encoding the evolution of a system over 𝑁 time-steps. Our formalisation uses a variant of linear temporal logic [14] known as LTL𝑓 [10], which is interpreted over finite paths and is often viewed as a more natural choice for applications in AI, e.g. planning [15], where processes are usually of finite duration. Following Fischer et al. we take comparisons of a state function’s values as our atomic propositions 𝜌, namely 𝑑 < 𝑑′ , 𝑑 ≀ 𝑑′ , 𝑑 = 𝑑′ , 𝑑 ΜΈ= 𝑑′ [5]. Thus we build constraints in LTL𝑓 from these comparisons (comp) and the constraints (constraint) arising from LTL𝑓 ’s operators. The LTL𝑓 formulae are thus: 𝜌, 𝜌1 ∧ 𝜌2 , 𝜌1 ∨ 𝜌2 , 𝒩 𝜌 (Next), β–‘πœŒ (Always), β™’πœŒ (Eventually), 𝜌1 π’°πœŒ2 (Weak Until), 𝜌1 β„›πœŒ2 (Strong Release). In LTL𝑓 , these have the usual semantics of LTL [14], except at the end of a path. In particular, Β¬(𝒩 𝜌) holds for all 𝜌 at the final time-step 𝑖 [10]. We define Isabelle datatypes comp and constraint for the real-valued comparisons and LTL𝑓 constraints respectively. This approach to specifying the language in higher-order logic is known as a deep embedding [16]. Doing so will enable us to prove that the loss function is sound and, importantly, generate fully self-contained code for defining LTL𝑓 constraints that will be used as part of the training of a neural network. Next, we formalise the eval function, which characterises the semantics of LTL𝑓 by recur- sively evaluating the truth-value of a constraint over a path. Given the complexity of LTL𝑓 , we also prove a number of LTL𝑓 equivalences, which confirms that our eval behaves as expected and matches LTL𝑓 semantics. 3. A LTL-based loss function and its derivative The loss function β„’ takes an LTL𝑓 constraint 𝜌, a path 𝑃 and a relaxation factor 𝛾, and returns a real value. Its purpose is to return a proportional positive value if 𝜌 is breached under path 𝑃 , and 0 otherwise. It needs to satisfy several important properties: 1. β„’(𝜌, 𝑃, 𝛾) β‰₯ 0; 2. β„’ is differentiable w.r.t. any of the terms that the constraint compares, for 𝛾 > 0; 3. (Soundness) lim𝛾→0 β„’(𝜌, 𝑃, 𝛾) = 0 ⇐⇒ 𝜌(𝑃 ), where 𝜌(𝑃 ) is the truth value of 𝜌 on 𝑃 . When formalising β„’, it must be differentiable for the neural network to be able to learn from it via backpropagation (see Section 4). We therefore use soft, differentiable versions of various functions to quantify the satisfaction of constraints. Based on the work by Cuturi and Blondel [17], we define a binary softmax (and softmin analogously) function, max𝛾 , as: {οΈƒ max(π‘Ž1 , π‘Ž2 ) 𝛾≀0 max(π‘Ž1 , π‘Ž2 ) = 𝛾 𝛾 log(𝑒 π‘Ž1 /𝛾 +π‘’π‘Ž2 /𝛾 ) otherwise Every soft function takes an additional parameter 𝛾, the relaxation factor used for β„’. The intention behind this parameter is that as 𝛾 β†’ 0, max𝛾 β†’ max, and that it is differentiable when 𝛾 > 0. We formalise this in Isabelle, and prove it has the desired properties, going on to formalise its derivative and proving that it is correct. Different soft functions capture losses from the 𝑑 ΜΈ= 𝑑′ comparison, again using 𝛾 as a parameter in the same way. We proceed to define β„’ recursively, over the constraint and path, with 𝛾 as a parameter. For all the LTL𝑓 operators, the β„’ function, in common with our eval function, recurses over the constraint from the outside in, and recurses down the path as required for temporal operators. Once we have formulated β„’, we show via a series of lemmas and an inductive proof on the constraint datatype that it has the expected soundness property with respect to the eval function. We next construct a derivative 𝑑ℒ for the β„’ function, to be used for gradient-based methods in PyTorch (see Section 4). The derivative must be defined with respect to each time-step 𝑖 and state-value index 𝑗 at that time-step along the finite trace. The 𝑑ℒ function formalisation is structured similarly to our formalisation of the β„’ function, defined recursively over the components of the LTL𝑓 constraint passed to it and essentially follows from repeated applications of the chain rule. In defining it, we make extensive use of the derivatives of the soft-functions we defined earlier. We formally prove that 𝑑ℒ is the correct derivative for our loss function and thus guarantee that when used for backpropagation it will achieve the desired results. 4. A PyTorch-compatible LTL loss function With our formalisation and proofs complete, what remains is to integrate them into the PyTorch environment. Unfortunately, there does not exist a mechanism for generating Isabelle functions as Python code. Instead, we choose to generate intermediate representations of β„’ and 𝑑ℒ in OCaml since our recursive Isabelle functions can be straightforwardly translated to type-safe OCaml ones using the code generation machinery of Isabelle. Moreover, there exists a Python library that can be used to call OCaml functions from within Python code [13]. In order to produce computable code, we need to map the real numbers of Isabelle to floating points [18]. As this is an approximation, it naturally has some scope for machine arithmetic errors, although the code generated for the various functions is fully faithful to their definitions in Isabelle. We generate an OCaml module LTL_Loss. Python bindings for the OCaml definitions of β„’ and 𝑑ℒ are incorporated into a PyTorch autograd.Function object through the forward and backward methods, respectively [19]. These methods are required for the loss function to form part of a computational graph in PyTorch and enable training based on gradient descent. Consequently, our LTL_Loss module, implemented as a subclass of autograd.Function, is functionally identical to a differentiable Figure 1: Avoid, Patrol, Until and Compound tests. The unconstrained output trajectory is in yellow, and the constrained output trajectory is in blue. PyTorch operation on tensors. Significantly, we know exactly how LTL_Loss should behave when computing gradients with autograd, as it is solely characterised by its formalisation in Isabelle. 5. Experiments and Discussion We base our experiments on those of Innes and Ramamoorthy [2] where they assess LTL𝑓 constraint training in a spatial environment. Each of the tests takes place in a 2-dimensional planar environment with Cartesian co-ordinates. The training data follow a spline-shaped curve consisting of 𝑁 =100 sequenced points in the plane following the curve with small random perturbations, simulating a demonstrator moving to some destination. We train a feed-forward neural network to learn a Dynamic Movement Primitive (DMP) [20] to follow this trajectory. We lay out 4 different constraints to test. Avoid: The trajectory (always) avoids an open ball of radius 0.1 around the point (0.4, 0.4). Patrol: The trajectory eventually reaches (0.2, 0.4) and (0.85, 0.6) in the plane. Until: The 𝑦 co-ordinate of the trajectory cannot exceed 0.4 until its π‘₯ co-ordinate is at least 0.6. Compound: A more complicated test combining several conditions. The trajectory should avoid an open ball of radius 0.1 around the point (0.5, 0.5), while eventually touching the point (0.7, 0.5). Further, the 𝑦 co-ordinate of the trajectory should not exceed 0.8. When we run our tests, we first train the neural network ignoring the logical constraint, then we use the latter as part of its loss calculations. The differences between the two sets of results demonstrates the effectiveness of the logical constraint as used in the loss function for the training. Visual depictions of the results are given in Fig. 1. This demonstrates that our approach has the desired intent: a theorem-proving based ap- proach (with associated code-generation) can be used to rigorously inject constraints into neural network training to guarantee the expected behaviour. Our approach is generic, so in principle a different formalism, e.g. a continuous-time logic, could be used instead of LTL𝑓 , and applied to any applicable neural network. Moreover, by formalising the derivative of the loss function, we unlock the potential to reason formally about the traversal of the loss surface during gradient descent. Given our results, we believe this work opens the way to a tighter integration between fully-formal symbolic reasoning in a theorem prover and machine learning. References [1] N. Kalra, S. M. Paddock, Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?, Transportation Research Part A: Policy and Practice 94 (2016) 182–193. URL: https://www.sciencedirect.com/science/article/pii/ S0965856416302129. doi:https://doi.org/10.1016/j.tra.2016.09.010. [2] C. Innes, R. Ramamoorthy, Elaborating on Learned Demonstrations with Temporal Logic Specifications, in: M. Toussaint, A. Bicchi, T. Hermans (Eds.), Proceedings of Robotics: Science and System XVI, 2020. doi:10.15607/RSS.2020.XVI.004. [3] G. Marra, F. Giannini, M. Diligenti, M. Gori, Lyrics: A general interface layer to integrate logic inference and deep learning, in: U. Brefeld, E. Fromont, A. Hotho, A. Knobbe, M. Maathuis, C. Robardet (Eds.), Machine Learning and Knowledge Discovery in Databases, Springer International Publishing, Cham, 2020, pp. 283–298. [4] S. Badreddine, A. d’Avila Garcez, L. Serafini, M. Spranger, Logic tensor networks, Artificial Intelligence 303 (2022) 103649. URL: https://www.sciencedirect.com/science/article/pii/ S0004370221002009. doi:https://doi.org/10.1016/j.artint.2021.103649. [5] M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, M. Vechev, DL2: Training and querying neural networks with logic, in: International Conference on Machine Learning, 2019, pp. 1931–1941. [6] Z. Hu, X. Ma, Z. Liu, E. Hovy, E. Xing, Harnessing deep neural networks with logic rules, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 2410–2420. URL: https://aclanthology.org/P16-1228. doi:10.18653/v1/P16-1228. [7] T. Li, V. Srikumar, Augmenting neural networks with first-order logic, in: A. Korhonen, D. R. Traum, L. MΓ rquez (Eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Association for Computational Linguistics, 2019, pp. 292–302. URL: https: //doi.org/10.18653/v1/p19-1028. doi:10.18653/v1/p19-1028. [8] R. Stewart, S. Ermon, Label-free supervision of neural networks with physics and domain knowledge, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017. [9] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. Broeck, A Semantic Loss Function for Deep Learning with Symbolic Knowledge, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, PMLR, StockholmsmΓ€ssan, Stockholm Sweden, 2018, pp. 5502–5511. URL: http://proceedings.mlr.press/v80/xu18h.html. [10] G. De Giacomo, M. Y. Vardi, Linear temporal logic and linear dynamic logic on finite traces, in: IJCAI’13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, Association for Computing Machinery, 2013, pp. 854–860. [11] T. Nipkow, L. C. Paulson, M. Wenzel, Isabelle/HOL: a proof assistant for higher-order logic, volume 2283, Springer Science & Business Media, 2002. [12] F. Haftmann, L. Bulwahn, Code generation from Isabelle/HOL theories, Isabelle documen- tation (2021). URL: https://isabelle.in.tum.de/doc/codegen.pdf. [13] L. Mazare, Using Python and OCaml in the same Jupyter notebook (2019). URL: https: //blog.janestreet.com/using-python-and-ocaml-in-the-same-jupyter-notebook. [14] A. Pnueli, The temporal logic of programs, in: 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), 1977, pp. 46–57. doi:10.1109/SFCS.1977.32. [15] J. Baier, S. Mcilraith, Planning with First-Order Temporally Extended Goals using Heuristic Search, in: Proceedings of the National Conference on Artificial Intelligence, volume 1, 2006. [16] R. J. Boulton, A. D. Gordon, M. J. Gordon, J. Harrison, J. Herbert, J. Van Tassel, Experience with embedding hardware description languages in hol., in: TPCD, volume 10, Citeseer, 1992, pp. 129–156. [17] M. Cuturi, M. Blondel, Soft-DTW: a differentiable loss function for time-series, arXiv preprint arXiv:1703.01541 (2017). [18] F. Haftmann, J. HΓΆlzl, T. Nipkow, Code_Real_Approx_By_Float, HOL_Library (2021). URL: https://isabelle.in.tum.de/library/HOL/HOL-Library/Code_Real_Approx_By_Float.html. [19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep learning library, Advances in neural information processing systems 32 (2019) 8026–8037. [20] S. Schaal, J. Peters, J. Nakanishi, A. Ijspeert, Learning movement primitives, in: Robotics research. the eleventh international symposium, Springer, 2005, pp. 561–572. Acknowledgements The authors would like to express their gratitude to the Artificial Intelligence and its Applications Institute, the Informatics Graduate School at the University of Edinburgh, and the Engineering and Physical Sciences Research Council for funding this work.