Using Neural Networks to Control Glut in the Active Logic Machine Justin D. Brody1,2 , Bobby Austin1 , Omar Khater1 , Christopher Maxey3 , Matthew D. Goldberg3 , Timothy Clausner2 , Darsana Josyula4 and Donald Perlis2 1 Franklin and Marshall College, PO Box 3003, Lancaster PA 17604 2 University of Maryland Institute for Advanced Computer Studies, College Park MD 20740 3 University of Maryland Department of Computer Science, College Park MD 20740 4 Bowie State University Department of Computer Science, 14000 Jericho Park Road, Bowie MD 20715 1. Introduction In this paper, we address the problem of inferential glut – specifically the problem that symbolic reasoners tend to produce vastly more deductive information than they can tractably process. As a concrete example, let us imagine an agent which is searching for a specific book. Entering a study, it encounters, among other things, a bookshelf, a lamp, and a table. If the agent has a knowledge base with a large number of facts about lamps and tables, it may spend an inordinate amount of resources reasoning about lamps and tables, and such irrelevant reasoning can prevent it from drawing conclusions that are helpful for finding the book. In fact, such swamping can also arise immediately and devastatingly even without distractor entities, as discussed below. Of course, the agent could also have special information guiding the reasoning in ways appropriate to the specific task (e.g., books tend to be on shelves). But how is this special information obtained? Here we address this as a learning issue and report on preliminary results from our approach. As an example, consider a round library with one continuous circular shelf along its cir- cumference, where each book B has exactly one "left" neighbor left(B) and another β€œright” neighbor right(B), where these function symbols simply refer to those two neighbors of B. This might be encoded by the following rules 𝐾 for some book π‘Ž: 1. book(a) 2. book(X) β†’ book(left(X)) 3. book(X) β†’ book(right(X)) Figure 1: The Knowledge Base 𝐾 NeSy’20/21: Workshop on Neuro-Symbolic Learning and Reasoning, October 25–27, 2021 (Virtual) " jbrody2@fandm.edu (J. D. Brody) Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) We will consider an agent using the Active Logic reasoning engine ALMA [1], [2], [3], [4]. Active logic is a timestep logic in which reasoning proceeds by applying available inference rules to the contents of the knowledge base at the current timestep, which then produces a revised knowledge base at that next timestep. It does not apply β€œdeductive closure” to derive consequences of consequences of sentences in the knowledge base, and there is no "final step" at which deductive closure is achieved. Reasoning continues in real time, with the agent revising its knowledge base as it goes. The intuition is that active logic reflects the reasoning of human agents, whose deductions take time and who can reason about the time a deductive process is taking. For example, considering the theory 𝐾 above, applying modus ponens allows ALMA to derive the sentences book(left(a)) and book(right(a)); these will thus enter the knowledge base at the next timestep. On the other hand, book(right(right(a))) requires two applications of modus ponens to derive, so it will not enter the knowledge base in the next timestep (but will in two timesteps). The ALMA reasoning engine supports the three basic kinds of inference found in [5]: resolution, forward inference and backward inference. This limitation on the number of new derivations in a timestep has a number of advantages: one is that it avoids the logical omniscience problem [6], [1]; another is that it constrains the growth-rate of the knowledge base. However, it turns out that in the simple example given by 𝐾 above, reasoning with these rules leads to enormous growth in inferences, far beyond the capacity of any current computer. In fact in 𝑛 steps of reasoning the knowledge base above 𝑛 empirical evidence indicates that the reasoner gains Θ(22 ) new inferences. The vast majority of these inferences will not be useful for any one particular reasoning task, and inferential glut is thus a serious issue even for active logic. In this paper, we propose an architecture for taming such glut based on using a neural network to assign priorities to potential inferences; we will present some preliminary results for this architecture. Our main contributions are: 1. We develop an architecture for controlling glut. 2. We demonstrate the effectiveness of our architecture in a supervised learning task (learn- ing when potential inferences will unify). 3. We demonstrate the effectiveness of our architecture in a simple reinforcement learning setting. While our focus in this paper is on abstract reasoning, our ultimate aim is to apply our techniques to embodied agents reasoning in an environment. This changes the nature of the glut control problem in some immediate ways. For example, the derivations a situated agent needs are prima facie much shorter than those needed by an agent attempting to prove mathematical theorems. Also, the theories we work with are incomplete in that a number of the sentences that enter the knowledge base will come from observation rather than deduction. Thus the relevance of any particular premise will be at least partially determined by the environment rather than being implicitly present in the structure of the knowledge base at any given time. This impacts what kinds of models a network needs to learn – pure logical structure in the case of abstract reasoning versus logical structure in the context of a specific environment in our envisioned use case. 2. Related Work Wang et al [7] represented logical sentences with syntax trees and fed the corresponding graphs into deep neural networks to perform premise selection – i.e. pick which available sentences in the knowledge base would likely be useful for reasoning toward the proof of a given conjecture. Our work differs in that we select potential inferences rather than premises, and rather than reasoning toward a particular goal we reason according to a more generally defined heuristic function. In particular, this need not be goal-driven (in the sense of proving a given theorem); for example in section 5 we describe the training of a network heuristic which is aimed solely at ensuring that potential inferences can lead to new knowledge. In addition to [7], a number of other works have used deep learning methods to guide reasoning: for example [8], [9] and [10]. With the notable exception of [8], most of these work in the context of automatic theorem proving. Our target is quite different – reasoning with an incomplete theory in an environment without an explicit sentence as a reasoning goal. 3. Glut We divide glut into two varieties – pre-inferential and inferential. These arise from different sources and, in this preliminary work, we take different approaches to addressing them. The fundamental difference in this work is in how the models are trained – the basic architecture will be the same in both cases. Pre-inferential glut arises from the way in which our reasoning engine works – each time a sentence is added to the knowledge base a very simple pattern match is performed to determine whether it can combine with any sentence to deduce new knowledge via an inference rule. Positive matches result in a potential inference – a pair of sentences which the system prepares to combine via an inference rule. Importantly, the simplicity of the pattern match can result in false positives. For example, if an ALMA knowledge base contains the (︁ rule f(X, 0) β†’ g(X) and )︁ the sentence f(a, 1) is added to the knowledge base, then the pair f(X, 0) β†’ g(X), f(a, 1) will be taken as a potential inference even though 𝑓 (π‘Ž, 1) cannot be unified with 𝑓 (𝑋, 0) because of the conflicting second argument. We consider such potential inferences which will not unify to produce new knowledge to be pre-inferential glut. Because we can algorithmically check whether a potential inference will lead to a new deduction, training our architecture to filter out pre-inferential glut can be done with supervised learning. While filtering out pre-inferential glut could be achieved simply by attempting to apply resolution to each pre-inference, using a neural networks allows us to detect pre-inferential glut in constant time. Further, since we will ultimately apply a neural heuristic to every potential inference anyway, we are essentially able to do this filtering with no additional cost. The second kind of glut we consider is inferential glut – these are potential inferences which will combine to produce new knowledge which is not relevant for the task at hand. Revisiting our original example of the agent searching for a book in a living room, we note that the agent’s goal will not be well served by a long chain of reasoning about the lamp in the living room. The associated utility for such potential inferences is inherently tied to the task and environment in which the agent operates; thus training our network to diminish inferential glut seems best suited to reinforcement learning. 4. Architecture Figure 2 below shows an outline of our system’s operation. Let us imagine that the knowledge engine begins with sentences 1 through 5 upon initialization and then adds sentences 6, 7 and 8. When adding these sentences, the system determines the potential inferences 1 as described above and stores each such pair in a buffer. Each potential inference is then given a vector representation which is fed into one of a number of neural networks that define a heuristic function. This function assigns a priority 𝑝 (0 < 𝑝 < 1) to each potential inference; the potential inference is then placed on a priority queue accordingly provided that 𝑝 < πœƒ for some threshold πœƒ (note that lower 𝑝 indicates a more immediate priority). Then, for some 𝑇 > 0, the engine processes the top 𝑇 elements of the priority queue2 and attempts to apply the appropriate inference rule. ALMA Reasoning Engine Potential Inference Buffer [isA(bookshelf, object47); isA(bookshelf, X) --> contains(X, books)] [isA(bookshelf, object47); isA(bookshelf, X) --> isA(furniture, X)] 1. isA(bookshelf, X) --> contains(X, books). [isA(table, object51); isA(table, X) --> contains(X, legs)] 2. isA(book, X) --> isA(book, sequel(X)). [isA(book, object61); isA(book, X) --> isA(book, prequel(X))] 3. isA(book, X) --> isA(book, prequel(X)). [isA(book, object61); isA(book, X) --> isA(book, sequel(X))] 4. isA(bookshelf, X) --> isA(furniture, X). 5. isA(table, X) --> contains(X, legs). Neural Heuristic Resolution Priority Queue 6: isA(bookshelf, object47). 7: isA(table, object51). [isA(bookshelf, object47); isA(bookshelf, X) --> contains(X, books)] 8: isA(book, object61). [isA(book, object61); isA(book, X) --> isA(book, sequel(X))] [isA(bookshelf, object47); isA(bookshelf, X) --> isA(furniture, X)] [isA(table, object51); isA(table, X) --> contains(X, legs)] [isA(book, object61); isA(book, X) --> isA(book, prequel(X))] Figure 2: Glut Control Architecture It is important to note that our system uses an atomic step which differs from the step of classical ALMA. Specifically, an atomic step is a single application of an inference rule to a single potential inference (with the exception of time-update rules, which are always applied). In classical ALMA a step consists of a single application of available inference rules to all potential inferences. Since our system’s notion of a step differs from that of classical ALMA, some remarks on how ALMA’s theoretical properties differ are in order. First, we note that because ALMA is designed 1 At present, the system only works with applications of the resolution inference rule. Applications of forward and backward inference will be treated in future work. 2 In a sense, this makes the inference trajectory in our system follows a kind of A* search, albeit without the proven guarantees of consistency on the heuristic function. to operate as a time-sensitive reasoner [2] there are some inferences which cannot be deferred to a later time. In particular, ALMA has a built in notion of time at step 𝑑 of reasoning Now(t) is in the knowledge base. This is updated by the rule Now(t) –> Now(t + 1) and it is crucial that such updates occur at each timestep. We thus assign such rules a priority of 0 and ensure that priority 0 potential inferences are applied independently of the process described above. The soundness and completeness of classical ALMA have been discussed elsewhere [2], [3]; here we note that the our system shares ALMA’s soundness (in the sense that if the knowledge base is consistent, then any application of an inference rule will preserve that consistency) 3 . Because ALMA does not, in general, apply every available inference at every step, reasoning will not usually be complete. It is worth noting, however, that for πœƒ = 1 and 𝑇 = ∞ (i.e. using the entire priority queue at each step), our system reduces to classical ALMA and will thus inherit ALMA’s completeness [3]. Thus our system provides a set of parameters that tune a tradeoff between completeness and efficiency. As we will see in the next section, the efficiency gains are both in terms of time and space: our control of the inference process can both reduce the number of sentences that must be stored and allow that system to draw valuable conclusions more quickly. 4.1. Neural Heuristic While the training of the neural network that defines our heuristic function will vary in the two examples, discussed below, the vectorization algorithm and network architecture will remain constant. Vectorization in this context refers to the process of turning the pairs of sentences in a potential inference into vectors which can become the input to a neural network. Ideally, such a transformation will balance preserving critical information about the structure of the sentences in the potential inference with representing such sentences concisely. In this preliminary work, we draw on the work of Wang et al [7] who proposed representing such sentences using graph neural networks. While we do not currently use a full implementation of graph neural networks, we use a vectorization algorithm that is inspired by them and could be thought of as β€œgraph neural networks without message passing”. Future work will explore using full graph networks; amongst other things we expect that this will allow our representations to effectively scale up to larger knowledge bases with longer sentences. Our particular vectorization algorithm works with a fixed vocabulary for each knowledge base. This will consist of the names of constants, functions and predicates that occur in the knowledge base. Each of these is assigned a numerical index to use in encoding. We also assign such indices to syntactical tags. For example, for our knowledge base 𝐾 (Figure 1) we might define an indexing function i by i(book) = 0, i(a) = 1, i(right) = 2, i(left) = 3, i(if) = 4, i(and) = 5, i(or) = 6, i(func) = 7, i(var) = 8). Such an indexing scheme allows to decorate the node of a syntax tree with a one-hot feature representation vector. See Figure 3 for an example of representing the sentence if(book(X), book(right(X))). Each node of the syntax tree is decorated with a feature vector, shown in the left of the figure. Each feature 3 Importantly, we do not assert that starting from a consistent initial knowledge base guarantees that the knowledge base will always be consistent. In particular, observations can contradict previously held assumptions. As a paraconsistent logic, ALMA is designed specifically to detect and correct such inconsistency. vector is a 1-hot vector indicating, for the given node, which vocabulary words are present and what syntactical tokens that node instantiates. Each sentence with an 𝑛-node syntax tree is then represented by a list of 𝑛 feature vectors along with an 𝑛 Γ— 𝑛 incidence matrix for the structure of the syntax tree itself. In our initial work, both structures were 0-padded to be represented as a 50-node syntax tree with possibly empty nodes (nodes with no connections and feature vectors of all 0s). The tree and feature-vector list were then flattened into a single vector which represented the sentence. Pairs of such vectors, representing the sentences of a potential inference, were then concatenated to form the input to the neural network. nID a book left right if and or func var 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 3 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 1 0 5 0 0 0 0 0 0 0 0 1 Figure 3: Vectorization: Every sentence is represented via an annotated graph of it’s syntax tree as shown to the right (node ids are written in blue to the left of each node). The graph structure is encoded as an incidence matrix while the relevant information about each each node in the tree is encoded by a 1-hot representation. The network itself had two hidden layers, each with 8 nodes and tanh activations. The output layer has a sigmoidal activation to keep our priorities between 0 and 1 (having bounded priorities is helpful for analysis) . The loss and training differ from scenario to scenario and will be described individually in the sections which follow. 5. Pre-Inferential Glut Control To study the system’s ability to control pre-inferential glut, we worked with an initial knowledge base having a single sentence of the form: 𝑓 (π‘₯0 , 𝑦0 , 𝑑) ∧ 𝑓 (𝑏, 𝑦1 , 𝑑) β†’ 𝑔(π‘₯0 , 𝑏, 𝑦0 , 𝑑) along with 50 observations of the form β€’ 𝑓 (π‘Ž, 𝑑𝑖 , 𝑖) β€’ 𝑓 (𝑏, 𝑑𝑖 , 𝑖) for 𝑖 an integer with 0 ≀ 𝑖 < 50 and 𝑑𝑖 random integers with 0 ≀ 𝑑𝑖 < 100. Logically, there are 100 conclusions that can be derived from these initial sentences (of the forms 𝑔(π‘Ž, 𝑏, 𝑑𝑖 , 𝑖) and 𝑔(𝑏, 𝑏, 𝑑𝑖 , 𝑖) for 0 ≀ 𝑖 < 50). The potential inferences generated by this system include pairs 𝑔(π‘Ž, 𝑏, 𝑑0 , 𝑑0 ), 𝑔(π‘Ž, 𝑏, 𝑑1 , 𝑑1 ); these will unify if, and only if, 𝑑0 = 𝑑1 and 𝑑0 = 𝑑1 . This leads to an overall sparsity of potential inferences which will unify within the space of generated potential inferences, indicating a real need for glut control. (a) Without learning priorities or limiting inferences, a great deal of glut is produced (b) Learning priorities while also limiting inferences allows the system to efficiently find the most useful formulae to combine, reducing glut and improving reasoning which the system cannot reason past. efficiency Figure 4: After training, our system is able to keep the size of the priority queue small and draw most of the inferences available to it. During training, an epoch consisted of running the reasoner for 5000 steps (without priority control; i.e. 𝑇 = 1, πœƒ = 1.0) and training on the potential inferences generated, with ground- truth labels generated algorithmically. Specifically, for each potential inference, we generate a label of 0 if the pair unifies and a label of 1 if the pair does not unify. Thus, for any potential inference 𝑝, the output of the network can be thought of as an estimate of 1 βˆ’ πœπ‘ , where πœπ‘ denotes the probability that 𝑝 will unify. This has the effect of prioritizing potential inferences which are likely to unify. The network is trained with mean-squared-error loss and the ADAM optimizer [11]. We trained the network until the accurracy was above 0.999 and the loss was below 1 Γ— 10βˆ’5 . To test the system, we ran with 𝑇 = 1, πœƒ = 0.7. The results, shown in Figure 4, illustrate that after training of the heuristic network the system is able to limit the number of potential inferences that need to be considered and that the system is able to reason past the glut. In the first run (Figure 4a) the system used random priorities. In this situation, the number of potential inferences grows very rapidly and, since the reasoner is effectively only making random choices from the available potential inferences, is only able to combine less than 50 potential inferences to infer new sentences. On the other hand, after a modest amount of training (Figure 4b) the network is very easily able to derive over 90 of the available inferences and the number of potential inferences goes down very quickly. 6. Inferential Glut Control Our ultimate aim is to control inferential glut – potential inferences which will lead to new information that is not useful for the task at hand. To control this, we set up the problem of glut control as a reinforcement learning problem. Any given scenario will have an associated reward function that is based on the particular sentences an agent has in its knowledge base. Recall that for a trained heuristic function and given parameters πœƒ, 𝑇 , we have a well deter- mined way of choosing which potential inferences will be applied at any given timestep. In reinforcement learning this corresponds to a policy πœ‹. Then we define our glut control problem as a reinforcement learning problem as follows 1. An action π‘Ž corresponds to the selection of a particular potential inference. 2. The state 𝑠 of the agent is determined by the set of sentences in the agent’s knowledge base. 3. The value (with respect to πœ‹) of any pair (𝑠, π‘Ž) is the expected return (total reward for executing the policy after selecting π‘Ž) that the agent will receive from selecting premise π‘Ž when the knowledge base is in state 𝑠 if the policy πœ‹ is followed for the entire episode. It is denoted by π‘„πœ‹ (𝑠, π‘Ž) or just 𝑄(𝑠, π‘Ž) since our policy will always be that determined by the neural heuristic. We approximate the function 𝑄 by training the neural heuristic network via reinforcement learning. In our case we specifically employed 𝑄-learning to train a deep 𝑄 network to approx- imate the value using the standard techniques in [12]. To date, our network encodes only the action and not the state, thus we are currently learning an approximation of 𝑄(Β·, π‘Ž) rather than 𝑄 itself. In the situation we will examine, this still proves tremendously useful. To test our setup, we worked with the knowledge base 𝐾 from the introduction. While 𝑛 simple, these axioms seem to lead to a very rapid (Θ(22 )) growth in the size of the knowledge base4 ; this is due to the fact that with resolution the system learns not only new literals (e.g. statements like book(left(left(right(a)))) but also general rules (such as book(X) –> book(left(left(left(right(X))))) and the maximal number of function applications involved in the latter doubles at each time step. Thus this simple axiom system already presents a real challenge for a glut-control system. We trained our system using a network structure identical to that used in the pre-inferential scenario above. Our reward function assigned a reward equal to the total number of new instances of the function right in the knowledge base; e.g. a derivation of book(right(left(right(a)))) would yield a reward of 2. We trained our network as a deep Q network [12] using epsilon greedy action selection with πœ– decaying linearly from 1 to 0.1 over the course of 300000 iterations and a discount factor of 0.99. Some specifics of our training process are as follows. Following [12], we used experience replay. Saving sequences of (action, reward) pairs in a buffer we trained on random samples from the buffer every 100 iterations – this gives the network a better chance of training on random samples rather than on a highly correlated sequence of immediately succeeding states. Stability was also improved by applying policy iteration; we kept both a target model with parameters πœƒβ€² and a current model with parameters πœƒ; the latter used the former to train as described below, with the target model being reset to the current model every 300 iterations. Each training episode consisted of a run of 10 atomic steps and we trained for 1000000 episodes. The loss function associated with any potential inference π‘Ž was the standard one for 𝑄-learning: 𝐿(𝑠, π‘Ž) = (π‘„πœƒ (π‘Ž) βˆ’ (π‘Ÿπ‘ ,π‘Ž + π‘„πœƒβ€² (π‘Žβ€² ))2 where 𝑠 denotes the state of knowledge base at a given time, π‘Ž denotes a particular action, π‘Ÿπ‘ ,π‘Ž denotes the reward associated with taking action π‘Ž from state 𝑠, 𝑠′ denotes the state of the 4 The actual computer simulation used different predicate names, we use these here to connect with the theory 𝐾 discussed in the introduction. knowledge base that results from applying action π‘Ž in state 𝑠 and π‘Žβ€² is the action with maximal π‘„πœƒβ€² -value from state 𝑠′ . Essentially, 𝐿 measures the degree to which the 𝑄 function satisfies the Bellman equation [12], [13], modified so that the estimate of future rewards uses the target network rather than the current network. (a) Training for 10 Atomic Steps. As the system trains, the performance converges on the greedy strategy of increasing the number of rights in a token by one at each step (b) Testing on 1024 Steps. Since the system only trained on 10 steps, the results show that system has learned the strategy of applying a single rule to obtain a new right at each step. Figure 5: Training Results: The strategy learned for 10 steps extrapolates to a large number of steps. Despite the massive growth discussed above, and despite being hobbled by only taking actions (and not states) as input, our system was able to learn an effective strategy for increasing rewards – namely by applying Rule 3 to the book with the most compositions of right in the knowledge base. For example, after 13 steps the system has learned that a composition of 13 rights with a is a book. At step 14 it applies 3 to what is has just learned to learn that a composition of 14 rights with a is a book. Since our learning algorithm is stateless, the heuristic function cannot have any notion of which reasoning step it is working with. Under such a constraint, the policy learned must treat all time steps equally. Thus, since the policy learned is clearly the optimal policy for an agent with a single timestep, it must be the optimal policy which treats all timesteps equally5 . 5 For an agent that has exploitable information about how many time steps it will reason for, this greedy policy It is worth emphasizing that learning the greedy strategy is a massive improvement over what would have been possible in classical ALMA. Whereas our system was able to efficiently reach a conclusion with 1024 compositions of the function right, because of the massive glut produced classical ALMA would not have been able to even reach 10 compositions. In fact, running the system for even 5 steps crashed our workstation with 384 gigabytes of memory. Thus even simple reasoning systems need some mechanism for glut control, and our system has the potential to effectively control glut without excessive sacrifices in reasoning power. In future work we will develop the system further and study it’s effectiveness in more complex reasoning domains. 7. Conclusions and Future Work We have presented neural network-based architectures to control both pre-inferential and inferential glut. While the studies described herein are preliminary in nature, they show promise for our approach in that we were able to effectively control glut in some simple scenarios. While these initial problems were simple, they already demonstrated significant levels of glut needing control. Future work will make the networks involved more robust (critically, we will represent state in our reinforcement-learning network) and explore the use of our architecture in increasingly complex environments. 8. Acknowledgements Work by Brody, Maxey, Goldberg, Clausner, Josyula and Perlis was partially supported by DARPA grant DARPA-PA-19-03-01-01-FP-037 awarded to Perlis. Work by Austin and Khater was supported by a Hackman Summer Scholar award to Brody by Franklin and Marshall College. References [1] J. J. Elgot-Drapkin, D. Perlis, Reasoning situated in time i: Basic concepts, Journal of Experimental & Theoretical Artificial Intelligence 2 (1990) 75–98. [2] J. Elgot-Drapkin, S. Kraus, M. Miller, M. Nirkhe, D. Perlis, Active logics: A unified formal approach to episodic reasoning, Technical Report, 1999. [3] M. L. Anderson, W. Gomaa, J. Grant, D. Perlis, Active logic semantics for a single agent in a static world, Artificial Intelligence 172 (2008) 1045–1063. [4] K. Purang, Alma/carne: implementation of a time-situated meta-reasoner, in: Proceedings 13th IEEE International Conference on Tools with Artificial Intelligence. ICTAI 2001, IEEE, 2001, pp. 103–110. [5] S. Russell, P. Norvig, Artificial intelligence: a modern approach, Prentice Hall, 2020. [6] R. Parikh, Knowledge and the problem of logical omniscience., in: ISMIS, volume 87, Citeseer, 1987, pp. 432–439. can beaten by deriving general rules with as many occurrences of right in them as possible and combining such a rule with book(a) as late as possible. [7] M. Wang, Y. Tang, J. Wang, J. Deng, Premise selection for theorem proving by deep graph embedding, arXiv preprint arXiv:1709.09994 (2017). [8] D. Huang, P. Dhariwal, D. Song, I. Sutskever, Gamepad: A learning environment for theorem proving, in: International Conference on Learning Representations, 2018. [9] S. Polu, I. Sutskever, Generative language modeling for automated theorem proving, arXiv preprint arXiv:2009.03393 (2020). [10] M. Rawson, G. Reger, A neurally-guided, parallel theorem prover, in: International Symposium on Frontiers of Combining Systems, Springer, 2019, pp. 40–56. [11] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, nature 518 (2015) 529–533. [13] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.