=Paper=
{{Paper
|id=Vol-1651/12340026
|storemode=property
|title=Crowdsourcing Theorem Proving via Natural Games
|pdfUrl=https://ceur-ws.org/Vol-1651/12340026.pdf
|volume=Vol-1651
|authors=Naveen Sundar Govindarajulu,Selmer Bringsjord
|dblpUrl=https://dblp.org/rec/conf/ijcai/GovindarajuluB16
}}
==Crowdsourcing Theorem Proving via Natural Games
==
Crowdsourcing Theorem Proving via Natural Games? Naveen Sundar Govindarajulu1 and Selmer Bringsjord2 1 Yahoo Sunnyvale, CA 2 RAIR Lab Dept. of Computer Science Dept. of Cognitive Science Rensselaer Polytechnic Institute 1 Introduction Despite the science of modern formal reasoning being more than a century old, mecha- nized formal reasoning is nowhere near what expert human reasoners (formal and infor- mal) can achieve. Meanwhile, there is a steadily increasing need for automated theorem proving in various fields of science and engineering. Proof discovery and verification in science and mathematics [2,3,4,5], formal verification for hardware and software [6,7] and logic-based AI [8,9,10] are all based on formal theorem proving. We present initial results from a project on augmenting automated theorem provers with crowdsourced theorem proving through games. The project was started based on the following two seemingly unrelated observations: Observation 1 (Observation 1) Non-trivial formal theorem proving requires in- sight into the structure of the problem being solved. (This insight is provided by the problem domain’s experts.) What do we mean by insight? While it is hard to quantify this, a rough characteriza- tion follows. For instance, say we are modelling a domain (e.g. standard arithmetic over N or biological processes) with a set of axioms A and we are interested in a conjecture Γ. Assume A can derive Γ, that is Γ is a theorem of A . Also, assume that the smallest ρ proof of Γ from A is of length L via a standard proof calculus ρ, that is A `L Γ. One form of insight might be through possession of a library of lemmas by experts. Definition 1 (Insight via Lemmas). An expert in the domain might possess in- sight in the form of a set of lemmas L (such that ∀φ ∈ L : A ` φ), that enables the expert to derive Γ from A ∪ L using a proof of length L0 with L0 L. Of course, experts could also possess insight not just through lemmas but also through common patterns of reasoning that can be reused across problems. ? Note: While this submission draws partly from the first author’s dissertation [1], it contains results and discussions not present in the dissertation. The new data can be download here: https://s3.amazonaws.com/www.catabotrescue.com/data/AnonoynmizedLevelCompletion.json. Definition 2 (Insight via an Extended Proof Calculus). An expert in the domain might possess insight in the form of an extended proof calculus ρ0 that is sound (that is forall Φ and ψ, if Φ `ρ φ then Φ ρ φ). 0 0 This proof calculus enables the expert to derive Γ from A through ρ0 using a ρ0 proof of length L0 with L0 L, that is A `L0 Γ. Insight through lemmas is usually domain and problem dependent, and insight through extened proof calculi usually transfers between domains and problems. Unless such insight is provided to a theorem prover in some form, such problems can be very hard to solve fully automatically. In practice, help is provided by human experts via lemmas [5], proof tactics [11], proof methods [12] etc. This requires that the human experts understand not only the problem domain, but also the formalization of the domain, and other formal tools that might be needed for the task. But for a lot of domains, e.g. hardware verification, there are not enough experts trained in both theorem proving and the problem domain to help with proving non trivial problems. Observation 2 (Observation 2) Lay gamesa (such as chess or sudoku) which re- quire the player to think logically or deeply are in fact formal reasoning schemes in disguise. a A lay game is simply any game which does not require extensive formal training on the order of what is required for someone to prove formal theorems in logic. For example, even if we rename all the pieces in chess (or any similar game), it is not wrong to assume that players with some acclimatization will revert to their skills levels in unmodified chess in a short duration compared with the time they took to learn chess from scratch. The two observations lead us to the question of whether hard formal problems can be cast into a form, for example a natural game like chess, which would remove the need for domain knowledge by completely abstracting away from the original domain. Ideally, such a game should be able to exploit a human player’s insight into the structure of the game. This insight should then be automatically transferrable into solving the problem. We present initial promising results from the Uncomputable Games project (first proposed in [13]) aimed at designing and implementing such natural games. While we have designed games for first-order logic, we present initial promising results for propositional logic. The full set of games for first-order natural deduction and first-order resolution theorem proving can be found in [1]. We focus on these proof calculi for the following reasons: resolution is the proof calculus of choice for industrial strength theorem provers, and natural deduction is the proof calculus of choice for formal proofs by humans [12]. The overall vision for this project is summarized in Figure 1. 2 Related Work While [1] contains a more detailed discussion of related work, we quickly go through a couple of two important related systems. While General Game Playing [14] uses the 5 s Problems that are not decidable es nc ta ins lem Proof Discovery in Mathematics and ob Science pr P AT G ` f? ATP Systems Software and Hardware Verification Uncomputable Games ... Platform Game plays Game instances Classical logic-based AI The General Crowd ... s Fig. 1: The Uncomputable Games Project Architecture. The abstract architecture for Figurecrowdsourced theorem proving.Games 1.1: The Uncomputable GamersProject are not visible to the theorem Architecture. provingarchitecture The abstract users and gamers do not for crowdsourced know about theorem the Gamers proving. problems arethey not are solving. visible to the users and gamers do not know about the problems they are solving. machinery of first-order logic (FOL) to specify the games, the games themselves are 1.2 Deriving Game Criteria neither designed to capture FOL nor to be used in crowd-sourced theorem proving. Since Giveneach thatstate in such we want games to have has for games only a finite number crowdsourcing of moves, theorem proving,suchwegames presentare at some thegeneral propositional level and cannot capture theoremhood in FOL. Tiling problems criteria to aid in the design of games that might help us in this goal. If we are lax [15] are undecidable and capture theoremhood in FOL. This is an attractive property for us,with our requirements, as such we end upenough systems are powerful with trivial games.FOL, to capture One such whilegame also could beingjust present in the forma of logic a layproblem G ` f? directly as game. Unfortunately we“game.” cannot Another use tiling systems such for the game could following present to the reason. player a Tiling problems generally ask whether an infinite plane can be tiled or non-deterministic Turing machine that would accept the string hG, fi only if G ` f. At not with a given any set of tiles. We cannot use such systems as there is no clear notion of certification for 1 anstage of the answer. computation, Affirmative a non-deterministic answers Turinginfinitely have as proof either machine tiled has one or more planes or a choices. proof in some non-tile A game system could then (usually challengeinformal proofs the player in mathematical to select choices whichEnglish resultorinformal proofs an accepting in FOL), and negative answers don’t have a proof using only tiles (even an infinite 1 A non-deterministic Turing machine m is said to accept an input string s if there is a sequence of number of them) and require a proof in some other system. choices that results in an accepting state s . a 3 Deriving Game Criteria Given that we want to have games for crowdsourcing theorem proving, we present some general criteria to aid in the design of games that might help us in this goal. If we are lax with our requirements, we end up with trivial games that are not useful for us. One such game could just present a logic problem “Γ ` φ?” directly as a game. Another such game could present to the player a non-deterministic Turing machine that would accept the string hΓ, φi only if Γ ` φ. At any stage of the computation, a non-deterministic Turing machine has one or more choices.3 A game could then challenge the player to select choices which result in an accepting state given some input! It is obvious that approaches such as these would not help us. The overarching requirement that players with no knowledge of formal logic be able to prove theorems by playing games helps us derive some general criteria such games should satisfy. We need a few simple definitions to make our criteria more clear. 3.1 A Simple Formalization We need a simple vocabulary to help us talk clearly about the aspects of games we are interested in. Some straightforward definitions are given below: Game Objects Game objects are all the objects that a player can interact with and manip- ulate in the game. The universe of all game objects is denoted by G. Note that each game object on its own contains enough information to assemble the game at any point in time when combined with all other game objects. The game state at any point in time is a function of the game objects. Game State The game state at any point of time is just a set of game objects. But not all sets of game objects might correspond to a meaningful configuration of the game. A game state is any valid set of game objects. Game A game G is a quadruple hG, γ ⊆ G × 2G , I, Fi composed of the universe of all game objects G, an operation γ specifying valid operations with the game objects, the initial game states I ⊆ 2G and the final states F ⊆ 2G . Game Instance The initial game states are also called game instances. Game Size The game size |g| of a game instance g is simply the number of game objects in that game instance. Game Play A terminating game play for a game instance g1 is sequence of game states g1 , g2 , . . . , gn such that gi+1 = gi − {ugi } ∪ ν(ugi ) where ugi is some game object in gi and ν(ugi ) is a set of game objects such that γ(ugi , ν(ugi )) holds. Game play proceeds by replacing game objects with one or more other game objects. Captures FOL A game captures FOL or theorem-proving in a proof calculus ρ in FOL iff every problem Γ ` γ is isomorphic with exactly one game instance gΓ`γ and every proof for the problem is isomorphic with every terminating game play for the instance. gΓ`γ is called the game instance for Γ ` γ. 3.2 Game Criteria for Naturalness and Usefulness Any game designed that seeks to help us should satisfy the following criteria: 1. Independently Incentivized: The games themselves should be captivating and engag- ing enough that users play the games of their own accord without the incentive of good theorems being proved. This sets us apart from games such as FoldIt where players 3 A non-deterministic Turing machine m is said to accept an input string s if there is a sequence of choices that results in an accepting state sa . deal with the problem domain directly. An incentive structure is usually superimposed on the domain. 2. No Knowledge Needed: In order to play the games, the players need not have any knowledge of first-order logic and the capturing of first-order theorem proving in the game. To play, users need to know only the rules of the game. 3. Non Linguistic: Games that are linguistic in nature generally have a smaller user base. This is a vague criterion which will be sharpened through the course of this project. The general idea is that a player should not have to understand, for example, how Turing machines work in order to play the game. One way to sharpen this criterion is to stipulate that γ has physical correlates or interpretations in naı̈ve physics (e.g. dropping objects, spatially locking objects etc.). 4. Readable Proof: If a game play instance shows us that a theorem φ can be proved from premises Γ, the game play instance should also provide us with a readable proof ρ. This rules out tiling problems [15] as there is not clear notion of a proof. 5. Tractable Representation: Given a problem Γ ` φ? with logical signature Σ being used in the problem, we can set a bound on the game size. If g is the game instance which represents the problem, then we need to have: |g| ≤ |Γ| + 1 + |Σ|. Another con- dition can be derived if we consider the length of the smallest proof. If the minimum proof length defined as the number of inference rule applications is denoted by |Γ ` φ|, we require that the instance g for the problem have a terminating game play of length kgk such that kgk ≤ k ∗ |Γ ` φ| where k is a non-negative constant. 4 Catabot Rescue Games The Catabot Rescue Games were designed with the above five criteria in mind. The games capture first-order logic with binary resolution and natural deduction. The games are set in a world populated by entities called catabots, which resemble robotic cater- pillars or bugs that consume objects in the game for fuel and can reproduce with other catabots. Most of the operations in the game have natural physical correlates, such as dropping objects, dissolving objects, breaking up catabots, etc. No elements of the un- derlying logic problem are directly shown in the game. A record of the game play can be directly translated into a proof in either binary resolution or natural deduction. The basic design metaphors used in these games could also be utilized to invent and design other games which satisfy the criteria. We present only the propositional subset of one of the games. The full set of games can be found in [1]. Before we present the game, we go through a brief introduction to the resolution rule used in theorem proving. 5 Resolution-based Theorem Proving There are several sound and complete proof calculi for propositional logic with different advantages and strengths. Resolution invented by Robinson [16] is widely implemented in automated provers and logic-programming systems due to its simplicity. While com- pared with proof calculi adhering close to natural deduction, resolution proofs suffer from being a bit harder for humans to read. Although resolution proofs are hard to read, they have a much simpler formal structure, which makes resolution more amenable to automation. Before we present the game for resolution, we briefly review a formal specification of resolution. Given two conjunctive normal form clauses p1 ∨ . . . ∨ pi ∨ . . . ∨ pn and q1 ∨ . . . ∨ q j ∨ . . . ∨ qm , such that pi ≡ r and q j ≡ ¬r, where r is a propositional variable, the binary resolution rule produces: p1 ∨ . . . ∨ pi−1 ∨ pi+1 ∨ . . . ∨ pn ∨ q1 ∨ . . . ∨ q j−1 ∨ q j+1 ∨ . . . ∨ qm A proof of Γ ` φ is obtained via resolution when there is a series of binary resolution steps resulting in the empty clause denoting false using only the clausal form (disjunc- tive normal form) of sentences in Γ ∪ {¬φ} and any sentence that was generated by applications of the rule. We now present the Catabot Rescue Game designed to capture propositional resolution. 6 Catabot Rescue Game 0 TheCatabot Rescue Game 0, CRG0 , captures propositional resolution. The catabots in the game are isomorphic with clauses in a propositional language and the game process is isomorphic with the steps in a resolution proof. The game satisfies the criteria defined above and adheres to the formalization presented. We start with a description of the game world. The backstory for the game is given below (A similar story is presented to the user along with a figure similar to Figure 4.): There are a bunch of catabots imprisoned in a room by a horrible blob. The room has a small door to the outside. None of the catabots are small enough to use the door. Only a body-less catabot with just the head can go through the door! Catabots can reproduce and download their minds to any child’s body (or any other catabot). Reproduction has certain rules. Help them escape by mating them in a fashion so as to produce a body-less catabot. Each catabot has a head and a body composed of zero or more blocks. The blocks have a certain number of limbs or legs of either positive or negative polarity such as those shown in Figure 2. Visual Representation Fig. 2: CRG0 Blocks. Blocks with different polarities. 3 2 Alternate Representation The catabots represent clauses and the blocks on the body represent different atoms. Let us assume the supply of propositional variables {P1 , . . . , Pk }. Then, e.g., Pn would have n limbs with a positive polarity and ¬Pn would have n limbs with a negative po- larity. Figure 2 shows literals P3 and ¬P2 . With this process it is easier to visualize the process of binary resolution operating on two clauses. Two simple examples are shown in Figure 3. have n limbs with a positive polarity and ¬Pn would have n limbs with a negative polarity. Figure 4.1 shows literals P3 and ¬P2 . With this process it is easier to visualize the process of binary resolution operating on two clauses. Two simple examples are shown in Figure 4.2. parent 1 parent 1 child child parent 2 parent 2 P3 _ P2 , ¬P3 ¬P2 _ P4 , P3 _ P2 P2 P3 _ P4 Fig. 3: CRG0 Mating Example. Two mating examples. The children are shown on the left Figure 4.2: CRGside of the parents. The children have all the blocks from the parents, save for the pair of Mating Example. Two mating examples. The children are shown on the left blocks which were mated in the parents. side of the parents. The children have all the blocks from the parents, save for the pair of blocks which were mated in the parents. 6.1 Mating Rules Catabot mating rules are very simple and correspond to propositional binary resolution. Machinery The rulesto for represent first-order catabot reproduction atoms are given below:consists of a fuel setup on top of the blocks. The catabots Mated Pairs Cancel The userbricks consume for can choose anyfuel. Bricks two catabots are dropped for mating. Two catabotsinto interconnected can mate if and only if they have the same number of limbs with differing polarities. Such a pair is tanks of water ona top called ofpair. mated the blocks. Tanks may be placed on top of the bricks. A tank Child Catabot The child is a catabot with all the blocks the same, except for the mated blocks represents a variable and multiple instantiations of the same variable are represented by in the parents. Same Blocks Merge If two blocks on different parents are visually the same, they merge into a network of connected one in the child.tanks. Simple examples are show in Figure 4.3. More complex examples are shown in the next section. Three simple catabots are shown below in Fig- 7 Initial Results ure 4.4. An initial industrial-strength implementation of a subset of the operations in CRG0 has The color of the bricks been implemented andfortheir in a game iPads shape and the matters, but4 shows Web.4 Figure not their relative a screenshot of sizes. For the game. ease of presentation here, we sometimes use letters in the bricks to help distinguish the This initial version has 50 levels out of which 44 were generated automatically (in addition colors better. to first Bricks cansix combine levels whichwith are considered tutorial to other bricks levels). formThemore generation processbricks in the complex 4 The web version is available at http://compete.catabotrescue.com and the iPad version is avail- manner of Lego TM blocks (Christiansen 2013). One such combination is show below in able at https://itunes.apple.com/us/app/catabot-rescue-lite/id645249674?ls=1&mt=8. Please note that the iPad version is not compatible with more recent versions of the iPad. We sug- Figure 4.5. Figure gest that 4.6 shows reviewers anweb use the ontology version. of the different objects in the CRG world. Fig. 4: Catabot Rescue Game . A screenshot of Catabot Rescue Game 0 Beginner of the problems is as follows. We randomly generate a certain number of clauses each having a minimum number min and a maximum number max of literals. Given that we want n clauses, we sample n times the discrete uniform distribution [min, max] to generate the size of each clause. Then for each clause we uniformly select from the set of all possible literals. We consider only the literals: {p1 , p2 , p3 , p4 , ¬p1 , ¬p2 , ¬p3 , ¬p4 , } Most of the clauses generated in this fashion do not result in a contradiction. The Prover9 automated theorem prover [17] was used to filter out sets of clauses which did not lead to a refutation. Prover9’s ability to search for multiple proofs was used. Then problems for which the smallest proof was below a certain length were filtered out. Prover9 also reports the depth of the resolution proof tree. Problems with proofs less than four levels deep were removed. This process produces quite complex problems. Though the problems are quite complex and hard, users without any formal training in logic found the levels engaging and were able to solve them. The scoring mechanism for the overall game is in terms of dollar rewards for the user. Upon completion of a level n with k operations, the total reward for the user becomes $(n + 100 ∗ 1/k). 8 Data from Game Plays We conducted two experiments with the game. In the first experiment, we released the game to the public and got game play results from mostly players untrained in logic. While we don’t have exact numbers on the fraction of the players that were trained in logic, this experiment is supposed to mirror conditions that we might face eventually if the games are to be used on a large scale. In the second experiment, we held a contest in an introductory logic class. The students participating in the class were trained in propositional logic but had no knowledge of resolution. 8.1 Experiment 1: Untrained Humans Over the course of a few days (less than ten days) since the release of the iPad version of the game, the game was download around 60 times. This resulted in 400 terminating game plays collected from around 15 players. The total number of players is not ac- curate as users generally hand over played levels to other players and sometimes reset levels to try again and get a better score. We are not interested in metrics which dictate that we have an accurate count of the number of users. Ultimately, when we want to solve very a hard theorem through games, we only need one single player/team to solve the corresponding game instance. Among all the players, only four players were able to solve all the levels. Among these four, we were able to verify that at least two did not have any training in formal logic. Of these two players, one was able to produce smaller proofs than Prover9 for all the 44 levels, save for one. The other players who completed all the levels also produced proofs smaller than Prover9 for some of the levels. 8.2 Experiment 2: Partially Trained Humans Over the course of a week we had 1766 terminating game plays from around 80 users. Of all these users, 22 completed all the levels. 8.3 Results Minimum Proof Lengths Table 1 Minimum proof length by Prover9+SNARK Table 1 Minimum proof length by untrained humans Table 1 Minimum proof length by partially trained humans 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Fig. 5: Prover9 versus a humans. Shortest proofs for the 44 levels by humans and Prover9 Figure 5 shows the minimum proof lengths produced by untrained humans, par- tially trained humans and Prover9 and SNARK combined. Quite surprisingly, minimally trained humans were able to get proofs smaller than the theorem provers. It should be noted that the two players from the first experiment who did not have any training in formal logic were able to solve problems as hard as the one given below. This problem is represented in Level 50 of the game. We need to derive the empty clause from this given set of clauses: ( ) P1 ∨ P3 ∨ ¬P2 ∨ P4 , P4 ∨ ¬P2 ∨ ¬P1 , ¬P4 ∨ ¬P3 ∨ ¬P2 ∨ P1 , ¬P2 ∨ P4 ∨ ¬P3 , P2 ∨ ¬P1 , P3 ∨ P2 , ¬P4 ∨ P1 , P1 ∨ P2 , P1 ∨ ¬P4 ∨ P3 ∨ ¬P2 Illustration of a Human-Dominated Level We now walk through a simple level in which it is easy to see how humans performed better than the ATP. The level we illus- trate is Level 7 in the game. The logic problem represented is given below: ( ) ¬P1 ∨ ¬P3 , P4 ∨ P3 , P2 ∨ ¬P4 , ¬P3 ∨ ¬P2 , P2 ∨ P4 , P3 ∨ ¬P4 , P3 ∨ ¬P1 ∨ ¬P2 , ¬P3 ∨ P1 Figures 6, 7, 8, and 9 show one possible proof of only four steps. Prover9 was unable to find any proofs with less than six binary resolution steps. Though Prover9’s configuration for the maximum number of proofs to search for was set at 50, Prover9 generated only two proofs. The two proofs are shown in Figure 10. Parent 1 Child Parent 2 Fig. 6: Step 1 of the Shortest Human Proof for Level 7. Catabots corresponding to the premises P3 ∨ ¬P4 and P4 ∨ P3 combine to give the catabot for P3 . We also ran the SNARK [18] ATP on this problem. While SNARK does have the provision to search for multiple proofs or produce only binary resolution proofs, it is easy to see that proof by SNARK shown in Figure 11 does not translate into a resolution proof smaller than 5 steps. Child Parent 1 Parent 2 Fig. 7: Step 2 of the Shortest Human Proof for Level 7. Use the catabot corresponding to P3 from the previous step with the catabot for premise ¬P3 ∨ ¬P2 to get ¬P2 . Child Parent 1 Parent 2 Fig. 8: Step 3 of the Shortest Human Proof for Level 7. Combine catabots for premises P2 ∨¬P4 and P2 ∨P4 to get the catabot for P2 . Note: the two blocks for P2 are automatically merged into one block in the child catabot. "============================== prooftrans ======================= Prover9 (64) version 2009-11A, November 2009. Process 87248 was started by Naveen on m5-05.dynamic.rpi.edu, Wed Jun 5 18:01:57 2013 The command was \"/Applications/LADR-2009-11A/bin/prover9 -f /Volu ============================== end of head ======================= ============================== end of input ====================== ============================== PROOF ============================= % -------- Comments from original proof -------- % Proof 1 at 0.00 (+ 0.00) seconds. % Length of proof is 11. % Level of proof is 4. % Maximum clause weight is 2.000. % Given clauses 9. 2 -P1 | -P3. [assumption]. Parent 1 3 P4 | P3. [assumption]. 5 -P3 | -P2. [assumption]. 6 P2 | P4. [assumption]. 7 P3 | -P4. [assumption]. Parent 2 9 -P4 | P1. [assumption]. "============================== prooftrans 10 ============================ Child -P2 | P4. [resolve(5,a,3,b)]. Prover9 (64) version 2009-11A, November 2009. 11A P4 | P4. [resolve(10,a,6,a)]. 11 P4. [copy(11A),merge(b)]. Process 87248 was started by Naveen on m5-05.dynamic.rpi.edu, Wed Jun 5 18:01:57 2013 12 P1. [resolve(11,a,9,a)]. 13 P3. [resolve(11,a,7,b)]. The command was \"/Applications/LADR-2009-11A/bin/prover9 -f /Volumes/prover9ramdisk/prob.in\". ============================== end of head 15A=========================== -P3. [resolve(12,a,2,a)]. Fig. 9: Step 4 of the Shortest Human Proof for15Level $F. 7[resolve(13,a,15A,a)]. . Finally, mate catabots for P2 and ============================== end of input ========================== ¬P2 obtained in the last two steps. This gives us a body-less catabot representing false. ============================== end of proof ====================== This catabot then escapes! ============================== PROOF ================================= ============================== PROOF ============================= % -------- Comments from original proof -------- % Proof 1 at 0.00 (+ 0.00) seconds. % -------- Comments from original proof -------- % Length of proof is 11. % Proof 2 at 0.00 (+ 0.00) seconds. % Level of proof is 4. % Length of proof is 10. % Maximum clause weight is 2.000. % Level of proof is 4. % Given clauses 9. % Maximum clause weight is 2.000. % Given clauses 9. 2 -P1 | -P3. [assumption]. 3 P4 | P3. [assumption]. 3 P4 | P3. [assumption]. 5 -P3 | -P2. [assumption]. 4 P2 | -P4. [assumption]. 6 P2 | P4. [assumption]. 5 -P3 | -P2. [assumption]. 7 P3 | -P4. [assumption]. 6 P2 | P4. [assumption]. 9 -P4 | P1. [assumption]. 7 P3 | -P4. [assumption]. 10 -P2 | P4. [resolve(5,a,3,b)]. 10 -P2 | P4. [resolve(5,a,3,b)]. 11A P4 | P4. [resolve(10,a,6,a)]. 11A P4 | P4. [resolve(10,a,6,a)]. 11 P4. [copy(11A),merge(b)]. 11 P4. [copy(11A),merge(b)]. 12 P1. [resolve(11,a,9,a)]. 13 P3. [resolve(11,a,7,b)]. 13 P3. [resolve(11,a,7,b)]. 14 P2. [resolve(11,a,4,b)]. 15A -P3. [resolve(12,a,2,a)]. 16A -P2. [resolve(13,a,5,a)]. 15 $F. [resolve(13,a,15A,a)]. 16 $F. [resolve(14,a,16A,a)]. (a) Proof 1 (b) Proof 2 ============================== ============================== end of proof ========================== end of proof ====================== Fig. 10: Prover9’s two proofs for Level ============================== 7: ================================= PROOF Prover9’s two proofs for Level 7 are longer than the shortest proof found by humans. % -------- Comments from original proof -------- % Proof 2 at 0.00 (+ 0.00) seconds. % Length of proof is 10. % Level of proof is 4. % Maximum clause weight is 2.000. % Given clauses 9. 3 P4 | P3. [assumption]. 4 P2 | -P4. [assumption]. 5 -P3 | -P2. [assumption]. 6 P2 | P4. [assumption]. 7 P3 | -P4. [assumption]. 10 -P2 | P4. [resolve(5,a,3,b)]. Fig. 11: SNARK on Level 7. The proof from the SNARK ATP is still longer than the shortest proof found by humans. 9 Observations While nowhere close to a full validation of our goal, these initial results provide us with some promise that such games could enable crowds of untrained humans to help machines when it comes to solving hard problems. It is quite notable that humans with- out much formal training in logic can produce proofs shorter than those produced by machine in a calculus specialized for machine reasoning. This is remarkable, because, in addition to propositional theorem proving being hard, many associated problems in propositional logic are hard. For example, determining whether there is a proof of a certain length for a tautology, determining the length of the shortest proof for a given tautology, and finding the shortest proof given the length of the shortest proof are all believed to be quite hard [19,20]. These results, while humble, indicate that theorem provers might be able to benefit from untrained humans. Future experiments would in- volve a more extensive set of problems (e.g. bins of problems of the same minimum proof length) to get more statistically useful conclusions comparing human perfor- mance with that of machines. 10 Data and Source Code 1. The data from Experiment 2 can be download from here: https://s3.amazonaws. com/www.catabotrescue.com/data/AnonoynmizedLevelCompletion.json. 2. The cross platform source code for the game is available here: https://github.com/ naveensundarg/catabotrescue-unity 3. The game can be played here: http://compete.catabotrescue.com References 1. Govindarajulu, N.S.: Uncomputable Games: Games for Crowdsourcing Formal Reasoning. PhD thesis, Rensselaer Polytechnic Institute (RPI) (2013) Available at: https://s3.amazonaws.com/naveensundarg/Dissertation.pdf. 2. Naumowicz, A., Kornilowicz, A.: A Brief Overview of Mizar. In Berghofer, S., Nipkow, T., Urban, C., Wenzel, M., eds.: Theorem Proving in Higher Order Logics. Volume 5674 of Lecture Notes in Computer Science (LNCS). Springer, Berlin (2009) 67–72 3. Wos, L.: The Legacy of a Great Researcher. In Bonacina, M.P., Stickel, M.E., eds.: Auto- mated Reasoning and Mathematics: Essays in Memory of William McCune. Springer, Berlin (2013) 1–14 4. Govindarajulu, N.S., Licato, J., Bringsjord, S.: Small Steps toward Hypercomputation via Infinitary Machine Proof Verification and Proof Generation. In Mauri, G., Dennunzio, A., Manzoni, L., Porreca, A., eds.: Unconventional Computation and Natural Computation. Vol- ume 7956 of Lecture Notes in Computer Science (LNCS). Springer, Berlin (2013) 102–112 5. Govindarajulu, N.S., Bringsjord, S., Taylor, J.: Proof Verification and Proof Discovery for Relativity. Synthese 192 (2015) 2077–2094 6. Dean, D.: Crowd Sourced Formal Verification (CSFV) (2013) Retrieved on July 26, 2013. http://www.darpa.mil/Our Work/I2O/Programs/Crowd Sourced Formal Verification (CSFV).aspx. 7. Almeida, J.B., Frade, M.J., Pinto, J.S., de Sousa, S.M.: Rigorous Software Development: An Introduction to Program Verification (Undergraduate Topics in Computer Science). Springer, Berlin (2011) 8. Charniak, E., McDermott, D.: Introduction to Artificial Intelligence. Addison-Wesley, Read- ing (1985) 9. Genesereth, M., Nilsson, N.J.: Logical Foundations of Artificial Intelligence. Morgan Kauf- mann, Los Altos (1987) 10. Bringsjord, S.: The Logicist Manifesto: At Long Last Let Logic-Based AI Become a Field Unto Itself. Journal of Applied Logic 6(4) (2008) 502–525 11. Delahaye, D.: A tactic language for the system coq. In: Logic for Programming and Auto- mated Reasoning, Springer (2000) 85–95 12. Arkoudas, K.: Denotational Proof Languages. PhD thesis, MIT (2000) 13. Govindarajulu, N.S.: Uncomputable Games: Toward Crowd-sourced Solving of Truly Dif- ficult Problems. In: Turing Centenary Conference 2012:How the World Computes (Ab- stracts of Informal Presentations). CiE (2012) Page 63 http://www.mathcomp.leeds.ac.uk/ turing2012/WScie12/Images/abstracts-booklet.pdf. 14. Genesereth, M., Love, N., Pell, B.: General Game Playing: Overview of the AAAI Compe- tition. AI Magazine 26(2) (2005) 62–72 15. Berger, R.: The Undecidability of the Domino Problem. Memoirs of the American Mathe- matical Society 66 (1966) 1–73 16. Robinson, J.A.: A Machine-Oriented Logic based on the Resolution Principle. Journal of the ACM 12(1) (1965) 23–41 17. McCune, W.: Prover9 and Mace4 (2013) Retrieved on April 25, 2016. http://www.cs.unm.edu/∼mccune/prover9/. 18. Stickel, M.E.: SNARK - SRI’s New Automated Reasoning Kit (2008) Retrieved on April 25, 2016. http://www.ai.sri.com/∼stickel/snark.html. 19. Alekhnovich, M., Buss, S., Moran, S., Pitassi, T.: Minimum Propositional Proof Length is NP-Hard to Linearly Approximate. Journal of Symbolic Logic 66(1) (2001) 171–191 20. Buss, S.R.: Some Remarks on Lengths of Propositional Proofs. Archive for Mathematical Logic 34(6) (1995) 377–394