Prover-independent Axiom Selection for Automated Theorem Proving in Ontohub Eugen Kuksa Till Mossakowski University of Bremen University of Magdeburg Bremen, Germany Magdeburg, Germany eugenk@informatik.uni-bremen.de till@iws.cs.uni-magdeburg.de Abstract In recent years, formalised large theories in form of ontologies, system specifications or mathematical libraries with thousands of sentences have become more and more common. Automatically proving prop- erties in such theories is an important task in research as well as in industry, for instance, as part of system specifications. However, auto- mated theorem provers struggle finding proofs when there are too many axioms available. This paper presents and investigates two approaches to reduce the load on automated theorem provers by preselecting ax- ioms: 1) a generalisation of the SInE selection heuristics to arbitrary first-order provers, and 2) frequent item set mining. These selection methods are implemented in the web application Ontohub. One of the implementations’ usefulness is validated while others’ is refuted by an evaluation on well-known benchmarks for automated theorem proving. 1 Introduction In recent years, the ability of systems to reason over large logical theories has become more and more important. Large theories, i.e. theories with many sentences and symbols, are becoming more common since large knowledge bases are arising in forms of ontologies and comprehensive mathematical libraries that have been translated to formal languages that are suitable for automated reasoning. At the same time, theorem provers have evolved to feature efficient heuristics for finding proofs. These theorem provers, however, were traditionally designed for working with small to medium-sized collections of sentences, for example, an axiomatisation of set theory or of algebraic constructs like groups or vector spaces. Supplying these provers with larger theories reduces their chance of success because the search space grows exponentially with the number of sentences. In most cases, only a few axioms are actually needed for proving a theorem in a theory containing thousands of axioms. Instead of only focusing on the improvement of the proving algorithm and the prover’s heuristics, one can also reduce proving time by supplying the prover with a modified theory omitting sentences that are not needed. This significantly reduces the search space: instead of thousands of axioms, there might only be a few dozens or hundred axioms available for proving. This work presents an implementation of two axiom selection algorithms that is not bound to a specific prover but rather works independently of the prover. This paper is structured as follows: Section 2 provides basic knowledge of the concepts and systems used in this work, namely the software tools Ontohub and Hets (suited for proving with first-order theories) and Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: P. Fontaine, S. Schulz, J. Urban (eds.): Proceedings of the 5th Workshop on Practical Aspects of Automated Reasoning (PAAR 2016), Coimbra, Portugal, 02-07-2016, published at http://ceur-ws.org 56 the AI method of frequent item set mining (which we will employ for axiom selection). Section 3 presents the axiom selection algorithms whose implementation is described briefly in section 4 and evaluated in section 5. Afterwards, section 6 gives an overview of related work while section 7 concludes this paper. 2 Preliminaries 2.1 Ontohub Ontohub [12] is an open source repository engine for managing distributed logical theories. While Ontohub focuses on ontologies, it supports arbitrary logical theories, let them be ontologies, models, specifications, or others. The distributed nature enables communities to share and exchange their contributions easily. Ontohub supports (among others) first-order logic based on the TPTP format [26] It also supports various structuring and modularity constructs and inter-theory (concept) mappings, building on the OMG-standardized DOL language. Ontohub has several distinctive features. The following list gives a short overview of them. • Different repositories organise logical theories. For each repository, access rights and ownership can be managed on a per-user or per-team basis. • Logical theories (and all other files) are versioned. The underlying git1 [27] version control system is an integral part of Ontohub. This feature allows the user to browse through the history of the repository and inspect any change in any file. • Ontohub features an integrated editor for logic theories, which however may also be edited outside of Ontohub: files can either be uploaded via the web front end or via git with SSH. • Ontohub is fully linked-data compliant. One and the same URL can be used for referencing a logical theory, downloading its defining file or examining it in a user-friendly presentation in the browser. The representation served by Ontohub depends on the requested MIME-type. • Logical theories can be modular and distributed, which means they can be structured into many simpler theories depending on each other. • As in BioPortal [17] and the NeOn Toolkit [8], logical theories can be aligned and combined along alignments. • Logical relations like interpretations of theories, conservative extensions, etc. between logical theories are supported. • Diverse structuring languages are supported, including DOL2 [15] and CASL [4]. • As a novel contribution of this work, logical theories may contain conjectures which Ontohub recognises. There is an interface for proving these with different parametrisations. 2.2 Hets Ontohub itself does only little analysis of logical theories. Instead, we use the Heterogeneous Tool Set (Hets)3 [14] as a back end to parse and evaluate a logical theory. Hets features a RESTful web API: As soon as a logical theory is sent to Hets’s HTTP server with the command to parse it, it analyses the theory file and answers the request with an XML or JSON representation of its details. Ontohub then parses the answer, post-processes it and then stores the theory’s details in the database. The whole evaluation process is done asynchronously in the background, so that the user can still work with the web application. Currently, Hets has an interface to several first-order automated provers (ATPs) as, for example, SPASS [28], E [21], Darwin [5], and via the MathServe Broker [30]: Vampire [19]. Similar to the analysis of logical theories, Hets can be called to run a proof attempt via the RESTful web API. Hets also features user interfaces other than the web API, for example, a graphical user interface, a stateless command line interface or an interactive command prompt, but we do not use them in the context of Ontohub. 1 Git is a popular decentralised version control system, usually utilised in software development. Many open source software projects use git in combination with Github: https://github.com (last checked: 13 June 2016). 2 Details on DOL and its background can be found on http://dol-omg.org (last checked: 13 June 2016). 3 Details on HETS can be found on http://hets.eu (last checked: 13 June 2016). 57 Before starting this work, Ontohub was not able to recognise conjectures in a logical theory. Hets, however, already provided proving functionality and thus, its analysis of the theory already discovered conjectures. How- ever, Hets’s proving interface has not been suitable for the use with Ontohub yet. For the preparation of the axiom selection, adjustments of Hets’s interface and the creation of an infrastructure for proving conjectures in Ontohub were required. By this work, Ontohub finds conjectures (called theorems for simplicity) and allows the user to run proof attempts with each of Hets’s ATP that are suitable for the logic translation’s target logic. For each finished proof attempt, Ontohub provides an overview of its details like the time it took, the axioms that were used and the prover’s output. It is even possible to run multiple proof attempts in parallel with different ATP which yields a higher success rate [6]. 2.3 Frequent Item Set Mining Frequent item set mining [1] is a method commonly used for market basket analysis. It aims at revealing regularities in the shopping behaviour of customers of, for example, supermarkets and online shops: It finds products that are usually bought together. From a more technical perspective, it can also be used for dependence analysis. In the context of axiom selection, our approach is to find sentences that have symbols in common and hence are related to each other. This section briefly presents the basic notions of frequent item set mining. Details of implementations are not relevant at this point. The following definitions are based on Borgelt’s lecture slides [7] on frequent item set mining. Definition 2.1 Let B = {i1 , . . . , im } be a set of items. B is called the item base. A subset I ⊆ B of the item base is called an item set. A multiset4 T = {t1 , . . . , tn } with ∀m ∈ {1, . . . , k} : tm ⊆ B is called a transaction database over B. An element tm of T is called a transaction over B. Items are these entities whose occurrence together with other entities is analysed, for example, shopping products. A transaction database can be, for example, a collection of purchases in a shop over a given period of time. Note that the same transaction can occur multiple times in a transaction database because it is defined as a multiset. In practice, the item base is often given implicitly as the union of all the item sets in the transaction database. Definition 2.2 Let T = {t1 , . . . , tn } be a transaction database, I ⊆ B an item set of an item base. A transaction t ∈ T covers I iff I ⊆ t. We define the multiset KT (I) ..= {t ∈ T | I ⊆ t} with the characteristic ∀t ∈ KT (I) : multiplicityKT (I) (t) = multiplicityT (t) meaning that each element of KT (I) occurs as many times in KT (I) as it occurs in T . KT (I) is called the cover of I with respect to T . The cover is the multiset of transactions that cover the given item set. It is a tool for defining two of the central notions of this topic: the absolute and the relative support. Definition 2.3 sT (I) ..= | KT (I) | is the absolute support of I with respect to T. . 1 σT (I) .= n · | KT (I) | is the relative support of I with respect to T . The support is the number or fraction of transactions in the transaction database that cover an item set. It is also called the absolute or relative frequency. Definition 2.4 Given the size of the transaction base n ∈ N an item base B = {i1 , . . . , im } a transaction database T = {t1 , . . . , tn } and the minimum support of either smin ∈ {0, . . . , n} ⊂ N or σmin ∈ [0, 1] ⊂ R 4 A multiset (also called a bag) M is an unordered collection of elements. It can contain the same elements multiple times. How many times an element is included is given by the function multiplicityM : M → N. 58 frequent item set mining is about finding the set of frequent item sets: either FT (smin ) = {I ⊆ B | sT (I) ≥ smin } or analogously ΦT (σmin ) = {I ⊆ B | σT (I) ≥ σmin } There are algorithms which compute the set of frequent item sets efficiently. In the worst case, the time complexity is exponential in the number of items. However, algorithms used in practice, like FP-Growth [9], Eclat [29] and Apriori [2], prune the search space in a way that mining is not only feasible but also finishes quickly with real world data. Except for some special cases, FP-Growth is the fastest algorithm in general [7]. 3 Axiom Selection Algorithms The basic idea of many selection methods is to find sentences that share a symbol with the already selected sentences. Standing out, SInE concentrates on the characteristics of definitions: They describe more special terms by using more general terms. It assumes that the more general terms occur more often in a theory. An entirely different observation we made when manually solving mathematics-textbook exercises is that sentences are probably relevant if they share many symbols with the conjecture or the already selected sentences. This section covers axiom selection ideas based on this observation which are novel contributions of this work. In the sequel, let symbols(ϕ) denote the set of non-logical symbols of a first-order sentence ϕ. 3.1 SInE SInE operates heuristically on the syntax of first-order sentences and was first implemented in the Vampire Prover. It was first developed in Hoder’s masters thesis ‘Automated Reasoning in Large Knowledge Bases’. It entered the public as an implementation in Python with the Vampire V9 and E 0.999 as a back end at CASC- J45 [23]. Later, an implementation was included in both, the Vampire Prover and the E Theorem Prover. A detailed description of the algorithm can be found in [10]. This section only reiterates the original work while being formulated slightly differently since the definitions are also needed for the axiom selection methods with frequent item set mining. Definition 3.1 Let k ∈ N, s be a symbol, ϕ, ψ be sentences. Furthermore, let ψ be the conjecture. Let trigger ⊆ Sym × Sen be an arbitrary relation. 1. s is 0-step triggered iff s ∈ symbols(ψ). 2. ϕ is k + 1-step triggered iff s is k-step triggered and trigger(s, ϕ). 3. s is k-step triggered iff there is a k-step triggered ϕ and s ∈ symbols(ϕ). This recursive definition can be illustrated as follows: 3 trigger 3 trigger 3 ψ s0 ϕ1 s1 ϕ2 ... We now define a trigger relation as needed for Def. 3.1. Definition 3.2 The commonness of a symbol s, denoted by occ(s), is the number of axioms which contain s. Note that (quantified) variables are not treated as symbols. Definition 3.3 Let s be a symbol, ϕ be an axiom, 1 ≤ t ∈ R the tolerance parameter, 1 ≤ g ∈ N the generality threshold parameter. Let furthermore slcs ∈ symbols(ϕ) with for all s ∈ symbols(ϕ) : occ(slcs ) ≤ occ(s) be the least common symbol of ϕ. The trigger relation of SInE is defined as: trigger(s, ϕ) iff s ∈ symbols(ϕ) and (either occ(s) ≤ g or occ(s) ≤ t · occ(slcs )). This means that either a sentence is triggered because it contains very uncommon symbols (due to the generality threshold; occ(s) ≤ g) or it is triggered by its least common symbol (or as close to it as the tolerance allows; occ(s) ≤ t · occ(slcs )). SInE then selects all axioms that are at most d-step triggered given a d ∈ N∪{∞} as the depth limit parameter. 5 The used SInE version and more information about it is available at http://www.cs.man.ac.uk/ hoderk/sine/. Last checked: ~ 13 June 2016. 59 3.2 Simple Axiom Selection using Frequent Item Set Mining Finding sentences that share many symbols with the conjecture is similar to finding frequent item sets: Frequent item set mining answers the question which items occur often in combination. Definition 3.4 Let smin ∈ N and σmin ∈ [0, 1] ⊆ R be a minimum absolute respectively relative support. Let T = (A, C) be a theory consisting of a set of axioms A and a set of conjectures C. Let the multiset T = {symbols(ϕ) | ϕ ∈ A ∪ C} be a transaction database and let the set [ B= symbols(ϕ) ϕ∈A∪C be the item set for frequent item set mining. Then, the frequent symbol sets are the elements of either FT (smin ) = {I ⊆ B | sT (I) ≥ smin } or equivalently ΦT (σmin ) = {I ⊆ B | σT (I) ≥ σmin } Frequent symbol set mining helps finding the symbols that occur together in a sentence and that occur in the conjecture or the already selected sentences as well. One can then select the sentences that have all the symbols of the frequent symbol sets. This would, however, ignore the short sentences which have less symbols than the absolute minimum support. To consider them as well, one can select all the short sentences that share a symbol with the already selected sentences. Definition 3.5 Let F be a set of frequent symbol sets for the corresponding theory (A, C) with a defined minimum support. Let s be a symbol and ϕ ∈ A be an axiom while ψ ∈ C is the conjecture. Let 2 ≤ n ∈ N be a minimal symbol set size, t ∈ N be a short axiom tolerance, d ∈ N ∪ {∞} be a depth limit. 1. ϕ is 0-step selected iff | symbols(ϕ) | < n + t. 2. s is 0-step selected iff s ∈ symbols(ψ). 3. S ∈ F is selection-enabled iff n ≤ | S |. 4. ϕ is k + 1-step selected iff there is some S ∈ F with: S is selection-enabled and S ⊆ symbols(ϕ) and s ∈ S for a k-step selected symbol s. 5. s is k-step selected iff ϕ is k-step selected and s ∈ symbols(ϕ), for k ≥ 1. Then, the union of j-step selected axioms for all j ≤ d form the final axiom selection. This definition describes a simple axiom selection idea that was implemented in Ontohub in this work. The recursive definition can be illustrated as follows: F F ∈ ∈ 3 3 3 ψ s0 ∈ S0 ⊆ symbols(ϕ1 ) s1 ∈ S1 ⊆ symbols(ϕ2 ) ... 3.3 SInE with Frequent Symbol Set Mining Another idea to circumvent the problem of short sentences is to combine frequent item set mining with SInE. The k-step-trigger relation from Def. 3.1 is modified. Apart from this definition, the SInE algorithm is used as is. Definition 3.6 Let k ∈ N, s, s0 be symbols, ϕ, ψ be sentences. Furthermore, let ψ be the conjecture, trigger ⊆ Sym × Sen be a relation and F be the set of frequent symbol sets for the corresponding theory with a defined minimum support. Let 1 ≤ σ ∈ R be a tolerance value. 60 1. s is 0-step triggered iff s ∈ symbols(ψ). 2. s is k-step-trigger-enabled iff there is an S ∈ F : {sk , s} ⊆ S with sk being k-step triggered and for all s0 ∈ S : occ(s0 ) · σ ≥ occ(smcs ) where smcs ∈ S is the most common symbol of S. 3. ϕ is k + 1-step triggered iff s is k-step-trigger-enabled and trigger(s, ϕ). 4. s is k-step triggered iff ϕ is k-step triggered and s ∈ symbols(ϕ). This recursive definition can be illustrated as follows (note that the condition involving smcs is not illustrated): F F ∈ ∈ 3 trigger 3 trigger 3 ψ s00 ∈ S0 3 s0 ϕ1 s01 ∈ S1 3 s1 ϕ2 ... This definition says that not the symbols that occur in a triggered sentence are allowed to trigger further axioms. Instead, those symbols are allowed that occur in a frequent symbol set in combination with a symbol from a triggered sentence. Additionally, only the most common symbols (with a tolerance) of such a frequent symbol set may trigger an axiom. Specifying a high tolerance value σ allows any symbol from such a set to trigger. The evaluation of SInE showed that there were still many axioms selected and the idea was to reduce the selection even further. 4 Implementation The described axiom selection methods were implemented in Ontohub which is based on the web framework Ruby on Rails. One of Ontohub’s basic features is the analysis of logical theories which recognises sentences and symbols inside the sentences. These are stored in a relational database and queried when an axiom selection heuristic is invoked. In our implementation, association objects are stored in the database such that SInE’s commonness values and the frequent symbol sets don’t need to be re-computed. Furthermore, the minimal tolerance parameter of SInE for each axiom/symbol combination for the symbol to trigger the axiom is saved. This way, the axiom selection of consecutive proof attempts with different parametrisations of our selection methods or for different conjectures in the same logical theory is very efficient. The axiom selections that are using frequent symbol set mining are based on the mining algorithm FP-Growth. The FP-Growth application itself is implemented in C by Borgelt6 [7] and invoked from Ontohub’s Ruby code. 5 Evaluation 5.1 Setup The axiom selection was evaluated on first-order logic with the MPTP20787 [3] benchmark which is used in the TPTP format and is based on the Mizar Mathematical Library (MML) [20]. Each problem of the MPTP2078 benchmark comes in a ‘bushy’ and a ‘chainy’ version. The ‘bushy’ version contains only the axioms that are used in a human-made proof while the ‘chainy’ version contains all axioms that were introduced in the MML before the conjecture. Proof attempts were started for each problem in both versions without any axiom selection. Their results are used as a baseline: The problems that could not be solved in the bushy category are considered too hard for the used ATP with the given resources because the bushy ones only contain axioms which contribute to a known human-made proof. The other ones are the ones whose corresponding chainy version might be solved with an appropriate axiom selection. For evaluating the axiom selection methods, only the chainy versions were used. The problems that could be solved in the chainy category without an axiom selection are supposed to be also provable with an axiom selection. An axiom selection is 6 The FP-Growth implementations sources or binaries for various systems can be downloaded from http://borgelt.net/fpgrowth. html (last checked: 13 June 2016). 7 Information about the MPTP2078 benchmark is available on http://wiki.mizar.org/twiki/bin/view/Mizar/MpTP2078 (last checked: 13 June 2016). 61 SInE with Simple Axiom Selection Frequent Symbol using Frequent Item Set parameter SInE Set Mining Mining generality threshold g 0 0 - depth limit d 5, ∞ ∞ ∞ tolerance value t 1.0, 1.2, 1.5, 2.0 1.0, 2.0 - symbol set tolerance σ - 1.0, 2.0 - min. absolute support smin - 1, 2 1, 2 short axiom tolerance t - - 0, 2 minimal symbol set size n - - 2 Table 1: Parameters used in the evaluation for each axiom selection method considered not effective if fewer conjectures can be proven using it than without using it in the chainy category. It is possible that a selection hides axioms that are relevant for a proof and, hence, leads to no conclusion. However, a well performing selection should enable proving more conjectures by selecting all the relevant axioms and hiding only irrelevant ones. The experiments were conducted on an instance of Ontohub specifically set up for this work. This instance runs on a machine with two Intel R Xeon R E5-2667 v2 processors with HyperThreading (resulting in 32 logical processor cores) typically clocked at 3.3 GHz along with 256 GB of RAM. For analysis and proving, 30 Hets instances were used in parallel. This puts heavy load on 30 of the cores. The asynchronous (to the web interface) processing of Hets responses is done by another 30 Ruby processes. Their workload alternates with that of the Hets processes, such that no additional cores are used extensively at the same time. The web user interface’s workload is very low. During the evaluation, it is only called by Hets to fetch the theory files and it does not put heavy use to another core. In total 30 cores were used completely and one additional core was used with little load. This leaves one spare processor core for other tasks. During the experiments, processor utilisation of around 70% was observed. For the benchmark, SPASS v3.7 and the E Theorem Prover 1.8 were used. These are among the best-rated first-order provers. Since E features an own implementation of SInE, this was not activated by only using the parameters -xAuto -tAuto among (soft) CPU time and memory limiting as well as input and output parameters. The parameters for the axiom selections are shown in Table 1. Proof attempts were made with each combi- nation of these parameters. Each of the provers was given a timeout of 10 seconds. During the run with the first SInE with Frequent Symbol Set Mining configuration, a very slow runtime behaviour was observed for some examples. The selection for these examples was cancelled after about eight hours of selection runtime. As a consequence, a timeout for the axiom selection was implemented: If the axiom selection itself took longer than two minutes, the proof attempt was cancelled. The choice of two minutes is arbitrary. The selection timeout must be chosen considerably large to make sure that much CPU time is consumed because it was measured using wall clock time. The controlled environment ensured that no other CPU-hungry processes were running other than the axiom selection and proving (which are limited to less than one process per core). Hence, it is unlikely that other processes used much CPU time that should have been given to the selection for a perfect timing evaluation (as it would have happened by measuring CPU time instead of wall clock time). The selection timeout has to be much longer than the chosen proving timeout, though, because we want to analyse the selection run time as well. An axiom selection that consequently takes this much longer than proving a conjecture is not an acceptable method because the goal is not only to enable proofs but also shorten the overall proving time which includes the axiom selection. 5.2 Results In the MPTP2078 benchmark, only 2044 problems are non-trivial. The other 34 conjectures are equal to an axiom and could be proved by Hets during static analysis using reflexivity of an entailment relation without invoking an external prover. Only the 2044 non-trivial problems were used for the axiom selection method evaluation. The first figures in this section show stacked column charts of the performance: Figure 1 shows how many problems could be solved using SInE. The same goes for Figure 2 using SInE with Frequent Symbol Set Mining. Finally, Figure 3 shows the performance of the Simple Axiom Selection using Frequent Item Set Mining. 62 Top-rated first-order provers return a status of the proof attempt telling the result in a single word. These statuses are defined in the SZS ontology [22]. The most common statuses used by provers are (i) THM (Theorem): All models of the axioms are models of the conjecture. (ii) CSA (CounterSatisfiable): Some models of the axioms are models of the conjecture’s negation. (iii) TMO (Timeout): The prover exceeded the time limit. Of these statuses, ‘THM’ and ‘CSA’ indicate a successful prover run, while ‘TMO’ shows that the prover did not finish successfully. There are more such values indicating failures which are actually used by software, but ATP only use the two values ‘THM’ and ‘CSA’ in case of success. We extended this ontology8 by a status specifically for proving with reduced axiom sets: (iv) CSAS (CounterSatisfiableWithSubset): Some models of the selected axioms are models of the conjecture’s negation. If a refutation of the conjecture is found using a strict subset of the axioms (which means that the prover returns with ‘CSA’), we do not know whether the conjecture is really false or we have excluded an axiom that is crucial to a potentially existing proof. If the prover returns with ‘THM’, we know by monotonicity of entailment relations that the found proof is also a proof with the full axiom set. When a prover returns with a proof (marked in the diagrams as ‘THM’) or a refutation (marked as ‘CSA’), a problem is considered solved. When it returns with a refutation while only a true subset of the axioms was used (marked as ‘CSAS’), or with a proving timeout (marked as ‘TMO (Prover)’) or when the axiom selection takes more than two minutes (marked as ‘TMO (Selection)’), the problem is considered still unsolved. 2500 2500 THM CSA CSAS THM CSA CSAS TMO (Prover) TMO (Selection) TMO (Prover) TMO (Selection) 2000 2000 Number of Problems Number of Problems 1500 1500 1000 1000 500 500 0 0 Bushy ) (1.0, 5) ) (1.2, 5) ) (1.5, 5) ) (2.0, 5) Chainy Bushy ) (1.0, 5) ) (1.2, 5) ) (1.5, 5) ) (2.0, 5) Chainy (1.0, (1.2, (1.5, (2.0, (1.0, (1.2, (1.5, (2.0, Axiom Selection Parameters (t, d) Axiom Selection Parameters (t, d) (a) E Theorem Prover (b) SPASS Figure 1: Performance of the provers in MPTP2078 using SInE Figure 4 shows the average runtime in milliseconds of each axiom selection method over all parametrisations. It contains two values for each method. The first one is the average over all selections that did not reach the two-minute timeout. The second one is the average runtime of the selection, regardless whether or not the timeout was reached. This, of course, counts every selection that was cancelled as two minutes. Figure 5 shows how many axioms were selected on average using SInE. Both versions of the MPTP2078 without a selection are displayed for reference. Due to a high amount of timeouts in the other two selection methods, they are not included in these statistics. The distortion caused by leaving out selections that encountered a timeout would be too high. 8 Our modified SZS ontology can be found on http://ontohub.org/meta/proof_statuses. 63 2500 2500 THM CSA CSAS THM CSA CSAS TMO (Prover) TMO (Selection) TMO (Prover) TMO (Selection) 2000 2000 Number of Problems Number of Problems 1500 1500 1000 1000 500 500 0 0 Bushy (1.0, 1, 1.0) (1.0, 1, 2.0) (1.0, 2, 1.0) (1.0, 2, 2.0) (2.0, 1, 1.0) (2.0, 1, 2.0) (2.0, 2, 1.0) (2.0, 2, 2.0) Chainy Bushy (1.0, 1, 1.0) (1.0, 1, 2.0) (1.0, 2, 1.0) (1.0, 2, 2.0) (2.0, 1, 1.0) (2.0, 1, 2.0) (2.0, 2, 1.0) (2.0, 2, 2.0) Chainy Axiom Selection Parameters ( , smin, t) Axiom Selection Parameters ( , smin, t) (a) E Theorem Prover (b) SPASS Figure 2: Performance of the provers in MPTP2078 using SInE with Frequent Symbol Set Mining Benchmark Size. To interpret the results properly, we need to know the size of the benchmarks. For implementation reasons, there is no very easy way to count the symbols including duplicates. So, only the distinct symbols are taken into account. This is regarded as a sufficient approximation for interpreting the results. The MPTP2078 benchmark in the bushy version has between 2 and 232 axioms per problem. Its average is 31.5 axioms with a standard deviation of 23.4. It has between 2 and 93 distinct symbols per problem with an average of 18.8 and a standard deviation of 12.1. Counting the distinct symbols per axiom, there are between 0 and 26 symbols, 2.9 on average with a standard deviation of 2.7. The chainy version of the MPTP2078 has between 10 and 4563 axioms per problem. The average axiom count is 1876.4 with a standard deviation of 1288.6. The number of distinct symbols per problem ranges from 5 to 784 with an average of 266.7 and a standard deviation of 229.2. In the axioms, there are between 0 and 26 distinct symbols, 4.9 on average with a standard deviation of 3.1. 5.3 Discussion SInE. Inspecting Figure 1, one can see that the E Theorem Prover improves by about 50% for most of the parametri- sations of SInE (45% on average) compared to the chainy versions without any selection. SPASS also performs much better with a selection than without one: Its manages to solve 64% more problems on average across the SInE parametrisations. As a side note, the E Theorem Prover performs better than SPASS, even when com- paring the E Theorem Prover without preselecting axioms against the best result with the selection for SPASS. Considering all SInE parametrisations, 769 problems could be solved, 330 of them could not be solved without an axiom selection. Without axiom selection, 444 problems could be solved. For both provers, the number of ‘CSA’ and ‘CSAS’ results is negligible. For the largest tolerance value without a depth limit, meaning (t, d) = (2.0, ∞), the axiom selection takes more than two minutes for 294 examples. These examples have 2165.5 axioms on average with a standard deviation of 1680.1, ranging from 51 to 4563 and 332.5 distinct symbols on average with a standard deviation of 298.7, ranging from 12 to 784. This means that not only the largest problems take long during axiom selection but also some of the small problems. Having a timeout on some of the small problems may be a result of a high CPU load caused by other tasks performed on the hardware. It is imaginable that the timeout at the larger problems is due to the problem size, though. 64 2500 2500 THM CSA CSAS THM CSA CSAS TMO (Prover) TMO (Selection) TMO (Prover) TMO (Selection) 2000 2000 Number of Problems Number of Problems 1500 1500 1000 1000 500 500 0 0 Bushy (1, 0) (1, 2) (2, 0) (2, 2) Chainy Bushy (1, 0) (1, 2) (2, 0) (2, 2) Chainy Axiom Selection Parameters (smin, t) Axiom Selection Parameters (smin, t) (a) E Theorem Prover (b) SPASS Figure 3: Performance of the provers in MPTP2078 with the Simple Axiom Selection using Frequent Item Set Mining SInE with Frequent Symbol Set Mining. Figure 2 clearly shows that there is no proving performance gain by SInE with Frequent Symbol Set Mining. Not only do both provers perform maximally as well as without the axiom selection but also the selection itself is cancelled frequently due to timeouts. The number of ‘CSAS’ results increases for some parametrisations, but regarding there is such a high amount of selection timeouts, this can be ignored. In total, 475 MPTP2078 problems could be solved using SInE with Frequent Symbol Set Mining, only 35 of which were not solved without an axiom selection. Simple Axiom Selection using Frequent Item Set Mining. Figure 3 shows that the timeout of the axiom selection is dominating the results. This means that for most of the problems, two minutes of selection time are not enough. Only 239 MPTP2078 problems could be solved in total with this method, 9 of which could not be solved without a preselection in the chainy versions. Selection Runtime. Figure 4 shows the average runtime over all parametrisations of the presented selection methods. When only considering the problems that did not reach a timeout (red/left columns), SInE is the slowest method with an average runtime of 18.6 seconds per problem. This is not a fair comparison though, because the other two methods often reached the two-minute timeout while SInE did not. When counting two minutes for every timeout-reached selection (which actually passed during the selection), SInE achieves the best result (yellow/right columns). Two minutes are the smallest value that can be used without distorting the data too much. The distortion caused by counting two minutes is in favour of the axiom selection methods using frequent item set mining which are slower nevertheless. A runtime of several hours for a single problem was observed with SInE with Frequent Symbol Set Mining, before a timeout was implemented. This time was mainly consumed by the external frequent item set mining process. The same behaviour is suspected for the Simple Axiom Selection using Frequent Item Set Mining. Since the MPTP2078 comprises large theories, these two slower methods are again considered unsuitable. Number of Selected Axioms. Figure 5 shows the average number of selected axioms using SInE. When the depth limit is set to 5, considerably less axioms get selected than without a limitation in MPTP2078. Also, the difference between infinite depth 65 average of unexpired selections average of all selections 2000 110000 101858 Number of selected Axioms 100000 1500 90000 selection runtime [ms] 80000 1000 70000 60000 50000 500 40000 31402 30000 18592 19721 0 20000 13630 Bushy ) (1.0, 5) ) (1.2, 5) ) (1.5, 5) ) (2.0, 5) Chainy 10000 3800 (1.0, (1.2, (1.5, (2.0, 0 SInE SInE with FSSM Simple AS with FSSM Figure 5: Average number of selected ax- Figure 4: Runtime of axiom selection methods ioms by SInE limit and a limit of 5 increases with a growing tolerance value. As expected, the number of selected axioms highly increases with a growing tolerance value. Conclusion. Considering the evaluation results, of the three selection methods, SInE clearly is the only usable method for large theories. The methods based on frequent symbol set mining take too much time for large problems and sometimes even for small problems. Even if the selection itself finishes quickly, the selected axioms do not enable proving considerably more conjectures. SInE on the other hand, allows to prove significantly more conjectures. For this reason, only SInE will be kept in Ontohub. 6 Related Work Axiom selection was actively researched in the recent years. A simple algorithm called RedAx [24] was de- veloped to provide evidence for the effectiveness of preselected axioms when proving. SRASS [25] is a more sophisticated semantically operating algorithm to find the right axioms for a proof. MePo [13] as part of Is- abelle/Sledgehammer [16, 18] which is adding axioms that share a feature with the conjecture or already added axioms in several iterations. In each iteration, a pass mark is increased which makes it harder for axioms to be added to the selection. This method was followed by MaSh [11] to improve Isabelle/Sledgehammer’s axiom selection with machine learning methods that rank axioms by the amount and kind of features they share with the conjecture and takes into account previous proofs in the same theory. Alama et al. [3] use learning in MOR, in particular kernel methods, to determine how similar sentences are to each other and uses these similarity values to find feature weights and to eventually rank the available axioms. SInE [10] is a heuristic that, similarly to MePo, iteratively adds axioms to the selection that share symbols with the conjecture or already selected axioms. Instead of increasing a pass mark, it only selects axioms that contain symbols whose number of occur- rences in the overall theory meets certain requirements. This method proved very successful by executing fast and enabling about as many proofs as some learning methods. Because of saving intermediate information in a database, our implementation of SInE is very efficient for consecutive runs, although because of the implementation in Ruby and because of being bound to a relational database the first run performs slower than an implementation with native/compiled code such as it exists directly in Vampire or E. One notable feature is the prover-independent operability of the axiom selection. This implementation is usable with any prover that interfaces with Hets. 7 Conclusions and Future Work An implementation of axiom selection heuristics was presented: the already existing SInE and two methods based on frequent item set mining. These are implemented in the combination of the web application Ontohub and Hets and run independently of the selected prover, which is a first step towards an implementation of logic- independent axiom selections in Ontohub. Currently, this implementation is compatible to every Hets-interfaced 66 first-order prover. The evaluation of the selection methods showed once again the effectiveness of the SInE algorithm but also revealed that frequent item set mining is not an appropriate tool for selecting axioms because of the slow computation of frequent item sets. Future work in this field should extend the feature of independence such that the axiom selections operate independently of the logic as well. A new evaluation could feature a comparison between SInE as it is implemented in provers to the SInE implementation of Ontohub, measuring the performance of consecutive axiom selections that are applied to the same theory (with different conjectures) such that the commonness values can be read from the database. Also learning methods could be incorporated into the selection, e.g. learning the best parameters for SInE or entirely learning-based methods similar to MaSh or MOR could be incorporated. This is very promising in the context of Ontohub because of its repository structure which stores information on every logical theory including proof attempts – both successful and failed. References [1] Charu C. Aggarwal and Jiawei Han. Frequent Pattern Mining. Springer Publishing Company, Incorporated, 2014. [2] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages 487–499, 1994. [3] Jesse Alama, Tom Heskes, Daniel Kühlwein, Evgeni Tsivtsivadze, and Josef Urban. Premise Selection for Mathematics by Corpus Analysis and Kernel Methods. Journal of Automated Reasoning, 52(2):191–213, 2014. [4] Egidio Astesiano, Michel Bidoit, Hélene Kirchner, Bernd Krieg-Brückner, Peter D Mosses, Donald Sannella, and Andrzej Tarlecki. CASL: The Common Algebraic Specification Language. Theoretical Computer Science, 286(2):153–196, 2002. [5] Peter Baumgartner, Alexander Fuchs, and Cesare Tinelli. Darwin: A Theorem Prover for the Model Evolution Calculus. In Stephan Schulz, Geoff Sutcliffe, and Tanel Tammet, editors, IJCAR Workshop on Empirically Successful First Order Reasoning (ESFOR (aka S4)), Electronic Notes in Theoretical Computer Science, 2004. [6] Sascha Böhme and Tobias Nipkow. Sledgehammer: Judgement Day. In Automated Reasoning, pages 107– 121. Springer, 2010. [7] Christian Borgelt. Frequent Pattern Mining. Lecture slides. URL http://www.borgelt.net/teach/fpm/ slides.html. last checked: 13 June 2016. [8] Peter Haase, Holger Lewen, Rudi Studer, Duc Thanh Tran, Michael Erdmann, Mathieu dAquin, and Enrico Motta. The Neon Ontology Engineering Toolkit. WWW, 2008. [9] Jiawei Han, Jian Pei, and Yiwen Yin. Mining Frequent Patterns Without Candidate Generation. In ACM Sigmod Record, volume 29, pages 1–12. ACM, 2000. [10] Krystof Hoder and Andrei Voronkov. Sine Qua Non for Large Theory Reasoning. In Automated Deduction - CADE-23 - 23rd International Conference on Automated Deduction, Wroclaw, Poland, July 31 - August 5, 2011. Proceedings, pages 299–314, 2011. [11] Daniel Kühlwein, Jasmin Christian Blanchette, Cezary Kaliszyk, and Josef Urban. MaSh: Machine Learning for Sledgehammer. In Interactive Theorem Proving, pages 35–50. Springer, 2013. [12] Eugen Kuksa and Till Mossakowski. Ontohub — version control, linked data and theorem proving for ontologies. In FOIS 2016. To appear. [13] Jia Meng and Lawrence C Paulson. Lightweight Relevance Filtering for Machine-Generated Resolution Problems. Journal of Applied Logic, 7(1):41–57, 2009. [14] Till Mossakowski. Heterogeneous Specification and the Heterogeneous Tool Set. University of Bremen Germany, 2005. Habilitation thesis. 67 [15] Till Mossakowski, Mihai Codescu, Fabian Neuhaus, and Oliver Kutz. The Distributed Ontology, Modeling, and Specification Language - DOL. In A. Koslow and A. Buchsbaum, editors, The Road to Universal Logic, volume II of Studies in Universal Logic. Springer, 2015. [16] Tobias Nipkow, Lawrence C Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283. Springer Science & Business Media, 2002. [17] Natalya F Noy, Nigam H Shah, Patricia L Whetzel, Benjamin Dai, Michael Dorf, Nicholas Griffith, Clement Jonquet, Daniel L Rubin, Margaret-Anne Storey, Christopher G Chute, et al. BioPortal: Ontologies and Integrated Data Resources at the Click of a Mouse. Nucleic acids research, page gkp440, 2009. [18] Lawrence C Paulson and Jasmin Christian Blanchette. Three Years of Experience With Sledgehammer, a Practical Link Between Automatic and Interactive Theorem Provers. IWIL-2010, 2010. [19] Alexandre Riazanov and Andrei Voronkov. The Design and Implementation of VAMPIRE. AI communica- tions, 15(2, 3):91–110, 2002. [20] Piotr Rudnicki. An Overview of the Mizar Project. In Proceedings of the 1992 Workshop on Types for Proofs and Programs, pages 311–330, 1992. [21] Stephan Schulz. System Description: E 1.8. In Ken McMillan, Aart Middeldorp, and Andrei Voronkov, editors, Proc. of the 19th LPAR, Stellenbosch, volume 8312 of LNCS, pages 735–743. Springer, 2013. [22] G. Sutcliffe. The SZS Ontologies for Automated Reasoning Software. In G. Sutcliffe, P. Rudnicki, R. Schmidt, B. Konev, and S. Schulz, editors, Proceedings of the LPAR Workshops: Knowledge Exchange: Automated Provers and Proof Assistants, and The 7th International Workshop on the Implementation of Logics, number 418 in CEUR Workshop Proceedings, pages 38–49, 2008. [23] Geoff Sutcliffe. The 4th ijcar automated theorem proving system competition–casc-j4. Ai Communications, 22(1):59–72, 2009. [24] Geoff Sutcliffe and Alexander Dvorskỳ. Proving Harder Theorems by Axiom Reduction. In FLAIRS Con- ference, pages 108–113, 2003. [25] Geoff Sutcliffe and Yury Puzis. SRASS - A Semantic Relevance Axiom Selection System. In Automated Deduction–CADE-21, pages 295–310. Springer, 2007. [26] Geoff Sutcliffe, Jürgen Zimmer, and Stephan Schulz. TSTP Data-Exchange Formats for Automated Theorem Proving Tools. Distributed Constraint Problem Solving and Reasoning in Multi-Agent Systems, 112:201–215, 2004. [27] Linus Torvalds and Junio Hamano. Git: Fast Version Control System, 2010. http://git-scm.com last checked: 13 June 2016. [28] Christoph Weidenbach, Dilyana Dimova, Arnaud Fietzke, Rohit Kumar, Martin Suda, and Patrick Wis- chnewski. SPASS Version 3.5. In Automated Deduction–CADE-22, pages 140–145. Springer, 2009. [29] Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. New Algorithms for Fast Discovery of Association Rules. In KDD, volume 97, pages 283–286, 1997. [30] Jürgen Zimmer and Serge Autexier. The MathServe System for Semantic Web Reasoning Services. In Automated Reasoning, pages 140–144. Springer, 2006. 68