KB Bio 101 : A Challenge for OWL Reasoners Vinay K. Chaudhri, Michael A. Wessel, Stijn Heymans SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493, USA firstname.lastname@sri.com Abstract. We describe the axiomatic content of a biology knowledge base that poses both theoretical and empirical challenges for OWL reasoning. The knowl- edge base is organized hierarchically as a set of classes with necessary and suf- ficient properties. The relations have domain and range restrictions, are orga- nized into a hierarchy, can have cardinality constraints and composition axioms stated for them. The necessary and sufficient properties of classes induce general graphs for which there are no known decidable reasoners. The OWL version of the knowledge base presented in this paper is an approximation of the original knowledge base. The knowledge content is practically motivated by an education application and has been extensively tested for quality. 1 Introduction The goal of Project Halo is to develop a “Digital Aristotle” - a reasoning system ca- pable of answering novel questions and solving problems in a broad range of scientific disciplines and related human affairs [11]. As part of this effort, SRI has created a sys- tem called Automated User-Centered Reasoning and Acquisition System (AURA) [8], which enables educators to encode knowledge from science textbooks in a way that it can be used for answering questions by reasoning. A team of biologists used AURA to encode a significant subset of a popular biology textbook that is used in advanced high school and introductory college courses in the United States [15]. The knowledge base called KB Bio 101 (for short: KBB101) is an outcome of this effort. KBB101 is a central component of an electronic textbook application called Inquire Biology [13] aimed at students studying from it. AURA uses a frame-based knowledge representation and reasoning system called Knowledge Machine (KM) [7]. We have translated the original KM-version of KBB101 into first-order logic with equality. By using this representation as a common basis, we have translated it into multiple different formats including SILK [9], OWL2 functional [17], answer set programming [5], and the TPTP FOF syntax [6]. In this paper, we describe the OWL2 translation. The translations are available for download [4].1 2 Modeling in the AURA Project – The Role of Skolem Functions AURA provides a graphical knowledge authoring environment for biologists. For ex- ample, the knowledge Every Cell has a Ribosome part and a Chromosome part is ex- 1 This work is owned by Vulcan Inc. and is licensed for use under the Creative Commons Attribution-NonCommerical-ShareAlike 3.0 license (http://creativecommons.org/ licenses/by-nc-sa/3.0/). Fig. 1. (Simplified) Class Graphs for Cell and EukaryoticCell in AURA. pressed graphically as shown in the left half of Fig. 1. Here, white nodes represent universally quantified variables, and grey nodes represent existentially quantified vari- ables. Cell hence corresponds to the following first-order formula: ∀x : Cell (x) ⇒ ∃y1 , y2 : hasPart(x, y1 ) ∧ hasPart(x, y2 ) ∧ Ribosome(y1 ) ∧ Chromosome(y2 ) Using the well-known technique of Skolemization, we can also write this as follows; the advantages of Skolem functions will become clear shortly: ∀x : Cell (x) ⇒ 1 2 hasPart(x, fCell (x)) ∧ hasPart(x, fCell (x))∧ 1 2 Ribosome(fCell (x)) ∧ Chromosome(fCell (x)) The system supports inheritance. Consider the subclass EukaryoticCell , which in- herits knowledge from Cell , see the right half of Fig. 1. The Chromosome in EukaryoticCell was inherited from Cell , and then specialized into a Eukaryotic- Chromosome, and likewise for Ribosome. The N ucleus was added locally in EukaryoticCell . The advantage of using Skolem functions is that the inheritance can be 3 2 2 made explicit by means of equality atoms: by adding fECell (x) = fCell (x), fECell (x) = 1 fCell (x) to the Skolemized formula for EukaryoticCell it is made explicit that the EukaryoticChromosome in EukaryoticCell is a specialization of the Chromosome in Cell and consequently, all knowledge which was modeled for that Chromosome in the context of Cell also applies to the EukaryoticChromosome in the context of EukaryoticCell (in addition to what was modeled for Chromosome itself, of course): ∀x : EukaryoticCell(x) ⇒ Cell(x)∧ 1 2 3 hasPart(x, fECell (x)) ∧ hasPart(x, fECell (x)) ∧ hasPart(x, fECell (x))∧ 3 1 EukaryoticChromosome(fECell (x)) ∧ N ucleus(fECell (x))∧ 2 3 1 EukaryoticRibosome(fECell (x)) ∧ isInside(fECell (x), fECell (x))∧ 3 2 2 1 fECell (x) = fCell (x) ∧ fECell (x) = fCell (x) The employed graphical modeling paradigm can be described as inherit, specialize, and extend. During the modeling process, the system keeps track of the specialized and extended Skolem functions and records the inheritance structures as demonstrated. Since the above axiom defines a graph, it is not expressible in the known decidable description logics [12]. 3 The Axiomatic Content of KB Bio 101 We first describe the signature of an AURA KB in first order logic. KBB101 is an AURA KB. Let CN be a set of class names (e.g., Cell ∈ CN ), and RN be a set of relation names (e.g., hasPart ∈ RN ). Let AN ⊆ RN be a set of attribute names (e.g., color , temperature ∈ AN ). Let C, C1 , C2 , . . . , D, D1 , D2 , . . . , E, E1 , E2 , . . . , F, F1 , F2 , . . . be class names, and R, R1 , R2 , . . . , S, S1 , S2 , . . . , T, T1 , T2 , . . . be relation names. Let {x, y, z, x1 , x2 , . . .} be a set of variables, and, for every C ∈ CN , let {f n1C , f n2C , . . .} be a set of function symbols. We have the following sets of constants: scalar constant values SCs = {small , big, . . .}, categorical constant values CCs = {blue, green, . . .}, cardinal unit classes CUCs = {meter , year , . . .}, and CN ∪ RN are considered con- stants as well. There are three kinds of attributes; they are used in so-called value atoms, see below: Cardinal attribute values: For example, t is 43 years would be represented as age(t, t1 ), theCardinalValue(t1 , 43), cardinalUnitClass(t1 , year ). Categorial attribute values: For example, t has color green would be represented as color (t, t1 ), theCategoricalValue(t1 , green). Scalar attribute values: For example, t is big w.r.t. a house (where house is a class) would be represented as size(t, t1 ), theScalarValue(t1 , big), scalarUnitClass(t1 , house). Next we describe the axiomatic content. An AURA KB is a tuple (CTAs, CAs, RAs, EQAs), where CTAs is a set of constant type assertions, RAs is a set of relation axioms, CAs is a set of class axioms, and EQAs is a set of equality atoms. Those axioms are described in the following: CTAs : The AURA KB contains, for every c ∈ SCs ∪ CCs ∪ CUCs, 1 to n type assertions of the form C(c), where C ∈ CN (the types of the constant). EQAs : A set of equality atoms for C, of the form t = f n(t0 ), where t, t0 ∈ {x, f n1C (x), f n2C (x), . . .}, and f n ∈ {f n1D , f n2D , . . .}, with C 6= D, for some D (D is a class mentioned in C, or a direct or indirect superclass of C). CAs : For every class name C ∈ CN , it may contain the following kinds of ax- ioms: DAs : disjointness axioms: ∀x : C(x) ⇒ ¬D(x); TAs : taxonomic axioms: ∀x : C(x) ⇒ E(x); NCAs : necessary conditions: ∀x : C(x) ⇒ Φ [x], where Φ [x] is a conjunction of unary (class) atoms and binary (relation) atoms over terms {x, f n1C (x), f n2C (x), . . .}. There are two special equality relations, namely equal , notEqual , which are user asserted equality atoms. The intended semantics is the semantics of first-order equal- ity resp. in-equality. In order to distinguish them from the equalities in EQAs we use different predicate names. Moreover, Φ [x] can contain the following value atoms: for a term t, let float be a floating point number, scalar ∈ SCs, categorical ∈ CCs, cardinalUnitClass ∈ CUCs, and scalarUnitClass ∈ CN , then the following atoms are value atoms: theCardinalValue(t, float), theScalarValue(t, scalar ), theCategoricalValue(t, categorical ), cardinalUnitClass(t, cardinalUnitClass), and scalarUnitClass(t, scalarUnitClass). In addition, an AURA KB can contain qualified number restrictions. Due to a lack of counting quantifiers, we represent them by means of quadrary atoms maxCardinality(t, R, n, C) (resp. minCardinality and exactCardinality), where n is a non-negative integer, C is a class, and R is a relation name. SCAs : sufficient conditions: ∀x : Θ [x, . . .] ⇒ C(x)∧EQs [x, . . .], where Θ [x, . . .] is a conjunction of unary, binary, value and qualified number restriction atoms over terms {x, x1 , x2 , . . .}, the sufficient conditions, and EQs [X, . . .] is a conjunction of equality atoms of the form t1 = t2 , where t1 ∈ {x, x1 , x2 , . . .}, and t2 ∈ {x, f n1C (x), f n2C (x), . . .}, linking the variables in the antecedent to the Skolem function values in the consequent of the necessary conditions, Φ(x). Obviously, requiring the use of the Skolem functions in the antecedent of the sufficient condition would be a too strong re- quirement and render the sufficient condition inapplicable in many cases. Also note that Θ0 [x] ⊆ Φ [x], where Θ0 [x] is the result of substituting the variables Θ [x] with their re- spective Skolem terms from EQs [x, . . .]: Θ0 [x] = Θ [x]{t1 7→t2 ,t1 =t2 ∈EQs[x,...]} . Hence, every sufficient condition is also necessary (a byproduct of the graphical modeling). For a given class name C, we refer to the corresponding axioms as DAs(C), TAs(C), and EQAs(C). We refer to the union of all axioms for C as CAs(C). RAs : For every relation name R ∈ RN , RAs may contain the following: DRAs : relation domain restrictions ∀x, y : R(x, y) ⇒ C1 (x) ∨ . . . ∨ Cn (x); RRAs : relation range restrictions ∀x, y : R(x, y) ⇒ D1 (y) ∨ . . . ∨ Dm (y); RHAs : simple relation hierarchy ∀x, y : R(x, y) ⇒ S(x, y); QRHAs : qualified relation hierarchy ∀x, y : R(x, y)∧C(x)∧D(y) ⇒ S(x, y); IRAs : inverse relations ∀x, y : R(x, y) ⇒ S(y, x); 12NAs : 1-to-N cardinality ∀x, y, z : R(x, y) ∧ R(z, y) ⇒ x = z; N21As : N-to-1 cardinality ∀x, y, z : R(x, y)∧R(x, z) ⇒ y = z; TRANSAs : simple transitive closure axioms ∀x, y, z : R(x, y) ∧ Rstar (y, z) ∧ C(x) ∧ D(y) ∧ E(z) ⇒ Rstar (x, z), where Rstar (x, z) = R∗ (x, z); GTRANSLAs : generalized transitive closure axioms (left composition) ∀x, y, z : R(x, y) ∧ S(y, z) ∧ C(x) ∧ D(y) ∧ E(z) ⇒ Rstar (x, z); and GTRANSRAs : generalized transitive closure axioms (right composition) ∀x, y, z : R(x, y) ∧ S(y, z) ∧ C(x) ∧ D(y) ∧ E(z) ⇒ Sstar (x , z ). We refer to the axioms for a relation R by DRAs(R) etc. We refer to the union of all axioms for R as RAs(R). 4 The OWL Translations of KB Bio 101 Our OWL translator produces OWL2 functional syntax [17], which has good human readability and is readily processed by most OWL2 reasoners. The generated KBs have been syntax-tested with Protégé 4.2 [14] (utilizing the OWLAPI parser) as well as Rac- erPro [16] (which has its own proprietary parser). The following features of KBB101 might be challenging for OWL reasoners: Cycles: KBB101 contains terminological cycles. It does not have the finite model property, nor the tree model property [1]. Size: the most complete export is 16 MBs big. Complexity: the most complete export exploits SHOIQ(Dn ) [3] (potentially we could use SROIQ(Dn ) [10], but we currently do not include complex role inclusions, see below for a discussion). Graph structures: we cannot represent the graph structures truthfully in OWL2. The original graph structures have to be approximated. We do this by rewriting and exporting the KBB101 in two flavors. Flavor 1 - Unraveling: We unravel the graph structures up to a certain maximal depth n. Unraveling is a standard technique from modal logics - let us give the following intuition: Given a class graph C (see Fig. 1), an up to max. depth n unraveled version of C can be produced by an n-bounded depth-first graph traversal of C, starting from the root node x, that outputs the (inverse) edge label whenever an (inverse) edge is traversed to visit a successor node, together with the node label. This produces a tree. This tree with max-depth n is then translated into OWL2 functional syntax as described. It results in an approximation of the original KBB101 which gets better the larger the value of n is. Note that nodes reachable only over paths of length > n are excluded. The filenames of KBB101s which were produced using un- raveling start with kb-owl-syntax-unraveled-depth-n. We currently vary n from 0 to 4 and produce the corresponding KBB101s. With n = 0, the axioms in NCAs and SCAs are basically ignored, as the unraveled tree consists of the root node only (hence, only the taxonomy is exported). Flavor 2 - Node IDs: We can represent the graph structure by introducing symbolic node identifiers in the OWL2 class expres- sions. Even though the OWL2 reasoner will be blind to the intended semantic meaning of these node IDs, modeling graph structure and co-references, the original graph struc- ture is at least represented and could, in principle, be exploited for reasoning by some powerful extended future OWL2 reasoner. Note that node IDs are only introduced if required (in tree-shaped class descriptions they are not required). Moreover, those node IDs can either be rendered as atomic classes, or introduced as nominals. The filenames of the respective KBB101s start with kb-owl-syntax-coreference-IDs. Explicit inheritance and equality: the inter-class co-references between Skolem function values and equality atoms cannot be represented in OWL2. We hence skip all the axioms in EQNs. We consider the OWL2 export underspecified. In principle, we could preserve some of those by using functional properties and encoding tricks, but even then, feature agreements or role value maps might be required, and already ALCF with general TBoxes is undecidable [2]. Rendering of axioms We can en- and disable the export of certain axiom types, e.g, there is a switch which determines whether DAs are exported or not, and likewise for other axiom types. We produce all KBB101s for all possible combinations of those switches. Let us describe the rendering of class axioms and relation axioms. In the following, C’ denotes the OWL2 version of class C, and R’ the corresponding property of relation R. The class axioms CA(C) are exported as follows: The axioms TAs(C) and NCA(C) are combined into one axiom of the form ∀x : C(x) ⇒ Ω, which is then rendered as a SubClassOf(C Ω 0 ) axiom. Here, Ω 0 is either an – up to depth n – unraveled ver- sion of Ω as an OWL2 class expression, or the Ω 0 class expression uses node IDs for representing the graph structure as described. Note that the DAs(C) and EQAs(C) are excluded here. Moreover, if C has a user-description or -comment, then this is ren- dered as AnnotationAssertion(C’ string). During rendering of SCAs, we are omitting the EQs [x, . . .] from ∀x : Θ [x, . . .] ⇒ C(x) ∧ EQs [x, . . .] ∈ SCAs. KBB101s with SCAs preserved have a triggers in their file names. We generate a SubClassOf(Θ0 C) axiom, where Θ0 is Θ [x, . . .] as an OWL2 class expres- sion, unraveled up to depth n. Disjointness axioms DAs are represented by means of DisjointClasses. The rendering of DAs can be suppressed; KBB101s with DAs preserved have -disjointness in their file names. The rendering of cardi- nality constraints in necessary conditions NCAs can be omitted. Also, we may choose to only export the cardinality constraints with cardinalities 0 and 1, as those are the only cardinality constraints used by the KM reasoning system [7] (the other cardinal- ity constraints are ignored). KBB101s with cardinality constraints preserved have a cardinalities resp. km-relevant-cardinalities in their file names. We also employ class and property annotation axioms to represent user descriptions and documentations. The inter-class equality axioms EQAs are ignored – as explained, there is no straight- forward way to model our Skolem function inheritance in OWL2. However, the user asserted intra-class equality and in-equality atoms are retained, and we are using the :same-as and :not-equal object properties for that purpose. Exporting the relation axioms RAs is straightforward, too. KBB101s with rela- tion axioms retained have a relation-axioms in their file names: The axioms TRANSAs, GTRANSLAs, GTRANSRAs are analyzed. If an axiom can be truth- fully encoded as an OWL2 complex role inclusion axiom obeying the regularity con- dition [10], then it is included in the file (unfortunately, none are, so the KBB101 ends up in SHOIQ(Dn ) instead of SROIQ(Dn )). If a relations R turns out to be transi- tive, then this is declared by means of TransitiveObjectProperty(R) axiom. RDAs(R) are rendered as ObjectPropertyDomain(R, C), for every ∀x, y : R(x, y) ⇒ C(x) ∈ RDAs(R). RRAs(R) are rendered as ObjectProperty- Range(R, C), for every ∀x, y : R(x, y) ⇒ D(y) ∈ RRAs(R). RHAs(R) are rendered as SubObjectProperty(R,S), for every ∀x, y : R(x, y) ⇒ S(x, y) ∈ RHAs(R). IRAs(R) are rendered as InverseObjectProperties(R, S), for every ∀x, y : R(x, y) ⇒ S(y, x) ∈ IRAs(R). If N21As(R) 6= ∅, then we declare FunctionalObjectProperty(R), and ObjectProperty(R) otherwise. If R has a user-description string, then this is rendered as an AnnotationAssertion(R string). Rendering of terms: OWL2 is a term-free language. However, there is the analog of first-order constants, so-called nominals, and we may choose to use them for the rep- resentation of categorical property values (such as green) and scalar symbolic prop- erty values (such as big). A categorical property value such as green can either be represented as a type / instance assertion of the form ClassAssertion(:Color- Constant :green) and then used as a nominal object property filler in class sub- expressions such as ObjectHasValue(:color :green), or :green might be a special subclass of :ColorConstant, SubClassOf(:green :ColorCon- stant), and then used in an ObjectSomeValuesFrom(:color :green) ex- pression to represent the color of some object. However, for string- and float-based property values we need to use a datatype property-based representation, e.g. Data- HasValue(:theCardinalValue "43.0e0"ˆˆxsd:float). KBB101s us- ing the nominal representation have a value-nominals in their file names, and oth- erwise value-classes. The rendering of value classes and nominals can also be switched off completely. 5 Conclusion An initial version of the KB Bio 101 in OWL2 is now available [4] and we are very interested to actively engage with the research community to facilitate its use. We are also looking forward to seeing the reasoning runtimes of different systems participating in the ORE 2013 reasoner competition for the different OWL2 variants of KBB101. The reasoning problems we are currently interested in are consistency checking and classification. Acknowledgment: This work has been funded by Vulcan Inc. References 1. F. Baader. Terminological Cycles in a Description Logic with Existential Restrictions. In International Joint Conference on Atificial Intelligence, 2003. 2. F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, 2003. 3. F. Baader and B. Hollunder. Qualifying Number Restrictions in Concept Languages. In International Conference on Knowledge Representation and Reasoning, 1991. 4. V. K. Chaudhri, S. Heymans, and M. A. Wessel. The Bio KB 101 Download Page, 2012. See http://www.ai.sri.com/halo/halobook2010/exported-kb/biokb. html. 5. V. K. Chaudhri, M. W. Stijn Heymans, and S. C. Tran. Object-Oriented Knowledge Bases in Logic Programming. Technical report, SRI International, 2013. Available from http: //www.ai.sri.com/pub_list/1935. 6. V. K. Chaudhri, M. A. Wessel, and S. Heymans. KB Bio 101: A Challenge for TPTP First- Order Reasoners. In CADE-24 Workshop on Knowledge Intensive Automated Reasoning, 2013. Available from http://www.ai.sri.com/pub_list/1937. 7. P. Clark and B. Porter. Building Concept Representations from Reusable Components. In AAAI. AAAI Press, 1997. 8. D. Gunning and V. Chaudhri et al. Project Halo Update Progress Toward Digital Aristotle. AI Magazine, Fall 2010. 9. B. Grosof. The SILK Project: Semantic Inferencing on Large Knowledge, 2012. See http: //silk.semwebcentral.org/. 10. I. Horrocks, O. Kutz, and U. Sattler. The Even More Irresistible SROIQ. In International Conference on Knowledge Representation and Reasoning, 2006. 11. V. Inc. Project Halo, 2012. See http://www.projecthalo.com/. 12. B. Motik, B. C. Grau, I. Horrocks, and U. Sattler. Representing Ontologies Using Description Logics, Description Graphs, and Rules. Artificial Intelligence, 173(14), Sept. 2009. 13. A. Overholtzer, A. Spaulding, V. K. Chaudhri, and D. Gunning. Inquire: An Intelligent Textbook. In Proceedings of the Vidoe Track of AAAI Conference on Artificial Intelli- gence. AAAI Press, 2012. See http://www.aaaivideos.org/2012/inquire_ intelligent_textbook/. 14. Protégé Group. The Protégé Ontology Editor and Knowledge Acquisition System, 2012. See http://protege.stanford.edu. 15. J. B. Reece, L. A. Urry, M. L. Cain, S. A. Wasserman, P. V. Minorsky, and R. B. Jackson. Campbell Biology, 9th ed. Harlow: Pearson Education, 2011. 16. V. Haarslev and K. Hidde and R. Möller and M. Wessel. The RacerPro Knowledge Repre- sentation and Reasoning System. Semantic Web Journal, 3(3):267–277, 2012. 17. W3C OWL Working Group. OWL 2 Web Ontology Language: Document Overview. W3C Recommendation, 27 October 2009. Available at http://www.w3.org/TR/ owl2-overview/.