KB Bio 101 : A Challenge for OWL Reasoners

                  Vinay K. Chaudhri, Michael A. Wessel, Stijn Heymans

         SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493, USA
                              firstname.lastname@sri.com


         Abstract. We describe the axiomatic content of a biology knowledge base that
         poses both theoretical and empirical challenges for OWL reasoning. The knowl-
         edge base is organized hierarchically as a set of classes with necessary and suf-
         ficient properties. The relations have domain and range restrictions, are orga-
         nized into a hierarchy, can have cardinality constraints and composition axioms
         stated for them. The necessary and sufficient properties of classes induce general
         graphs for which there are no known decidable reasoners. The OWL version of
         the knowledge base presented in this paper is an approximation of the original
         knowledge base. The knowledge content is practically motivated by an education
         application and has been extensively tested for quality.


1     Introduction
The goal of Project Halo is to develop a “Digital Aristotle” - a reasoning system ca-
pable of answering novel questions and solving problems in a broad range of scientific
disciplines and related human affairs [11]. As part of this effort, SRI has created a sys-
tem called Automated User-Centered Reasoning and Acquisition System (AURA) [8],
which enables educators to encode knowledge from science textbooks in a way that it
can be used for answering questions by reasoning.
    A team of biologists used AURA to encode a significant subset of a popular biology
textbook that is used in advanced high school and introductory college courses in the
United States [15]. The knowledge base called KB Bio 101 (for short: KBB101) is
an outcome of this effort. KBB101 is a central component of an electronic textbook
application called Inquire Biology [13] aimed at students studying from it.
    AURA uses a frame-based knowledge representation and reasoning system called
Knowledge Machine (KM) [7]. We have translated the original KM-version of KBB101
into first-order logic with equality. By using this representation as a common basis, we
have translated it into multiple different formats including SILK [9], OWL2 functional
[17], answer set programming [5], and the TPTP FOF syntax [6]. In this paper, we
describe the OWL2 translation. The translations are available for download [4].1

2     Modeling in the AURA Project – The Role of Skolem Functions
AURA provides a graphical knowledge authoring environment for biologists. For ex-
ample, the knowledge Every Cell has a Ribosome part and a Chromosome part is ex-
 1
     This work is owned by Vulcan Inc. and is licensed for use under the Creative Commons
     Attribution-NonCommerical-ShareAlike 3.0 license (http://creativecommons.org/
     licenses/by-nc-sa/3.0/).
            Fig. 1. (Simplified) Class Graphs for Cell and EukaryoticCell in AURA.

pressed graphically as shown in the left half of Fig. 1. Here, white nodes represent
universally quantified variables, and grey nodes represent existentially quantified vari-
ables. Cell hence corresponds to the following first-order formula:
    ∀x : Cell (x) ⇒
         ∃y1 , y2 : hasPart(x, y1 ) ∧ hasPart(x, y2 ) ∧ Ribosome(y1 ) ∧ Chromosome(y2 )
    Using the well-known technique of Skolemization, we can also write this as follows;
the advantages of Skolem functions will become clear shortly:
                      ∀x : Cell (x) ⇒
                                        1                      2
                           hasPart(x, fCell (x)) ∧ hasPart(x, fCell (x))∧
                                       1                         2
                           Ribosome(fCell (x)) ∧ Chromosome(fCell     (x))
    The system supports inheritance. Consider the subclass EukaryoticCell , which in-
herits knowledge from Cell , see the right half of Fig. 1. The Chromosome in
EukaryoticCell was inherited from Cell , and then specialized into a Eukaryotic-
Chromosome, and likewise for Ribosome. The N ucleus was added locally in
EukaryoticCell . The advantage of using Skolem functions is that the inheritance can be
                                                     3             2         2
made explicit by means of equality atoms: by adding fECell (x) = fCell (x), fECell (x) =
  1
fCell (x) to the Skolemized formula for EukaryoticCell it is made explicit that the
EukaryoticChromosome in EukaryoticCell is a specialization of the Chromosome
in Cell and consequently, all knowledge which was modeled for that Chromosome
in the context of Cell also applies to the EukaryoticChromosome in the context of
EukaryoticCell (in addition to what was modeled for Chromosome itself, of course):
      ∀x : EukaryoticCell(x) ⇒ Cell(x)∧
                        1                          2                         3
           hasPart(x, fECell  (x)) ∧ hasPart(x, fECell  (x)) ∧ hasPart(x, fECell  (x))∧
                                        3                        1
           EukaryoticChromosome(fECell         (x)) ∧ N ucleus(fECell  (x))∧
                                    2                       3           1
           EukaryoticRibosome(fECell    (x)) ∧ isInside(fECell   (x), fECell (x))∧
            3             2           2            1
           fECell (x) = fCell (x) ∧ fECell (x) = fCell (x)
    The employed graphical modeling paradigm can be described as inherit, specialize,
and extend. During the modeling process, the system keeps track of the specialized and
extended Skolem functions and records the inheritance structures as demonstrated.
    Since the above axiom defines a graph, it is not expressible in the known decidable
description logics [12].

3    The Axiomatic Content of KB Bio 101

We first describe the signature of an AURA KB in first order logic. KBB101 is an
AURA KB. Let CN be a set of class names (e.g., Cell ∈ CN ), and RN be a set of
relation names (e.g., hasPart ∈ RN ). Let AN ⊆ RN be a set of attribute names (e.g.,
color , temperature ∈ AN ). Let C, C1 , C2 , . . . , D, D1 , D2 , . . . , E, E1 , E2 , . . . , F, F1 ,
F2 , . . . be class names, and R, R1 , R2 , . . . , S, S1 , S2 , . . . , T, T1 , T2 , . . . be relation names.
Let {x, y, z, x1 , x2 , . . .} be a set of variables, and, for every C ∈ CN , let {f n1C , f n2C , . . .}
be a set of function symbols. We have the following sets of constants: scalar constant
values SCs = {small , big, . . .}, categorical constant values CCs = {blue, green, . . .},
cardinal unit classes CUCs = {meter , year , . . .}, and CN ∪ RN are considered con-
stants as well. There are three kinds of attributes; they are used in so-called value atoms,
see below:
Cardinal attribute values: For example, t is 43 years would be represented as
    age(t, t1 ), theCardinalValue(t1 , 43), cardinalUnitClass(t1 , year ).
Categorial attribute values: For example, t has color green would be represented as
    color (t, t1 ), theCategoricalValue(t1 , green).
Scalar attribute values: For example, t is big w.r.t. a house (where house is a class)
    would be represented as size(t, t1 ), theScalarValue(t1 , big), scalarUnitClass(t1 ,
    house).
     Next we describe the axiomatic content. An AURA KB is a tuple
(CTAs, CAs, RAs, EQAs), where CTAs is a set of constant type assertions, RAs
is a set of relation axioms, CAs is a set of class axioms, and EQAs is a set of equality
atoms. Those axioms are described in the following:
     CTAs : The AURA KB contains, for every c ∈ SCs ∪ CCs ∪ CUCs, 1 to n type
assertions of the form C(c), where C ∈ CN (the types of the constant).
     EQAs : A set of equality atoms for C, of the form t = f n(t0 ), where t, t0 ∈
{x, f n1C (x), f n2C (x), . . .}, and f n ∈ {f n1D , f n2D , . . .}, with C 6= D, for some D (D
is a class mentioned in C, or a direct or indirect superclass of C).
     CAs : For every class name C ∈ CN , it may contain the following kinds of ax-
ioms: DAs : disjointness axioms: ∀x : C(x) ⇒ ¬D(x); TAs : taxonomic axioms:
∀x : C(x) ⇒ E(x); NCAs : necessary conditions: ∀x : C(x) ⇒ Φ [x], where
Φ [x] is a conjunction of unary (class) atoms and binary (relation) atoms over terms
{x, f n1C (x), f n2C (x), . . .}.
     There are two special equality relations, namely equal , notEqual , which are user
asserted equality atoms. The intended semantics is the semantics of first-order equal-
ity resp. in-equality. In order to distinguish them from the equalities in EQAs we use
different predicate names.
     Moreover, Φ [x] can contain the following value atoms: for a term t, let float be
a floating point number, scalar ∈ SCs, categorical ∈ CCs, cardinalUnitClass ∈
CUCs, and scalarUnitClass ∈ CN , then the following atoms are value atoms:
theCardinalValue(t, float), theScalarValue(t, scalar ), theCategoricalValue(t,
categorical ), cardinalUnitClass(t, cardinalUnitClass), and scalarUnitClass(t,
scalarUnitClass).
     In addition, an AURA KB can contain qualified number restrictions. Due to a lack of
counting quantifiers, we represent them by means of quadrary atoms
maxCardinality(t, R, n, C) (resp. minCardinality and exactCardinality), where n
is a non-negative integer, C is a class, and R is a relation name.
     SCAs : sufficient conditions: ∀x : Θ [x, . . .] ⇒ C(x)∧EQs [x, . . .], where Θ [x, . . .]
is a conjunction of unary, binary, value and qualified number restriction atoms over
terms {x, x1 , x2 , . . .}, the sufficient conditions, and EQs [X, . . .] is a conjunction of
equality atoms of the form t1 = t2 , where t1 ∈ {x, x1 , x2 , . . .}, and t2 ∈ {x, f n1C (x),
f n2C (x), . . .}, linking the variables in the antecedent to the Skolem function values in
the consequent of the necessary conditions, Φ(x). Obviously, requiring the use of the
Skolem functions in the antecedent of the sufficient condition would be a too strong re-
quirement and render the sufficient condition inapplicable in many cases. Also note that
Θ0 [x] ⊆ Φ [x], where Θ0 [x] is the result of substituting the variables Θ [x] with their re-
spective Skolem terms from EQs [x, . . .]: Θ0 [x] = Θ [x]{t1 7→t2 ,t1 =t2 ∈EQs[x,...]} . Hence,
every sufficient condition is also necessary (a byproduct of the graphical modeling).
     For a given class name C, we refer to the corresponding axioms as DAs(C),
TAs(C), and EQAs(C). We refer to the union of all axioms for C as CAs(C).
     RAs : For every relation name R ∈ RN , RAs may contain the following: DRAs :
relation domain restrictions ∀x, y : R(x, y) ⇒ C1 (x) ∨ . . . ∨ Cn (x); RRAs : relation
range restrictions ∀x, y : R(x, y) ⇒ D1 (y) ∨ . . . ∨ Dm (y); RHAs : simple relation
hierarchy ∀x, y : R(x, y) ⇒ S(x, y); QRHAs : qualified relation hierarchy ∀x, y :
R(x, y)∧C(x)∧D(y) ⇒ S(x, y); IRAs : inverse relations ∀x, y : R(x, y) ⇒ S(y, x);
12NAs : 1-to-N cardinality ∀x, y, z : R(x, y) ∧ R(z, y) ⇒ x = z; N21As : N-to-1
cardinality ∀x, y, z : R(x, y)∧R(x, z) ⇒ y = z; TRANSAs : simple transitive closure
axioms ∀x, y, z : R(x, y) ∧ Rstar (y, z) ∧ C(x) ∧ D(y) ∧ E(z) ⇒ Rstar (x, z), where
Rstar (x, z) = R∗ (x, z); GTRANSLAs : generalized transitive closure axioms (left
composition) ∀x, y, z : R(x, y) ∧ S(y, z) ∧ C(x) ∧ D(y) ∧ E(z) ⇒ Rstar (x, z); and
GTRANSRAs : generalized transitive closure axioms (right composition) ∀x, y, z :
R(x, y) ∧ S(y, z) ∧ C(x) ∧ D(y) ∧ E(z) ⇒ Sstar (x , z ).
     We refer to the axioms for a relation R by DRAs(R) etc. We refer to the union of
all axioms for R as RAs(R).

4   The OWL Translations of KB Bio 101

Our OWL translator produces OWL2 functional syntax [17], which has good human
readability and is readily processed by most OWL2 reasoners. The generated KBs have
been syntax-tested with Protégé 4.2 [14] (utilizing the OWLAPI parser) as well as Rac-
erPro [16] (which has its own proprietary parser).
    The following features of KBB101 might be challenging for OWL reasoners:
    Cycles: KBB101 contains terminological cycles. It does not have the finite model
property, nor the tree model property [1].
    Size: the most complete export is 16 MBs big.
    Complexity: the most complete export exploits SHOIQ(Dn ) [3] (potentially we
could use SROIQ(Dn ) [10], but we currently do not include complex role inclusions,
see below for a discussion).
    Graph structures: we cannot represent the graph structures truthfully in OWL2.
The original graph structures have to be approximated. We do this by rewriting and
exporting the KBB101 in two flavors. Flavor 1 - Unraveling: We unravel the graph
structures up to a certain maximal depth n. Unraveling is a standard technique from
modal logics - let us give the following intuition: Given a class graph C (see Fig. 1), an
up to max. depth n unraveled version of C can be produced by an n-bounded depth-first
graph traversal of C, starting from the root node x, that outputs the (inverse) edge label
whenever an (inverse) edge is traversed to visit a successor node, together with the node
label. This produces a tree. This tree with max-depth n is then translated into OWL2
functional syntax as described. It results in an approximation of the original KBB101
which gets better the larger the value of n is. Note that nodes reachable only over paths
of length > n are excluded. The filenames of KBB101s which were produced using un-
raveling start with kb-owl-syntax-unraveled-depth-n. We currently vary
n from 0 to 4 and produce the corresponding KBB101s. With n = 0, the axioms in
NCAs and SCAs are basically ignored, as the unraveled tree consists of the root node
only (hence, only the taxonomy is exported). Flavor 2 - Node IDs: We can represent
the graph structure by introducing symbolic node identifiers in the OWL2 class expres-
sions. Even though the OWL2 reasoner will be blind to the intended semantic meaning
of these node IDs, modeling graph structure and co-references, the original graph struc-
ture is at least represented and could, in principle, be exploited for reasoning by some
powerful extended future OWL2 reasoner. Note that node IDs are only introduced if
required (in tree-shaped class descriptions they are not required). Moreover, those node
IDs can either be rendered as atomic classes, or introduced as nominals. The filenames
of the respective KBB101s start with kb-owl-syntax-coreference-IDs.
     Explicit inheritance and equality: the inter-class co-references between Skolem
function values and equality atoms cannot be represented in OWL2. We hence skip
all the axioms in EQNs. We consider the OWL2 export underspecified. In principle,
we could preserve some of those by using functional properties and encoding tricks,
but even then, feature agreements or role value maps might be required, and already
ALCF with general TBoxes is undecidable [2].
    Rendering of axioms We can en- and disable the export of certain axiom types,
e.g, there is a switch which determines whether DAs are exported or not, and likewise
for other axiom types. We produce all KBB101s for all possible combinations of those
switches. Let us describe the rendering of class axioms and relation axioms. In the
following, C’ denotes the OWL2 version of class C, and R’ the corresponding property
of relation R.
    The class axioms CA(C) are exported as follows: The axioms TAs(C) and NCA(C)
are combined into one axiom of the form ∀x : C(x) ⇒ Ω, which is then rendered as a
SubClassOf(C Ω 0 ) axiom. Here, Ω 0 is either an – up to depth n – unraveled ver-
sion of Ω as an OWL2 class expression, or the Ω 0 class expression uses node IDs for
representing the graph structure as described. Note that the DAs(C) and EQAs(C)
are excluded here. Moreover, if C has a user-description or -comment, then this is ren-
dered as AnnotationAssertion(C’ string). During rendering of SCAs, we
are omitting the EQs [x, . . .] from ∀x : Θ [x, . . .] ⇒ C(x) ∧ EQs [x, . . .] ∈ SCAs.
KBB101s with SCAs preserved have a triggers in their file names. We generate
a SubClassOf(Θ0 C) axiom, where Θ0 is Θ [x, . . .] as an OWL2 class expres-
sion, unraveled up to depth n. Disjointness axioms DAs are represented by means
of DisjointClasses. The rendering of DAs can be suppressed; KBB101s with
DAs preserved have -disjointness in their file names. The rendering of cardi-
nality constraints in necessary conditions NCAs can be omitted. Also, we may choose
to only export the cardinality constraints with cardinalities 0 and 1, as those are the
only cardinality constraints used by the KM reasoning system [7] (the other cardinal-
ity constraints are ignored). KBB101s with cardinality constraints preserved have a
cardinalities resp. km-relevant-cardinalities in their file names.
     We also employ class and property annotation axioms to represent user descriptions
and documentations.
     The inter-class equality axioms EQAs are ignored – as explained, there is no straight-
forward way to model our Skolem function inheritance in OWL2. However, the user
asserted intra-class equality and in-equality atoms are retained, and we are using the
:same-as and :not-equal object properties for that purpose.
     Exporting the relation axioms RAs is straightforward, too. KBB101s with rela-
tion axioms retained have a relation-axioms in their file names: The axioms
TRANSAs, GTRANSLAs, GTRANSRAs are analyzed. If an axiom can be truth-
fully encoded as an OWL2 complex role inclusion axiom obeying the regularity con-
dition [10], then it is included in the file (unfortunately, none are, so the KBB101 ends
up in SHOIQ(Dn ) instead of SROIQ(Dn )). If a relations R turns out to be transi-
tive, then this is declared by means of TransitiveObjectProperty(R) axiom.
RDAs(R) are rendered as ObjectPropertyDomain(R, C), for every ∀x, y :
R(x, y) ⇒ C(x) ∈ RDAs(R). RRAs(R) are rendered as ObjectProperty-
Range(R, C), for every ∀x, y : R(x, y) ⇒ D(y) ∈ RRAs(R). RHAs(R) are
rendered as SubObjectProperty(R,S), for every ∀x, y : R(x, y) ⇒ S(x, y) ∈
RHAs(R). IRAs(R) are rendered as InverseObjectProperties(R, S), for
every ∀x, y : R(x, y) ⇒ S(y, x) ∈ IRAs(R). If N21As(R) 6= ∅, then we declare
FunctionalObjectProperty(R), and ObjectProperty(R) otherwise. If R
has a user-description string, then this is rendered as an AnnotationAssertion(R
string).
     Rendering of terms: OWL2 is a term-free language. However, there is the analog
of first-order constants, so-called nominals, and we may choose to use them for the rep-
resentation of categorical property values (such as green) and scalar symbolic prop-
erty values (such as big). A categorical property value such as green can either be
represented as a type / instance assertion of the form ClassAssertion(:Color-
Constant :green) and then used as a nominal object property filler in class sub-
expressions such as ObjectHasValue(:color :green), or :green might be
a special subclass of :ColorConstant, SubClassOf(:green :ColorCon-
stant), and then used in an ObjectSomeValuesFrom(:color :green) ex-
pression to represent the color of some object. However, for string- and float-based
property values we need to use a datatype property-based representation, e.g. Data-
HasValue(:theCardinalValue "43.0e0"ˆˆxsd:float). KBB101s us-
ing the nominal representation have a value-nominals in their file names, and oth-
erwise value-classes. The rendering of value classes and nominals can also be
switched off completely.

5   Conclusion

An initial version of the KB Bio 101 in OWL2 is now available [4] and we are very
interested to actively engage with the research community to facilitate its use. We are
also looking forward to seeing the reasoning runtimes of different systems participating
in the ORE 2013 reasoner competition for the different OWL2 variants of KBB101.
The reasoning problems we are currently interested in are consistency checking and
classification.
Acknowledgment: This work has been funded by Vulcan Inc.


References
 1. F. Baader. Terminological Cycles in a Description Logic with Existential Restrictions. In
    International Joint Conference on Atificial Intelligence, 2003.
 2. F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors.
    The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge
    University Press, 2003.
 3. F. Baader and B. Hollunder. Qualifying Number Restrictions in Concept Languages. In
    International Conference on Knowledge Representation and Reasoning, 1991.
 4. V. K. Chaudhri, S. Heymans, and M. A. Wessel. The Bio KB 101 Download Page, 2012.
    See http://www.ai.sri.com/halo/halobook2010/exported-kb/biokb.
    html.
 5. V. K. Chaudhri, M. W. Stijn Heymans, and S. C. Tran. Object-Oriented Knowledge Bases
    in Logic Programming. Technical report, SRI International, 2013. Available from http:
    //www.ai.sri.com/pub_list/1935.
 6. V. K. Chaudhri, M. A. Wessel, and S. Heymans. KB Bio 101: A Challenge for TPTP First-
    Order Reasoners. In CADE-24 Workshop on Knowledge Intensive Automated Reasoning,
    2013. Available from http://www.ai.sri.com/pub_list/1937.
 7. P. Clark and B. Porter. Building Concept Representations from Reusable Components. In
    AAAI. AAAI Press, 1997.
 8. D. Gunning and V. Chaudhri et al. Project Halo Update Progress Toward Digital Aristotle.
    AI Magazine, Fall 2010.
 9. B. Grosof. The SILK Project: Semantic Inferencing on Large Knowledge, 2012. See http:
    //silk.semwebcentral.org/.
10. I. Horrocks, O. Kutz, and U. Sattler. The Even More Irresistible SROIQ. In International
    Conference on Knowledge Representation and Reasoning, 2006.
11. V. Inc. Project Halo, 2012. See http://www.projecthalo.com/.
12. B. Motik, B. C. Grau, I. Horrocks, and U. Sattler. Representing Ontologies Using Description
    Logics, Description Graphs, and Rules. Artificial Intelligence, 173(14), Sept. 2009.
13. A. Overholtzer, A. Spaulding, V. K. Chaudhri, and D. Gunning. Inquire: An Intelligent
    Textbook. In Proceedings of the Vidoe Track of AAAI Conference on Artificial Intelli-
    gence. AAAI Press, 2012. See http://www.aaaivideos.org/2012/inquire_
    intelligent_textbook/.
14. Protégé Group. The Protégé Ontology Editor and Knowledge Acquisition System, 2012. See
    http://protege.stanford.edu.
15. J. B. Reece, L. A. Urry, M. L. Cain, S. A. Wasserman, P. V. Minorsky, and R. B. Jackson.
    Campbell Biology, 9th ed. Harlow: Pearson Education, 2011.
16. V. Haarslev and K. Hidde and R. Möller and M. Wessel. The RacerPro Knowledge Repre-
    sentation and Reasoning System. Semantic Web Journal, 3(3):267–277, 2012.
17. W3C OWL Working Group. OWL 2 Web Ontology Language: Document Overview.
    W3C Recommendation, 27 October 2009. Available at http://www.w3.org/TR/
    owl2-overview/.