Completing Description Logic Knowledge Bases using Formal Concept Analysis? Franz Baader1 , Bernhard Ganter1 , Ulrike Sattler2 and Barış Sertkaya1 1 TU Dresden, Germany 2 The University of Manchester, UK Abstract. We propose an approach for extending both the terminolog- ical and the assertional part of a Description Logic knowledge base by using information provided by the knowledge base and by a domain ex- pert. The use of techniques from Formal Concept Analysis ensures that, on the one hand, the interaction with the expert is kept to a minimum, and, on the other hand, we can show that the extended knowledge base is complete in a certain, well-defined sense. 1 Introduction Description Logics (DLs) [1] are employed in various application domains, such as natural language processing, configuration, databases, and bio-medical ontolo- gies, but their most notable success so far is due to the fact that DLs provide the logical underpinning of OWL, the standard ontology language for the seman- tic web [6]. As a consequence of this standardization, several ontology editors support OWL [9, 10, 8], and ontologies written in OWL are employed in more and more applications. As the size of these ontologies grows, tools that sup- port improving their quality become more important. The tools available until now use DL reasoning to detect inconsistencies and to infer consequences, i.e., implicit knowledge that can be deduced from the explicitly represented knowl- edge. There are also promising approaches that allow to pinpoint the reasons for inconsistencies and for certain consequences, and that help the ontology en- gineer to resolve inconsistencies and to remove unwanted consequences [13, 7]. These approaches address the quality dimension of soundness of an ontology, both within itself (consistency) and w.r.t. the intended application domain (no unwanted consequences). In the present paper, we are concerned with a different quality dimension: completeness. We provide a basis for formally well-founded techniques and tools that support the ontology engineer in checking whether an ontology contains all the relevant information about the application domain, and to extend the ontology appropriately if this is not the case. A DL knowledge base (nowadays often called ontology) usually consists of two parts, the terminological part (TBox), which defines concepts and also states ? Supported by DFG (GRK 334/3) and the EU (IST-2005-7603 FET project TONES and NoE 507505 Semantic Mining). additional constraints (so-called general concept inclusions, GCIs) on the inter- pretation of these concepts, and the assertional part (ABox), which describes individuals and their relationship to each other and to concepts. Given an appli- cation domain and a DL knowledge base (KB) describing it, we can ask whether the KB contains all the relevant information about the domain: Are all the rel- evant constraints that hold between concepts in the domain captured by the TBox? Are all the relevant individuals existing in the domain represented in the ABox? As an example, consider the OWL ontology for human protein phosphatases that has been described and used in [16]. This ontology was developed based on information from peer-reviewed publications. The human protein phosphatase family has been well characterised experimentally, and detailed knowledge about different classes of such proteins is available. This knowledge is represented in the terminological part of the ontology. Moreover, a large set of human phosphatases has been identified and documented by expert biologists. These are described as individuals in the assertional part of the ontology. One can now ask whether the information about protein phosphatases contained in this ontology is complete. Are all the relationships that hold among the introduced classes of phosphatases captured by the constraints in the TBox, or are there relationships that hold in the domain, but do not follow from the TBox? Are all possible kinds of human protein phosphatases represented by individuals in the ABox, or are there phosphatases that have not yet been included in the ontology or even not yet been identified? Such questions cannot be answered by an automated tool alone. Clearly, to check whether a given relationship between concepts—which does not follow from the TBox—holds in the domain, one needs to ask a domain expert, and the same is true for questions regarding the existence of individuals not described in the ABox. The rôle of the automated tool is to ensure that the expert is asked as few questions as possible; in particular, she should not be asked trivial questions, i.e., questions that could actually be answered based on the represented knowl- edge. In the above example, answering a non-trivial question regarding human protein phosphatases may require the biologist to study the relevant literature, query existing protein databases, or even carry out new experiments. Thus, the expert may be prompted to acquire new biological knowledge. Attribute exploration is an approach developed in Formal Concept Analysis (FCA) [5] that is used to acquire knowledge about an application domain by querying an expert. One of the earliest applications of this approach is described in [15], where the domain is lattice theory, and the goal of the exploration process is to find, on the one hand, all valid relationships between properties of lattices (like being distributive), and, on the other hand, to find counterexamples to all the relationships that do not hold. To answer a query whether a certain rela- tionship holds, the lattice theory expert must either confirm the relationship (by using results from the literature or carrying out a new proof for this fact), or give a counterexample (again, by either finding one in the literature or constructing a new one). Although this sounds very similar to what is needed in our context, we can- not directly use this approach. The main reason is the open-world semantics of description logic knowledge bases. In DLs, if we cannot deduce from the knowl- edge base that an individual i is an instance of the concept C, we do not assume that i belongs to ¬C. In contrast classical FCA assumes complete knowledge in the sense that, whenever it is not explicitly mentioned that an object o has a property p, we assume that o does not have p. There has been some work on how to extend FCA and attribute exploration from complete knowledge to the case of partial knowledge [11, 4]. However, this work is based on assumptions that are different from ours. In particular, it assumes that the expert cannot answer all queries and, as a consequence, the knowledge obtained after the exploration process may still be incomplete. In contrast, our intention is to complete the KB, i.e., in the end we want to have complete knowledge about these relation- ships. What may be incomplete is the description of individuals used during the exploration process. Due to space limitation, we introduce our variant of FCA that can deal with the open-world semantics of DL knowledge bases only briefly. For a more comprehensive description, we refer the reader to the long version of this paper [3] or to the technical report [2], which also contains full proofs of our results. 2 Exploring Partial Contexts In this section, we extend the classical approach to Formal Concept Analysis (FCA), as described in detail in [5], to the case of individuals (called objects in FCA) that have only a partial description in the sense that, for some prop- erties (called attributes in FCA), it is not known whether they are satisfied by the individual or not. The connection between this approach and the classical approach is explained in [2]. In the following, we assume that we have a finite set M of attributes and a (possibly infinite) set of objects. Definition 1. A partial object description (pod) is a tuple (A, S) where A, S ⊆ M are such that A ∩ S = ∅. We call such a pod a full object description (fod) if A ∪ S = M . A set of pods is called a partial context and a set of fods a full context. Intuitively, the pod (A, S) says that the object it describes satisfies all attributes from A and does not satisfy any attribute from S. For the attributes not con- tained in A ∪ S, nothing is known w.r.t. this object. Definition 2. We say that the pod (A0 , S 0 ) extends the pod (A, S), and write this as (A, S) ≤ (A0 , S 0 ), if A ⊆ A0 and S ⊆ S 0 . Similarly, we say that the partial context K0 extends the partial context K, and write this as K ≤ K0 , if every pod in K is extended by some pod in K0 . If K is a full context and K ≤ K, then K is called a realizer of K. If (A, S) is a fod and (A, S) ≤ (A, S), then we also say that (A, S) realizes (A, S). The following notion of implication formalizes the informal notion “relation- ship between properties” used in the introduction. Definition 3. An implication is of the form L → R where L, R ⊆ M . This implication is refuted by the pod (A, S) if L ⊆ A and R ∩ S 6= ∅. It is refuted by the partial context K if it is refuted by at least one element of K. The set of implications that are not refuted by a given partial context K is denoted by Imp(K). The set of all fods that do not refute a given set of implications L is denoted by Mod (L). For a set of implications L and a set P ⊆ M , the implicational closure of P with respect to L, denoted by L(P ), is the smallest subset Q of M such that – P ⊆ Q, and – L → R ∈ L and L ⊆ Q imply R ⊆ Q. A set P ⊆ M is called L-closed if L(P ) = P . Definition 4. The implication L → R is said to follow from a set J of impli- cations if R ⊆ J (L). The set of implications J is called complete for a set of implications L if every implication in L follows from J . It is called sound for L if every implication that follows from J is contained in L. A set of implications J is called a base for a set of implications L if it is both sound and complete for L, and no strict subset of J satisfies this property. The following is a trivial fact regarding the connection between partial con- texts and the implications they do not refute. Proposition S 1. For a given set P ⊆ M and a partial context K, K(P ) := M \ {S | (A, S) ∈ K, P ⊆ A} is the largest subset of M such that P → K(P ) is not refuted by K. Attribute exploration with partial contexts The classical attribute ex- ploration algorithm of FCA assumes that there is a domain expert that can answer questions regarding the validity of implications in the application do- main. Accordingly, our approach requires an expert that can decide whether an implication is refuted in the application domain or not. In contrast to existing work on extending FCA to the case of partial knowledge [11, 4], we do not as- sume that the expert has only partial knowledge and thus cannot answer all implication questions. To be more precise, we consider the following setting. We are given an initial (possibly empty) partial context K, an initially empty set of implications L, and a full context K that is a realizer of K. The expert answers implication questions “L → R?” w.r.t. the full context K. More precisely, if the answer is “yes,” then K does not refute L → R. The implication L → R is then added to L. Otherwise, the expert extends the current context K such that the extended context refutes L → R and still has K as a realizer. Consequently, the following invariant will be satisfied by K, K, L: K ≤ K ⊆ Mod (L). Our aim is to enrich K and L such that eventually L is not only sound, but also complete for Imp(K), and K refutes all other implications (i.e., all the implications refuted by K). As in the classical case, we want to do this by asking as few as possible questions to the expert. Definition 5. Let L be a set of implications and K a partial context. An im- plication is called undecided w.r.t. K and L if it neither follows from L nor is refuted by K. It is decided w.r.t. K and L if it is not undecided w.r.t. K and L. In principle, our attribute exploration algorithm tries to decide each undecided implications by either adding it to L or extending K such that it refutes the implication. If all implications are decided, then our goal is achieved. Proposition 2. Assume that K ≤ K ⊆ Mod (L) and that all implications are decided w.r.t. K and L. Then L is complete for Imp(K) and K refutes all impli- cations not belonging to Imp(K). How can we find—and let the expert decide—all undecided implications with- out considering all implications? The following proposition motivates why it is sufficient to consider implications whose left-hand sides are L-closed. Proposition 3. Let L be a set of implications and L → R an implication. Then, L → R follows from L iff L(L) → R follows from L. Concerning right-hand sides, Proposition 1 says that the largest right-hand side R such that L → R is not refuted by K is R = K(L). Putting these two observations together, we only need to consider implications of the form L → K(L) where L is L-closed. In order to enumerate all relevant left-hand sides, we can thus use the well-known approach from FCA for enumerating closed sets in the lectic order [5]. Definition 6. Assume that M = {m1 , . . . , mn } and fix some linear order m1 < m2 < · · · mn on M . The lectic order < is defined as follows: for mi ∈ M and A, B ⊆ M we define A