Axiomatizing ℰℒKgfp -General Concept Inclusions in the Presence of Untrusted Individuals Daniel Borchmann* TU Dresden Abstract. To extract terminological knowledge from data, Baader and Distel have proposed an effective method that allows for the extraction of a base of all valid general concept inclusions of a given finite interpre- tation. In previous works, to be able to handle small amounts of errors in our data, we have extended this approach to also extract general concept inclusions which are “almost valid” in the interpretation. This has been done by demanding that general concept inclusions which are “almost valid” are those having only an allowed percentage of counterexamples in the interpretation. In this work, we shall further extend our previous work to allow the interpretation to contain both trusted and untrusted individuals, i. e. individuals from which we know and do not know that they are correct, respectively. The problem we then want to solve is to find a compact representation of all terminological knowledge that is valid for all trusted individuals and is almost valid for all others. 1 Introduction Constructing description logic knowledge bases is an expensive, time-consuming and often cumbersome task. The main reason for this is that it almost always has to be conducted by human experts, since they provide the means to (more or less) reliably transform informally stated knowledge into a formal reformulation. Thus, methods to assist human experts in constructing these ontologies would be highly helpful. For example, one could provide a first, rough approximation of the desired knowledge base, i. e. a sketch of the ontology, that the expert then could build upon. The extraction of such first knowledge bases is a widely studied topic. Interest- ing approaches in this direction have been proposed by Völker and Niepert [14] with their method of Statistical Schema Induction, which relies on methods from data mining to extract certain terminological axioms from data given as a set of RDF triples. Another approach has been discussed by Baader and Distel [3], which proposes a method to extract a finite base of all valid ℰℒK -general concept inclusions (GCIs) of a given finite interpretation, relying on methodology from the mathematical field of formal concept analysis. Such a finite base could then be used as a starting point for the terminological part of a future knowledge base. The latter approach has a particular appeal: since bases are complete in the sense that all valid GCIs expressible in ℰℒK already follow from it, such a base provides * Supported by DFG Graduiertenkolleg 1763 (QuantLA) all the knowledge which is present in the data. However, this approach also has a drawback: since only valid GCIs are considered, even very rare counterexamples could invalidate otherwise correct (and useful) GCIs, leading the algorithm to come up with a lot of special cases to circumvent such errors. Thus, based upon the approach of Baader and Distel, the author has developed an extension that allows for finding finite bases of “almost valid” GCIs of a given finite interpretation [8, 6, 7], an extension which turns out to be somehow similar to Statistical Schema Induction. To define the notion of “almost valid”, we use the notion of confidence from data mining [1]: GCIs are “almost valid” if and only if their confidence is high enough. In other words, the amount of counterexamples for such a GCI has to be rather small compared to all individuals to which this GCIs applies. However, within this extension, all individuals in a given interpretation are suspected to be erroneous, an assumption which may not be correct in general. We thus extend these results in the present work to include the possibility that we “trust” certain individuals, i. e. that we state a-priori that they are correct as they are. This implies that all GCIs which are invalidated by such trusted individuals are false in our original domain of interest, and should thus not be included in an according knowledge base. The target description logic we would like to consider is ℰℒK , i. e. we would like to extract bases of ℰℒK -GCIs which are “almost valid” for untrusted individuals and valid for trusted individuals. However, due to technical reasons, we have to take a detour and need to also consider ℰℒK K gfp , an extension of ℰℒ by means of cyclic concept descriptions. More precisely, we shall show in this work how to find bases of ℰℒK gfp -GCIs in the presence of untrusted individuals. To then obtain from this base of ℰℒK K gfp -GCIs a base of ℰℒ -GCIs we can use results from [6, 10] showing theses ℰℒgfp -bases can effectively be turned into equivalent ℰℒK -bases. However, K due to space restrictions, we shall not consider this transformation here. The paper is structured as follows. Firstly, we shall introduce the necessary notions from the field of description logics as they are needed in this work. More precisely, we shall introduce the logics ℰℒK and ℰℒK gfp as well as the notion of model-based most-specific concept descriptions from [10]. Then, in Section 3, we shall give a formalization of “almost valid” GCIs and the notion of “trusted” and “untrusted” individuals. Based upon this, we shall give a first finite base of all such GCIs in Section 4.3, and extend this result in Section 4.4 to allow us to perform this computation by means of formal concept analysis. Finally, we close this work by some conclusions and outlook on future work. 2 Preliminaries We introduce the necessary definitions from the field of description logics that are necessary for our further considerations. More precisely, we shall introduce the description logics ℰℒK and ℰℒKgfp in Section 2.1 as well as the notion of general concept inclusions in Section 2.2. In Section 2.3 we shall discuss the notion of model-based most-specific concept descriptions. 2.1 The Description Logics ℰℒK and ℰℒK gfp Let 𝑁𝐶 and 𝑁𝑅 be two disjoint sets. An ℰℒ-concept description 𝐶 is formed according to the rule 𝐶 ::“ 𝐴 | J | 𝐶 [ 𝐶 | D𝑟.𝐶 where 𝐴 P 𝑁𝐶 and 𝑟 P 𝑁𝑅 . An ℰℒK -concept description is either K or an ℰℒ-concept description. The semantics of ℰℒK -concept descriptions are defined through the notion of interpretations. An interpretation ℐ “ p𝛥ℐ , ¨ℐ q consists of a set 𝛥ℐ of individuals and an interpretation function ¨ℐ which maps concept names 𝐴 P 𝑁𝐶 to subsets 𝐴ℐ of 𝛥ℐ and role name 𝑟 P 𝑁𝑅 to subsets 𝑟ℐ of 𝛥ℐ ˆ 𝛥ℐ . The interpretation function ¨ℐ can naturally be extended to all ℰℒK -concept descriptions 𝐶, 𝐷 in the usual way, i. e. Kℐ “ H, Jℐ “ 𝛥ℐ , p𝐶 [ 𝐷qℐ “ 𝐶 ℐ X 𝐷ℐ , pD𝑟.𝐶qℐ “ t 𝑑 P 𝛥ℐ | D𝑒 P 𝛥ℐ : p𝑑, 𝑒q P 𝑟ℐ ^ 𝑒 P 𝐶 ℐ u, where 𝑟 P 𝑁𝑅 . We shall call the set 𝐶 ℐ the extension of 𝐶 in ℐ. For each 𝑥 P 𝛥ℐ we shall say that 𝑥 satisfies 𝐶 if and only if 𝑥 P 𝐶 ℐ . The main distinction between ℰℒK K gfp and ℰℒ is that the former allows for cyclic concept descriptions. More formally, let 𝑁𝐷 be a set disjoint to both 𝑁𝐶 and 𝑁𝑅 . We call this set the set of defined concept names. A concept definition is then an expression of the form 𝐴 ” 𝐶, where 𝐴 P 𝑁𝐷 and 𝐶 is an ℰℒK -concept description which can use in the place of concept names from 𝑁𝐶 also concept names from 𝑁𝐷 . Let 𝒯 be a finite set of concept definitions where each defined concept names appears exactly once on the left-hand side of a concept definition in 𝒯 . Then 𝒯 is called a cyclic TBox. We define 𝑁𝐷 p𝒯 q as the set of all concept names from 𝑁𝐷 that appear in some concept definition in 𝒯 . Then, an ℰℒgfp -concept description is defined as a pair p𝐴, 𝒯 q, where 𝒯 is a cyclic TBox and 𝐴 P 𝑁𝐷 appears on the left-hand side of a concept definition p𝐴 ” 𝐶q P 𝒯 . An ℰℒK gfp -concept description is either of the form K or is an ℰℒgfp -concept description. As an example of an ℰℒK gfp -concept description we can consider the concept description 𝐸, where 𝐸 “ p𝐴, t 𝐴 ” Cat [ Dhunts.𝐵 𝐵 ” Mouse [ Dhunts.𝐴 uq. Intuitively, 𝐸 represents all cats hunting a mouse which again hunts a cat — a common situation in the old cartoon series “Tom and Jerry.” In other words, given an interpretation ℐ, the semantics of 𝐸 can be understood as follows: let 𝐴ℐ and 𝐵 ℐ be the Ď-maximal subsets of 𝛥ℐ such that – all individuals in 𝐴ℐ satisfy Cat and have a hunts-successor in 𝐵 ℐ , and – all individuals in 𝐵 ℐ satisfy Mouse and have a hunts-successor in 𝐴ℐ . It can be shown that those maximal sets always exist. We then define 𝐸 ℐ :“ 𝐴ℐ . To define the semantics for ℰℒK gfp formally and in general, it is necessary to resolve the cyclic dependencies within ℰℒK gfp -concept descriptions. This is done using greatest fixpoint semantics [13, 2]. Since the exact definition of the semantics of ℰℒKgfp exceeds the space available for this publication, we leave out the details and refer the reader to the corresponding publications. 2.2 General Concept Inclusions Terminological knowledge is represented in the form of general concept inclusions (GCIs). These are expressions of the form 𝐶 Ď 𝐷, where 𝐶 and 𝐷 are concept descriptions. We speak of ℰℒK -GCIs if both 𝐶 and 𝐷 are ℰℒK -concept descriptions, and likewise for other description logics. The semantics of GCIs is again defined via interpretations. We say that an interpretation ℐ is a model of a GCI 𝐶 Ď 𝐷 if and only if 𝐶 ℐ Ď 𝐷ℐ . If ℐ is a model of 𝐶 Ď 𝐷, then we shall also say that 𝐶 Ď 𝐷 is valid in ℐ. If 𝐶 Ď 𝐷 is valid in every possible interpretation we shall say that 𝐶 is subsumed by 𝐷. This fact is commonly also denoted by 𝐶 Ď 𝐷 (as a statement, not an expression). If ℒ is a set of GCIs and 𝐶 Ď 𝐷 is another GCI, then we say that ℒ entails 𝐶 Ď 𝐷 and write ℒ |ù p𝐶 Ď 𝐷q, if and only if for every interpretation 𝒥 which is a model of all GCIs in ℒ, 𝒥 is also a model of 𝐶 Ď 𝐷. Since we are going to extract terminological knowledge from interpretations, we can ask for the set of all ℰℒKgfp -GCIs for which ℐ is a model. We shall denote this set Thpℐq and call it the theory of ℐ. A base of ℐ is a set ℬ Ď Thpℐq such that every GCI from Thpℐq is already entailed by ℬ. 2.3 Model-Based Most-Specific Concept Descriptions Let us fix a finite interpretation ℐ “ p𝛥ℐ , ¨ℐ q. A first attempt to extract termino- logical knowledge from ℐ is to consider the set Thpℐq of all valid ℰℒK -GCIs of ℐ. However, it is quite easy to see that the number of valid ℰℒK -GCIs of ℐ is infinite in general, for if 𝐶 Ď 𝐷 is such a GCI, then D𝑟.𝐶 Ď D𝑟.𝐷 for 𝑟 P 𝑁𝑅 is a valid ℰℒK -GCI as well. Therefore, we cannot simply use the set Thpℐq as a TBox for an ontology. Instead, we try to find a finite base ℬ of Thpℐq. Such a base would contain the same information as Thpℐq, and since it is finite it could be used as a TBox for an ontology. One of the main results from [10] is to prove that such bases always exist, and also to give an effective method to compute them. These results have been achieved using formal concept analysis [11]. The central notion that has been introduced in [10] for bringing together the description logic ℰℒK and formal concept analysis is the one of model-based most- specific concept descriptions. Roughly, for a set 𝑋 Ď 𝛥ℐ we are looking for a concept description that describes the individuals in 𝑋 in the best way possible. More formally, we call an ℰℒK -concept description 𝐶 a model-based most-specific concept description for 𝑋 (in ℰℒK ) if and only if – 𝑋 Ď 𝐶 ℐ and – for all ℰℒK -concept descriptions 𝐷 such that 𝑋 Ď 𝐷ℐ , it is true that 𝐶 Ď 𝐷. It is clear that, if a model-based most-specific concept description for 𝑋 exists, it is unique up to equivalence. In this case, we shall denote it with 𝑋 ℐ , since computing model-based most-specific concept descriptions is somehow “dual” to computing extensions 𝐶 ℐ of concept descriptions 𝐶. If 𝐶 is a concept description, we shall write 𝐶 ℐℐ instead of p𝐶 ℐ qℐ , and likewise for 𝑋 ℐℐ “ p𝑋 ℐ qℐ for 𝑋 Ď 𝛥ℐ . However, it may happen that model-based most-specific concept descriptions in ℰℒK may not exists. To see this, consider the example interpretation 𝑁𝐶 “ H, 𝑁𝑅 “ t 𝑟 u, 𝛥ℐ “ t 𝑥 u, 𝑟ℐ “ t p𝑥, 𝑥q u. Then all ℰℒK -concept descriptions D𝑟.D𝑟.. . . D𝑟.J have the set 𝑋 “ t 𝑥 u in their extension, but there does not exist a most specific one. On the other hand, it can be seen quite easily that the ℰℒKgfp -concept description p𝐴, t 𝐴 ” D𝑟.𝐴 uq is a model- based most-specific concept description of 𝑋, if we consider ℰℒK K gfp instead of ℰℒ in the above definition. Indeed, this not a coincidence, as the following result shows. Theorem 1 (Lemma 4.5 of [10]). Let ℐ be a finite interpretation and 𝑋 Ď 𝛥ℐ . Then there exists a model-based most-specific concept description of 𝑋 in ℰℒK gfp . Because of this result we shall implicitly assume from now on that we are talking about model-based most-specific concept descriptions in ℰℒK gfp . Before we continue, let us note two facts about model-base most-specific concept ℐℐ descriptions. Firstly, if 𝐶 is an ℰℒK gfp -concept description, then 𝐶 Ď 𝐶 is always ℐℐ K a valid GCI of ℐ. Furthermore, 𝐶 is subsumed by 𝐶, again for each ℰℒgfp -concept description 𝐶. Establishing these two facts is not difficult, see [10]. 3 Confidence and Trusted Individuals Recall that we have fixed a finite interpretation ℐ. A first attempt to learn termi- nological knowledge from ℐ was to consider Thpℐq. However, this approach is not really applicable if we regard this interpretation as somehow faulty, in the sense that for certain individuals we are not quite sure whether their concept names or role successors are correct. Example 2. In [9] the approach by Baader and Distel has been applied to an interpretation ℐDBpedia that has been extracted from RDF Triples from the DB- pedia data set by considering the child-relation only. This interpretation had 5262 individuals and ThpℐDBpedia q can be axiomatized by 1252 GCIs. However, the GCI Dchild.J Ď Person was found not to be valid in ℐDBpedia , because of the presence of 4 erroneous counterexamples. However, such a GCI would be considered correct, and it would be preferable if it could be learned, too. Furthermore, 2547 individuals in ℐDBpedia were positive examples for this GCI, i. e. they satisfied both Dchild.J and Person. One could then argue that the 4 counterexamples are “not enough” to invalidate the GCI Dchild.J Ď Person, i. e. this GCI should have been learned as well. However, it might be the case that for certain individuals we are indeed sure that the are correct. These individuals we shall call trusted individuals, whereas the other ones are called untrusted individuals. Example 3. We consider a classical example here. Suppose that we want to learn terminological knowledge about birds, and we consider all GCIs were the number of counterexamples is “small” compared to the number of positive examples (we shall give a formalization for this shortly). Then, the GCI Birds Ď Flies could be extracted from the data set, as counterexamples to this are quite rare (penguins, os- triches and the like). However, these counterexamples are proper counterexamples, i. e. they are correct. We can understand the set of all untrusted individuals as a subinterpretation of ℐ. Definition 4. Let ℐ “ p𝛥ℐ , ¨ℐ q be a finite interpretation. A subinterpretation of ℐ is an interpretation 𝒥 “ p𝛥𝒥 , ¨𝒥 q such that i. 𝛥𝒥 Ď 𝛥ℐ , ii. t 𝐴 P 𝑁𝐶 | 𝑥 P 𝐴𝒥 u “ t 𝐴 P 𝑁𝐶 | 𝑥 P 𝐴ℐ u for all 𝑥 P 𝛥𝒥 , and iii. t 𝑦 P 𝛥𝒥 | p𝑥, 𝑦q P 𝑟𝒥 u Ď t 𝑦 P 𝛥ℐ | p𝑥, 𝑦q P 𝑟ℐ u for all 𝑥 P 𝛥𝒥 and 𝑟 P 𝑁𝑅 . As already stated in the introduction, we are interested in extracting a compact representation of all GCIs of ℐ which hold for all trusted individuals and are “almost valid” for all untrusted ones. To formalize the notion of “almost valid”, we shall make use of the notion of confidence as follows. Definition 5. Let ℐ be a finite interpretation, and let 𝐶 Ď 𝐷 be an ℰℒK gfp -GCI. Then the confidence of 𝐶 Ď 𝐷 in ℐ is defined as # 1 if 𝐶 ℐ “ H conf ℐ p𝐶 Ď 𝐷q :“ |p𝐶[𝐷qℐ | |𝐶 ℐ | otherwise It can be seen quite easily that conf ℐ p𝐶 Ď 𝐷q is the largest number 𝑘 ď 1 such that |p𝐶 [ 𝐷qℐ | ě 𝑘 ¨ |𝐶 ℐ |. We shall now formally define the set of GCIs we are interested in. Definition 6. Let ℐ be a finite interpretation, let 𝒥 be a subinterpretation of ℐ and let 𝑐 P r0, 1s. The set of confident GCIs in ℐ with untrusted individuals 𝒥 is defined to be the following set of ℰℒK gfp -GCIs: Th𝑐 pℐ, 𝒥 q :“ t 𝐶 Ď 𝐷 | 𝐶 ℐ z𝛥𝒥 Ď 𝐷ℐ z𝛥𝒥 and |p𝐶 [ 𝐷qℐ X 𝛥𝒥 | ě 𝑐 ¨ |𝐶 ℐ X 𝛥𝒥 | u. Note that we cannot define Th𝑐 pℐ, 𝒥 q to consist of all GCIs satisfying conf 𝒥 p𝐶 Ď 𝐷q ě 𝑐 instead of |p𝐶 [𝐷qℐ X𝛥𝒥 | ě 𝑐¨|𝐶 ℐ X𝛥𝒥 |. This is because 𝐶 ℐ X𝛥𝒥 “ 𝐶 𝒥 is not true in general. However, if one restricts one attention to closed subinter- pretations 𝒥 , i. e. subinterpretations where no role edges exist between elements from 𝛥𝒥 and 𝛥ℐ z𝛥𝒥 and vice versa, then 𝐶 ℐ X 𝛥𝒥 “ 𝐶 𝒥 is indeed true, as it has been shown in [10, Lemma 6.12]. Another observation about Th𝑐 pℐ, 𝒥 q is that this set is not necessarily closed under entailment. In other words, it is possible that Th𝑐 pℐ, 𝒥 q |ù p𝐶 Ď 𝐷q but p𝐶 Ď 𝐷q R Th𝑐 pℐ, 𝒥 q. However, we can consider GCIs in Th𝑐 pℐ, 𝒥 q as valid in a certain domain of interest, and counterexamples in 𝒥 as erroneous. Then p𝐶 Ď 𝐷q R Th𝑐 pℐ, 𝒥 q, which just means that the amount of counterexamples in 𝒥 is too high, indicates that either the data ℐ is not sufficiently good enough for learning 𝐶 Ď 𝐷 (if it is not valid in our domain) or that some GCIs in Th𝑐 pℐ, 𝒥 q are actually not valid in our domain. In both cases, further refinement is necessary, for the first case by checking completeness with respect to the underlying domain of interest [5], and in the second case by finding reasons for errors in Th𝑐 pℐ, 𝒥 q, for example using axiom pinpointing [4]. 4 Axiomatizing GCIs in the Presence of Untrusted Individuals We shall now show how to axiomatize Th𝑐 pℐ, 𝒥 q. To this end, we show in Sec- tions 4.3 and 4.4 how one can compute finite bases of Th𝑐 pℐ, 𝒥 q, i. e. finite sets ℬ Ď Th𝑐 pℐ, 𝒥 q such that all GCIs in Th𝑐 pℐ, 𝒥 q are already entailed by ℬ. To this end, we shall make use of results obtained by Baader and Distel [10, 3], which are introduced in Section 4.2. As these results make use of formal concept analysis, we shall first introduce some notions of it first. 4.1 Formal Concept Analysis Formal concept analysis is a subfield of mathematical order theory that is usually concerned with investigating different representations of complete lattices. However, is has also been used in different areas, as in data mining and classification. The fundamental notion of formal concept analysis [11] is the one of a formal context. A formal context is a triple K “ p𝐺, 𝑀, 𝐼q where 𝐺 and 𝑀 are sets and 𝐼 Ď 𝐺ˆ𝑀 . Intuitively, we think of the set 𝐺 as the set of objects, the set 𝑀 as the set of attributes, and of the set 𝐼 as an incidence relation between objects and attributes. If 𝑔 P 𝐺 and 𝑚 P 𝑀 , we say that 𝑔 has the attribute 𝑚 if and only if p𝑔, 𝑚q P 𝐼. A for- mal context L “ p𝐻, 𝑁, 𝐽q is a subcontext of K if and only if 𝐻 Ď 𝐺, 𝑁 Ď 𝑀, 𝐽 Ď 𝐼. For a set of objects 𝐵 Ď 𝐺, we can ask for the set 𝐵 1 of common attributes of all objects in 𝐵, i. e. 𝐵 1 “ t 𝑚 P 𝑀 | @𝑔 P 𝐵 : p𝑔, 𝑚q P 𝐼 u. The set 𝐵 1 is called the derivation of 𝐵 in K. Dually, we define for 𝐴 Ď 𝑀 the set 𝐴1 of all objects satisfying all attributes in 𝐴. In the formal context K we can ask the question whether an object 𝑔 that has all attributes from a set 𝐴1 always also has all attributes from a set 𝐴2 , i. e. whether it is true that 𝑔 P 𝐴11 implies 𝑔 P 𝐴12 . We can formalize this question as follows: we call a pair p𝐴1 , 𝐴2 q with 𝐴1 , 𝐴2 Ď 𝑀 an implication, usually written as 𝐴1 Ñ 𝐴2 . We shall say that the implication 𝐴1 Ñ 𝐴2 is valid in K if and only if 𝐴11 Ď 𝐴12 . The set of all implications 𝐴1 Ñ 𝐴2 with 𝐴1 , 𝐴2 Ď 𝑀 is denoted by Impp𝑀 q, and the set of all valid implications of K is called its theory and is denoted by ThpKq. A set ℒ Ď Impp𝑀 q of implications entails an implication 𝐴1 Ñ 𝐴2 if and only if for all contexts in which ℒ is true, the implication 𝐴1 Ñ 𝐴2 is true as well. A set ℒ Ď ThpKq is called a base of K if and only if all valid implications of K are entailed by ℒ. It is obvious that we can extend each set 𝒦 Ď ThpKq to a base of K, provided that K is finite. We call ℒ Ď ThpKq a base with background knowledge 𝒦 if and only if ℒ Y 𝒦 is a base of K. If 𝒦 “ H, then bases with background knowledge 𝒦 are just bases of K. A particularly interesting base is the so called canonical base CanpK, 𝒦q of K, for some given background knowledge 𝒦. Making the definition of this base understandable is hardly possible in the given amount of space, and we refer the reader to [11] for further details. However, we still note that it is well known that CanpK, 𝒦q is a base of smallest cardinality with background knowledge 𝒦, i. e. every set of implications with less elements than CanpK, 𝒦q cannot be a base of K with background knowledge 𝒦. 4.2 Results by Baader and Distel To connect description logics with formal concept analysis, Baader and Distel make use of the notion of of model-based most-specific concept descriptions, and define a formal context Kℐ which captures all relevant information on the valid ℰℒK gfp -GCIs of ℐ. For this, we define 𝑀ℐ :“ t K u Y 𝑁𝐶 Y t D𝑟.𝑋 ℐ | 𝑟 P 𝑁𝑅 , 𝑋 Ď 𝛥ℐ , 𝑋 ‰ H u. The set 𝑀ℐ has the particular property that all model-based most-specific concept descriptions d in ℐ are expressible in terms of 𝑀ℐ [10]: let us denote for a set 𝑈 Ď 𝑀ℐ with 𝑈 the concept description that is either J, when 𝑈 is empty, or 𝑉1 [ . . . [ 𝑉𝑛 , when 𝑈 “ t 𝑉1 , . . . , 𝑉𝑛 u. Then a concept description 𝐶 is expressible d in terms of 𝑀ℐ if and only if there exists a set 𝑁 Ď 𝑀ℐ such that 𝐶 ” 𝑁 . Having defined the set 𝑀ℐ , we can introduce the notion of the induced context of ℐ. This is the formal context Kℐ “ p𝛥ℐ , 𝑀ℐ , ∇q, where for all 𝑥 P 𝛥ℐ and 𝐶 P 𝑀ℐ , it is true that 𝑥∇𝐶 if and only if 𝑥 P 𝐶 ℐ . The derivation operators in Kℐ , the interpretation function ¨ℐ and model-based most-specific concept descriptions are closely related. Proposition 7. Let 𝐴 Ď 𝛥ℐ , 𝐵 Ď 𝑀𝐼 . Then 𝐴ℐ ” 𝐴1 and 𝐵 1 “ p 𝐵qℐ , where d d the derivations are conducted in Kℐ . With some more technical machinery it can even be shown that p 𝐴qℐℐ “ 𝐴2 is d d true for each 𝐴 Ď 𝑀ℐ , i. e. model-based most-specific concept descriptions and the derivation operators in the induced context of ℐ are closely related. This connection also extends to the canonical base of Kℐ and ℰℒK gfp -bases of ℐ. Theorem 8 (5.13 and 5.18 of [10]). Let ℐ be a finite interpretation and define 𝑆ℐ “ t t 𝐶 u Ñ t 𝐷 u | 𝐶, 𝐷 P 𝑀ℐ , 𝐶 Ď 𝐷 u. Then the set l l ℬCan :“ t 𝑈 Ďp 𝑈 qℐℐ | p𝑈 Ñ 𝑈 2 q P CanpKℐ , 𝑆ℐ q u is a finite ℰℒK gfp -base of ℐ of minimal cardinality. Note that the set 𝑆ℐ contains knowledge which is trivially true in every interpre- tation, but not necessarily in every formal context. More precisely, if 𝐶 Ď 𝐷, then we do not need to state this GCI explicitly in a base. However, the corresponding implication t 𝐶 u Ñ t 𝐷 u may not necessarily be valid in every formal context, and therefore bases of K have to contain information to entail such implications. However, we are only interested in bases of ℐ, which is why we add the set 𝑆ℐ as background knowledge. 4.3 A First Finite Base The aim of this section is to provide a first result for axiomatizing the interpretation ℐ with untrusted individuals 𝒥 . To this end we shall make use of ideas from M. Luxenburger’s theory of partial implications in formal contexts [12]. Due to space restrictions, however, we shall not recall his ideas in the original setting here, but shall discuss them in the appropriate reformulation to our setting only. The two ideas from Luxenburger’s work we shall make use of are actually quite simple. First of all, we observe that Thpℐq Ď Th𝑐 pℐ, 𝒥 q. Therefore, to finitely axiomatize the set Th𝑐 pℐ, 𝒥 q, i. e. to find a finite set ℬ Ď Th𝑐 pℐ, 𝒥 q that entails all GCIs in Th𝑐 pℐ, 𝒥 q, it suffices to find a finite base 𝒞 of Th𝑐 pℐ, 𝒥 qz Thpℐq, since then the set ℬCan Y 𝒞 will be a finite base of Th𝑐 pℐ, 𝒥 q. We thus concentrate on finding such a finite base 𝒞 of Th𝑐 pℐ, 𝒥 qz Thpℐq, i. e. on finding a finite set 𝒞 Ď Th𝑐 pℐ, 𝒥 qz Thpℐq which already entails all GCIs in Th𝑐 pℐ, 𝒥 qz Thpℐq. For this, we make two observations for GCIs p𝐶 Ď 𝐷q P Th𝑐 pℐ, 𝒥 qz Thpℐq. Firstly, it is true that p𝐶 Ď 𝐷q P Th𝑐 pℐ, 𝒥 q ðñ p𝐶 ℐℐ Ď 𝐷ℐℐ q P Th𝑐 pℐ, 𝒥 q. This is mainly due to the fact that 𝐸 ℐℐℐ “ 𝐸 ℐ and p𝐸 [ 𝐹 qℐ “ p𝐸 ℐℐ [ 𝐹 ℐℐ qℐ are true for all ℰℒK gfp -concept descriptions 𝐸, 𝐹 , as can be seen easily from the definition of model-based most-specific concept descriptions. We thus obtain 𝐶 ℐ z𝛥𝒥 Ď 𝐷ℐ z𝛥𝒥 ðñ 𝐶 ℐℐℐ z𝛥𝒥 Ď 𝐷ℐℐℐ z𝛥𝒥 |p𝐶 [ 𝐷qℐ X 𝛥𝒥 | ě 𝑐 ¨ |𝐶 ℐ X 𝛥𝒥 | ðñ |p𝐶 ℐℐ [ 𝐷ℐℐ qℐ X 𝛥𝒥 | ě 𝑐 ¨ |𝐶 ℐℐℐ X 𝛥𝒥 | Secondly, if ℬ is a base of Thpℐq, then ℬ Y t 𝐶 ℐℐ Ď 𝐷ℐℐ u |ù p𝐶 Ď 𝐷q, since ℬ |ù p𝐶 Ď 𝐶 ℐℐ q, and 𝐷ℐℐ Ď 𝐷. Thus, let us define Confpℐ, 𝑐, 𝒥 q :“ t 𝐶 ℐℐ Ď 𝐷ℐℐ | p𝐶 ℐℐ Ď 𝐷ℐℐ q P Th𝑐 pℐ, 𝒥 q u, It has been shown in [10] that if ℐ is finite, there exist only finitely many different model-based most-specific concept descriptions up to equivalence. Therefore, we immediately obtain the following result: Theorem 9. Let ℐ be a finite interpretation, 𝒥 be a subinterpretation of ℐ and 𝑐 P r0, 1s. If ℬ is a finite base of ℐ, then the set ℬ Y Confpℐ, 𝑐, 𝒥 q is a finite base of Th𝑐 pℐ, 𝒥 q. 4.4 Computing a Base by means of FCA The result of the previous section provides us with an effective method to compute bases of confident GCIs in the presence of trusted and untrusted individuals. In this section, we shall extend these result to allow us a computation of such a base using methods from formal concept analysis. More precisely, we shall see that we can view our initial data ℐ and 𝒥 as a certain formal context, and computing a base of ℐ with untrusted individuals 𝒥 can then be understood as computing a certain set of implications from this context. There are two reasons why we are interested in such a transformation. Firstly, the overall goal of our considerations is to adapt the algorithm of attribute exploration from formal concept analysis to our setting of extracting terminological knowledge from interpretations (see also Section 5). Secondly, such a reformulation might allow us to use high performance algorithms from the field of data mining to support us in our task, because the field of formal concept analysis can be understood as a theoretical foundation of data mining [15]. The actual transformation is now quite simple: as for general concept inclusions, we can define the notion of confidence for implications as well, i. e. if K is a formal context and 𝐴 Ñ 𝐵 an implication then conf K p𝐴 Ñ 𝐵q :“ 1 if 𝐴1 “ H and conf K p𝐴 Ñ 𝐵q “ |p𝐴 Y 𝐵q1 |{|𝐴1 | otherwise. We shall denote with Imp𝑐 pKq the set of all implications from Impp𝑀 q which have confidence at least 𝑐 in K. The transformation then uses the induced context Kℐ of ℐ and asserts that a certain set of implications yields a base of Th𝑐 pℐ, 𝒥 q. Theorem 10. Let ℐ be a finite interpretation, 𝒥 be a subinterpretation of ℐ and 𝑐 P r0, 1s. Let us denote for 𝑋 Ď 𝛥ℐ with Kℐ |𝑋 the subcontext of Kℐ restricted to the object set 𝑋, and let us define 𝑇 :“ Imp𝑐 pKℐ |𝛥𝒥 q X ThpKℐ |𝛥ℐ z𝛥𝒥 q. If then ℒ Ď 𝑇 is complete for 𝑇 , then l l l ℒ :“ t 𝐴 Ď 𝐵 | p𝐴 Ñ 𝐵q P ℒ u is a base of Th𝑐 pℐ, 𝒥 q. A proof of this theorem can be found in Section A of the appendix. 5 Conclusions and Further Work In this work we have presented an extension to our previous work of axiomatizing general concept inclusions using confidence, that takes into account the existence of trusted individuals from which we can be sure that their properties in the given data are correct. Essentially, we not only consider general concept inclusions whose confidence is high enough, but we also require for such general concept inclusions that they are not falsified by trusted individuals. For these general concept inclu- sions we have discussed the existence and the computation of bases. In particular, we have shown that we can utilize methodology from the field of formal concept analysis to obtain such bases. This has the particular appeal that in the long run we might be able to utilize fast data mining algorithms to compute bases of confident general concept inclusions which respect trusted individuals. The considerations we have presented in this work are part of a larger attempt to apply the algorithm of attribute exploration from formal concept analysis to our setting of confident general concept inclusions. This algorithm has been used before to complete knowledge bases [5]. Essentially, the algorithm generates logical consequences which can be drawn from the given knowledge and asks the expert for their validity. If the expert declines, then the given knowledge is incomplete and the expert has to provide additional information. This algorithm is repeated as long as new consequences can be generated. An attribute exploration algorithm in the setting of confident general concept in- clusions would provide an attempt to solve the issue of rare counterexamples which comes from the heuristic approach of confident general concept inclusions, as it has been indicated in Example 3: if we consider all general concept inclusions whose con- fidence is “high enough”, we may neglect certain counterexamples which are appear seldom in the data but are correct otherwise. An attribute exploration algorithm could help to resolve this issue by consulting the expert whether certain confident general concept inclusions are valid or not. If the expert is then asked a general con- cept inclusion which has rare counterexample, the expert could add (or mark) these counterexamples as trusted individuals, thus prohibiting the algorithm from consid- ering them as errors. Thus, the approach of confident general concept inclusions may benefit from an attribute exploration algorithm, and its design is part of future work. Acknowledgments The author is deeply grateful for the thorough, constructive and motivating comments by the anonymous reviewers, who helped improving the quality of this work. References [1] R. Agrawal, T. Imielinski, and A. Swami. “Mining Association Rules between Sets of Items in Large Databases”. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 1993, pp. 207–216. [2] Franz Baader. “Terminological Cycles in a Description Logic with Existential Restrictions”. In: Proceedings of the Eighteenth International Joint Confer- ence on Artificial Intelligence. (Acapulco, Mexico). Ed. by Georg Gottlob and Toby Walsh. Morgan Kaufmann, 2003, pp. 325–330. [3] Franz Baader and Felix Distel. “A Finite Basis for the Set of ℰℒ-Implications Holding in a Finite Model”. In: Proceedings of the 6th International Conference on Formal Concept Analysis. (Montreal, Canada). Ed. by Raoul Medina and Sergei A. Obiedkov. Vol. 4933. Lecture Notes in Computer Science. Springer, 2008, pp. 46–61. [4] Franz Baader and Rafael Peñaloza. “Axiom Pinpointing in General Tableaux”. In: TABLEAUX. Ed. by Nicola Olivetti. Vol. 4548. Lecture Notes in Computer Science. Springer, 2007, pp. 11–27. isbn: 978-3-540-73098-9. [5] Franz Baader et al. “Completing Description Logic Knowledge Bases using Formal Concept Analysis”. In: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence. AAAI Press, 2007, pp. 230–235. [6] Daniel Borchmann. Axiomatizing Confident ℰℒK gfp -GCIs of Finite Interpre- tations. Report MATH-AL-08-2012. Dresden, Germany: Chair of Algebraic Structure Theory, Institute of Algebra, Technische Universität Dresden, 2012. [7] Daniel Borchmann. Axiomatizing ℰℒK -Expressible Terminological Knowledge from Erroneous Data. to appear in: Proceedings of the Seventh Internation Conference on Knowledge Capture, KCAP 2013. [8] Daniel Borchmann. On Confident GCIs of Finite Interpretations. LTCS- Report 12-06. See http://lat.inf.tu-dresden.de/research/reports. html. Dresden: Institute for Theoretical Computer Science, TU Dresden, 2012. [9] Daniel Borchmann and Felix Distel. “Mining of ℰℒ-GCIs”. In: ICDM Work- shops. Ed. by Myra Spiliopoulou et al. IEEE, 2011, pp. 1083–1090. isbn: 978-0-7695-4409-0. [10] Felix Distel. “Learning Description Logic Knowledge Bases from Data Using Methods from Formal Concept Analysis”. PhD thesis. TU Dresden, 2011. [11] Bernhard Ganter and Rudolf Wille. Formal Concept Analysis: Mathematical Foundations. Berlin-Heidelberg: Springer, 1999. [12] Michael Luxenburger. “Implikationen, Abhängigkeiten und Galois-Abbildungen”. PhD thesis. TH Darmstadt, 1993. [13] Bernhard Nebel. “Terminological Cycles: Semantics and Computational Properties”. In: Principles of Semantic Networks. Morgan Kaufmann, 1991, pp. 331–362. [14] Johanna Völker and Mathias Niepert. “Statistical Schema Induction”. In: The Semantic Web: Research and Applications - 8th Extended Semantic Web Conference, Proceedings, Part I. (Heraklion, Crete, Greece). Ed. by Grigoris Antoniou et al. Vol. 6643. Lecture Notes in Computer Science. Springer, 2011, pp. 124–138. isbn: 978-3-642-21033-4. [15] Mohammed Javeed Zaki and Mitsunori Ogihara. “Theoretical foundation of association rules”. In: SIGMOD’98 Workshop on Research Issues in Data Mining and Knowledge Discovery (SIGMOD-DMKD’98). 1998, pp. 1–8. A Proofs for Section 4.4 For completeness, we provide the proofs we have omitted in Section 4.4. d d The main idea of the proof is to show that set ℒ satisfies ℒ |ù ℬCan Y Confpℐ, 𝑐, 𝒥 q, and thus Theorem 9 yields the desired result. A crucial cornerstone in this argumentation is the following lemma. Lemma 11. Let 𝑀 be a set of concept descriptions, d and dlet ℒ ĎdImpp𝑀 q and p𝑋 Ñ 𝑌 q P Impp𝑀 q. Then ℒ |ù p𝑋 Ñ 𝑌 q implies ℒ |ù p 𝑋 Ď 𝑌 q, where l l l ℒ :“ t 𝑋Ď 𝑌 | p𝑋 Ñ 𝑌 q P ℒ u. Proof. Let 𝒥 “ p𝛥𝒥 , ¨𝒥 q be an interpretation such that 𝒥 |ù d ℒ. Let us define a formal context K𝒥 ,𝑀 “ p𝛥𝒥 , 𝑀, ∇q via 𝑥∇𝐶 ðñ 𝑥 P 𝐶 𝒥 for all 𝑥 P 𝛥𝒥 , 𝐶 P 𝑀 . Then p 𝐸q𝒥 Ď p 𝐹 q𝒥 , d d show now that K𝒥 ,𝑀 |ù ℒ. Let p𝐸 We shall d d Ñ𝒥𝐹 q P ℒ. since 𝒥 |ù ℒ. It is not hard to see that p 𝐸q “ 𝐸 1 , where the derivation has been done in K𝒥 ,𝑀 . Therefore, it is true that 𝐸 1 Ď 𝐹 1 , and thus K𝒥 ,𝑀 |ù p𝐸 Ñ 𝐹 q. dSince ℒ |ù p𝑋 Ñ 𝑌 q, it is truedthat K𝒥 ,𝑀d |ù p𝑋 Ñ d𝑌 q, i. e.d𝑋 1 Ď d 𝑌 1 . As 𝒥 1 𝒥 𝒥 p 𝑋q “ 𝑋 , it is thus true that p 𝑋q Ď p 𝑌 q , i. e. ℒ |ù p 𝑋 Ď 𝑌 q. Theorem 11. Let ℐ be a finite interpretation, 𝒥 be a subinterpretation of ℐ and 𝑐 P r0, 1s. Let us denote for 𝑋 Ď 𝛥ℐ with Kℐ |𝑋 the subcontext of Kℐ restricted to the object set 𝑋, and let us define 𝑇 :“ Imp𝑐 pKℐ |𝛥𝒥 q X ThpKℐ |𝛥ℐ z𝛥𝒥 q. If ℒ Ď 𝑇 is complete for 𝑇 , then l l l ℒ :“ t 𝐴Ď 𝐵 | p𝐴 Ñ 𝐵q P ℒ u is a base of Th𝑐 pℐ, 𝒥 q. Proof. We need to show two claims, namely d i. ℒ Ď Th𝑐 pℐ, 𝒥 q and d ii. ℒ is complete for Th𝑐 pℐ, 𝒥 q. d d d For the first claim we need to show that for every GCI p 𝑋 Ď 𝑌 q P ℒ it is true that i. |p 𝑋 [ 𝑌 qℐ X 𝛥𝒥 | ě 𝑐 ¨ |p 𝑋qℐ X 𝛥𝒥 | d d d d ℐ 𝒥 d ℐ 𝒥 ii. p 𝑋q z𝛥 Ď p 𝑌 q z𝛥 For the first subclaim, we observe that conf Kℐ |𝛥𝒥 p𝑋 Ñ 𝑌 q ě 𝑐 for all p𝑋 Ñ 𝑌 q P ℒ, i. e. |p𝑋 Y 𝑌 q1 X 𝛥𝒥 | ě 𝑐 ¨ |𝑋 1 X 𝛥𝒥 | d ℐ Since p 𝑋q “ 𝑋 1 , we obtain l l |p p𝑋 Y 𝑌 qqℐ X 𝛥𝒥 | ě 𝑐 ¨ |p 𝑋qℐ X 𝛥𝒥 |, d d d and from p𝑋 Y 𝑌 q ” 𝑋 [ 𝑌 it follows immediately that l l l |p 𝑋 [ 𝑌 qℐ X 𝛥𝒥 | ě 𝑐 ¨ |p 𝑋qℐ X 𝛥𝒥 | as required. For the second subclaim, we observe that 𝑋 1 z𝛥𝒥dĎ 𝑌 1 z𝛥𝒥 , since d 𝑋 Ñ 𝑌 is valid in the formal context Kℐ |𝛥ℐ z𝛥𝒥 . Since 𝑋 1 “ p 𝑋qℐ and 𝑌 1 “ p 𝑌 qℐ the claim follows. d We have therefore shown that ℒ Ď Th𝑐 pℐ, 𝒥 q. d We now show that ℒ is complete for Th𝑐 pℐ, 𝒥 q. To this end, we shall show that i. d ℒ |ù p 𝑈 Ď p 𝑈 qℐℐ q for all 𝑈 Ď 𝑀ℐ , in particular, ℒ |ù ℬCan ; d d d d ii. ℒ |ù Confpℐ, 𝑐, 𝒥 q. d If we can establish these claims, then by d Theorem 9 we obtain from ℒ |ù ℬCan Y Confpℐ, 𝑐, 𝒥 q the completeness of ℒ for Th𝑐 pℐ, 𝒥 q. Let 𝑈 Ď 𝑀ℐ . Since ℒ entails all valid implications of Kℐ , we obtain ℒ |ù p𝑈 Ñ 𝑈 2 q. By Lemma 11, it follows that ℒ |ù p 𝑈 Ď p 𝑈 2 qq. Since p 𝑈 2 q ” p 𝑈 qℐℐ , we d d d d obtain the validity of the subclaim. ℐℐ ℐℐ For the second subclaim, ℐ ℐ ℐ d let1 p𝐶 ℐĎ 𝐷d q1 P Confpℐ, 𝑐, 𝒥 q. We define 𝑈 :“ 𝐶 , 𝑉 :“ 𝐷 . Then 𝑈 ” 𝑈 and 𝑉 ” 𝑉 , so l l l l ℒ |ù p𝑈 ℐ Ď 𝑉 ℐ q ðñ ℒ |ù p 𝑈 1 Ď 𝑉 1 q. We show that ℒ |ù p𝑈 1 Ñ 𝑉 1 q. For this we recall that |p𝐶 ℐℐ [ 𝐷ℐℐ qℐ X 𝛥𝒥 | ě 𝑐 ¨ |p𝐶 ℐℐ qℐ X 𝛥𝒥 | 𝑈 1 ” 𝑈 ℐ ” 𝐶 ℐℐ and 𝑉 1 ” 𝐷ℐℐ , we obtain d d Now since l l l |p 𝑈 1 [ 𝑉 1 qℐ X 𝛥𝒥 | ě 𝑐 ¨ |p 𝑈 1 qℐ X 𝛥𝒥 | As shown before, this implies that |p𝑈 1 Y 𝑉 1 q1 X 𝛥𝒥 | ě 𝑐 ¨ |𝑈 2 X 𝛥𝒥 |, where the derivations are conducted in Kℐ . In other words, it is true that conf Kℐ |𝛥𝒥 p𝑈 1 Ñ 𝑉 1 q ě 𝑐. d d d 1 Thus, ℒ |ù p𝑈 1 Ñ 𝑉 1 q, and Lemma 11 implies ℒ |ù p 𝑈 1 Ď 𝑉 q, thus ℒ |ù p𝑈 ℐ Ď 𝑉 ℐ q “ p𝐶 ℐℐ Ď 𝐷ℐℐ q, as required. d