Refining Ontologies via Pattern-based Clustering Francesca A. Lisi and Floriana Esposito 1 Abstract. In this paper we consider the problem of finding subcon- is an O-query Q ∈ L and ext(C) is the set answerset(Q, B) of cepts of a known concept (reference concept) in a given ontology in correct answers to Q w.r.t. B. The DAG G is structured according to the light of new knowledge coming from a data source. These sub- the subset relation between concept extensions. concepts are discovered by looking for frequent association patterns The problem in hand can be considered as a case of tha form of between the reference concept and other concepts also occurring in unsupervised learning, known under the name of Conceptual Clus- the existing ontology. As an illustration, we report preliminary re- tering [10], that aims at determining not only the clusters but also sults obtained from the refinement of an ALC ontology with respect their descriptions expressed in some representation formalism. As a to DATALOG data extracted from the on-line CIA World Fact Book. solution approach to the problem, we follow a recent trend in Cluster Analysis: using frequent (association) patterns as candidate clusters [13]. A frequent pattern is an intensional description, expressed in a 1 INTRODUCTION language L, of a subset of a given data set r whose cardinality ex- Ontology Refinement is the adaptation of an existing ontology to a ceeds a user-defined threshold (minimum support). Note that patterns specific domain or the needs of a particular user [8]. In this paper can refer to multiple levels of description granularity (multi-grained we consider the problem of finding subconcepts of a known concept patterns). In any case, a frequent pattern highlights a regularity in Cref , called reference concept, in the existing ontology Σ in the light r, therefore it can be considered as the clue of a data cluster. In the of new knowledge coming from a data source Π. We assume that a context of Ontology Refinement these clues are called emerging con- concept C consists of two parts: an intension int(C) and an extension cepts because they are concepts whose only extension is determined. ext(C). The former is an expression belonging to a logical language In [4] it has been proposed to extend [6] in order to provide a pattern- L whereas the latter is a set of objects that satisfy the former. More based approach to Conceptual Clustering. formally, given The paper is organized as follows. Section 2 illustrates our ap- proach to the problem. Section 3 reports a preliminary empirical eval- • a reference concept Cref ∈ Σ, uation of the approach. Section 4 concludes with final remarks and • a data set r = Σ ∪ Π, directions of future work. • a language L our Ontology Refinement problem is to find a directed acyclic graph 2 PATTERN-BASED CLUSTERING (DAG) G of concepts Ci such that (i) int(Ci ) ∈ L and (ii) ext(Ci ) ⊂ ext(Cref ). Note that Cref is among both the concepts defined in When faced with a pattern-based approach to Conceptual Clustering, Σ and the symbols of L. Furthermore ext(Ci ) relies on a notion of the Ontology Refinement problem stated in Section 1 is decomposed satisfiability of int(Ci ) w.r.t. r. Note that r includes Σ because in On- in two subproblems: tology Refinement, as opposite to other forms of Ontology Learning such as Ontology Extraction (or Building), it is mandatory to con- I. discovery of frequent patterns in data sider the existing ontology and its existing connections. II. generation of clusters from frequent patterns A Knowledge Representation and Reasoning (KR&R) framework that turns out to be suitable for our problem is the one offered by the In particular, the subproblem I is actually a variant of frequent pat- hybrid system AL-log [2]. It allows for the specification of both re- tern discovery which aims at obtaining descriptions of the data set r lational and structural data: the former is based on DATALOG [1], the at different levels of granularity [3]. Here r typically encompasses a latter on ALC [11]. The integration of the two logical formalisms is taxonomy T . More precisely, the problem of frequent pattern discov- provided by the so-called constrained DATALOG clause, i.e. a DAT- ery at l levels of description granularity, 1 ≤ l ≤ maxG, is to find ALOG clause with variables possibly constrained by concepts ex- the set F of all the frequent patterns expressible in a multi-grained pressed in ALC. Within this KR&R framework, the data set r is rep- language L = {Ll }1≤l≤maxG and evaluated against r w.r.t. a set resented as a AL-log knowledge base B and the language L contains {minsupl }1≤l≤maxG of minimum support thresholds by means of expressions, called O-queries, of the form the evaluation function supp. In this case, P ∈ Ll with support s is frequent in r if (i) s ≥ minsupl and (ii) all ancestors of P w.r.t. T Q = q(X) ← α1 , . . . , αm &X : Cref , γ1 , . . . , γn , are frequent in r. The method proposed for solving one such decomposed problem relating individuals of Cref to individuals of other concepts (task- extends the levelwise search method [9] for frequent pattern discov- relevant concepts) also occurring in Σ. Thus, for a concept C, int(C) ery with an additional post-processing step to solve the subproblem 1 Dipartimento di Informatica, Università degli Studi di Bari, Via Orabona 4, II. This method searches the space (L, ) of patterns organized ac- 70125 Bari, Italy, email: {lisi, esposito}@di.uniba.it cording to a generality order  in a breadth-first manner, starting from the most general pattern in L and alternating candidate gener- C40 ∈ F52 ation and candidate evaluation phases. The underlying assumption is q(A) ← speaks(A,B), believes(A,C) & that  is a quasi-order monotonic w.r.t. supp. For L being a multi- A:MiddleEastCountry, grained language of O-queries, supp supplies the percentage of in- B:AfroAsiaticLanguage, C:MonotheisticReligion dividuals of Cref that satisfy an O-query Q and  is based on the {IR, SA} B-subsumption relation [6]. It has been proved that B is a quasi- order that fulfills the condition of monotonicity w.r.t. supp [6]. Also C50 ∈ F52 q(A) ← believes(A,B), believes(A,C) & the search for patterns is depth-bounded (up to maxD). A:MiddleEastCountry, The subproblem II concerns choosing a description for each clus- B:MonotheisticReligion, C:MonotheisticReligion ter. In [5] it has been proposed a criterion obtained by combining {BRN, IR, IRQ, IL, JOR, RL, SYR} two orthogonal biases: a language bias and a search bias. The lan- guage bias allows the user to define conditions on the form of O- C60 ∈ F33 queries to be accepted as concept intensions. In particular, it is pos- q(A) ← believes(A,’Druze’) & A:MiddleEastCountry sible to state which is the minimum level of description granularity {IL, SYR} and whether (all) the variables must be ontologically constrained or not. The search bias allows the user to define a preference criterion C70 ∈ F33 based on B-subsumption. In particular, it is possible to state whether q(A) ← believes(A,B) & the most general description (m.g.d.) or the most specific description A:MiddleEastCountry, B:JewishReligion {IR, IL, SYR} (m.s.d.) w.r.t. B has to be preferrred. Since B is not a total or- der, it can happen that two patterns P and Q, belonging to the same C80 ∈ F33 language L, can not be compared w.r.t. B . In this case, the m.g.d. q(A) ← believes(A,B) & (resp. m.s.d) of P and Q is the union (resp. conjunction) of P and Q. A:MiddleEastCountry, B:ChristianReligion Note that this method for Conceptual Clustering is top-down and {ARM, IR, IRQ, IL, JOR, RL, SYR} incremental due to the features of the levelwise search. Also it is not hierarchical because it returns a DAG instead of a tree of concepts. C90 ∈ F33 q(A) ← believes(A,B) & A:MiddleEastCountry, B:MuslimReligion 3 PRELIMINARY EXPERIMENTS {BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE} As an illustration, we report the results of four experiments con- 0 ducted on the AL-log knowledge base BCIA that has been obtained C10 ∈ F53 by adding DATALOG facts2 extracted from the on-line 1996 CIA q(A) ← believes(A,B), believes(A,C) & A:MiddleEastCountry, World Fact Book3 to an ALC ontology ΣCIA concerning the concepts B:ChristianReligion, C:MuslimReligion Country, EthnicGroup, Language, and Religion. The parameter {IR, IRQ, IL, JOR, RL, SYR} settings are: Cref = MiddleEastCountry, maxD = 5, maxG = 3, minsup1 = 20%, minsup2 = 13%, and minsup3 = 10%. 0 C11 ∈ F53 Thus each of them started from the same set F of 53 frequent pat- q(A) ← believes(A,B), believes(A,C) & terns out of 99 candidate patterns. A:MiddleEastCountry, Case for l ≥ 2. The first two experiments both require the descrip- B:MuslimReligion, C:MuslimReligion tions to have all the variables ontologically constrained by concepts {BRN, IR, SYR} from the second granularity level on. When the m.g.d. criterion is adopted, the procedure of graph building returns the following twelve 0 organized in the DAG GCIA . They are numbered according to the concepts: chronological order of insertion in GCIA0 and annotated with informa- C00 ∈ F11 tion of the generation step. From a qualitative point of view, concepts q(A) ← A:MiddleEastCountry C20 4 and C90 well characterize Middle East countries. Armenia (ARM), {ARM, BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE, YE} as opposite to Iran (IR), does not fall in these concepts. It rather be- longs to the weaker characterizations C30 and C80 . This proves that C10 ∈ F32 our procedure performs a ’sensible’ clustering. Indeed Armenia is a q(A) ← believes(A,B) & well-known borderline case for the geo-political concept of Middle A:MiddleEastCountry, B:MonotheisticReligion East, though the Armenian is usually listed among Middle Eastern {ARM, BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE} ethnic groups. Modern experts tend nowadays to consider it as part C20 ∈ F32 of Europe, therefore out of Middle East. But in 1996 the on-line CIA q(A) ← speaks(A,B) & World Fact Book still considered Armenia as part of Asia. A:MiddleEastCountry, B:AfroAsiaticLanguage When the m.s.d. criterion is adopted, the intensions for the con- {IR, SA, YE} cepts C40 , C60 and C70 change as follows: C30 ∈ F32 C40 ∈ F52 q(A) ← speaks(A,B) & q(A) ← speaks(A,B), believes(A,C) & A:MiddleEastCountry, B:IndoEuropeanLanguage A:MiddleEastCountry, {ARM, IR} B:ArabicLanguage, C:MuslimReligion {IR, SA} 2 http://www.dbis.informatik.uni-goettingen.de/ Mondial/mondial-rel-facts.flp 4 C 0 is less populated than expected because B 2 CIA does not provide facts on 3 http://www.odci.gov/cia/publications/factbook/ the languages spoken for all countries. 00 0 C60 ∈ F33 organized in a DAG GCIA which partially reproduces GCIA . Note that q(A) ← believes(A,’Druze’), believes(A,B), the stricter conditions set in the language bias cause two concepts 0 00 believes(A,C), believes(A,D) & occurring in GCIA not to appear in GCIA : the scarsely significant C10 A:MiddleEastCountry, B:JewishReligion, and the quite interesting C30 . C:ChristianReligion, D:MuslimReligion When the m.s.d. condition is chosen, the intensions for the con- {IL, SYR} cepts C200 and C300 change analogously to C60 and C70 . C70 ∈ F33 q(A) ← believes(A,B), believes(A,C), believes(A,D) & 4 CONCLUSIONS AND FUTURE WORK A:MiddleEastCountry, B:JewishReligion, C:ChristianReligion, D:MuslimReligion Ontology Refinement can be considered as the problem of contextu- {IR, IL, SYR} alizing an input ontology. In our case, context is conveyed by task- relevant concepts and is attached to the reference concept by dis- In particular C60 and C70 look quite overfitted to data. Yet overfitting al- covering strong associations between the reference concepts and the lows us to realize that what distinguishes Israel (IL) and Syria (SYR) task-relevant concepts w.r.t. the input ontology. The idea of applying from Iran is just the presence of Druze people. association rules in Ontology Learning has been already investigated Case for l ≥ 3. The other two experiments further restrict the in [7]. Yet there are several differences: [7] is conceived for Ontology conditions of the language bias specification. Here only descriptions Extraction instead of Ontology Refinement, uses generalized associ- with variables constrained by concepts of granularity from the third ation rules (bottom-up search) instead of multi-level association rules level on are considered. When the m.g.d. criterion is adopted, the procedure for graph building returns the following nine concepts: (top-down search), adopts propositional logic instead of First Order Logic. Also our work has contact points with Vrain’s proposal [12] C000 ∈ F11 of a top-down incremental but distance-based method for Conceptual q(A) ← A:MiddleEastCountry Clustering in a mixed object-logical representation. {ARM, BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE, YE} For the future we plan to extensively evaluate our approach. Ex- periments will show, among the other things, how emerging concepts C100 ∈ F33 depend on the minimum support thresholds set for the stage of fre- q(A) ← speaks(A,B) & quent pattern discovery. A:MiddleEastCountry, B:ArabicLanguage {IR, SA, YE} References C200 ∈ F33 q(A) ← believes(A,’Druze’) & A:MiddleEastCountry [1] S. Ceri, G. Gottlob, and L. Tanca, Logic Programming and Databases, Springer, 1990. {IL, SYR} [2] F.M. Donini, M. Lenzerini, D. Nardi, and A. Schaerf, ‘AL-log: Inte- grating Datalog and Description Logics’, Journal of Intelligent Infor- C300 ∈ F33 mation Systems, 10(3), 227–252, (1998). q(A) ← believes(A,B) & [3] J. Han and Y. Fu, ‘Mining multiple-level association rules in large A:MiddleEastCountry, B:JewishReligion databases’, IEEE Transactions on Knowledge and Data Engineering, {IR, IL, SYR} 11(5), (1999). [4] F.A. Lisi, ‘A Pattern-Based Approach to Conceptual Clustering in C400 ∈ F33 FOL.’, in Conceptual Structures: Inspiration and Application, eds., H. Schärfe, P. Hitzler, and P. Øhrstrøm, volume 4068 of Lecture Notes q(A) ← believes(A,B) & in Artificial Intelligence, 346–359, Springer, (2006). A:MiddleEastCountry, B:ChristianReligion [5] F.A. Lisi and F. Esposito, ‘Two Orthogonal Biases for Choosing the {ARM, IR, IRQ, IL, JOR, RL, SYR} Intensions of Emerging Concepts’, in ECAI 2006. IOS Press, (2006). [6] F.A. Lisi and D. Malerba, ‘Inducing Multi-Level Association Rules C500 ∈ F33 from Multiple Relations’, Machine Learning, 55, 175–210, (2004). q(A) ← believes(A,B) & [7] A. Maedche and S. Staab, ‘Discovering Conceptual Relations from A:MiddleEastCountry, B:MuslimReligion Text’, in Proceedings of the 14th European Conference on Artificial {BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE} Intelligence, ed., W. Horn, pp. 321–325. IOS Press, (2000). [8] A. Maedche and S. Staab, ‘Ontology Learning’, in Handbook on On- tologies, eds., S. Staab and R. Studer, Springer, (2004). C600 ∈ F53 [9] H. Mannila and H. Toivonen, ‘Levelwise search and borders of theories q(A) ← speaks(A,B), believes(A,C) & in knowledge discovery’, Data Mining and Knowledge Discovery, 1(3), A:MiddleEastCountry, 241–258, (1997). B:ArabicLanguage, C:MuslimReligion [10] R.S. Michalski and R.E. Stepp, ‘Learning from observation: Concep- {IR, SA} tual clustering’, in Machine Learning: an artificial intelligence ap- proach, eds., R.S. Michalski, J.G. Carbonell, and T.M. Mitchell, vol- C700 ∈ F53 ume I, Morgan Kaufmann, San Mateo, CA, (1983). q(A) ← believes(A,B), believes(A,C) & [11] M. Schmidt-Schauss and G. Smolka, ‘Attributive concept descriptions A:MiddleEastCountry, with complements’, Artificial Intelligence, 48(1), 1–26, (1991). [12] C. Vrain, ‘Hierarchical conceptual clustering in a first order repre- B:ChristianReligion, C:MuslimReligion sentation.’, in Foundations of Intelligent Systems, eds., Z.W. Ras and {IR, IRQ, IL, JOR, RL, SYR} M. Michalewicz, volume 1079 of Lecture Notes in Computer Science, 643–652, Springer, (1996). C800 ∈ F53 [13] H. Xiong, M. Steinbach, A. Ruslim, and V. Kumar, ‘Characterizing pat- q(A) ← believes(A,B), believes(A,C) & tern based clustering’, Technical Report TR 05-015, Dept. of Computer A:MiddleEastCountry, Science and Engineering, University of Minnesota, Minneapolis, USA, B:MuslimReligion, C:MuslimReligion (2005). {BRN, IR, SYR}