Uniformly Querying Knowledge Bases and Data Bases Paolo Bresciani IRST, I-38050 Trento Povo, TN, Italy bresciani@irst.it Abstract uniformly accessed. A technique to tightly couple Present kl-one-like knowledge base man- KBMS with DBMS [Borgida and Brachman,1993] agement systems (KBMS), whilst o er- is described. As in [Devanbu,1993; Borgida and ing highly structured description languages Brachman,1993] we let primitive concepts and rela- aside ecient concepts classi cation, have tions in a KB correspond respectively to unary and limited capability to manage large amounts binary tables/views in a DB. Unlike [Devanbu,1993; of individuals. Data base management sys- Borgida and Brachman,1993] we provide a tight cou- pling between KBMS and DBMS, i.e., a on demand tems (DBMS) can, instead, manage large access to the DB, instead of a loose coupling, that re- amounts of data eciently, but give scarce quires a pre-loading of the data from the DB into the formalism to organize them in a structured KB. In this way we obtain the following advantages: way, and to reason with them. This paper shows how assertional knowl-  more complex queries than simply asking for the edge of KBMS and data of DBMS can be instances of concepts can be done; just as an ex- uniformly accessed. The query answering ample, in our system queries like C(x) R(x; y) ^ ^ capability of an arbitrary KBMS is aug- D(y) can be made. mented with the possibility of accessing ex-  no memory space is wasted in the KBMS to keep ternal databases (DB) as a supplemental descriptions of DBMS instances. source of extensional knowledge. answers are given on the basis of the current The techniques presented in this paper can  state of the KB and the DB. be easily adapted to several sources of in- formation. From a knowledge acquisition  no periodical updating of the KB with new or perspective, we believe that they can be modi ed data from the DB is needed. usefully applied in all those applications Basically, in our system the query answering ca- where several sources of informations are pability of an arbitrary KBMS1 is augmented with available independently from the knowl- the possibility of accessing external information as edge bases. a supplemental source of extensional knowledge. In particular a database is seen as an extension of the 1 Introduction KBMS ABox. The two basic components of a KBMS of the kl- 2 DBox as Extension of the ABox one family are the terminological box (TBox) and the assertional box (ABox). One of the tradeo of The ABox is the component of a DBMS where as- these KBMS is between the expressiveness of the sertions about single individuals are stated. In the description languages characterizing their TBox and present paper we describe how the ABox can be ex- the ineciency in managing large amounts of data tended with an external source of extensional data. in the ABox, even when they have a quite schematic We call this extension a `DBox'. In the following, form and their classi cation is completely a priori we adopt the notation of [Nebel,1990] and call a set given. DBMS, instead, are suited to manage data of term descriptions (concepts and roles) a termi- eciently, with little concern about their dimension, nology 2, and a set of individual assertions a world T but their formalism for organizing them in a struc- tured way is quite absent, as it is the capability to 1 Even if we implemented the ideas presented here as infer new information from the existing ones. an extension of LOOM [MacGregor,1991], they can be Here we propose to cope with both KBMS and easily applied to any kl-one-like KBMS system with a DBMS together, using them in an integrated way to rst-order-logic query-language. manage with several kinds of information. Of course 2 An important task of a KBMS is to organize the terms in a taxonomy accordingly with a specialization a uniform way to retrieve information from a mixed relation, i.e., to classify them; in the following, we often KBMS/DBMS is needed. use T to denote just the set of atomic terms appearing In the present paper it is shown how assertional in T , and consider them correctly classi ed in the tax- knowledge of KBMS and data of DBMS can be onomy on the basis of their de nitions. description W 3 ; we say also that the set of data ex- mixed (KBMS/DBMS) queries can be answered in a pressed in a DBox constitutes a data base D. As- coherent way, but, to this extent, we need to couple suming that two complete query answering functions the terminology in with the data base . This T KB D separately exist, for both the ABox and the DBox, coupling consists in the association of some particu- a knowledge base = ; ; can be de ned in KB hT W Di lar terms of with tables, in the DB representing , T D such a way that a uniform query function, based on where the extension of these terms are to be found. the two answering functions, can be implemented. For sake of simplicity we adopt, next, some re- We do not require any special capability from the strictions on the form of , even if, as it will be KB DBox, except the one of (quickly) retrieving lists noted later, they can be, at least in part, released. of tuples of items satisfying requested conditions. Given = ; ; , we assume that the follow- KB hT W Di These conditions are of the kind of being in a class ing conditions are satis ed: { i.e., belonging to a unary table/view { or being in - non intermediate db extension: every in- relation with other items { i.e., belonging to a bi- D dividual must be realized under a leaf term in , nary table/view { and logical combinations of these, T i.e., a term in specialized by no other terms as it can be, for example, expressed in SQL. Since T in . our implementation relies, in fact, on a DBMS with T SQL, we assume that is somehow represented by - homogeneous extension: for each leaf term D means of a relational database, and queries to the of its associated instances are either all in T W DBox can be done in SQL. Therefore in the follow- or all in . D ing we will refer to tables/views { or simply tables { - db isolation: all the leaf terms of whose T as they usually are intended in relational databases, instances are in are primitive and are not used D and to the query answering function of the DBox as in any other term de nition in . T to those of SQL.4 Consider that it is not dicult to design in >From the point of view of users of KBMS, our KB such a way that a primitive term is introduced in experience [Bresciani,1992] suggests that, in realis- T for each class of individuals present in : by this the D tic applications, knowledge bases not only can be homogeneous extension hypothesis can always be complex, but can also involve a large number of in- satis ed. The db isolation and the non interme- dividuals: often most of them are already completely diate db extension conditions re ect the hypoth- and suitably described in some database. We faced esis that is just a at collection of unstructured several times, in the past, the task of transferring D tables of records of data, without any reasoning ca- data from these databases to our knowledge bases. pability. Using the techniques described in the present arti- Under these assumptions, all the information cle it is, of course, much more recommended to link needed to correctly drive the query mechanism is the databases to the knowledge bases. Using a DBox the association of those terms in whose extensions T paradigm we obtain the advantage of reducing to the are in with the corresponding tables in the DB. minimum the e ort of transferring data and, more- D To this extent it is enough to know this association over, they are automatically kept updated as far as for those terms that are leaf, for the non interme- the linked databases are. But also when5 data must diate db extension condition above. Therefore, be collected ex-novo it can be convenient to manage we assume that a partial mapping PM : most of them by means of a DBMS. PT ! DBtable is given, where is the set of primi- PT tive terms in , and DBtable the set of tables in 3 Coupling T the DB. So, DBtable we can de ne the marking function As mentioned, in our , is assumed to contain M : T !2 , s.t. M(t) = PM(x) x f j 2 KB D no structural knowledge, but just raw data, and is subs(t) and PM(x) is de ned , where subs(t) is the g not supposed to have any inferential power. The set of all the terms classi ed under t in (includ- T single instances are, therefore, already placed in the ing t). The marking function gives the (possibly right tables. That is, speaking in KR terms, they are empty) set of tables necessary to retrieve all the in- completely a-priori realized under the right concept. stances (pairs) of a given concept (relation). There- This corresponds to having each table of associ- fore, it is an important part of our , whose de - KB D ated with a primitive term6 of . We will show how nition, to be more precise, has now to be rephrased: T KB = ; ; ;PM . hT W D i By W we denote also the set of individuals described 4 Query Answering 3 in W . 4 Of course, the external source of information where We are now ready to describe the task of answer- the DBox searches data can be of any kind, provided ing a query. Here we will assume that a query to only that it can be accessed via a rst-order-logic query KB = ; ; ;PM is an expression of the kind hT W D i language. x:P1 : : : Pn, where P1 ; : : :; Pn are predicates 5 if not necessary, considering that most KBMS cannot ^ ^ of the form C(x) or R(x; y), where C and R are re- cope with more than some hundreds or few thousands of spectively a concept and a relation in and each individuals. T 6 A primitive term is a term whose de nition gives of x and y appears in x = x1; : : :; xm or is an in- h i only necessary but not sucient conditions; individuals stance in . As a rst, informal example, let us W [D cannot be realized under one of these terms unless it is consider the case in which all the P1 ; : : :; Pn can be not explicitly asserted that they belong to it or to a more managed by the DBox only, that is: M(t) = for 6 ; speci c term in the taxonomy. each t subs(P1 ) : : : subs(Pn ). 2 [ [ 4.1 Translating Queries into SQL relevant result of answering an unconnected query is When each predicate in a query q = x:P1 : : : Pn equivalent to the union of the single results of sep- can be made correspond to a set of tables in the ^ ^ arately answering the clusters, in the sense that all DB, where the answers have to be found, it can be the information is included in it. But, if we consider translated into an equivalent SQL query. Of course, the formal de nition of answer, we must consider the sets of tables can be easily found via the mark- the fact that the overall result must contain tuples ing function M. At this point we have just to cope longer than those resulting by submitting the sin- with the union set of tables T1 ; : : :; Th and their gle clusters; to obtain all the tuples satisfying the f g bindings via the variables in x. For simplicity, let de nition of answer the single answers have to be us suppose that the tables returned by M are com- combined by a sort of Cartesian product. More ex- posed by one column in the case of a concept (let actly, if, after having reordered the variables, un un- it be called left), and two in the case of a rela- conected query is written as x:'1 (x1 ) : : : 'n (xn ) ^ ^ tion (let them be called left and right). The SQL { where x is the concatenation of the other vec- translation is of the kind: tors (x = x1 . .xn), and '1 (x1); : : :; 'n (xn ) cor-  SELECT DISTINCT select-body responds to the single clusters { and given that FROM from-body the asnwers sets of a generic cluster xi :'i(xi ) is WHERE where-body Si = I 1i ; : : :; I lii , the answer set of the whole query f g where the select-body is a list of column names of the is S = I j11 . .I jnn I j11 S1 ; : : :; I jnn Sn . f  j 2 2 g kind M(Pxi ):left or M(Pxi ):right, one for each The case of a connected (i.e., non unconnected) variable xi in x, according to the fact that the vari- query x:'(y) with unbound variables can be re- able xi appears for the rst time in the predicate duced to the case of an unconnected query x:'(y) ^ Pxi in the rst place7 or in the second place, respec- T(z), where z = z1 ; : : :; zk contains all the vari- h i tively. The from-body is the list of all the tables in- ables appearing in x but not in y, and T(z) = volved { i.e., all the M(Pi). The where-body is a list top(z1 ) : : : top(zk ), where top correspond to the ^ ^ of SQL where-conditions of the kind field2=field1 most generic concept in . T or field2=constant, where the rst form has to be It is now clear that unconnected queries and used for each variable that is used more than once, queries with unbound variables may have unreason- each time it is reused, and the second form occurs for ably large answer sets, without giving any further ca- each use of constants. In both the forms field2 is a pability to the system. Therefore, we consider only selector similar to those in select-body, correspond- connected queries with only bound variables. ing to positions in the query where the variable is To a ord the answering of a query we need to split further used or where the constant appears, respec- it into sub-queries that can be answered by the two tively; field1 corresponds to the rst occurence of specialized query answering functions of the KBMS the variable. and the DBMS. To this extent we need, as a rst step, to mark all the possible atomic predicates, cor- 4.2 The General Case responding to the terms in , and say that a term T In general answering, a query is more complex and P is: requires the merging of results from the DBMS and - DB-marked i for each t subs(P) 2 PM(t) \P T the KBMS. Answering a query in means nding KB is de ned . a set x 1; : : :; x m of tuples of instances s.t., for each f g - KB-marked i for each t subs(P) , tuple x i , x:(P1 : : : Pn)[x i ] holds in . We call 2 \ PT ^ ^ KB PM(t) is unde ned. such tuples answers of the query and the set of all - Mixed-marked otherwise. of them its answer set. Due to the de nition of answer of a query, it is ob- These three markings re ect the fact that the in- vious that, in order to avoid the generation of huge stances (pairs) of P are all in , all in , or part W D answer sets, free variables should not be used, i.e., in and part in , respectively. The strategy for W D each variable appearing in x must appear also in answering to a query is based on this information. the query body (i.e., the part at the right of the Let us, rst, observe that it is easy to answer to an dot). Indeed, we adopt a stronger restriction, be- atomic query where the predicate is a KB-marked or cause the former one still allows for some undesired a DB-marked term. In the rst case it is enough to situations. Let us consider, for example, the query: submit it to the KBMS. In the second it is enough  x; y; z :A(x) R(x; y) C(z). All the variables to translate the query in a SQL equivalent, as shown h i ^ ^ appear in the body, but, nevertheless, the answer above, and submit it to the associated DB. More- set of the query can be unreasonably large, due over, if the query is not atomic, but made up by to the fact that all the answers of the sub-query atomic subexpression all with the same marking, the  x; y :A(x) R(x; y) have to be combined with all same strategy is applied. More dicult is the case h i ^ the answers of the sub-query  z :C(z). We say that of queries with Mixed-marked predicates. Even the h i such a query is unconnected. More in general, we say atomic case is quite dicult; it is necessary to trans- that a query is unconnected when it can be split into form the atomic query into the (possibly non atomic) two or more sub-queries s.t. all the variables appear- one whose predicates correspond to all the leaf terms ing in each of them does not appear in any other. We that specialize the only term in the original atomic call these sub-queries clusters. It is obvious that the query, proceed as before, and collect all the results. Let us now consider a generic non atomic query: 7 or the only one in the case of concept. x:P1KB : : : PlKB ^ ^ KB P1 DB : : : P DB P M : : : P M ^ ^lDB 1 ^ ^lM ^ ^ where the PiKB corresponds to the KB-marked 4.3 merge AS1 and AS2 by collecting only those an- terms, the PiDB to the DB-marked terms, and the swers in AS1 where each non-? lled position is PiM to the Mixed-marked terms. The query can be lled by the same individual or by ? in some split in the sub-queries: qKB = x:P1KB : : : PlKB , answers in AS2 , and replace in the collected an- M = x:P M :KB swers each ? with the individuals in the corre- ^ ^ qDB = x:P1DB : : : PlDB ^ ^ DB , and q 1 : : ^ ^ sponding position in all the matching answers PlMM . of AS2 4.3 The Algorithms 4.4 replace AS1 and AS2 in result-list with their As we said, the sub-queries qKB , qDB , qM can be merging computed in step 4.3 easily processed. The only diculty is that some of 4.5 REPEAT from step 4.2 UNTIL only one item the variables in x could be unbound in a sub-query. is left in result-list. In this case, as shown before, the answer sets have to 4.6 RETURN the only item left in result-list. be completed, that is, the unbound variables should be made correspond to each instance in , for all Now it is easy to explain how to collect the answers of the sub-queries qiM of step 2. It is enough, for each KB the found answers, by all the possible combinations. But, in this way, huge answer sets are generated, as qiM q1M : : :qhM , to collect all the answers of all its 2 f g in the following sketch of the query-answering algo- descendant queries, and complete these answer sets rithm: generating ASx;M1; : : :; ASx;h M , as described above; it 1 split the query as sketched above into qKB , qDB is now clear that, in the above algorithm for step 4, and qM . step 4.1 has to be so rephrased: 2 submit qKB to KBMS, qDB to SQL (after trans- 4.1-bis lation) and transform each of the atomic sub- let queries qiM of qM into a set of atomic queries result-list= ASxKB ; ASxDB ; ASx; f M ; : : :; AS M . 1 x;h g corresponding to the leaf terms in that spe- T cialize qiM ; submit them to the speci c retriev- The resulting algorithm, composed by steps 1, 2, 3 ers. (modi ed as shown), 4.1-bis, and 4.2 to 4.6 has been 3 collect all the answers respectively in the answer implemented. In our system the KBMS currently in sets ASxKB , ASxDB M DB , and ASxM , and complete use is LOOM [MacGregor,1991], and the database them withKBthe whole domain in the place of un- query language is SQL, but, as mentioned, also other bound variables, as mentioned above, generat- systems could be easily used. ing ASxKB , ASxDB , and ASxM . 4 the overall answer set is just ASxKB ASxDB 5 Conclusion and Future Developments \ \ ASxM . Of course this rst algorithm is widely space wast- We have shown how a third component, a DBox { ing. Moreover, in step 3 it is not clearly stated how allowing for the extensional data to be distributed to collect the answers of the sub-queries qiM . We try among the ABox and the DBox { can be added to here to shortly describe this operation and to show the traditional TBox/ABox architecture of KBMS. how the completions of ASxKB DB M KB , ASxDB , and ASxM By means of the DBox is possible to couple the in step 3, and their following intersection in step 4, KBMS with, for example, a DBMS, and use both can be obtained more eciently. To solve these prob- the systems to uniformly answering queries to knowl- lems, from step 3 ahead a compact representation for edge bases realized by this extended paradigm. The ASxKB , ASxDB , and ASxM is needed. Let a generic presented query language has some restrictions, and partial answer set be written as ASy , where the vari- some constraints have been imposed to the form of ables of the original complete variable tuple x miss- the knowledge bases. To overcome these limitations, ing in y are, xp1 ; : : :; xpk . Its completionScan be rep- some extensions of the present work can be proposed. resented in a compact way with ASx = I 2ASy I ? , where I ? are equivalent to I except that are length- f g 5.1 Constraints on the Form of KB ened by lling the k missing positions p1 ; : : :; pk with In section 3 we assumed that some constraints any marker, e.g., a star `?'. The star stands for all should be imposed on the form of . Indeed they KB the individuals in . Using this representation for can be in part released, even if this more general ap- KB the completion in step 3, it is now easy to rephrase proach would require a deeper discussion and a re- step 4 of the algorithm as a merging operation. In formulation of the algorithms. Here we try to give a fact answer sets ASxKB , ASxDB , and ASxM can be very short account on possible developments in this merged into a single answer set as follow: direction. First, consider the homogeneous ex- tension condition. It is important because it allows 4.1 let result-list= ASxKB ; ASxDB ; ASxM f g to make the search of the answers simpler, giving 4.2 choose two answer sets, AS1 and AS2 , in the basis for a neat separation between KB-marked, result-list, where answers have at least one DB-marked, and Mixed-marked predicates9 . But it common position lled by individuals, i.e., not with connected queries. ?.8 9 and giving also the way to decompose the Mixed- 8 Such two sets do always exist, otherwise the query marked predicates in sets of KB-marked and DB-marked would be unconnected, while we assumed to deal only ones. is even more important when considered in conjunc- 5.3 Aknowledgments tion with the db isolation condition. In fact we Our thanks must be addressed to Enrico Franconi, can easily cope with leaf terms having instances from for his careful reading of several preliminary copies both and by submitting the corresponding sub- W D of the present paper and the useful suggestions he queries to both the specialized retrieving functions, made about it. We also thank Fabio Rinaldi, for his and then proceeding with the merging as usual. But, implementation of the SQL interface. allowing this ambiguity would make more complex the formulation of the db isolation condition, that References could become: [Borgida and Brachman, 1993] Alex Borgida and - db isolation: all the leaf terms of whose in- T Ronald J. Brachman. Loading data into descrip- stances are even only in part in are primitive D tion reasoners. In Proceeding of ACM SIGMOD and are not used in any other term de nition in '93, 1993. . [Bresciani, 1991] Paolo Bresciani. Logical account T Indeed we can, at least in part, give up also with of a terminological tool. In Proc. of the IX Con- this condition. In fact, while keeping the fact that ference on Applications of Arti cial Intelligence, such term must be primitive { this is pragmatically Orlando, FL, 1991. coherent with the fact that the raw information com- [Bresciani, 1992] Paolo Bresciani. Representation of ing from the DB cannot be inferred { we can allow such term to be used inside new, eventually even non the domain for a natural language dialogue sys- primitive, de nition. To this extent we need a much tem. Technical Report 9203-01, IRST, Povo TN, more complex schema for translating queries on DB- 1992. marked term into SQL. For example, if the query is [Buchheit et al., 1994] Martin of the kind  x :C(x) where C =: some(R; D), its h i Buchheit, Manfred A. Jeusfeld, Werner Nutt, and SQL translation could be: Martin Staudt. Subsumption between queries to SELECT M(R):left object-oriented databases. Information Systems, FROM M(R) 19(1):33{54, 1994. WHERE M(R):right IN M(D) [Devanbu, 1993] Premkumar T. Devanbu. Trans- Similarly, a translation for the all operator could lating description logics to information server be given, as in [Borgida and Brachman,1993], but in queries. In Proceedings of Second Conference on this case some extra considerations about the ade- Information and Knowledge Management (CIKM quacy of the standard extensional semantics of this '93), 1993. operator, when used in a database context, would [Franconi, 1992] E. Franconi. Extending hybridity arise. In fact, the empty satis ability of an all clause within the YAK knowledge representation system. would be hardly suited for a DB.10 AI*IA notizie, the Italian Association for Arti - In the example above D is supposed to be a cial Intelligence Journal, 5(2):55{58, June 1992. primitive atomic DB-marked concept. Another ex- [MacGregor, 1991] R. MacGregor. Inside the tension to be explored is about releasing this con- LOOM description classi er. SIGART Bulletin, straint. Again, some concerns about semantics ade- 2(3):88{92, 1991. quacy should probably be adressed. [Nebel, 1990] B. Nebel. Reasoning and Revision in Also the non intermediate db extension con- Hybrid Representation Systems, volume 422 of dition has, after the considerations above, to be re- Lecture Notes in Arti cial Intelligence. Springer- vised. In fact, even if we must still consider the Verlag, Berlin, Heidelberg, New York, 1990. informaton of , as they are given, as being a priori D fully realized in the leaves of the taxonomy, because the tables in the DB, where the instances of are D described, are not structured in a hierarchy, it could happen that non primitive concepts specialize the DB-marked ones, as in the previous example on the some operator. 5.2 The Query Language Another iussue to be explored regards the query lan- guage. Currently our system support existentially quanti ed conjuntions of atomic formulae. We plan to expand its capability with the possi- bility of answering any rst-order-logic query. We foresee that, to this extent, much attention has to be paid on the optimization of the queries.11 10 As we argued even for standard knowledge bases [Bresciani,1991] the every operator [Franconi,1992] would be more adequate in this case. managed in a uniform way, the approach of [Buchheit et al.,1994] can be usefully applied. 11 Because in our system queries to KB and to DB are