Translation of Ontology Retrieval Problem into Relational Queries Jaroslav Pokorný1 , Jana Pribolová2 , and Peter Vojtáš1 1 Department of Software Engineering, Charles University, Prague, Czech Republic {Pokorny, Vojtas}@ksi.mff.cuni.cz 2 Institute of Computer Science, Faculty of Science, University of P. J. Šafárik, {Jana.Pribolova}@upjs.sk Abstract. Ontology as a knowledge base can provide different reasoning tasks, e.g. to check consistency of the ontology or to check whether a resource is instance of a concept or not. In this paper we want to focus on retrieval problem. There already exist systems resolving this problem, but they are not effective within large datasets. Our idea is to transform ontology into a relational database. We present particular algorithms of this transformation both on the scheme level and SQL level with special handling of functional roles and definitions. This enables to query such database by usual tools of SQL, i.e. to solve the retrieval problem. Keywords: ontology, translation, relational database, SQL 1 Introduction OWL enables the creation of ontologies and provides extensive semantics for Web data. This language is heavily influenced by description logics (DL, see [1]). Research on DL reasoners consists of the solving reasoner problems such as satisfiability, consistency or retrieving instances of the concept. Relational database is an excellent system for storing and querying data, but their infer- encing capabilities are limited just to querying. In this paper we describe the method to extend relational database to store ontology. In present some research groups are interesting in topic of ontology storing and maintenance. Some of them are trying to limit expressive power of language to speed up reasoner tasks. But some of them are trying to fuse two or more kind of systems. One of the prominent directions in this area is blending ontologies and logic databases [5, 3]. They create so called description logic programs which are de- scription logic expressions and logic programs mixed together. Another direction of the research area is combining ontologies and rela- tional databases. First experiments focus on XML document translation into corresponding relational tables [9]. Inspired by the XML storing in relational K. Richta, J. Pokorný, V. Snášel (Eds.): Dateso 2009, pp. 44–55, ISBN 978-80-01-04323-3. Translation of Ontology Retrieval Problem into Relational Queries 45 databases there are some research projects concerned with ontology storing in the relational databases. One of the first projects is published in [4]. However, many others occurred, e.g., the system HAWK [10] and its ancestor the system DLDB [7], as well as the projects described in [2, 6]. In Section 2 we write about a knowledge base represented through DL and relational data model. Further, in subsection 2.1 we deal with potential and valid domains as well as with ranges. The valid domains are important input parameters of the algorithm to construct relational scheme which we present in Section 2.2. Subsection 2.3 explains creating the relational database. Section 3 concentrates on ontology implementation in an SQL environment. This enables to query such database by usual tools of SQL, i.e. to solve the retrieval problem. In fact, it means creating tables and views (Subsection 3.1) in the SQL language. Then we explain how to insert data in the tables (Subsection 3.2). Section 4 concludes the paper. 2 Knowledge Base in Relational Data Model Basics of DL, ontologies and all used symbols are described in [8], here we refer to unexplained notions. Our language consists of names for atomic concepts A, B ∈ NC, names for roles R ∈ NR. Roles are either functional (NFR) or non-functional (NNR). Concept constructions we usually denote by C, D and we understand them as nonterminal symbols. The main idea of the knowledge base representation with relational data model is shown in Figure 1. For every concept or role we create unary or binary relation, respectively, except for functional roles. For every functional role we create only one attribute in a relation or one attribute in more than one relation depending on so called valid domains (see Section 2.1). The ABox assertions are translated by inserting some tuples into relations. TBox assertions of the type ⊑ are translated into integrity constrains and equalities into special relations called TCview . In practice, such a relation is represented by view in SQL. Fig. 1. Translation mapping the ontology elements to database structures. 46 Jaroslav Pokorný, Jana Pribolová, Peter Vojtáš Note that ontology as a relational database offers the users some kind of inferencing. The main feature is to retrieve all instances of the concept. Moreover such a database supports query construction. In DL, an equivalent of the query is a concept defined as equality. However, in practice ontology-based systems are applied for more complicated queries than concepts can be. The W3C standard for such a set of queries is the query language called SPARQL [13]. A lot of such queries are supported by our system. Section 2.2 presents the solution to find out all instances of the concept, not only instances that are explicitly known, i.e. those expressed by C(a). 2.1 Domains and Ranges of Roles In DL the knowledge base comprises TBox and ABox. We understand both of them, ABox A about and TBox T , as sets of assertions (about individuals or con- cepts, respectively). Each set of assertions can be divided into two categories. One, denoted by subscript E, includes extensional assertions, the second one comprehends as set of additionally deduced assertions and it is denoted by sub- script D. The set TE consists of acyclic definitions A := C and axioms C ≡ D and C ⊑ D. The set TD is derived with respect to (symmetric, transitive) properties of the assertions ≡ and ⊑. For all TBox sets the following holds: – T = TE ∪ TD , – TE ∩ TD = ∅. ABox AE consists of statements of the form B(a) and R(a, b). Let us also mention that A = AE ∪ ATD . The set ATD depends on the TBox T , because we derive assertions on basis of AE and TBox assertions. If it is evident which T inducted ATD , we omit superscript T . Example 1. Let us have a knowledge base O with concepts shown in Figure 2. The concept Student is defined as follows: Student := P erson ⊓ ∃takesCourse.Course The set of assertions T includes previous definition of concept Student and also subsumption assertions shown in Figure 2 with solid line. Also there are some roles in O, i.e. NFR = {hasN ame, hasAddress} and NNR = {takesCourse}. Note that the name of the concept GraduateCourse is abridged to GCourse. The deduced assertions are shown in Figure 2 with dotted arrows. Also ABox consists of the extensional assertions: AE = { P erson(S2 ), GCourse(C2 ), hasN ame(P1 , N ameb ), Student(S1 ), takesCourse(S1 , C1 ), hasN ame(C1 , N amec ), P ublication(P1 ), takesCourse(S2 , C1 ), hasN ame(C2 , N amed ), Article(P1 ), hasN ame(S1 , N amea ) hasAddress(S1 , Addressa ), Course(C1 ), } Translation of Ontology Retrieval Problem into Relational Queries 47 Fig. 2. IS-A hierarchy of the knowledge base O. and also of the deduced ones: ATD = {P erson(S1 ), Course(C2 ), Student(S2 )} As we can see the assertion P ublication(P1 ) belongs to the set of extensional assertions. Though it can be deduced also, because the assertion Article ⊑ P ublication is in the TBox T and Article(P1 ) is an extensional assertion. In DL there are two kinds of the equality meanings. The equality whose left-hand side is atomic concept means the definition. In this paper we denote definition equality as the symbol := instead of ≡ to distinguish definition equal- ities from the rest. We assume, there are no cycles in definitions. In OWL language there is a chance to define domain and range of a role but it is not a necessary condition. Emphasize that our approach tries to reduce the number of join operations within the query construction. The main idea is to encode each functional property as an attribute of the relation, that represents domain of the role, not as standalone relation. In case without defined domain (range) of the functional roles we need to find concepts of instances that are related to another through the role. Sometimes domain or range are defined as union of concepts. This case is similar to the previous one. Therefore we define so called potential and valid domains as follows. Definition 1. A concept B ∈ NC is said to be a potential domain for role R ∈ NR with respect to the set assertions A if there is R(a, b) ∈ A such that B(a) ∈ A. The set of potential domains for role R with respect to A is denoted PDAR. Example 2. Let us illustrate the Definition 1 on Example 1. The role hasN ame has the following potential domains with respect to AE : PDA hasN ame = {P ublication, Article, Course, GCourse, Student} E The role hasAddress has the following potential domain with respect to AE : AE PDhasAddress = {Student} 48 Jaroslav Pokorný, Jana Pribolová, Peter Vojtáš In OOP (Object-Oriented Programming) if the ancestor class has defined a function, the descendant class can use it. We map the concept into meaning of OOP class and functional role into OOP function. That means that we do not need translate the same functional role for ancestor concept and for descendant concept separately. Therefore let us define valid domains – concepts for which we consider the functional role to translate into relational scheme. Definition 2. A valid domain for role R ∈ NFR with respect to A is a potential domain A ∈ PDA A R with property that does not exists B, B ∈ PDR so that A ⊑ B ∈ T and A ≡ B 6∈ T , as well as B ⊑ A 6∈ T . The set of valid domains for role R is denoted as VDA,T R . If we are interested in valid domains with respect to whole TBox assertions, e.g. T , we can omit the superscript T . Note that if ABox is changed it is necessary to revise the potential and valid domains again. This process prevents ontology deformation of the original mod- eling intent. Example 3. With assistance of the previous examples we can present the valid domains of the role hasN ame with respect to AE : VDA hasN ame = {P ublication, Course, Student}. E The concept GCourse does not belong to the set VDA R because there exists E AE Course ∈ VDR so that GCourse ⊑ Course and neither Course ⊑ GCourse nor Course ≡ GCourse is in T . It is useful to define ”inverse” function that can find for any concept A all roles for which the concept is valid domain. Definition 3. The role R ∈ NFR is said to be a role defined on the concept B with respect to the set of assertions A if B ∈ VDA,T R . We denote the set of all roles defined on the concept B as isInVDA,T B . Similarly as in Definition 2 if we use all assertions of T , we can omit the superscript and denote the set of all roles defined on the concept B as isInVDA B. Example 4. In the running example of this paper an interesting point is to com- AE pute isInVDStudent : isInVDA Student = {hasN ame, hasAddress} E Our solution requires valid domains to encode functional properties. To keep some integrity constraints it is useful to define also range of the roles. Definition 4. A potential range for role R ∈ NR (with respect to A) is concept B for which there exists R(a, b) ∈ A and B(b) ∈ A. We denote the set of potential ranges of role R as PRR . Example 5. We compute potential ranges for all roles as follows: PRA hasN ame = {String}, PRA takesCourse = {Course}. Computation of valid ranges is unnecessary because we use ranges only to keep integrity. Translation of Ontology Retrieval Problem into Relational Queries 49 2.2 Construction of a Relational Scheme The first part of ontology translation into relational database consists of creating a relational scheme. Algorithm 1 Let O be a knowledge base with TBox T and ABox A. T , A, concept’s names, and role’s name are translated into relational database D = (R, I). Here R denotes a relational database scheme consisting of basic relational schemes and view definitions using relational algebra expressions (RA expressions). Second, I denotes a set of integrity constraints. The translation is done by induction as described below. First part of translation depends only on the language, the second part de- pends also on ABox and the last depends on the TBox too. Note that names of attributes are motivated by RDF (subject, predicate, object) and resource terminology. The construction is based on the following steps: First translation steps are based solely on the description logic language 1. For all A ∈ NC we add to R new relation TA with scheme TA (resource). 2. For all R ∈ NNR we add to R a relation scheme TR (subject, object). Following translation steps depend on the ABox (and deduced valid domains) 3. For all A ∈ NC for which isInVDA A = {R1 , R2 , . . . , Rn } ⊆ NFR, n ≥ 1 E we modify TA (resource) ∈ R to relation TAmod ∈ R with scheme TAmod (resource, R1 .object, . . . , Rn .object) If R ∈ NFR and VDA R = ∅, then we add to R a new relational scheme E TR (subject, object). The following translations depend on the TBox. First we deal with defi- nitions: 4. For all A ∈ NC such that there is a concept construction C with A := C ∈ T we add to R a new relation TAview with scheme TAview (resource) and view definitions so that (SD and SE are defined in step 5): – If C := D ⊓ E then TAview = TA ∪ (SD ∩ SE ) AE – If C := ∃R.D and R ∈ NNR or VDR = ∅ then view TA = TA ∪ (TR (subject, object) [TR .object = SD .resource] SD (resource))[TR .subject]. – If C = ∃R.D and R ∈ NFR and VDA R 6= ∅, n > 0 then E view rec TA = TA ∪ (SR (subject, object) rec [SR .object = SD .resource] rec SD (resource))[SR .subject] 50 Jaroslav Pokorný, Jana Pribolová, Peter Vojtáš rec where SR is the RA expression for reconstruction of the role R from appropriate columns of TBmod tables [ rec TBmod [resource, R.object]  SR (subject, object) = A B∈VDR E Here we assume that this is a lossless encoding of all ABox information about R. 5. For a non-atomic concept construction C such that there is in T no definition with right hand side C and C is a sub construction of a concept definition in T , then we create a new RA expression SC with the only attribute resource so that: – If C = D ⊓ E then SC = (SD ∩ SE ) – If C = ∃R.D and R ∈ NNR or VDA R = ∅ then E SC = (TR (subject, object)[TR .object = SD .resource]SD (resource)) [TR .subject] – If C = ∃R.D and R ∈ NFR and VDA R 6= ∅ then E rec rec SC = (SR (subject, object)[SR .object = SD .resource]SD (resource)) rec [SR .subject] 6. To transform axioms in T , we add the following integrity constraint to I: – if C ≡ D ∈ T and C, D are non atomic concept constructions then SC = SD ∈ I, – if C ⊑ D ∈ T then SC ⊆ SD ∈ I. The following interesting observations result from the previous algorithm. 1. For all R ∈ NNR andSfor all R ∈ NFR for which VDA R = ∅: E – TR [subject] ⊆ TB , A B∈PDR E S – TR [object] ⊆ TB . B∈PRR AE 2. For all B ∈ NC for which isInVDB 6= ∅ and for all R ∈ isInVDA B : E [ TBmod [R.object] ⊆ TA . A∈PRR These assertions check ”integrity” of the translation of role assertions. In more detail, the mentioned assertions take care to preserve the subject in area of role domains and the object in area of the role ranges. The previous algorithm is illustrated on the following example. Example 6. According to previous examples and applying Algorithm 1 we re- ceive the following database scheme: TPmod ublication (resource, hasN ame.object), TArticle (resource), mod TCourse (resource, hasN ame.object), TGCourse (resource), mod TStudent (resource, hasN ame.object), TP erson (resource), TtakesCourse (subject, object). Translation of Ontology Retrieval Problem into Relational Queries 51 and I = {TArticle [resource] ⊆ TP ublication [resource], TGCourse [resource] ⊆ TCourse [resource], TStudent [resource] ⊆ TP erson [resource]}. Also relation defined as follows belongs to the database scheme: view TStudent = TStudent [resource] ∪ TP erson ∩ TtakesCourse [TtakesCourse .object = TCourse .resource] [TtakesCourse .subject] and the following assertions hold for given D: – TtakesCourse [subject] ⊆ TP erson ∪ TStudent – TtakesCourse [object] ⊆ TCourse Note that, the attributes of the relations representing non-functional roles called subject and object have the same domain. Also note that all instances of a concept B ∈ NC, so that B := D, are stored in TBview . 2.3 Construction of a Relational Database In previous section we have created a relational scheme. Now we present the algorithm to insert the data in the database relations. Algorithm 2 Suppose that T , NC and NR are translated into database D. ABox A is transferred into D by induction as follows: 1. If B(a) ∈ A and also B ∈ NC, then hai ∈ TB . 2. If R(a, b) ∈ A and R ∈ NNR, then ha, bi ∈ TR . 3. If R(a, b) ∈ A and R ∈ NFR, then one of the following items: AE (a) if VDR = ∅ then ha, bi ∈ TR , AE (b) if there exists A ∈ VDR so that A(a) ∈ A, then ha, bi ∈ TAmod [TA .resource, R.object], (c) if there exists A ∈ PDR \ VDA R so that A(a) ∈ A, then there exists a E AE maximal sequence A = B1 , B2 , . . . , Bn ∈ NC so that Bn ∈ VDR and Bi ⊑ Bi+1 ∈ TD for i = 1, . . . , n − 1 Then ha, bi ∈ TBmod n [B.resource, R.object]. Note that the last step of algorithm the information A(a) is not lost and moreover information Bn (a) is a consequence of TBox axioms. It is important to remember that the valid domains are built with respect to the set of extensional assertions AE (Algorithm 2, Steps 3b and 3c). On the other hand, we consider all assertions from ABox A to insert them in the proper tables (Step 1). Example 7. After applying Algorithm 2 we obtain: TPmod ublication = {hP1 , N ameb i}, TP erson = {hS1 i, hS2 i}, mod TCourse = {hC1 , N amec i, hC2 , N amed i}, TArticle = {hP1 i}, mod view TStudent = {hS1 , N amea i}, TStudent = {hS2 i}. 52 Jaroslav Pokorný, Jana Pribolová, Peter Vojtáš 3 Ontology Implementation in an SQL Environment We designed and implemented translation of a description logic knowledge base into an SQL environment, which works in accordance with Algorithms 1, 2. In fact, we create the database scheme which consists of CREATE TABLE and CREATE VIEW statements of the SQL language. 3.1 Create Statements First we create tables representing concepts as it is stated in Step 1 of Algorithm 1. A special case is the top concept. For technical reasons, we assign a numerical identifier to every URI which represents the associated instance. for all A ∈ NC do: if (A = ⊤) then CREATE TABLE T⊤ ( resource INT NOT NULL PRIMARY KEY, uri VARCHAR NOT NULL ); else CREATE TABLE TA ( resource INT NOT NULL PRIMARY KEY ); Next step is the Step 2 of Algorithm 1. Therefore we create tables for non- functional roles in this way: for all R ∈ NNR do: CREATE TABLE TR ( subject INT NOT NULL, object INT NOT NULL, PRIMARY KEY(subject, objects) ); After that we deal with functional properties – Step 3 of the Algorithm 1: for all R ∈ NFR do: AE for all A ∈ VDR do: ALTER TABLE TA ADD COLUMN R object INT; Let us mention that in DL we do not distinguish different kinds of roles. However, in OWL there are two kinds of roles. Practically a role represents a relationship type (in OWL so called object property) or an attribute type (in OWL data type property). In case of relationship roles there is relationship between two instances represented by URIs. On the other hand, attribute roles represents relationship between instance represented by URI and literal value. It may appear that the problem can became if in the column representing the object functional role there is a literal value, or URI in the case of the data type role. But practically the role can be either object type or data type, not both types simultaneously. So it could not happen the mentioned collision. For all atomic concepts that are equivalent to other concept, we also cre- ate view in the database as it is defined in Algorithm 1 in Step 4. Note, that in [8] we have proved that to any concept defined via other atomic and non- atomic concepts we can construct an SQL view whose definition contains only INTERSECTION operations, SELECTs and TABLE R expressions. Each SELECT uses Translation of Ontology Retrieval Problem into Relational Queries 53 only join conditions. To achieve this it is necessary to normalize concept defini- tion on the relational algebra level. We omit details of this procedure here. After this comment Step 4 looks like: for all A ∈ NC: C ≡ D do: String[] elements = getExpressionsOf(A) if (elements.length > 1) then for i=2 to elements.length do: elements[1] += " INTERSECT " + elements[i] CREATE VIEW ViewA AS elements[1]; where the function getExpressionsOf is defined as follows: String[] getExpressionsOf(Concept A){ String[] result; d for all Di : A := Di do i≥1 if (Di ∈ NC) then result[i] = "SELECT resource FROM TDi ;" else if (Di = ∃R.E and E ∈ NC) then result[i] = "SELECT TR .subject FROM returnRole(R) JOIN TE ON TR .object = TE .resource;" else if (Di = ∃R.E) then result[i] = "SELECT TR .subject FROM " + returnRole(R) + " AS TR JOIN " + getExpressionsOf(E) + " AS TE ON TR .object = TE .resource;" return result; } and the function returnRole is: String returnRole(Role R){ String result; if (R ∈ NNR) then result = "TR " else int i = 0; for all A ∈ VDA R E do: i++; if (i = 1) then result = "SELECT resource,R object FROM TA " else result =+ "UNION SELECT resource,R object FROM TA " return result; } At the end of the Algorithm 1 we have do one more thing – to add some integrity constraints generated in Step 6. for all t: t = A ⊑ B do: ALTER TABLE TA ADD FOREIGN KEY (resource) REFERENCES TB ; The equality of two tables implementing ≡ leads to cyclic referential integrity in relational database schema. Therefore, we omit it from further consideration. 54 Jaroslav Pokorný, Jana Pribolová, Peter Vojtáš 3.2 Insert the Data Let us to suppose that the database scheme in SQL is created. Now we will deal with data – instances as it is described in Algorithm 2. We will show how to insert them into created tables. for all A ∈ NC \ {⊤} do: for all A(a) ∈ A do: if (a ∈ T⊤ ) then id = (SELECT resource FROM T⊤ WHERE uri = ’a’) else INSERT INTO T⊤ (resource, uri) VALUES ( generateId(a),’a’); INSERT INTO TA (resource) VALUES ( generateId(a)); This code fragment implements Step 1. It uses the function int generateId( String URI) which generates a numerical identifier to use it within condition in join operation instead of string identifier. Step 2 of the Algorithm 2 is implemented in this way: for all R ∈ NNR do: for all R(a, b) ∈ A do: INSERT INTO TR VALUES (getId(a),getId(b)); Finally, Step 3 of the Algorithm is interpreted as follows: for all R ∈ NFR do: for all R(a, b) ∈ A do: AE if(VDR = ∅ then INSERTINTO TR VALUES (getId(a), getId(b)) else AE if(∃A ∈ VDR and A(a) ∈ A) then UPDATE TA SET R object = getId(b) WHERE resource = getId(a) else find B ∈ VDA R E : A⊑B UPDATE TB SET R object = getId(b) WHERE resource = getId(a) The function int getId(String URI) returns the numerical identifier assigned to a given URI. 4 Conclusion The paper describes an approach to mapping ontology into relational database. The process of mapping is autonomous that means that there is no need of human interaction. We present algorithms of translation ontology into relational scheme and relational data model. In this paper we focused on implementations of the mentioned algorithms – transformation of algorithm’s steps in SQL statements. We work with so called EL description logic which contains top, intersect and full existential quantification constructor. We would like to extend the logic with additional concept constructors inspired by relational database operators. The important area of our research relates to valid domains. We want to do a research about valid domains and the assumption that valid domains preserve ISA-hierarchy. Translation of Ontology Retrieval Problem into Relational Queries 55 Acknowledgment This paper was supported by Slovak project VEGA 1/0131/09, Slovak project VVGS/UPJ Š/45/09-10, Czech project GAČR 201/09/0990 and Czech project IS 1ET100300517. References 1. F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. Patel-Schneider. The Description Logic Handbook. Theory, implementation, and application. Cambridge University Press, 2003 United Kingdom. 2. J. Dokulil, J. Tykal, J. Yaghob and F. Zavoral. Semantic Web Repository and Interfaces. International Conference on Mobile Ubiquitous Computing, Systems, Services and Technologies. In Proc. of UBICOMM 2007, IEEE Computer Society, 2007, pp. 223–228. 3. T. Eiter, T. Lukasiewicz, R. Schindlauer, and H. Tompits. Combining Answer Set Programming with Description Logics for the Semantic Web. Artificial Intelligence Vol. 172, Issues 12–13, 2008, pp. 1495–1539. 4. A. Gali, C. X. Chen, K. T. Claypool, and R. Uceda-Sosa. From Ontology to Rela- tional Databases. Springer Verlag, LNCS 3289, 2004, pp. 278–289. 5. B. N. Grosof, I. Horrocks, R. Volz, and S. Decker. Description Logic Programs: Combining Logic Programs with Description Logic. In Proc. of WW2003, Hungary, 2003, pp. 48–57. 6. N. Kottmann and T. Studer: Improving semantic query answering. DEXA 2007, LNSC 4653, Springer, 2007, pp. 671 – 679. 7. Z. Pan and J. Heflin: DLDB: Extending relational databases to support semantic web queries. In Workshop on Practical and Scaleable Semantic Web Systems, ISWC 2003, 2003, pp. 109–113. 8. J. Pokorný, J. Pribolová, and P. Vojtáš. Ontology Engineering Relationally. Tech- nical Report 2009-2, Dep. of Software Engineering, Faculty of Mathematics and Physics, Charles University, 2009, 20 p. 9. J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying xml documents: Limitations and opportunities. In Proceedings of 25 VLDB Conference, 1999, pp. 302–314. 10. HAWK. http://swat.cse.lehigh.edu/projects/index.html#hawk 11. MySQL. http://www.mysql.com/. 12. Sesame. http://www.openrdf.org/. 13. SPARQL. http://www.w3.org/TR/rdf-sparql-query/. 14. SWAT Projects - the Lehigh University Benchmark (LUBM). http://swat.cse. lehigh.edu/projects/lubm/.