Logical reconstruction of normative RDF Jos de Bruijn1 , Enrico Franconi2 , and Sergio Tessaris2 1 Digital Enterprise Research Institute, University of Innsbruck, Austria jos.debruijn@deri.org 2 Faculty of Computer Science, Free University of Bozen-Bolzano, Italy lastname @inf.unibz.it Abstract. In this sketchy paper we introduce a logical reconstruction of the RDF family of languages and the OWL-DL family of languages. We prove that our logical framework is equivalent to the normative W3C definitions of RDF/RDFS and OWL-DL/Lite. The main aim is to have a unified model theoretic semantics for both worlds. As a consequence we get various complexity results and a model theoretic semantics for basic SPARQL. 1 Introduction The main aim of this sketchy paper is to recast the RDF and RDFS model theory in a more classical logic framework, and to use this characterisation to shed new light on the ontology languages layering in the semantic web, and to lay down the logic based semantics of SPARQL. In particular, we will show how the models of RDF can be related to the models of DL based ontology languages, without requiring any change on the existing syntactic or semantic definitions in the RDF and OWL-DL realms. We first introduce the notion of herbrand and canonical models for RDF graphs, and we use this notion to characterise RDF entailment. RDF herbrand models can also be seen as classical first order structures, that we call FO in- terpretations. These structures provide the semantic bridge between RDF and classical logics, such as description logics (DL) based languages (e.g., OWL-DL). The intuition beyond FO interpretations is that it singles out the concepts and the individuals from an RDF herbrand model – possibly in a polymorphic way when the same node is given both the meaning as a class and as an individual. Once we have characterised RDF graphs in terms of their herbrand models, it is possible to understand the notion of logical implication between RDF graphs and classical logic formulae. At the end of this paper we analyse the problem This work has been partially supported by the EU projects KnowledgeWeb, In- terop, Tones, Sekt, and Asg. This paper extends a paper previously published at the PPSWR-05 workshop [de Bruijn et al., 2005]. of querying RDF graphs with OWL-DL ontologies. We prove an important re- duction result. That is, given an RDF graph S and a query Q, the answer set of Q to S (as defined by W3C) is the same as the answer of Q to S given the empty KB. This shows a complete interoperability between RDF and OWL-DL. For example, in absence of ontologies, it would be possible to use OWL-QL to answer queries to RDF graphs, or to use SPARQL to answer queries to ABoxes. In this paper we assume that the reader is familiar with the definitions asso- ciated to RDF. 2 RDF Model Theory revisited In this paper we consider an extended notion of RDF graph, in which we are less restrictive on the kind of triples. In particular we allow – literals in subject positions; – blank nodes in property positions. Note that the first kind of extension has been already considered by W3C working groups (e.g. see Section 2.2 of [Prud’hommeaux and Seaborne, 2005]). All the results shown in this paper still holds for the standard definition of RDF graph. From now on, by RDF graph we intend the extended definition. Also note that reification is not considered as not being part of the normative semantic definition of RDF. We indicate with RDF U the set of all RDF URI references together with the set of all literals in their canonical representation1. An RDF graph is said to be well typed if doesn’t contain the triple h“xxx”^^rdf:XMLLiteral, rdf:type, rdf:XMLLiterali where “xxx”^^rdf:XMLLiteral is an ill-typed XML literal string (see the RDF semantic conditions in Section 3.1 of [Hayes, 2004]). We first define the notion of herbrand and canonical models for an RDF graph. Definition 1. (Herbrand and canonical models) A herbrand model of an RDF graph S is a well typed ground instantiation of the graph obtained by replacing each bnode in the completed S with some element in RDF U . A graph is completed if it is augmented by the RDF and RDFS axiomatic triples, it is extended by applying the RDF and RDFS entailment rules (see sections 3.1, 4.1, 7.2 and 7.3 in [Hayes, 2004])2 and all the literals are in their canonical representation. 1 The canonical representation of a literal is a chosen representative of all the literals associated to the same value, if the literal is non ill-typed, otherwise it is the literal itself. 2 Note that, since we allow literals as subject in RDF triples, we need to add a dual rule to RDF2 and to RDFS1 acting on literals in the subject of a triple; moreover, rules RDF2 and RDFS1 should have the proper literal instead of a bnode in the “then add” part. The canonical model Sb of an RDF graph S is the herbrand model of S obtained by skolemisation, i.e. by replacing each distinct bnode in S with a distinct fresh URI – that is, a skolem constant not appearing elsewhere in S nor in the context in which S is used (e.g. in queries). Note that a herbrand model is always finite if the RDF graph is finite, that a ground RDF graph has a unique herbrand model that it is also its canonical model, and that a herbrand model is a ground RDF graph. As the following theorem shows, the herbrand models of an RDF graph con- tain explicitly all the information entailed by the graph itself. Theorem 2. (RDF entailment) An RDF graph S entails an RDF graph E (as defined in [Hayes, 2004]), written S ` E, if and only if some herbrand model of E is a subgraph of the canonical model of S. Corollary 3. (Complexity of entailment) 1. RDF entailment is NP-complete in the size of the RDF graphs. 2. RDF entailment is polynomial in the size of the entailing graph S. 3. RDF entailment is polynomial in the size of the graphs if E is acyclic or ground. The proofs are based on a reduction to the problem of conjunctive query containment, and by using the interpolation lemma in [Hayes, 2004]. The above theorem and corollary (without the polynomial results) have been already sketched in [Gutierrez et al., 2004]. However, the results in [Gutierrez et al., 2004] are imprecise since the role of axiomatic triples and the completion (as defined here) are neglected, and literals are not taken in careful account. 2.1 The Semantics of Basic SPARQL Let’s now consider SPARQL queries on RDF graphs. If we restrict our attention to SPARQL query basic graph patterns [Prud’hommeaux and Seaborne, 2005], we can define the semantics of query answering in the usual logic based way (as, e.g., is defined for classical relational databases, or for description logics). We also disallow in this paper the answer to a query to contain blank nodes. Relaxing this restriction raises several issues regarding the redundancy of answers, which are not taken into account in [Prud’hommeaux and Seaborne, 2005]. Definition 4. (Semantics of basic SPARQL) A SPARQL query basic graph pattern to an RDF graph S is a (possibly ground) RDF graph Qx where, in addition to URIs and bnodes, variables are allowed; the elements in the set x (possibly empty) of n variables of a query are called distin- guished variables, and the bnodes play the role of non-distinguished variables. The answer set of Qx is the set of all substitutions of the distinguished variables with some arbitrary URI from RDF U , such that the for each substitution the instantiated query is entailed by S, i.e., {hc1 . . . cn i ∈ (RDF U )n | S ` Q[x1 7→c1 ,...,xn 7→cn ] }. Note that according to our extended definition of RDF graphs we allow blank nodes and variables in property position. The complexity results presented in Corollary 3 can be rephrased in the con- text of SPARQL. We consider the graph to be queried as the data against which the given query is verified; so, in this way, we can define the data complexity of the problem of query answering. Corollary 5. (Complexity of SPARQL) Query answering for SPARQL query basic graph patterns is polynomial in data complexity. Note that the above definition together with the correspondence stated in Theorem 2 supports the idea of implementing SPARQL by means of a rela- tional DBMS. In fact, the theorem suggests a straightforward query answering technique based on canonical models. The same technique can be extended in order to provide blank nodes in the query answers. However, it can be shown that guaranteeing non redundant answers increase the complexity of the problem, which rises to be NP-complete in the size of the answer set (i.e. data complexity). The result of Corollary 5 has been already sketched in [Gutierrez et al., 2004] for a richer query language, with the same imprecision we mentioned before. 2.2 The FO Model Theory for RDF A FO interpretation (first order interpretation) of an RDF graph shows how models of RDF can be seen as interpretations of classical first order logic. Definition 6. (FO interpretation of an RDF herbrand model) A FO interpretation I(·) of an RDF herbrand model IRDF is a first order type structure I(IRDF ) = h∆, .IO , .IC , .IR i, where ∆ is a non-empty abstract domain corresponding to RDF U . An RDF herbrand model IRDF with RDFS vocabulary and containing an XML clash has no FO interpretation. The interpretation of the elements of IRDF is given by the interpretation functions .IO , .IC , .IR , whose domain is RDF U , and the range is respectively all elements of ∆, all subsets of ∆, and all binary relations over ∆. The interpretation functions state which of the elements of the graph play the role of individuals, concepts, and roles. For each u ∈ RDF U , I(IRDF ) should be such that: uIO = u uIC = {o | ho, rdf:type, ui ∈ IRDF } uIR = {(o1 , o2 ) | ho1 , u, o2 i ∈ IRDF } An URI reference is associated to more than one syntactic type, e.g., an URI may refer to an individual and to a class at the same time: polymorphic meanings of URIs are allowed. However note that, just like in the case of contextual predicate calculus (as defined in [Chen et al., 1993]) and of π-semantics of [Hustadt et al., 2005], in the above definition there is no semantic interaction between the distinct occurrences of the same URI as a concept name, or as a role name, or as an individual. This absence of interaction is required for classical first order (description) logic fragments such as OWL-Lite or OWL-DL. For example, given the triple hex:o, rdf:type, ex:oi within an RDF herbrand model, in the FO interpretation associated to it the URI ex:o is interpreted as both an individual and a concept, and the individual ex:o is in the extension of the concept ex:o. We say that the FO interpretations of an RDF graph are the FO interpreta- tions of its herbrand models. The main theorem of this Section states that we can correctly define RDF entailment and queries using a classical logic with FO interpretations. Theorem 7. (FO entailment and query) 1. An RDF graph S entails an RDF graph E (as defined in [Hayes, 2004] ), written as S ` E, if and only if the set of all the FO interpretations of S is included in the set of all the FO interpretations of E, written S |= E. 2. The answer set of a SPARQL query basic graph pattern Qx to an RDF graph S, as defined in Definition 4, is equal to {hc1 . . . cn i ∈ (RDF U )n | S |= Q[x1 7→c1 ,...,xn 7→cn ] }. 3 Classical Logic Interoperability In this Section we define the interoperability between RDF graphs and first order classical logics. We show how a tower of classical logics (e.g., from OWL- Lite to OWL-DL, to full first order logic, or any arbitrary logic equipped with classical first order models) can be built on top of the language of RDF: the interoperability is grounded on the notion of FO interpretations. First, we need to define the notion of non high order graphs, that basically do not have bnodes in any property or class position. Definition 8. (Non-high order RDF graph) An RDF graph is non-high order if bnodes and variables are not in property position of any triple, nor in object position of rdf:type triples, nor in any triple with rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range predicates. Note that herbrand models and canonical models are always non-high order RDF graphs, since they are always ground graphs. Given a classical first order logic C, we define now the translation of a non- high order graph into some formula of C. We require that in C the interpretation of well-typed literals is subject to the Unique Name Assumption. Definition 9. (Classical logic translation) The classical logic translation FO(S) of a non-high order well-typed RDF graph S is a predicate logic formula, where URIs and literals (in their canonical rep- resentation) are constants and blank nodes are existentially quantified variables, and the body is a conjunction of the ground binary atomic formulae in corre- spondence with the triples of S, where to each binary atomic formula of the kind “ rdf:type(a, b)” a ground unary atomic formula of the kind “ b(a)” is added, to each binary atomic formula of the kind “ rdfs:subClassOf(a, b)” a formula of the kind “ ∀x. a(x) → b(x)” is added, to each binary atomic formula of the kind “ rdfs:subPropertyOf(a, b)” a formula of the kind “ ∀xy. a(x, y) → b(x, y)” is added, to each binary atomic formula of the kind “ rdfs:domain(a, b)” a formula of the kind “ ∀xy. a(x, y) → b(x)” is added, to each binary atomic formula of the kind “ rdfs:range(a, b)” a formula of the kind “ ∀xy. a(x, y) → b(y)” is added. If S has RDFS vocabulary and contains an XML clash, then FO(S) is equal to ⊥. We now introduce the general problem of reasoning and query answering in a classical first order logic C given an RDF graph. Definition 10. (Classical logic RDF extension) 1. The logical implication problem in a classical logic C given an RDF graph S is defined as follows: b φ |=C ψ FO(S), where φ and ψ are formulae in C, φ does not contain any symbol from the RDF and RDFS vocabularies, and |=C is entailment in C. 2. The query answering problem in a classical logic C given an RDF graph S is defined as follows: b φ |=C ψ[x 7→c ,...,x 7→c ] }. {hc1 . . . cn i ∈ (RDF U )n | FO(S), 1 1 n n where φ is a formula in C which does not contain any symbol from the RDF and RDFS vocabularies, and ψx is an open formula in C (expressing the query) with x being the free (distinguished) variables, and |=C is entailment in C. The above general definition of reasoning and querying given an RDF graph is actually an abstraction of basic reasoning and querying for RDF graphs only, as the following reduction theorem shows. Theorem 11. (Reduction theorem) 1. Given an RDF graph S and a non-high order graph E, S ` E if and only if FO(S)b |=C FO(E) 2. Given an RDF graph S and a SPARQL non-high order query basic graph pattern Qx , its answer set is equal to b |=C FO(Q[x 7→c ,...,x 7→c ] )}. {hc1 . . . cn i ∈ (RDF U )n | FO(S) 1 1 n n The proof of the reduction theorem is based on the following lemma. Lemma 12. (Canonical entailment) An RDF graph S entails an RDF graph E (as defined in [Hayes, 2004]), i.e., S ` E, if and only if the FO interpretation corresponding to the canonical model of S is in the set of all the FO interpretations of E. This lemma together with the reduction theorem justifies the use of datalog-like implementations for SPARQL. We believe that the classical logic RDF extension presented here is a mean- ingful way to build up logical languages on top of RDF, and it is a formal justi- fication of the semantic web tower of languages proposed by Tim Berners-Lee. As a matter of fact, in our proposed framework it is possible to add (first-order) logic-based knowledge on top of RDF graphs, written as a knowledge base φ in the logic C. Note that such knowledge base φ should not contain any RDF and RDFS vocabularies; this restriction is not really limiting, since – as it is evident by looking at the FO translation of RDF graphs – it is possible to write directly in C itself the RDF/RDFS properties. Also note that any use of the RDF and RDFS vocabularies in the entailed formula ψ (or in the query) is affected only by the RDF graph and not by the knowledge base expressed by φ. This observation suggests the definition of a pure classical logic RDF extension by considering the classical logic translation of the simple skolemisation (and not of the canonical model) of S, and by restricting both φ and ψ to contain no symbol from the RDF and RDFS vocabularies. The encoding of the RDF graph in the logic C would be obtained by only skolemising the bnodes in S without completing the graph itself, i.e. without adding the axiomatic triples and without applying the entailment rules. Definition 13. (Pure classical logic RDF extension) 1. The pure logical implication problem in a classical logic C given an RDF graph S is defined as follows: FO(SK(S)), φ |=C ψ where φ and ψ are formulae in C which do not contain any symbol from the RDF and RDFS vocabularies, and |=C is entailment in C. 2. The query answering problem in a classical logic C given an RDF graph S is defined as follows: {hc1 . . . cn i ∈ (RDF − n U ) | FO(SK(S)), φ |=C ψ[x1 7→c1 ,...,xn 7→cn ] }. where RDF − U does not include the RDF and RDF vocabularies, φ is a formula in C which does not contain any symbol from the RDF and RDFS vocabular- ies, and ψx is an open formula in C (expressing the query) with x being the free (distinguished) variables and not containing any symbol from the RDF and RDFS vocabularies, and |=C is entailment in C. As the following lemma shows, the classical logic translation of an RDF graph seems to be mostly insensible to the computation of the canonical model. Lemma 14. (Minimal models of classical logic translations) Given an RDF graph S, 1. FO(SK(S)), i.e. the classical logic translation of the skolemisation of S, has a unique minimal model; b i.e. the classical logic translation of the canonical model of S, has a 2. FO(S), unique minimal model; 3. the minimal models of FO(SK(S)) and FO(S) b are isomorphic up to renaming of the skolem constants, and by not considering the parts involving the RDF and RDFS vocabularies. The following important equivalence result can now be proved. Theorem 15. (Equivalence theorem) The pure classical logic RDF extension and the classical logic RDF extension are equivalent if ψ does not contain any symbol from the RDF and RDFS vo- cabularies, and the query answers containing symbols from the RDF and RDFS vocabularies are discarded. 4 Interoperability between RDF and OWL-DL The results presented so far have several immediate consequences when consid- ering the interoperability between OWL-DL/Lite with RDF. Note that in Section 3 we assume that the theory written in logic C, which interoperates with RDF(S), is not encoded in the graph itself (see Definition 10). For this reason, we do not run into the problems of dealing with triples which correspond to the peculiar serialisation of OWL into RDF. However, nothing prevents any specific implementation to use RDF to represent syntactically the formulae of the logic C. First of all, it is possible to have an implementation for free of a query evaluation engine for SPARQL non-high order query basic graph patterns over RDF graphs, using any of the existing description logics based query system available. In fact, it is enough to encode the (arbitrary) RDF graph to query as an ABox in the system (by considering its canonical model), and to query it by encoding the SPARQL non-high order query basic graph pattern as a standard conjunctive query. The reduction theorem above guarantees that we will get the correct answer. Moreover, it is possible to extend the query problem of an RDF graph in SPARQL to the query problem of an RDF graph (possibly with RDFS vocab- ulary) given an ontology in OWL-DL, again by exploiting standard description logics based query systems. This is achieved by just adding the encoding of the RDF graph to query as an OWL-DL knowledge base corresponding to the FO translation of its canonical model or of its simple skolemisation (depending whether the pure classical logic RDF extension has been chosen or not). This work also shows how it is possible to give a semantics to OWL-DL based on RDF, generalising the recommended semantics given by [Patel-Schneider et al., 2004]. Our proposal is fully compatible with the W3C recommended seman- tics, but removes some of the non necessary limitations related to the polymor- phism of URIs, enforced in [Patel-Schneider et al., 2004] by the “vocabulary par- titioning” and “explicit typing” restrictions. As a matter of fact, [Patel-Schneider et al., 2004] would allow interoperation and queries only with RDF graphs not containing any meta information (for example, of the kind represented by the triple hex:o, rdf:type, ex:oi); a slightly more liberal restriction has been pro- posed in [Pan and Horrocks, 2003] where acyclic regularly stratified meta-classes are allowed but still disallowing the above cyclic example. References [Chen et al., 1993] Weidong Chen, Michael Kifer, and David Warren. HILOG: a foun- dation for higher-order logic programming. Journal of Logic Programming, 15(3):187– 230, February 1993. [de Bruijn et al., 2005] Jos de Bruijn, Sergio Tessaris, and Enrico Franconi. Logical reconstruction of RDF and ontology languages. In Proc. of PPSWR-05, 2005. [Gutierrez et al., 2004] C. Gutierrez, C. Hurtado, and A. Mendelzon. Foundations of semantic web databases. In Proceedings of the 2004 ACM SIGART SIGMOD SIGART Symposium on Principles of Database Systems (PODS’04), 2004. [Hayes, 2004] Patrick Hayes. RDF semantics. Technical report, W3C, February 2004. W3C recommendation, URL http://www.w3.org/TR/rdf-mt/. [Hustadt et al., 2005] U. Hustadt, Boris Motik, and U. Sattler. Rea- soning for description logics around SHIQ in a resolution framework. Technical Report 3-8-04/04, FZI, Karlsruhe, Germany, 2005. URL http://www.fzi.de/ipe/publikationen.php?id=1172. [Pan and Horrocks, 2003] Jeff Pan and Ian Horrocks. RDFS(FA) and RDF MT: Two semantics for RDFS. In Proc. 2003 International Semantic Web Conference (ISWC 2003), 2003. [Patel-Schneider et al., 2004] Peter F. Patel-Schneider, Patrick Hayes, and Ian Horrocks. OWL web ontology language semantics and abstract syn- tax. Technical report, W3C, February 2004. W3C recommendation, URL http://www.w3.org/TR/owl-semantics/. [Prud’hommeaux and Seaborne, 2005] Eric Prud’hommeaux and Andy Seaborne. SPARQL query language for RDF. Technical report, W3C, July 2005. W3C working draft, URL http://www.w3.org/TR/rdf-sparql-query/.