Integrity Constraints for Linked Data Alan Jeffrey and Peter F. Patel–Schneider Alcatel–Lucent Bell Labs Abstract. Linked Data makes one central addition to the Semantic Web principles: all entity URIs should be dereferenceable to provide an authoritative RDF representation. URIs in a linked dataset can be par- titioned into the exported URIs for which the dataset is authoritative versus the imported URIs the dataset is linking against. This partition- ing has an impact on integrity constraints, as a Closed World Assumption applies to the exported URIs, while a Open World Assumption applies to the imported URIs. We provide a definition of integrity constraint satisfaction in the presence of partitioning, and show that it leads to a formal interpretation of dependency graphs which describe the hyper- linking relations between datasets. We prove that datasets with integrity constraints form a symmetric monoidal category, from which the sound- ness of acyclic dependency graphs follows. 1 Introduction Motivation. In the Semantic Web, entities are named by URIs, and are de- scribed by RDF documents. Linked Data [5] adds the constraint that entity URIs should be dereferenceable (HTTP URIs which accept GET requests), and dereferencing an entity URI returns an RDF representation of that entity. The W3C web architecture [8] calls such representations authoritative. The RDF triples contained in an entity representation will generally refer to entities for which the representation is not authoritative. Such hyperlinks between datasets are often visualized as a dependency graph, such as the popular Linking Open Data cloud diagram [6] shown in Figure 1. Linked Data puts a new spin on the open world stance of the Semantic Web: from the point of view of a given URI owner, the world is partitioned into local entities, for which the owner is authoritative, and imported entities, for which the owner is not authoritative. In this paper, we provide a formal model of this partitioning which includes: – a partitioning of entities into imported and exported nodes, in addition to the familiar blank nodes, – a definition of what it means for a dataset to satisfy its integrity constraints, based on the minimal models of Motik et al. [13], but adapted to partitioning, – a model of acyclic dependency graphs, which can be built compositionally, and where integrity constraint satisfaction can be performed locally, and – a proof that graphical reasoning for datasets is sound, by showing that datasets form a symmetric monoidal category. 2 Alan Jeffrey and Peter F. Patel–Schneider Fig. 1. Linking Open Data cloud diagram (detail) This paper gives the first formal treatment of authoritative resource, integrity constraints for linked data, dependency graphs, and the categorical structure of semantic data. In this introduction, we provide informal examples to motivate our model, which are made precise in Sections 4 and 5. Authoritative representations, imports and exports. The W3C web ar- chitecture [8] recommends as good practice that a URI owner “should provide authoritative representations of the resource it identifies”. Typically, these are HTTP URIs, which respond to GET requests. Linked Data [5] applies this prac- tice to the Semantic Web: URI owners provide authoritative representations of their URIs in RDF (for datasets) or OWL (for ontologies). Semantic reasoners can make deductions from Linked Data. For example, consider a URI bob: (in examples, we will use URI prefixes such as alice: and bob:) which dereferences to the Turtle [4] representation: bob: foaf:primaryTopic bob:me . bob:me foaf:knows [ foaf:homepage alice: ] . Now, if alice: dereferences to: alice: foaf:primaryTopic alice:me . then a reasoner can deduce (using the FOAF [2] specification’s definitions): bob:me foaf:knows alice:me . In Linked Data, the entities in a dataset can be partitioned into: – exported nodes (enodes): local entities, which the representation is authori- tative for, with a publicly defined name that other datasets may link against, – blank nodes (bnodes): local entities, which the representation is authoritative for, but without a publicly defined name, and – imported nodes (inodes): all other entities. For example, the RDF representation of bob: given above contains inode alice:, enodes bob: and bob:me, and an anonymous bnode (called _:anon below). Integrity Constraints for Linked Data 3 Ontologies and integrity constraints. A consumer of Linked Data may wish to assume a notion of correctness of the data it is consuming. Rather than considering integrity constraints to be given in a separate formalism such as a rules engine [14] or epistemic logic [15], we will use ontologies to express both deductive reasoning (the standard ontology), and the correctness criteria (the constraint ontology). A similar approach was taken by Motik et al. [13] and Tao et al. [17]. For example, consider the standard ontology: . homepage = primaryTopic− PersonalHomePage v Document Document v ≤1 primaryTopic and the constraint ontology: PersonalHomePage v ∃primaryTopic . Person Person v ∀knows . Person The above example is correct with respect to the exported interface: Person(bob:me) PersonalHomePage(bob:) under the assumption of the imported interface: PersonalHomePage(alice:) We reason informally as follows (this will be made formal in later sections). – The constraint PersonalHomePage v ∃primaryTopic . Person is satisfied be- cause the only new PersonalHomePage entity is bob:, and we have a witness bob:me for the role primaryTopic. – The constraint Person v ∀knows . Person is satisfied because the only new Person entity is bob:me, and the only entity which bob:me knows is _:anon. Now, in any world where PersonalHomePage(alice:), there must be some individual i such that primaryTopic(alice:, i) and Person(i). We can then reason using the standard ontology that i = _:anon and so Person(_:anon). This example, shows the use of two different styles of reasoning. – When reasoning about exported or blank nodes, we can assume that the only properties are ones which can be deduced from information we have asserted, using the standard ontology. For example, this form of reasoning is used in “because the only new PersonalHomePage entity is bob:” and “the only entity in a knows role with bob:me is _:anon.” – We reason differently about imported nodes. All we know about the imported world is that it satisfies the imported interface, the standard ontology and the constraint ontology. For example, this form of reasoning is used in “in any world where PersonalHomePage(alice:), there must be some individual i such that primaryTopic(alice:, i) and Person(i).” More succinctly, we use a Closed World Assumption for blank and exported nodes, and an Open World Assumption for imported nodes. 4 Alan Jeffrey and Peter F. Patel–Schneider Dependency graphs. Dependency graphs such as Figure 1 are a common way of visualizing linked data, but have, until now, remained informal. We propose a formalization of such graphs as directed graphs where nodes are datasets such as ALICE (the authoritative representation of alice:) and BOB (the authoritative representation of bob:), and edges indicate the existence of hyperlinks between datasets. These edges are labeled by interfaces to make the contract between datasets explicit, for example: PersonalHomePage(alice:) ALICE BOB Dependency graphs can be regarded as datasets, given by taking the union of all their constituent datasets (with a bit of bookkeeping to rename nodes to ensure no name clashes). Since dependency graphs form datasets, they can be nested, for example a GROUP which includes ALICE and BOB might be built: GROUP ALICE BOB Ensuring correctness should be compositional, for example knowing that ALICE and BOB are correct should ensure correctness of GROUP. Moreover, nested graphs should respect equivalence of datasets: if ALICE is replaced by an equiva- lent ALICE0 , then GROUP should be equivalent to GROUP0 . Finally, isomorphic graphs should be equivalent, irrespective of how they are composed, for example: DEPT CHARLIE DEPT GROUP CHARLIE ≡ GROUP ALICE BOB ALICE BOB Symmetric monoidal categories. Our goals for dependency graphs are: – Nodes describe datasets, edges describe hyperlink relationships. – Graphs can be built compositionally, with local checking of correctness. – Graph construction respects equivalence of datasets. – Isomorphism of dependency graphs implies equivalence of datasets. Proving these properties directly would be difficult, but fortunately there is an existing structure which guarantees these properties: a symmetric monoidal category. Category theory forms a foundational framework for mathematics, but our need of it is quite pragmatic: the equational theory of a symmetric monoidal category is precisely that of direct acyclic graphs (shown by Joyal and Street [10], see, for example, Selinger [16]). Figure 2 sketches how directed acyclic graphs form a symmetric monoidal category: Integrity Constraints for Linked Data 5 .. .. .. .. . F . . . .. .. .. .. .. .. .. .. .. . . . F . G . . G . . . 1 F;G F ⊗G σ Fig. 2. Directed acyclic graphs form a (strict) symmetric monoidal category. – The identity graph 1 just connects its source and target edges. – The composition F ; G of graphs takes the disjoint union of F and G, and unifies the target edges of F with the source edges of G. – The tensor F ⊗ G of graphs takes the disjoint union of F and G. – The symmetry graph σ just permutes its source and target edges. Since the equational theory of a symmetric monoidal category is precisely that of directed acyclic graphs, we can replace our goals for dependency graphs by the goal of showing that datasets form a symmetric monoidal category. This is a matter of proving a handful of equations, which is a easier than proving directly that graph isomorphism implies dataset equivalence. Summary. The remainder of this paper will make this motivational section precise. We will define a notion of integrity constraint suitable for partitioning, and show that datasets with integrity constraints form a symmetric monoidal category, and hence can be formalized by dependency graphs. This is the first such investigation of integrity constraints for Linked Data. All results presented in this paper have been mechanically verified, using the Agda [1] mechanical proof assistant; all proofs are publicly available [9]. 2 Preliminaries In this paper, we consider a Description Logic SHIN + 1 , which includes role hierarchies, role inverses, disjoint, reflexive, irreflexive and transitive roles, and singleton cardinality restrictions. We expect the results to apply to other descrip- tion logics. Spelling this out, roles and concepts are defined by the grammars (where r and c are drawn from sets of atomic role names and concept names): R ::= r | r− C ::= c | ¬c | ⊥ | > | C1 u C2 | C1 t C2 | ∀R . C | ∃R . C | ≤1 R | >1 R A TBox is a finite set of axioms of the form: C1 v C2 or R1 v R2 or Dis(R1 , R2 ) or Ref(R) or Irr(R) or Tra(R) 6 Alan Jeffrey and Peter F. Patel–Schneider Following Motik et al. [13], we assume ambient TBoxes S (the standard TBox) and T (the constraint TBox). For any finite set X, an ABox over X is a finite set of assertions of the form: c(x) or r(x, y) or x ≈ y where x, y ∈ X Note that ABoxes are restricted to contain only positive statements, and so have a monotone semantics. In many cases, this does not impact expressivity, as S can give names for arbitrary concepts, and T can introduce an irreflexive role differentFrom used in place of 6≈ assertions. In practice, RDF is limited to positive atomic statements. An interpretation I over X consists of a set ∆I , together with cI ⊆ ∆I for each concept name c, rI ⊆ ∆I × ∆I for each role name r, and xI ∈ ∆I for each x ∈ X. The satisfaction relations I  A (for an ABox A over X) and I  T (for a TBox T ) are standard. Note that if I is an interpretation over X ⊇ Y , then I can be regarded as an interpretation over Y . In the following, we will write X ] Y for the disjoint union of X and Y : for simplicity, we will assume that X and Y are disjoint, and so X ⊆ X ] Y ⊇ Y . In the mechanized proofs [9], we use explicit tagging to ensure disjointness. 3 Initial interpretations Consider ABoxes A over X, B over Y , and F over (X ] V ] Y ). We can think of A as the imported interface (where X is the set of inodes), B as the exported interface (where Y is the set of enodes) and F as the dataset (where V is the set of bnodes). Now, what does it mean for F to import A and export B, in the presence of ambient TBoxes S and T ? F can be thought of as a recipe for adding new assertions to an existing interpretation. Given any interpretation I over X which satisfies (S, T, A), we require there to be a canonical interpretation J over (X ] V ] Y ) which extends I with (S, F ), and we require J to satisfy (T, B). Motik et al. [13] use a similar notion of constraint satisfaction, although they consider all minimal J , rather than a canonical J , with respect to subset or- der on Herbrand models of Skolemized formulae. As they note, Skolemization has an impact on the notion of equivalence of TBoxes, for example (c v ∃r . d) is not equivalent to (c v ∃r . d, c v ∃r . d) because they Skolemize differently (each existential quantifier introduces a new Skolem function, which may be in- terpreted differently). We avoid Skolemization by considering initial interpreta- tions (relative to homomorphisms between interpretations) rather than minimal interpretations (relative to subset order). Tao et al. [17] also consider minimal models, with respect to a partial or- der ≺= which preserves concept membership, role membership and equality of named individuals. They avoid Skolemization by an alternate semantics, where quantification only ranges over named individuals. Integrity Constraints for Linked Data 7 A homomorphism between interpretations I and J over X is a function h : ∆I → ∆J such that, for all x, i and j: h(xI ) = xJ (i ∈ cI ) ⇒ (h(i) ∈ cJ ) ((i, j) ∈ rI ) ⇒ ((h(i), h(j)) ∈ rJ ) We will write I . J whenever there is a homomorphism from I to J . Consider an interpretation I, and a family of interpretations Ji with a chosen family of homomorphisms hi : I → Ji . An initial Ji is one with a unique family of homomorphisms: gj : Ji → Jj such that gj ◦ hi = hj . Note that initial interpretations do not always exist, but that when they do they are unique up to isomorphism. Definition 1. For any interpretation I over X and ABox F over Z ⊇ X, let I ⊕ (S, F ) be the initial interpretation J over Z such that I . J and J  S, F . Note that I ⊕ (S, F ) does not always exist, as S may contain existentials or disjunctions which do not have canonical witnesses. For example there is no initial extension of ∅ by: Bool v True t False True t False v Bool Bool(x) since there are two incomparable extensions, one with True(x) and one with False(x). However, there is a syntactic restriction which guarantees the existence of initial interpretations. Let S be minimizable whenever any axiom C v D has C built from atoms, ⊥, >, t, u and ∃, and D built from atoms, >, u, ∀ and ≤. Proposition 1. If S is minimizable, then I ⊕ (S, F ) exists. 4 Integrity constraints Having defined initiality, we can now define constraint satisfaction. This is a variant of Motik et al.’s definition: rather than considering all minimal interpre- tations, we require a canonical initial interpretation to exist, and for it to satisfy the integrity constraints. Definition 2. For ABoxes A over X, B over Y and F over (X ] V ] Y ), define F : A ⇒ B whenever, for any interpretation I over X such that I  S, T, A, we have I ⊕ (S, F )  T, B. For example, in the example from Section 1 we have that in any I which satisfies the ambient TBoxes and PersonalHomePage(alice:), there must be some i such that (alice:I , i) ∈ primaryTopicI , so we can pick fresh j and k and define J as the smallest extension of I where: bob:J = j bob:meJ = k :anonJ = alice:I j ∈ PersonalHomePageJ j ∈ DocumentJ k ∈ PersonJ J J (j, k) ∈ primaryTopic (k, j) ∈ homepage (k, i) ∈ knowsJ 8 Alan Jeffrey and Peter F. Patel–Schneider and so we have: PersonalHomePage(bob:),    Person(bob:me),   PersonalHomePage(bob:),   primaryTopic(bob:, bob:me),  : (PersonalHomePage(alice:)) ⇒    knows(bob:me, :anon),  Person(bob:me) homepage( :anon, alice:) 5 Symmetric monoidal category Having defined our notion of integrity constraints for Linked Data, we give our main result, which is that ABoxes with integrity constraints form a symmetric monoidal category, and hence (as shown by Joyal and Street [10] and surveyed, for example, by Selinger [16]) can be modeled formally by directed acyclic graphs. A symmetric monoidal category C consists of: – A collection Obj(C) of objects, including: • a chosen object I, and • for each pair of objects A and B, an object A ⊗ B. – For each pair of objects, A and B, a collection of morphisms C[A, B], in- cluding (where we write f : A → B whenever f is in C[A, B]): • for each f : A → B and g : B → C, a morphism (f ; g) : A → C, • for each f : A → C and g : B → D, a morphism (f ⊗ g) : (A ⊗ B) → (C ⊗ D), and • chosen families of morphisms: 1A : A → A σAB : (A ⊗ B) → (B ⊗ A) αABC : ((A ⊗ B) ⊗ C) → (A ⊗ (B ⊗ C)) λA : (A ⊗ I) → A −1 αABC : (A ⊗ (B ⊗ C)) → ((A ⊗ B) ⊗ C) λ−1 A : A → (A ⊗ I) satisfying certain equations (see, for example Mac Lane [12] for details). The objects of our symmetric monoidal category ABox will be ABoxes, which we will think of as interfaces. – Obj(ABox) is the collection of all ABoxes. – The chosen object I is the empty ABox. – Given two ABoxes A over X and B over Y , the object (A ⊗ B) is the ABox (A, B) over (X ] Y ). The morphisms of the category ABox will also be ABoxes, this time thought of as datasets satisfying integrity constraints. – ABox[A, B] is the collection of all ABoxes F such that F : A ⇒ B. – Given two ABoxes F over (X ]V ]Y ) and G over (Y ]W ]Z), the morphism (F ; G) is the ABox (F, G) over (X ] (V ] Y ] W ) ] Z). – Given two ABoxes F1 over (X1 ] V1 ] Y1 ) and F2 over (X2 ] V2 ] Y2 ), the morphism (F1 ⊗F2 ) is the ABox (F1 , F2 ) over ((X1 ]X2 )](V1 ]V2 )](Y1 ]Y2 )). Integrity Constraints for Linked Data 9 To verify that this definition is well-formed, we have to verify that checking integrity constraints is compositional, that is we only have to check integrity locally, and know it is preserved by composition and tensor. Proposition 2. 1. If F : A ⇒ B and G : B ⇒ C, then (F ; G) : A ⇒ C. 2. If F1 : A1 ⇒ B1 and F2 : A2 ⇒ B2 , then (F1 ⊗ F2 ) : (A1 ⊗ A2 ) ⇒ (B1 ⊗ B2 ). Note that the composition (F ; G) may introduce bnodes, since the intermediate names which are exported by F and imported by G become bnodes (indeed, this is why bnodes are present in this model). For example: (knows(alice:me, bob:me)); (knows(bob:me; charlie:me)) ≡ (knows(alice:me, :anon), knows( :anon, charlie:me)) As well as composition of ABoxes, we have to provide the “wiring” combinators for identity, symmetry, unit and associativity. These are all constructed in the same way: given any function f : Y → X on finite sets, we define the ABox wiring(f ) over (X ] Y ) as containing f (y) ≈ y for each y ∈ Y . We can then show that wiring(f ) respects renaming of ABoxes. Given any ABox A over Y , let f [A] be the ABox over X given by replacing any individual y in A by f (y). Proposition 3. If f : Y → X and B ⊆ f [A], then wiring(f ) : A ⇒ B. This suffices to define the combinators of a symmetric monoidal category, for example 1A : A ⇒ A is given by wiring the identity function. Finally, we have to prove the equations of a symmetric monoidal category. These equations are not true up to syntactic equality of ABoxes, due to intro- duction of bnodes, for example a counter-example to 1; F = F is: (alice:me ≈ alice:me0 ); (knows(alice:me0 , bob:me)) ≡ (alice:me ≈ :anon, knows( :anon, bob:me)) 6= (knows(alice:me, bob:me)) The equations are true when we consider ABoxes up to equivalence (in the presence of S, T and A), that is: F ≡ G : A ⇒ B whenever S, T, A, F  G and S, T, A, G  F We therefore consider the morphisms of ABox up to equivalence, which requires us to show that composition and tensor respect equivalence: Proposition 4. 1. If F ≡ F 0 : A ⇒ B and G ≡ G0 : B ⇒ C then (F ; G) ≡ (F 0 ; G0 ) : A ⇒ C. 2. If F1 ≡ F10 : A1 ⇒ B1 and F2 ≡ F20 : A2 ⇒ B2 then (F1 ⊗ F2 ) ≡ (F10 ⊗ F20 ) : (A1 ⊗ A2 ) ⇒ (B1 ⊗ B2 ). 10 Alan Jeffrey and Peter F. Patel–Schneider Fig. 3. Example of Agda proof mechanization The proofs that ABoxes satisfy the equations of a symmetric monoidal category are then direct. The coherence properties (which only involve compositions of wiring morphisms) follow because wiring respects composition and tensor: wiring(f ); wiring(g) ≡ wiring(f ◦ g) wiring(f ) ⊗ wiring(g) ≡ wiring(f ] g) Theorem 1. ABox forms a symmetric monoidal category. The proof of this theorem, including the definitions it relies on, is approximately 3,000 lines of Agda code [9]. An example lemma is shown in Figure 3. 6 Conclusions and further work We have presented the first treatment of integrity constraints for Linked Data which makes use of a partition between local entities, for which a dataset is authoritative, and imported entities, where complete information is not known. We have given the first categorical presentation of datasets, and as a consequence, we have the first formal treatment of acyclic dependency graphs. There are open questions raised by this model, of which the most important is its algorithmic properties: is integrity constraint satisfaction decidable, and if so, what is its complexity, and can it be reduced to existing decision problems? Our model only treats acyclic dependency graphs, via symmetric monoidal categories. A categorical treatment of cyclic graphs uses traced monoidal cate- gories (introduced by Joyal, Street and Verity [11], and discussed by Selinger [16]). Cyclic graphs require the existence of fixed points which unfortunately do not re- spect integrity constraint satisfaction, for example the fixed point of the identity morphism is equivalent to an empty dataset, which will not satisfy existential Integrity Constraints for Linked Data 11 or disjunctive integrity constraints. The situation is similar to that of complete metric spaces: not all functions have fixed points, but contraction maps do. Our model assumes the existence of ambient TBoxes S and T , which must be agreed upon by all datasets. This requirement is quite strong, and the model would be improved by allowing authoritative ontologies as well as datasets. This is related to the notion of modularity of ontologies [7]. The mechanized proofs of our model [9] are given in Agda [1], which as well as a proof assistant is a programming language which compiles to Haskell [3]. We hope to extend our proofs to a Semantic Web library, which will support the development of provably correct programs to process Linked Data. References 1. The Agda programming language. http://wiki.portal.chalmers.se/agda/ 2. The friend of a friend (FOAF) project, http://www.foaf-project.org/ 3. The Haskell programming language. http://haskell.org/ 4. Beckett, D., Berners-Lee, T.: Turtle - terse RDF triple language (2008), http: //www.w3.org/TeamSubmission/turtle/ 5. Berners-Lee, T.: Linked data (2006), http://www.w3.org/DesignIssues/ LinkedData.html 6. Cyganiak, R., Jentzsch, A.: Linking open data cloud diagram, http://lod-cloud. net/ 7. Grau, B.C., Horrocks, I., Kazakov, Y., Sattler, U.: Modular reuse of ontologies: Theory and practice. J. Artificial Intelligence Research 31, 273–318 (2008) 8. Jacobs, I., Walsh, N.: Architecture of the World Wide Web, volume one. W3C Recommendation (2004), http://www.w3.org/TR/webarch/ 9. Jeffrey, A.S.A.: Agda libraries for the semantic web. https://github.com/agda/ agda-web-semantic/ (2011) 10. Joyal, A., Street, R.: The geometry of tensor calculus I. Advances in Mathematics 88(1), 55–112 (1991) 11. Joyal, A., Street, R., Verity, D.: Traced monoidal categories. Math. Proc. Cam- bridge Phil. Soc. 3, 447–468 (1996) 12. Mac Lane, S.: Categories for the Working Mathematician. Springer, 2nd edn. (1998) 13. Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational databases. J. Web Semantics 7(2), 74–89 (2009) 14. Motik, B., Rosati, R.: Reconciling Description Logics and Rules. J. ACM 57(5), 1–62 (2010) 15. Reiter, R.: What should a database know? J. Log. Program. 14, 127–153 (1992) 16. Selinger, P.: A survey of graphical languages for monoidal categories. In: Coecke, B. (ed.) New Structures for Physics, Lecture Notes in Physics, vol. 813, chap. 4, pp. 289–356. Springer (2011) 17. Tao, J., Sirin, E., Bao, J., McGuinness, D.L.: Extending OWL with integrity con- straints. In: Proc. Workshop on Description Logics. pp. 137–148 (2010)