Introduction

Integrity Constraints for Linked Data

Alcatel

Lucent Bell Labs

Linked Data makes one central addition to the Semantic Web principles: all entity URIs should be dereferenceable to provide an authoritative RDF representation. URIs in a linked dataset can be partitioned into the exported URIs for which the dataset is authoritative versus the imported URIs the dataset is linking against. This partitioning has an impact on integrity constraints, as a Closed World Assumption applies to the exported URIs, while a Open World Assumption applies to the imported URIs. We provide a de nition of integrity constraint satisfaction in the presence of partitioning, and show that it leads to a formal interpretation of dependency graphs which describe the hyperlinking relations between datasets. We prove that datasets with integrity constraints form a symmetric monoidal category, from which the soundness of acyclic dependency graphs follows.

Introduction Motivation. In the Semantic Web, entities are named by URIs, and are de

scribed by RDF documents. Linked Data [ 5 ] adds the constraint that entity

URIs should be dereferenceable (HTTP URIs which accept GET requests), and

dereferencing an entity URI returns an RDF representation of that entity. The

W3C web architecture [8] calls such representations authoritative. The RDF triples contained in an entity representation will generally refer

to entities for which the representation is not authoritative. Such hyperlinks between datasets are often visualized as a dependency graph, such as the popular

Linking Open Data cloud diagram [6] shown in Figure 1.

Linked Data puts a new spin on the open world stance of the Semantic Web: from the point of view of a given URI owner, the world is partitioned into local entities, for which the owner is authoritative, and imported entities, for which the owner is not authoritative. In this paper, we provide a formal model of this partitioning which includes: { a partitioning of entities into imported and exported nodes, in addition to the familiar blank nodes, { a de nition of what it means for a dataset to satisfy its integrity constraints, based on the minimal models of Motik et al. [ 13 ], but adapted to partitioning, { a model of acyclic dependency graphs, which can be built compositionally, and where integrity constraint satisfaction can be performed locally, and { a proof that graphical reasoning for datasets is sound, by showing that datasets form a symmetric monoidal category.

This paper gives the rst formal treatment of authoritative resource, integrity constraints for linked data, dependency graphs, and the categorical structure of semantic data. In this introduction, we provide informal examples to motivate our model, which are made precise in Sections 4 and 5. Authoritative representations, imports and exports. The W3C web ar

chitecture [ 8 ] recommends as good practice that a URI owner \should provide authoritative representations of the resource it identi es". Typically, these are

HTTP URIs, which respond to GET requests. Linked Data [5] applies this prac

tice to the Semantic Web: URI owners provide authoritative representations of their URIs in RDF (for datasets) or OWL (for ontologies).

Semantic reasoners can make deductions from Linked Data. For example, consider a URI bob: (in examples, we will use URI pre xes such as alice: and bob:) which dereferences to the Turtle [4] representation:

bob: foaf:primaryTopic bob:me .

bob:me foaf:knows [ foaf:homepage alice: ] .

Now, if alice: dereferences to:

alice: foaf:primaryTopic alice:me . then a reasoner can deduce (using the FOAF [ 2 ] speci cation's de nitions): bob:me foaf:knows alice:me .

In Linked Data, the entities in a dataset can be partitioned into:

{ exported nodes (enodes): local entities, which the representation is authoritative for, with a publicly de ned name that other datasets may link against, { blank nodes (bnodes): local entities, which the representation is authoritative for, but without a publicly de ned name, and { imported nodes (inodes): all other entities.

For example, the RDF representation of bob: given above contains inode alice:, enodes bob: and bob:me, and an anonymous bnode (called _:anon below).

Ontologies and integrity constraints. A consumer of Linked Data may wish to assume a notion of correctness of the data it is consuming. Rather than considering integrity constraints to be given in a separate formalism such as a rules engine [ 14 ] or epistemic logic [ 15 ], we will use ontologies to express both deductive reasoning (the standard ontology), and the correctness criteria (the constraint ontology). A similar approach was taken by Motik et al. [ 13 ] and Tao et al. [ 17 ]. For example, consider the standard ontology: : homepage = primaryTopic PersonalHomePage v Document

Document v 1 primaryTopic and the constraint ontology:

The above example is correct with respect to the exported interface:

PersonalHomePage v 9primaryTopic : Person

Person v 8knows : Person

Person(bob:me) PersonalHomePage(bob:) under the assumption of the imported interface: PersonalHomePage(alice:)

We reason informally as follows (this will be made formal in later sections).

{ The constraint PersonalHomePage v 9primaryTopic : Person is satis ed because the only new PersonalHomePage entity is bob:, and we have a witness bob:me for the role primaryTopic. { The constraint Person v 8knows : Person is satis ed because the only new

Person entity is bob:me, and the only entity which bob:me knows is _:anon. Now, in any world where PersonalHomePage(alice:), there must be some

individual i such that primaryTopic(alice:; i) and Person(i). We can then reason using the standard ontology that i = _:anon and so Person(_:anon). This example, shows the use of two di erent styles of reasoning. { When reasoning about exported or blank nodes, we can assume that the only properties are ones which can be deduced from information we have asserted, using the standard ontology. For example, this form of reasoning is used in \because the only new PersonalHomePage entity is bob:" and \the only entity in a knows role with bob:me is _:anon." { We reason di erently about imported nodes. All we know about the imported world is that it satis es the imported interface, the standard ontology and the constraint ontology. For example, this form of reasoning is used in \in any world where PersonalHomePage(alice:), there must be some individual i such that primaryTopic(alice:; i) and Person(i)."

More succinctly, we use a Closed World Assumption for blank and exported nodes, and an Open World Assumption for imported nodes. Dependency graphs. Dependency graphs such as Figure 1 are a common way

of visualizing linked data, but have, until now, remained informal. We propose a formalization of such graphs as directed graphs where nodes are datasets such as

ALICE (the authoritative representation of alice:) and BOB (the authoritative representation of bob:), and edges indicate the existence of hyperlinks between datasets. These edges are labeled by interfaces to make the contract between datasets explicit, for example:

ALICE

PersonalHomePage(alice:) BOB

Dependency graphs can be regarded as datasets, given by taking the union of all their constituent datasets (with a bit of bookkeeping to rename nodes to ensure no name clashes). Since dependency graphs form datasets, they can be nested, for example a GROUP which includes ALICE and BOB might be built:

GROUP ALICE BOB

Ensuring correctness should be compositional, for example knowing that ALICE

and BOB are correct should ensure correctness of GROUP. Moreover, nested graphs should respect equivalence of datasets: if ALICE is replaced by an equivalent ALICE0, then GROUP should be equivalent to GROUP0. Finally, isomorphic graphs should be equivalent, irrespective of how they are composed, for example: DEPT GROUP ALICE

CHARLIE

BOB

DEPT

GROUP

CHARLIE ALICE

BOB

Symmetric monoidal categories. Our goals for dependency graphs are:

{ Nodes describe datasets, edges describe hyperlink relationships. { Graphs can be built compositionally, with local checking of correctness. { Graph construction respects equivalence of datasets.

{ Isomorphism of dependency graphs implies equivalence of datasets. Proving these properties directly would be di cult, but fortunately there is an existing structure which guarantees these properties: a symmetric monoidal category. Category theory forms a foundational framework for mathematics, but our need of it is quite pragmatic: the equational theory of a symmetric monoidal category is precisely that of direct acyclic graphs (shown by Joyal and Street [ 10 ], see, for example, Selinger [ 16 ]). Figure 2 sketches how directed acyclic graphs form a symmetric monoidal category: . . .

. . .

F . . .

G . .

. 1

F ; G { The identity graph 1 just connects its source and target edges. { The composition F ; G of graphs takes the disjoint union of F and G, and uni es the target edges of F with the source edges of G. { The tensor F G of graphs takes the disjoint union of F and G. { The symmetry graph just permutes its source and target edges.

Since the equational theory of a symmetric monoidal category is precisely that

of directed acyclic graphs, we can replace our goals for dependency graphs by the goal of showing that datasets form a symmetric monoidal category. This is a matter of proving a handful of equations, which is a easier than proving directly that graph isomorphism implies dataset equivalence.

Summary. The remainder of this paper will make this motivational section precise. We will de ne a notion of integrity constraint suitable for partitioning, and show that datasets with integrity constraints form a symmetric monoidal category, and hence can be formalized by dependency graphs. This is the rst such investigation of integrity constraints for Linked Data. All results presented in this paper have been mechanically veri ed, using the Agda [ 1 ] mechanical proof assistant; all proofs are publicly available [ 9 ]. 2

Preliminaries

+ In this paper, we consider a Description Logic SHIN 1 , which includes role hierarchies, role inverses, disjoint, re exive, irre exive and transitive roles, and singleton cardinality restrictions. We expect the results to apply to other description logics. Spelling this out, roles and concepts are de ned by the grammars (where r and c are drawn from sets of atomic role names and concept names): R ::= r j r C ::= c j :c j ? j > j C1 u C2 j C1 t C2 j 8R : C j 9R : C j 1 R j >1 R

A TBox is a nite set of axioms of the form:

C1 v C2 or R1 v R2 or Dis(R1; R2) or Ref(R) or Irr(R) or Tra(R)

Following Motik et al. [13], we assume ambient TBoxes S (the standard TBox) and T (the constraint TBox). For any nite set X, an ABox over X is a nite set of assertions of the form: c(x) or r(x; y) or x

y where x; y 2 X

Note that ABoxes are restricted to contain only positive statements, and so have

a monotone semantics. In many cases, this does not impact expressivity, as S can give names for arbitrary concepts, and T can introduce an irre exive role di erentFrom used in place of 6 assertions. In practice, RDF is limited to positive atomic statements.

An interpretation I over X consists of a set I , together with cI I for each concept name c, rI I I for each role name r, and xI 2 I for each x 2 X. The satisfaction relations I A (for an ABox A over X) and I T (for a TBox T ) are standard. Note that if I is an interpretation over X Y , then I can be regarded as an interpretation over Y .

In the following, we will write X ] Y for the disjoint union of X and Y : for simplicity, we will assume that X and Y are disjoint, and so X X ] Y Y .

In the mechanized proofs [9], we use explicit tagging to ensure disjointness.

Initial interpretations

Consider ABoxes A over X, B over Y , and F over (X ] V ] Y ). We can think of A as the imported interface (where X is the set of inodes), B as the exported interface (where Y is the set of enodes) and F as the dataset (where V is the set of bnodes). Now, what does it mean for F to import A and export B, in the presence of ambient TBoxes S and T ?

F can be thought of as a recipe for adding new assertions to an existing

interpretation. Given any interpretation I over X which satis es (S; T; A), we require there to be a canonical interpretation J over (X ] V ] Y ) which extends I with (S; F ), and we require J to satisfy (T; B).

Motik et al. [ 13 ] use a similar notion of constraint satisfaction, although they consider all minimal J , rather than a canonical J , with respect to subset order on Herbrand models of Skolemized formulae. As they note, Skolemization has an impact on the notion of equivalence of TBoxes, for example (c v 9r : d) is not equivalent to (c v 9r : d; c v 9r : d) because they Skolemize di erently (each existential quanti er introduces a new Skolem function, which may be interpreted di erently). We avoid Skolemization by considering initial interpretations (relative to homomorphisms between interpretations) rather than minimal interpretations (relative to subset order).

Tao et al. [17] also consider minimal models, with respect to a partial or

der = which preserves concept membership, role membership and equality of named individuals. They avoid Skolemization by an alternate semantics, where quanti cation only ranges over named individuals.

A homomorphism between interpretations I and J over X is a function h : I ! J such that, for all x, i and j: h(xI ) = xJ (i 2 cI ) ) (h(i) 2 cJ ) ((i; j) 2 rI ) ) ((h(i); h(j)) 2 rJ ) We will write I . J whenever there is a homomorphism from I to J . Consider an interpretation I, and a family of interpretations Ji with a chosen family of homomorphisms hi : I ! Ji. An initial Ji is one with a unique family of homomorphisms: gj : Ji ! Jj such that gj hi = hj. Note that initial interpretations do not always exist, but that when they do they are unique up to isomorphism.

De nition 1. For any interpretation I over X and ABox F over Z I (S; F ) be the initial interpretation J over Z such that I . J and J X, let

S; F .

Note that I (S; F ) does not always exist, as S may contain existentials or disjunctions which do not have canonical witnesses. For example there is no initial extension of ; by:

Bool v True t False

True t False v Bool

Bool(x) since there are two incomparable extensions, one with True(x) and one with

False(x). However, there is a syntactic restriction which guarantees the existence

of initial interpretations. Let S be minimizable whenever any axiom C v D has C built from atoms, ?, >, t, u and 9, and D built from atoms, >, u, 8 and . Proposition 1. If S is minimizable, then I (S; F ) exists. 4

Integrity constraints Having de ned initiality, we can now de ne constraint satisfaction. This is a variant of Motik et al.'s de nition: rather than considering all minimal interpretations, we require a canonical initial interpretation to exist, and for it to satisfy the integrity constraints.

De nition 2. For ABoxes A over X, B over Y and F over (X ] V ] Y ), de ne F : A ) B whenever, for any interpretation I over X such that I S; T; A, we have I (S; F ) T; B.

For example, in the example from Section 1 we have that in any I which satis es the ambient TBoxes and PersonalHomePage(alice:), there must be some i such that (alice:I ; i) 2 primaryTopicI , so we can pick fresh j and k and de ne J as the smallest extension of I where:

bob:J = j j 2 PersonalHomePageJ (j; k) 2 primaryTopicJ bob:meJ = k j 2 DocumentJ (k; j) 2 homepageJ :anonJ = alice:I

k 2 PersonJ (k; i) 2 knowsJ and so we have: 0 PersonalHomePage(bob:); 1 BB pPreirmsoanry(Tboopb:icm(eb)o;b:; bob:me); CC : (PersonalHomePage(alice:)) ) B@ knows(bob:me; :anon); CA homepage( :anon; alice:) PersonalHomePage(bob:); Person(bob:me) 5

Symmetric monoidal category Having de ned our notion of integrity constraints for Linked Data, we give our

main result, which is that ABoxes with integrity constraints form a symmetric monoidal category, and hence (as shown by Joyal and Street [ 10 ] and surveyed, for example, by Selinger [ 16 ]) can be modeled formally by directed acyclic graphs.

A symmetric monoidal category C consists of: { A collection Obj(C) of objects, including: a chosen object I, and for each pair of objects A and B, an object A B. { For each pair of objects, A and B, a collection of morphisms C[A; B], including (where we write f : A ! B whenever f is in C[A; B]): for each f : A ! B and g : B ! C, a morphism (f ; g) : A ! C, for each f : A ! C and g : B ! D, a morphism (f g) : (A B) ! (C D), and chosen families of morphisms:

1A : A ! A ABC : ((A

1 ABC : (A

B) (B

C) ! (A C)) ! ((A (B B)

C)) C)

AB : (A

A : (A A1 : A ! (A

B) ! (B I) ! A

A) satisfying certain equations (see, for example Mac Lane [ 12 ] for details).

The objects of our symmetric monoidal category ABox will be ABoxes, which

we will think of as interfaces.

{ Obj(ABox) is the collection of all ABoxes. { The chosen object I is the empty ABox. { Given two ABoxes A over X and B over Y , the object (A (A; B) over (X ] Y ).

B) is the ABox

The morphisms of the category ABox will also be ABoxes, this time thought of as datasets satisfying integrity constraints.

{ ABox[A; B] is the collection of all ABoxes F such that F : A ) B. { Given two ABoxes F over (X ]V ]Y ) and G over (Y ]W ]Z), the morphism (F ; G) is the ABox (F; G) over (X ] (V ] Y ] W ) ] Z). { Given two ABoxes F1 over (X1 ] V1 ] Y1) and F2 over (X2 ] V2 ] Y2), the morphism (F1 F2) is the ABox (F1; F2) over ((X1]X2)](V1]V2)](Y1]Y2)).

To verify that this de nition is well-formed, we have to verify that checking integrity constraints is compositional, that is we only have to check integrity locally, and know it is preserved by composition and tensor. Proposition 2.

1. If F : A ) B and G : B ) C, then (F ; G) : A ) C. 2. If F1 : A1 ) B1 and F2 : A2 ) B2, then (F1 F2) : (A1 A2) ) (B1

B2).

Note that the composition (F ; G) may introduce bnodes, since the intermediate names which are exported by F and imported by G become bnodes (indeed, this is why bnodes are present in this model). For example:

(knows(alice:me; bob:me)); (knows(bob:me; charlie:me)) (knows(alice:me; :anon); knows( :anon; charlie:me))

As well as composition of ABoxes, we have to provide the \wiring" combinators

for identity, symmetry, unit and associativity. These are all constructed in the same way: given any function f : Y ! X on nite sets, we de ne the ABox wiring(f ) over (X ] Y ) as containing f (y) y for each y 2 Y . We can then show that wiring(f ) respects renaming of ABoxes. Given any ABox A over Y , let f [A] be the ABox over X given by replacing any individual y in A by f (y). Proposition 3. If f : Y ! X and B f [A], then wiring(f ) : A ) B.

This su ces to de ne the combinators of a symmetric monoidal category, for

example 1A : A ) A is given by wiring the identity function.

Finally, we have to prove the equations of a symmetric monoidal category. These equations are not true up to syntactic equality of ABoxes, due to introduction of bnodes, for example a counter-example to 1; F = F is:

(alice:me

alice:me0); (knows(alice:me0; bob:me)) (alice:me

:anon; knows( :anon; bob:me)) 6= (knows(alice:me; bob:me))

The equations are true when we consider ABoxes up to equivalence (in the presence of S, T and A), that is:

G : A ) B whenever S; T; A; F G and S; T; A; G F

We therefore consider the morphisms of ABox up to equivalence, which requires us to show that composition and tensor respect equivalence: Proposition 4.

1. If F 2. If F1 then (F1

F 0 : A ) B and G F10 : A1 ) B1 and F2

F2) (F10 F20) : (A1

G0 : B ) C then (F ; G)

F20 : A2 ) B2

A2) ) (B1 The proofs that ABoxes satisfy the equations of a symmetric monoidal category are then direct. The coherence properties (which only involve compositions of wiring morphisms) follow because wiring respects composition and tensor: wiring(f ); wiring(g) wiring(f g) wiring(f ) wiring(g) wiring(f ] g) Theorem 1. ABox forms a symmetric monoidal category.

The proof of this theorem, including the de nitions it relies on, is approximately 3,000 lines of Agda code [9]. An example lemma is shown in Figure 3.

Conclusions and further work We have presented the rst treatment of integrity constraints for Linked Data

which makes use of a partition between local entities, for which a dataset is authoritative, and imported entities, where complete information is not known.

We have given the rst categorical presentation of datasets, and as a consequence,

we have the rst formal treatment of acyclic dependency graphs.

There are open questions raised by this model, of which the most important

is its algorithmic properties: is integrity constraint satisfaction decidable, and if so, what is its complexity, and can it be reduced to existing decision problems?

Our model only treats acyclic dependency graphs, via symmetric monoidal

categories. A categorical treatment of cyclic graphs uses traced monoidal categories (introduced by Joyal, Street and Verity [ 11 ], and discussed by Selinger [ 16 ]).

Cyclic graphs require the existence of xed points which unfortunately do not respect integrity constraint satisfaction, for example the xed point of the identity morphism is equivalent to an empty dataset, which will not satisfy existential

or disjunctive integrity constraints. The situation is similar to that of complete metric spaces: not all functions have xed points, but contraction maps do.

Our model assumes the existence of ambient TBoxes S and T , which must

be agreed upon by all datasets. This requirement is quite strong, and the model would be improved by allowing authoritative ontologies as well as datasets. This is related to the notion of modularity of ontologies [ 7 ].

The mechanized proofs of our model [9] are given in Agda [1], which as well

as a proof assistant is a programming language which compiles to Haskell [ 3 ].

We hope to extend our proofs to a Semantic Web library, which will support the development of provably correct programs to process Linked Data.

1. The Agda programming language . http://wiki.portal.chalmers.se/agda/

2. The friend of a friend (FOAF) project , http://www.foaf-project.org/

3. The Haskell programming language . http://haskell.org/

4. Beckett , D. , Berners-Lee , T. : Turtle - terse RDF triple language ( 2008 ), http: //www.w3.org/TeamSubmission/turtle/

5. Berners-Lee , T. : Linked data ( 2006 ), http://www.w3.org/DesignIssues/ LinkedData.html

6. Cyganiak , R. , Jentzsch , A. : Linking open data cloud diagram , http://lod-cloud. net/

7. Grau , B.C. , Horrocks , I. , Kazakov , Y. , Sattler , U. : Modular reuse of ontologies: Theory and practice . J. Arti cial Intelligence Research 31 , 273 { 318 ( 2008 )

8. Jacobs , I. , Walsh , N.: Architecture of the World Wide Web, volume one . W3C Recommendation ( 2004 ), http://www.w3.org/TR/webarch/

9. Je

rey

, A.S.A. : Agda libraries for the semantic web . https://github.com/agda/ agda-web-semantic/ ( 2011 )

10. Joyal , A. , Street , R. : The geometry of tensor calculus I. Advances in Mathematics 88 ( 1 ), 55 { 112 ( 1991 )

11. Joyal , A. , Street , R. , Verity , D. : Traced monoidal categories . Math. Proc. Cambridge Phil. Soc. 3 , 447 { 468 ( 1996 )

12.

Mac

Lane , S. : Categories for the Working Mathematician . Springer, 2nd edn. ( 1998 )

13. Motik , B. , Horrocks , I. , Sattler , U. : Bridging the gap between OWL and relational databases . J. Web Semantics 7 ( 2 ), 74 { 89 ( 2009 )

14. Motik , B. , Rosati , R.: Reconciling Description Logics and Rules. J. ACM 57 ( 5 ), 1 { 62 ( 2010 )

15. Reiter , R.: What should a database know? J . Log . Program. 14 , 127 { 153 ( 1992 )

16. Selinger , P.: A survey of graphical languages for monoidal categories . In: Coecke, B. (ed.) New Structures for Physics, Lecture Notes in Physics , vol. 813 , chap . 4, pp. 289 { 356 . Springer ( 2011 )

17. Tao , J. , Sirin , E. , Bao , J. , McGuinness , D.L. : Extending OWL with integrity constraints . In: Proc. Workshop on Description Logics . pp. 137 { 148 ( 2010 )