Towards a Robust Semantics for SHACL: Preliminary Discussion Julien Corman1 , Juan L. Reutter2 , and Ognjen Savković1 1 Free University of Bozen-Bolzano, Bolzano, Italy 2 PUC Chile and Center for Semantic Web Research, Santiago, Chile Abstract. Validating RDF graphs against constraints has gained in- terest in recent years, due to the popularity of RDF and the growth of knowledge bases. SHACL, a constraint language for RDF, has re- cently become a W3C recommendation, with a specification detailing syntax, semantics and common use cases. Unfortunately, this (otherwise complete) specification does not cover validation against recursive con- straints. This omission is important, because SHACL by design favors constraint references. We investigate the possibility of a formal semantics for SHACL which covers the recursive case, while being compliant with the current standard. 1 Introduction Recent years have seen an increased interest in standardizing a declarative con- straint language for RDF, and devising mechanisms to detect violations of such constraints. One of the most promising candidate languages is SHACL, or Shapes Constraint Language,3 which has become a W3C recommendation in 2017. SHACL constraints are grouped in so-called “shapes” to be verified by certain nodes of the graph under validation, and such that a shape may reference an- other. Figure 1 presents two SHACL shapes. The left one, named :AddressShape, is meant to define valid addresses. A node v satisfying this shape must satisfy two (nested) constraints: the first one states that there must be at least one suc- cessor (sh:minCount 1) of v via property :telephone, and the second one states that there must be at most one successor of v via property :postalCode. Validating an RDF graph against a set of shapes is based on the notion of “target nodes”, which mandates for each shape which nodes have to conform to it. In Figure 1, the right shape contains the triple :PersonShape sh:targetClass :Person, stating that its target nodes are all nodes of type :Person. Nodes may have to conform to additional shapes, due to shape references. For example, the right shape of Figure 1 contains one (non-recursive) shape reference, stat- ing that every node v conforming to :PersonShape has at most one :address, which must conform to :AddressShape, and one recursive reference, stating that each foaf:friend of v must conform to :PersonShape. By recursion, we refer to (possibly n-ary) shape reference cycles. Unfortunately, the validation of a graph 3 https://www.w3.org/TR/shacl : PersonShape : AddressShape a sh : NodeShape ; a sh : NodeShape ; sh : t a r g e t C l a s s : P e r s o n ; sh : p r o p e r t y [ sh : p r o p e r t y [ sh : path : t e l e p h o n e ; sh : path : a d d r e s s ; sh : minCount 1 sh : maxCount 1 ; ] ; sh : node : AddressShape sh : p r o p e r t y [ ] ; sh : path : p o s t a l C o d e ; sh : p r o p e r t y [ sh : maxCount 1 sh : path f o a f : f r i e n d ; ] . sh : node : PersonShape ] . Fig. 1. Two SHACL shapes, about persons and addresses against recursive shapes is left explicitly undefined in the SHACL specification. This is an important limitation for RDF graphs which are not tree-shaped. As an illustration, in Example 1, the RDF graph G1 is valid w.r.t. the shapes of Figure 1, whereas G2 is invalid, and the validity of G3 is left unspecified. Example 1. G1 = {:p1 rdf:type :Person ; :address :a2 . :a2 :telephone :t3 .} G2 = G1 ∪ {:a2 :postalcode :c4 . :a2 :postalcode :c5 .} G3 = G1 ∪ {:p1 foaf:friend :p6 . :p6 foaf:friend :p1 .} In this work, we set up the task of defining a robust semantics for graph validation in SHACL (including circular references), and studying its computa- tional properties. A report is available online, which reflects the current state of our investigations: https://www.inf.unibz.it/krdb/KRDB%20files/tech-reports/ KRDB18-01.pdf. Related Work. Boneva et al. [2, 7, 1, 3] have embarked on a study of a similar constraint language for RDF, called Shape Expression Language, or ShEx,4 providing in [3] an abstract syntax and semantics. A different abstract language in provided in [1], which encodes a common fragment of ShEx and SHACL. In both cases, validating a graph against a recursive set of shapes is based on the existence of a typing (of nodes with shape names) verifying constraints and targets, which is also the approach we follow (with the notion of shape assignment described below). Perhaps the most important difference with our investigations is that these two studies are restricted to constraints with stratified negation (in the datalog sense), whereas we consider constraints with arbitrary negation. Another key difference (in the stratified case) is that the semantics given in [3] relies on a specific maximal typing (defined inductively on the constraint’s strata), which is not the only possible one. We hope that both works may complement each other, towards an optimal constraint language for RDF. Another related line of work, advocated in [5] as a semantic grounding for SHACL, is inspired by Description Logics under closed-world assumption [8], effectively reduc- ing graph validation to FO satisfiability with closed (binary) predicates [6]. But as illustrated with Example 2, (2-valued) FO satisfiability may not be well-suited for validating target nodes against non-stratified constraints. 4 http://shex.io/shex-semantics 2 Validating a graph against recursive constraints In order to devise a formal semantics, we abstract away from the concrete syntax of SHACL. For shape constraints, we chose a concise notation inspired by Description Logics, expressive enough to encode the SHACL “Core Constraint Components” of the specification. For instance, the constraints in Figure 1 would be translated into two constraint definitions, one for each shape: . AddressShape = (≥1 telephone.>) ∧ (≤1 postalCode.>) . PersonShape = (≤1 address.>) ∧ (≤0 address.¬AddressShape) ∧ (≤0 friend.¬PersonShape) The semantics given in the SHACL specification needs to be extended in order to handle recursive shapes. To this end, we follow the same path as in [3], and consider the evaluation of a constraint formula given a shape assignment, which may be intuitively viewed as labeling nodes of the graph with (sets of) shape names. Then a graph is valid iff there exists an assignment verifying targets and constraints. For instance, one may validate graph G3 in Example 1, by assigning :PersonShape to :p1 and :p6, and :AddressShape to :a2. How to handle non-stratified constraints? For an input graph G and set S of shapes, let atoms(G, S) be the set of all atoms of the form s(v) built from shape names in S and nodes in G. Then a shape assignment may be viewed as a function σ from atoms(G, S) to truth values. But should σ be required to be a total function to {true, false}? Such semantics may not behave well in the presence of non-stratified negation, as illustrated in Example 2. The only target node is v1 , for shape s1 . v1 has an r1 -successor (namely v2 ) in G, therefore it verifies the left disjunct ≥1 r1 .> of the constraint for s1 . So intuitively, one would want to validate this unique target, and therefore the whole graph. But s2 (v1 ) cannot be assigned a truth value which complies with the constraint for s2 . If σ(s2 (v1 )) = true, then v1 does not verify ≥1 r2 .¬s2 given σ. Conversely, if σ(s2 (v1 )) = false, then v1 does not violate ≥1 r2 .¬s2 given σ. Example 2. G = {(v1 , r1 , v2 ), (v1 , r2 , v1 )} . s1 = (≥1 r1 .>) ∨ (≥1 r2 .s2 ) . s2 = ≥1 r2 .¬s2 target: v1 for s1 A solution is to allow σ to be a partial function, or equivalently a total function from atoms(G, S) to {true, unknown, false}, in order to define constraint evaluation given σ. This is a key difference from the approach followed in [3]. For instance, in Example 2, the assignment σ = {s1 (v1 ) 7→ true, s2 (v1 ) 7→ unknown, s1 (v2 ) 7→ false, s2 (v2 ) 7→ false} could be used to validate the graph. Such 3-valued semantics should still comply with the SHACL specification in the non-recursive case though, which may not be trivial for constraints with counting. How to define partial validation? The SHACL specification describes a graph val- idation mechanism (for non-recursive shapes) which starts from a target atom, and propagates the constraints required by this target only. One way to generalize this approach to recursive shapes is to allow assigning unknown to an atom, even though the corresponding constraint evaluates to true or false. For instance, in Example 2, assigning a truth value to s1 (v2 ) and s2 (v2 ) seems unnecessary for the (unique) tar- get under validation. So could one also validate the graph with another assignment σ 0 = {s1 (v1 ) 7→ true, s2 (v1 ) 7→ unknown, s1 (v2 ) 7→ unknown, s2 (v2 ) 7→ unknown}, even though s1 (v2 ) and s2 (v2 ) evaluate to false given σ 0 ? And in which cases (if any) do these two semantics differ? Is validation scalable? Our investigations so far indicate that graph validation un- der such requirements is intractable in data complexity (i.e. when the constraints are assumed to be fixed), even for constraints with stratified negation, as opposed to seem- ingly comparable frameworks, such as stratified datalog [4]. To overcome this limita- tion, we are currently investigating two options: tractable fragments of the constraint language on the one hand, and approximation algorithms on the other hand. Acknowledgements. This work was supported by the QUEST, ROBAST and OBATS projects at the Free University of Bozen-Bolzano, and the Millenium Institute for Foun- dational Research on Data, Chile. References [1] I. Boneva. Comparative expressiveness of ShEx and SHACL (Early working draft), 2016. [2] I. Boneva, J. E. L. Gayo, S. Hym, E. G. Prud’hommeau, H. R. Solbrig, and S. Sta- worko. Validating RDF with shape expressions. CoRR, abs/1404.1270, 2014. [3] I. Boneva, J. E. L. Gayo, and E. G. Prud’hommeaux. Semantics and Validation of Shapes Schemas for RDF. In ISWC , 2017. [4] E. Dantsin, T. Eiter, G. Gottlob, and A. Voronkov. Complexity and expressive power of logic programming. ACM Comput. Surv., 33(3):374–425, 2001. [5] P. F. Patel-Schneider. Using Description Logics for RDF Constraint Checking and Closed-World Recognition. In AAAI , 2015. [6] P. F. Patel-Schneider and E. Franconi. Ontology constraints in incomplete and complete data. In ISWC , 2012. [7] S. Staworko, I. Boneva, J. E. Labra Gayo, S. Hym, E. G. Prud’hommeaux, and H. Solbrig. Complexity and Expressiveness of ShEx for RDF. In ICDT , 2015. [8] J. Tao, E. Sirin, J. Bao, and D. L. McGuinness. Integrity Constraints in OWL. In AAAI , 2010.