=Paper=
{{Paper
|id=Vol-3739/abstract-23
|storemode=property
|title=Inferring SHACL Constraints for Results of Composable Graph Queries (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-3739/abstract-23.pdf
|volume=Vol-3739
|authors=Philipp Seifer,Daniel Hernández,Ralf Lämmel,Steffen Staab
|dblpUrl=https://dblp.org/rec/conf/dlog/Seifer0LS24
}}
==Inferring SHACL Constraints for Results of Composable Graph Queries (Extended Abstract)==
<pdf width="1500px">https://ceur-ws.org/Vol-3739/abstract-23.pdf</pdf>
<pre>
                         Inferring SHACL Constraints for Results of Composable
                         Graph Queries (Extended Abstract)
                         Philipp Seifer1 , Daniel Hernández2 , Ralf Lämmel1 and Steffen Staab2,3
                         1
                           University of Koblenz, Koblenz, Germany
                         2
                           University of Stuttgart, Stuttgart, Germany
                         3
                           University of Southampton, Southampton, UK


                                     Abstract
                                     SPARQL CONSTRUCT queries allow for the specification of data processing pipelines that transform given input
                                     graphs into new output graphs. Input graphs are now commonly constrained through SHACL shapes allowing
                                     for both their validation and aiding users (as well as tools) in understanding their structure. However, it becomes
                                     challenging to understand what graph data can be expected at the end of a data processing pipeline without
                                     knowing the particular input data: Shape constraints on the input graph may affect the output graph, but may no
                                     longer apply literally, and new shapes may be imposed by the query itself.
                                         In our recent work, From Shapes to Shapes: Inferring SHACL Shapes for Results of SPARQL CONSTRUCT
                                     Queries, we studied the derivation of shape constraints that hold on all possible output graphs of a given SPARQL
                                     CONSTRUCT query by axiomatizing the query and the shapes with the 𝒜ℒ𝒞ℋ𝒪ℐ description logic. This extended
                                     abstract summarizes our previous work.

                                     Keywords
                                     Semantic Queries, Data Pipelines, SHACL, SPARQL CONSTRUCT


                         1. Introduction
                         Some graph query languages are composable (i. e., they construct new graphs as results) and thereby
                         allow for the fruitful composition of queries into data processing pipelines. Examples for such languages
                         are SPARQL (in particular, its CONSTRUCT [1] queries) for RDF graphs and more recently G-CORE [2]
                         for property graphs.
                             When graphs existing in the context of such composable query languages are validated using
                         constraints specified in a shape description language such as SHACL [3] or ProGS [4], an interesting
                         problem arises: Even though the input to a query is well-defined with SHACL or ProGS shapes, it
                         becomes unclear which shapes apply after executing the query. Constraints that applied to the input
                         graph may be invalidated by the query, e. g., because some required edges were removed, and new
                         shapes may arise from the queries’ construction template as well. Indeed, both downstream applications
                         (e. g., programming languages [5]) and software developers working with graph data must understand
                         what a query may output.

                         Problem Description In our recent paper [6], we define the problem of computing a set of SHACL
                         shapes that characterises the possible output graphs of a SPARQL CONSTRUCT query. We represent
                         queries as rules 𝐻 ← 𝑃 where the template 𝐻 and pattern 𝑃 are sets of assertions with variables.
                         Bindings for variables in 𝑃 are used to construct a new graph according to 𝐻. We express SHACL
                         shapes as 𝒜ℒ𝒞ℋ𝒪ℐ axioms (cf. [7]).
                           Formally, we define the problem OutputShapes with signature (𝒮in , 𝑞) ↦→ 𝒮out-max , where 𝑞 is a
                         query and 𝒮in and 𝒮out-max are two sets of shapes, called the input and output shapes. The set 𝒮out-max


                              DL 2024: 37th International Workshop on Description Logics, June 18–21, 2024, Bergen, Norway
                          $ pseifer@uni-koblenz.de (P. Seifer); daniel.hernandez@ki.uni-stuttgart.de (D. Hernández); laemmel@uni-koblenz.de
                          (R. Lämmel); steffen.staab@ki.uni-stuttgart.de (S. Staab)
                           0000-0002-7421-2060 (P. Seifer); 0000-0002-7896-0875 (D. Hernández); 0000-0001-9946-4363 (R. Lämmel);
                          0000-0002-0780-4154 (S. Staab)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
           𝑝                                   𝐵, 𝐸                                         𝐵        𝐵, 𝐸
                                𝐸         𝐴
    𝐴𝑎           𝑏 𝐵, 𝐸                                          𝑝       𝑏 𝐵, 𝐸             𝑒             𝑝
          𝑝, 𝑟
                                𝑒
                                     𝑝
                                          𝑎
                                              𝑝 𝑏                                                   𝑝 𝑏

           (a) 𝐺1                        (b) 𝐺2                      (c) J𝑞1 K𝐺1                (d) J𝑞1 K𝐺2

Figure 1: (a,b) Input graphs for 𝑞1 valid with respect to 𝑆1 . (c,d) Results of 𝑞1 on 𝐺1 and 𝐺2 .


is the maximum set of shapes such that for every graph 𝐺in that satisfies the input shapes – we write
valid(𝐺in , 𝒮in ) – the graph resulting from evaluating 𝑞 on 𝐺in , denoted J𝑞K𝐺in , satisfies the output
shapes, i. e., valid(J𝑞K𝐺in , 𝒮out-max ).

Running Example Consider a problem OutputShapes with a query 𝑞1 and a set of shapes 𝑆1 as
input. Let 𝑞1 = {y:𝐸, z:𝐵, (y, z):𝑝} ← {(w, y):𝑝, y:𝐵, (x, z):𝑝, z:𝐸}. Given a graph 𝐺in , query 𝑞1 finds
bindings for variable y that are 𝐵, and bindings for variable z that are 𝐸, both with incoming edges
with role name 𝑝 in 𝐺in . Then, the query constructs the output graph, 𝐺out = J𝑞1 K𝐺in , consisting only
of these bindings for z and y, with inverted concept annotations and 𝑝-edges between them. Note, that
we color-code the namespace of the input versus output graphs, since the extensions of the involved
concept and role names are not the same.
   Let 𝑆1 = {𝑠1 , 𝑠2 , 𝑠3 } where 𝑠1 = 𝐴 ⊑ ∃𝑝.𝐵, 𝑠2 = ∃𝑟.⊤ ⊑ 𝐵 and 𝑠3 = 𝐵 ⊑ 𝐸. The graphs 𝐺1 (a)
and 𝐺2 (b) in Figure 1 are valid with respect to 𝑆1 . Graphs J𝑞1 K𝐺1 (c) and J𝑞1 K𝐺2 (d) in Figure 1 are the
respective output graphs for 𝑞1 over 𝐺1 and 𝐺2 , respectively.
   Some expected output shapes are 𝐸 ⊑ ∃𝑝.𝐵 and 𝐸 ⊑ 𝐵. The first shape follows directly from the
query template. Each node labelled 𝐸 has an outgoing edge to a node labelled 𝐵. The second shape
𝐸 ⊑ 𝐵 is valid because 𝐵 ⊑ 𝐸 holds on all input graphs: As a result, we can infer that all bindings for
y are also bindings for z, such that 𝐸 ⊑ 𝐵 follows from the query template.

Proposed Algorithm In this paper, we summarize the algorithm we presented in [6] (an extended
version with proofs is available in [8]), which constructs a sound (but not complete) approximation
𝒮out ⊆ 𝒮out-max .
   To this end, we split the problem into two subproblems: First, the problem of deciding whether
any given shape must be included in the output, and secondly the problem of generating a set of
candidate shapes. The second problem turns out to be rather trivial, as the set of candidates is finite:
We consider a subset of SHACL that is (syntactically) finite for a finite vocabulary, and show that all
relevant candidates are indeed from the finite vocabulary of the query. In our extended version, we
show that this also holds for a much larger subset of SHACL.
   Thus, we focus on the first problem in Section 2 and Section 3.


2. Axiomatizations Over Query Executions
As hinted earlier, different occurrences of the same concept name do not have the same extensions: A
query matches on input graphs, determines valuations (as subsets of the input), and constructs new
graphs. We distinguish between the inputs, intermediate bindings, as well as the constructed output by
rewriting input symbols 𝐴, 𝑝 into fresh symbols 𝐴 ˙ , 𝑝˙ after the first step, and into 𝐴   ¨ after the second
                                                                                         ¨, 𝑝
step. These rewritten symbols allow us to encode assertions that are valid for only specific states of
query execution. Variable bindings, on the other hand, hold throughout: We codify a variable binding
𝜇(𝑥) = 𝑎 as a concept assertion 𝑎:𝑉𝑥 , where 𝑉𝑥 is a fresh concept name.
   In order to axiomatize how (all possible) input graphs are connected to (all possible) output graphs,
we define a (virtual) extended graph 𝐺ext , that unifies the different steps,     and therefore allows us to
reason about them: 𝐺ext := 𝐺in ∪ 𝐺   ˙ med ∪ 𝐺V ∪ 𝐺¨ out , where 𝐺med is ⋃︀
                                                                               𝜇(𝑃 )⊆𝐺in 𝜇(𝑃 ) (i.e., the union
of all graphs 𝜇(𝑃 ) resulting from replacing every variable 𝑥 in 𝑃 with 𝜇(𝑥)), 𝐺out is J𝑞K𝐺in , and 𝐺V is
the graph containing an assertion 𝑎:𝑉𝑥 if and only if there exists a valuation 𝜇 such that 𝜇(𝑃 ) ⊆ 𝐺in
               𝑝, 𝑝˙                              𝑝                   𝑝˙                𝑎 𝑉𝑤 , 𝑉𝑥
                         𝐵, 𝐵˙ , 𝐵
                                 ¨ , 𝑉𝑦 ,   𝐴𝑎          𝑏 𝐵, 𝐸   𝑎          𝑏 𝐵˙ , 𝐸˙               𝑝
                                                                                                    ¨        ¨,𝐸
                                                                                                           𝑏 𝐵 ¨
𝐴, 𝑉𝑤 , 𝑉𝑥 𝑎    𝑝
                ¨      𝑏                         𝑝, 𝑟                𝑝, 𝑟               𝑏 𝑉𝑦 , 𝑉𝑧
                             ˙   ¨
                         𝐸, 𝐸 , 𝐸 , 𝑉𝑧
               𝑝, 𝑟                              (a) 𝐺in             (b) 𝐺
                                                                         ˙ med          (c) 𝐺V          (d) 𝐺
                                                                                                            ¨ out


                                                                      ˙ med (b), 𝐺V (c), and 𝐺
        Figure 2: On the left the graph 𝐺ext as the union of 𝐺in (a), 𝐺                      ¨ out (d).


and 𝜇(𝑥) = 𝑎. Figure 2 shows the extended graph and its components for the running example query
𝑞1 , and the graph 𝐺1 in Figure 1 .

Proposition 1. Given a graph 𝐺in and a query 𝑞, let the graphs 𝐺ext , 𝐺med , and 𝐺out be the extended
graph and its components. For every axiom 𝜙 that does not include names with dots (e. g., names 𝐴   ˙,𝐴 ¨ , 𝑝˙ ,
   ¨), the following equivalences hold: valid(𝐺in , {𝜙}) if and only if valid(𝐺ext , {𝜙}), valid(𝐺med , {𝜙})
or 𝑝
if and only if valid(𝐺ext , {𝜙˙ }), and valid(𝐺out , {𝜙}) if and only if valid(𝐺ext , {𝜙
                                                                                       ¨ }).

  Given the notion of extended graphs, we can prove Proposition 1 (see [8] for the proof), which is
essential to our method. Utilizing this proposition, we can show as a corollary that given a set of axioms
Σ such that valid(𝐺ext , Σ) for all extended graphs of a query 𝑞, if Σ |= ¨𝑠 then valid(𝐺out , {𝑠}) for
every output graph 𝐺out of q. Thus, what remains is to show how to construct such a set of axioms Σ.


3. Axioms Valid on Extended Graphs
Let us now consider what axioms can be inferred, by inspecting the running example. First, we can
notice that the input shapes 𝒮in are valid on all input graphs, by definition. Thus, we include them in
our knowledge base Σ.
   Some axioms of the validation knowledge base (unique name, closed world and domain closure
assumptions) can be approximated by investigating the query 𝑞. The unique name assumption is limited
to individual names that occur in the query; in the running example there are none. Since a query does
not determine the set of individual names, no axioms related to the domain closure assumption can be
inferred. On the other hand, a query does restrict concept names that appear in the query with respect
to the closed world assumption (CWA).
   For the running example, we can thus infer {𝐵˙ ≡ 𝐵 ⊓ 𝑉𝑦 , 𝐸˙ ≡ 𝐸 ⊓ 𝑉𝑧 }, because e. g., concept 𝐵˙ in
the extended graph is defined by filtering 𝐵 with variable 𝑉𝑦 , based on the query pattern y:𝐵 in 𝑞1 .
We can also infer axioms {𝑉𝑤 ≡ ∃𝑝.𝑉𝑦 , 𝑉𝑥 ≡ ∃𝑝.𝑉𝑧 , 𝑉𝑦 ≡ ∃𝑝.𝑉𝑤 ⊓ 𝐵, 𝑉𝑧 ≡ ∃𝑝.𝑉𝑥 ⊓ 𝐸} since variable
concepts are defined by constraints to the variable in the query pattern. For example, 𝑉𝑦 is constrained
by patterns (w, y):𝑝 and y:𝐵 in 𝑞1 , and thus bound by ∃𝑝.𝑉𝑤 ⊓𝐵. This is a crucial step, since concept and
role names in the extended graph are defined in terms of these variable concepts: {𝐵       ¨ ≡ 𝑉𝑧 , 𝐸
                                                                                                    ¨ ≡ 𝑉𝑦 }
since e. g., concept 𝐵 in the extended graph is defined by 𝑉𝑧 , as it only occurs in the single construct
                     ¨
pattern z:𝐵. With similar reasoning, {∃𝑝˙ .𝑉𝑦 ≡ 𝑉𝑤 , ∃𝑝˙ .𝑉𝑧 ≡ 𝑉𝑥 , ∃𝑝˙ .⊤ ≡ (𝑉𝑤 ⊓ ∃𝑝˙ .𝑉𝑦 ) ⊔ (𝑉𝑥 ⊓ ∃𝑝˙ .𝑉𝑧 )}
as well as {∃𝑝 ¨.𝑉𝑧 ≡ 𝑉𝑦 , ∃𝑝
                            ¨.⊤ ≡ 𝑉𝑦 ⊓ ∃𝑝 ¨.𝑉𝑧 } (and their inverse counterparts) can be constructed with
respect to role names.
   Query 𝑞1 has two components 𝑃1 = {(w, y):𝑝, y:𝐵} and 𝑃2 = {(x, z):𝑝, z:𝐸} not sharing variables.
The CWA encoding does not entail 𝑉𝑦 ⊑ 𝑉𝑧 , even though this axiom is both valid in all extended graphs,
and required for inferring, e. g., the result shape 𝐸 ⊑ 𝐵. In another step of the algorithm, we infer
these additional subsumptions by constructing variable mappings ℎ between query components, that
are potentially extended with input shapes. In this case, we know based on input shape 𝑆1 that 𝐵 ⊑ 𝐸.
We can utilize this knowledge to extend component (w, y):𝑝, y:𝐵, adding the pattern y:𝐸 which does
not alter the queries results. Then, we can find the mapping ℎ(x) = w, ℎ(z) = y such that ℎ(𝑃2 ) ⊆ 𝑃1 ,
which implies 𝑉𝑤 ⊑ 𝑉𝑥 and 𝑉𝑦 ⊑ 𝑉𝑧 .
4. Conclusion
We presented an algorithm for inferring a set of shapes that validate the possible output graphs of a
CONSTRUCT query, where input graphs of this query can be constrained by a set of shapes as well. This
enables the inference of shapes over result graphs of data processing pipelines (i. e., compositions of
CONSTRUCT queries), which can be used both for validation purposes when working with these result
graphs, and informatively, aiding developers directly.
   Our approach differs from related work (e.g., [9, 10, 11, 12, 13, 14]) in that we infer shapes from
statically known information (query and input shapes) and not from instance data. Thus, it is similar
to approaches for inferring constraints over views on relational databases (e.g., [15, 16, 17, 18]), but
utilizing a modelling approach that is feasible and supports crucial constraints for typing knowledge
graphs. Some approaches construct SHACL from static information [19, 20, 21], such as RML rules or
direct mappings, though they are limited in either a less expressive mapping language, or lack support
for constraints on the input data. An implementation [22] for the algorithms presented in our work is
available on GitHub1 .

Acknowledgments This work was partially funded by Deutsche Forschungsgemeinschaft (DFG) under
SPP 1921 – 318363223, and the DFG Germany’s Excellence Strategy – EXC 2120/1 – 390831618.


References
 [1] S. Harris, A. Seaborne, E. Prud’hommeaux, SPARQL 1.1 query language, 2013. URL: https://www.
     w3.org/TR/sparql11-query/.
 [2] R. Angles, M. Arenas, P. Barceló, P. A. Boncz, G. H. L. Fletcher, C. Gutierrez, T. Lindaaker,
     M. Paradies, S. Plantikow, J. F. Sequeda, O. van Rest, H. Voigt, G-CORE: A core for future
     graph query languages, in: Proceedings of SIGMOD, ACM, 2018, pp. 1421–1432. doi:10.1145/
     3183713.3190654.
 [3] H. Knublauch, D. Kontokostas, Shapes Constraint Language (SHACL), 2017. URL: https://www.w3.
     org/TR/shacl/.
 [4] P. Seifer, R. Lämmel, S. Staab, ProGS: Property graph shapes language, in: Proceedings of ISWC,
     volume 12922 of LNCS, Springer, 2021, pp. 392–409. doi:10.1007/978-3-030-88361-4_23.
 [5] M. Leinberger, P. Seifer, C. Schon, R. Lämmel, S. Staab, Type checking program code using SHACL,
     in: Proceedings of ISWC, volume 11778 of LNCS, Springer, 2019, pp. 399–417. doi:10.1007/
     978-3-030-30793-6_23.
 [6] P. Seifer, D. Hernández, R. Lämmel, S. Staab, From shapes to shapes: Inferring SHACL shapes
     for results of SPARQL CONSTRUCT queries, in: Proceedings of the ACM Web Conference 2024,
     WWW 2024, ACM, 2024, pp. 2064–2074. doi:10.1145/3589334.3645550.
 [7] B. Bogaerts, M. Jakubowski, J. V. den Bussche, SHACL: A description logic in disguise, in:
     Proceedings of Logic Programming and Nonmonotonic Reasoning, volume 13416 of LNCS, Springer,
     2022, pp. 75–88. doi:10.1007/978-3-031-15707-3_7.
 [8] P. Seifer, D. Hernández, R. Lämmel, S. Staab, From shapes to shapes: Inferring SHACL shapes for
     results of SPARQL CONSTRUCT queries (extended version), 2024. arXiv:2402.08509.
 [9] B. Spahiu, A. Maurino, M. Palmonari, Towards improving the quality of knowledge graphs with
     data-driven ontology patterns and SHACL, in: Proceedings of the Workshop on Ontology Design
     and Patterns (WOP 2018) co-located with ISWC 2018, volume 2195 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2018, pp. 52–66. URL: https://ceur-ws.org/Vol-2195/research_paper_2.pdf.
[10] D. Fernández-Álvarez, J. E. L. Gayo, D. Gayo-Avello, Automatic extraction of shapes using sheXer,
     Knowledge-Based Systems 238 (2022) 107975. doi:10.1016/J.KNOSYS.2021.107975.
[11] K. Rabbani, M. Lissandrini, K. Hose, Extraction of validating shapes from very large knowledge
     graphs, Proceedings VLDB Endow. 16 (2023) 1023–1032. doi:10.14778/3579075.3579078.
1
    https://github.com/softlang/s2s
[12] N. Mihindukulasooriya, M. R. A. Rashid, G. Rizzo, R. García-Castro, Ó. Corcho, M. Torchiano,
     RDF shape induction using knowledge base profiling, in: Prov. of the Symposium on Applied
     Computing, ACM, 2018, pp. 1952–1959. doi:10.1145/3167132.3167341.
[13] P. G. Omran, K. Taylor, S. J. R. Méndez, A. Haller, Learning SHACL shapes from knowledge graphs,
     Semantic Web 14 (2023) 101–121. doi:10.3233/SW-223063.
[14] B. Groz, A. Lemay, S. Staworko, P. Wieczorek, Inference of shape graphs for graph databases,
     in: ICDT, volume 220 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022, pp.
     14:1–14:20. doi:10.4230/LIPICS.ICDT.2022.14.
[15] A. C. Klug, R. Price, Determining view dependencies using tableaux, ACM Trans. Database Syst. 7
     (1982) 361–380. doi:10.1145/319732.319738.
[16] W. Fan, S. Ma, Y. Hu, J. Liu, Y. Wu, Propagating functional dependencies with conditions, Proceed-
     ings VLDB Endow. 1 (2008) 391–407. doi:10.14778/1453856.1453901.
[17] M. Stonebraker, Implementation of integrity constraints and views by query modification, in:
     Proceedings of SIGMOD, ACM, 1975, pp. 65–78. doi:10.1145/500080.500091.
[18] B. E. Jacobs, A. R. Aronson, A. C. Klug, On interpretations of relational languages and solutions
     to the implied constraint problem, ACM Trans. Database Syst. 7 (1982) 291–315. doi:10.1145/
     319702.319730.
[19] R. B. Thapa, M. Giese, Mapping relational database constraints to SHACL, in: Proceedings of ISWC,
     volume 13489 of LNCS, Springer, 2022, pp. 214–230. doi:10.1007/978-3-031-19433-7_13.
[20] R. B. Thapa, M. Giese, A source-to-target constraint rewriting for direct mapping, in: Proceedings of
     ISWC, volume 12922 of LNCS, Springer, 2021, pp. 21–38. doi:10.1007/978-3-030-88361-4_2.
[21] T. Delva, B. D. Smedt, S. M. Oo, D. V. Assche, S. Lieber, A. Dimou, RML2SHACL: RDF generation
     taking shape, in: Proceedings of Knowledge Capture Conference, ACM, 2021, pp. 153–160.
     doi:10.1145/3460210.3493562.
[22] P. Seifer, D. Hernández, R. Lämmel, S. Staab, Code for from shapes to shapes, 2024. doi:10.18419/
     darus-3977.

</pre>