=Paper= {{Paper |id=Vol-3739/abstract-23 |storemode=property |title=Inferring SHACL Constraints for Results of Composable Graph Queries (Extended Abstract) |pdfUrl=https://ceur-ws.org/Vol-3739/abstract-23.pdf |volume=Vol-3739 |authors=Philipp Seifer,Daniel Hernรกndez,Ralf Lรคmmel,Steffen Staab |dblpUrl=https://dblp.org/rec/conf/dlog/Seifer0LS24 }} ==Inferring SHACL Constraints for Results of Composable Graph Queries (Extended Abstract)== https://ceur-ws.org/Vol-3739/abstract-23.pdf
                         Inferring SHACL Constraints for Results of Composable
                         Graph Queries (Extended Abstract)
                         Philipp Seifer1 , Daniel Hernรกndez2 , Ralf Lรคmmel1 and Steffen Staab2,3
                         1
                           University of Koblenz, Koblenz, Germany
                         2
                           University of Stuttgart, Stuttgart, Germany
                         3
                           University of Southampton, Southampton, UK


                                     Abstract
                                     SPARQL CONSTRUCT queries allow for the specification of data processing pipelines that transform given input
                                     graphs into new output graphs. Input graphs are now commonly constrained through SHACL shapes allowing
                                     for both their validation and aiding users (as well as tools) in understanding their structure. However, it becomes
                                     challenging to understand what graph data can be expected at the end of a data processing pipeline without
                                     knowing the particular input data: Shape constraints on the input graph may affect the output graph, but may no
                                     longer apply literally, and new shapes may be imposed by the query itself.
                                         In our recent work, From Shapes to Shapes: Inferring SHACL Shapes for Results of SPARQL CONSTRUCT
                                     Queries, we studied the derivation of shape constraints that hold on all possible output graphs of a given SPARQL
                                     CONSTRUCT query by axiomatizing the query and the shapes with the ๐’œโ„’๐’žโ„‹๐’ชโ„ description logic. This extended
                                     abstract summarizes our previous work.

                                     Keywords
                                     Semantic Queries, Data Pipelines, SHACL, SPARQL CONSTRUCT




                         1. Introduction
                         Some graph query languages are composable (i. e., they construct new graphs as results) and thereby
                         allow for the fruitful composition of queries into data processing pipelines. Examples for such languages
                         are SPARQL (in particular, its CONSTRUCT [1] queries) for RDF graphs and more recently G-CORE [2]
                         for property graphs.
                             When graphs existing in the context of such composable query languages are validated using
                         constraints specified in a shape description language such as SHACL [3] or ProGS [4], an interesting
                         problem arises: Even though the input to a query is well-defined with SHACL or ProGS shapes, it
                         becomes unclear which shapes apply after executing the query. Constraints that applied to the input
                         graph may be invalidated by the query, e. g., because some required edges were removed, and new
                         shapes may arise from the queriesโ€™ construction template as well. Indeed, both downstream applications
                         (e. g., programming languages [5]) and software developers working with graph data must understand
                         what a query may output.

                         Problem Description In our recent paper [6], we define the problem of computing a set of SHACL
                         shapes that characterises the possible output graphs of a SPARQL CONSTRUCT query. We represent
                         queries as rules ๐ป โ† ๐‘ƒ where the template ๐ป and pattern ๐‘ƒ are sets of assertions with variables.
                         Bindings for variables in ๐‘ƒ are used to construct a new graph according to ๐ป. We express SHACL
                         shapes as ๐’œโ„’๐’žโ„‹๐’ชโ„ axioms (cf. [7]).
                           Formally, we define the problem OutputShapes with signature (๐’ฎin , ๐‘ž) โ†ฆโ†’ ๐’ฎout-max , where ๐‘ž is a
                         query and ๐’ฎin and ๐’ฎout-max are two sets of shapes, called the input and output shapes. The set ๐’ฎout-max


                              DL 2024: 37th International Workshop on Description Logics, June 18โ€“21, 2024, Bergen, Norway
                          $ pseifer@uni-koblenz.de (P. Seifer); daniel.hernandez@ki.uni-stuttgart.de (D. Hernรกndez); laemmel@uni-koblenz.de
                          (R. Lรคmmel); steffen.staab@ki.uni-stuttgart.de (S. Staab)
                           0000-0002-7421-2060 (P. Seifer); 0000-0002-7896-0875 (D. Hernรกndez); 0000-0001-9946-4363 (R. Lรคmmel);
                          0000-0002-0780-4154 (S. Staab)
                                     ยฉ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
           ๐‘                                   ๐ต, ๐ธ                                         ๐ต        ๐ต, ๐ธ
                                ๐ธ         ๐ด
    ๐ด๐‘Ž           ๐‘ ๐ต, ๐ธ                                          ๐‘       ๐‘ ๐ต, ๐ธ             ๐‘’             ๐‘
          ๐‘, ๐‘Ÿ
                                ๐‘’
                                     ๐‘
                                          ๐‘Ž
                                              ๐‘ ๐‘                                                   ๐‘ ๐‘

           (a) ๐บ1                        (b) ๐บ2                      (c) J๐‘ž1 K๐บ1                (d) J๐‘ž1 K๐บ2

Figure 1: (a,b) Input graphs for ๐‘ž1 valid with respect to ๐‘†1 . (c,d) Results of ๐‘ž1 on ๐บ1 and ๐บ2 .


is the maximum set of shapes such that for every graph ๐บin that satisfies the input shapes โ€“ we write
valid(๐บin , ๐’ฎin ) โ€“ the graph resulting from evaluating ๐‘ž on ๐บin , denoted J๐‘žK๐บin , satisfies the output
shapes, i. e., valid(J๐‘žK๐บin , ๐’ฎout-max ).

Running Example Consider a problem OutputShapes with a query ๐‘ž1 and a set of shapes ๐‘†1 as
input. Let ๐‘ž1 = {y:๐ธ, z:๐ต, (y, z):๐‘} โ† {(w, y):๐‘, y:๐ต, (x, z):๐‘, z:๐ธ}. Given a graph ๐บin , query ๐‘ž1 finds
bindings for variable y that are ๐ต, and bindings for variable z that are ๐ธ, both with incoming edges
with role name ๐‘ in ๐บin . Then, the query constructs the output graph, ๐บout = J๐‘ž1 K๐บin , consisting only
of these bindings for z and y, with inverted concept annotations and ๐‘-edges between them. Note, that
we color-code the namespace of the input versus output graphs, since the extensions of the involved
concept and role names are not the same.
   Let ๐‘†1 = {๐‘ 1 , ๐‘ 2 , ๐‘ 3 } where ๐‘ 1 = ๐ด โŠ‘ โˆƒ๐‘.๐ต, ๐‘ 2 = โˆƒ๐‘Ÿ.โŠค โŠ‘ ๐ต and ๐‘ 3 = ๐ต โŠ‘ ๐ธ. The graphs ๐บ1 (a)
and ๐บ2 (b) in Figure 1 are valid with respect to ๐‘†1 . Graphs J๐‘ž1 K๐บ1 (c) and J๐‘ž1 K๐บ2 (d) in Figure 1 are the
respective output graphs for ๐‘ž1 over ๐บ1 and ๐บ2 , respectively.
   Some expected output shapes are ๐ธ โŠ‘ โˆƒ๐‘.๐ต and ๐ธ โŠ‘ ๐ต. The first shape follows directly from the
query template. Each node labelled ๐ธ has an outgoing edge to a node labelled ๐ต. The second shape
๐ธ โŠ‘ ๐ต is valid because ๐ต โŠ‘ ๐ธ holds on all input graphs: As a result, we can infer that all bindings for
y are also bindings for z, such that ๐ธ โŠ‘ ๐ต follows from the query template.

Proposed Algorithm In this paper, we summarize the algorithm we presented in [6] (an extended
version with proofs is available in [8]), which constructs a sound (but not complete) approximation
๐’ฎout โІ ๐’ฎout-max .
   To this end, we split the problem into two subproblems: First, the problem of deciding whether
any given shape must be included in the output, and secondly the problem of generating a set of
candidate shapes. The second problem turns out to be rather trivial, as the set of candidates is finite:
We consider a subset of SHACL that is (syntactically) finite for a finite vocabulary, and show that all
relevant candidates are indeed from the finite vocabulary of the query. In our extended version, we
show that this also holds for a much larger subset of SHACL.
   Thus, we focus on the first problem in Section 2 and Section 3.


2. Axiomatizations Over Query Executions
As hinted earlier, different occurrences of the same concept name do not have the same extensions: A
query matches on input graphs, determines valuations (as subsets of the input), and constructs new
graphs. We distinguish between the inputs, intermediate bindings, as well as the constructed output by
rewriting input symbols ๐ด, ๐‘ into fresh symbols ๐ด ห™ , ๐‘ห™ after the first step, and into ๐ด   ยจ after the second
                                                                                         ยจ, ๐‘
step. These rewritten symbols allow us to encode assertions that are valid for only specific states of
query execution. Variable bindings, on the other hand, hold throughout: We codify a variable binding
๐œ‡(๐‘ฅ) = ๐‘Ž as a concept assertion ๐‘Ž:๐‘‰๐‘ฅ , where ๐‘‰๐‘ฅ is a fresh concept name.
   In order to axiomatize how (all possible) input graphs are connected to (all possible) output graphs,
we define a (virtual) extended graph ๐บext , that unifies the different steps,     and therefore allows us to
reason about them: ๐บext := ๐บin โˆช ๐บ   ห™ med โˆช ๐บV โˆช ๐บยจ out , where ๐บmed is โ‹ƒ๏ธ€
                                                                               ๐œ‡(๐‘ƒ )โІ๐บin ๐œ‡(๐‘ƒ ) (i.e., the union
of all graphs ๐œ‡(๐‘ƒ ) resulting from replacing every variable ๐‘ฅ in ๐‘ƒ with ๐œ‡(๐‘ฅ)), ๐บout is J๐‘žK๐บin , and ๐บV is
the graph containing an assertion ๐‘Ž:๐‘‰๐‘ฅ if and only if there exists a valuation ๐œ‡ such that ๐œ‡(๐‘ƒ ) โІ ๐บin
               ๐‘, ๐‘ห™                              ๐‘                   ๐‘ห™                ๐‘Ž ๐‘‰๐‘ค , ๐‘‰๐‘ฅ
                         ๐ต, ๐ตห™ , ๐ต
                                 ยจ , ๐‘‰๐‘ฆ ,   ๐ด๐‘Ž          ๐‘ ๐ต, ๐ธ   ๐‘Ž          ๐‘ ๐ตห™ , ๐ธห™               ๐‘
                                                                                                    ยจ        ยจ,๐ธ
                                                                                                           ๐‘ ๐ต ยจ
๐ด, ๐‘‰๐‘ค , ๐‘‰๐‘ฅ ๐‘Ž    ๐‘
                ยจ      ๐‘                         ๐‘, ๐‘Ÿ                ๐‘, ๐‘Ÿ               ๐‘ ๐‘‰๐‘ฆ , ๐‘‰๐‘ง
                             ห™   ยจ
                         ๐ธ, ๐ธ , ๐ธ , ๐‘‰๐‘ง
               ๐‘, ๐‘Ÿ                              (a) ๐บin             (b) ๐บ
                                                                         ห™ med          (c) ๐บV          (d) ๐บ
                                                                                                            ยจ out


                                                                      ห™ med (b), ๐บV (c), and ๐บ
        Figure 2: On the left the graph ๐บext as the union of ๐บin (a), ๐บ                      ยจ out (d).



and ๐œ‡(๐‘ฅ) = ๐‘Ž. Figure 2 shows the extended graph and its components for the running example query
๐‘ž1 , and the graph ๐บ1 in Figure 1 .

Proposition 1. Given a graph ๐บin and a query ๐‘ž, let the graphs ๐บext , ๐บmed , and ๐บout be the extended
graph and its components. For every axiom ๐œ™ that does not include names with dots (e. g., names ๐ด   ห™,๐ด ยจ , ๐‘ห™ ,
   ยจ), the following equivalences hold: valid(๐บin , {๐œ™}) if and only if valid(๐บext , {๐œ™}), valid(๐บmed , {๐œ™})
or ๐‘
if and only if valid(๐บext , {๐œ™ห™ }), and valid(๐บout , {๐œ™}) if and only if valid(๐บext , {๐œ™
                                                                                       ยจ }).

  Given the notion of extended graphs, we can prove Proposition 1 (see [8] for the proof), which is
essential to our method. Utilizing this proposition, we can show as a corollary that given a set of axioms
ฮฃ such that valid(๐บext , ฮฃ) for all extended graphs of a query ๐‘ž, if ฮฃ |= ยจ๐‘  then valid(๐บout , {๐‘ }) for
every output graph ๐บout of q. Thus, what remains is to show how to construct such a set of axioms ฮฃ.


3. Axioms Valid on Extended Graphs
Let us now consider what axioms can be inferred, by inspecting the running example. First, we can
notice that the input shapes ๐’ฎin are valid on all input graphs, by definition. Thus, we include them in
our knowledge base ฮฃ.
   Some axioms of the validation knowledge base (unique name, closed world and domain closure
assumptions) can be approximated by investigating the query ๐‘ž. The unique name assumption is limited
to individual names that occur in the query; in the running example there are none. Since a query does
not determine the set of individual names, no axioms related to the domain closure assumption can be
inferred. On the other hand, a query does restrict concept names that appear in the query with respect
to the closed world assumption (CWA).
   For the running example, we can thus infer {๐ตห™ โ‰ก ๐ต โŠ“ ๐‘‰๐‘ฆ , ๐ธห™ โ‰ก ๐ธ โŠ“ ๐‘‰๐‘ง }, because e. g., concept ๐ตห™ in
the extended graph is defined by filtering ๐ต with variable ๐‘‰๐‘ฆ , based on the query pattern y:๐ต in ๐‘ž1 .
We can also infer axioms {๐‘‰๐‘ค โ‰ก โˆƒ๐‘.๐‘‰๐‘ฆ , ๐‘‰๐‘ฅ โ‰ก โˆƒ๐‘.๐‘‰๐‘ง , ๐‘‰๐‘ฆ โ‰ก โˆƒ๐‘.๐‘‰๐‘ค โŠ“ ๐ต, ๐‘‰๐‘ง โ‰ก โˆƒ๐‘.๐‘‰๐‘ฅ โŠ“ ๐ธ} since variable
concepts are defined by constraints to the variable in the query pattern. For example, ๐‘‰๐‘ฆ is constrained
by patterns (w, y):๐‘ and y:๐ต in ๐‘ž1 , and thus bound by โˆƒ๐‘.๐‘‰๐‘ค โŠ“๐ต. This is a crucial step, since concept and
role names in the extended graph are defined in terms of these variable concepts: {๐ต       ยจ โ‰ก ๐‘‰๐‘ง , ๐ธ
                                                                                                    ยจ โ‰ก ๐‘‰๐‘ฆ }
since e. g., concept ๐ต in the extended graph is defined by ๐‘‰๐‘ง , as it only occurs in the single construct
                     ยจ
pattern z:๐ต. With similar reasoning, {โˆƒ๐‘ห™ .๐‘‰๐‘ฆ โ‰ก ๐‘‰๐‘ค , โˆƒ๐‘ห™ .๐‘‰๐‘ง โ‰ก ๐‘‰๐‘ฅ , โˆƒ๐‘ห™ .โŠค โ‰ก (๐‘‰๐‘ค โŠ“ โˆƒ๐‘ห™ .๐‘‰๐‘ฆ ) โŠ” (๐‘‰๐‘ฅ โŠ“ โˆƒ๐‘ห™ .๐‘‰๐‘ง )}
as well as {โˆƒ๐‘ ยจ.๐‘‰๐‘ง โ‰ก ๐‘‰๐‘ฆ , โˆƒ๐‘
                            ยจ.โŠค โ‰ก ๐‘‰๐‘ฆ โŠ“ โˆƒ๐‘ ยจ.๐‘‰๐‘ง } (and their inverse counterparts) can be constructed with
respect to role names.
   Query ๐‘ž1 has two components ๐‘ƒ1 = {(w, y):๐‘, y:๐ต} and ๐‘ƒ2 = {(x, z):๐‘, z:๐ธ} not sharing variables.
The CWA encoding does not entail ๐‘‰๐‘ฆ โŠ‘ ๐‘‰๐‘ง , even though this axiom is both valid in all extended graphs,
and required for inferring, e. g., the result shape ๐ธ โŠ‘ ๐ต. In another step of the algorithm, we infer
these additional subsumptions by constructing variable mappings โ„Ž between query components, that
are potentially extended with input shapes. In this case, we know based on input shape ๐‘†1 that ๐ต โŠ‘ ๐ธ.
We can utilize this knowledge to extend component (w, y):๐‘, y:๐ต, adding the pattern y:๐ธ which does
not alter the queries results. Then, we can find the mapping โ„Ž(x) = w, โ„Ž(z) = y such that โ„Ž(๐‘ƒ2 ) โІ ๐‘ƒ1 ,
which implies ๐‘‰๐‘ค โŠ‘ ๐‘‰๐‘ฅ and ๐‘‰๐‘ฆ โŠ‘ ๐‘‰๐‘ง .
4. Conclusion
We presented an algorithm for inferring a set of shapes that validate the possible output graphs of a
CONSTRUCT query, where input graphs of this query can be constrained by a set of shapes as well. This
enables the inference of shapes over result graphs of data processing pipelines (i. e., compositions of
CONSTRUCT queries), which can be used both for validation purposes when working with these result
graphs, and informatively, aiding developers directly.
   Our approach differs from related work (e.g., [9, 10, 11, 12, 13, 14]) in that we infer shapes from
statically known information (query and input shapes) and not from instance data. Thus, it is similar
to approaches for inferring constraints over views on relational databases (e.g., [15, 16, 17, 18]), but
utilizing a modelling approach that is feasible and supports crucial constraints for typing knowledge
graphs. Some approaches construct SHACL from static information [19, 20, 21], such as RML rules or
direct mappings, though they are limited in either a less expressive mapping language, or lack support
for constraints on the input data. An implementation [22] for the algorithms presented in our work is
available on GitHub1 .

Acknowledgments This work was partially funded by Deutsche Forschungsgemeinschaft (DFG) under
SPP 1921 โ€“ 318363223, and the DFG Germanyโ€™s Excellence Strategy โ€“ EXC 2120/1 โ€“ 390831618.


References
 [1] S. Harris, A. Seaborne, E. Prudโ€™hommeaux, SPARQL 1.1 query language, 2013. URL: https://www.
     w3.org/TR/sparql11-query/.
 [2] R. Angles, M. Arenas, P. Barcelรณ, P. A. Boncz, G. H. L. Fletcher, C. Gutierrez, T. Lindaaker,
     M. Paradies, S. Plantikow, J. F. Sequeda, O. van Rest, H. Voigt, G-CORE: A core for future
     graph query languages, in: Proceedings of SIGMOD, ACM, 2018, pp. 1421โ€“1432. doi:10.1145/
     3183713.3190654.
 [3] H. Knublauch, D. Kontokostas, Shapes Constraint Language (SHACL), 2017. URL: https://www.w3.
     org/TR/shacl/.
 [4] P. Seifer, R. Lรคmmel, S. Staab, ProGS: Property graph shapes language, in: Proceedings of ISWC,
     volume 12922 of LNCS, Springer, 2021, pp. 392โ€“409. doi:10.1007/978-3-030-88361-4_23.
 [5] M. Leinberger, P. Seifer, C. Schon, R. Lรคmmel, S. Staab, Type checking program code using SHACL,
     in: Proceedings of ISWC, volume 11778 of LNCS, Springer, 2019, pp. 399โ€“417. doi:10.1007/
     978-3-030-30793-6_23.
 [6] P. Seifer, D. Hernรกndez, R. Lรคmmel, S. Staab, From shapes to shapes: Inferring SHACL shapes
     for results of SPARQL CONSTRUCT queries, in: Proceedings of the ACM Web Conference 2024,
     WWW 2024, ACM, 2024, pp. 2064โ€“2074. doi:10.1145/3589334.3645550.
 [7] B. Bogaerts, M. Jakubowski, J. V. den Bussche, SHACL: A description logic in disguise, in:
     Proceedings of Logic Programming and Nonmonotonic Reasoning, volume 13416 of LNCS, Springer,
     2022, pp. 75โ€“88. doi:10.1007/978-3-031-15707-3_7.
 [8] P. Seifer, D. Hernรกndez, R. Lรคmmel, S. Staab, From shapes to shapes: Inferring SHACL shapes for
     results of SPARQL CONSTRUCT queries (extended version), 2024. arXiv:2402.08509.
 [9] B. Spahiu, A. Maurino, M. Palmonari, Towards improving the quality of knowledge graphs with
     data-driven ontology patterns and SHACL, in: Proceedings of the Workshop on Ontology Design
     and Patterns (WOP 2018) co-located with ISWC 2018, volume 2195 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2018, pp. 52โ€“66. URL: https://ceur-ws.org/Vol-2195/research_paper_2.pdf.
[10] D. Fernรกndez-รlvarez, J. E. L. Gayo, D. Gayo-Avello, Automatic extraction of shapes using sheXer,
     Knowledge-Based Systems 238 (2022) 107975. doi:10.1016/J.KNOSYS.2021.107975.
[11] K. Rabbani, M. Lissandrini, K. Hose, Extraction of validating shapes from very large knowledge
     graphs, Proceedings VLDB Endow. 16 (2023) 1023โ€“1032. doi:10.14778/3579075.3579078.
1
    https://github.com/softlang/s2s
[12] N. Mihindukulasooriya, M. R. A. Rashid, G. Rizzo, R. Garcรญa-Castro, ร“. Corcho, M. Torchiano,
     RDF shape induction using knowledge base profiling, in: Prov. of the Symposium on Applied
     Computing, ACM, 2018, pp. 1952โ€“1959. doi:10.1145/3167132.3167341.
[13] P. G. Omran, K. Taylor, S. J. R. Mรฉndez, A. Haller, Learning SHACL shapes from knowledge graphs,
     Semantic Web 14 (2023) 101โ€“121. doi:10.3233/SW-223063.
[14] B. Groz, A. Lemay, S. Staworko, P. Wieczorek, Inference of shape graphs for graph databases,
     in: ICDT, volume 220 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum fรผr Informatik, 2022, pp.
     14:1โ€“14:20. doi:10.4230/LIPICS.ICDT.2022.14.
[15] A. C. Klug, R. Price, Determining view dependencies using tableaux, ACM Trans. Database Syst. 7
     (1982) 361โ€“380. doi:10.1145/319732.319738.
[16] W. Fan, S. Ma, Y. Hu, J. Liu, Y. Wu, Propagating functional dependencies with conditions, Proceed-
     ings VLDB Endow. 1 (2008) 391โ€“407. doi:10.14778/1453856.1453901.
[17] M. Stonebraker, Implementation of integrity constraints and views by query modification, in:
     Proceedings of SIGMOD, ACM, 1975, pp. 65โ€“78. doi:10.1145/500080.500091.
[18] B. E. Jacobs, A. R. Aronson, A. C. Klug, On interpretations of relational languages and solutions
     to the implied constraint problem, ACM Trans. Database Syst. 7 (1982) 291โ€“315. doi:10.1145/
     319702.319730.
[19] R. B. Thapa, M. Giese, Mapping relational database constraints to SHACL, in: Proceedings of ISWC,
     volume 13489 of LNCS, Springer, 2022, pp. 214โ€“230. doi:10.1007/978-3-031-19433-7_13.
[20] R. B. Thapa, M. Giese, A source-to-target constraint rewriting for direct mapping, in: Proceedings of
     ISWC, volume 12922 of LNCS, Springer, 2021, pp. 21โ€“38. doi:10.1007/978-3-030-88361-4_2.
[21] T. Delva, B. D. Smedt, S. M. Oo, D. V. Assche, S. Lieber, A. Dimou, RML2SHACL: RDF generation
     taking shape, in: Proceedings of Knowledge Capture Conference, ACM, 2021, pp. 153โ€“160.
     doi:10.1145/3460210.3493562.
[22] P. Seifer, D. Hernรกndez, R. Lรคmmel, S. Staab, Code for from shapes to shapes, 2024. doi:10.18419/
     darus-3977.