=Paper=
{{Paper
|id=Vol-3739/abstract-23
|storemode=property
|title=Inferring SHACL Constraints for Results of Composable Graph Queries (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-3739/abstract-23.pdf
|volume=Vol-3739
|authors=Philipp Seifer,Daniel Hernรกndez,Ralf Lรคmmel,Steffen Staab
|dblpUrl=https://dblp.org/rec/conf/dlog/Seifer0LS24
}}
==Inferring SHACL Constraints for Results of Composable Graph Queries (Extended Abstract)==
Inferring SHACL Constraints for Results of Composable Graph Queries (Extended Abstract) Philipp Seifer1 , Daniel Hernรกndez2 , Ralf Lรคmmel1 and Steffen Staab2,3 1 University of Koblenz, Koblenz, Germany 2 University of Stuttgart, Stuttgart, Germany 3 University of Southampton, Southampton, UK Abstract SPARQL CONSTRUCT queries allow for the specification of data processing pipelines that transform given input graphs into new output graphs. Input graphs are now commonly constrained through SHACL shapes allowing for both their validation and aiding users (as well as tools) in understanding their structure. However, it becomes challenging to understand what graph data can be expected at the end of a data processing pipeline without knowing the particular input data: Shape constraints on the input graph may affect the output graph, but may no longer apply literally, and new shapes may be imposed by the query itself. In our recent work, From Shapes to Shapes: Inferring SHACL Shapes for Results of SPARQL CONSTRUCT Queries, we studied the derivation of shape constraints that hold on all possible output graphs of a given SPARQL CONSTRUCT query by axiomatizing the query and the shapes with the ๐โ๐โ๐ชโ description logic. This extended abstract summarizes our previous work. Keywords Semantic Queries, Data Pipelines, SHACL, SPARQL CONSTRUCT 1. Introduction Some graph query languages are composable (i. e., they construct new graphs as results) and thereby allow for the fruitful composition of queries into data processing pipelines. Examples for such languages are SPARQL (in particular, its CONSTRUCT [1] queries) for RDF graphs and more recently G-CORE [2] for property graphs. When graphs existing in the context of such composable query languages are validated using constraints specified in a shape description language such as SHACL [3] or ProGS [4], an interesting problem arises: Even though the input to a query is well-defined with SHACL or ProGS shapes, it becomes unclear which shapes apply after executing the query. Constraints that applied to the input graph may be invalidated by the query, e. g., because some required edges were removed, and new shapes may arise from the queriesโ construction template as well. Indeed, both downstream applications (e. g., programming languages [5]) and software developers working with graph data must understand what a query may output. Problem Description In our recent paper [6], we define the problem of computing a set of SHACL shapes that characterises the possible output graphs of a SPARQL CONSTRUCT query. We represent queries as rules ๐ป โ ๐ where the template ๐ป and pattern ๐ are sets of assertions with variables. Bindings for variables in ๐ are used to construct a new graph according to ๐ป. We express SHACL shapes as ๐โ๐โ๐ชโ axioms (cf. [7]). Formally, we define the problem OutputShapes with signature (๐ฎin , ๐) โฆโ ๐ฎout-max , where ๐ is a query and ๐ฎin and ๐ฎout-max are two sets of shapes, called the input and output shapes. The set ๐ฎout-max DL 2024: 37th International Workshop on Description Logics, June 18โ21, 2024, Bergen, Norway $ pseifer@uni-koblenz.de (P. Seifer); daniel.hernandez@ki.uni-stuttgart.de (D. Hernรกndez); laemmel@uni-koblenz.de (R. Lรคmmel); steffen.staab@ki.uni-stuttgart.de (S. Staab) 0000-0002-7421-2060 (P. Seifer); 0000-0002-7896-0875 (D. Hernรกndez); 0000-0001-9946-4363 (R. Lรคmmel); 0000-0002-0780-4154 (S. Staab) ยฉ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings ๐ ๐ต, ๐ธ ๐ต ๐ต, ๐ธ ๐ธ ๐ด ๐ด๐ ๐ ๐ต, ๐ธ ๐ ๐ ๐ต, ๐ธ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ (a) ๐บ1 (b) ๐บ2 (c) J๐1 K๐บ1 (d) J๐1 K๐บ2 Figure 1: (a,b) Input graphs for ๐1 valid with respect to ๐1 . (c,d) Results of ๐1 on ๐บ1 and ๐บ2 . is the maximum set of shapes such that for every graph ๐บin that satisfies the input shapes โ we write valid(๐บin , ๐ฎin ) โ the graph resulting from evaluating ๐ on ๐บin , denoted J๐K๐บin , satisfies the output shapes, i. e., valid(J๐K๐บin , ๐ฎout-max ). Running Example Consider a problem OutputShapes with a query ๐1 and a set of shapes ๐1 as input. Let ๐1 = {y:๐ธ, z:๐ต, (y, z):๐} โ {(w, y):๐, y:๐ต, (x, z):๐, z:๐ธ}. Given a graph ๐บin , query ๐1 finds bindings for variable y that are ๐ต, and bindings for variable z that are ๐ธ, both with incoming edges with role name ๐ in ๐บin . Then, the query constructs the output graph, ๐บout = J๐1 K๐บin , consisting only of these bindings for z and y, with inverted concept annotations and ๐-edges between them. Note, that we color-code the namespace of the input versus output graphs, since the extensions of the involved concept and role names are not the same. Let ๐1 = {๐ 1 , ๐ 2 , ๐ 3 } where ๐ 1 = ๐ด โ โ๐.๐ต, ๐ 2 = โ๐.โค โ ๐ต and ๐ 3 = ๐ต โ ๐ธ. The graphs ๐บ1 (a) and ๐บ2 (b) in Figure 1 are valid with respect to ๐1 . Graphs J๐1 K๐บ1 (c) and J๐1 K๐บ2 (d) in Figure 1 are the respective output graphs for ๐1 over ๐บ1 and ๐บ2 , respectively. Some expected output shapes are ๐ธ โ โ๐.๐ต and ๐ธ โ ๐ต. The first shape follows directly from the query template. Each node labelled ๐ธ has an outgoing edge to a node labelled ๐ต. The second shape ๐ธ โ ๐ต is valid because ๐ต โ ๐ธ holds on all input graphs: As a result, we can infer that all bindings for y are also bindings for z, such that ๐ธ โ ๐ต follows from the query template. Proposed Algorithm In this paper, we summarize the algorithm we presented in [6] (an extended version with proofs is available in [8]), which constructs a sound (but not complete) approximation ๐ฎout โ ๐ฎout-max . To this end, we split the problem into two subproblems: First, the problem of deciding whether any given shape must be included in the output, and secondly the problem of generating a set of candidate shapes. The second problem turns out to be rather trivial, as the set of candidates is finite: We consider a subset of SHACL that is (syntactically) finite for a finite vocabulary, and show that all relevant candidates are indeed from the finite vocabulary of the query. In our extended version, we show that this also holds for a much larger subset of SHACL. Thus, we focus on the first problem in Section 2 and Section 3. 2. Axiomatizations Over Query Executions As hinted earlier, different occurrences of the same concept name do not have the same extensions: A query matches on input graphs, determines valuations (as subsets of the input), and constructs new graphs. We distinguish between the inputs, intermediate bindings, as well as the constructed output by rewriting input symbols ๐ด, ๐ into fresh symbols ๐ด ห , ๐ห after the first step, and into ๐ด ยจ after the second ยจ, ๐ step. These rewritten symbols allow us to encode assertions that are valid for only specific states of query execution. Variable bindings, on the other hand, hold throughout: We codify a variable binding ๐(๐ฅ) = ๐ as a concept assertion ๐:๐๐ฅ , where ๐๐ฅ is a fresh concept name. In order to axiomatize how (all possible) input graphs are connected to (all possible) output graphs, we define a (virtual) extended graph ๐บext , that unifies the different steps, and therefore allows us to reason about them: ๐บext := ๐บin โช ๐บ ห med โช ๐บV โช ๐บยจ out , where ๐บmed is โ๏ธ ๐(๐ )โ๐บin ๐(๐ ) (i.e., the union of all graphs ๐(๐ ) resulting from replacing every variable ๐ฅ in ๐ with ๐(๐ฅ)), ๐บout is J๐K๐บin , and ๐บV is the graph containing an assertion ๐:๐๐ฅ if and only if there exists a valuation ๐ such that ๐(๐ ) โ ๐บin ๐, ๐ห ๐ ๐ห ๐ ๐๐ค , ๐๐ฅ ๐ต, ๐ตห , ๐ต ยจ , ๐๐ฆ , ๐ด๐ ๐ ๐ต, ๐ธ ๐ ๐ ๐ตห , ๐ธห ๐ ยจ ยจ,๐ธ ๐ ๐ต ยจ ๐ด, ๐๐ค , ๐๐ฅ ๐ ๐ ยจ ๐ ๐, ๐ ๐, ๐ ๐ ๐๐ฆ , ๐๐ง ห ยจ ๐ธ, ๐ธ , ๐ธ , ๐๐ง ๐, ๐ (a) ๐บin (b) ๐บ ห med (c) ๐บV (d) ๐บ ยจ out ห med (b), ๐บV (c), and ๐บ Figure 2: On the left the graph ๐บext as the union of ๐บin (a), ๐บ ยจ out (d). and ๐(๐ฅ) = ๐. Figure 2 shows the extended graph and its components for the running example query ๐1 , and the graph ๐บ1 in Figure 1 . Proposition 1. Given a graph ๐บin and a query ๐, let the graphs ๐บext , ๐บmed , and ๐บout be the extended graph and its components. For every axiom ๐ that does not include names with dots (e. g., names ๐ด ห,๐ด ยจ , ๐ห , ยจ), the following equivalences hold: valid(๐บin , {๐}) if and only if valid(๐บext , {๐}), valid(๐บmed , {๐}) or ๐ if and only if valid(๐บext , {๐ห }), and valid(๐บout , {๐}) if and only if valid(๐บext , {๐ ยจ }). Given the notion of extended graphs, we can prove Proposition 1 (see [8] for the proof), which is essential to our method. Utilizing this proposition, we can show as a corollary that given a set of axioms ฮฃ such that valid(๐บext , ฮฃ) for all extended graphs of a query ๐, if ฮฃ |= ยจ๐ then valid(๐บout , {๐ }) for every output graph ๐บout of q. Thus, what remains is to show how to construct such a set of axioms ฮฃ. 3. Axioms Valid on Extended Graphs Let us now consider what axioms can be inferred, by inspecting the running example. First, we can notice that the input shapes ๐ฎin are valid on all input graphs, by definition. Thus, we include them in our knowledge base ฮฃ. Some axioms of the validation knowledge base (unique name, closed world and domain closure assumptions) can be approximated by investigating the query ๐. The unique name assumption is limited to individual names that occur in the query; in the running example there are none. Since a query does not determine the set of individual names, no axioms related to the domain closure assumption can be inferred. On the other hand, a query does restrict concept names that appear in the query with respect to the closed world assumption (CWA). For the running example, we can thus infer {๐ตห โก ๐ต โ ๐๐ฆ , ๐ธห โก ๐ธ โ ๐๐ง }, because e. g., concept ๐ตห in the extended graph is defined by filtering ๐ต with variable ๐๐ฆ , based on the query pattern y:๐ต in ๐1 . We can also infer axioms {๐๐ค โก โ๐.๐๐ฆ , ๐๐ฅ โก โ๐.๐๐ง , ๐๐ฆ โก โ๐.๐๐ค โ ๐ต, ๐๐ง โก โ๐.๐๐ฅ โ ๐ธ} since variable concepts are defined by constraints to the variable in the query pattern. For example, ๐๐ฆ is constrained by patterns (w, y):๐ and y:๐ต in ๐1 , and thus bound by โ๐.๐๐ค โ๐ต. This is a crucial step, since concept and role names in the extended graph are defined in terms of these variable concepts: {๐ต ยจ โก ๐๐ง , ๐ธ ยจ โก ๐๐ฆ } since e. g., concept ๐ต in the extended graph is defined by ๐๐ง , as it only occurs in the single construct ยจ pattern z:๐ต. With similar reasoning, {โ๐ห .๐๐ฆ โก ๐๐ค , โ๐ห .๐๐ง โก ๐๐ฅ , โ๐ห .โค โก (๐๐ค โ โ๐ห .๐๐ฆ ) โ (๐๐ฅ โ โ๐ห .๐๐ง )} as well as {โ๐ ยจ.๐๐ง โก ๐๐ฆ , โ๐ ยจ.โค โก ๐๐ฆ โ โ๐ ยจ.๐๐ง } (and their inverse counterparts) can be constructed with respect to role names. Query ๐1 has two components ๐1 = {(w, y):๐, y:๐ต} and ๐2 = {(x, z):๐, z:๐ธ} not sharing variables. The CWA encoding does not entail ๐๐ฆ โ ๐๐ง , even though this axiom is both valid in all extended graphs, and required for inferring, e. g., the result shape ๐ธ โ ๐ต. In another step of the algorithm, we infer these additional subsumptions by constructing variable mappings โ between query components, that are potentially extended with input shapes. In this case, we know based on input shape ๐1 that ๐ต โ ๐ธ. We can utilize this knowledge to extend component (w, y):๐, y:๐ต, adding the pattern y:๐ธ which does not alter the queries results. Then, we can find the mapping โ(x) = w, โ(z) = y such that โ(๐2 ) โ ๐1 , which implies ๐๐ค โ ๐๐ฅ and ๐๐ฆ โ ๐๐ง . 4. Conclusion We presented an algorithm for inferring a set of shapes that validate the possible output graphs of a CONSTRUCT query, where input graphs of this query can be constrained by a set of shapes as well. This enables the inference of shapes over result graphs of data processing pipelines (i. e., compositions of CONSTRUCT queries), which can be used both for validation purposes when working with these result graphs, and informatively, aiding developers directly. Our approach differs from related work (e.g., [9, 10, 11, 12, 13, 14]) in that we infer shapes from statically known information (query and input shapes) and not from instance data. Thus, it is similar to approaches for inferring constraints over views on relational databases (e.g., [15, 16, 17, 18]), but utilizing a modelling approach that is feasible and supports crucial constraints for typing knowledge graphs. Some approaches construct SHACL from static information [19, 20, 21], such as RML rules or direct mappings, though they are limited in either a less expressive mapping language, or lack support for constraints on the input data. An implementation [22] for the algorithms presented in our work is available on GitHub1 . Acknowledgments This work was partially funded by Deutsche Forschungsgemeinschaft (DFG) under SPP 1921 โ 318363223, and the DFG Germanyโs Excellence Strategy โ EXC 2120/1 โ 390831618. References [1] S. Harris, A. Seaborne, E. Prudโhommeaux, SPARQL 1.1 query language, 2013. URL: https://www. w3.org/TR/sparql11-query/. [2] R. Angles, M. Arenas, P. Barcelรณ, P. A. Boncz, G. H. L. Fletcher, C. Gutierrez, T. Lindaaker, M. Paradies, S. Plantikow, J. F. Sequeda, O. van Rest, H. Voigt, G-CORE: A core for future graph query languages, in: Proceedings of SIGMOD, ACM, 2018, pp. 1421โ1432. doi:10.1145/ 3183713.3190654. [3] H. Knublauch, D. Kontokostas, Shapes Constraint Language (SHACL), 2017. URL: https://www.w3. org/TR/shacl/. [4] P. Seifer, R. Lรคmmel, S. Staab, ProGS: Property graph shapes language, in: Proceedings of ISWC, volume 12922 of LNCS, Springer, 2021, pp. 392โ409. doi:10.1007/978-3-030-88361-4_23. [5] M. Leinberger, P. Seifer, C. Schon, R. Lรคmmel, S. Staab, Type checking program code using SHACL, in: Proceedings of ISWC, volume 11778 of LNCS, Springer, 2019, pp. 399โ417. doi:10.1007/ 978-3-030-30793-6_23. [6] P. Seifer, D. Hernรกndez, R. Lรคmmel, S. Staab, From shapes to shapes: Inferring SHACL shapes for results of SPARQL CONSTRUCT queries, in: Proceedings of the ACM Web Conference 2024, WWW 2024, ACM, 2024, pp. 2064โ2074. doi:10.1145/3589334.3645550. [7] B. Bogaerts, M. Jakubowski, J. V. den Bussche, SHACL: A description logic in disguise, in: Proceedings of Logic Programming and Nonmonotonic Reasoning, volume 13416 of LNCS, Springer, 2022, pp. 75โ88. doi:10.1007/978-3-031-15707-3_7. [8] P. Seifer, D. Hernรกndez, R. Lรคmmel, S. Staab, From shapes to shapes: Inferring SHACL shapes for results of SPARQL CONSTRUCT queries (extended version), 2024. arXiv:2402.08509. [9] B. Spahiu, A. Maurino, M. Palmonari, Towards improving the quality of knowledge graphs with data-driven ontology patterns and SHACL, in: Proceedings of the Workshop on Ontology Design and Patterns (WOP 2018) co-located with ISWC 2018, volume 2195 of CEUR Workshop Proceedings, CEUR-WS.org, 2018, pp. 52โ66. URL: https://ceur-ws.org/Vol-2195/research_paper_2.pdf. [10] D. Fernรกndez-รlvarez, J. E. L. Gayo, D. Gayo-Avello, Automatic extraction of shapes using sheXer, Knowledge-Based Systems 238 (2022) 107975. doi:10.1016/J.KNOSYS.2021.107975. [11] K. Rabbani, M. Lissandrini, K. Hose, Extraction of validating shapes from very large knowledge graphs, Proceedings VLDB Endow. 16 (2023) 1023โ1032. doi:10.14778/3579075.3579078. 1 https://github.com/softlang/s2s [12] N. Mihindukulasooriya, M. R. A. Rashid, G. Rizzo, R. Garcรญa-Castro, ร. Corcho, M. Torchiano, RDF shape induction using knowledge base profiling, in: Prov. of the Symposium on Applied Computing, ACM, 2018, pp. 1952โ1959. doi:10.1145/3167132.3167341. [13] P. G. Omran, K. Taylor, S. J. R. Mรฉndez, A. Haller, Learning SHACL shapes from knowledge graphs, Semantic Web 14 (2023) 101โ121. doi:10.3233/SW-223063. [14] B. Groz, A. Lemay, S. Staworko, P. Wieczorek, Inference of shape graphs for graph databases, in: ICDT, volume 220 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum fรผr Informatik, 2022, pp. 14:1โ14:20. doi:10.4230/LIPICS.ICDT.2022.14. [15] A. C. Klug, R. Price, Determining view dependencies using tableaux, ACM Trans. Database Syst. 7 (1982) 361โ380. doi:10.1145/319732.319738. [16] W. Fan, S. Ma, Y. Hu, J. Liu, Y. Wu, Propagating functional dependencies with conditions, Proceed- ings VLDB Endow. 1 (2008) 391โ407. doi:10.14778/1453856.1453901. [17] M. Stonebraker, Implementation of integrity constraints and views by query modification, in: Proceedings of SIGMOD, ACM, 1975, pp. 65โ78. doi:10.1145/500080.500091. [18] B. E. Jacobs, A. R. Aronson, A. C. Klug, On interpretations of relational languages and solutions to the implied constraint problem, ACM Trans. Database Syst. 7 (1982) 291โ315. doi:10.1145/ 319702.319730. [19] R. B. Thapa, M. Giese, Mapping relational database constraints to SHACL, in: Proceedings of ISWC, volume 13489 of LNCS, Springer, 2022, pp. 214โ230. doi:10.1007/978-3-031-19433-7_13. [20] R. B. Thapa, M. Giese, A source-to-target constraint rewriting for direct mapping, in: Proceedings of ISWC, volume 12922 of LNCS, Springer, 2021, pp. 21โ38. doi:10.1007/978-3-030-88361-4_2. [21] T. Delva, B. D. Smedt, S. M. Oo, D. V. Assche, S. Lieber, A. Dimou, RML2SHACL: RDF generation taking shape, in: Proceedings of Knowledge Capture Conference, ACM, 2021, pp. 153โ160. doi:10.1145/3460210.3493562. [22] P. Seifer, D. Hernรกndez, R. Lรคmmel, S. Staab, Code for from shapes to shapes, 2024. doi:10.18419/ darus-3977.