=Paper= {{Paper |id=Vol-2971/paper05 |storemode=property |title=SHACL Constraint Validation during SPARQL Query Processing |pdfUrl=https://ceur-ws.org/Vol-2971/paper05.pdf |volume=Vol-2971 |authors=Philipp D. Rohde |dblpUrl=https://dblp.org/rec/conf/vldb/Rohde21 }} ==SHACL Constraint Validation during SPARQL Query Processing== https://ceur-ws.org/Vol-2971/paper05.pdf
SHACL Constraint Validation during SPARQL Query Processing
                                                                         Philipp D. Rohde
                                                          supervised by Maria-Esther Vidal
                                                           Leibniz University of Hannover
                                             TIB Leibniz Information Centre for Science and Technology
                                                                Hannover, Germany
                                                                philipp.rohde@tib.eu

ABSTRACT                                                                               While SHACL exhibits clarity and readability, SHACL shape schema
The importance of knowledge graphs is increasing. Due to their                         validation can face tractability issues. In general, the problem is
application in more and more real-world use-cases the data qual-                       intractable [8]. State-of-the-art approaches have identified tractable
ity issue has to be addressed. The Shapes Constraint Language                          fragments which cover most real-world use-cases and provided
(SHACL) is the W3C recommendation language for defining in-                            means for the efficient validation of those fragments [6, 9].
tegrity constraints over knowledge graphs expressed in the Re-                            SPARQL Protocol And RDF Query Language (SPARQL)3 is the
source Description Framework (RDF). Annotating SPARQL query                            W3C recommendation language to query RDF data. Much work
results with metadata from the SHACL validation provides a better                      has been done by the community to study different approaches to
understanding of the knowledge graph and its data quality. We                          improve the query performance of SPARQL [2]. Recent work [1, 14]
propose a query engine that is able to efficiently evaluate which                      also considers the integrity constraint schema for cost-based ap-
instances in the knowledge graph fulfill the requirements from the                     proaches to query optimization. However, many publicly accessi-
SHACL shape schema and annotate the SPARQL query result with                           ble knowledge graphs suffer from quality issues even if they are
this metadata. Hence, adding the dimension of explainability to                        curated [11]. This work aims at increasing the explainability of
SPARQL query processing. Our preliminary analysis shows that the                       SPARQL query results by annotating it with the result from the
proposed optimizations performed for SHACL validation during                           SHACL shape schema validation while minimizing execution time
SPARQL query processing increase the performance compared to                           to evaluate which instances are valid.
a naive approach. However, in some queries the naive approach
outperforms the optimizations. This shows that more work needs to                      2    MOTIVATION
be done in this topic to fully comprehend all impacting factors and                    Consider an RDF knowledge graph containing data about universi-
to identify the amount of overhead added to the query execution.                       ties from the LUBM [10] benchmark. Given this data, the SPARQL
                                                                                       query in Figure 1a retrieves the names of full professors that have
                                                                                       an email address and work at Department0 of University1, their
                                                                                       research interest, and the URI of the university they got their PhD
1    INTRODUCTION                                                                      from. Figure 1b depicts the integrity constraints defined for profes-
Knowledge graphs still experience an exponential growth [11]                           sors, departments, and universities. Instances of all three classes
and became expressive data structures that provide a unified view                      have to have exactly one name. Departments are a sub-organization
of a multitude of data sources. Knowledge graphs are used in                           of at least one university. Professors additionally have at least one
large IT companies [13] as well as for domain-specific data like in                    email address and research interest. Furthermore, professors work
biomedicine [12]. These use-cases prove the potential of knowledge                     for at least one department and obtained their PhD from at least
graphs but also show the need for efficient means of knowledge                         one university. An instance associated with the professor shape is
graph creation, curation, and understanding.                                           valid if the instance meets all requirements, i.e., the instance fulfills
   The Shapes Constraint Language (SHACL)1 is the W3C recom-                           the intra-shape constraints, i.e., the constraints not linking to other
mendation language for declaratively defining integrity constraints                    shapes, as well as the inter-shape constraints, i.e., the constraints
over data expressed in the Resource Description Framework (RDF)2 .                     linking to other shapes. An instance meets the inter-shape con-
SHACL models integrity constraints as a network of shapes, called                      straints if the instances linked to fulfill all requirements inflicted on
shape schema. A shape consists of all constraints against attributes                   them. The result of the query in Figure 1a is presented in Figure 1c.
of a specific RDF target resource. Requirements on properties as-                      Note that in the example data only eight of 1,000 universities have
sociating two targets are modeled as links between the shapes.                         a name. Hence, only three of the seven query results meet all in-
SHACL is being used in real-world scenarios, and it is also adopted                    tegrity constraints defined for the data. That does not falsify the
in industrial consortia, e.g., the International Data Space (IDS) [5],                 result reported. However, it provides an explanation for why the
to represent integrity constraints in the reference architectures.                     query has only three answers when asking for the university name
                                                                                       as well; by adding the triple pattern ?uni ub:name ?uname to the
Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den-         query. Adding metadata, e.g., from the validation of a SHACL shape
mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative   schema, to the result of a SPARQL query helps in understanding
Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                       the answers retrieved from an RDF knowledge graph.
1 https://www.w3.org/TR/2017/REC-shacl-20170720/
2 https://www.w3.org/TR/1999/REC-rdf-syntax-19990222/                                  3 https://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/
PREFIX ub: 
PREFIX rdf: 

SELECT ?name ?ri ?uni WHERE {
  ?prof rdf:type ub:FullProfessor ;
     ub:name ?name ;
     ub:worksFor  ;
     ub:doctoralDegreeFrom ?uni ;
     ub:emailAddress ?email ;
     ub:researchInterest ?ri .
} ORDER BY ?prof

                          (a) SPARQL Query                                                       (b) SHACL Shape Schema

                 name               ri             uni                                __meta__
                 FullProfessor0     Research6      http://www.University6.edu         all requirements met
                 FullProfessor1     Research23     http://www.University7.edu         all requirements met
                 FullProfessor2     Research15     http://www.University1.edu         all requirements met
                 FullProfessor3     Research10     http://www.University888.edu       University888 violates name constraint
                 FullProfessor4     Research19     http://www.University358.edu       University358 violates name constraint
                 FullProfessor5     Research8      http://www.University996.edu       University996 violates name constraint
                 FullProfessor6     Research12     http://www.University87.edu        University87 violates name constraint
                                          (c) Query Result with Annotations from Quality Assessment

Figure 1: Motivating Example over an RDF University System (LUBM Benchmark). (a) A SPARQL query retrieving the names
of the FullProfessors working for dept0 of uni1, their research interest, and the URI of the university they got their PhD from.
(b) The SHACL shape schema used for validating the data. (c) The result of the query in (a) annotated with the results from the
SHACL validation of the shape schema in (b). Three out of seven query results meet all the constraints. The other four results
include universities violating the constraint on the predicate ub:name, i.e., they have no or more than one name in the data.


3    RELATED WORK                                                             ShEx schema. The triple patterns are ranked based on the hierar-
SHACL Validation. Due to the increasing importance of RDF data,               chical structure of the shapes within the ShEx schema, i.e., shape
the need for efficiently checking integrity constraints over RDF data         inclusion, as well as the predicate distribution, i.e., the generality
emerge. Corman et al. [8] propose a new semantics for recursive               of the predicate. Shapes that are being included in other shapes
SHACL since the semantics of recursions are left open in the SHACL            get ranked higher as they are assumed to be more selective due to
specification. Based on this work, they identify three tractable frag-        the well-formation cardinality rule. This rule forces 𝑚-to-𝑛 relations
ments of SHACL [7] and develop an algorithm able to validate                  where 𝑚 > 𝑛 to be defined on the 𝑚-side. Triple patterns with
SHACL shape schemas of these fragements [6]. Andreßel et al. [4]              predicates that are unique for one shape are ranked higher than
propose to use stable models known from Answer Set Programming                the ones with more general predicates as they are assumed to be
(ASP) to validate SHACL shape schemas. This results in an even                more selective. Rabbani et al. [14] propose an extension of SHACL
stricter semantics for recursive SHACL. This approach allows to               including statistics into the definition of a SHACL shape. These
translate the validation into a logic progamm that can be solved us-          statistics capture the total triple count, minimum and maximum
ing of-the-shelf ASP solvers. In contrast to these logic approaches to        number of triples for each instance, and the number of distinct ob-
the validation of a SHACL shape schema, Figuera et al. [9] propose            jects. The proposed query optimizer uses the information encoded
Trav-SHACL which aims at improving the incremental behavior                   in the SHACL shapes for cardinality estimation as part of a cost-
and scalability of the validation problem. They make use of the               based query planner. The authors show that the approach generates
fragments and algorithms identified by Corman et al. [6, 7] and               query plans that are cheaper or at most of the same cost as state-
improve the performance by interleaving the data retrieval, rule              of-the-art heuristic query planners. Additionally, their approach
grounding, and saturation steps. Additionally, Trav-SHACL opti-               requires considerably less time and space for the pre-processing
mizes the queries sent to the endpoint in order to identify invalid           compared to other cost-based approaches which rely on the exis-
entities as fast as possible. While this work is dependent on efficient       tence of hard-to-compute statistics. These optimization techniques
validation of SHACL, it is not the main focus.                                are sound, but they assume that the data complies to the integrity
Query Processing. Recently integrity constraints over RDF data                constraints. However, especially in publicly available RDF data
– expressed either in SHACL or ShEx [15] – have been included                 sources, this assumption does not hold. On the contrary, most of
in studies about optimization of SPARQL queries. Abbas et al. [1]             the data suffers from low quality, e.g., missing values. Our approach
propose rules for well-formed ShEx schemas and SPARQL query                   aims at explaining the query result based on the validation of the
optimization by triple pattern reordering based on a well-formed              integrity constraints, i.e., annotating the query result.

                                                                          2
                                                                             Figure 3: Preliminary Results over WatDiv. Naive approach
                                                                             is evaluating the complete shape schema while proposed ap-
                                                                             proach implements the proposed optimizations. For queries
                                                                             Q1 and Q2 the optimizations increase the performance.
Figure 2: Approach Overview. A SPARQL query is executed                      However, for Q3 the naive approach performs better.
over an RDF knowledge graph. A SHACL shape schema is
validated against the same knowledge graph. The result of                       iii) SPARQL Query Result Annotation: For a SHACL shape it is
the query and the validation are joined to form the anno-                    common to impose integrity constraints on a set of instances that
tated query result improving explainability.                                 share the same properties. Hence, it makes sense to decompose
                                                                             the original SPARQL query into subject star-shaped sub-queries
4   PROBLEM DEFINITION                                                       and annotate the sub-queries first. Figure 2 shows the approach to
Problem Statement. Given an RDF graph G = ⟹𝑉 G , 𝐾 G ⟩, a SHACL              annotate an exemplary sub-query with the result from the SHACL
shape schema S = ⟹𝑆, targ, def⟩, and a SPARQL query Q, the                   validation based on the motivating example (see Figure 1). The
                                                                             complete shape schema might also include requirements for stu-
problem of annotating the SPARQL query result [[Q]] G with the
                                                                             dents and research assistants. Once all the sub-queries have been
SHACL validation result [S] G is to match the instances in [[Q]] G
                                                                             executed and annotated, the SPARQL operators, like join and union,
with the instances in [S] G such that the execution time required
                                                                             have to combine the results from each sub-query. Part of this work
for the annotation of the query result, i.e., evaluating the shape
                                                                             would be the formalization of the semantics and SPARQL operators
schema and matching it against the query result, is minimized.
                                                                             in presence of annotations from the SHACL validation.
Solution. The work necessary to address the problem of enriching
SPARQL query results with metadata to increase the explainability
can be broken down into the following dimensions:                            5   RESULTS SO FAR
   i) Query Decomposition: SPARQL queries can be decomposed                  So far, we studied the efficient validation of SHACL shape schemas.
into subject star-shaped sub-queries [17]. All triple patterns of such       In [9] we reported our work on the topic. We proposed new opti-
a sub-query have their subject in common with the other triple               mization techniques to increase the performance and continuous
patterns. It is likely that the instances in the data that match the         behavior of SHACL shape schema validation. Furthermore, we
star-shaped sub-query represent one class of instances.                      started to investigate the impact of annotating the result of star-
   ii) SHACL Validation: A crucial part of this work is the efficient        shaped SPARQL queries with the result from the aforementioned
validation of SHACL shape schemas since either the complete RDF              validation. In order to annotate the query result, we modified the
knowledge graph or an appropriate subset has to be validated.                XJoin [16] operator to match instances from the query result with
Interleaving the different steps of the SHACL validation and op-             the instances from the SHACL validation. In this process, we iden-
timizing the queries sent to the SPARQL endpoint improve the                 tified several factors that impact the performance of such a system.
performance [9]. However, in this use-case more optimizations can            To increase the performance, we defined further optimization tech-
be done. This is due to the fact that most queries will only include a       niques that can be applied in case the annotation of the query result
subset of the SHACL shapes present in the shape schema. Shapes in            is needed. The results of a preliminary analysis over WatDiv [3]
the shape schema that do not play a role in the validation result of         are shown in Figure 3. All queries reported are subject star-shaped
the shapes covered by the query unnecessarily consume resources              queries created from the original WatDiv queries. The queries are
and can be omitted during the validation. For the query and SHACL            highly selective and cover the classes Role and ProductCategory. For
shape schema in Figure 1 no shapes can be removed from the vali-             two of the three queries, the proposed optimizations increase the
dation since the query targets the Professor shape which is linked           performance. However, for the third query, the optimizations lead
to all other shapes via inter-shape constraints.                             to a worse performance. This needs to be studied further.
                                                                         3
6   RESEARCH PLAN                                                            7    CONCLUSIONS
This work is based on previous work in SHACL validation as well              RDF knowledge graphs are gaining momentum. Therefore, the
as SPARQL query processing. The goal is to create a SPARQL query             issue of data quality must be addressed. SHACL is the W3C recom-
engine with explainable results through annotations from the vali-           mendation language for integrity constraints over RDF. SHACL is
dation of a SHACL shape schema associated with the data.                     used in more and more use-cases. Hence, the efficient validation of
    SHACL Validation. As a first step, we started to investigate the         SHACL shape schemas is necessary, so that SHACL is able to scale
problem of efficiently validating SHACL shape schemas (see Section           up to real-world scenarios with big data. Explainability and bias
5). Based on the tractable fragments of SHACL identified by Corman           detection are hot topics at the moment. Enriching SPARQL query
et al. [7] we proposed Trav-SHACL [9], a SHACL validator that                results with annotations from the SHACL shape schema validation
interleaves the data collection, rule grounding, and saturation steps.       is a first step towards explainable query results.
Additionally, Trav-SHACL rewrites the SPARQL queries sent to the
endpoint in order to optimize the queries in terms of execution time.        ACKNOWLEDGMENTS
We found that interleaving the stages of the validation process and          This work has been partially supported by the EU H2020 RIA funded
optimizing the queries improves the performance. Trav-SHACL                  projects QualiChain (No 822404) and CLARIFY (No 875160), and
benefits from scenarios with data of low quality as it is designed           the ERAMed project P4-LUCAT (No 53000015).
to find invalid instances fast; by skipping the evaluation of further
integrity constraints of already invalid instances which does comply         REFERENCES
with the SHACL specification. This behavior might need to be                  [1] Abdullah Abbas, Pierre GenevÚs, Cécile Roison, and Nabil Laya"ida. 2018. Selec-
turned off in the case a complete (detailed) explanation is required.             tivity Estimation for SPARQL Triple Patterns with Shape Expressions. In Web
                                                                                  Engineering. ICWE 2018.
    SPARQL Query Result Annotation. First, we plan to annotate                [2] Waqas Ali, Mohammad Saleem, Bin Yao, Aidan Hogan, and Axel-Cyrille Ngonga
the query result of subject star-shaped queries, i.e., SPARQL queries             Ngomo. 2021. A Survey of RDF Stores & SPARQL Engines for Querying Knowl-
where all triple patterns of the query have the same subject variable.            edge Graphs. CoRR abs/2102.13027 (2021).
                                                                              [3] GĂŒneß Aluç, Olaf Hartig, M. Tamer Özsu, and Khuzaima Daudjee. 2014. Diver-
As described in Section 5, we already made some progress on that.                 sified Stress Testing of RDF Data Management Systems. In The Semantic Web –
In order to reduce the overall execution time, the validation time                ISWC 2014.
                                                                              [4] Medina Andreßel, Julien Corman, Magdalena Ortiz, Juan L. Reutter, Ognjen
of the SHACL shape schema needs to be reduced as this is the                      Savković, and Mantas Ơimkus. 2020. Stable Model Semantics for Recursive
main bottleneck in this approach. In order to achieve that, we                    SHACL. In ACM – The Web Conference.
defined optimization techniques to limit the number of integrity              [5] Sebastian R. Bader, Jaroslav Pullmann, Christian Mader, Sebastian Tramp,
                                                                                  Christoph Quix, Andreas W. MĂŒller, Haydar AkyĂŒrek, Matthias Böckmann,
constraints and instances that need to be checked during the SHACL                Benedikt T. Imbusch, Johannes Lipp, Sandra Geisler, and Christoph Lange. 2020.
validation. Our current results show that this approach can be done.              The International Data Spaces Information Model - An Ontology for Sovereign
The next step is to formalize the optimization techniques we came                 Exchange of Digital Content. In International Semantic Web Conference ISWC.
                                                                              [6] Julien Corman, Fernando Florenzano, Juan L. Reutter, and Ognjen Savković. 2019.
up with so far. Afterwards, we plan to extend the SPARQL query                    SHACL2SPARQL: Validating a SPARQL Endpoint against Recursive SHACL
result annotation to queries containing any basic graph pattern                   Constraints. In International Semantic Web Conference ISWC Satellite Events.
                                                                              [7] Julien Corman, Fernando Florenzano, Juan L. Reutter, and Ognjen Savković.
(BGP) based on the decomposition of the BGP into star-shaped sub-                 2019. Validating SHACL Constraints over a SPARQL Endpoint. In International
queries (SSQ) [17]. This requires us to formalize an extension of the             Semantic Web Conference ISWC.
SPARQL algebra to include and aggregate the annotation of SSQs                [8] Julien Corman, Juan L. Reutter, and Ognjen Savković. 2018. Semantics and
                                                                                  Validation of Recursive SHACL. In International Semantic Web Conference ISWC.
in the SPARQL operators, e.g., join and union. The ultimate goal is           [9] MĂłnica Figuera, Philipp D. Rohde, and Maria-Esther Vidal. 2021. Trav-SHACL:
to have a SPARQL query engine that is able to provide explainable                 Efficiently Validating Networks of SHACL Constraints. In ACM – The Web Con-
query results by adding metadata from the quality assessment – in                 ference.
                                                                             [10] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A Benchmark for
our case SHACL validation – to the individual results of the query.               OWL Knowledge Base Systems. Web Semantics 3, 2–3 (2005).
    Additional Annotations. Adding information about the data                [11] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo,
                                                                                  Claudio Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier,
quality, e.g., from the validation of a SHACL shape schema asso-                  Axel Polleres, Roberto Navigli, Axel-Cyrille Ngonga Ngomo, Sabbir M. Rashid,
ciated with the data, is only a first step towards explainability of              Anisa Rula, Lukas Schmelzeisen, Juan F. Sequeda, Steffen Staab, and Antoine
SPARQL query results. Imagine a recruitment system linked to                      Zimmermann. 2020. Knowledge Graphs. CoRR abs/2003.02320 (2020).
                                                                             [12] David N. Nicholson and Casey S. Greene. 2020. Constructing knowledge graphs
multiple SPARQL endpoints. Some of the data sources might be                      and their biomedical applications. Comput Struct Biotechnol J. 18 (2020).
private knowledge graphs containing personal information, e.g.,              [13] Natalya Fridman Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson,
the degrees obtained from a university. In the process of reviewing               and Jamie Taylor. 2019. Industry-scale knowledge graphs: lessons and challenges.
                                                                                  Commun. ACM 62, 8 (2019).
a job application, a recruiter gets read access to the personal in-          [14] Kashif Rabbani, Matteo Lissandrini, and Katja Hose. 2021. Optimizing SPARQL
formation of the applicant. Currently, the recruiter does not know                Queries using Shape Statistics. In Advances in Database Technology – EDBT 2021.
                                                                             [15] Katherine Thornton, Harold Solbrig, Gregory S. Stupp, Jose Emilio Labra Gayo,
if the data retrieved is actually true. Now assume that the system                Daniel Mietchen, Eric Prud’hommeaux, and Andra Waagmeester. 2019. Using
uses a blockchain – or similar technology – to record all transac-                Shape Expressions (ShEx) to Share RDF Data Models and to Guide Curation with
tions, e.g., adding RDF triples to a knowledge graph. In this case,               Rigorous Validation. In The Semantic Web – ESWC 2019.
                                                                             [16] Tolga Urhan and Micheal J. Franklin. 2000. XJoin: A Reactively-Scheduled
it is possible to annotate the query result with the information of               Pipelined Join Operator. IEEE Data Eng. Bull. 23, 2 (6 2000).
who actually added each triple. Back to the recruiter, the systems           [17] Maria-Esther Vidal, Edna Ruckhaus, Tomas Lampo, AmadĂ­s MartĂ­nez, Javier
also shows the name of the university as the entity who added the                 Sierra, and Axel Polleres. 2010. Efficiently Joining Group Patterns in SPARQL
                                                                                  Queries. In The Semantic Web: Research and Applications. ESWC 2010.
information about the applicant’s degree, so the chances are high
that the applicant really holds the degree.

                                                                         4