Reusable SHACL Constraint Components for Validating Geospatial Linked Data Christophe Debruyne, Kris McGlinn2 1 Smals Research, Avenue Fonsny 20, 1060 Brussels, Belgium christophe.debruyne@smals.be 2 ADAPT Centre, Trinity College Dublin, Dublin 2, Ireland kris.mcglinn@adaptcentre.ie Abstract. SHACL provides us a powerful way of declaring validation rules for datasets. The built-in functions are quite limited, but we can use SPARQL to create custom constraint components. The problem is one could end up reinvent- ing the wheel for constraints that hold in many contexts, such as topological re- lationships. We present GeoSHACL, a set of GeoSPARQL-based SHACL con- straint components published as Linked Data. We thus provide constraint com- ponents that can be shared and reused. By starting with the topological relations of simple features, our goal is to provide a reusable set of such constraints. This article elaborates on some of the technical design decisions and provides a brief demonstration. Keywords: Data Quality, Data Validation, SHACL, Geospatial Linked Data 1 Introduction The Shapes Constraint Language (SHACL) [1] is a W3C Recommendation for validat- ing RDF [2] graphs.2 While it is oftentimes mentioned to validate Linked Data, the reality is a bit more nuanced; SHACL can validate RDF graphs in general. SHACL provides a set of “core” constructs for declaring rules (value- and data type checking, cardinality, value ranges, comparisons,… which can be combined with a set of logical operators). While those core constructs are arguably “limited” for modeling domain- specific constraints, SHACL does allow one to create custom components. One uses the SHACL vocabulary to declare new constraint components, but their "implementa- tion" is done with SPARQL [3].3 The problem, however, is that many constraints may be "generic"; constraints that are general enough to be applicable in many domains. It is thus more than likely that different domain experts end up reinventing the wheel when creating constraints. For * Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 http://www.w3.org/ns/shacl# (with the usual namespace prefix sh) 3 SHACL Advanced Features also specified a JavaScript extension for implementing rules and constraints in that language. This is, however, not important for this paper. 2 instance, most (Linked Data) datasets have some geospatial dimension [4]. That geo- spatial dimension is often a convenient way for aligning, integrating, and relating in- formation on the Web. However, we may want to check whether certain topological constraints hold when doing so: Belgium should border France and countries should not overlap, for instance. These simple topological constraints are arguably common- place and should be checked to ensure a geospatial dataset’s quality. SHACL, however, does not provide support for topological relationships. To avoid the implementation of such constraint components over and over again, we propose GeoSHACL. GeoSHACL is a set of constraint components that have been published according to best practices as a Linked Data vocabulary on the Web. The contributions of this paper are the dataset and its demonstration. The dataset can easily be retrieved and used by others as part of their validation processes. Thus, we also use this paper to advocate (a repository) of interoperable SHACL constraint components. W While one currently has to include the dataset in their own shapes graphs, it is hoped that the community will consider providing support for “importing” constraints, either via extensions of the SHACL standard or via tooling. In this paper, we first introduce GeoSPARQL, a standard for representing and que- rying geospatial information on the Linked Data Web, as it provides the foundation for our topological relationships. Then we present GeoSHACL. We will focus on some of the design considerations and implementation details. GeoSHACL will be demon- strated with a simple example. We end the paper with some concluding remarks and some of the next steps that could be undertaken after this study. 2 GeoSPARQL The OGC GeoSPARQL [5] standard proposes two things. First, it provides a vocabu- lary to represent geographical features and geometries, of which the latter can be ex- pressed in either Well-Know Text (WKT) format or Geography Markup Language (GML). Features represent the "things" with a spatial dimension, such as a building, and geometries represent the spatial dimension (point, boundary, etc.). Features can be related with predicates such as geo:hasGeometry, or specializations thereof. Sec- ondly, as its name implies, it specifies an extension to SPARQL for formulating geo- spatial queries. That extension consists mainly of two things: 1. Functions that yield a value. Examples include geof:sfDisjoint and geof:sfIntersects to determine whether the lexical representations of two geometries are, respectively, disjoint or intersecting. 2. A set of query-transformation rules to facilitate queries. For instance, a query with the triple pattern ?a geo:sfDisjoint ?b looking for two features or geom- etries that are disjoint will be rewritten as a SPARQL query with UNION keywords to match five alternatives: one for looking for ?a geo:sfDisjoint ?b as an asserted triple in the graph and four for all possible combinations of ?a and ?b being either bound to a feature or a geometry. These four alternatives then avail of the geof:sfDisjoint function. 3 Notice that there is a difference between geo:sfDisjoint and geof:sfDisjoint. While they have the same node-ID, both are declared in a different schema. The former is declared in the namespace of the GeoSPARQL ontology4 and refers to the predicate. The latter is declared in GeoSPARQL’s functions namespace5 and refers to the func- tions that can be used in, for instance, filters. It is important to note that while those transformation rules are part of the specifica- tion, not all implementations support those (by default). Some implementations require one to enable those rules explicitly. 3 GeoSHACL GeoSPARQL is the standard for representing (complex) geospatial data on the Linked Data Web. There are some other (simple) standards for representing points in a coordi- nate system (e.g., longitude and latitude). We focus on GeoSPARQL for this study as these points can be converted into GeoSPARQL coordinates and GeoSPARQL sup- ports more complex geometries. To demonstrate the viability of our approach, we first decided to focus on the so-called “simple feature relation family”, which are the topo- logical relations (and corresponding functions) that a GeoSPARQL-compliant system should support. These relations and functions are sfEquals, sfDisjoint, sfIntersects, sfTouches, sfCrosses, sfWithin, sfContains, and sfOverlaps. GeoSHACL must provide support for these eight relations. One counterintuitive quirk of GeoSPARQL is that points can never be equal, even when you compare a point with itself. This is because a criterion for equality is that boundaries must be non-empty and shared and that points have, by definition, empty boundaries. Therefore, we have provided support for an “intuitive equals”, which is based on two other relations (sfContains and sfWithin). Two design decisions informed the development of GeoSHACL: • We will not assume that transformation rules have been enabled, which means that the use of these constraints will rely on the lexical representations of geometries. • A user should be able to compare the lexical representation of a geometry (via a path) with either a constant or the lexical representation of another geometry via a predicate. The behavior thus resembles those of SHACL core comparison opera- tors. We present below the specific implementation of one of the relations. All eight re- lations follow a similar pattern. The “intuitive equals” also uses the same pattern but uses two functions. Rather than “reusing” the predicates from GeoSPARQL, we de- clared predicates for the constraints in our GeoSHACL namespace. This is to avoid any 4 http://www.opengis.net/ont/geosparql# (with the usual namespace prefix geo) 5 http://www.opengis.net/def/function/geosparql/ (with the usual namespace prefix geof) 4 ambiguity when one would provide the shapes graph to a GeoSPARQL-enabled triple- store. On line 14, we test the case a constant was provided for $touches. If no lexical representation (e.g., a WKT or GML literal) is bound to either $touches or $value, this one fails. Lines 16-18 consider the case a predicate was provided by the user. In that case, $touches should contain an IRI (line 16) used as the predicate of a triple pattern (line 17). The value via that predicate is then passed along with the value of $value to the GeoSPARQL function (line 18). We finally note that the classes sh:ConstraintCom- ponent and sh:SPARQLAskValidator are, of course, declared in the SHACL vocabu- lary. 1. # Implementation of geof:sfTouches constraints 2. geosh:touchesConstraint 3. a sh:ConstraintComponent ; 4. sh:parameter [ 5. sh:path geosh:touches ; 6. ] ; 7. sh:validator [ 8. a sh:SPARQLAskValidator ; 9. sh:message "Value does not touch {$touches}." ; 10. sh:ask """ 11. PREFIX geo: 12. PREFIX geof: 13. ASK { 14. { FILTER( geof:sfTouches($value, $touches) ) } 15. UNION { 16. FILTER( isIRI($touches) ) 17. $this $touches ?otherValue . 18. FILTER( geof:sfTouches($value, ?otherValue) ) 19. } 20. }""" ; 21. ] ; 22. . One could argue that the function could have been provided as an argument for the constraint component, thereby reducing the number of components. SHACL does not provide support for using variables to place function calls, however. In other words, SHACL does not support the use of a variable where functions calls are made, as vari- ables must contain RDF terms. Another approach could have been to test for the differ- ent values in one large query. Not only would that have impeded the efficiency of the approach (i.e., computational overhead), it would have made our approach less exten- sible. The inclusion of a new relation merely requires extending the shapes graph and not changing the query. While not an ontology in the traditional sense (e.g., an OWL 2 ontology), we have stored GeoSHACL as an instance of an ontology6. This allowed us to provide metadata for both the ontology and the constraint components we have developed. The documen- tation of the ontology was generated with WIDOCO [6]. The artifact has been published according to best practices and guidelines within the community (e.g., the use of per- manent URIs, content negotiation, and a permissible license). GeoSHACL contains the implementation of nine constraint components. These correspond with the eight simple 6 Namespace geosh: https://w3id.org/geoshacl# 5 relations and our "intuitive equals". 4 Demonstration In this section we demonstrate GeoSHACL. This simple example will illustrate the use of the intuitive equals and the use of both a constant and a predicate in our shapes. Our data graph looks as follows (prefixes are omitted): 1. ex:Point1 a geo:Feature, ex:Point ; 2. geo:hasGeometry [ 3. a geo:Geometry ; 4. geo:asWKT "Point(1 1)"^^geo:wktLiteral 5. ] ; 6. . 7. ex:Point2 a geo:Feature, ex:Point ; 8. geo:hasGeometry [ 9. a geo:Geometry ; 10. geo:asWKT "Point(2 2)"^^geo:wktLiteral ; 11. ] ; 12. . 13. ex:SquareGeom a geo:Geometry ; 14. geo:sfContains ex:Point2, ex:PPoint1 ; 15. geo:asWKT "POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))"^^geo:wktLiteral ; 16. . In our data graph, we have two features (points), each with a geometry, and another geometry representing a square. The square asserts that it contains both points, though one can see that one of the points lies outside the square. Our shapes graph using Geo- SHACL looks as follows (prefixes are omitted): 1. # Check whether all points are equal with Point(1 1) 2. ex:PointShape 3. a sh:NodeShape ; 4. sh:targetClass ex:Point ; 5. sh:property [ 6. sh:path (geo:hasGeometry geo:asWKT) ; 7. geosh:intuitiveEquals "Point(1 1)"^^geo:wktLiteral 8. ] ; 9. . 10. # Are things that geometries contain actually within? 11. ex:SquareGeomShape 12. a sh:NodeShape ; 13. sh:targetClass geo:Geometry ; 14. sh:property [ 15. sh:path (geo:sfContains geo:hasGeometry geo:asWKT) ; 16. geosh:within geo:asWKT ; 17. ] ; 18. . After passing both the data graph and the shapes graph to a SHACL engine, which for our demo is the Apache Jena implementation7, the engine can detect the errors using 7 https://jena.apache.org/documentation/shacl/ 6 the GeoSPARQL functions (see Listing 1). Node= Path=/ Value: "Point(2 2)"^^ Message: Value is not intuitively equal to Point(1 1) Node= Path=(/)/ Value: "Point(2 2)"^^ Message: Value is not within . Listing 1. The validation report after validating the data graph with the shapes graph Even though our example uses WKT to represent geometries, this does not mean that our solution is solely intended for WKT. Apache Jena's GeoSPARQL implementation also supports GML. For GeoSHACL to work, the underlying SPARQL engine needs to (correctly) support GeoSPARQL. If that is not the case, the user may not notice that the validation process fails, and that the validation report may be invalid. For example, when GeoSPARQL functions are not supported, one will observe that those FILTERs will "fail gracefully" as the error within the FILTER results in a solution not being withheld. When using GeoSHACL, one has to ensure that the GeoSPARQL engine is compliant by using benchmarks such as [7]. It is possible to define SHACL rules that test the existence of GeoSPARQL functions, however. 5 Discussion Due to the flexible nature of data represented using Semantic Web technologies, SHACL constraints have been proposed as a solution for validating geospatial datasets by the W3C working group who developed the Spatial Data on the Web Best Practices” [8]. In practice, there are few examples of their application. In [9], Huang et al. present the use of SHACL for validating the results of the integration of geospatial and traffic data to ensure that semantic correctness is maintained at different levels of detail within the representations of geospatial data. In [10], Stolk et al. demonstrate the use of SHACL constraints for validating a Building Information Model standard called Indus- try Foundations Classes, which itself has been derived from a GIS geometry. In this study, we considered the support for the eight topological relations that are part of the simple features specifications. While seemingly limited, we consider this an important step toward shareable and reusable SHACL constraints. Given the growing interest in representing geospatial data as Linked Data, this work will play an important role in providing a set of reusable SHACL constraints for researchers who wish to val- idate their data represented using GeoSPARQL. We foresee the support of other con- straint components and consider validating the relationships between features and ge- ometries (next to referring to literals) as logical next steps. 6 Conclusions We presented GeoSHACL, which provides shareable and reusable constraint compo- nents for topological relations between simple features. GeoSHACL is built on top of 7 GeoSPARQL. The motivation of this work is that while SHACL is powerful, one also has to consider sharing and reusing constraint components that may hold in many do- mains, applications, etc. While seemingly simple, we hope that this paper provides a first step towards realizing this. With respect to GeoSHACL, there is room for future work. One is the inclusion of the other topological relations that GeoSPARQL provides. As we do not assume that SPARQL engines support the transformation rules, we aim to investigate how we could rewrite or extend the queries in GeoSHACL to refer to features and geometries next to the lexical representations. And while the SPARQL engine's support for (and compli- ance with) GeoSPARQL falls outside the scope of GeoSHACL, GeoSHACL can be extended to test the availability of GeoSPARQL functions. Acknowledgements. Kris McGlinn is supported by the ADAPT Centre for Digital Con- tent Technology, which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. References 1. Knublauch, H., Kontokostas, D.: Shapes Constraint Language (SHACL), https://www.w3.org/TR/shacl/. 2. Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 Concepts and Abstract Syntax, https://www.w3.org/TR/rdf11-concepts/. 3. Seaborne, A., Harris, S.: SPARQL 1.1 Query Language, https://www.w3.org/TR/sparql11- query/. 4. Shadbolt, N., O’Hara, K., Berners-Lee, T., Gibbins, N., Glaser, H., Hall, W., Schraefel, M.C.: Linked Open Government Data: Lessons from Data.gov.uk. IEEE Intell. Syst. 27, 16– 24 (2012). https://doi.org/10.1109/MIS.2012.23. 5. Open Geospatial Consortium: OGC GeoSPARQL - A Geographic Query Language for RDF Data, https://www.ogc.org/standards/geosparql. 6. Garijo, D.: WIDOCO: A wizard for documenting ontologies. In: The Semantic Web - ISWC 2017 - 16th International Semantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceedings, Part II. pp. 94–102. Springer (2017). https://doi.org/10.1007/978-3-319- 68204-4_9. 7. Jovanovik, M., Homburg, T., Spasić, M.: Software for the GeoSPARQL compliance benchmark. Softw. Impacts. 8, 100071 (2021). https://doi.org/10.1016/j.simpa.2021.100071. 8. Tandy, J., van den Brink, L., Barnaghi, P.: Spatial Data on the Web Best Practices, https://www.w3.org/TR/sdw-bp/. 9. Huang, W., Kazemzadeh, K., Mansourian, A., Harrie, L.: Towards Knowledge-Based Geospatial Data Integration and Visualization: A Case of Visualizing Urban Bicycling Suitability. IEEE Access. 8, 85473–85489 (2020). https://doi.org/10.1109/ACCESS.2020.2992023. 10. Stolk, S., McGlinn, K.: Validation of IfcOWL datasets using SHACL. In: Proceedings of the 8th Linked Data in Architecture and Construction Workshop Dublin, Ireland, June 17- 19, 2020 (virtually hosted). pp. 91–104 (2020).