SoCK: SHACL on Completeness Knowledge
Muhammad Jilham Luthfi, Fariz Darari and Amanda Carrisa Ashardian
Faculty of Computer Science, Universitas Indonesia, Depok, West Java, Indonesia
Abstract
The proliferation of applications based on knowledge graphs (KGs) in the recent years has created
increasing demands for high-quality KGs. Completeness is an important quality aspect concerning the
breadth, depth, and scope of data in KGs. In that light, describing and validating completeness over KGs
have become a must go in order to make an informed decision whether or not to use (parts of) KGs. In this
paper, we propose SoCK (short for SHACL on Completeness Knowledge), a pattern-oriented framework
to support the creation and validation of knowledge about completeness in KGs. The framework relies
on SHACL, a W3C recommendation for validating KGs against a collection of constraints. In SoCK,
we first offer a number of patterns capturing how completeness requirements are typically expressed
in a high-level way. Such completeness patterns can then be instantiated in various manners over
different KG domains. These instantiations result in SHACL shapes that can be validated against KGs to
provide a completeness profile of the KGs. As a proof-of-concept, we implement and demonstrate the
SoCK framework as a Python library, creating over 360k SHACL shapes for real-world KGs (in our case,
DBpedia and Wikidata) based on the aforementioned completeness patterns. We also develop a web app
to serve as an information point for anything about SoCK, available at https://sock.cs.ui.ac.id/.
Keywords
SHACL, Completeness, Patterns, Shapes, Validation, DBpedia, Wikidata
1. Introduction
The current massive development of knowledge graphs (KGs) may cause various problems
related to data quality [1]. The quality of data can affect application quality and is related to
problems in publishing and using data [2]. An aspect of quality is completeness of information.
Completeness measures how much information is contained in a dataset [3]. It is an aspect of
data quality related to the breadth, depth, and scope of information contained in data [4]. The
aspect of completeness is one of the most important ones in measuring data quality and can
indirectly affect other aspects, such as data accuracy and consistency [5]. As for open KGs (e.g.,
DBpedia and Wikidata), their collaborative nature causes the diversity of data to be higher and
that data quality, especially the aspect of completeness, is of particular concern [6].
Figure 1 shows two Wikidata entities of type hotel with different level of completeness in
that Beverly Hills Hotel is missing the hotel rating information as opposed to that of Hotel
Indonesia (as of July 10, 2022). Nevertheless, both hotels are still complete with respect to the
information of: instance of, country, and review score. Such a case indicates that the quality of
applications using hotel data on Wikidata may suffer from data incompleteness depending on
whether hotel rating information is required in the applications or not.
WOP2022: 13th Workshop on Ontology Design and Patterns, October 23-24, 2022, Hangzhou, China
Envelope-Open muhammad.jilham@ui.ac.id (M. J. Luthfi); fariz@ui.ac.id (F. Darari); amanda.carrisa@ui.ac.id (A. C. Ashardian)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
(a) (b)
Figure 1: Data on Wikidata about (a) Beverly Hills Hotel and (b) Hotel Indonesia Kempinski Jakarta
One way to be better informed about completeness in KGs is by means of data validation. In
2017, W3C introduced a standard for validating KG data called Shapes Constraint Language
(SHACL) [7]. This allows the higher need for consumption of good quality data through
a validation process. SHACL validates data by applying a set of conditions that are combined
into a “shape” and expressed in an RDF graph. Nevertheless, to the best of our knowledge
there has been no research that specifically and systematically prioritizes SHACL as a basis for
checking the quality of KG data on the aspect of completeness.
Validating data completeness in a KG could be done by collecting similar cases of completeness
to form a completeness pattern. Such a pattern can then be reused and adapted to various
domains in different KGs. An ontology design pattern (ODP) is a modeling solution related
to repeated problems in ontologies and KGs [8]. Our study aims to fill the gap from previous
studies in developing a pattern-oriented solution based on SHACL for checking KG completeness.
Particularly, we focus on four issues that revolve around the problem of completeness patterns:
(𝑖) identification of completeness patterns; (𝑖𝑖) instantiation of completeness patterns into SHACL
shapes; (𝑖𝑖𝑖) development of a (Python) library to support our completeness validation process
and a web app to provide information points; and (𝑖𝑣) evaluation of completeness over real-world
KGs, that is, DBpedia and Wikidata.
2. Preliminaries
Data Completeness. Data quality provides an outlook to the fitness for use by data con-
sumers [4]. Poor data quality can pose substantial risks to decision making in organizations.
Beyond accuracy, data quality aspects include relevancy, timeliness, & completeness. Data
completeness concerns the breadth, depth, and scope of data w.r.t. the task at hand [4].
In the context of KGs, the quality of completeness can be classified into seven types [5].
The first two are schema completeness, which is the degree to which properties of classes
are sufficiently represented, and property completeness, which concerns the completeness
of property values (e.g., Albert Einstein was married twice). The third type is population
completeness, which checks the coverage of real-world objects as stored in a KG. The fourth
one is interlinking completeness, referring to the degree to which entities in different KGs
are interlinked. The last three types are: currency completeness, examining how property
values are well represented over time; metadata completeness, observing if sufficient metadata
is available; and labeling completeness, which has to do with the existence of human- and
machine-readable labels.
SHACL. Shapes Constraint Language (SHACL) is a W3C recommendation to describe con-
straints and validate KGs against those constraints [7]. Such constraints are provided as SHACL
shapes, which themselves can be written as an RDF graph [9]. In SHACL, RDF graphs repre-
senting shapes are called “shapes graphs” and that these shapes graphs can then be employed
to validate RDF graphs (= “data graphs”). SHACL offers some advantages in that SHACL is
high-level, declarative, concise, and yet feature-rich [10].
SHACL comes in two flavors: SHACL Core and SHACL-SPARQL. The former captures
features frequently needed for the representation of shapes and constraints, whereas the latter
extends the former by adding advanced features of constraints based on SPARQL, a query
language for RDF KGs. SHACL Core supports a variety of basic constraint components, such as
value types, cardinalities, and string filters. On the other hand, SHACL-SPARQL provides more
expressiveness due to the flexibility of SPARQL SELECT queries in defining SHACL constraints.
SHACL use cases include documentation, user interface generation, validation during RDF
data production/consumption, and quality control [11]. In the context of quality control, a typical
scenario for SHACL usage is that first a domain expert expresses quality requirements in natural
language and then a knowledge engineer translates these requirements into SHACL shapes, to
be validated against KGs of interest. Figure 2 illustrates a SHACL shape for the constraint that
all instances of type person must have a name and a birth date. Validating a KG against that
constraint may uncover the quality issue as to whether there exist person instances in that KG
without a name or a birth date.
1 # these prefixes are used throughout this paper
2 @prefix ex: .
3 @prefix sh: .
4 @prefix xsd: .
5
6 ex:PersonShape a sh:NodeShape;
7 sh:targetClass ex:Person;
8 sh:property [ a sh:PropertyShape;
9 sh:path ex:name;
10 sh:minCount 1 ];
11 sh:property [ a sh:PropertyShape;
12 sh:path ex:birthDate;
13 sh:minCount 1 ].
Figure 2: A SHACL shape in Turtle for “every person must have a name and a birth date”
Patterns. Ontology design patterns (ODPs) or for simplicity, patterns, are a modeling solu-
tion to recurrent problems in ontology designs [12, 8]. Patterns capture problems commonly
encountered in the development of ontologies & KGs, and make more explicit and reusable
best practices in solving such problems. Benefits of using patterns include easier ontology
design processes by both knowledge engineers and domain experts, as well as increased quality
& interoperability of developed ontologies & KGs [13]. Several interesting ontologies and
KG-based applications developed using pattern-based approaches are chess games [14], historic
slave trade [15], and open data conversion to Wikidata [16].
Gangemi and Presutti [8] present a classification of patterns for ontology design. Those
six families of patterns are: structural patterns, correspondence patterns, content patterns,
reasoning patterns, presentation patterns, and lexico-syntactic patterns. In this work, we focus
on content patterns, which encode conceptual solutions for ontology & KG (quality) modeling
problems. In principle, content patterns do not depend on any specific language. Nevertheless,
in our work we rely on SHACL as a reference formalism for representing reusable, practical
building blocks of content patterns.
3. Completeness Patterns and Instantiations
This section introduces data completeness patterns developed based on SHACL and approaches
to instantiating them.
3.1. Completeness Patterns
Issa et al. [5] synthesized previous studies that discussed the completeness aspect of KGs. Of
the seven completeness types proposed, we use five types as completeness patterns. We also
add one additional completeness pattern which we will see below. Each completeness pattern
may have several SHACL patterns, that is, patterns that generalize from the SHACL syntax [7]
with the addition of %% V A R I A B L E %% that later can be instantiated into SHACL shapes. Table 1
shows eight SHACL patterns according to the six completeness patterns.
Schema completeness pattern (SC). A completeness pattern about the extent to which
entities in a KG have the required properties for a class, such as entities of the class “Human”
must have the properties (among others) of name and date of birth. The definition of mandatory
properties in the same class may differ depending on the specific needs for the task at hand.
Property completeness pattern (PC). A completeness pattern related to the degree to
which property values of a specific property in a particular entity are present. For example, the
entity Joe Biden must have four values for a “child” property. This completeness only focuses
on one property with varying cardinalities among entities.
Table 1: List of SHACL patterns based on completeness patterns
Code SHACL Pattern Description
SC1 %% SHAPE-NAME %% All members of a class are re-
a sh:NodeShape; quired to have the mandatory
sh:targetClass %% CLASS %%;
properties from P R O P E R T Y - 0 1 to
sh:property [ a sh:PropertyShape;
sh:path %% PROPERTY-01 %%; P R O P E R T Y - N N . A SHACL shape in-
sh:minCount 1 ]; stantiating such a pattern is ex-
sh:property [ a sh:PropertyShape; emplified in Figure 2.
sh:path %% PROPERTY-NN %%;
sh:minCount 1 ].
PC1 %% SHAPE-NAME %% An entity (= node) is complete
a sh:NodeShape; for P R O P E R T Y whenever the num-
sh:targetNode %% NODE %%;
ber of the property values equals
sh:property [ a sh:PropertyShape;
sh:path %% PROPERTY %%; COUNT.
sh:minCount %% COUNT %% ].
NVC1 %% SHAPE-NAME %% An entity (= node) is complete
a sh:NodeShape; for P R O P E R T Y despite the absence
sh:targetNode %% NODE %%;
of property values.
sh:property [ a sh:PropertyShape;
sh:path %% PROPERTY %%;
sh:minCount 0 ].
POC1 %% SHAPE-NAME %% A population is complete when-
a sh:NodeShape; ever the number of its mem-
sh:targetNode %% NODE %%;
bers equals C O U N T . The use of
sh:property [ a sh:PropertyShape;
sh:path [ s h : i n v e r s e P a t h makes it possible
sh:inversePath %% TYPE-PROPERTY %% ]; to count the subjects of T Y P E -
sh:minCount %% COUNT %% ]. PROPERTY.
LDC1 %% SHAPE-NAME %% All members of a class are
a sh:NodeShape; required to have a L A B E L - O R -
sh:targetClass %% CLASS %%;
D E S C R I P T I O N - P R O P E R T Y . The pat-
sh:property [ a sh:PropertyShape;
sh:path %% LABEL-OR-DESCRIPTION-PROPERTY %%; tern can be modified to specific
sh:minCount 1 ]. nodes using s h : t a r g e t N o d e .
LDC2 %% SHAPE-NAME %% This pattern extends the pattern
a sh:NodeShape; LDC1 by adding the L A N G U A G E re-
sh:targetClass %% CLASS %%;
quirement for the given L A B E L - O R -
sh:property [ a sh:PropertyShape;
sh:path %% LABEL-OR-DESCRIPTION-PROPERTY %%; DESCRIPTION-PROPERTY.
sh:qualifiedMinCount 1;
sh:qualifiedValueShape [
sh:languageIn (%% LANGUAGE %%) ].
IC1 %% SHAPE-NAME %% All members of a class
a sh:NodeShape; are required to have an
sh:targetClass %% CLASS %%;
INTERLINKING-PROPERTY, which
sh:property [ a sh:PropertyShape;
sh:path %% INTERLINKING-PROPERTY %%; can be generic ones (e.g.,
sh:minCount 1 ]. o w l : s a m e A s , s c h e m a : s a m e A s , and
skos:exactMatch) or specific
ones (e.g., d b o : i s b n for DBpedia
and w d t : P 2 1 4 as VIAF ID for
Wikidata).
IC2 %% SHAPE-NAME %% This pattern extends that of IC1
a sh:NodeShape; by adding a requirement that the
sh:targetClass %% CLASS %%;
linked entity must come from a
sh:property [ a sh:PropertyShape;
sh:path %% INTERLINKING-PROPERTY %%; specific N A M E S P A C E - U R I (e.g., http:
sh:qualifiedMinCount 1; //dbpedia.org/resource/).
sh:qualifiedValueShape [
sh:pattern %% NAMESPACE-URI %% ].
No-value completeness pattern (NVC). A completeness pattern that deals with to which
entities in a KG capture information about the absence of a property value. In this context,
property values might be “unavailable” due to two possible reasons: the property values do not
exist in the real world, or they are intentionally removed for privacy or confidentiality reasons.
For example, the entity Keanu Reeves has no children. To be more precise, this pattern is a
special case of the property completeness pattern. Nevertheless, the no-value completeness
pattern deserves a special attention due to its peculiarity in that shapes instantiated from the
no-value completeness pattern may represent negative facts (i.e., the non-existence) [17] and
any KG would trivially satisfy the shape.
Population completeness pattern (POC). A completeness pattern related to the extent to
which entities in the real world are present as members (or instances) of a population. For
example, all provincial entities in Indonesia are members of the class “Provinces in Indonesia”. In
such a pattern, the membership of a population is usually expressed through typing properties
like r d f : t y p e , d c t : s u b j e c t , d b o : t y p e (for DBpedia), and w d t : P 3 1 (for Wikidata).
Label & description completeness pattern (LDC). A completeness pattern related to the
degree to which entities in a KG have human-readable labels and descriptions. One of the
most commonly used label properties is r d f s : l a b e l . An alternative property is s k o s : p r e f L a b e l .
In addition, the language aspect of the label can be considered in this pattern. Apart from the
label, the pattern also concerns description properties, such as r d f s : c o m m e n t , s c h e m a : d e s c r i p t i o n ,
and d b o : a b s t r a c t (for DBpedia).
Interlinking completeness pattern (IC). A completeness pattern concerning the degree to
which the same entity links to each other in various KGs. For example, the entity “Indonesia”
on DBpedia is linked to the same entity on Wikidata. One of the commonly used properties
to define such linking is o w l : s a m e A s . Other general properties with the same intention are
s c h e m a : s a m e A s and s k o s : e x a c t M a t c h . Furthermore, there are also properties that specifically link
to particular external authorities, such as ISBN, DOI, and VIAF. It is worth noting that the
pattern IC1 in Table 1 would only check the existence of interlinking properties regardless of
the number of property values. We argue that attempting to impose some value counting on
interlinking properties would be tricky to do since there can be an arbitrary number of KGs
which may contain the entities to be linked. However, should one want to check the existence of
interlinking properties to entities in some specific KG, the pattern IC2 which features namespace
qualifiers can be utilized.
3.2. Completeness Pattern Instantiations
Given the above completeness patterns, a question arises as to how one may instantiate those
patterns. We identify a number of such approaches: manual, spreadsheet, automatic, ontology,
and statistics. We also list the advantages and disadvantages of each approach in Table 2.
Manual. In this approach, a user instantiates a completeness pattern by hand. For example,
with respect to the pattern SC1, the user will selectively choose the properties of a class that
must exist. Afterwards, the selected properties will be substituted into the property variables of
the SHACL pattern.
Spreadsheet. This approach gathers completeness information (e.g., number of children and
number of authors) into a spreadsheet. Next, a program reads the spreadsheet and generates
the corresponding SHACL shapes based on the collected completeness information. Here we
use this approach to instantiate the property completeness pattern.
Automatic. The automatic approach relies on data from KGs themselves to provide com-
pleteness information. Wikidata, for example, has the information of “properties for this type”
(P1963) which lists properties that normally apply for a class. Furthermore, Wikidata also owns
counting properties like “number of children” (P1971) which may give hints on the cardinality
of property values (in this case, the property “child”). External APIs may also be leveraged
to provide completeness information, such as Crossref for the number of complete authors of
a scholarly article.1
Ontology. One may also instantiate completeness patterns by utilizing the ontology structure
of a KG, such as using r d f s : d o m a i n .2 Such ontological information provides properties that
typically apply for a class. More specifically, referring to the statement of P r d f s : d o m a i n C , we
can assume that property P associates with instances of C .
1
https://www.crossref.org/documentation/retrieve-metadata/rest-api/
2
https://www.w3.org/TR/rdf-schema/#ch_domain
Statistics. Statistical information may be useful for instantiating completeness patterns. Over
a class, one may list the frequency of property usage in a class, and then take the top-N most
frequent properties as the mandatory properties of the class.
Table 2: Advantages and Disadvantages of Various Completeness Instantiation Approaches
Approach Advantages Disadvantages
Manual Enables a more detailed and cus- Manual checking can be time con-
tomized making of shapes thus re- suming, requires prior knowledge
sulting in a completeness evaluation about SHACL, and is prone to hu-
with high precision and quality. man errors.
Spreadsheet Enables easier data collection with It is time consuming and is not as
high quality. flexible nor as detailed as the man-
ual approach.
Automatic Gives the benefit of quick and Requires quality checking over gen-
easy generations over an abundance erated instantiations.
of completeness instantiations (of
some form).
Ontology Uses terminological knowledge The quality of the results depends
from ontologies made by domain on the source of ontology being
experts, thus enabling the reuse of used and that a poor quality ontol-
existing ontologies. ogy would tend to produce a lower
quality of instantiations.
Statistics Data driven and does not require Manual checking is still required to
domain understanding. ensure that the captured instantia-
tions make sense.
4. SoCK Library and Web App
In this section, we present our SoCK library for completeness instatiation and validation, as
well as our SoCK web app as an information point for the SoCK framework.
4.1. SoCK Library
The SoCK library provides a Python-based implementation of instantiating and validating
completeness patterns.3 The library supports the pattern instantiation process as discussed
in Subsection 3.2. We use the library to create SHACL shapes based on the aforementioned
completeness patterns for real-world KGs like DBpedia and Wikidata.
The SoCK library can also be used to validate the completeness of KGs based on the instanti-
ated patterns. The validation process requires both the data graph and (completeness) shapes
graph as the input. Thus, the first step is SPARQL querying to get the corresponding data of
3
Our library is available at https://github.com/JillyCS15/sock-validator
relevant entities and properties. Here we rely on SPARQLWrapper4 for querying and RDFLib5
for RDF data creation and manipulation. The library currently does not support validation over
remote data graphs and hence, the data graphs to be validated have to be available locally in the
same machine where the library resides. Next, the collected data graph is validated against the
(completeness) shapes graph from the instantiation process as described before. The validation
results in validation reports, informing the completeness profiles of KG entities. We reuse
PySHACL6 to support the validation step. The whole flow of validation process is displayed in
Figure 3.
Figure 3: Validation flow in SoCK
4.2. SoCK Web App
The SoCK web app is developed to serve as an information point about the SoCK framework.
The web app is available at https://sock.cs.ui.ac.id, including a demo video at https://youtu.be/
FtJ3HO_6YHcq. This web app offers features for users in learning about completeness patterns
and creating their instances or shapes. For example, an instance of the label & description
pattern is the SHACL shape of “American Film”.7 Another example is an instance of the
population completeness pattern, about “G20 Nations”.8 Further pattern instances can be
viewed at https://sock.cs.ui.ac.id/instance/?page=1.
The SoCK web app uses the client-server architecture. In particular, it adapts a three-tier
architecture that separates the application into three logical layers: presentation, application,
and database. The presentation relies on basic web stack (i.e., HTML, CSS, and JavaScript),
whereas the database makes use of SQLite. In developing the application, we employ the Django
framework.
5. Case Studies: Wikidata and DBpedia
In this section, we discuss the evaluation results of our SoCK framework to real-world KGs,
that is, Wikidata and DBpedia. We create SHACL shapes out of our presented completeness
patterns and validate those shapes to the KGs. Overall, we successfully validate 928,310 entities
4
https://sparqlwrapper.readthedocs.io/
5
https://rdflib.readthedocs.io/
6
https://pypi.org/project/pyshacl/
7
https://sock.cs.ui.ac.id/instance/show-pattern-instance-detail/12/
8
https://sock.cs.ui.ac.id/instance/show-pattern-instance-detail/5/
from Wikidata and DBpedia with 360,162 completeness pattern instances (= shapes). Further
evaluation discussion for each completeness pattern is provided as follows.
Schema Completeness Validation. Here we instantiate the pattern SC1 via three approaches,
that is, automatic, ontology, and statistics, collecting in total 1,106 SHACL shapes. Through the
automatic approach using “properties for this type” (P1963), we successfully validate 469,891
Wikidata entities using 1,095 pattern instances. In the validation, we record for each entity the
completeness percentage of properties listed in “properties for this type”. The validation results
are as follows: 35% entities are complete for 0–19% of properties, 17% entities for 20–39% of
properties, 17% entities for 40–59% of properties, 11% entities for 60–79% of properties, and 20%
entities for 80–100% of properties.
Next, through the ontology approach, we validate 5,000 DBpedia sample entities using five
pattern instances from the class of “Hotel”, “Magazine”, “Museum”, “University”, and “Person”.
The validation results show that the completeness of DBpedia entities from those classes are
still quite low in that 78% of all entities have no more than 10% of the corresponding class
properties from the ontology approach.
As for the last schema completeness experiment, we validate 6,000 DBpedia sample entities
from the class of “Country”, “Person”, “Activity”, “Film”, “Museum”, and “University” using
six pattern instances generated with statistics approach, where we take top-10 most frequent
properties from each class. The result indicates that entities on DBpedia with property selection
using a statistical approach have a good average completeness value in that almost 70% of the
entities are complete for more than 65% of class properties.
Property Completeness Validation. In this validation scenario, we make 357,892 shapes of
the pattern PC1 to validate the property completeness of entities. On the automatic approach,
we validate 357,749 Wikidata sample entities for the completeness of the number of authors
from “Scholarly Article” (Q13442814), the number of children from “Human” (Q5), the number
of episodes from “Television Series Season” (Q3464665), the number of seasons from “Television
Series” (Q5398426), and the number of participants from “Sports Season” (Q27020041). Then, on
the spreadsheet approach, we gather completeness information about the number of children
of Indonesian actors, the number of cities from Indonesian provinces, and so on. The validation
results from both approaches show a relatively good completeness value, ranging from 60–70%.
No-Value Completeness Validation. We conduct experiments using this pattern through
an automatic approach, instantiating the pattern NVC1 based on Wikidata. Here we take all
entities having no children, as stated by the value of zero (0) for the property of “number of
children” (P1971). There are in total 1,277 SHACL shapes generated for no-value completeness.
The validation results are trivially 100% complete by the definition of no-value completeness.
Population Completeness Validation. We create eleven population completeness pattern
instances with the code POC1. We examine on DBpedia the populations of cantons of Switzer-
land, continents, countries in Africa and Europe, G20 nations, Indonesian active volcanoes,
Indonesian legislative election events, Indonesian provinces, NATO nations, oceans, and Sum-
mer Olympics events. Based on the experiment results, DBpedia includes in total 92.8% of the
entities from all the populations above. This shows that DBpedia has covered almost all the
population entities tested.
Label & Description Completeness Validation. Using nine label & description pattern
instances, we conduct validation experiments involving 69,045 entities on DBpedia and Wikidata.
On DBpedia, we validate 5,000 sample entities from the class of mountain, politician, country,
disease, and musical artist, for the existence of a basic label & description property according to
the pattern of the code LDC1. Based on the validation results, more than 90% of the entities have
a complete label and description property, such as r d f s : l a b e l (99.96%), r d f s : c o m m e n t (94.96%), and
d b o : a b s t r a c t (94.96%). As for Wikidata, we validate 64,045 entities using instantiations of the
pattern with the code LDC02 to check whether the entities have a label and description property
with a proper language tag. We examine the entities from the class “National Hero of Indonesia”,
“South Korean Music Group”, “American Film”, and “Japanese Manga”. The validation results
show that over 98.72% of entities have a r d f s : l a b e l and 93.80% of entities have s c h e m a : d e s c r i p t i o n
with a proper language tag. However, only 28.41% of the entities capture the property of alias
(s k o s : a l t L a b e l ) so it indicates that the s k o s : a l t L a b e l property is rather incomplete. One possible
reason is there is sometimes no alias for entities of interest.
We also compare the label & description completeness of DBpedia entities that have equivalent
Wikidata entities through o w l : s a m e A s . This way, we can have a fair comparison between DBpedia
and Wikidata. The validation checks whether entities from the class of country, mountain, film,
hotel, and song, have a label & description in English according to the pattern code LDC2. Based
on the validation results of 5,000 entities, DBpedia entities have completeness of 99.86% for the
label property (r d f s : l a b e l ) and 99.06% for the description property (r d f s : c o m m e n t ). Meanwhile, the
corresponding Wikidata entities have completeness of 99.5% for the label property (r d f s : l a b e l )
and 95.04% for the description property (s c h e m a : d e s c r i p t i o n ). Note that Wikidata relies on
s c h e m a : d e s c r i p t i o n for describing entities instead of r d f s : c o m m e n t . From the above results, we
can say that DBpedia entities have slightly higher label & description completeness than those
of Wikidata.
Interlinking Completeness Validation. We experiment with interlinking completeness
patterns, creating ten completeness pattern instances (i.e., five with the code IC1 and five with
IC2), to validate 9,194 sample entities of DBpedia and Wikidata. On DBpedia, we check entities
from the country, actor, island, museum, and hotel classes. The validation result shows that
the completeness of the o w l : s a m e A s property is the highest with 95.24% DBpedia entities linked
to Wikidata and 72.36% linked to YAGO. Meanwhile, the completeness of the s c h e m a : s a m e A s
property is only about 20.80%, and even smaller for s k o s : e x a c t M a t c h , which is only 1.40%. This
shows that most entities on DBpedia are not yet complete in covering external entities through
s c h e m a : s a m e A s and s k o s : e x a c t M a t c h properties.
On Wikidata, we examine the existence of external ID properties for entities of the class
country, hotel, film, singer-songwriter, and university. The validation result shows that inter-
linking completeness varies: countries are mostly complete for VIAF ID (96%) and GND ID
(96%); hotels are somewhat incomplete – reaching the highest of only 25% for Agoda hotel ID;9
films are the most complete for IMDb ID (99%); singers & songwriters are the most complete for
VIAF ID (97%); and universities are also the most complete for VIAF ID (86%).
6. Related Work
A number of research studies have investigated the problem of describing and checking KG
completeness. Prasojo et al. [18] introduce SP-statements, in the form of (𝑠, 𝑝), declaring that
the entity 𝑠 is complete for all values of the property 𝑝 for the KG of interest. Such SP-statements
are practically relevant for entity-centric KGs, such as DBpedia and Wikidata. A more generic
approach to describe completeness is provided in [19, 20] in that more expressive completeness
statements based on arbitrary basic graph patterns (BGPs), including their consequences to
query completeness, are well-studied.
In [21], Wisesa et al. develop a framework for completeness profiling. The framework analyzes
the completeness of a KG based on the Class-Facet-Attribute (CFA) profiles. A class represents
a set of entities with a number of facets, which allow attribute completeness to be analyzed.
The framework can be used, for example, to answer the question: how does the completeness
of the attributes “residence” and “eye color” fare for humans with the occupation of actor?
Now, we would like to report on several use cases of SHACL in data quality. Hammond et
al. [22] discuss how SHACL can be leveraged to supervise the quality of ETL pipelines from vari-
ous data sources to build the Springer Nature SciGraph. The graph describes the Springer Nature
publishing world, containing data about sponsors, research projects, conferences, publications,
and affiliations from across the research landscape. Cimmino et al. [23] introduce an approach
to generate SHACL shapes automatically from OWL ontologies. To this end, they provide two
resources: (𝑖) Astrea-KG, a KG for mappings between ontology constraint patterns and SHACL
constraint patterns; and (𝑖𝑖) Astrea, a tool implementing the mappings from the Astrea-KG for
concrete ontologies. The tool has been experimented over various real-world ontologies to
create automatically SHACL restrictions, such as s h : d a t a t y p e and s h : m i n I n c l u s i v e , in order to
maintain KG quality. Pandit et al. [24] envision reusing ODP axioms to define SHACL shapes.
They provide a simple case study in the context of validating the data quality of microblogs
(e.g., Twitter posts). In that study, constraints and relationships within the ODP, defined using
RDFS subclasses and OWL cardinality restrictions, are mapped to the corresponding SHACL
shapes using s h : c l a s s and s h : q u a l i f i e d ( M a x / M i n ) C o u n t conditions.
7. Conclusions
Through this paper, we present SoCK (SHACL on Completeness Knowledge), a pattern-oriented
framework for describing and validating the completeness of KGs. We propose six completeness
patterns based on general data quality problems related to completeness, namely those patterns
of schema completeness, property completeness, no-value completeness, population complete-
ness, label & description completeness, and interlinking completeness. From those patterns,
9
Out of the external ID properties of VIAF ID, Booking.com numeric ID, Agoda hotel ID, Hotels.com hotel ID,
and TripAdvisor ID
we manage to create over 360k instances (i.e., SHACL shapes) and validate nearly 1 million
Wikidata and DBpedia entities using our SoCK Python library (https://github.com/JillyCS15/
sock-validator). To disseminate our framework, we develop a SoCK web app, accessible via
https://sock.cs.ui.ac.id, which provides a catalog of completeness patterns as well as instances
and use cases.
There are a few things that can be done to improve the SoCK framework. Currently, five
patterns are formed based on [5]. The other two patterns, namely, currency completeness
and metadata completeness, are not yet investigated in this work due to the time limitations,
and are left for future research. At the moment, our evaluation is limited to Wikidata and
DBpedia entities. Using entities in other KGs, such as YAGO and KBpedia, or even enterprise
KGs, may help further explore the general quality of KGs. The current version of our work
only concerns the syntactic level of RDF graphs and not yet the semantic level. Investigating
the consequences in incorporating the semantics (and inferences) of RDFS and OWL would
be relevant for future directions. Working on the semantic level would also be potential in
better generalizing our SHACL-based approach and aligning SHACL & ODPs for describing
and validating KG completeness. Last but not least, it is also of our interest to analyze the
completeness of ontologies, that is, how complete is an ontology in capturing some domain of
interest (i.e., whether the classes and properties are sufficient enough).
References
[1] J. Debattista, C. Lange, S. Auer, D. Cortis, Evaluating the quality of the LOD cloud: An
empirical investigation, Semantic Web 9 (2018).
[2] B. F. Lóscio, C. Burle, N. Calegari (Eds.), Data on the Web Best Practices, W3C Recommen-
dation, 31 January 2017. https://www.w3.org/TR/dwbp/.
[3] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for
linked data: A survey, Semantic Web 7 (2015).
[4] R. Y. Wang, D. M. Strong, Beyond accuracy: What data quality means to data consumers,
J. Manag. Inf. Syst. 12 (1996).
[5] S. Issa, O. Adekunle, F. Hamdi, S. S. Cherfi, M. Dumontier, A. Zaveri, Knowledge graph
completeness: A systematic literature review, IEEE Access 9 (2021).
[6] M. Luggen, D. E. Difallah, C. Sarasua, G. Demartini, P. Cudré-Mauroux, Non-parametric
Class Completeness Estimators for Collaborative Knowledge Graphs - The Case of Wiki-
data, in: ISWC, 2019.
[7] H. Knublauch, D. Kontokostas (Eds.), Shapes Constraint Language (SHACL), W3C Recom-
mendation, 20 July 2017. https://www.w3.org/TR/shacl/.
[8] A. Gangemi, V. Presutti, Ontology design patterns, in: S. Staab, R. Studer (Eds.), Handbook
on Ontologies, International Handbooks on Information Systems, Springer, 2009.
[9] F. Manola, E. Miller (Eds.), RDF Primer, W3C Recommendation, 10 February 2004. http:
//www.w3.org/TR/rdf-primer/.
[10] S. Steyskal, K. Coyle (Eds.), SHACL Use Cases and Requirements, W3C Working Group
Note, 20 July 2017. https://www.w3.org/TR/shacl-ucr/.
[11] J. E. L. Gayo, E. Prud’hommeaux, I. Boneva, D. Kontokostas, Validating RDF Data, Synthesis
Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool Publishers,
2017.
[12] A. Gangemi, Ontology Design Patterns for Semantic Web Content, in: ISWC, 2005.
[13] K. Hammar, Ontology Design Patterns in WebProtege, in: ISWC Posters & Demos, 2015.
[14] A. Krisnadhi, V. Rodríguez-Doncel, P. Hitzler, M. Cheatham, N. Karima, R. Amini, A. Cole-
man, An ontology design pattern for chess games, in: WOP, 2015.
[15] C. Shimizu, P. Hitzler, Q. Hirt, D. Rehberger, S. G. Estrecha, C. Foley, A. M. Sheill,
W. Hawthorne, J. Mixter, E. Watrall, R. Carty, D. Tarr, The enslaved ontology: Peoples of
the historic slave trade, JWS 63 (2020).
[16] M. Faiz, G. M. F. Wisesa, A. A. Krisnadhi, F. Darari, OD2WD: From Open Data to Wikidata
through Patterns, in: WOP, 2019.
[17] F. Darari, Representing and querying negative knowledge in RDF, in: ESWC Posters and
Demos, 2013.
[18] R. E. Prasojo, F. Darari, S. Razniewski, W. Nutt, Managing and Consuming Completeness
Information for Wikidata Using COOL-WD, in: COLD@ISWC, 2016.
[19] F. Darari, W. Nutt, G. Pirrò, S. Razniewski, Completeness statements about RDF data
sources and their use for query answering, in: ISWC, 2013.
[20] F. Darari, S. Razniewski, R. E. Prasojo, W. Nutt, Enabling fine-grained RDF data complete-
ness assessment, in: ICWE, 2016.
[21] A. Wisesa, F. Darari, A. Krisnadhi, W. Nutt, S. Razniewski, Wikidata Completeness Profiling
Using ProWD, in: K-CAP, 2019.
[22] T. Hammond, M. Pasin, E. Theodoridis, Data integration and disintegration: Managing
Springer Nature SciGraph with SHACL and OWL, in: ISWC Posters & Demos, 2017.
[23] A. Cimmino, A. Fernández-Izquierdo, R. García-Castro, Astrea: Automatic Generation of
SHACL Shapes from Ontologies, in: ESWC, 2020.
[24] H. J. Pandit, D. O’Sullivan, D. Lewis, Using ontology design patterns to define SHACL
shapes, in: WOP, 2018.