<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Improving the Quality of Knowledge Graphs with Data-driven Ontology Patterns and SHACL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Blerina Spahiu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Maurino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Palmonari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milano-Bicocca</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>As Linked Data available on the Web continue to grow, understanding their structure and assessing their quality remains a challenging task making such the bottleneck for their reuse. ABSTAT is an online semantic pro ling tool which helps data consumers in better understanding of the data by extracting data-driven ontology patterns and statistics about the data. The SHACL Shapes Constraint Language helps users capturing quality issues in the data by means of constraints. In this paper we propose a methodology to improve the quality of different versions of the data by means of SHACL constraints learned from the semantic pro les produced by ABSTAT.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Million of facts describing entities such as people, places, events, lms, etc., are
stored in knowledge repositories. In the Semantic Web, knowledge is encoded
into graphs represented using the RDF data model1. Nodes in these graphs
represent entities or literals, while arcs represent relations between entities and
between entities and literals, whose semantics is speci ed by RDF properties.
Entities and literals are usually associated a type, i.e., respectively a class or a
datatype. The sets of possible types and properties are organized into schemas
or ontologies, which de nes the meaning of the terms used in the knowledge
base through logical axioms. Often KGs are very large and are continuously
evolving, as an example we can mention Linked Data which has evolved with
roughly 1; 184 data sets as of April 20182. However, understanding and exploring
the content of large and complex knowledge bases is very challenging for users
[
        <xref ref-type="bibr" rid="ref11 ref14 ref19 ref21 ref4">11, 21, 4, 19, 14</xref>
        ].
      </p>
      <p>As the number and size of published KGs is increasing, the need for
methodologies and tools able to support the quality assessment of such datasets increase
as well. Many data producers take care about the quality of the data they
publish resulting in many data sets in the LD cloud that are of a high quality.
However, there are also many data sets, which are extracted from unstructured</p>
      <sec id="sec-1-1">
        <title>1 https://www.w3.org/RDF/</title>
      </sec>
      <sec id="sec-1-2">
        <title>2 http://lod-cloud.net/</title>
        <p>
          or semi-structured information being vulnerable for quality issues [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Many
metrics, methodologies and approaches have been proposed to capture quality issues
in linked data sets [
          <xref ref-type="bibr" rid="ref10 ref13 ref2 ref20">20, 13, 10, 2</xref>
          ]. In this context it happens that a new version
of a dataset shows a reduced quality level with respect to a previous version.
In addition, when an error is discovered in a given version of a dataset, there
may be the need to check if such error is also found in previous versions of the
dataset that are still in use.
        </p>
        <p>
          In this paper we propose an approach for improving the quality of di
erent versions of knowledge graphs by means of ABSTAT [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. In practice,
proles computed with ABSTAT provide datadriven ontology patterns and related
statistics. Given an RDF data set and, optionally, an ontology (used in the
data set), ABSTAT computes a semantic pro le which consists of a summary
that provides an abstract but complete description of the data set content and
statistics. ABSTAT's summary is a collection of patterns known as Abstract
Knowledge Patterns (AKPs) of the form &lt;subjectType, pred, objectType&gt;,
which represent the occurrence of triples &lt;sub, pred, obj&gt;in the data, such
that subjectType is a minimal type of the subject and objectType is a minimal
type of the object. With the term type we refer to either an ontology class (e.g.,
foaf:Person) or a datatype (e.g., xsd:DateTime). By considering only minimal
types of resources, computed with the help of the data ontology, we exclude
several redundant AKPs from the summary making them compact and complete.
Summaries are published and made accessible via web interfaces, in such a way
that the information that they contain can be consumed by users and machines
(via APIs).
        </p>
        <p>SHACL3 (Shapes Constraint Language), is a highly expressive language used
to write constraints for validating RDF graphs recommended by the W3C.
Constraints are written in RDF. In this paper we show not only how ABSTAT
patterns encode valuable information to evaluate and improve the quality of
knowledge graphs (RDF data sets), but also how this information can be
translated into SHACL pro les that can be validated against the data. More precisely,
we introduce the rst methodology to build SHACL pro les using data-driven
ontology patterns extracted automatically from the data. These SHACL pro les
can be derived automatically, re ned by the users (also with the help of
information provided by the semantic pro ling tool) and then validated automatically
using a SHACL validator. The overall idea is that ABSTAT pro les and SHACL
pro les can be used in combination to improve the quality of RDF data sets
along time.</p>
        <p>Our contributions can be summarized as follows: (i) We provide a
methodology for the quality assessment and validation among di erent versions of a
dataset. In particular we propose an approach to detect quality issues in the
data by means of cardinality statistics, (ii) Represent ABSTAT semantic
proles into SHACL language and (iii) Use SHACL pro les as templates for quality
detection.</p>
      </sec>
      <sec id="sec-1-3">
        <title>3 https://www.w3.org/TR/shacl/</title>
        <p>The rest of the paper is organized as follow: Related work are discussed in
Section 2. In Section 3 we introduce the methodology for improving the general
quality of the data sets. In Section 4.1 we describe how pattern-based pro les
can support data quality assessment of the data. We rst introduce ABSTAT
and the generated pro le consisting of data-driven ontology patterns. We then
describe how to transform ABSTAT pro les into SHACL language in order to
use them as templates for validation. An example of how such approach can be
applied into DBpedia is given in Section 5 while conclusions end the paper in
Section 6.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The identi cation of quality issues has been recently studied in RDF data
sets. The most straightforward way to represent constraints is the use of OWL
24 by means of three axioms: owl:minCardinality, owl:maxCardinality, and
owl:exactCardinality on ObjectProperty as well as on DatatypeProperty.
However such kind of declaration is optional and consistency with the data is not
ensure. Especially it is very di cult to evaluate consistency when relations are
not explicitly stated such as in the case of lightweight ontologies. The
expressiveness of OWL 2 is limited also to express integrity constraints [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        The quality of DBpedia as one of the most known and used data set of the
LOD cloud has been studied in di erent works such as [
        <xref ref-type="bibr" rid="ref18 ref20 ref6 ref8">6, 20, 8, 18</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] the
authors study the property domain/range constraints and enrich the ontology
with axioms from the eld of Inductive Logical Programming (ILP) and use
them to detect inconsistencies in DBpedia. The creation of suitable correction
suggestions help users in identifying and correcting existing errors.
      </p>
      <p>
        Authors in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] present a study of the usability of 115 constraints on
vocabularies commonly used in the social, behavioral, and economic sciences in order
to asses the quality of the data. These constraints were validated in 15 694 data
sets. Authors gain a better understanding about the role of certain constraint
for assessing the quality of RDF data. The main ndings refer to language that
is used to formulate constraints, 1/2 of all constraints are informational, 1/3 are
error, and 1/5 are warning constraints, etc.
      </p>
      <p>
        The problem of mining cardinality bounds for properties in order to discover
structural characteristics of knowledge bases and assess their completeness is
introduced by [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The algorithm for mining cardinality patterns has two
implementations; (1) based on SPARQL and (2) based on Apache Spark and is
evaluated against ve di erent real-world or synthetic datasets. The ndings
show that cardinality bounds can be mined e ciently and are useful to
understand the structure of data. Such approach allows to analyze the completeness
and the consistency of data. There are also di erent tools and frameworks for
generic quality assessment such as [
        <xref ref-type="bibr" rid="ref1 ref2 ref5">2, 5, 1</xref>
        ]. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a Linked Data quality
assessment framework that helps non-programming experts in creating their quality
      </p>
      <sec id="sec-2-1">
        <title>4 https://www.w3.org/TR/owl2-syntax/</title>
        <p>
          metrics either procedurally, through Java classes, or declaratively, through a
quality metric. In [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] the de nition of metrics for quality assessment task should
be speci ed in an XML le. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] harvests data dumps from the Web in various
RDF format, cleans up quality issues relating to the serialization, and republishes
the dumps in standards-conform formats.
        </p>
        <p>With respect to the related work described in this section, this paper di ers
in several ways: i) it provides an iterative way for the improvement of quality over
upcoming versions of the data; ii) it introduces the concept of SHACL pro les
which serve as templates for validation of the data; and iii) proposes the use of
minimal ontology-patterns to detect quality issues in the data.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Proposed Methodology</title>
      <p>In this section we introduce the methodology of the approach that can learn
validation rules upon semantic pro les. For representing such rules we propose
to use SHACL (Shapes Constraint Language), a versatile constraints language for
validating RDF. Such rules are used as templates for detecting quality issues for
the upcoming versions of data. The process of validating the quality of di erent
versions of the data set through the use of SHACL pro les is given in Figure 1.
Validation of the data set could be done in the following phases:</p>
      <p>
        Phase I: As a rst step, for a data set (D1) semantic pro les (Ap1) are
generated with the help of ABSTAT (Section 4.1). Such semantic pro les allow
users to detect quality issues in the data [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Phase 2: In order to assess quality issues in the data set (D1), a
custombuilt converter was implemented which transforms semantic pro les into SHACL
mappings (Sp1) (Section 4.3). To the best of our knowledge in the state-of-the-art
there are no converters to support such transformation.</p>
      <p>Phase 3: SHACL pro les specify severities (sh:severity sh:Warning) to
identify non-critical constraint violation for those properties for which the maximum
cardinality is higher than twice the average (Section 4.4). We chose such
threshold in our experiments because we wanted to verify as much as possible cases
which violate the quality of the data. Such threshold is a tunable parameter.</p>
      <p>Phase 4: The domain expert takes such report and veri es if the cases
identi ed as Warning are real violations or not. Gradually checking such cases
the user updates the SHACL pro les (Sp1'), if the violations are not real, or
otherwise updates the data set. As a rst step in validating our approach we
check manually if violations are real or not. We are investigating methods how
to capture such real violations automatically.</p>
      <p>Phase 5: When a di erent version of the dataset is available (D2) the domain
expert can directly validate the new data against the constraints Sp1' and use
semantic pro les for exploring the data.</p>
      <p>Such process iterates any time a version of the data comes. In such a way
the quality of the upcoming versions of the data improves over time as well as
the quality of the data set itself is improved as in Figure 1.</p>
    </sec>
    <sec id="sec-4">
      <title>Data Quality Insights with Pattern-based Pro les</title>
      <p>In this section we show how to apply the methodology by describing each phase
and presenting a real example.
4.1</p>
      <sec id="sec-4-1">
        <title>Pattern-based Data Summarization with ABSTAT</title>
        <p>
          In the following we describe the main features of ABSTAT bringing the main
de nitions from our previous paper [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. We borrow the de nition of data set
from the de nition of Knowledge Base in Description Logics (DLs). In a
Knowledge Base there are two components: a terminology de ning the vocabulary of an
application domain, and a set of assertions describing RDF resources in terms
of this vocabulary. A data set is de ned as a couple = (T ; A), where T is a set
of terminological axioms, and A is a set of assertions. The domain vocabulary
of a data set contains a set NC of types, where with type we refer to either a
named class or a datatype, a set NP of named properties, a set of named
individuals (resource identi ers) NI and a set of literals L. In this paper we use
symbols like C, C0, ..., and D, D0, ..., to denote types, symbols P , Q to denote
properties, and symbols a, b to denote named individuals or literals. Types and
properties are de ned in the terminology and occur in assertions. Assertions in
A are of two kinds: typing assertions of form C(a), and relational assertions of
form P (a; b), where a is a named individual and b is either a named individual
or a literal. We denote the sets of typing and relational assertions by AC and
AP respectively. Assertions can be extracted directly from RDF data (even in
absence of an input terminology). Typing assertions occur in a data set as RDF
triples &lt; x; rdf:type; C &gt; where x and C are URIs, or can be derived from
triples &lt; x; P; y^^C &gt; where y is a literal (in this case y is a typed literal), with
C being its datatype. Without loss of generality, we say that x is an instance of
a type C, denoted by C(x), either x is a named individual or x is a typed literal.
Every resource identi er that has no type is considered to be of type owl:Thing
and every literal that has no type is considered to be of type rdfs:Literal. A
literal occurring in a triple can have at most one type and at most one type
assertion can be extracted for each triple. Conversely, an instance can be the
subject of several typing assertions. A relational assertion P (x; y) is any triple
&lt; x; P; y &gt; such that P 6= Q , where Q is either rdf:type, or one of the
properties used to model a terminology (e.g. rdfs:subClassOf).
        </p>
        <p>
          Patterns are abstract representations of Knowledge Patterns, i.e., constraints
over a piece of domain knowledge de ned by axioms of a logical language, in the
vein of Ontology Design Patterns [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. A pattern is a triple (C; P; D) such that C
and D are types and P is a property. In a pattern (C; P; D), we refer to C as the
subject type and D to the object type. Intuitively, a pattern states that there
are instances of type C that are linked to instances of a type D by a property
P . Instead of representing every pattern occurring in the data set, ABSTAT
summaries include only a base of minimal type patterns, i.e., a subset of the
patterns such that every other pattern can be derived using a subtype graph.
        </p>
        <p>A piece of a subtype graph is shown in Figure 2. Two branches of the concept
(classes) hierarchy are shown with respect to Thing class. The rst branch
has the subtype relation between Agent, Person, Artist, MusicalArtist,
Instrumentalist and Guitarist while the second one has the subtype relation
between Place, PopulatedPlace and Settlement. Consider in the data set
the following triples: &lt;dbo:Kurt Cobain dbo:birthPlace dbo:Washington&gt;,
&lt;dbo:Kurt Cobain rdf:type dbo:Guitarist&gt;, &lt;dbo:Kurt Cobain rdf:type
dbo:Person&gt;, &lt;dbo:Washington rdf:type dbo:Settlement&gt; and &lt;dbo:Washington
rdf:type dbo:PopulatedPlace&gt;. From the triples above, in the data Kurt Cobain
has two types, Person and Guitarist, while the resource of Washington has the
types PopulatedPlace and Settlement. The approach of ABSTAT, which
considers only the minimal type, extracts the pattern &lt;dbo:Guitarist, dbo:birthPlace,
dbo:Settlement&gt; as Guitarist is the minimal concept between Person and
Guitarist for the resource of Kurt Cobain. While the concept Settlement is
the minimal type because is a subtype of PopulatedPlace. For the triple with
birthDate as predicate ABSTAT could extract the pattern &lt;dbo:Guitarist,
dbo:birthDate, xmls:Date&gt;. Among patterns ABSTAT is able to extract also
other statistical information such as:
{ frequency which shows how many times this pattern occurs in the data set
as minimal type pattern.
{ instances which shows how many times this pattern occurs based on the
ontology, regardless of pattern minimality. In other words, this number tells
how many relation instances (i.e., relational assertions) are represented by
the pattern, by counting also the relation instances that have a more
speci c pattern as minimal type pattern. For example, if we have two patterns
(C; P; D) and (C; P; D0) with D being a subtype of D0, (C; P; D) is more
speci c than (C; P; D0) (can be inferred using the subtype graph); thus,
instances of (C; P; D0) include occurrences of (C; P; D) and (C; P; D0) itself.
{ Max (Min, Avg) subjs-obj cardinality is the maximum (minimum,
average) number of distinct subjects associated with a same object through the
predicate p, with subjects and objects belonging to respectively the subject
type and the object type.
{ Max (Min, Avg) subj-objs is the maximum (minimum, average) number
of distinct entities of type is subject position linked to a single entity of the
type in the object position through the predicate p.</p>
        <p>
          Figure 3 shows the semantic pro le generated by ABSTAT for the example
above. The statistical information can be understood as below. Frequency shows
that the pattern &lt;dbo:Guitarist, dbo:birthPlace, dbo:Settlement&gt; occurs 36 times
in the data set. The number of instances shows that there are 50 relational
assertions that are represented by this pattern, regardless of minimality (e.g.,
here we sum relational assertions that have (Guitarist, birthPlace, City) as
minimal type pattern). Max (Min, Avg) subjs-obj cardinality is the maximum
(minimum, average) number of distinct entities of type Guitarist linked to a
same entity of type Settlement through the predicate birthPlace. There are at
most 6 (minimum 1 and in average 1) distinct entities of type Guitarist linked
to a single entity of type Settlement. Max (Min, Avg) subj-objs is the maximum
(minimum, average) number of distinct entities of type Settlement linked to a
single entity of type Guitarist through the predicate birthPlace. Notice that
occurrences is given also for types and predicates. The type Guitarist occurs
151 times, the predicate birthPlace occurs 1 168 459 and the type Settlement
occurs 238 436 times.
Shapes Constraint Language (SHACL) since July 2017 is a W3C
recommendation language for de ning constraints on RDF graphs. The scope of SHACL is
primarily validation but it can be extended to data integration, code generation
and interface building. Validating RDF graphs is done against a set of shapes. A
SHACL processor has two inputs: a data graph that contains the RDF data to
validate and a shapes graph that contains the shapes [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. RDF graphs
containing conditions are called shape graphs while the RDF graphs that are validated
against them are called data graphs. A result of the validation process is a
validation report. There are two types of shape; node shape that declare constraints
directly on a node and property shape that declare constraints on the values
associated with a node through a path. Node shapes declare constraints directly
on a node e.g., node kind (IRI, literal or blank), IRI regex, etc. Property shapes
declare constraints on the values associated with a node through a path, e.g.,
constraints about a certain ongoing or incoming property of a focus node;
cardinality, datatype, numeric min/max, etc. SHACL shapes may de ne several
target declarations. Target declarations specify the set of nodes that will be
validated against a shape, e.g., directly pointing to a node, or all nodes that are
subjects to a certain predicate, etc. The validation report produced by SHACL
contains three di erent severity levels; Violation, Warning and Info. Severity does
not matter to the validation result and can be speci ed by users while writing
constraints. SHACL is divided into SHACL Core, which describes a core RDF
vocabulary to de ne common shapes and constraints while SHACL-SPARQL
describes an extension mechanism in terms of SPARQL. In SHACL-SPARQL,
there are two types of validators: SELECT-based and ASK-based queries. While
SELECT-based validators return no results to indicate conformance with the
constraint and a non-empty set when violated. ASK-based validators return
true to indicate conformance.
4.3
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>ABSTAT Pro les into SHACL</title>
        <p>Figure 4 shows the syntax of the example pattern in Figure 3 into the SHACL
language (Phase 2 of the methodology). Not all the information in the semantic
pro le produced by ABSTAT can be represented in SHACL. Only a subset of the
information such as the minimum and maximum cardinality for both subj-objs
and subjs-obj can be represented. We can not represent the information about
the frequency of types, predicates, patterns, the occurrence and the average
cardinality for both subj-objs and subjs-obj. For such statistics we will extend
SHACL in the future as described in Section 6.</p>
        <p>The semantic pro le produced by ABSTAT can be transformed into SHACL
by representing each pattern (C; P; D) (in parenthesis we map such process with
the example in the Figure 4) as follows:
{ For each type C (Guitarist) we extract all patterns with C (Guitarist) as the
subject type. The subject type C (Guitarist) is mapped into a node shape
(sh:NodeShape) targeting the class.
{ The property P (birthPlace or birthDate) is mapped into a property shape
(sh:property) with a value for sh:path equal to the property.</p>
        <p>If for all patterns (C; P; Z), (Guitarist, birthPlace) Z (Settlement or
City) is a class, then a "global" sh:path speci cation is added that
speci es the sh:nodeKind as an sh:IRI. If for all patterns (C; P; Z), Z
(xmls:date) is a datatype, then a "global" sh:path speci cation is added
that speci es the sh:datatype as a rdfs:Literal. Observe that this step
is executed only for the rst pattern with C (Guitarist) as subject type
and P (birthPlace, birthDate) as property.
a "local" sh:path speci cation is added to specify characteristics for each
object type Z (Settlement, City) for the property P (birthPlace). For
each path P (birthPlace), Z (Settlement, City), cardinality constraint
components are set, namely sh:minCount, sh:maxCount.
sh:inversePath property is used to describe the inverse cardinality for
the properties.
4.4</p>
      </sec>
      <sec id="sec-4-3">
        <title>Capturing Quality Issues by Applying Heuristics</title>
        <p>In this section we describe how the domain expert can apply some heuristics in
the semantic pro les and use SHACL to validate the data (Phase 3).</p>
        <p>For all patterns for which the number of maximum cardinality is higher
than twice the average we generate a severity report sh:Warning. We put such
threshold in order to capture as much as possible cases where the quality might
be a ected. For those patterns we use SHACL-SPARQL to generate the entities
that violate such constraints as in the Figure 5.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>DBpedia Case Study</title>
      <p>In the following we consider the triple patterns extracted by ABSTAT in Figure
6, which contain descriptive information about companies in DBpedia 2015-10.
Note that the triple patters in the gure are just a sample of the patterns
containing dbo:Company in the subject type. In this example we also have xed
the predicate which is dbo:keyPerson. As described above apart from triple
patterns, ABSTAT is able to produce also statistics that describe some
characteristics about the data. The type Company is used in the data 51 898 times
while the predicate dbo:keyPerson is used 31 078 times. The rst triple
pattern &lt;dbo:Company, dbo:keyPerson, owl:Thing&gt; occurs 18 710 times in the data.
While the number of instances having such pattern including those for which the
types Company, Thing and the predicate keyPerson can be inferred is 29 884.
Moreover ABSTAT is able to describe some characteristics of the data (in Figure
6) as below:</p>
      <p>R1. There are about 5 268 distinct entities of type Company linked to a single
entity of type Thing through the predicate keyPerson (Max subjs-obj).</p>
      <p>R2. For each entity of type Company there exist at least one value for the
predicate keyPerson (Min subjs-obj).</p>
      <p>R3. The number of distinct entities of type Thing linked to a single entity of
type Company through the predicate keyPerson is 23 (Max sub-objs).</p>
      <p>R4. For each entity of type Thing there exist at least one entity of type
Company for the predicate keyPerson (Min sub-objs).</p>
      <p>R5. The predicate keyPerson is an object property thus all triples which have
this as predicate take an entity in the object (the OP symbol in the predicate eld
of the pattern).</p>
      <p>R6. The domain and range of the property keyPerson is the union of the
types in the subject position and the union of the types in the object position of
triple patterns that have keyPerson as predicate.</p>
      <p>
        By looking at the descriptive information gained from ABSTAT in the
example above, we can see that they reveal quality issues in the data. R1, R2, R3
and R4 are known as cardinality constraints (used to specify a minimum and
maximum bound for relationships with properties that an entity can have [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ])
while, constraints of the same form as R5 are known as integrity constraints
(used to ensure accuracy and consistency of data). Constraints such as R6 are
domain and range constraints, respectively; they restrict the types that
entities (in the domain) and values (in the range) of relations for a given
property can have. The statistics about the average cardinality are indicators that
can help users identify violations of such constraints. In the example
considered it is very strange that while the average number of distinct entities of
type Company linked with one entity of a given type (which is categorized
as Thing) is 3, while the maximum is 5 268. It means that from 18 710 times
that this pattern occurs in the data, more than 25% (5 268) of the patterns
are created by these distinct entities of type Company that are linked with the
same entity of type Thing. The high imbalance between the number of
average and maximal cardinality subjs-obj incite us on making further exploration
and checks. Querying the endpoint of DBpedia we could identify the entity of
type Thing which is linked with 5 268 di erent entities of type Company. The
entity is &lt;http://dbpedia.org/resource/Chief executive o cer&gt;which in DBpedia
does not have a type. This means that a Chief Executive O cer (represented as
an entity) might be the CEO of 5 268 distinct companies. Examples of triples of
entities which type is Company linked with the entity CEO are:
&lt;dbr:Kodak dbo:keyPerson dbr:Chief executive o cer &gt;
&lt;dbr:Telefnica dbo:keyPerson dbr:Chief executive o cer &gt;
&lt;dbr:Allianz dbo:keyPerson dbr:Chief executive o cer &gt;
      </p>
      <p>Similarly, we could identify such quality issue for the triple pattern
&lt;dbo:Company, foaf:homePage, owl:Thing &gt;. The max cardinality subjs-obj is
27 while the minimum and the average cardinality is 1. Looking at the triples
from which ABSTAT derived the above pattern, there are di erent entities of
the type Company linked with the predicate homepage of the FOAF vocabulary
to the http://www.centurylink.com webpage which are linked with instances
of type Thing. Some example of such kind of triples are the following:
&lt;dbr:Embarq Florida foaf:homepage http://www.centurylink.com/&gt;
&lt;dbr:Central Telephone foaf:homepage http://www.centurylink.com&gt;
foaf:homepage
&lt;dbr:United Telephone Company of Kansas
http://www.centurylink.com&gt;</p>
      <p>In the version of DBpedia considered in this paper we did not nd any
violation for the R5. Such violations are easy to be identi ed in the version of
DBpedia 3.9 with Infoboxes e.g., birthDate is used as an object property (with
the dbp namespace) also as a datatype property (with the dbo namespace).</p>
      <p>The violation of the R6 is caused by triple patterns of the form
&lt;dbo:Company, dbo:keyPerson, dbo:Company&gt;. In the ontology of DBpedia, the
range of dbo:keyPerson is Person but in the data such property is used also
with instances of the type Company (254 patterns), Organization (308 patterns),
University (15 patterns), Bank (13 patterns) or Software (7 patterns) etc. In
the other example the semantics of the predicate foaf:homepage implies the
domain to be a Thing while the range should be a Document5. Such restriction
is violated by the use of such predicate in DBpedia 2015-10.</p>
      <p>In this work, we limit ourselves to mainly study the violation of cardinality
constraints and integrity constrains described above because of the impact that
they have in the quality analysis of the data. Most of the Linked Data quality
assessment, require users to have knowledge about the structure of the data
whereas we aim to provide users a method to capture quality issues by means of
identifying violated constraints against semantic pro les produced by ABSTAT.</p>
      <p>In our experiments we use DBpedia as one of the most used data set covering
di erent domains and being one of the most curated data set in the LOD cloud.
However other data sets being a single-topic data set such as media, life science,
etc. data sets are planed to be considered in our future experiments.
5 http://xmlns.com/foaf/spec/#term_homepage</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>In this paper we propose a methodology for assessing the quality of data sets and
their versions by means of an ontology pattern pro ling tool. For each data set,
ABSTAT generates a semantic pro le that describes some characteristics about
the data. The semantic pro les are then transformed into SHACL for the
validation of constraints. Cases where the cardinality between max and average have
a high imbalance are reported as warnings. A user veri es further such warnings
and consequently updates the data set and the SHACL pro le. Although this
research is in the beginning, samples extracted from DBpedia have shown that
it is an e ective way to capture and improve quality issues over time.</p>
      <p>As a future direction we plan to implement the full pipeline for the generation
of SHACL pro les and SHACL validators and make the code available. We plan
to extend SHACL by including also other statistical information produced by
ABSTAT in order to apply sophisticated heuristics to capture quality issues such
as including frequency and occurrence of a pattern. Moreover, we plan to run the
experiment in large scale, thus including more data sets that belong to di erent
topical domains and make an analysis of the most frequent quality issues.</p>
      <p>Another interesting future direction is the investigation of inconsistencies or
quality issues over time for those data sets that have versioned ontologies. The
semantic pro les produced by di erent versions of the data and the ontology
can be compared by exploring summaries while SHACL pro les can be used as
templates for the veri cation of errors found in the other version of the data.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This research has been supported in part by EU H2020 projects EW-Shopp
- Grant n. 732590, and EuBusinessGraph - Grant n. 732003. This work will
be published as part of the book \Emerging Topics in Semantic Technologies.
ISWC 2018 Satellite Events. E. Demidova, A.J. Zaveri, E. Simperl (Eds.), ISBN:
978-3-89838-736-1, 2018, AKA Verlag Berlin".</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Beek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rietveld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Bazoobandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wielemaker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Schlobach</surname>
          </string-name>
          .
          <article-title>Lod laundromat: a uniform way of publishing other peoples dirty data</article-title>
          .
          <source>In International Semantic Web Conference</source>
          , pages
          <volume>213</volume>
          {
          <fpage>228</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Debattista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          .
          <article-title>Luzzu{a framework for linked data quality assessment</article-title>
          .
          <source>In Semantic Computing (ICSC)</source>
          ,
          <source>2016 IEEE Tenth International Conference on</source>
          , pages
          <volume>124</volume>
          {
          <fpage>131</fpage>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zapilko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wackerow</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Eckert</surname>
          </string-name>
          .
          <article-title>Directing the development of constraint languages by checking constraints on rdf data</article-title>
          .
          <source>International Journal of Semantic Computing</source>
          ,
          <volume>10</volume>
          (
          <issue>02</issue>
          ):
          <volume>193</volume>
          {
          <fpage>217</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Consens</surname>
          </string-name>
          .
          <article-title>Explod: summary-based exploration of interlinking and rdf usage in the linked open data cloud</article-title>
          .
          <source>In Extended Semantic Web Conference</source>
          , pages
          <volume>272</volume>
          {
          <fpage>287</fpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cornelissen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          .
          <article-title>Test-driven evaluation of linked data quality</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World Wide Web</source>
          , pages
          <volume>747</volume>
          {
          <fpage>758</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          .
          <article-title>Triplecheckmate: A tool for crowdsourcing the quality assessment of linked data</article-title>
          .
          <source>In International Conference on Knowledge Engineering and the Semantic Web</source>
          , pages
          <volume>265</volume>
          {
          <fpage>272</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. E. Labra</given-names>
            <surname>Gayo</surname>
          </string-name>
          , E. Prud'hommeaux, I. Boneva, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          .
          <source>Validating RDF Data, volume 7 of Synthesis Lectures on the Semantic Web: Theory and Technology</source>
          . Morgan &amp; Claypool Publishers LLC, sep
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Van Kleef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , et al.
          <article-title>Dbpedia{a large-scale, multilingual knowledge base extracted from wikipedia</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <volume>167</volume>
          {
          <fpage>195</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Liddle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Embley</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. N.</surname>
          </string-name>
          <article-title>Wood eld. Cardinality constraints in semantic data models</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          ,
          <volume>11</volume>
          (
          <issue>3</issue>
          ):
          <volume>235</volume>
          {
          <fpage>270</fpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          , H. Muhleisen, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Sieve: linked data quality assessment and fusion</article-title>
          .
          <source>In Proceedings of the 2012 Joint EDBT/ICDT Workshops</source>
          , pages
          <volume>116</volume>
          {
          <fpage>123</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poveda-Villalon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Garc</surname>
          </string-name>
          a
          <article-title>-Castro, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>GomezPerez</surname>
          </string-name>
          .
          <article-title>Loupe-an online tool for inspecting datasets in the linked data cloud</article-title>
          .
          <source>In International Semantic Web Conference (Posters &amp; Demos)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mun</surname>
          </string-name>
          <article-title>~oz and M. Nickles. Mining cardinalities from knowledge bases</article-title>
          .
          <source>In International Conference on Database and Expert Systems Applications</source>
          , pages
          <volume>447</volume>
          {
          <fpage>462</fpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Improving the quality of linked data using statistical distributions</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS)</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <volume>63</volume>
          {
          <fpage>86</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Peroni</surname>
          </string-name>
          , E. Motta, and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>dAquin</year>
          .
          <article-title>Identifying key concepts in an ontology, through the integration of cognitive principles with statistical and topological measures</article-title>
          .
          <source>In Asian Semantic Web Conference</source>
          , pages
          <volume>242</volume>
          {
          <fpage>256</fpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Spahiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Porrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          . Abstat:
          <article-title>Ontologydriven linked data summaries with pattern minimalization</article-title>
          .
          <source>In International Semantic Web Conference</source>
          , pages
          <volume>381</volume>
          {
          <fpage>395</fpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Studer</surname>
          </string-name>
          .
          <source>Handbook on ontologies. Springer Science &amp; Business Media</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sirin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          .
          <article-title>Integrity constraints in OWL</article-title>
          .
          <source>In Proceedings of the Twenty-Fourth AAAI Conference on Arti cial Intelligence</source>
          ,
          <source>AAAI</source>
          <year>2010</year>
          , Atlanta, Georgia, USA, July
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2010</year>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18] G. Topper, M. Knuth, and
          <string-name>
            <given-names>H.</given-names>
            <surname>Sack</surname>
          </string-name>
          .
          <article-title>Dbpedia ontology enrichment for inconsistency detection</article-title>
          .
          <source>In Proceedings of the 8th International Conference on Semantic Systems</source>
          , pages
          <fpage>33</fpage>
          {
          <fpage>40</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Troullinou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kondylakis</surname>
          </string-name>
          , E. Daskalaki, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Plexousakis</surname>
          </string-name>
          .
          <article-title>Rdf digest: E cient summarization of rdf/s kbs</article-title>
          .
          <source>In European Semantic Web Conference</source>
          , pages
          <volume>119</volume>
          {
          <fpage>134</fpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pietrobon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Quality assessment for linked data: A survey</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <volume>63</volume>
          {
          <fpage>93</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Cheng, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          .
          <article-title>Ontology summarization based on rdf sentence graph</article-title>
          .
          <source>In Proceedings of the 16th international conference on World Wide Web</source>
          , pages
          <volume>707</volume>
          {
          <fpage>716</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>