<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mining Validating Shapes for Large Knowledge Graphs via Dynamic Reservoir Sampling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Lissandrini</string-name>
          <email>matteo.lissandrini@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kashif Rabbani</string-name>
          <email>kashifrabbani@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katja Hose</string-name>
          <email>katja.hose@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalborg University</institution>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Wien</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Verona</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge Graphs (KGs) are databases that model knowledge from heterogeneous domains using the graph data model. Shape constraint languages have been adopted in KGs to ensure their data quality. They encode the equivalent of a schema in the Resource Description Framework (RDF). Unfortunately, few KGs are accompanied by a corresponding set of validating shapes. When validating shapes are missing, the solution is to extract them from the graph via mining techniques. Current shape extraction methods are often incomplete, not scalable, and generate spurious shapes. Thus, in this discussion paper, we present our recent contribution: a novel Quality Shapes Extraction (QSE) method for large graphs. QSE computes confidence and support for shape constraints via a novel Dynamic Reservoir Sampling method, enabling the identification of informative and reliable shapes. QSE is the first method (validated on WikiData and DBpedia) to extract a complete set of shapes from large real-world KGs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graphs</kwd>
        <kwd>Data Mining</kwd>
        <kwd>Data Quality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge Graphs (KGs) are databases of collections of triples using the Resource Description
Framework (RDF) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and represented in the form ⟨, , ⟩. They are in
widespread use both within companies [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ] and on the Web [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Yet, their heterogeneous
nature and the semi-automatic way in which they are built (usually by crawling data from
many sources) leads to important issues of data quality [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ]. To face this situation, shape
constraint languages such as SHACL [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and ShEx [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] have been designed to ofer a way to
enforce validation rules. In practice they work as schema languages for KGs, for this reason,
they are generally called validating shapes. Consider a simple KG representing entities of type
Student (see Figure 1), a validating shape would enforce the requirement for these entities for
a name, a registration number, and the enrollment in some courses. The shape would also
specify that these attributes should be of type string, integer, and Course, respectively. In an
ideal scenario, validating shapes should be manually specified by domain experts before data
is inserted into the database. Nonetheless, to specify validating shapes for already-existing
large-scale KGs, this manual process is often very cumbersome, so data curators need tools that
can speed up this process [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. To address this need, a few tools have been proposed to produce
      </p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
a :Un:siXubOargO:f Univ:deorsciDtyegreaeFrom:Un:inSame Alic:ename Bob
:CS_Faculty :worksFor : alice :advisor : bob a : Student
a :headOf a a :teacherOf :takesCourse : Course</p>
      <p>: Chair a
: Department : FullProfessor :DatabasesLegend a = rdf:type
sh: Chair ≥ 1 headOf sh:Department</p>
      <p>≥ 1 subOrgOf
sh:University
≥ 1 docDegreeFrom
≥ 1 headOf</p>
      <p>=≥ 11 enmamaiel = 1 name
≥ 1 worksFor Legend
sh:Student
≥ 1 advisor ≥ 1 takesCourse</p>
      <p>
        b
sh:FullProfessor ≥ 1 teacherOf sh:Course
automatically [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12, 13, 14, 15</xref>
        ] or semi-automatically [
        <xref ref-type="bibr" rid="ref16">16, 17, 18</xref>
        ] a set of validating shapes for a
target KG. In our work [19], we identify 3 important limitations of existing systems: (1) they
produce incomplete shapes as they support a very limited family of constraints, e.g., they do
not support extraction of cardinality constraints; (2) their extraction process is easily afected
by errors and inconsistencies in the KG, e.g., if some departments, by mistake, have attached
the property hasAdvisor, a corresponding spurious shape is extracted; and above all (3) they do
not scale to large KGs, e.g., they take days to process just a subset of WikiData.
      </p>
      <p>Among these issues, we verified that spuriousness poses important challenges to automatic
shape extraction methods [19]. For instance, in one of the snapshots of DBpedia [20] we analyzed,
some of the entities representing musical bands were wrongly assigned to the class dbo:City
because of some faulty automatic text-matching process. As a consequence, when shapes are
extracted from this dataset using existing approaches, the resulting node shape for dbo:City
specifies that cities are allowed optional properties like dbo:genre and dbo:formerBandMember.
Therefore, our position is that the key to solving these challenges passes through eficient
extraction algorithms that take into account the prevalence of shapes. These algorithms should
extract shapes annotated also with the amount of entities and triples that satisfy them (in
mining terms we talk of support and confidence [ 21]). Further, they should be able to do so
even on commodity machines. Due to the very restricted nature of validating shapes (i.e., they
are not arbitrary graph patterns), we propose a highly optimized mining algorithm (QSE-Exact).
Then, to tackle the issue of scalability in extremely large KGs, we devised a novel sampling
process, called Dynamic Reservoir Sampling, which allows us to propose QSE-Approximate. As
a result, QSE can filter out shapes afected by spurious or erroneous data based on robust and easily
understandable measures. QSE allows shape extraction both from KGs available as files as well
as SPARQL endpoints. Finally, our eficient approximation algorithm enables shape extraction
even on a commodity machine by sampling the KG entities via a dynamic multi-tiered reservoir
sampling technique. Our experiments on real-world KGs further prove that our sampling
strategy is accurate and eficient.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RDF Shapes and the QSE Problem</title>
      <p>
        The standard model for encoding KGs is the Resource Description Framework (RDF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), which
describes data as a set of ⟨, , ⟩∈ triples stating that a subject  is in a relationship with an
object  through predicate . In these triples subjects, predicates, and objects can be IRIs (ℐ)
or (only for objects) literal values. Moreover, we distinguish two special subsets of the IRIs ℐ:
predicates  and classes . The set of predicates ⊂ℐ is the subset of IRIs that appear in the
predicate position  in any ⟨, , ⟩∈. Among predicates , we identify the type predicate
a∈, which corresponds to IRI rdf:type [22] or wdt:P31 WikiData [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], as the predicate that
connects all entities that are instances of a class to the node representing the class itself, i.e.,
their type. Thus, all the IRIs that are classes in  form the subset :{∈ℐ|∃∈ℐ s.t. ⟨, a, ⟩∈}.
      </p>
      <p>Given a KG , a set of validating shapes represents integrity constraints in the form of a
shape schema  over . Since the shape schema describes shapes associated with node types
and their connections to other attributes and other node types, we can also visualize the shape
schema  as a particular type of graph (see Figure 1b). Therefore, in the following, we refer to
two concepts: the data graph  and the shape graph derived from . The data graph is the RDF
graph  to be validated, while the shape graph consists of constraints in the form of the shape
schema  against which entities of the data graph are validated. These constraints are defined
using node and property shapes. In the following, we adopt the previously defined syntax [ 23]
to refer to the set  according to the SHACL core constraint components [24]. Finally, without
loss of generality, we focus on the current standard for SHACL shapes in the following, but our
methods can be applied to ShEx via converters [25].</p>
      <p>When validating a graph  against a shape schema  having a node shape ⟨,  , Φ ⟩∈,
where  is the shape for the target class  ∈ and Φ  defines which properties each instance
of   can or should be associated with, we verify that each entity ∈ that is an instance of
  satisfies all the constraints Φ . Note that we use the term entity and node interchangeably
throughout the paper.</p>
      <p>
        Shapes Extraction. Here we study the case where  is given, and we want to extract the
set of validating shapes  that validates every class in  from . This is the shapes extraction
problem. In this case, existing automatic approaches [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] assume the graph to be correct, then
iterate over all entities in it, and extract for each entity  all necessary shapes that validate
. The union of all such shapes is assumed to be the final schema . This is useful when we
want to validate new data that will be added in the future to the KG so that it will conform to
the data already in the graph. Unfortunately, this approach will produce spurious shapes. For
instance, in Figure 1, since :alice has both type Full Professor and Chair, when parsing the triple
(:alice, :headOf, :CS_Faculty), the property shape headOf (the red dotted arrow in Figure 1.b) is
assigned to both node shapes, instead of assigning it to the Chair node shape only.
      </p>
      <p>Shapes Support and Confidence. To contrast the efect of spuriousness, we want to exploit
statistics on how often properties are applied to entities of a given type. Therefore, we introduce
the notion of support and confidence for shape constraints to study the reliability of extracted
shapes. These concepts are inspired by the well-known theory developed for the task of frequent
patterns mining [26]. In our approach, a property shape corresponds to a node- and edge-labeled
graph pattern. Thus, given the shape :⟨,  , Φ ⟩∈ its support is the number of entities that
are of type  , while the support of a property shape :⟨ p, Tp, Cp⟩∈Φ  is the cardinality of
entities conforming to it. Finally, the confidence of a constraint  measures the ratio between
how many entities conform to  and the total number of entities that are instances of the target
class of the shape  (for brevity, we leave out the full computation and formalization [19]).</p>
      <p>In practice, as it happens in the case of frequent pattern mining [26], when extracting
validating shapes, the support (supp()) provides insights on how frequently a constraint is
matched in the graph, i.e., the number of entities  satisfying a constraint . While similar
to the task of itemset mining [27], the confidence ( conf()) can tell us how strong is the
association between a node type and a specific constraint, i.e., the proportion of entities 
satisfying a constraint  among all the entities that are instances of the node type   of ∈.
For instance, the confidence for property shape headOf (Figure 1.b) in our snapshot of LUBM is
10% for the Full Professor node shape and 100% for Chair, which indicates a strong association
of the headOf property shape to latter and a weak association to the former.</p>
      <p>Given the need to extract shapes from a large existing graph , we formally define the
problem of extracting high-quality shapes from KGs as follows:
Problem 1 (Quality Shapes Extraction). Given an RDF graph , a threshold  for support,
and  for confidence, the problem of quality shapes extraction over  is to find the set of shapes 
such that for all node shapes ⟨,  , Φ ⟩∈ it holds that supp()&gt; and for all property shapes
:⟨ p, Tp, Cp⟩∈Φ , supp()&gt; and conf()&gt;.</p>
    </sec>
    <sec id="sec-3">
      <title>3. QSE-Exact: Exact Shape Extraction algorithm</title>
      <p>Extracting shapes  from an RDF graph  requires processing its triples and analyzing the
types of nodes involved. At a high level, we need to know for each entity all its types, these will
become node shapes, and then for each entity type, identify property shapes, which requires,
in turn, knowing the types of the objects as well. Furthermore, we need to keep frequency
counts to know how often a specific property connects nodes of two given types compared
to how many entities exist of those types. Therefore, the extraction of shapes  from an RDF
graph  involves a computation over all of its triples. During this process, the system surveys
which types are associated with all distinct subjects and objects within these triples. Further, for
each entity type, property shapes must be determined, this involves examining the predicates
associated to pairs of subject and object types. This process thus maintains frequency counts to
establish the prevalence of a specific property connecting nodes of two particular types relative
to the total number of entities of those types.</p>
      <p>In our solution [19], this is done in four steps (see Figure 2): (1) entity extraction, (2) entity
constraints extraction, (3) support and Confidence computation, and (4) shapes extraction.
These steps apply similarly to both cases when the graph is stored as a complete dump on a
single file or stored within a triplestore [ 19]. Here, for simplicity, we focus on the file-based
approach. The endpoint version of the algorithm requires instead the execution of multiple
SPARQL queries to extract equivalent information.</p>
      <p>QSE-Exact (file-based). One of the most common ways to store an RDF graph  on a file F
is to represent it as a sequence of triples. Therefore, QSE reads F line by line and processes it as a
stream of ⟨, , ⟩ triples. In the entity extraction phase, the algorithm parses each ⟨, , ⟩ triple
containing a type declaration (e.g., rdf:type or wdt:P31 – this can be configured) and for each
entity, it stores the set of its entity types and the global count of their frequencies, i.e., the number
of instances for each class in maps Ψ etd (Entity-to-Data) and Ψ cec (Class-to-Entity-Count),
respectively. For example, Figure 2 (phase 1) presents two example entities :bob and :alice
(from the example graph of Figure 1) having entity types :Student, :FullProfessor, and :Chair,
respectively. Figure 2 also presents the structure of the Entity-to-Data Ψ etd dictionary map to
help understand the captured entities and their information. In the second phase, i.e., entity
:bob
:alice
,
:name [ :String ] #
:takesCourse [ :Course ] #
:advisor [ :FullProfessor, :Chair ] #
:teacherOf [ :Course ] #
:docDegreeFrom [ :University ] #
3
[ :name :String ] , #
[ :takesCourse :Course] , #
[ :advisor :FullProfessor] , #
[ :advisor :Chair ] , #
[ … ] , #
4</p>
      <p>see Fig. 1b</p>
      <p>SHACL ( , , )
[:alice, …. ] R2</p>
      <p>[:alice, …. ] R3 . . . . . [ei , …., en ] Rn
Legend
:Student :Chair ΨETD
:FullProfessor # : Count Dictionary
:alice [ , ] … :teacherOf [:Course] , #
Key Value
constraints extraction, the algorithm performs a second pass over F to collect the constraints
and the meta-data required to compute support and confidence of each candidate property
shape. Specifically, it parses all triples except triples containing type declarations (which can be
skipped now) to obtain for each predicate the subject and object types from the map Ψ etd that
was populated in the previous step. The type of a literal object is inferred from the value, and
for a non-literal object is obtained from Ψ etd. For example, Ψ etd records that the types of :alice
are :FullProfessor and :Chair. Then, the Entity-to-Property-Data map Ψ etpd is updated to add
the candidate property constraints associated with each subject entity. Figure 2 (phase 2) shows
the meta-data captured for the properties of :bob and :alice.</p>
      <p>In the third phase, i.e., for support and confidence computation, the constraints’ information
stored in maps (Ψ etd, Ψ cec) is used to compute support and confidence for specific constraints.
The algorithm iterates over the map Ψ etd to get the inner map Ψ etpd mapping entities to
candidate property shapes :⟨ p, Tp, Cp⟩∈Φ , and retrieves the type of each entity using types
information stored in Ψ etd to build triplets of the form ⟨ ,  ,   ⟩ and compute their support
and confidence. Figure 2 (phase 3) highlights some of these triplets for   =:Student. The value
of support and confidence for each distinct triplet is incremented in each iteration and stored
in Ψ Supp and Ψ Conf maps. Additionally, a map Ψ PTT (Property to Types) is populated with
distinct properties’ frequencies and their object types to establish the corresponding min/max
cardinality constraints.</p>
      <p>Finally, in the shapes extraction phase, the algorithm iterates over the values of the Ψ ctp
map and defines the shape name of , the shape’s target definition  , and the set of shape
constraints  for each candidate class. The set of property shapes  for a given Node Shape are
then extracted from the map Map⟨Property, Set⟩. An example shapes graph for our running
example is shown in Figure 1. The Cp constraint can possibly have three types of values:
sh:Literal, sh:IRI, and sh:BlankNode. In the case of literal types, the literal object types such as
xsd:string, xsd:integer, or xsd:date are used. However, in the case of non-literal object types,
the constraint sh:class is used to declare the type of object to define the type of value for the
candidate property. It is possible to have more than one value for the sh:class and sh:datatype
constraints of a candidate property shape, e.g., to state that a property can accept both integers
and floats as values, in such cases, we use sh:or constraint to encapsulate multiple values. A
more detailed explanation of each phase is available in the extended version of the paper1.</p>
      <p>Cardinality Constraints. QSE supports assigning cardinality constraints (sh:minCount and
sh:maxCount) to Cp to each property shape constraint :⟨ p, Tp, Cp⟩. Following the open-world
assumption, all shape constraints are initially assigned a minimum cardinality of 0, making
them optional. However, in some cases, certain properties must be mandatory (min count: 1),
while others should appear exactly once per entity (i.e., should be assigned both a min and
a max count equal to 1). Trivially one can assign minimum cardinality 1 to property shapes
having confidence 100%, i.e., for those cases in which all entities have that property. In case of
incomplete KGs, QSE allows users to provide a diferent confidence threshold value for adding
the min cardinality constraints. To achieve this, we extend the fourth phase and add a min
cardinality constraint for property shapes based on the min-confidence provided by the user.
QSE also keeps track of properties having maximum cardinality equal to 1 in a second phase
and assigns sh:maxCount=1 to those property shapes in the fourth phase of shapes extraction.</p>
      <p>Our analysis [19] shows that QSE-Exact requires only two passed over the file containing the
triples. On the other hand, QSE-Exact keeps type and property information for each entity in
memory while extracting shapes. As a result, its memory requirements are prohibitively large
when dealing with very large KGs. Therefore, we propose QSE-Approximate to enable shape
extraction from very large KGs with reduced memory requirements.</p>
    </sec>
    <sec id="sec-4">
      <title>4. QSE-Approximate</title>
      <p>QSE-Approximate solves the scalability issue in shapes extraction approaches by employing a
sampling technique. Thanks to this technique, we are able to drastically reduce the
memory requirements of QSE-Exact. Thus, QSE-Approximate is based on a multi-tiered dynamic
reservoir-sampling (DRS) algorithm. Reservoir sampling is a technique for selecting a random
sample of items from a stream of data. Given a sample size fixed in advance it assumes each
item has an equal probability of being chosen. In our algorithm we maintain as many reservoirs
as types in the graph. Yet, entities have a very skewed distribution, so fixing the size of each
reservoir in advance leads to a huge memory waste, as most reservoir remain almost empty.
Our method, instead, dynamically resizes each reservoir as new triples are parsed. Moreover,
the replacement of nodes in the reservoir is performed based on the number of node types
across reservoirs. The resulting algorithm replaces the first phase of QSE. After sampling, the
information about the sampled entities is used in the same way as before in the remaining phases
of our exact algorithm. Hence, we maintain in memory information only for a representative
sample of entities, forming an induced sampled graph, enough to detect all shapes.</p>
      <p>QSE-Approximate receives as input a graph file F, sampling percentage (Sampling%), and
maximum size of the reservoir per class ( ). After initialization, triples  of F are parsed and
ifltered based on whether they contain a type declaration. From these, we extract the entities
to populate the Entity-to-Data map Ψ etd, while non-type triples are parsed to keep count of
distinct properties in the Property-Count map Ψ pc. For instance, :alice is an entity of type
:FullProfessor and :Chair in Ψ etd shown in Figure 2. QSE-Approximate maintains a reservoir for
each distinct entity type , e.g., maintaining a distinct reservoir of entities of type :Student (1),
:FullProfessor (2), and :Chair (3) shown in Figure 2, using a map of sampled entities per
class (Ψ sepc). The reservoir capacity map (Ψ rcpc) stores the current max capacities for the
reservoir for each . If  does not exist in Ψ sepc and Ψ rcpc, i.e., if it has not a reservoir, one is
created. Then,  is inserted in the reservoir for , e.g., :alice is inserted into both reservoirs
2 and 3 shown in Figure 2. If the reservoir has reached its current capacity limit, we may
have to replace an entity in the reservoir with the current one. Hence, neighbor-based dynamic
reservoir sampling is performed, i.e., a random number  is generated between zero and the
current number of type declarations read from F. If  falls within the reservoir size, then a node
in the reservoir is replaced with . To select which node to replace, we identify as ̂︀ the target
node at index , and with ⃗and ⃗ its neighbors at indexes − 1 and +1, respectively. Among
these, the node having minimum scope (i.e., the minimum number of types that are known at
this point in time) is selected to be replaced by the current . Additionally, the algorithm keeps
track of actual Class-to-Entity-Count in Ψ cec, i.e., the exact count of how many entities of each
type we have seen. Once the reservoir for  is updated, we compute the proportion of entities
sampled so far with type  over the total number of entities of that type seen up to now. Given
the current and target sampling ratio (Sampling%) provided as input, the algorithm evaluates
whether to resize the reservoir for , if it has not already reached the limit  .</p>
      <p>While performing shapes pruning using counts over sampled entities, QSE-Approximate
requires to estimate actual support  and confidence  of a property shape  from the current
values  and  computed from the sampled data. To estimate the efective support for a property
shape  we employ the formula =/(|* |/| |, ||/| |), where  is the support
computed for  in the current sample,  represents all triples in  having property  , *
represents triples having property   across all entities in all reservoirs,  represents all entities
of type  in , and  represents all entities of type  in the reservoir. Similarly, the confidence
 of a property shape is estimated by dividing by || the estimated support.</p>
      <p>Space Analysis. QSE-Approximate’s space complexity depends on the values of target
Sampling%, the maximum reservoir size  , and the number of entity types |T| in . In the
worst case, it requires (2·|  |·  ), therefore while  can contain hundreds of millions of
entities, we can still easily estimate how many distinct types are in the graph and select  
to fit the available memory.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation and Discussion</title>
      <p>
        We selected a synthetic dataset, LUBM-500 [28], and three real-world datasets: DBpedia [20]
downloaded on 01.10.2020; YAGO-4 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], for which we use the subset containing instances from
the English Wikipedia, downloaded on 01.12.2020; and WikiData [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in two variants, i.e., a
dump from 2015 [29] (Wdt15), used in the original evaluation of SheXer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and the truthy
dump from September 2021 (Wdt21) filtered by removing non-English strings. Among these,
Wdt21 is the largest and contains almost 2B triples, with 82K node types, and 9K property
types. We have implemented QSE algorithms in JAVA-11. All experiments are performed on a
single machine with 16 cores and 256 GB RAM. Yet, we also test the algorithm performance in a
resource constrained environment. The full experimental setup of QSE is available online [30].
      </p>
      <p>
        QSE-Exact. We initially considered SheXer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], ShapeDesigner [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and SHACLGEN [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
as state-of-the-art approaches [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to compare against QSE. Yet, from our initial experiments, we
verified their current implementations cannot handle large KGs with more than a few million
triples and do not manage to extract shapes of KGs having more than some hundreds of classes.
Therefore, in the following, we focus our comparison on SheXer. Table 1 shows the running
time and memory consumption to extract shapes for all datasets using File (F) and Query-based
(Q) variants of SheXer, QSE-Exact, and QSE-Approximate. Among the file-based approaches,
QSE-Exact is 1 order of magnitude faster than SheXer for all datasets. Further, it consumes up
to 50% less memory than SheXer. We note that SheXer goes out of memory (OutM) for Wdt21.
      </p>
      <p>When looking at the shapes produced after pruning via support and confidence [ 19], as
expected, the results show that the higher we set the threshold for support and confidence, the
higher the percentage of PSc and PS to be pruned. Precisely, DBpedia contains 11 Property
Shapes (PS), with 38 non-literal and 5 literal Property Shape constraints (PSc), when QSE
performs pruning with confidence &gt;25% and support ≥ 1, it prunes out 99% PSc and PS. Similarly
for Wdt21, QSE prunes 85% non-literal and 97% literal constraints, and 66% PS for confidence
&gt;25% and support ≥ 1. A manual inspection showed us that there was a very high correlation
between high support shapes and shapes that should be valid in the KG.</p>
      <p>QSE-Approximate. Table 1 shows how QSE-Approximate reduces the memory
requirements of the exact approach by allowing users to specify the sampling percentage ( Sampling%,
S% for short) and maximum limit of the reservoir size ( ), i.e., the maximum number of
entities to be sampled per class, to reduce the number of entities to keep in memory. Here (with
  = 1000 and S%=100%), to extract shapes from Wdt21, QSE-Approximate required almost
half the time with 1 order of magnitude less memory than QSE-Exact, while SheXer could not
complete the computation. Similarly, among query-based approaches, QSE-Approximate proved
to be the only approach to extract shapes from the Wdt21 endpoint in 5.7 hours with 64 GB
memory consumption. In contrast, QSE-Exact and SheXer timed out (24 hours).</p>
      <p>Practical Implications of QSE. We tested the practical utility of QSE by evaluating the
correctness of extracted shapes and their efect when used to validate the KG [ 19]. The results
of this analysis showed that QSE extracts shapes with 100% precision in terms of correct
shapes constraints that should be part of the final set of shapes (qualified as quality shapes) by
removing spurious shape constraints. Further, we used these 10 shapes, extracted by QSE, to
validate DBpedia using a SHACL validator and found 20,916 missing triples and 155 erroneous
triples. The detailed results of this analysis are contained in the extended version1. Overall, this
experiment shows that by using our technique the user is provided with a refined set of valid
shapes that can efectively identify errors in the KG.
[17] T. Quadrant, TopBraid, https://www.topquadrant.com/products/topbraid-composer/, 2023.</p>
      <p>Accessed 6th May, 2024.
[18] H. J. Pandit, D. O’Sullivan, D. Lewis, Using ontology design patterns to define SHACL
shapes, in: Proceedings of the 9th Workshop on Ontology Design and Patterns, volume
2195 of CEUR Workshop Proceedings, CEUR-WS.org, Monterey, USA, 2018, pp. 67–71.
[19] K. Rabbani, M. Lissandrini, K. Hose, Extraction of validating shapes from very large
knowledge graphs, Proc. VLDB Endow. 16 (2023) 1023–1032. URL: https://doi.org/10.14778/
3579075.3579078. doi:10.14778/3579075.3579078.
[20] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. G. Ives, Dbpedia: A nucleus for
a web of open data, in: The Semantic Web, 6th International Semantic Web Conference,
volume 4825 of Lecture Notes in Computer Science, Springer, Busan, Korea, 2007, pp. 722–735.
[21] G. Preti, M. Lissandrini, D. Mottin, Y. Velegrakis, Mining patterns in graphs with
multiple weights, Distributed and Parallel Databases (2019). URL: https://doi.org/10.1007/
s10619-019-07259-w. doi:10.1007/s10619-019-07259-w.
[22] W3C, W3C: RDF Type, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, 2023. Accessed
6th May, 2024.
[23] O. Savkovic, E. Kharlamov, S. Lamparter, Validation of SHACL constraints over kgs
with OWL 2 QL ontologies via rewriting, in: The Semantic Web - 16th International
Conference, ESWC 2019, Portorož, volume 11503 of Lecture Notes in Computer Science,
Springer, Slovenia, 2019, pp. 314–329.
[24] W3C, SHACL- core constraint components, https://www.w3.org/TR/shacl/
#core-components, 2023. Accessed 6th May, 2024.
[25] WesoShaclConvert, SHACL to ShEx converter, https://rdfshape.weso.es/shaclConvert,
2023. Accessed 6th May, 2024.
[26] J. Han, H. Cheng, D. Xin, X. Yan, Frequent pattern mining: current status and future
directions, Data mining and knowledge discovery 15 (2007) 55–86.
[27] C. Borgelt, Frequent item set mining, Wiley interdisciplinary reviews: data mining and
knowledge discovery 2 (2012) 437–456.
[28] Y. Guo, Z. Pan, J. Heflin, Lubm: A benchmark for owl knowledge base systems, Journal of</p>
      <p>Web Semantics 3 (2005) 158–182.
[29] WikiData, WikiData-2015, https://archive.org/details/wikidata-json-20150518, 2023.
Accessed 6th May, 2024.
[30] K. Rabbani, Quality Shape Extraction - resources and source code, https://github.com/
dkw-aau/qse, 2023. Accessed 6th May, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Consortium</surname>
          </string-name>
          , RDF
          <volume>1</volume>
          .1, https://w3.org/RDF/,
          <year>2014</year>
          . Accessed 6th May,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Henson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <article-title>Using knowledge graphs to search an enterprise data lake</article-title>
          ,
          <source>in: The Semantic Web: ESWC 2019 Satellite Events - ESWC</source>
          , volume
          <volume>11762</volume>
          of Lecture Notes in Computer Science, Springer, Portorož, Slovenia,
          <year>2019</year>
          , pp.
          <fpage>262</fpage>
          -
          <lpage>266</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -32327-1_
          <fpage>46</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lassila</surname>
          </string-name>
          ,
          <article-title>Designing and building enterprise knowledge graphs</article-title>
          ,
          <source>Synthesis Lectures on Data, Semantics, and Knowledge</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <article-title>Industry-scale knowledge graphs: lessons and challenges</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>62</volume>
          (
          <year>2019</year>
          )
          <fpage>36</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Tanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>M. Suchanek, YAGO 4: A reason-able knowledge base</article-title>
          ,
          <source>in: The Semantic Web - 17th International Conference, ESWC</source>
          , volume
          <volume>12123</volume>
          of Lecture Notes in Computer Science, Springer, Heraklion, Crete, Greece,
          <year>2020</year>
          , pp.
          <fpage>583</fpage>
          -
          <lpage>596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>WESO</surname>
          </string-name>
          , RDFShape, http://rdfshape.weso.es,
          <year>2023</year>
          . Accessed 6th May,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Rabbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lissandrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <article-title>Shacl and shex in the wild: A community survey on validating shapes generation and adoption</article-title>
          ,
          <source>in: Proceedings of the ACM Web Conference</source>
          <year>2022</year>
          , ACM, Online, Lyon, France,
          <year>2022</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>263</lpage>
          . URL: https://www2022.thewebconf. org/PaperFiles/65.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Prud</surname>
          </string-name>
          <article-title>'hommeaux</article-title>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Solbrig</surname>
          </string-name>
          ,
          <article-title>Shape expressions: an RDF validation and transformation language</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Semantic Systems, SEMANTiCS</source>
          <year>2014</year>
          , Leipzig, Germany, September 4-
          <issue>5</issue>
          ,
          <year>2014</year>
          , ACM, Leipzig, Germany,
          <year>2014</year>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>40</lpage>
          . URL: https://doi.org/10.1145/2660517.2660523. doi:
          <volume>10</volume>
          . 1145/2660517.2660523.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Knublauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <article-title>Shapes constraint language (shacl</article-title>
          ),
          <source>W3C Candidate Recommendation</source>
          <volume>11</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Prud</surname>
          </string-name>
          <article-title>'hommeaux</article-title>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Solbrig</surname>
          </string-name>
          ,
          <article-title>Shape expressions: an RDF validation and transformation language</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Semantic Systems</source>
          , SEMANTiCS, ACM, Leipzig, Germany,
          <year>2014</year>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Keely</surname>
          </string-name>
          , SHACLGEN, https://pypi.org/project/shaclgen/,
          <year>2023</year>
          . Accessed 6th May,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernandez-Álvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Labra-Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gayo-Avello</surname>
          </string-name>
          ,
          <article-title>Automatic extraction of shapes using shexer, Knowledge-Based Systems 238 (</article-title>
          <year>2022</year>
          )
          <fpage>107975</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fernández-Izquierdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>García-Castro</surname>
          </string-name>
          ,
          <article-title>Astrea: Automatic generation of SHACL shapes from ontologies</article-title>
          ,
          <source>in: ESWC</source>
          , volume
          <volume>12123</volume>
          of Lecture Notes in Computer Science, Springer, Heraklion, Crete, Greece,
          <year>2020</year>
          , p.
          <fpage>497</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R. A.</given-names>
            <surname>Rashid</surname>
          </string-name>
          , G. Rizzo,
          <string-name>
            <given-names>R.</given-names>
            <surname>García-Castro</surname>
          </string-name>
          , Ó. Corcho,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Torchiano, RDF shape induction using knowledge base profiling</article-title>
          ,
          <source>in: Proceedings of the 33rd Annual ACM Symposium on Applied Computing</source>
          ,
          <string-name>
            <surname>SAC</surname>
          </string-name>
          , ACM, Pau, France,
          <year>2018</year>
          , pp.
          <fpage>1952</fpage>
          -
          <lpage>1959</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Boneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dusart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernández-Álvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <article-title>Shape designer for shex and SHACL constraints</article-title>
          ,
          <source>in: Proceedings of the ISWC 2019 Satellite Tracks</source>
          , volume
          <volume>2456</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Auckland, New Zealand,
          <year>2019</year>
          , pp.
          <fpage>269</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>