<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Leiden,
The Netherlands
* Corresponding author.
$ fernandezalvdaniel@uniovi.es (D. Fernández-Álvarez); yy@dbcls.rois.ac.jp (Y. Yamamoto); labra@uniovi.es
(J. E. Labra-Gayo); andra@micel.io (A. Waagmeester)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extracting shapes from large RDF data collections</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Fernández-Álvarez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yasunori Yamamoto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Emilio Labra-Gayo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andra Waagmeester</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Database Center for Life Science</institution>
          ,
          <addr-line>ROIS-DS, Kashiwa</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Micelio</institution>
          ,
          <addr-line>Antwerp</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>WESO Research Group, University of Oviedo</institution>
          ,
          <addr-line>Oviedo</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>There is an increasing number of projects based on RDF graphs. Shape languages, such as SHACL and ShEx, have been proposed to support the evolution of such projects on two main aspects: description and validation of RDF content. However, producing shapes for an existing knowledge graph is an arduous and time-consuming task when dealing with large data sources. Automatic shape extractors are software elements that allow us to tackle such issue. They can produce RDF shapes by exploring existing RDF content. However, these tools usually sufer from scalability issues related to memory availability in those scenarios where they could be more useful: large data sources. To deal with these situations, some extractors implement sampling strategies. They extract shapes from a representative part of the input data rather than using the whole dataset. However, such mechanisms may lose some features which are not frequent among the input data. We propose an alternative approach based on splitting the original input into parts, running the extraction process over each part, and consolidating the obtained result in a single schema. We demonstrate through experimentation that our approach can outperform sampling w.r.t. of quality of the obtained results. The software used for these experiments is publicly available.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;RDF</kwd>
        <kwd>RDF Shapes</kwd>
        <kwd>Shape Extraction</kwd>
        <kwd>Large Graphs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the last decade, various declarative languages have been developed and discussed to define
RDF data patterns or shapes. These languages serve multiple purposes, from automatic validation
of RDF data sources [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for documenting the presented data or expressing expectations from
consumed data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. One such language dedicated to constraint validation is the Shapes Constraint
Language (SHACL) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is a W3C recommendation. Shape Expressions (ShEx) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a
similar declarative language which is the technology chosen by some significant projects such
as Wikidata [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Although these languages are not entirely equivalent, they are both based
on the concept of data shape. A data shape is a structure that describes the kind and amount
of relations that one should expect to find for a specific type of node in a specific RDF graph.
Shapes are organized in schemes that describe a set of structures in a certain data source.
      </p>
      <p>Writing shape schemes is not a trivial task. It requires 1) to be skilled in a shape language, 2)
to know the data well enough to be able to describe its structure, and 3) time, both for writing
and maintaining it in case the graph structure evolves. The complexity of this process scales up
with factors such as the number of concepts to describe, their internal complexity (variety of
relations for each concept), the size of the data source, or the frequency of events able to trigger
a schema evolution.</p>
      <p>
        To tackle this issue, several approaches to perform automatic shape extraction have been
proposed [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Such approaches can generate shape schemes by inferring or mining data from
existing RDF content. Shape extractors reduce the creation and maintenance times to obtain
up-to-date shape schemes, both when they are used to produce final shapes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], or shapes that
act as a draft so domain experts can refine them a posteriori [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Shape extractors can be especially useful when dealing with large data sources, as it is usually
much harder to produce/maintain a shape schema up-to-date in this type of scenario. However,
most of the existing approaches have scalability problems related to memory usage in such
contexts.</p>
      <p>
        Some existing tools are able to extract shapes from large data sources by using sampling-based
strategies [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. However, these kinds of techniques could lead to knowledge loss when a certain
feature of the target data is not frequent and cannot be observed among the sample chosen.
      </p>
      <p>In this paper, we propose an alternative approach to extract shapes from large graphs. Our
proposal is based on slicing the target input, performing independent shape extractions over on
each slice, and then merging the obtained results in a single schema. With this, we tackle the
described scalability issues while still using all the input’s knowledge.</p>
      <p>
        We have developed a software prototype implementing our proposal and used it to test its
potential. We extracted shapes from a subset of UniProt [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] using an approach with no memory
restrictions, whose produced schema was used as a baseline. Then, we extracted shapes with our
proposal and a sampling-based one using diferent settings. In our experiments, our prototype
produced schemas more similar to the baseline ones while using the same amount of memory
as the sampling technique. Our prototype works with ShEx, but both the proposal and the
experimentation could be trivially extended to SHACL.
      </p>
      <p>In this document, we describe and discuss our proposal and prototype. Both the software and
data used in our experiments are publicly available, so the obtained results are fully reproducible.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Several tools following diferent approaches have been proposed for the automatic extraction
of RDF shapes from existing RDF content. We can distinguish two main diferent types of
approaches: T-BOX-based and A-BOX-based.</p>
      <p>
        On the one hand, there are techniques based on mapping T-BOX concepts into shapes, i.e.,
ontology definitions involving classes, properties, and their domain, range, and expected
cardinality [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. However, when the data in a graph is built using contextual constraints
concerning the use of properties/classes that are not defined in the ontologies of those
properties/classes, then the shapes produced by this type of approach may not describe the actual
topologies in a graph as precisely as possible.
      </p>
      <p>
        On the other hand, there are techniques based on mining the neighborhood of some nodes
used as examples, assuming that such a mining process will reveal generalizable structures [
        <xref ref-type="bibr" rid="ref12 ref14 ref15 ref16 ref17 ref18 ref19 ref8 ref9">8,
9, 12, 14, 15, 16, 17, 18, 19</xref>
        ]. These approaches can capture contextual restrictions for the use of
a certain property/class, but they may not be able to 1) detect ontology constraint violations, or
2) add features to a shape that cannot be clearly distinguished among the example data but are
in an ontology [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The extraction process that we propose in this paper is another example of
an A-BOX-based proposal.
      </p>
      <p>Both T-BOX-based and A-BOX-based approaches can sufer from scalability issues related to
lack of main memory when handling large inputs, but this type of issue is more frequent with
A-BOX-based tools. When we explore the data contained in big well-known graphs, such as
UniProt or Wikidata, we can see that the number of abstract concepts is orders of magnitude
smaller than the number of entities that are instances of those concepts.</p>
      <p>
        A limitation of many A-BOX-based extractors is the size of the input graph, as they need to
load such content in an in-memory data structure. However, there are some other approaches,
such as sheXer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and QSE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which are based on iterative parsing strategies that do not
require allocating in main memory the whole target graph, but only the information that is
relevant for the extraction process. Still, the data structures required for such information can
lead to scalability issues.
      </p>
      <p>Both sheXer and QSE implement extra mechanisms to be able to perform shape extractions
in those scenarios. Those mechanisms are based on the same core idea: sampling. Rather than
using the whole set of available instances/examples to extract a shape, the tool selects a subset
of them that should be representative enough. However, as with any sampling-based approach,
there is a chance of missing some information when the sample is not adequate for detecting
infrequent and yet meaningful features.</p>
      <p>.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposal description</title>
      <p>The core idea of our proposal consists of slicing the input content into several parts, running a
shape extraction process over each part, and, finally, merging the obtained results in a single
schema. To test this proposal, we have developed a software prototype built on top of existing
solutions for automatic shape extraction. In this section, we will explain each stage of our
proposed workflow and describe our implemented prototype.</p>
      <sec id="sec-3-1">
        <title>3.1. Input slicing</title>
        <p>In this stage of the process, we need to split the input into smaller parts or slices. For this
paper’s experimentation, we slice the input by using chunks of N consecutive triples. For that,
we used the Java ConvRDF library1 to transform diferent RDF syntaxes to n-triples and then
used bash commands to split the resulting file into chunks of a certain number of lines.
:Sarah a :Person ;
:age 22 ;
:name "Sarah" .</p>
        <p>However, this slicing approach may cause some knowledge loss. Let us consider a graph such
as the one described on the left side of Figure 1. When we split such file into groups of three
consecutive triples, we obtain the three sub-graphs shown in the upper part of Figure 2. Each
sub-graph contains a complete example of a :Person instance. These three inputs will allow a
shape extractor to produce the shape shown on the left side of Figure 3.</p>
        <p>Let us suppose now that we perform the same slicing approach with the triples on the right
side of Figure 1, which describe the same information with a diferent serialization. Then, we
would obtain the three sub-graphs shown at the bottom of Figure 2. An A-BOX-based extractor
aiming to produce a shape for the class :Person will look for :Person instances and check
their neighborhood. The only sub-graph of those three containing instances of :Person is
the first one, so that would be the only useful input for the extractor. However, the :Person
instances of this sub-graph are not connected to any other information but their type. Then,
the shape obtained with this sub-graph will be the one shown on the right side of Figure 3.</p>
        <p>Our current prototype can be afected by this kind of information loss when the knowledge
of an entity is not contained in a single slice. However, turtle serializations are frequently
organized such that the triples with a common subject are placed next to each other. For this
reason, our experimental results have not been heavily afected by this issue.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Shape extraction</title>
        <p>We have used sheXer to carry the shape extraction part in our proposed workflow. sheXer is
an A-BOX-based approach for extracting shapes from large RDF input sources implemented
in Python2. sheXer meets our requirements for this proposal, as it 1) can be configured to use
sampling with very large inputs, and 2) can handle quite large inputs even without sampling.
This allows us to perform the experiments described in section 4.</p>
        <p>sheXer uses an iterative parsing strategy that avoids placing the whole target content in
the main memory. However, it keeps an in-memory structure in which general features of the
entities used as examples to extract shapes are annotated. This may cause a memory overflow
when the number of example entities is too large. The library prevents such issues by letting
the user configure a limit for the number of entities that could be considered to extract a shape.</p>
        <p>
          sheXer allows for configuring many relevant aspects of the extraction process, but reviewing
them all falls out of the scope of this paper. A thorough revision of this tool can be read in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Shape consolidation</title>
        <p>The last stage of our proposed workflow consists of merging the obtained schemes into a single
one. We reduced the problem of merging N schemes to the problem of merging two. We merge
two schemes 1 and 2 of the original set  and get a resulting schema . 1 and 2 are added
to a set of explored schemes .  is then merged with another schema 3 ∈ /3 ∈/ .  now
contains the merged information of 1, 2, and 3, and 3 is added to . This process goes on
until ∄ ∈ / ∈/ . At this point,  contains the merged information of every  ∈ .</p>
        <p>Given a couple of schemes 1 and 2, merging them into a schema  takes the following steps:
1. We create a set  that will contain the shapes of the resulting schema . Initially, 
contains those shapes that are only in 1 or in 2, but not in both schemes. Formally,
 ← { /( ∈ (1)) ∧ (∄ ∈ (2)/( ) = ())} ∪ {/( ∈ (2)) ∧ (∄ ∈
(1)/( ) = ())}, where () is a function that returns the set of shapes in the
schema , and () is a function that returns the URI of the shape .
2. For each pair of shapes 1 ∈ (1) and 2 ∈ (2) such that (1) = (2), we add to 
a new shape that merges the constraints of 1 and 2. The merging process assumes that
2https://github.com/DaniFdezAlvarez/shexer. The current features of this library are described in its GitHub
repository. The experiments were performed using sheXer 2.2.2.
1) each property will be used in one of the shape’s constraints at most (except rdf:type,
which is treated specially), and 2) the node constraint of a property will not contain logical
expressions. These conditions can be enforced with a certain configuration of sheXer.
The consolidation process of two shapes 1 and 2 into a single shape  consists of the
following steps:
a) We create a set  containing those constraints which are exactly equal in both
shapes, i.e., whose property, node constraint, and cardinality are identical. Formally,
 ← { /( ∈ (1)) ∧ (∃ ∈ (2)/ =  }, where () is a function that
returns the set of constraints of the shape .
b) We find the set ′ containing constraints of one of the shapes whose property is not
used in any constraint of the other shape. Formally, ′ ←  ∪ {/( ∈ (1)) ∧
(∄ ∈ (2)/( ) = ())} ∪ {/( ∈ (2)) ∧ (∄ ∈ (1)/( ) =
())}, where () is a function that returns the property of the constraint .</p>
        <p>Then, we add to  every  ∈ ′, but changing their minimal cardinality to zero.
c) We add to  a constraint  that merges each pair of constraints  ∈ (1) and
 ∈ (2) such that they have the same property but difer in cardinality or node
constraint. The node constraint and cardinality of  will be selected such that they
describe the minimum superset that conforms at a time with  and  .</p>
        <p>d) Finally, we create a new shape  whose constraints are the ones in the set .</p>
        <p>
          When our prototype merges two constraints, it also merges some statistical information
generated by sheXer. These statistics indicate the support of a certain constraint among the
example data. To mitigate the knowledge loss that may occur after slicing the input, we define
a threshold   in [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] that takes advantage of those statistics. Whenever the ratio of instances
supporting a certain constraint is higher than 1 −  , we promote that support to 1, assuming
that the lack of total support was caused by knowledge loss.
        </p>
        <p>We have implemented the described approach for shape consolidation in Python3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>To evaluate our proposal, we have extracted shapes from a large RDF dataset using both our
approach and a sampling-based one. We compare our proposal with a sampling strategy, which
is the current state-of-the-art approach for handling large datasets with afordable hardware.
We used sheXer to generate sampling-based schemes.</p>
      <p>We designed an experiment involving a real use case with a subset of UniProt containing
83,855,205 triples4. We extracted a shape for every class without any memory restriction, which
took 32.96 GB of RAM memory. The obtained schema was used as a baseline.</p>
      <p>We run the consolidation and sampling approaches using diferent slice sizes or instance limits
respectively. As the motivation of both approaches is tackling an issue of memory availability,
we think that it is fair to compare these approaches by using settings that produce the same peak
3https://github.com/shex-consolidator/shex-consolidator. This prototype was implemented to work with sheXer’s
outputs. This source code may fail when trying to merge schemes that were not originally produced by sheXer.
4This content can be downloaded from https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_
reviewed_eukaryota_opisthokonta_metazoa_33208_0.rdf.xz
0.5
Consolidation: triples per file 1.1M
Sampling: max. instances per shape 3.1k</p>
      <p>1
2.2M
6.7k
of memory usage. With this premise, we run diferent experiments to compare the sampling
and consolidation approaches using the settings listed in Table 15.</p>
      <p>The shape comparison between approaches and baseline consists of checking diferent notions
of similarity. At shape level, we check:
• S_EQ: Ratio of shapes which are in both schemes and are exactly equal.
• S_CAR: Ratio of shapes which are in both schemes and the only diference found between
their constraints is related to cardinality.
• S_NOD: Ratio of shapes which are in both schemes and whose constraints use the same
properties, but at least one of those constraints has diferent node constraint.
• S_PRO: Ratio of shapes which are in both schemes, but one of the schemes has a constraint
that is not used in the other shape.</p>
      <p>• S_ERR: Ratio of shapes that are not in both schemes.</p>
      <p>Also, we perform similar measures at constraint level. For those shapes whose URI exists in
both schemes, we pair a constraint of a version of the shape with a constraint of its counterpart
that uses the same property (if possible). Considering those paired elements, we check:
• C_EQ: Ratio of constraints which are exactly equal in both shapes.
• C_CAR: Ratio of constraints that have the same property and node constraint, but diferent
cardinality.
• C_NOD: Ratio of constraints that have the same property, but diferent node constraint.
• C_ERR: Ratio of constraints that has a property that is not used in both shapes, i.e., that
could not be paired with a similar constraint in the other version of its shape.</p>
      <p>In Table 2, we show the comparison between the baseline and every extraction process run for
both consolidation and sampling approaches. Note that the columns in this table are primarily
organized w.r.t. memory usage. The number of lines per slice or the instance cap used for the
consolidation (C) and sampling (S) approaches respectively are shown in Table 1. Accumulated
results of each measurement are shown in Figure 4, so one can see the ratio of elements that
have, at least, a certain level of similarity with their counterpart element in the baseline.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>All values for the row S_ERR in Table 2 are 0, which means that, in every case, both approaches
were able to produce the same shapes for the same classes as the baseline. Also, even with
5We empirically determined that setting   = 5 · , where  is the number of chunks, was adequate to mitigate the
slicing knowledge loss. However, more experimentation is required to obtain optimal values of  
Concept</p>
      <p>S_EQ
S_CAR
S_NOD
S_PRO
S_ERR
C_EQ
C_CAR
C_NOD</p>
      <p>C_ERR</p>
      <p>A
0.5
C S C S C S C S C S C S C S</p>
      <p>C S C S C S C S C S C S C S
5
6
0.5
1
2 3 4
Memory peak (GB)
5
6
C_EQ C_CAR C_NOD C_ERR</p>
      <p>S_EQ S_CAR S_NOD S_PRO S_ERR
the lowest memory peak, more than 80% of the extracted constraints are exactly equal to
the constraints in the baseline with both approaches. This percentage only improves when
we increase the memory peak. As shown in Figure 4, the ratio of identical shapes is usually
remarkably lower than the ratio of identical constraints. These numbers together indicate that,
even if the produced shapes are not exactly equal, frequently they do not have too many internal
diferences. Note also that the consolidation approach (C) performs better or nearly equal to
the sampling approach (S) at almost any memory peak tested.</p>
      <p>The ratio of shapes and constraints whose diferences with the baseline’s counterparts are
only related to cardinality is remarkably better with C. Cardinality-related issues in C are mostly
caused by situations where the baseline has a constraint whose minimal cardinality is 1, but the
extractor produces 0. Cardinality-related issues observed among the shapes obtained with S
tend to happen because the extracted constraints are more specific than the ones produced by
the baseline. This happens when there is a frequent feature observed among a class’ instances,
but there are some instances that do not manifest it. If none of those instances are part of the
used sample, the obtained cardinality will be wrong.</p>
      <p>Most node constraint-related issues occur when this element should be a specific shape label,
but the extractor uses the macro IRI instead. With both approaches, this situation arises when
the sub-graph computed does not contain the type of a certain entity used as object in a triple.
However, note that the ratio of elements which at least have the same property as its baseline’s
counterpart is consistently higher with C.</p>
      <p>Finally, note that property-related issues are infrequent and tend to disappear when we use a
high enough memory peak. However, note also that, with C, these cases are very rare even for
the lowest memory peaks tested.</p>
      <p>The results observed in our experiments are promising but could be biased by the type of
input. As explained in section 3.1, our prototype could be afected by a type of graph serialization
that has not been observed in the chosen use case. On the other hand, we hypothesize that the
obtained result could be remarkably improved by implementing a more complex slicing strategy,
so we enforce an adequate knowledge organization in the used slices. More experimentation
would be required to prove such a hypothesis.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this paper, we describe a novel approach for tackling memory-related issues when performing
automatic extraction of shapes from existing RDF content. We define a workflow where 1) we
slice the input source, 2) perform an independent shape extraction process over each slice, and
3) merge the obtained results. This allows us to use the whole content of the target graph while
controlling memory usage. We have developed a prototype implementing that workflow and
compared it with a state-of-the-art tool based on sampling. We extracted a shape schema from
a UniProt subset with a tool with no memory restrictions. Then, we extracted schemes using
the evaluated approaches. In our experiments, our prototype produced shapes more similar to
the baseline ones while using the same amount of memory as the sampling-based approach.</p>
      <p>Although these results are promising, this is an ongoing work. Experimentation with diferent
types of sources and diferent slicing strategies should be performed to determine the actual
potential of our proposal.</p>
      <p>Acknowledgements This work has received partial funding from the project ANGLIRU
(Applying Knowledge Graphs to Research Data Interoperability and Reusability,
MCI-21-PID2020117912RB-C21) funded by the Spanish Agency for Research. This work was partially supported
by the National Bioscience Database Center (NBDC) of the Japan Science and Technology
Agency (JST). We also thank the organizers of the DBCLS Biohackathon 2023, which enabled
the cross-collaboration that triggered this research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Labra-Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>García-González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernández-Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Prud'hommeaux, Challenges in rdf validation</article-title>
          ,
          <source>Current Trends in Semantic Web Technologies: Theory and Practice</source>
          (
          <year>2019</year>
          )
          <fpage>121</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Waagmeester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Willighagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kutmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernández-Álvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Groom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Schaap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Verhagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Koehorst</surname>
          </string-name>
          ,
          <article-title>A protocol for adding knowledge to Wikidata: aligning resources on human coronaviruses</article-title>
          ,
          <source>BMC Biol</source>
          <volume>19</volume>
          (
          <year>2021</year>
          )
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Knublauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <article-title>Shapes constraint language (shacl</article-title>
          ),
          <source>W3C Candidate Recommendation</source>
          <volume>11</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Prud</surname>
          </string-name>
          <article-title>'hommeaux</article-title>
          ,
          <string-name>
            <given-names>J. E. Labra</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Solbrig</surname>
          </string-name>
          ,
          <article-title>Shape expressions: an rdf validation and transformation language</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Semantic Systems</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Rabbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lissandrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <article-title>Shacl and shex in the wild: A community survey on validating shapes generation and adoption</article-title>
          , in: A. L.
          <string-name>
            <surname>Gentile</surname>
          </string-name>
          , L. Pasquale (Eds.),
          <source>Companion Proceedings of the Web Conference, WWW'22 Companion</source>
          , ACM,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cifuentes-Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernández-Álvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Labra-Gayo</surname>
          </string-name>
          ,
          <article-title>National budget as linked open data: New tools for supporting the sustainability of public finances</article-title>
          ,
          <source>Sustainability</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <fpage>4551</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Rabbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lissandrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <article-title>Extraction of validating shapes from very large knowledge graphs</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>16</volume>
          (
          <year>2023</year>
          )
          <fpage>1023</fpage>
          -
          <lpage>1032</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernandez-Álvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Labra-Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gayo-Avello</surname>
          </string-name>
          ,
          <article-title>Automatic extraction of shapes using shexer, Knowledge-Based Systems 238 (</article-title>
          <year>2022</year>
          )
          <fpage>107975</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>U.</given-names>
            <surname>Consortium</surname>
          </string-name>
          ,
          <article-title>Uniprot: a worldwide hub of protein knowledge</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>47</volume>
          (
          <year>2019</year>
          )
          <fpage>D506</fpage>
          -
          <lpage>D515</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fernández-Izquierdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>García-Castro</surname>
          </string-name>
          ,
          <article-title>Astrea: automatic generation of shacl shapes from ontologies</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>513</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Keely</surname>
          </string-name>
          , Shaclgen,
          <year>2023</year>
          . URL: https://pypi.org/project/shaclgen/.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Pandit</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. O'Sullivan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          ,
          <article-title>Using ontology design patterns to define shacl shapes</article-title>
          .,
          <source>in: WOP@ ISWC</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Boneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dusart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <article-title>Shape designer for shex and shacl constraints</article-title>
          ,
          <source>in: ISWC 2019-18th International Semantic Web Conference</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R. A.</given-names>
            <surname>Rashid</surname>
          </string-name>
          , G. Rizzo,
          <string-name>
            <given-names>R.</given-names>
            <surname>García-Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Torchiano</surname>
          </string-name>
          ,
          <article-title>Rdf shape induction using knowledge base profiling</article-title>
          ,
          <source>in: Proceedings of the 33rd Annual ACM Symposium on Applied Computing</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1952</fpage>
          -
          <lpage>1959</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Groz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lemay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staworko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wieczorek</surname>
          </string-name>
          ,
          <article-title>Inference of shape graphs for graph databases</article-title>
          ,
          <source>in: 25th International Conference on Database Theory (ICDT</source>
          <year>2022</year>
          ),
          <source>Schloss Dagstuhl-Leibniz-Zentrum für Informatik</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Omran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J. R.</given-names>
            <surname>Méndez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haller</surname>
          </string-name>
          ,
          <article-title>Towards shacl learning from knowledge graphs</article-title>
          ., in: ISWC (Demos/Industry),
          <year>2020</year>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B.</given-names>
            <surname>Spahiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <article-title>Towards improving the quality of knowledge graphs with data-driven ontology patterns and shacl</article-title>
          .,
          <source>in: ISWC</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Rabbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lissandrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <article-title>Shactor: Improving the quality of large-scale knowledge graphs with validating shapes</article-title>
          ,
          <source>in: Companion of the 2023 International Conference on Management of Data</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>151</fpage>
          -
          <lpage>154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bechhofer</surname>
          </string-name>
          , Owl:
          <article-title>Web ontology language</article-title>
          ,
          <source>in: Encyclopedia of Database Systems</source>
          , Springer,
          <year>2009</year>
          , pp.
          <fpage>2008</fpage>
          -
          <lpage>2009</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>