<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Statistics about Data Shape Use in RDF Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sven Lieber</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ben De Meester</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Dimou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Verborgh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ghent University - imec - IDLab, Department of Electronics and Information Systems</institution>
          ,
          <addr-line>Technologiepark-Zwijnaarde 122, 9052 Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Statistics about constraint use in RDF data bring insights in common practices to address data quality. However, we only have such statistics for OWL axioms, not for constraint languages, such as SHACL or ShEx, that have recently become more popular. We extended previous work on axiom statistics to provide evidence of constraint type use. In this poster1 we present preliminary statistics about the use of SHACL core constraints in data shapes found on GitHub. We found that class, datatype and cardinality constraints are predominantly used, similar to the dominant use of domain and range in ontologies. Less-used constraint types need further attention in visualization or modeling tools to address data quality issues. More constraints of SHACL but also ShEx need to be included to deepen the understanding. Data quality researchers and tool designers can make informed decisions based on the provided statistics.</p>
      </abstract>
      <kwd-group>
        <kwd>SHACL</kwd>
        <kwd>Statistics</kwd>
        <kwd>RDF</kwd>
        <kwd>Constraints</kwd>
        <kwd>Montolo</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recently, RDF constraint languages, such as SHACL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or ShEx [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], have been
developed to model restrictions in the form of constraints on data. Statistics for
OWL ontologies showed that only a subset of possible axioms are commonly
used [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], but such evidence does not yet exist for constraints which poses a gap
and leaves users to anticipate possible use cases or cover whole specifications.
      </p>
      <p>
        Insights about used constraint types can be taken from generated constraints
or curated repositories. Astrea [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and OSLO [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which generate shapes from
existing sources cover specific subsets of SHACL, but this is due to limited
mapping and not because of evidence of broad use. To the best of our knowledge,
only small repositories of SHACL constraints with less than 5 entries exist2 3.
      </p>
      <p>
        In this poster paper, we present preliminary statistics generated by a
constraint type extension of our Montolo framework [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to collect RDF Data Cube
compliant statistics about axiom use. Following the same approach, we used the
1 Copyright ©2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
2 https://schreckl.inspirito.de/
3 http://shacl-play.sparna.fr/catalog
vocabulary of Montolo4 to create definitions for all SHACL core constraints and
created statistics for identified data shapes from GitHub.
      </p>
      <p>Our work provides insights in constraint type use and is extendible with
respect to constraint types of other RDF constraint languages. Preliminary results,
the created corpus of SHACL shapes as well as the tool to download the shapes
are available with a persistent identifier (DOI: 10.5281/zenodo.39889305) and
under an open license6 to attract more research.
2</p>
      <p>
        Constraint Type Statistics
We explain the framework to collect constraint type statistics, which sources we
consider and present preliminary results before we discuss the results.
Framework We briefly describe the framework to collect constraint type statistics
and the selection of SHACL data shapes. Montolo uses an extension of
LODStats [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to define statistical modules to detect (patterns of) RDF terms7. We
created a statistical module for each core constraint of SHACL to detect SHACL
serializations of constraint types, e.g., sh:class or sh:minCount. Additionally,
we created definitions for SHACL core constraints with the Montolo vocabulary.
      </p>
      <p>We searched for the term “SHACL” in GitHub and manually selected
repositories which contain valid SHACL shapes that do not appear as simple examples.
We also considered common SHACL shapes, such as Schema.org’s SHACL8 and
SHACL constraints of SHACL itself9. We implemented a tool to download data
shapes and merge the ones that conceptually belong together, e.g. because they
are in the same repository; the tool is part of the accompanying resource of this
paper.</p>
      <p>Results In total, we analyzed the SHACL RDF files of 13 projects
containing 1,978 NodeShapes. Two of the projects, the aforementioned OSLO and the
SHACL version of schema.org are similar to the Astrea examples, i.e. data shapes
generated based on a subset of SHACL. We describe statistics about constraint
types of potentially manually curated SHACL shapes while comparing it with
generated SHACL shapes of OSLO, schema.org and Astrea.</p>
      <p>All constraint types are used (Fig. 2) but constraint types regarding
cardinality, class and datatype of properties are most frequently used by total
number (Fig. 1). Class and datatype constraints are primarily found in our corpus
which likewise is generated by Astrea, OSLO and SHACL of schema.org. This
suggests that class and datatype constraints are main use cases for constraint
types which find common use; it appears similar to the dominance of domain
4 http://w3id.org/montolo/ns/montolo-voc
5 https://zenodo.org/record/3988930
6 https://creativecommons.org/publicdomain/zero/1.0/
7 https://github.com/IDLabResearch/lovstats
8 http://datashapes.org/schema
9 https://www.w3.org/ns/shacl-shacl
otherHasValue</p>
      <p>shapeNode
se stringPattern
yp logicalDisjunction
TtargetSubjectsOfTarget
t
in targetClassTarget
tra valueTypeDatatype
sn valueTypeClass
oC cardinalityMaxCount
cardinalityMinCount
shapeProperty
0
500
1000
1500
2000</p>
      <p>2500</p>
    </sec>
    <sec id="sec-2">
      <title>Constraint Type Occurrence</title>
      <p>
        Fig. 1: Constraints on properties, their cardinality and datatype or class are most
frequently used in manually curated data shapes (excluding OSLO &amp; schema.org’
SHACL). Constraint types used less than 20 times are not shown.
and range axioms for ontologies [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Disjunction constraints (sh:or) are used
by more than 75% of the analyzed repositories and to a large extent by the
automatically generated SHACL for schema.org. This can be explained by the
flexibility of schema.org: properties are specified to expect one of several possible
types. However, disjunction is almost non-existent in Astrea, showing that the
selected ontologies barely contain owl:unionOf statements. Value range
constraints (sh:minExclusive, sh:maxInclusive, etc) are barely found in our
corpus and are neither generated for the Astrea examples nor OSLO, suggesting
less future use, similar for other constraint types.
      </p>
      <p>Projects using Constraint Types (&gt; 20%)
logicalConjunction
stringUniqueLang</p>
      <p>otherClosed
se otherValueIn
yTp otheshrHapaesVNaoludee
t stringPattern
ian logicalDisjunction
r targetClassTarget
t
snvalueTypeDatatype
o valueTypeClass
CcardinalityMaxCount
cardinalityMinCount
shapeProperty
0.00
20.00
40.00
60.00
80.00</p>
      <p>100.00</p>
    </sec>
    <sec id="sec-3">
      <title>Percent of projects</title>
      <p>Discussion Constraint types complement ontology restrictions yet both show a
similar use pattern. Our previous study on restrictions in ontologies found that
taxonomic relationships (rdfs:domain, rdfs:range, rdfs:subClassOf) are
extensively used whereas restrictions on literals were barely found. We see a similar
pattern of constraint use compared to axiom use: relationships between concepts
restricted to certain classes or datatypes. However, the current analysis suggests
that with respect to literals at least string patterns (sh:pattern) find some use
in shapes which complements missing literal restrictions use of ontologies.</p>
      <p>
        However, we see more potential in the use of constraints with respect to
literals. One out of seven RDF statements in large knowledge graphs contains
a literal as object [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Several string or literal value range constraint types are
defined by SHACL and ShEx which can be used to impose precise restrictions
on literals. We have no insights with which tools the shapes were created yet
this might be important. Current tools might focus too much on classes and
datatypes while neglecting other constraint types. Appropriate tools with
userfriendly interfaces are crucial and should be available such that users are made
aware of possible constraint types and are assisted in using them.
Conclusion and Future Work Our preliminary results identified cardinality, class,
datatype and disjunction constraints as commonly used. Developers of tools
related to RDF constraints become able to iteratively implement their tools as
they can cover first these commonly used constraint types. However, to exploit
the existing data quality potential, developers should not neglect other constraint
types completely especially regarding literal values. Future work can extend the
statistics by including ShEx and extending the sample size. We currently work
on visual notations for RDF constraints10 which will benefit from this and future
insights in constraint type use.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , J.:
          <article-title>LODStats - An Extensible Framework for High-Performance Dataset Analytics</article-title>
          . In: EKAW (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Beek</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilievski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debattista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlobach</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wielemaker</surname>
          </string-name>
          , J.:
          <article-title>Literally better: Analyzing and improving the quality of literals</article-title>
          .
          <source>Semantic Web</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cimmino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>Ferna´ndez-</article-title>
          <string-name>
            <surname>Izquierdo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>Garc´ıa-</article-title>
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Astrea: Automatic Generation of SHACL Shapes from Ontologies. The Semantic Web (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>De Paepe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thijs</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buyle</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
          </string-name>
          , E.:
          <article-title>Automated UMLbased ontology generation in OSLO2</article-title>
          .
          <source>In: The Semantic Web: Satellite Events</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Knublauch</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Shapes Constraint Language (SHACL)</article-title>
          .
          <source>Recommendation, World Wide Web Consortium</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lieber</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
          </string-name>
          , R.: MontoloStats - Ontology Modeling Statistics.
          <source>In: Proceedings of the 10th K-Cap Conference</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Prud'hommeaux</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Labra</surname>
            <given-names>Gayo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.E.</given-names>
            ,
            <surname>Solbrig</surname>
          </string-name>
          , H.:
          <article-title>Shape expressions: an RDF validation and transformation language</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Semantic Systems</source>
          . New York, NY, United
          <string-name>
            <surname>States</surname>
          </string-name>
          (
          <year>2014</year>
          ) 10 https://w3id.org/imec/unshacled/spec/shape-vowl and https://w3id.org/ imec/unshacled/spec/shape-uml
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>