<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEMANTiCS</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davan Chiem Dao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christophe Debruyne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Knowledge Graphs</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SHACL</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Security</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Montefiore Institute of Electrical Engineering and Computer Science, University of Liège</institution>
          ,
          <addr-line>Liège</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>20</volume>
      <fpage>17</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper describes security vulnerabilities when using SHACL for data validation in knowledge graphs (KGs) and enhance security awareness in SHACL and KG technologies. We employ a KG developed in a prior study in the social security domain to illustrate potential security issues. The vulnerability identified is SHACL's inability to distinguish user input from historical information in the KG, resulting in two potential exploits: validation bypass and data leakage. This study improves our understanding of security risks in SHACL and KG technologies, providing insights in this underexplored area in literature.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
Bypass in SHACL</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Knowledge Graphs (KG) have emerged as versatile tools for representing and organizing complex
relationships between entities in various domains [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As applications utilizing KGs continue to
proliferate, ensuring the accuracy and reliability of its data becomes increasingly important.
One technique is the Shapes Constraint Language (SHACL), which provides a standardized way
to validate RDF graphs.
      </p>
      <p>
        Despite the increase in popularity of KGs, there is limited research on related security issues.
In 2023, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] stated that security risks of KG (reasoning) remain largely unexplored and provided
one of the first systematic studies on these risks. However, this study and others such as [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]
present security issues that are mostly related to machine learning on top of KGs rather than
problems pertaining to KG technologies.
      </p>
      <p>
        Motivated by this gap in the literature, this paper presents a case study to uncover security
vulnerabilities inherent to KG, focusing on SHACL utilization for data validation. SHACL is
often used as a simple data validation tool, either internally or for users to check the structure
of their documents. However, it is flexible enough to be considered in more complex contexts
where multiple stakeholders might interact with the KG. We employ a KG developed in a prior
study to validate social security declarations and illustrate potential security issues, showcasing
how this flexibility, while beneficial, also introduces risks.
Netherlands
The Shapes Constraint Language (SHACL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is a standardized language for expressing
conditions that RDF graphs must adhere to, commonly referred to as “shapes.” The set of shapes
forms an RDF graph called a “shapes graph.” On the other hand, the RDF graph undergoing
validation against these shapes is called a “data graph,” and the RDF graph reporting the result
is called a “validation report.”
      </p>
      <p>The validation process works as follows. For each shape, the focus nodes are selected based
on their target among the nodes of the data graph. Then, these focus nodes are validated against
the shape’s constraints. Depending on the validation results, the validation report indicates
either that the data graph conforms when all nodes conform or that the graph does not conform
if at least one node fails. In the latter case, the cause of the failing node is reported.</p>
      <p>The types of constraints that can be expressed include value type constraints, cardinality
constraints, value range constraints and logical operators, among others. These constraints,
along with the general workings of the validation, form SHACL-CORE. In addition to
SHACLCORE, SHACL-SPARQL extends its capabilities by enabling the creation of custom constraints
using SPARQL queries. This component allows one to define more complex validation rules
that may not be expressible using standard SHACL constraints alone.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Social Security Artifact</title>
      <p>
        In a prior study [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a validation system using KG technologies was built to validate social
security declarations stored as XML. This study demonstrated the feasibility of SHACL in
expressing complex domain rules where current practice can only validate the syntax and
structure of these XML documents with an XSD schema. The built system allows users to have
a more complete and transparent validation of their declarations (i.e., user input) as SHACL is
more expressive than XSD1. Moreover, the shapes can be shared with the public and used in
several applications.
      </p>
      <p>The constraints in the shapes graph were developed by translating constraints from an
XSD schema and a well-documented glossary into SHACL. The kind of constraints that were
expressed include simple constraints expressed with SHACL-CORE constraints such as datatype,
value range, cardinality constraints, regular expressions, and complex constraints expressed
with SHACL-SPARQL such as checksums, value part of an enumeration, and value unicity2.
Listing 1 shows an example of a shape. This shape constrains a field to be a two-digit integer
(lines 5), occur at most once (end of line 4), and be part of an enumeration (lines 6-11).</p>
      <p>For the enumeration constraint, the user input must be complemented with additional data
for the data graph, e.g., to check whether the provided values are valid w.r.t. rules documented
elsewhere. We will refer to these data as public data as it concerns information that is publicly
available in the glossary’s annexes. It describes the values allowed for some specific fields and
their meaning. For instance, the position code in Listing 13 specifies the code for a person’s job:
1e.g., SHACL allows arithmetic computation, rules between diferent graphs and recursion
2Value unicity refers to ensuring that each value following a specific path in the KG does not occur more than once.
3More detailed examples are available at https://github.com/Ikeragnell/shaclExploits
e.g., 58–barman, 61–hunter, 62–valet. As not all values are allowed and allowed values can vary
over time, the public data contains an enumeration of the currently allowed values.</p>
      <p>Similarly, some rules may require additional data to be included in the data graph, but that
should be private. This information concerns user data relevant to social security, such as the
number of working hours, amount of contributions, and amount of social rights. For example,
Listing 2 states that one cannot be employed and have unemployment benefits simultaneously.
ont:NaturalPersonShape a sh:NodeShape ;
sh:targetClass ont:NaturalPerson ;
sh:sparql [
sh:prefixes &lt;&gt; ;
sh:select """SELECT $this WHERE {
$this ont:INSS ?nbr; ont:R_90017_90012 ?workerRecord.
?workerRecord a ont:WorkerRecord.
?person ont:INSS ?nbr; ?person ont:R_90017_901234 ?unemploymentBenefit.
?unemploymentBenefit a ont:UnemploymentBenefit.}"""] .</p>
      <sec id="sec-3-1">
        <title>Listing 2: Shape example requiring private data</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Exploits</title>
      <p>In the previous section, we explained that the data graph included not only the transformed user
input but also public and private data. Despite these components being distinguished before
they are provided to the SHACL processor, it is critical to note that the SHACL processor treats
the data graph as a whole. This aspect becomes a potential source of vulnerability, as SHACL
cannot diferentiate user input from historical and trusted information.</p>
      <p>This lack of distinction between user input and historical data potentially enables attackers to
manipulate the validation process or extract sensitive information from the KG. The following
sections present two exploits leveraging this vulnerability: validation bypass and data leakage.</p>
      <p>
        While some access control mechanisms might help mitigate these risks, current access control
solutions for KGs are not yet mature [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and, to the best of our knowledge, there are no thorough
implementations yet. Moreover one might argue that it is the responsibility of KG engineers
to manage these risks by implementing their own solution on top of SHACL. However, by
identifying these issues at the SHACL level, we can envisage solutions that are more standardized
and applicable across various projects.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Validation Bypass</title>
        <p>Validation bypass occurs when a declaration that should be erroneous is reported as conforming
to the shapes. This exploit capitalizes on SHACL’s inability to diferentiate between
userprovided data and historical information within the KG. Essentially, attackers add additional
triples that do not concern the declaration being evaluated4. This additional data pretends to be
trusted data and masks the underlying inaccuracies in the provided data.</p>
        <p>To illustrate this exploit, we will now describe how to bypass the shape in Listing 1, specifically
the constraint of being part of an enumeration. For example, the value 31 is not part of
the enumeration allowed for a position code. In normal circumstances, when a declaration
containing this value is validated, the validation report generated states the node does not
conform as it has a position code that does not exist.</p>
        <p>An attacker can make value 31 valid by creating a position code with this value and adding it
to the original declaration. Listing 3 shows the poisonous triples to be added. This kind of data
should only be in the public data. However, as there is no distinction between the user input
and the public data in the data graph, the SHACL validation process considered all asserted
data as true.</p>
        <p>In this case, the vulnerable constraint is a SPARQL constraint. SPARQL constraint component
can utilize any paths to express complex rules even those beyond closed SHACL shapes. Thus,
closing shape does not solve the problem. Moreover, this SPARQL constraint can be expressed
with SHACL-CORE which shows that this exploit does not depend on SHACL-SPARQL.
1
&lt;http://maliciousInput.be/PositionCode31&gt; a an9:PositionCode ; an9:Code 31.</p>
        <sec id="sec-4-1-1">
          <title>Listing 3: Poisonous data</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data Leak</title>
        <p>Data leakage occurs when sensitive information stored within the KG is exposed during
validation. Users only interact with the SHACL processor and do not have direct access to private
data. However, some rules require private data to be included in the data graph. Thus, an
attacker could forge declarations and deduce sensitive data based on their validity.</p>
        <p>To illustrate this exploit, we will now describe how the shape in Listing 2 can leak some
sensitive information. Suppose that the data in Listing 4 is some private data, this data states
that someone with the social security number 77101500172 is unemployed. We will refer to this
person as Bob. In normal circumstances, no employer would declare Bob as an employee.
1 ont:UnempPers0 ont:INSS 77101500172.
2 ont:UnempPers0 ont:R_90017_901234 ont:UnempBen0.
3 ont:UnempBen0 a ont:UnemploymentBenefit.
1 ont:NatPers0 a ont:NaturalPerson.
2 ont:NatPers0 ont:INSS 77101500172.
3 ont:NatPers0 ont:R_90017_90012 ont:WorkRec0.</p>
        <p>4 ont:WorkRec0 a ont:WorkerRecord.</p>
        <sec id="sec-4-2-1">
          <title>Listing 4: Private data (left) and forged declaration (right)</title>
          <p>4Note that poisoned data should not concern the declaration; otherwise, only fraud would occur instead of an
exploit. For instance, if a declaration is erroneous due to a missing field, adding some fake data can make it pass
the validation. This would be equivalent to fraudulently filling out a declaration.</p>
          <p>An attacker can forge a declaration stating that he employs Bob. The relevant data from
the forged declaration can be seen in Listing 4. The attacker can determine whether Bob is
employed depending on the validation result. Bob is unemployed if the validation report states
that the declaration does not conform (which is the case with this forged declaration). On the
other hand, if the declaration conforms, then Bob is employed.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper marks a first step in understanding the vulnerabilities present in SHACL by describing
exploits such as validation bypass and data leakage. We highlight the risks stemming from
SHACL’s inability to diferentiate triples coming from diferent graphs. Moving forward, we
will develop and implement efective strategies to address these security concerns. In the
social security domain, where personal or sensitive data is used, convoluted approaches can
be conceived in which validation reports are not communicated to users or where humans are
kept in the loop in the validation process. Such solutions delegate all responsibilities to the
applications built on the KG. Ideally, approaches are embedded into and part of the KG. Our
goal is to address these declaratively using KG technologies. One potential venue is to use
SHACL extensions5. This study is important to understand the vulnerabilities inherent in KGs.
Through these endeavors, we aim to enhance the overall understanding of security issues of
KG technologies and fortify them against potential exploits.</p>
      <sec id="sec-5-1">
        <title>5For instance, https://afs.github.io/shacl-datasets.html</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N.</given-names>
            <surname>Al-Aswadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gaurav</surname>
          </string-name>
          ,
          <article-title>Recent trends in knowledge graphs: theory and practice</article-title>
          ,
          <source>Soft Computing</source>
          <volume>25</volume>
          (
          <year>2021</year>
          )
          <fpage>8337</fpage>
          -
          <lpage>8355</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ma</surname>
          </string-name>
          , T. Wang,
          <article-title>On the security risks of knowledge graph reasoning</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>02383</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhardwaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kelleher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Costabello</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>O'Sullivan, Adversarial attacks on knowledge graph embeddings via instance attribution methods</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>8225</fpage>
          -
          <lpage>8239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Zheng,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Data poisoning attack against knowledge graph embedding</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1904</year>
          .12052.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] Shapes constraint language (SHACL</article-title>
          ),
          <year>2017</year>
          . URL: https://www.w3.org/TR/shacl/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chiem Dao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Debruyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stijfhals</surname>
          </string-name>
          ,
          <article-title>Using knowledge graphs and shacl to validate declaration forms: An experiment in the social security domain to assess shacl's applicability</article-title>
          ,
          <source>in: 2nd EuropeaN Data conference On Reference data and SEmantics ENDORSE</source>
          <year>2023</year>
          ,
          <volume>14</volume>
          -
          <fpage>16</fpage>
          March 2023 - Proceedings,
          <source>Publications Ofice of the European Union</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Valzelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Spahiu</surname>
          </string-name>
          ,
          <article-title>Towards an access control model for knowledge graphs (discussion paper)</article-title>
          ,
          <source>in: Proceedings of the 29th Italian Symposium on Advanced Database Systems</source>
          , SEBD 2021,
          <article-title>Pizzo Calabro (VV), Italy</article-title>
          , September 5-
          <issue>9</issue>
          ,
          <year>2021</year>
          , volume
          <volume>2994</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>339</fpage>
          -
          <lpage>346</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>