Data Leakage and Validation Bypass in SHACL
                                Davan Chiem Dao1,∗ , Christophe Debruyne1
                                1
                                    Montefiore Institute of Electrical Engineering and Computer Science, University of Liège, Liège, Belgium


                                               Abstract
                                               This paper describes security vulnerabilities when using SHACL for data validation in knowledge graphs
                                               (KGs) and enhance security awareness in SHACL and KG technologies. We employ a KG developed
                                               in a prior study in the social security domain to illustrate potential security issues. The vulnerability
                                               identified is SHACL’s inability to distinguish user input from historical information in the KG, resulting
                                               in two potential exploits: validation bypass and data leakage. This study improves our understanding of
                                               security risks in SHACL and KG technologies, providing insights in this underexplored area in literature.

                                               Keywords
                                               Knowledge Graphs, SHACL, Security


                                1. Introduction
                                Knowledge Graphs (KG) have emerged as versatile tools for representing and organizing complex
                                relationships between entities in various domains [1]. As applications utilizing KGs continue to
                                proliferate, ensuring the accuracy and reliability of its data becomes increasingly important.
                                One technique is the Shapes Constraint Language (SHACL), which provides a standardized way
                                to validate RDF graphs.
                                   Despite the increase in popularity of KGs, there is limited research on related security issues.
                                In 2023, [2] stated that security risks of KG (reasoning) remain largely unexplored and provided
                                one of the first systematic studies on these risks. However, this study and others such as [3, 4]
                                present security issues that are mostly related to machine learning on top of KGs rather than
                                problems pertaining to KG technologies.
                                   Motivated by this gap in the literature, this paper presents a case study to uncover security
                                vulnerabilities inherent to KG, focusing on SHACL utilization for data validation. SHACL is
                                often used as a simple data validation tool, either internally or for users to check the structure
                                of their documents. However, it is flexible enough to be considered in more complex contexts
                                where multiple stakeholders might interact with the KG. We employ a KG developed in a prior
                                study to validate social security declarations and illustrate potential security issues, showcasing
                                how this flexibility, while beneficial, also introduces risks.


                                SEMANTiCS 2023: 20th International Conference on Semantic Systems, September 17–19, 2024, Amsterdam, The
                                Netherlands
                                ∗
                                  Corresponding author.
                                Orcid 0009-0004-8139-1927 (D. Chiem Dao); 0000-0003-4734-3847 (C. Debruyne)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. SHACL
The Shapes Constraint Language (SHACL) [5] is a standardized language for expressing con-
ditions that RDF graphs must adhere to, commonly referred to as “shapes.” The set of shapes
forms an RDF graph called a “shapes graph.” On the other hand, the RDF graph undergoing
validation against these shapes is called a “data graph,” and the RDF graph reporting the result
is called a “validation report.”
   The validation process works as follows. For each shape, the focus nodes are selected based
on their target among the nodes of the data graph. Then, these focus nodes are validated against
the shape’s constraints. Depending on the validation results, the validation report indicates
either that the data graph conforms when all nodes conform or that the graph does not conform
if at least one node fails. In the latter case, the cause of the failing node is reported.
   The types of constraints that can be expressed include value type constraints, cardinality
constraints, value range constraints and logical operators, among others. These constraints,
along with the general workings of the validation, form SHACL-CORE. In addition to SHACL-
CORE, SHACL-SPARQL extends its capabilities by enabling the creation of custom constraints
using SPARQL queries. This component allows one to define more complex validation rules
that may not be expressible using standard SHACL constraints alone.


3. Social Security Artifact
In a prior study [6], a validation system using KG technologies was built to validate social
security declarations stored as XML. This study demonstrated the feasibility of SHACL in
expressing complex domain rules where current practice can only validate the syntax and
structure of these XML documents with an XSD schema. The built system allows users to have
a more complete and transparent validation of their declarations (i.e., user input) as SHACL is
more expressive than XSD1 . Moreover, the shapes can be shared with the public and used in
several applications.
   The constraints in the shapes graph were developed by translating constraints from an
XSD schema and a well-documented glossary into SHACL. The kind of constraints that were
expressed include simple constraints expressed with SHACL-CORE constraints such as datatype,
value range, cardinality constraints, regular expressions, and complex constraints expressed
with SHACL-SPARQL such as checksums, value part of an enumeration, and value unicity2 .
Listing 1 shows an example of a shape. This shape constrains a field to be a two-digit integer
(lines 5), occur at most once (end of line 4), and be part of an enumeration (lines 6-11).
   For the enumeration constraint, the user input must be complemented with additional data
for the data graph, e.g., to check whether the provided values are valid w.r.t. rules documented
elsewhere. We will refer to these data as public data as it concerns information that is publicly
available in the glossary’s annexes. It describes the values allowed for some specific fields and
their meaning. For instance, the position code in Listing 13 specifies the code for a person’s job:

1
  e.g., SHACL allows arithmetic computation, rules between different graphs and recursion
2
  Value unicity refers to ensuring that each value following a specific path in the KG does not occur more than once.
3
  More detailed examples are available at https://github.com/Ikeragnell/shaclExploits
 1   ont:OccupationShape a sh:NodeShape;
 2       sh:targetClass ont:Occupation;
 3       sh:property [
 4           sh:path ont:PositionCode; sh:maxCount 1
 5           sh:minLength 2; sh:maxLength 2; sh:minInclusive 0; sh:maxInclusive 99; sh:datatype xs:integer;];
 6       sh:sparql [
 7           sh:prefixes <> ;
 8           sh:select """SELECT $this ?value WHERE {
 9                            $this $PATH ?value.
10                            OPTIONAL{?p a an9:PositionCode ; an9:Code ?value.}
11                            FILTER(!BOUND(?p))}"""] .


                               Listing 1: Shape example requiring public data


     e.g., 58–barman, 61–hunter, 62–valet. As not all values are allowed and allowed values can vary
     over time, the public data contains an enumeration of the currently allowed values.
        Similarly, some rules may require additional data to be included in the data graph, but that
     should be private. This information concerns user data relevant to social security, such as the
     number of working hours, amount of contributions, and amount of social rights. For example,
     Listing 2 states that one cannot be employed and have unemployment benefits simultaneously.
 1   ont:NaturalPersonShape a sh:NodeShape ;
 2       sh:targetClass ont:NaturalPerson ;
 3       sh:sparql [
 4           sh:prefixes <> ;
 5           sh:select """SELECT $this WHERE {
 6                   $this ont:INSS ?nbr; ont:R_90017_90012 ?workerRecord.
 7                   ?workerRecord a ont:WorkerRecord.
 8                   ?person ont:INSS ?nbr; ?person ont:R_90017_901234 ?unemploymentBenefit.
 9                   ?unemploymentBenefit a ont:UnemploymentBenefit.}"""] .


                              Listing 2: Shape example requiring private data


     4. Exploits
     In the previous section, we explained that the data graph included not only the transformed user
     input but also public and private data. Despite these components being distinguished before
     they are provided to the SHACL processor, it is critical to note that the SHACL processor treats
     the data graph as a whole. This aspect becomes a potential source of vulnerability, as SHACL
     cannot differentiate user input from historical and trusted information.
        This lack of distinction between user input and historical data potentially enables attackers to
     manipulate the validation process or extract sensitive information from the KG. The following
     sections present two exploits leveraging this vulnerability: validation bypass and data leakage.
        While some access control mechanisms might help mitigate these risks, current access control
     solutions for KGs are not yet mature [7] and, to the best of our knowledge, there are no thorough
     implementations yet. Moreover one might argue that it is the responsibility of KG engineers
     to manage these risks by implementing their own solution on top of SHACL. However, by
     identifying these issues at the SHACL level, we can envisage solutions that are more standardized
     and applicable across various projects.
    4.1. Validation Bypass
    Validation bypass occurs when a declaration that should be erroneous is reported as conforming
    to the shapes. This exploit capitalizes on SHACL’s inability to differentiate between user-
    provided data and historical information within the KG. Essentially, attackers add additional
    triples that do not concern the declaration being evaluated4 . This additional data pretends to be
    trusted data and masks the underlying inaccuracies in the provided data.
       To illustrate this exploit, we will now describe how to bypass the shape in Listing 1, specifically
    the constraint of being part of an enumeration. For example, the value 31 is not part of
    the enumeration allowed for a position code. In normal circumstances, when a declaration
    containing this value is validated, the validation report generated states the node does not
    conform as it has a position code that does not exist.
       An attacker can make value 31 valid by creating a position code with this value and adding it
    to the original declaration. Listing 3 shows the poisonous triples to be added. This kind of data
    should only be in the public data. However, as there is no distinction between the user input
    and the public data in the data graph, the SHACL validation process considered all asserted
    data as true.
       In this case, the vulnerable constraint is a SPARQL constraint. SPARQL constraint component
    can utilize any paths to express complex rules even those beyond closed SHACL shapes. Thus,
    closing shape does not solve the problem. Moreover, this SPARQL constraint can be expressed
    with SHACL-CORE which shows that this exploit does not depend on SHACL-SPARQL.
1   <http://maliciousInput.be/PositionCode31> a an9:PositionCode ; an9:Code 31.


                                                 Listing 3: Poisonous data


    4.2. Data Leak
    Data leakage occurs when sensitive information stored within the KG is exposed during valida-
    tion. Users only interact with the SHACL processor and do not have direct access to private
    data. However, some rules require private data to be included in the data graph. Thus, an
    attacker could forge declarations and deduce sensitive data based on their validity.
       To illustrate this exploit, we will now describe how the shape in Listing 2 can leak some
    sensitive information. Suppose that the data in Listing 4 is some private data, this data states
    that someone with the social security number 77101500172 is unemployed. We will refer to this
    person as Bob. In normal circumstances, no employer would declare Bob as an employee.
    1     ont:UnempPers0 ont:INSS 77101500172.                          1   ont:NatPers0 a ont:NaturalPerson.
    2     ont:UnempPers0 ont:R_90017_901234 ont:UnempBen0.              2   ont:NatPers0 ont:INSS 77101500172.
    3     ont:UnempBen0 a ont:UnemploymentBenefit.                      3   ont:NatPers0 ont:R_90017_90012 ont:WorkRec0.
                                                                        4   ont:WorkRec0 a ont:WorkerRecord.

                             Listing 4: Private data (left) and forged declaration (right)

    4
        Note that poisoned data should not concern the declaration; otherwise, only fraud would occur instead of an
        exploit. For instance, if a declaration is erroneous due to a missing field, adding some fake data can make it pass
        the validation. This would be equivalent to fraudulently filling out a declaration.
  An attacker can forge a declaration stating that he employs Bob. The relevant data from
the forged declaration can be seen in Listing 4. The attacker can determine whether Bob is
employed depending on the validation result. Bob is unemployed if the validation report states
that the declaration does not conform (which is the case with this forged declaration). On the
other hand, if the declaration conforms, then Bob is employed.


5. Conclusion
This paper marks a first step in understanding the vulnerabilities present in SHACL by describing
exploits such as validation bypass and data leakage. We highlight the risks stemming from
SHACL’s inability to differentiate triples coming from different graphs. Moving forward, we
will develop and implement effective strategies to address these security concerns. In the
social security domain, where personal or sensitive data is used, convoluted approaches can
be conceived in which validation reports are not communicated to users or where humans are
kept in the loop in the validation process. Such solutions delegate all responsibilities to the
applications built on the KG. Ideally, approaches are embedded into and part of the KG. Our
goal is to address these declaratively using KG technologies. One potential venue is to use
SHACL extensions5 . This study is important to understand the vulnerabilities inherent in KGs.
Through these endeavors, we aim to enhance the overall understanding of security issues of
KG technologies and fortify them against potential exploits.


References
[1] S. Tiwari, F. N. Al-Aswadi, D. Gaurav, Recent trends in knowledge graphs: theory and
    practice, Soft Computing 25 (2021) 8337–8355.
[2] Z. Xi, T. Du, C. Li, R. Pang, S. Ji, X. Luo, X. Xiao, F. Ma, T. Wang, On the security risks of
    knowledge graph reasoning, 2023. arXiv:2305.02383 .
[3] P. Bhardwaj, J. Kelleher, L. Costabello, D. O’Sullivan, Adversarial attacks on knowledge
    graph embeddings via instance attribution methods, in: Proceedings of the 2021 Conference
    on Empirical Methods in Natural Language Processing, Association for Computational
    Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 8225–8239.
[4] H. Zhang, T. Zheng, J. Gao, C. Miao, L. Su, Y. Li, K. Ren, Data poisoning attack against
    knowledge graph embedding, 2019. arXiv:1904.12052 .
[5] Shapes constraint language (SHACL), 2017. URL: https://www.w3.org/TR/shacl/.
[6] D. Chiem Dao, C. Debruyne, P. Stijfhals, Using knowledge graphs and shacl to validate
    declaration forms: An experiment in the social security domain to assess shacl’s applicability,
    in: 2nd EuropeaN Data conference On Reference data and SEmantics ENDORSE 2023, 14-16
    March 2023 - Proceedings, Publications Office of the European Union, 2023, pp. 85–96.
[7] M. Valzelli, A. Maurino, M. Palmonari, B. Spahiu, Towards an access control model for
    knowledge graphs (discussion paper), in: Proceedings of the 29th Italian Symposium on
    Advanced Database Systems, SEBD 2021, Pizzo Calabro (VV), Italy, September 5-9, 2021,
    volume 2994 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 339–346.
5
    For instance, https://afs.github.io/shacl-datasets.html