<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge Graph Anonymization using Semantic Anatomization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maxime Thouvenot</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Curé</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philippe Calvez</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Université Paris Est</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>firstname.lastname}@u-pem.fr</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ENGIE CRIGEN - LAB CSAI philippe.calvez</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>@engie.com</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Organizations and companies that are using Knowledge Graphs need to consider the privacy preservation of the data they contain. This is generally performed by anonymization techniques such as triple suppression and generalization which are known to reduce the utility of the released datasets. This paper presents semantic anatomization, a novel anonymization technique, that retains all quasi-identifier and sensitive values in the RDF graph. Due to an aggregating mechanism and the exploitation of the semantics contained in ontologies, this technique preserves data correlation and supports high quality analysis from anonymized graphs. We demonstrate the potential of semantic anatomization in a real-world setting on client information of a multinational company.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>As the usage of Knowledge Graphs (KGs) is going mainstream, it becomes necessary
for organizations and companies to consider privacy-preserving data publishing (PPDP)
approaches. In PPDP, we refer to the targets of an anonymization technique as "entities
of interest" (EoI). Each EoI is identified by a set of values called attributes which can
be split into four distinct categories: (i) Explicit identifiers (EID) explicitly identify an
entity (e.g., a social security number), (ii) Quasi identifiers (QID) are sets of attributes
that together can potentially identify an entity, e.g., date of birth, zipcode and gender,
(iii) Sensitive attributes (SA) describe some sensitive information concerning an entity,
e.g., disease, political opinion, and (iv) Non-sensitive attributes (NSA) do not belong to
any of the previous categories. EID, QID and SA can be considered as private attributes
of the EoI while NSA are generally not concerned with privacy issues.</p>
      <p>
        In the context of data represented using the RDF data model, several approaches
and systems have been proposed. The anonymisation techniques that they are using
mainly concern the suppression of information [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], i.e., removal of some triples or
substitution of some RDF terms by blank nodes, and the generalization of some values
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In general, the adoption of these techniques favor the non-disclosure of entities over
the utility aspect, i.e., usefulness of the anonymized dataset to a group of users.
0 Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>In this work, we have developed an anonymisation technique that focuses on the
utility aspect but still provides high privacy guarantees. Denoted semantic
anatomization, rather than suppressing or generalizing entity QIDs, this technique focuses on a
semantic-aware grouping of SA. Additionally, it quantifies the amount of individuals
in each group. As a result, it preserves the correlation between precise entity QIDs and
semantically related SAs.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Semantic anatomization</title>
      <p>
        Semantic anatomization adapts to KGs the main idea of the anatomy anonymization
technique which has been applied to relational database [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. To the best of our
knowledge, this is the first implementation of an anatomy-based approach for KGs (note that
in an RDF graph context, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] only mentions it as a possible solution).
      </p>
      <p>Adapted to RDF graphs, anatomy implies that the relationships between the QIDs
and their SAs are removed by inserting intermediate nodes which will be used to gather
different SAs into groups. Unlike other anonymization methods, anatomization only
modifies the graph at the structural level by adding new nodes and edges. The fact
that QID values are not transformed allows us to avoid the information loss that would
be caused by some generalization mechanism and thus favor a quality analysis of the
released data.</p>
      <p>We now provide a comprehensive example. Fig. 1(a) presents an extract of an
original graph representing three patients (John Doe, Jane Doe and Smith) and some of
their QIDs. In Fig. 1(b), we present the anatomized graph of this extract. We can see
that the QID values have not been modified, the EIDs (John, Jane and Smith’s URI)
have been replaced by blank nodes and the hasDisease property does not point to a
disease concept but to a sub-graph group which is used to break sensitive relationships.
Several properties have been added to the original graph, namely inGroup, value and
cardinality, which belong to our Anatomy ontology. The latter two keep information
about SAs, their value and cardinality, i.e., number of occurrences in the dataset.</p>
      <p>Consider an adversary who has access to the data in Fig. 1(b), possesses personal
details about Jane Doe (e.g., her age, gender and zipcode) and knows that she appears
in this dataset. The attacker knows that Jane belongs to Group 1 but cannot infer any
additional information about her disease. However, using the cardinalities, he is able to
deduce that 23 of the entities contained in the group suffer from the Flu (resp. 13 from
HIV ). Hence, he can only make a probabilistic guess about the actual disease that Jane
has. Nevertheless, considering the utility aspect of this anonymization approach, it is
still possible to extract valuable statistics on the disease of a group of people with a
given age, zipcode or gender.</p>
      <p>
        The objective of a semantic-enabled anatomization approach is to make sure that
each group contains SA entries that are semantically related, hence, we are using an
ontology associated to the dataset. In particular, we are making an intensive usage of
the TBox’s concept hierarchy. By leveraging these structures and a similarity measure
denoted taxonomy similarity [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we are able to compare the different concepts in our
dataset and gather similar concepts together.
      </p>
      <p>The semantic anatomization algorithm we are presenting in this section computes
groups of SAs. Its result is used to create SPARQL update queries.</p>
      <p>We chose to adopt a bottom-up approach which can be divided into two parts. First,
we retrieve a set of initial clusters i.e., clusters comprised of only one SA. Second, a
merging step is performed and consists of the following operations: storing of all the
clusters in a list L then we compute the taxonomy similarity between all these clusters.
Similar clusters are merged and are thus defined by a set of concepts over which we
search for a least common ancestor (LCA). This LCA is then used as the concept for
the newly formed cluster. The process stops once every cluster contains at least 2
attributes. Note that the algorithm mainly takes as input the ontology and not the original
dataset (except for retrieving initial clusters). This is important when considering the
processing complexity of an algorithm since TBoxes are known to be orders of
magnitude smaller than ABoxes (this is particularly true in our anonymization context). In
fact, the complexity of our algorithm is O(n2) where n is the number of clusters
computed from the ontology.</p>
      <p>Once the final clusters are computed, we can produce the update operations to be
performed to apply anatomization. We iterate on each of the clusters and on each of the
attributes they contain in order to create the appropriate SPARQL queries (query
template as well as a detailed clustering algorithm are available on GitHub3). Considering
the graph of Fig 1(a), Its execution produces the graph presented in Fig. 1(b). Hence a
deterministic set of queries is executed to produce the released dataset.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We now provide a preliminary evaluation of our approach on its effectiveness to fulfill
the end-users’ analysis need, e.g., counting queries. The experimentation is conducted
3 https://github.com/mthouv/SemanticAnatomization
over an extension of LUBM datasets. In order to support a concept-based SA, we
introduced a four-level concept hierarchy related to religion and assigned a religion to each
person instance in our different datasets (ranging from 300,000 to 1,000,000 triples).
We also added another attribute with no underlying ontology. Finally, we experimented
with different value distributions for theses attributes which are displayed in the GitHub
companion of this paper.</p>
      <p>In order to measure the efficiency of the approach, we designed a set of counting
queries involving 2 to 4 attributes (including 1 or 2 SAs) and computed the average
relative error in answering these queries. The relative error can be described as j Act
Est j =Act where Act and Est correspond respectively to the actual value derived from
the dataset and an estimated value computed from the anonymized graph.</p>
      <p>We provide the results for queries with 2 attributes in Figure 2 and show that the
average error is below 10%. This is much better than what is generally achieved with
generalization or suppression. Due to space constraint, the rest of the results can be
found on our GitHub repository.</p>
      <p>Regardless of the various parameters (queries’ size, value distribution, etc...), a
common fact is observed: the larger the dataset (and consequently the more EoI we have),
the more precise our approach is. This is due to the lesser selectivity of queries on
larger datasets, i.e., we compute our estimates on more significant subsets of the EoI.
Most importantly, we showed that, even in an anatomized KG, it is still possible to
retrieve valuable information. Furthermore, anatomization is not greatly affected by the
distribution of SA values and becomes more efficient as the number of EoI grows. We
observe slightly worse results when dealing with queries involving multiple SA but this
is expected considering the method used to compute our approximation. Finally, query
cardinality is not a limitation either.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We presented semantic anatomization as a new anonymization technique to help in
the PPDP process for KGs. This new technique reaches a data analysis quality that
can hardly be met with the standard generalization and suppression techniques. This is
mainly due to the preservation of data correlation between QIDs and SAs. Nevertheless,
generalization provides privacy guarantees that would be hard to obtain with any form
of anatomization. Hence, a future work is to integrate generalization in our system.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>B. Cuenca</given-names>
            <surname>Grau</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Kostylev</surname>
          </string-name>
          .
          <article-title>Logical foundations of privacy-preserving publishing of linked data</article-title>
          . In D. Schuurmans and
          <string-name>
            <surname>M. P.</surname>
          </string-name>
          Wellman, editors,
          <source>AAAI</source>
          , pages
          <fpage>943</fpage>
          -
          <lpage>949</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Delanaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rousset</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Thion</surname>
          </string-name>
          .
          <article-title>Query-based linked data anonymization</article-title>
          .
          <source>In ISWC 2018</source>
          , pages
          <fpage>530</fpage>
          -
          <lpage>546</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>B.</given-names>
            <surname>Heitmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hermsen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker.</surname>
          </string-name>
          k
          <article-title>-RDF-neighbourhood anonymity: Combining structural and attribute-based anonymisation for linked data</article-title>
          .
          <source>In PrivOn at ISWC</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Maedche</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Zacharias</surname>
          </string-name>
          .
          <article-title>Clustering ontology-based metadata in the semantic web</article-title>
          .
          <source>In PKDD 2002</source>
          , pages
          <fpage>348</fpage>
          -
          <lpage>360</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>F.</given-names>
            <surname>Radulovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>García-Castro</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gómez-Pérez</surname>
          </string-name>
          .
          <article-title>Towards the anonymisation of RDF data</article-title>
          .
          <source>In SEKE</source>
          , pages
          <fpage>646</fpage>
          -
          <lpage>651</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tao</surname>
          </string-name>
          . Anatomy:
          <article-title>Simple and effective privacy preservation</article-title>
          .
          <source>In Proceedings of VLDB</source>
          , Seoul, Korea,
          <source>September 12-15</source>
          ,
          <year>2006</year>
          , pages
          <fpage>139</fpage>
          -
          <lpage>150</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>