<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A methodology for GDPR compliant data processing</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Salerno</institution>
          ,
          <addr-line>via Giovanni Paolo II n.132, 84084 Fisciano (SA)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>24</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Nowadays new laws and regulations to prevent the privacy of users have been proposed. For instance, the General Data Protection Regulation (GDPR) is taking e ect in Europe, requiring organizations to de ne privacy policies complying with the preferences of their users. One way to abide by GDPR is to obscure sensitive data. However, in order not to limit the usage of data, it is vital to limit the amount of data to be obscured. To this end, we propose a methodology exploiting relaxed functional dependencies (rfds) to automatically identify attributes from which sensitive values can be derived. The methodology prescribes to partially encrypt database values causing data privacy threats, identi ed through the automatically discovered rfds.</p>
      </abstract>
      <kwd-group>
        <kwd>Data privacy</kwd>
        <kwd>Anonymity</kwd>
        <kwd>Data management</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>When a user provides personal data to use a services on the web, s/he will no
longer own them, rather they became property of the organization running the
services. To this end, the European Community has issued the General Data
Protection Regulation (GDPR), in order to ensure the protection of personal
user data while they are processed by organizations.</p>
      <p>Standard privacy prevention techniques, such as cryptography and anonymity,
could lead to the impossibility of using the data, even if part of them do not
represent sensitive data. For this reason, it is necessary to detect the data to be
considered sensitive, and those that would not a ect user's privacy.</p>
      <p>
        In this paper we present a new methodology that analyzes data correlations
detected by means of relaxed functional dependencies rfds [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], aiming to identify
potentially sensitive data that could break privacy preservation. In particular,
the proposed methodology aims to: (1) classify the data potentially yielding
violations users' anonymity, and (2) enhance privacy prevention, by determining
whether data declared as sensitive could be implied by identifying data that
could imply the values of sensitive data.
      </p>
      <p>The paper is organized as follows. In Section 2 we provide a formalization
of the privacy prevention problem, based on which the proposed methodology is
described Section 3. In Section 4 we present the results of several experiments
in order to validate the proposed methodology. Finally, conclusions and future
research directions are discussed in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem description</title>
      <p>
        The two main concerns in data privacy are: anonymity and information con
dentiality. Anonymity can be intended as non-identi ability. Thus, organizations
must prevent the possibility to associate data to legitimate owners when letting
third parts access them [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To formalize the concept of anonymity, we de ne
the concept of anonimity-violating attribute set.
      </p>
      <p>Anonimity-violating attribute set. Given a relation schema R containing
user personal data, an attribute set X = fX1; : : : ; Xkg, X attr(R), and a
relation instance r of R, X represents an anonimity-violating attribute set if
and only if it permits to identify data tuples in r (we denote this set by X ).</p>
      <p>A relation R preserves the anonymity if and only if R does not contain an
anonymity-violating attribute set X . For instance, if a third-part knows an
identi er value, then s/he is able to identify a user with a certainty degree of
100%. However, in order to limit third-part's power, we must also deny access
to attributes enabling user's identi cation with a high certainty degree, even if
less than 100%.</p>
      <p>Information con dentiality is more a general concept. Here, the user would
protect data s/he considers as sensitive. In this case, starting from a set of user
speci ed sensitive data, we need to detect attributes from which it is possible to
derive them. To formalize the concept of information con dentiality, we
introduce the concept of con dentiality-violating attribute set.</p>
      <p>Con dentiality-violating attribute set. Given a relation schema R
containing user personal data, a relation instance r of R, and two attribute sets
X; Y attr(R), where Y = fY1; : : : ; Yhg is the set of user speci ed sensitive
attributes by user, then X represents a con dentiality-violating attribute set, if
and only if it is not a key, but it determines at least one Yi, one Yi 2 Y (we
denote this set by X ).</p>
      <p>A relation R preserves the information con dentiality if and only if (i) it does
not contain user speci ed sensitive attributes or (ii) they are obscured and R
does not contain con dentiality-violating attribute sets X . To this end, we use
the concept of functional determination in order to exclude the possibility to
derive values of attributes declared as sensitive. Thus, given a sensitive attribute
A, if a third-part knows values of attributes determining those of A, then s/he
could be able to discover values of A with a certainty degree of 100%, and with
a maximum accuracy degree. However, it would be useful to limit third-part's
power not only by decreasing the certainty degree (as de ned for anonymity
violations), but also by excluding the possibility to use values that are similar
to those determining sensitive ones, i.e. by considering similarity-based matches.</p>
      <p>A GDPR compliant data privacy preservation
For this reason, in this case we can identify a con dentiality violating attribute
set X by using: (i) the above de ned measure and a threshold ", and (ii) a
set of constraints containing similarity-based matching predicates.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        We propose a methodology that guarantees user privacy for both anonymity
and information con dentiality while permitting to continue use data. It
exploits Relaxed Functional Dependencies (rfds) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which enable us to detect
sensitiveness of data, and to reduce the encryption processes only to them.
      </p>
      <p>Rfds extend Functional Dependencies (fds) by relaxing some constraints
of their de nition. In particular, they might relax on the attribute comparison
method, and or on the fact that the dependency must be satis ed by the
entire database. Relaxing on the attribute comparison method means adopting an
approximate tuple comparison operator, instead of the \equality" operator. In
order to de ne the type of attribute comparison used within an rfd, we use the
concept of constraint. Instead, a dependency holding for \almost" all tuples or
for a \subset" of them is said to relax on the extent. In this case, a coverage
measure or a condition is speci ed to quantify the subset of tuples on which the
rfd holds.</p>
      <p>More formally, the following rfd</p>
      <p>X 1
"
! Y 2
(1)
holds on a relation instance r of R i : 8 (t1; t2) 2 r, if t1[X] and t2[X] agree with
the constraints speci ed by 1, then t1[Y ] and t2[Y ] agree with the constraints
speci ed by 2 with a degree of certainty (measured by ) greater than ".</p>
      <p>Our methodology exploits rfds discovered from data to identify sensitive
data, using block ciphers to encrypt a minimal set of attributes among those
containing sensitive data. Then, ranking techniques are applied to discovered
rfds, in order to detect a minimal set of attributes to be encrypted to guarantee
anonymity and information con dentiality.</p>
      <p>
        Anonimity. Given a relation R, we need to identify all of its sets X , de ning
a way to make them no longer accessible on R. Formally, to identify such set
of attributes X we map the concept of anonymity to that of key dependency.
In particular, since a set X identi es a user, it will also be the Left-Hand-Side
(LHS) of a key dependency, to preserve the anonymity of R we need to identify
the minimum attribute set Z attr(R), such that by obscuring all attributes in
Z from R, then no anonymity violation can be found. To automatically obtain Z
we must use a metric to rank rfds discovered from data. We de ned two simple
and e ective ranking metrics, but they do not always guarantee the minimality
of the attribute set to be encrypted in order to satisfy privacy requirements, due
to the fact that this problem can be reduced to the Minimum Feedback Vertex
Set [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that is NP-Complete.
      </p>
      <p>Information con dentiality. Given a relation R, we need to identify all the
con dentiality violating attribute sets X in R, and de ne a way through which
X is no longer accessible on R. Formally, to identify a set X , we map the
concept of information con dentiality to rfds relaxing on the extent by a coverage
measure, and to attribute comparison by means of similarity constraints. The
latter represents the required accuracy degree.</p>
      <p>Given the set of all X s in R, to preserve the con dentiality of R we need to
identify a minimal set of attributes Z attr(R), such that by obscuring Z from
R, no more con dentiality violation can be found.</p>
      <p>
        The cryptographic technique that we use in the proposed methodology is
block cipher [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In particular, given the set of \sensitive" attributes calculated
for the i-th user (according to the user choices and the application of rfds),
denoted by Xi = X1i ; : : : Xmi, we encrypt each Xi with a di erent key. The
user's key permits to decrypt his/her set of sensitive data.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>We validated the proposed methodology, by conducting experiments on six
datasets derived from the real-world datasets Customers, Cancer, Job, Votes,
WholeSale and Echocardiogram, available from the UCI machine learning
repository, to which we added same personal data. They show that the number of
attributes to be encrypted is in general small, and that the proposed
methodology can e ectively help to detect anonymity and/or con dentiality threats.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>
        We have proposed a new methodology to automatically identify and partially
encrypt \sensitive" data in order to guarantee anonymity and information
condentiality. It can help organizations comply with the GDPR privacy
prevention regulations. The identi cation procedure exploits automatically discovered
rfds [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], in order to derive the minimal set of data to be encrypted.
      </p>
      <p>
        In the future, we would like to apply this methodology in the context of data
manipulation processes that can potentially break data privacy, such as data
integration [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and schema evolution.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Caruccio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deufemia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polese</surname>
          </string-name>
          , G.:
          <article-title>On the discovery of relaxed functional dependencies</article-title>
          .
          <source>In: IDEAS</source>
          . pp.
          <volume>53</volume>
          {
          <issue>61</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Caruccio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deufemia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polese</surname>
          </string-name>
          , G.:
          <article-title>Relaxed functional dependencies - A survey of approaches</article-title>
          .
          <source>IEEE TKDE 28(1)</source>
          ,
          <volume>147</volume>
          {
          <fpage>165</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fomin</surname>
            ,
            <given-names>F.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaspers</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pyatkin</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Razgon</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>On the minimum feedback vertex set problem: Exact and enumeration algorithms</article-title>
          .
          <source>Algorithmica</source>
          <volume>52</volume>
          (
          <issue>2</issue>
          ),
          <volume>293</volume>
          {
          <fpage>307</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mohammed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hung</surname>
            ,
            <given-names>P.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.K.</given-names>
          </string-name>
          :
          <article-title>Centralized and distributed anonymization for high-dimensional healthcare data</article-title>
          .
          <source>ACM TKDD 4</source>
          (
          <issue>4</issue>
          ),
          <volume>18</volume>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Stallings</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>The o set codebook (ocb) block cipher mode of operation for authenticated encryption</article-title>
          .
          <source>Cryptologia</source>
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <volume>135</volume>
          {
          <fpage>145</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>