-

A methodology for GDPR compliant data processing

0 Department of Computer Science, University of Salerno , via Giovanni Paolo II n.132, 84084 Fisciano (SA) , Italy

2018

24 27

Nowadays new laws and regulations to prevent the privacy of users have been proposed. For instance, the General Data Protection Regulation (GDPR) is taking e ect in Europe, requiring organizations to de ne privacy policies complying with the preferences of their users. One way to abide by GDPR is to obscure sensitive data. However, in order not to limit the usage of data, it is vital to limit the amount of data to be obscured. To this end, we propose a methodology exploiting relaxed functional dependencies (rfds) to automatically identify attributes from which sensitive values can be derived. The methodology prescribes to partially encrypt database values causing data privacy threats, identi ed through the automatically discovered rfds.

Data privacy Anonymity Data management

When a user provides personal data to use a services on the web, s/he will no longer own them, rather they became property of the organization running the services. To this end, the European Community has issued the General Data Protection Regulation (GDPR), in order to ensure the protection of personal user data while they are processed by organizations.

Standard privacy prevention techniques, such as cryptography and anonymity, could lead to the impossibility of using the data, even if part of them do not represent sensitive data. For this reason, it is necessary to detect the data to be considered sensitive, and those that would not a ect user's privacy.

In this paper we present a new methodology that analyzes data correlations detected by means of relaxed functional dependencies rfds [ 2 ], aiming to identify potentially sensitive data that could break privacy preservation. In particular, the proposed methodology aims to: (1) classify the data potentially yielding violations users' anonymity, and (2) enhance privacy prevention, by determining whether data declared as sensitive could be implied by identifying data that could imply the values of sensitive data.

The paper is organized as follows. In Section 2 we provide a formalization of the privacy prevention problem, based on which the proposed methodology is described Section 3. In Section 4 we present the results of several experiments in order to validate the proposed methodology. Finally, conclusions and future research directions are discussed in Section 5. 2

Problem description

The two main concerns in data privacy are: anonymity and information con dentiality. Anonymity can be intended as non-identi ability. Thus, organizations must prevent the possibility to associate data to legitimate owners when letting third parts access them [ 4 ]. To formalize the concept of anonymity, we de ne the concept of anonimity-violating attribute set.

Anonimity-violating attribute set. Given a relation schema R containing user personal data, an attribute set X = fX1; : : : ; Xkg, X attr(R), and a relation instance r of R, X represents an anonimity-violating attribute set if and only if it permits to identify data tuples in r (we denote this set by X ).

A relation R preserves the anonymity if and only if R does not contain an anonymity-violating attribute set X . For instance, if a third-part knows an identi er value, then s/he is able to identify a user with a certainty degree of 100%. However, in order to limit third-part's power, we must also deny access to attributes enabling user's identi cation with a high certainty degree, even if less than 100%.

Information con dentiality is more a general concept. Here, the user would protect data s/he considers as sensitive. In this case, starting from a set of user speci ed sensitive data, we need to detect attributes from which it is possible to derive them. To formalize the concept of information con dentiality, we introduce the concept of con dentiality-violating attribute set.

Con dentiality-violating attribute set. Given a relation schema R containing user personal data, a relation instance r of R, and two attribute sets X; Y attr(R), where Y = fY1; : : : ; Yhg is the set of user speci ed sensitive attributes by user, then X represents a con dentiality-violating attribute set, if and only if it is not a key, but it determines at least one Yi, one Yi 2 Y (we denote this set by X ).

A relation R preserves the information con dentiality if and only if (i) it does not contain user speci ed sensitive attributes or (ii) they are obscured and R does not contain con dentiality-violating attribute sets X . To this end, we use the concept of functional determination in order to exclude the possibility to derive values of attributes declared as sensitive. Thus, given a sensitive attribute A, if a third-part knows values of attributes determining those of A, then s/he could be able to discover values of A with a certainty degree of 100%, and with a maximum accuracy degree. However, it would be useful to limit third-part's power not only by decreasing the certainty degree (as de ned for anonymity violations), but also by excluding the possibility to use values that are similar to those determining sensitive ones, i.e. by considering similarity-based matches.

A GDPR compliant data privacy preservation For this reason, in this case we can identify a con dentiality violating attribute set X by using: (i) the above de ned measure and a threshold ", and (ii) a set of constraints containing similarity-based matching predicates. 3

Methodology

We propose a methodology that guarantees user privacy for both anonymity and information con dentiality while permitting to continue use data. It exploits Relaxed Functional Dependencies (rfds) [ 2 ], which enable us to detect sensitiveness of data, and to reduce the encryption processes only to them.

Rfds extend Functional Dependencies (fds) by relaxing some constraints of their de nition. In particular, they might relax on the attribute comparison method, and or on the fact that the dependency must be satis ed by the entire database. Relaxing on the attribute comparison method means adopting an approximate tuple comparison operator, instead of the \equality" operator. In order to de ne the type of attribute comparison used within an rfd, we use the concept of constraint. Instead, a dependency holding for \almost" all tuples or for a \subset" of them is said to relax on the extent. In this case, a coverage measure or a condition is speci ed to quantify the subset of tuples on which the rfd holds.

More formally, the following rfd

X 1 " ! Y 2 (1) holds on a relation instance r of R i : 8 (t1; t2) 2 r, if t1[X] and t2[X] agree with the constraints speci ed by 1, then t1[Y ] and t2[Y ] agree with the constraints speci ed by 2 with a degree of certainty (measured by ) greater than ".

Our methodology exploits rfds discovered from data to identify sensitive data, using block ciphers to encrypt a minimal set of attributes among those containing sensitive data. Then, ranking techniques are applied to discovered rfds, in order to detect a minimal set of attributes to be encrypted to guarantee anonymity and information con dentiality.

Anonimity. Given a relation R, we need to identify all of its sets X , de ning a way to make them no longer accessible on R. Formally, to identify such set of attributes X we map the concept of anonymity to that of key dependency. In particular, since a set X identi es a user, it will also be the Left-Hand-Side (LHS) of a key dependency, to preserve the anonymity of R we need to identify the minimum attribute set Z attr(R), such that by obscuring all attributes in Z from R, then no anonymity violation can be found. To automatically obtain Z we must use a metric to rank rfds discovered from data. We de ned two simple and e ective ranking metrics, but they do not always guarantee the minimality of the attribute set to be encrypted in order to satisfy privacy requirements, due to the fact that this problem can be reduced to the Minimum Feedback Vertex Set [ 3 ] that is NP-Complete.

Information con dentiality. Given a relation R, we need to identify all the con dentiality violating attribute sets X in R, and de ne a way through which X is no longer accessible on R. Formally, to identify a set X , we map the concept of information con dentiality to rfds relaxing on the extent by a coverage measure, and to attribute comparison by means of similarity constraints. The latter represents the required accuracy degree.

Given the set of all X s in R, to preserve the con dentiality of R we need to identify a minimal set of attributes Z attr(R), such that by obscuring Z from R, no more con dentiality violation can be found.

The cryptographic technique that we use in the proposed methodology is block cipher [ 5 ]. In particular, given the set of \sensitive" attributes calculated for the i-th user (according to the user choices and the application of rfds), denoted by Xi = X1i ; : : : Xmi, we encrypt each Xi with a di erent key. The user's key permits to decrypt his/her set of sensitive data. 4

Evaluation

We validated the proposed methodology, by conducting experiments on six datasets derived from the real-world datasets Customers, Cancer, Job, Votes, WholeSale and Echocardiogram, available from the UCI machine learning repository, to which we added same personal data. They show that the number of attributes to be encrypted is in general small, and that the proposed methodology can e ectively help to detect anonymity and/or con dentiality threats. 5

Conclusions and Future Work

We have proposed a new methodology to automatically identify and partially encrypt \sensitive" data in order to guarantee anonymity and information condentiality. It can help organizations comply with the GDPR privacy prevention regulations. The identi cation procedure exploits automatically discovered rfds [ 1 ], in order to derive the minimal set of data to be encrypted.

In the future, we would like to apply this methodology in the context of data manipulation processes that can potentially break data privacy, such as data integration [ 2 ], and schema evolution.

1. Caruccio , L. , Deufemia , V. , Polese , G.: On the discovery of relaxed functional dependencies . In: IDEAS . pp. 53 { 61 ( 2016 )

2. Caruccio , L. , Deufemia , V. , Polese , G.: Relaxed functional dependencies - A survey of approaches . IEEE TKDE 28(1) , 147 { 165 ( 2016 )

3. Fomin , F.V. , Gaspers , S. , Pyatkin , A.V. , Razgon , I. : On the minimum feedback vertex set problem: Exact and enumeration algorithms . Algorithmica 52 ( 2 ), 293 { 307 ( 2008 )

4. Mohammed , N. , Fung , B. , Hung , P.C. , Lee , C.K. : Centralized and distributed anonymization for high-dimensional healthcare data . ACM TKDD 4 ( 4 ), 18 ( 2010 )

5. Stallings , W.: The o set codebook (ocb) block cipher mode of operation for authenticated encryption . Cryptologia 42 ( 2 ), 135 { 145 ( 2018 )