ICAASE'2014 A New Distributed Anonymization Protocol to Satisfy Multiple Data Provider’s Privacy Requirements A new distributed anonymization protocol to satisfy multiple data provider’s privacy requirements Salheddine kabou Sidi Mohamed Benslimane EEDIS Laboratory Computer Sciences Department Djillali Liabes University of Sidi Bel Abbes Djillali Liabes University of Sidi Bel Abbes salheddine.kabou@univ-sba.dz benslimane@univ-sba.dz Abstract –Privacy and security concerns are among the main obstacles facing the widespread adoption of this new technology. Data anonymization makes data worthless to anyone except the owner of the data. It is one of the methods for transforming the data in such a way that it prevents identification of key information from an unauthorized person. Most of the existing works use a k-anonymat model for preserving privacy for data subject that offers lower utility. Motivated by this, we develop a new distributed anonymization protocol to satisfy multiple data providers privacy requirements based on a k-concealment model that offers a higher utility. Keywords – Privacy, Concealment, Anonymization, Cloud Computing 1. INTRODUCTION 2. As the output of the protocol, each private dataset produces a local The confidentiality of this data must be preserved anonymized dataset that satisfies each before outsourcing to the commercial public data provider’s privacy constraints and cloud, i.e. any sensitive information should not be their union forms a global virtual disclosed. Data anonymization is one of the database that meets a global privacy preserving techniques that translate the anonymization principle (k-conealment). information, making the original data worthless to anybody except the owners [1]. It has been widely The rest of the paper is organized as follows. In discussed in the literature such as k-anonymity Section II, we give we discuss related works to [2], l-diversity [3], k-concealment [4]. The our research. In Section III, presents our objective of our work is to adopt a new distributed distributed anonymization protocol, and Section anonymization protocol over cloud servers, which IV explain a R*-tree generalization principal to can satisfy different data provider’s personal achieve the protocol. Finally, we conclude our needs and maximize the utility of the anonymous data still remains a challenging problem using an discussion in Section V. alternative model. In this paper, we offer the 2. RELATED WORK following contributions: 2.1 Privacy preserving data publishing for 1. We present an algorithm which inserts single databases data subjects into an R*-tree for anonymization protocol using a k- In recent years, Privacy preserving data concealment model. publishing for centralized databases has been International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 201 ICAASE'2014 A New Distributed Anonymization Protocol to Satisfy Multiple Data Provider’s Privacy Requirements studied extensively [5](.K-anonymity,l-diversity) site and assumes that participants are fully K-concealment proposes to generalize the table available. Ding et al [10] presented a distributed records so that each one of them becomes anonymization algorithm using R-tree index computationally - indistinguishable from at least structure [11]. Their algorithm uses a greedy k−1 others. In this study, our distributed approach to recursively insert the data objects anonymization protocol is built on top if the k- into the quasi-identifier domain space. concealment principles. The above works have used a k-anonymity model 2.2 Privacy preserving data publishing for for the privacy of data subjects. They choose it multiple databases because it’s useful in several practical applications; also it suffers in data utility. Here, There are a number of possible solutions that can our work is aimed at outsourcing data provider’s be applied for anonymizing distributed data private dataset to cloud servers for data sharing. The integrate-and-anonymize [8] solution is More importantly, our anonymization protocol the simplest approach that assumes an existence aims to achieve anonymity for both (1) data of third party that can be trusted by each of the subjects: by using a k-concealment model which data owners. Here data owners send their data ensures the same level of security and offers to the trusted third party where data integration higher utility than others models that have used and anonymization are performed. This approach in the above works, and (2) data providers: by is not feasible for many scenarios [7]. Finding designing a new distributed anonymization such a trusted third party is not always feasible. algorithm that uses a R*-tree structure [12]. Its Compromise of the server by hackers could lead offers a best choosing of appropriate insertion of to a complete privacy loss for all the participating data objects into the quasi-identifier domain parties and data subjects. space, and it find an adequate partitioning of the quasi-identifier domain space. The basic idea of an alternative approach anonymize-and-integrate is for each data 3. DISTRIBUTED ANONYMIZATION provider to perform data anonymization PROTOCOL independently. In case of horizontal data 3.1 Algorithm distribution [8], this tends to have a negative impact on data quality. For example, enforcing There are a large number of algorithms proposed local k-anonymity can cause the data utility to to achieve k-anonymity. These algorithms have suffer more than enforcing global k-anonymity. If been successfully used to enhance anonymity in data is distributed vertically [9], the integration of distributed environments. Although to our best knowledge, the model k-concealment has not yet the locally anonymous datasets can result in been used for the cloud environment. global non-anonymous datasets. Algorithm.1 Master In the Virtual anonymization [6] [10] data Phase 1: Reading data (slave) providers participate in distributed protocols to Read data (B, num) from each slave node into a set r produce a virtual integrated and anonymized Phase 2: Insertion on R*-tree (Generalisation) database. Important to note is that the For each rectangle B ϵ r, do Finds on every level of the tree, the most suitable subtree anonymized data still resides at individual to accommodate the new entry (see process of R*-tree databases and the integration and anonymization generalisation) If split is possible, do split (see process of split) of the data is performed through the secure End for distributed protocols. Phase 3: Modify the local ki-concealment databases In [6] the authors propose an adaptation of the For each equivalence class Ei of the K-concealment table, do Get rectangles set P contained in R* of Ei Mondrian generalization-based algorithm for For each rectangle B ϵ P, do send B and R* to slave producing a k-anonymous release of the union of node the datasets by using a set of secure multi-party End for End for computation protocols (i.e., secure computing sum, secure median protocols). The execution Figure.2 Distributed anonymization algorithm- sequence is under the control of a single leading Master node International Conference on Advanced Aspects of Software Engineering International ICAASE, Conference November, 2-4, on Advanced 2014, Aspects Constantine, of Software Engineering Algeria. 202 ICAASE, November, 2-4, 2014, Constantine, Algeria. ICAASE'2014 A New Distributed Anonymization Protocol to Satisfy Multiple Data Provider’s Privacy Requirements Algorithm.2 Slave node i (i>0) The R*-tree uses the following steps to find a Phase 1: Sending data to the master better split. In a first step, the split axis has to be For each equivalence class Ej of ki concealment table, do chosen. Along each axis, the entries are first Send information of Ej to the Master in form (B, num) end For sorted by the lower value, and then sorted by the upper value of their rectangles (rectangles of Phase 2: Receive modification from the master Read the data B and R* from the Master equivalence classes). For each sort M-2m+2 For each equivalence class E j of ki-concealment table, do distributions of the M+1 (node a) entries into two if rectangle of Ej equals B groups are determined (M is the maximum Enlarge the rectangle of Ej to R* end if number of entries that will fit in one node and m end For is a minimum number of entries), where the k-th distribution (k=1, …., M-2m+2) is determined as Figure.2 Distributed anonymization algorithm- follows: The first group contains the first (m–1) + Slave node k entries, the second group contains the remaining entries. We assume that a master node is selected from cloud server for the main protocol, and all the For each of the M–2m+2 distributions, the other local databases are consider as slave margin-value is determined by summarizing the nodes. The protocols for the master node and margin length of the two minimum bounding other slave nodes are presented in Algorithm1 boxes (rectangles of equivalence classes) of both and 2. In the first, the slave node sends a data that is abstract information of each equivalence distributions. Finally, the axis that returns the class to the Master node in form (B, num). Where minimum margin value is selected a split axis. In B refers a d-dimensional rectangle which is the the next step, an adequate partitioning of the bounding box of the equivalence class’s spacial entries along the split axis determined. The QI values [l1,u1], ...,[ld,ud], and num is the total overlap-value for each of the 2(M–2m+2) number of data subjects in the equivalence class. distributions is considered where the overlap- value denotes the size of the area of the overlap In Phase 1, the master node reads data (B, num) between the two minimum bounding boxes of of each equivalence class that had sent from both partitions. Here, the secure sum protocols each slave node into a set r. In phase 2, the [13] can be used to securely compute the algorithm inserts each rectangle B from r into an minimum area-value denoting the sum of the size R*-tree: for the first time, it finds on every level of the tree, the most suitable subtree to of the areas covered by both minimum bounding accommodate the new entry (minimum boxes of both partitions [12]. overlapping), and if split is possible, do split. 4. R*-TREE GENERALIZATION In phase 3, the master node modifies all the initial local ki-concealment databases by traversing Our algorithm uses an R*-tree. It generalizes the each equivalence class Ei of the K-concealment data by inserting it into the minimum overlapping table, to find the rectangles set P contained in it. quasi-identifier domain space and when overflow occurs, the quasi-identifier domain space will be For each rectangle B ϵ P that contained in the split into two parts along the best axis chosen. It bounding box R* of Ei, we send data B and R* recursively chooses the best branch to inset the back to a corresponding slave node. When data objects for gathering the closest data receive the data B and R* from the master, the objects. When all the data tuples are inserted into slave node finds out the equivalence class whose the R*-tree, the generalization table was built. rectangle equals B and then enlarges its rectangle to R*. A non-leaf node contains entries of the form (cp, B) where cp is the address of the child node in the 3.2. Split process R*-tree and B is the minimum bounding rectangle of all rectangles which are entries in that child The objective is to split the data as much as node. A leaf node contains entries of the form possible while satisfying the privacy constraints to (Oid, B) where Oid refers to a record in the maximize the utility of anonymized data. database, and B = (B1, B2,…, Bd) is a d- dimensional rectangle which is the bounding box International Conference on Advanced Aspects of Software Engineering International ICAASE, Conference November, 2-4, on Advanced 2014, Aspects Constantine, of Software Engineering Algeria. 203 ICAASE, November, 2-4, 2014, Constantine, Algeria. ICAASE'2014 A New Distributed Anonymization Protocol to Satisfy Multiple Data Provider’s Privacy Requirements of one equivalence class’s spacial QI values. at 6. REFERENCE this point, d is the number of dimensions and Bi is [1] Sedayao, “Enhancing cloud security using a bounded interval [x,y] describing the QI value Data Anonymization”, Intel white paper, June along dimension. 2012. [2] P. Samarati and L. Sweeney. “Generalizing The R*-tree construction is based on the insertion data to provide anonymity when disclosing algorithm [12], which focuses on: (1) information”. In ACM-SIGMOD Symposium on Principles of Database Systems (PODS), Choossubtree: Beginning in the root, descending page 188, 1998. to a leaf, it finds on every level the most suitable [3] A. Machanavajjhala, J. Gehrke, D. Kifer, and subtree to accommodate the new object. (2) The M. Venkitasubramaniam.” l-diversity: Privacy beyond k-anonymity”. In International split: If Choosesubtree ends in a node filled with Conference on Data Engineering (ICDE), the maximum number of objects M, split should page 24, 2006. distribute M+ 1 rectangles into two nodes in the [4] T.Tassa, A.Mazza, A.Gionis, “ k- Concealment: An Alternative Model of k-Type most appropriate manner. Anonymity”, Transactions on Data Privacy 5, pp189–222, 2012 We create the R*-tree by inserting every object [5] B. Fung, K. Wang, R. Chen, “ Privacy- into the indexing structure. Given a new object, preserving data publishing”, A survey of the authors in [13] consider only the area. Our recent developments. ACM Computing algorithm tests the parameters area and overlap Surveys, CSUR (2010) in different combinations. The overlap of an entry [6] P. Jurczyk, , L. Xiong, “Distributed Anonymization: Achieving Privacy for Both is defined as follows. Let E1,...,Ep be the entries Data Subjects and Data Providers”. In: in the current node. Then, Gudes, E., Vaidya, J. (eds.) Data and Applications Security 2009. LNCS, vol. 5645, pp. 191–207. Springer, Heidelberg (2009) [7] F. Kohlmayer, F. Prasser, C. Eckert, A. Kuhn, “A flexible approach to distributed data anonymization”, In Journal of Biomedical For choosing the best non-leaf node, we can use Informatics, August 2013. the method used in R-tree [11]. For the leaf [8] N. Mohammed, B. Fung , “Centralized and distributed anonymization for high- nodes, minimizing the overlap performed slightly dimensional healthcare data”. ACM Trans better. Knowl Discovery Data 2010;4(4):1–33. [9] W.Jiang, , C.Clifton, “A secure distributed 5. CONCLUSION framework for achieving k-anonymity”. VLDB Journal 15(4), 316–333 (2006) We have presented a distributed anonymization [10] X.Ding, Q.Yu, J.LI, J.Liu, H.Jin, “Distributed Anonymization for Multiple Data Providers in protocol for privacy-preserving data publishing a Cloud System”, In: W. Meng et al. (Eds.): from multiple data providers in a cloud DASFAA 2013, Part I, LNCS 7825, pp. 346– 360, 2013. environment. Our work addresses two important [11] A .Guttman,. “R-trees: a dynamic index issues, privacy of data subjects and privacy of structure for spatial searching”. In: SIGMOD data providers. For the privacy of data subjects (1984) (individuals), we have used a k-concealment [12] N. Beckmann, H.Kriegel, R.Schneider, B. model that offers a higher utility with less Seeger, “The R*-tree: An efficient and robust access method for points and rectan-gles”. generalisation than that which is required by k- In: Proceedings of the International anonymity used in [6] and [10]. For the privacy of Conference on Manage-ment of Data (SIGMOD’90), Atlantic City, NJ (1990) pp. data providers, we have adopted a bottom-up 322–331 algorithm instead of top-down approach used in [13] B¨ ottcher, S., Obermeier, S, “Secure set [6] while using R*-tree index for better union and bag union computation for guar- generalization. We also illustrated that the R*-tree anteeing anonymity of distrustful participants”, JSW 3(1), 9–17 (2008) strategy leads to a more effective insertions and splitting than that of the R-tree. We are now working on implementing this algorithm to validate our contribution. International Conference on Advanced Aspects of Software Engineering International ICAASE, Conference November, 2-4, on Advanced 2014, Aspects Constantine, of Software Engineering Algeria. 204 ICAASE, November, 2-4, 2014, Constantine, Algeria.