ICAASE'2014                  A New Distributed Anonymization Protocol to Satisfy Multiple Data Provider’s Privacy Requirements


A new distributed anonymization protocol to
   satisfy multiple data provider’s privacy
                requirements


               Salheddine kabou                                                     Sidi Mohamed Benslimane
               EEDIS Laboratory                                                 Computer Sciences Department
 Djillali Liabes University of Sidi Bel Abbes                             Djillali Liabes University of Sidi Bel Abbes
       salheddine.kabou@univ-sba.dz                                                  benslimane@univ-sba.dz


Abstract –Privacy and security concerns are among the main obstacles facing the widespread adoption of this
new technology. Data anonymization makes data worthless to anyone except the owner of the data. It is one
of the methods for transforming the data in such a way that it prevents identification of key information from
an unauthorized person. Most of the existing works use a k-anonymat model for preserving privacy for data
subject that offers lower utility. Motivated by this, we develop a new distributed anonymization protocol to
satisfy multiple data providers privacy requirements based on a k-concealment model that offers a higher
utility.

Keywords – Privacy, Concealment, Anonymization, Cloud Computing


1. INTRODUCTION                                                         2. As the output of the protocol, each
                                                                           private dataset produces a local
The confidentiality of this data must be preserved                         anonymized dataset that satisfies each
before outsourcing to the commercial public                                data provider’s privacy constraints and
cloud, i.e. any sensitive information should not be                        their union forms a global virtual
disclosed. Data anonymization is one of the                                database    that    meets    a    global
privacy preserving techniques that translate the                           anonymization principle (k-conealment).
information, making the original data worthless to
anybody except the owners [1]. It has been widely                  The rest of the paper is organized as follows. In
discussed in the literature such as k-anonymity                    Section II, we give we discuss related works to
[2], l-diversity [3], k-concealment [4]. The                       our research. In Section III, presents our
objective of our work is to adopt a new distributed                distributed anonymization protocol, and Section
anonymization protocol over cloud servers, which                   IV explain a R*-tree generalization principal to
can satisfy different data provider’s personal
                                                                   achieve the protocol. Finally, we conclude our
needs and maximize the utility of the anonymous
data still remains a challenging problem using an                  discussion in Section V.
alternative model. In this paper, we offer the                     2. RELATED WORK
following contributions:
                                                                   2.1 Privacy preserving data publishing for
    1. We present an algorithm which inserts                       single databases
       data subjects into an R*-tree for
       anonymization protocol using a k-                            In recent years, Privacy preserving data
       concealment model.                                          publishing for centralized databases has been


International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                        201
ICAASE'2014                   A New Distributed Anonymization Protocol to Satisfy Multiple Data Provider’s Privacy Requirements


studied extensively [5](.K-anonymity,l-diversity)                    site and assumes that participants are fully
K-concealment proposes to generalize the table                       available. Ding et al [10] presented a distributed
records so that each one of them becomes                             anonymization algorithm using R-tree index
computationally - indistinguishable from at least                    structure [11]. Their algorithm uses a greedy
k−1 others. In this study, our distributed                           approach to recursively insert the data objects
anonymization protocol is built on top if the k-                     into the quasi-identifier domain space.
concealment principles.
                                                                     The above works have used a k-anonymity model
2.2 Privacy preserving data publishing for                           for the privacy of data subjects. They choose it
multiple databases                                                   because it’s useful in several practical
                                                                     applications; also it suffers in data utility. Here,
There are a number of possible solutions that can                    our work is aimed at outsourcing data provider’s
be applied for anonymizing distributed data                          private dataset to cloud servers for data sharing.
   The integrate-and-anonymize [8] solution is                       More importantly, our anonymization protocol
the simplest approach that assumes an existence                      aims to achieve anonymity for both (1) data
of third party that can be trusted by each of the                    subjects: by using a k-concealment model which
data owners. Here data owners send their data                        ensures the same level of security and offers
to the trusted third party where data integration                    higher utility than others models that have used
and anonymization are performed. This approach                       in the above works, and (2) data providers: by
is not feasible for many scenarios [7]. Finding                      designing a new distributed anonymization
such a trusted third party is not always feasible.                   algorithm that uses a R*-tree structure [12]. Its
Compromise of the server by hackers could lead                       offers a best choosing of appropriate insertion of
to a complete privacy loss for all the participating                 data objects into the quasi-identifier domain
parties and data subjects.                                           space, and it find an adequate partitioning of the
                                                                     quasi-identifier domain space.
   The basic idea of an alternative approach
anonymize-and-integrate is for each data                             3. DISTRIBUTED ANONYMIZATION
provider to perform data anonymization                                  PROTOCOL
independently. In case of horizontal data                                  3.1 Algorithm
distribution [8], this tends to have a negative
impact on data quality. For example, enforcing                       There are a large number of algorithms proposed
local k-anonymity can cause the data utility to                      to achieve k-anonymity. These algorithms have
suffer more than enforcing global k-anonymity. If                    been successfully used to enhance anonymity in
data is distributed vertically [9], the integration of               distributed environments. Although to our best
                                                                     knowledge, the model k-concealment has not yet
the locally anonymous datasets can result in
                                                                     been used for the cloud environment.
global non-anonymous datasets.
                                                                         Algorithm.1 Master
   In the Virtual anonymization [6] [10] data                            Phase 1: Reading data (slave)
providers participate in distributed protocols to                        Read data (B, num) from each slave node into a set r
produce a virtual integrated and anonymized                              Phase 2: Insertion on R*-tree (Generalisation)
database. Important to note is that the                                  For each rectangle B ϵ r, do
                                                                            Finds on every level of the tree, the most suitable subtree
anonymized data still resides at individual                                 to accommodate the new entry (see process of R*-tree
databases and the integration and anonymization                             generalisation)
                                                                            If split is possible, do split (see process of split)
of the data is performed through the secure
                                                                         End for
distributed protocols.
                                                                         Phase 3: Modify the local ki-concealment databases
   In [6] the authors propose an adaptation of the                       For each equivalence class Ei of the K-concealment table, do
                                                                         Get rectangles set P contained in R* of Ei
Mondrian generalization-based algorithm for                                   For each rectangle B ϵ P, do send B and R* to slave
producing a k-anonymous release of the union of                                   node
the datasets by using a set of secure multi-party                             End for
                                                                         End for
computation protocols (i.e., secure computing
sum, secure median protocols). The execution                              Figure.2 Distributed anonymization algorithm-
sequence is under the control of a single leading                                          Master node


International Conference on Advanced Aspects of Software Engineering
International
ICAASE,       Conference
          November,  2-4, on Advanced
                          2014,       Aspects
                                Constantine,   of Software Engineering
                                             Algeria.                                                                             202
ICAASE, November, 2-4, 2014, Constantine, Algeria.
ICAASE'2014                    A New Distributed Anonymization Protocol to Satisfy Multiple Data Provider’s Privacy Requirements


Algorithm.2 Slave node i (i>0)
                                                                     The R*-tree uses the following steps to find a
Phase 1: Sending data to the master                                  better split. In a first step, the split axis has to be
 For each equivalence class Ej of ki concealment table, do           chosen. Along each axis, the entries are first
   Send information of Ej to the Master in form (B, num)
end For                                                              sorted by the lower value, and then sorted by the
                                                                     upper value of their rectangles (rectangles of
Phase 2: Receive modification from the master
Read the data B and R* from the Master                               equivalence classes). For each sort M-2m+2
For each equivalence class E j of ki-concealment table, do           distributions of the M+1 (node a) entries into two
   if rectangle of Ej equals B                                       groups are determined (M is the maximum
       Enlarge the rectangle of Ej to R*
  end if                                                             number of entries that will fit in one node and m
end For                                                              is a minimum number of entries), where the k-th
                                                                     distribution (k=1, …., M-2m+2) is determined as
  Figure.2 Distributed anonymization algorithm-                      follows: The first group contains the first (m–1) +
                    Slave node                                       k entries, the second group contains the
                                                                     remaining entries.
We assume that a master node is selected from
cloud server for the main protocol, and all the                      For each of the M–2m+2 distributions, the
other local databases are consider as slave                          margin-value is determined by summarizing the
nodes. The protocols for the master node and
                                                                     margin length of the two minimum bounding
other slave nodes are presented in Algorithm1
                                                                     boxes (rectangles of equivalence classes) of both
and 2. In the first, the slave node sends a data
that is abstract information of each equivalence                     distributions. Finally, the axis that returns the
class to the Master node in form (B, num). Where                     minimum margin value is selected a split axis. In
B refers a d-dimensional rectangle which is the                      the next step, an adequate partitioning of the
bounding box of the equivalence class’s spacial                      entries along the split axis determined. The
QI values [l1,u1], ...,[ld,ud], and num is the total                 overlap-value for each of the 2(M–2m+2)
number of data subjects in the equivalence class.                    distributions is considered where the overlap-
                                                                     value denotes the size of the area of the overlap
In Phase 1, the master node reads data (B, num)                      between the two minimum bounding boxes of
of each equivalence class that had sent from                         both partitions. Here, the secure sum protocols
each slave node into a set r. In phase 2, the
                                                                     [13] can be used to securely compute the
algorithm inserts each rectangle B from r into an
                                                                     minimum area-value denoting the sum of the size
R*-tree: for the first time, it finds on every level of
the tree, the most suitable subtree to                               of the areas covered by both minimum bounding
accommodate the new entry (minimum                                   boxes of both partitions [12].
overlapping), and if split is possible, do split.
                                                                     4. R*-TREE GENERALIZATION
In phase 3, the master node modifies all the initial
local ki-concealment databases by traversing                          Our algorithm uses an R*-tree. It generalizes the
each equivalence class Ei of the K-concealment                       data by inserting it into the minimum overlapping
table, to find the rectangles set P contained in it.                 quasi-identifier domain space and when overflow
                                                                     occurs, the quasi-identifier domain space will be
For each rectangle B ϵ P that contained in the                       split into two parts along the best axis chosen. It
bounding box R* of Ei, we send data B and R*                         recursively chooses the best branch to inset the
back to a corresponding slave node. When                             data objects for gathering the closest data
receive the data B and R* from the master, the                       objects. When all the data tuples are inserted into
slave node finds out the equivalence class whose                     the R*-tree, the generalization table was built.
rectangle equals B and then enlarges its
rectangle to R*.                                                     A non-leaf node contains entries of the form (cp,
                                                                     B) where cp is the address of the child node in the
3.2. Split process                                                   R*-tree and B is the minimum bounding rectangle
                                                                     of all rectangles which are entries in that child
The objective is to split the data as much as
                                                                     node. A leaf node contains entries of the form
possible while satisfying the privacy constraints to
                                                                     (Oid, B) where Oid refers to a record in the
maximize the utility of anonymized data.                             database, and B = (B1, B2,…, Bd) is a d-
                                                                     dimensional rectangle which is the bounding box


International Conference on Advanced Aspects of Software Engineering
International
ICAASE,       Conference
          November,  2-4, on Advanced
                          2014,       Aspects
                                Constantine,   of Software Engineering
                                             Algeria.                                                                       203
ICAASE, November, 2-4, 2014, Constantine, Algeria.
ICAASE'2014                   A New Distributed Anonymization Protocol to Satisfy Multiple Data Provider’s Privacy Requirements


of one equivalence class’s spacial QI values. at                     6. REFERENCE
this point, d is the number of dimensions and Bi is
                                                                     [1] Sedayao, “Enhancing cloud security using
a bounded interval [x,y] describing the QI value                          Data Anonymization”, Intel white paper, June
along dimension.                                                          2012.
                                                                     [2] P. Samarati and L. Sweeney. “Generalizing
The R*-tree construction is based on the insertion                        data to provide anonymity when disclosing
algorithm [12], which focuses on: (1)                                     information”. In ACM-SIGMOD Symposium
                                                                          on Principles of Database Systems (PODS),
Choossubtree: Beginning in the root, descending                           page 188, 1998.
to a leaf, it finds on every level the most suitable                 [3] A. Machanavajjhala, J. Gehrke, D. Kifer, and
subtree to accommodate the new object. (2) The                            M. Venkitasubramaniam.” l-diversity: Privacy
                                                                          beyond k-anonymity”. In International
split: If Choosesubtree ends in a node filled with                        Conference on Data Engineering (ICDE),
the maximum number of objects M, split should                             page 24, 2006.
distribute M+ 1 rectangles into two nodes in the                     [4] T.Tassa,      A.Mazza,      A.Gionis,     “    k-
                                                                          Concealment: An Alternative Model of k-Type
most appropriate manner.                                                  Anonymity”, Transactions on Data Privacy 5,
                                                                          pp189–222, 2012
 We create the R*-tree by inserting every object                     [5] B. Fung, K. Wang, R. Chen, “ Privacy-
into the indexing structure. Given a new object,                          preserving data publishing”, A survey of
the authors in [13] consider only the area. Our                           recent developments. ACM Computing
algorithm tests the parameters area and overlap                           Surveys, CSUR (2010)
in different combinations. The overlap of an entry                   [6] P. Jurczyk, , L. Xiong, “Distributed
                                                                          Anonymization: Achieving Privacy for Both
is defined as follows. Let E1,...,Ep be the entries                       Data Subjects and Data Providers”. In:
in the current node. Then,                                                Gudes, E., Vaidya, J. (eds.) Data and
                                                                          Applications Security 2009. LNCS, vol. 5645,
                                                                          pp. 191–207. Springer, Heidelberg (2009)
                                                                     [7] F. Kohlmayer, F. Prasser, C. Eckert, A. Kuhn,
                                                                          “A flexible approach to distributed data
                                                                          anonymization”, In Journal of Biomedical
For choosing the best non-leaf node, we can use                           Informatics, August 2013.
the method used in R-tree [11]. For the leaf                         [8] N. Mohammed, B. Fung , “Centralized and
                                                                          distributed     anonymization      for     high-
nodes, minimizing the overlap performed slightly                          dimensional healthcare data”. ACM Trans
better.                                                                   Knowl Discovery Data 2010;4(4):1–33.
                                                                     [9] W.Jiang, , C.Clifton, “A secure distributed
5. CONCLUSION                                                             framework for achieving k-anonymity”. VLDB
                                                                          Journal 15(4), 316–333 (2006)
We have presented a distributed anonymization                        [10] X.Ding, Q.Yu, J.LI, J.Liu, H.Jin, “Distributed
                                                                          Anonymization for Multiple Data Providers in
protocol for privacy-preserving data publishing                           a Cloud System”, In: W. Meng et al. (Eds.):
from multiple data providers in a cloud                                   DASFAA 2013, Part I, LNCS 7825, pp. 346–
                                                                          360, 2013.
environment. Our work addresses two important
                                                                     [11] A .Guttman,. “R-trees: a dynamic index
issues, privacy of data subjects and privacy of                           structure for spatial searching”. In: SIGMOD
data providers. For the privacy of data subjects                          (1984)
(individuals), we have used a k-concealment                          [12] N. Beckmann, H.Kriegel, R.Schneider, B.
model that offers a higher utility with less                              Seeger, “The R*-tree: An efficient and robust
                                                                          access method for points and rectan-gles”.
generalisation than that which is required by k-                          In: Proceedings of the International
anonymity used in [6] and [10]. For the privacy of                        Conference on Manage-ment of Data
                                                                          (SIGMOD’90), Atlantic City, NJ (1990) pp.
data providers, we have adopted a bottom-up                               322–331
algorithm instead of top-down approach used in                       [13] B¨ ottcher, S., Obermeier, S, “Secure set
[6] while using R*-tree index for better                                  union and bag union computation for guar-
generalization. We also illustrated that the R*-tree                      anteeing       anonymity      of     distrustful
                                                                          participants”, JSW 3(1), 9–17 (2008)
strategy leads to a more effective insertions and
splitting than that of the R-tree. We are now
working on implementing this algorithm to
validate our contribution.


International Conference on Advanced Aspects of Software Engineering
International
ICAASE,       Conference
          November,  2-4, on Advanced
                          2014,       Aspects
                                Constantine,   of Software Engineering
                                             Algeria.                                                                      204
ICAASE, November, 2-4, 2014, Constantine, Algeria.