=Paper=
{{Paper
|id=Vol-2482/paper3
|storemode=property
|title=Random Projection to Preserve Patient Privacy
|pdfUrl=https://ceur-ws.org/Vol-2482/paper3.pdf
|volume=Vol-2482
|authors=Aris Anagnostopoulos,Fabio Angeletti,Federico Arcangeli,Chris Schwiegelshohn,Andrea Vitaletti
|dblpUrl=https://dblp.org/rec/conf/cikm/Anagnostopoulos18
}}
==Random Projection to Preserve Patient Privacy==
<pdf width="1500px">https://ceur-ws.org/Vol-2482/paper3.pdf</pdf>
<pre>
           Random Projection to Preserve Patient Privacy

                         Aris Anagnostopoulos                  Fabio Angeletti
                     Sapienza University of Rome         Sapienza University of Rome
                         aris@diag.uniroma1.it            angeletti@diag.uniroma1.it
                        Federico Arcangeli                   Chris Schwiegelshohn
                   Sapienza University of Rome           Sapienza University of Rome
                  federico.arcangeli1@gmail.com        schwiegelshohn@diag.uniroma1.it
                                             Andrea Vitaletti
                                       Sapienza University of Rome
                                        vitaletti@diag.uniroma1.it


                                                                     against known input/output attacks while of-
                                                                     fering high quality data, as long as the pro-
                          Abstract                                   jected space is smaller than the original space,
                                                                     and as long as the amount of leaked data avail-
     With the availability of accessible and widely                  able to the adversary is limited.
     used cloud services, it is natural that large
     components of healthcare systems migrate to
                                                                 1   Introduction
     them; for example, patient databases can be
     stored and processed in the cloud. Such cloud               During the recent years, we witnessed the tremendous
     services provide enhanced flexibility and addi-             progress made in the field of wireless sensor networks.
     tional gains, such as availability, ease of data            This paved the way and facilitated the wide adop-
     share, and so on. This trend poses serious                  tion of small electronic devices with interconnection
     threats regarding the privacy of the patients               capabilities. These devices composed the majority of
     and the trust that an individual must put into              the so-called Internet of Things (IoT). The IoT is a
     the healthcare system itself. Thus, there is a              highly dynamic and radically distributed networked
     strong need of privacy preservation, achieved               system, composed of an incredible high number of ob-
     through a variety of different approaches.                  jects [MSDPC12]. It is vastly considered as one of the
                                                                 most expanding area within future technologies and
     In this paper, we study the application of
                                                                 it is attracting vast attention in different industry ap-
     a random projection-based approach to pa-
                                                                 plications [LL15], ranging from smart cities to home
     tient data as a means to achieve two goals:
                                                                 automation, farming, and many more fields of appli-
     (1) provably mask the identity of users un-
                                                                 cation. Ubiquitous sensors, smart objects and devices
     der some adversarial-attack settings, (2) pre-
                                                                 involved in IoT can generate a tremendous amount of
     serve enough information to allow for aggre-
                                                                 data [JDXC+ 14]. This flow of data requires robust,
     gate data analysis and application of machine-
                                                                 available and fast storage solutions and builds the
     learning techniques. As far as we know, such
                                                                 bases to very effective and powerful algorithms in the
     approaches have not been applied and tested
                                                                 fields of machine learning and data mining [CDW+ 15].
     on medical data. We analyze the trade-
     off between the loss of accuracy on the out-                    Electronics health-care solutions and, generally
     come of machine-learning algorithms and the                 speaking, the Internet of Health Things (IoHT) fol-
     resilience against an adversary. We show                    lows the same trend. About 73% of healthcare exec-
     that random projections proved to be strong                 utives say that IoHT is a disrupting techology for the
                                                                 next years and it is becoming one of the most funded
Copyright © CIKM 2018 for the individual papers by the papers'   areas in IoT. Pervasive IoHT enables cost savings for
authors. Copyright © CIKM 2018 for the volume as a collection    both the administrations and the individuals, but on
by its editors. This volume and its papers are published under
                                                                 the other hand it has some barriers, like: privacy and
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
security concerns, lack of skilled workers, poor inter-      dom, one cannot use the projected data to obtain the
operability and more [acc17].                                real data: each datapoint appears to be random. This,
    Given the sensitive nature of healthcare data, there     unless the attacker has some significant power. This
is a strong need to protect the information of the           trade off is studied in previous works (e.g. [BBM16],
patients. Furthermore, the recent adoption of Gen-           [Liu07a]).
eral Data Protection Regulation (GDPR) strength-                 It is not clear a priori that this approach could work
ens data protection and now it must be applied to            in the application on medical dataset that interests us.
any organisation or individual that collects and pro-        For instance, the lemma by Johnson and Lindenstrauss
cesses information related to EU citizens, regardless        is typically applied on settings where the original data
where the data is physically stored or where they are        lie on a very high-dimensional space. However, in prac-
based[Alb16, Tan16]. At the same time, analysis of           tice, the original dimension may be low (for instance
such data are crucial for medical research and the drug      in our dataset it is about 50). In this paper we look
industry. Consequently, there is a need to design ap-        at this and other issues by applying the random pro-
proaches that allow data processing without exposing         jection to a dataset containing information about 70K
the personal underlying information.                         cases of diabetes [SDG+ 14]. We show that it is possi-
    For this reason, there has been a series of tech-        ble to reduce the dimensionality of the data and still
niques for perturbing data such that information on          obtain accuracy scores that are comparable with the
individual data points cannot be leaked, while aggre-        ones obtained from the original non-projected data.
gate information is preserved. Examples of such ap-          At the same time, we also show how sensible and pri-
proaches are k-anonymity [SS98] and differential pri-        vate patient information such their age or gender are
vacy [Dwo06]. The various approaches put different           safe against attacks that try to reconstruct the map-
importance on the privacy requirements; for instance,        ping between the original data and the projected data
differential privacy attempts to alter the data such as      after applying random projections.
to provide very strong privacy guarantees, typically,
without specifying the usefulness of the resulting data.        Structure of the paper. The rest of this paper
for general-purpose data analysis.                           is organized as follows. In Section 2 we present cur-
    In this paper we apply a method, which can be            rent solutions and the state of the art on random pro-
found in [Liu07a], where the explicit goal is to ob-         jections and other privacy-preserving techniques. In
tain a dataset that remains useful after the perturba-       Section 3 we present the goals of our work and the ap-
tion (still providing some privacy guarantees). More         proach we used for achieving them, leveraging random
specifically, our approach is based on random projec-        projections. In Section 4 we show our experimental
tions (RP), a technique that is typically applied pri-       results, where we explore the limits of our approach
marily for efficiency reasons. It is based on a fun-         both in terms of accuracy and privacy protection. We
damental result from the work of Johnson and Lin-            conclude in Section 5 where we also propose future
denstrauss [JL84] and the idea is the following: As-         work.
sume that we have large amounts of data (n data-
points), lying on a high-dimensional space. Then, if
                                                             2    State of the art
we project each point to a random subspace of dimen-
sion O(log n/2 ), with high probability all pairwise dis-   Data leakages are very common [Wik14]. In this work,
tances between the data points are preserved within a        we are more interested in reducing the ability of an
factor of 1 ± . This technique has found multiple ap-       attacker to reconstruct non-yet-leaked data from the
plications in streaming algorihtms, in finding nearest       leaked one. Within medical premises, there are multi-
neighbors in high-dimensional spaces, in reducing the        ple individuals who could obtain access to protected
dimension of databases, and so on.                           information from the rightous doctor to untrustful
    Our idea of applying a random projection approach        workers. This could lead to multiple entities knowing
on privacy-preserving data mining is the following. As-      protected personal health information.
sume that we have the records of multiple patients.             Before 2003, with the enforcing of HIPAA rules,
Then we can consider a random projection of these            some private medical information were regularly
data. The result of Johnson and Lindenstrauss guar-          shared among professional [Bro00]. Following the
antees that if we execute algorithms that depend on          guidelines from Health Insurance Portability and Ac-
the pairwise distances on the data (e.g., several clus-      countability Act (HIPAA) [fDCP+ 03], the US govern-
tering or classification algorithms), then the results ob-   ment made the first concrete attempt to mitigate the
tained are with high probability similar to those ob-        chance of re-identification of patients. In 2009, it was
tained on the real data set (and the error can be quan-      clear that HIPAA is not sufficient to protect the pri-
tified). Furthermore, because the projections are ran-       vacy of an individual. In fact, the HIPAA was not
able to protect the user personal information after the      areas [KEDR12], to enhance tomography [FMR10], to
anonimization process that substituted HIPAA param-          cluster DNA microarrays[AV09] or to classify cancer
eters with IDs. In a famous case [New09], some re-           [XLZW16].
searchers were able to re-identify users and also their         In [LKR06], RPs were used to mask clear data pro-
sexual orientation and other information. Moreover,          jecting them in smaller spaces, while in [BBM16] and
the availability of correlated data (coming from the         [KKMM12b], similarly to our work, the authors dis-
same source or other sources) could greatly help to          cuss how to exploit RPs to enhance data privacy. The
identify a patient. Data breaches continue to increase       authors in [LKR06] also discuss the utility of the RP
year after year, between 2005 and 2014, only in the          in reducing complexity of problems while maintaining
US, more than 26 million of people had some form of          the usefulness of the projected data for algorithms. It
personal health information leaked [Wik14].                  is anticipated that by 2020 there will be more than
    Therefore, more elaborate tecniques, which add           26 billion devices involved in IoT related applications
noise to the data, have been developed in the last years     [RvdM14]. Surely, not all of them will be part of
to protect users’ privacy and still maintain a good level    the healthcare field, however we expect a very large
of accuracy when exploring and analyzing the data.           amount of information to process. The usefulness of
One of these is differential privacy [Dwo06]. This ap-       RP in reducing problem complexity (or resource re-
proach focuses on providing statistically coherent re-       quirements) is well understood and exploited as use-
sponses querying a database, i.e. third parties are in-      ful resource in the literature [CEM+ 15, LKR06, FB03,
terested to query for information about a sample of          BZD10, AWY+ 99]. For example, in [FB03], the au-
a population, not a single individual. Instead, we are       thors explore some ways to reduce high dimensional
interested also in providing data about a specific indi-     data for clustering while, in [LF12], is presented a work
vidual, for example investigating if he or she is suitable   on classification of small patches of images from a very
for a clinical trial.                                        large database that takes advantage of the properties
    In [KKMM12a] the authors proved that the                 offered by RPs.
Johnson-Lindenstrauss transform can be used as an al-           During the last two decades, the contribution of ma-
ternative approach to achieve differential privacy. The      chine learning and data mining algorithms in health-
method is then compared against other techniques,            care applications became more frequent year after
such as adding Gaussian noise to data or random-             year. This is well demonstrated in the literature,
ized response. The proposed approach has superior            for example in [CHH+ 17, MP99, ZWC+ 09, HHC+ 14].
accuracy bounds than the others, while still keeping         One last aspect to consider is the chance to link to-
secure the privacy of the records. The authors also          gether multiple datasets. For example in [LJJ+ 09], the
criticize the work of Liu et al. about releasing data        authors presented the infrastructure of a databank in
to third parties after applying random projections in        order to enable record-linkage research studies. This
order to protect sensitive information while still pre-      linkage on one hand could deeply help the development
serving accuracy of different data mining algorithms:        of newer treatments or drugs, but on the other hand
an adversary that has some background knowledge can          poses threats to the privacy of the individuals.
infer approximations of the original data. We address
this issue in the scenario of known input-output data        3    Problem Formulation
(section 3.2) and show how in real world scenario re-
garding medical data, under reasonable assumptions           We consider a reference scenario in which a group of
for the power of the adversary, it is difficult for an at-   users, characterized by private features, are potentially
tacker to discover private information from projected        suitable for a clinical trial. Only a limited number of
records.                                                     users in the group will be actually enrolled in the trial.
    In the literature there exist a very large number of     For the enrolled users, namely the patients, the private
works regarding the re-identification of person start-       features will be eventually made public to participate
ing from various data, within some degree it is called       to the clinical trial in the most effective way. Some
“breaking the k-anonimity”. For example in [NS08]            knowledge on the group is of primary importance for
the authors presents a method to re-identify a user          the researchers to understand the size and the char-
from its preferences.                                        acteristics of potential patients. In general, users are
    In this paper, we aim at investigating to what ex-       well disposed to support this need of the researchers
tent RPs can provide useful data for machine learning        provided that their privacy is preserved. The main
algorithms (e.g. classification) on a group of potential     problem we want to address in this paper is:
patients while preserving at the same time the privacy
of individuals. RPs have been employed in a number of          Can we learn something on the group of users as a
healthcare applications, for example to segment tumor        whole, while preserving the privacy of the individuals
who will not participate in the trial?                         We now elaborate on these two dimensions.

    More formally, we consider a group of n users, where    3.1   Accuracy
each user u is characterized by m features. We repre-
sent the corresponding dataset as a matrix X ∈ Rm×n ,       Lemma 3.1 provides a technique to generate a low-
with m rows (the features) and n columns (the users).       dimensional representation of the original data main-
As already observed, in the era of big data, m and n        taining the pairwise distance within an error . Since
can be particularly big.                                    the pairwise distance is the key ingredient for many
    Giannella et al. [GLH13] show how it is possible to     classification tasks performed by machine learning al-
break the privacy in some contexts of distance preserv-     gorithms, this property allow us to have some guar-
ing mappings. Liu [Liu07b] instead, highlights how          antees that the solution found in the low-dimensional
mappings that do preserve distances within certain          space is a good approximation of the solution in the
bounds like random projections can boost the privacy        original and higher dimensional space. Furthermore,
guarantees. We will apply these techniques in order         reducing the size of the input data speeds-up the ex-
to prove that users’ privacy can be kept safe against       ecution time of the algorithms and limits the amount
malicious attackers.                                        of resources needed.
    We are interested in understanding to what ex-
tent the random projection technique, which has been        Lemma 3.1 (Johnson and Lindenstrauss) Given  >
originally conceived to reduce the dimensionality of a      0 and an integer n let k be a positive integer such that
dataset, can also be used to preserve the privacy of the    k ≥ k0 = O( log(n)
                                                                           2 ). For every set P of n points in
users. In particular, we apply a random projection to       Rm there exists a mapping f : Rm → Rk such that for
X, such that if R ∈ Rk×m is the random-projection           all u, v ∈ P
matrix Y = RX is the transformed matrix after ap-
plying the random projection, with Y ∈ Rk×n . We            (1 − ) k u − v k2 ≤k f (u) − f (v) k2 ≤ (1 + ) k u − v k2
denote by xui the column in X associated to user ui ,
                                                               It can be proved that a random projection, is a map-
and with yiu the corresponding column in Y . In the
                                                            ping f that fulfills the previous lemma with positive
scenario we are describing the projected matrix Y is
                                                            probability. This is often referred as JL-embeddings.
known to the public, it is indeed the dataset on which
researchers try to distill information on the group; the
                                                            3.2   Privacy: Known Input–Output Attack
transformation matrix R and the original data X are
private. Some columns of X may become public once           We now try to answer one of the questions we raised
the corresponding users will eventually decide to par-      in the previous section: Can a a malicious third party
ticipate to a clinical trial, in other words some pairs     who knows some pairs (xui , yiu ) (i.e. that a particular
(xui , yiu ) will become public.                            record xui is associated to yiu after its projection) learn
We can now better describe the problem, splitting it        information about other records?
into two sub-problems:                                         Liu in his Ph.D. thesis [Liu07b] describes a Bayes
                                                            privacy model to measure the privacy offered by a
Accuracy. Can we learn something on the group ex-
                                                            perturbation technique. The model considers the at-
   ploiting Y ? Here we want to understand if the
                                                            tacker’s apriori and a posteriori beliefs about the data
   results of some machine-learning algorithms on Y
                                                            and uses Bayesian inference to evaluate the privacy.
   are a good approximation of the ones obtained on
                                                            For completeness, we repeat his framework here.
   X. If we answer positively to this question, we
                                                               Let x be the unknown private data, y the perturbed
   can at least conclude that what can be learned
                                                            data and θ the attacker’s additional knowledge about
   from the original data can be also learned from
                                                            the data. Then the MAP estimate of x given y and θ
   the projected data.
                                                            is
Privacy. Can we preserve the privacy of the individ-                  x̂M AP (y, θ) = arg max fx|y,θ (x|y, θ)
                                                                                         x
    uals that will not participate in the trial? As al-
    ready observed, Y is public whereas only some           with fx|y,θ the conditional probability density of x
    columns of X will eventually become public when         given y and θ.
    the corresponding users will decide to partici-            Let Xp denote the first p columns of X and Xn−p
    pate in a clinical trial. Consequently, some pairs      the remaining columns. We define similarly Yp and
    (xui , yiu ) will become public. Here we want to un-    Yn−p . We further assume that the columns of Xp are
    derstand if an attacker knowing Y and the some          linearly independent and that Xp is known to the at-
    pairs (xui , yiu ) can possibly know something about    tacker (i.e., the attacker has full knowledge of p pa-
    the other users that do not participate in the trial.   tients). Y is entirely known to the attacker, because
as we stated before, it is publicly available to conduct          We want to maximize this function in order to solve
experiments on the projected data.                                the problem of finding the best estimate of x given the
   For the next reasoning the following hypothesis                observation of Xp .
must be verified:                                                    Liu proposes an algorithm to estimate the nondis-
                                                                  closed records of a certain dataset. Experimental re-
  • The original data arose from as a sample from a               sults have shown that while decreasing the number
    matrix variate distribution.                                  of column records known to the attacker (denoted by
  • The projection matrix R is a k × m random ma-                 p) the relative error of the estimation increases. The
    trix with each entry indipendent and identically              error in the estimation increases also decreasing the
    distributed with 0 mean and unit variance. R has              dimensionality of the projected subspace (denoted by
    a matrix variate Gaussian distribution with mean              k). In particular the algorithm uses the Nelder–Mead
    matrix M = 0 and covariance matrix Σ = Ik ⊗In .1              simplex algorithm to find the optimal solution of the
                                                                  maximization problem.
  • Y has a matrix variate Gaussian distribution with
    mean matrix M = 0 and covariance matrix Σ =                   4    Experimental results
    Ik ⊗ k1 X T X
                                                        In this section, we present experimental results ob-
   The attacker will try to produce x̂i , with 1 ≤ i ≤  tained on a dataset containing information about
m−p, such that x̂i is a good estimate of the undisclosed70000 cases of diabetes diagnosticated in 130 US hos-
private record xi . In other words the attacker’s targetpitals during the decade 1999-2008 [SDG+ 14] 2 . From
is to try to give an estimation of one of the records   now on we will refer to this dataset as the diabetes
contained in Xn−p , given that he knows the records in  dataset.
Xp and their randomly projected counterpart in Yp .         We focus on the classification of patients based
   We now derive the MAP estimate of x given y = Rx     on their privatized data. Following the work in
and the known matrices Xp and Yp                        [DMS+ 17, HXY+ 16], we choose to use random forest
                                            1     1     classifier in our dataset to classify the users. More-
x̂M AP (y, θ) = arg max fx|y,θ (x = x| √ Rx = y, √ RXp =over,
                                                         Yp ) from the work in [Jol17], we know that random
                        x                    k     k    forest classifiers works really well with random pro-
which can be simplified in                              jections. In Figure 1 we report the effectiveness in
                                                        terms of accuracy running the random forest classi-
                                    1                   fier [Pal05] on the original data and on the projected
                  arg max fx,y,θ ( √ RX = Y )
                      x              k                  data in multiple lower dimensional spaces. To run and
                                                        validate the classification algorithm, we divided the
where X = [xXp ] and Y = [yYp ].                        whole dataset into two parts: train and test. In the
    We further suppose that the attacker has no other   dataset we decided to predict the range of glucose level
background knowledge about the private data, so we      in the blood. So that, the algorithm was firstly trained
can assume that θ = 0.                                  with the records within the train part of the diabetes
    The previous result can be written as               dataset, providing all the target values. Thus, we made
                      1                                 the random forest classifier algorithm predicts the tar-
   arg max fx,y ( √ RX = Y ) =                          get values in the test part giving its features as in-
       x                k
                                                        put. Moreover, we tested the effectiveness of RPs also
                          1
   arg max f √1 RZ|Z ( √ RZ = Y |Z = X)fZ (Z = X)       with k-nearest neighbors (k-NN) classifier, the results
       x          k        k                            were reported in Figure 2. Our approach was inspired
If we assume that fZ is distributed uniformly over an   by [AC06]. The results are quite different because in
interval, we finally get                                the first experiment we taken a feature of the dataset
                                                        (the range of glucose level in the blood) as the value to
                                        1               predict, instead with the second experiment we choose
  x̂M AP (y) = arg max f √1 RZ|Z ( √ RZ = Y |Z = X)
                       x      k          k              to run firstly a kMeans clustering algorithm (on the
    In [Liu07b, Theorem 5.3.8] is shown that the prob-  whole dataset) to obtain labeled groups and then, with
ability density function we obtained has the following  the k-nearest neighbors (k-NN) classifier we predicted
form                                                    the values.
                                                            The blue line represents the accuracy of the machine
      − 21 k(p+1)      1 T −1k             1 1 T −1 T
(2π)              det( X X)      2  etr{− Y ( X X) Y }     2 The   dataset is called “Diabetes 130-US hospi-
                       k                   2 k
                                                                  tals for years 1999–2008 Data Set” and is available at
  1 ⊗ indicates the Kroenecker product of two matrices [Liu07b]   https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+ye
learning algorithm on the original data. The orange         wisker again. Instead, for the accuracy of classifica-
line, instead, represents the accuracy of the same al-      tion on the projected data, we ran the algorithm more
gorithm on the projected (obfuscated) data. We tested       than 100 times. In each round the algorithm generated
the classification algorithm on projected spaces in dif-    a value for each projected space. The results were ob-
ferent sizes, starting from only 2 components up to 10      tained using the scikit-learn package on Python 3.6.
components.                                                    In [KLR06, LGK06] the authors explore the security
                                                            of such techniques: they show how it is possible to use
                                                            data dimensionality reduction techniques to lower the
                                                            complexity of data mining algorithms while preserving
                                                            their accuracy and how those techniques preserve the
                                                            privacy of users.
                                                               The authors start from the same privacy hypothesis
                                                            we have presented in 3.2 and study how an attacker in
                                                            possession of a collection of linearly independent pri-
                                                            vate data records and their corresponding transformed
                                                            part can gather some insight about other records.
                                                               We present the results we got running the algorithm
                                                            of [Liu07b] on this dataset. After choosing a number p
                                                            of record pairs (xp , yp ) we select a record x for which
Figure 1: Accuracy of the random forest classifier al-      we do not know the mapping; the algorithm we are
gorithm on the original data (blue line) and on the         using will try to give an estimation x̂ of the original
projected data (orange line), varying the projection        record x.
space (# of components). Mean values are reported              We used two techniques to evaluate how similar to
as lines and 95% confidence intervals are reported as       the original records the algorithm’s estimations were.
vertical lines.                                             We measured the distance between the estimation x̂
                                                            provided by the algorithm and the original record x.
                                                            We compute the relative error between the two vectors
                                                            with the following:
                                                                                           ||x − x̂||2
                                                                              E(x, x̂) =
                                                                                              ||x||2
                                                            The error E increases with the Euclidean distance be-
                                                            tween the two. Notice that with this notation it may
                                                            happen that the error is greater than one: this could
                                                            verify in the case that the distance between x and its
                                                            estimations x̂ is high and the norm of x is a small value.
                                                            This could happen if the algorithm’s estimation is very
                                                            far off from the original record.
Figure 2: Accuracy of the k-nearest neighbors (k-NN)           This measuring has the drawback to lack an upper
algorithm on the original data (blue line) and on the       bound for the dissimilarity. Neither the cosine similar-
projected data (orange line), varying the projection        ity helps, since in our case we are not interested only
space (# of components). Mean values are reported           in the direction of vectors but also in their magnitude.
as lines and 95% confidence intervals are reported as          A solution is provided in [JNY07], where a radial
vertical lines.                                             basis function kernel can be used for representing sim-
                                                                                                    1
   The lines plotted in Figure 1 presents the average       ilarities: we are going to use 1 − edist(x,x̂) as a similar-
values for each projection space, while the vertical        ity function between x and its estimation x̂, where
wiskers represent the confidence interval correspond-       dist(x, x̂) = ||x − x̂||2 . The bigger the Euclidean
ing to a specific projection space. For the baseline        distance between two vectors, the bigger the error
(classification on the clear data) we ran the classifica-   edist(x,x̂) will be. In this way we have a [0, 1) bound
tion algorithm 50 times, in each round starting from        for the similarity of the estimations. By applying the
a random state of the random forest classifier. Since       inverse we get a value in the range [0,1): if x and x̂
the original data is not projected into any space, we       are the same vector (perfect reconstruction performed
                                                                                          1
have only a baseline with the associated mean value         by the algorithm) then edist(x,x̂ )
                                                                                                = 1.
and confidence interval. Thus, we reported the con-            Our workplan is the following: for every subspace
fidence interval only at the lefties part of line using     of dimensionality k we apply the algorithm with differ-
ent knowledge about the number of pairs (xp , yp ) the
attacker knows. We go from p = k − 1 to p = 1. In the
next figures we display the results of our experiments,
with the two different measuring techniques we used
to quantify the similarity between the original records
and the estimated ones. We report the mean of the
errors for every pair (k, p) and the variance. On the
X axis are placed the tuples (k, p) for which we have
conducted the experiments, on the Y axis we placed
the reconstruction errors.
On low-dimensionality subspaces we get a high rela-
tive error, meaning that it is not possible to give an
effective approximation of the original (private) data
records. In higher dimensions the approximation is
closer to the original data. We ran our experiments
with 10 features of the dataset, since with vectors of
higher dimensionality it becomes more difficult to run      Figure 3: Mean and variance of the relative error while
the reconstruction algorithm in reasonable times; also      using the formula ||x−x̂||
                                                                                ||x||2
                                                                                       2


with higher dimensionalities the algorithm we are us-
ing outputs vector reconstructions that are very dis-
similar from the original ones.
   We applied the random projection to reduce the fea-
ture space in different dimensions, from 10 to 3. No-
tice, however, that even when the projected space has
the same dimension of the original space, we already
get a significant relative error, meaning that on the av-
erage it is not possible for the attacker to extrapolate
any useful information about the patients’ records. So
for records of higher dimensionality there is already a
safe privacy bound when applying random projection
to them, at least against this kind of attacks.
   We assigned an increasing numerical value to nom-
inal features, that is, we assigned 0 to the text male
and 1 to text female in the gender feature.
   We applied random projection to this records, from
k = 10 (no dimensionality reduction) to k = 3; the          Figure 4: Mean and variance of the similarity between
number p of pairs (original record, projected record)       original records and their reconstruction while using
                                                                                            1
known to the attacker is in the range k − 1 ≤ p ≤ 1.        the similarity function 1 − e||x−x̂|| 2

   With k = 2 we obviously have only p = 1: we              rors when applying the techniques we have explained.
omit this result since it is not meaningful with respect    This is another confirmation of the thesis that random
the other results we get for higher k and p, because it     projections help keep the privacy of users when their
does not show how knowing less (or more) information        information is shared among research institutes.
about the original data changes the reliability of the
reconsturction we get.
                                                            5   Conclusions
   In the next figures we show the mean and variances
of the errors for every tuple (k, p) for which we have      In this work, we applied an random-projections ap-
conducted the experiment. It can be seen from the           proach to privacy-preserving data mining of medical
charts that as the number of known input-output pairs       data.
p decreases, the reconstruction error increases. To-           First we demonstrated the usefulness of RP in in-
gether with the dimensionality reduction, disclosing a      creasing privacy of personal health data. The pro-
scarce number of known input-output pairs can help          jected data are useful for machine learning algorithms
with the task of preserving the privacy of users in-        (for example, in clustering) while allows the sharing of
volved in clinical trials.                                  information between parties without revealing the pa-
   In this case we are projecting low dimensionality        tients’ clear data. In this particular application, this
vectors (k = 10) but we still get high reconstruction er-   is of notable importance since allows entities involved
in different health branches to cooperate effectively      [BZD10]      Christos Boutsidis, Anastasios Zouzias,
without sharing clear data. Second, we investigated                     and Petros Drineas. Random projections
to what extent an attacker can discover additional in-                  for k-means clustering. In Advances in
formation starting from leaked data. As long as the                     Neural Information Processing Systems,
projected space is smaller than the original space, and                 pages 298–306, 2010.
as long as the amount of data leaked is small, than the
proposed approach is robust and mantains very good         [CDW+ 15]    Feng Chen, Pan Deng, Jiafu Wan,
performance in both accuracy and privacy.                               Daqiang Zhang, Athanasios V Vasilakos,
                                                                        and Xiaohui Rong. Data mining for the
   We analyzed the ratio behind and the performances
                                                                        internet of things: literature review and
(in terms of accuracy) of the RP applied on sensible
                                                                        challenges. International Journal of Dis-
healthcare data. The results shows that the use of RP
                                                                        tributed Sensor Networks, 11(8):431047,
offers great enhancements in privacy protection. This
                                                                        2015.
was a first step into developing a full-fledged platform
that allows the effective share of medical data. In fu-    [CEM+ 15]    Michael B Cohen, Sam Elder, Cameron
ture we are planning a bigger real-world deployment                     Musco,     Christopher Musco,        and
of such platform to further validate our results, plus                  Madalina Persu. Dimensionality reduc-
an audit to check privacy protection against real third                 tion for k-means clustering and low rank
parties.                                                                approximation. In Proceedings of the
                                                                        forty-seventh annual ACM symposium
References                                                              on Theory of computing, pages 163–172.
                                                                        ACM, 2015.
[AC06]        Nir Ailon and Bernard Chazelle. Ap-
              proximate nearest neighbors and the fast     [CHH+ 17]    Min Chen, Yixue Hao, Kai Hwang,
              Johnson-Lindenstrauss transform. In                       Lu Wang, and Lin Wang. Disease predic-
              Proceedings of the 38th Annual ACM                        tion by machine learning over big data
              Symposium on Theory of Computing,                         from healthcare communities. IEEE Ac-
              Seattle, WA, USA, May 21-23, 2006,                        cess, 5:8869–8879, 2017.
              pages 557–563, 2006.
                                                           [DMS+ 17]    Sarah DuBrava, Jack Mardekian, Alesia
[acc17]       accentureconsulting. Internet of health                   Sadosky, E Jay Bienen, Bruce Parsons,
              things survey, 2017.                                      Markay Hopps, and John Markman. Us-
                                                                        ing random forest models to identify
[Alb16]       Jan Philipp Albrecht. How the gdpr will                   correlates of a diabetic peripheral neu-
              change the world. Eur. Data Prot. L.                      ropathy diagnosis from electronic health
              Rev., 2:287, 2016.                                        record data. Pain Medicine, 18(1):107–
                                                                        115, 2017.
[AV09]        Roberto Avogadri and Giorgio Valen-
              tini. Fuzzy ensemble clustering based on     [Dwo06]      Cynthia Dwork. Differential privacy.
              random projections for dna microarray                     In Proceedings of the 33rd Interna-
              data analysis. Artificial Intelligence in                 tional Conference on Automata, Lan-
              Medicine, 45(2-3):173–183, 2009.                          guages and Programming - Volume Part
                                                                        II, ICALP’06, pages 1–12, Berlin, Hei-
[AWY+ 99]     Charu C Aggarwal, Joel L Wolf, Philip S                   delberg, 2006. Springer-Verlag.
              Yu, Cecilia Procopiuc, and Jong Soo
              Park. Fast algorithms for projected clus-    [FB03]       Xiaoli Z Fern and Carla E Brodley.
              tering. In ACM SIGMoD Record, vol-                        Random projection for high dimensional
              ume 28, pages 61–72. ACM, 1999.                           data clustering: A cluster ensemble ap-
                                                                        proach. In Proceedings of the 20th Inter-
[BBM16]       Tiziano Bianchi, Valerio Bioglio, and                     national Conference on Machine Learn-
              Enrico Magli. Analysis of one-time ran-                   ing (ICML-03), pages 186–193, 2003.
              dom projections for privacy preserving
              compressed sensing. IEEE Transactions        [fDCP+ 03]   Centers for Disease Control, Prevention,
              on Information Forensics and Security,                    et al. Hipaa privacy rule and public
              11(2):313–327, 2016.                                      health. guidance from cdc and the us de-
                                                                        partment of health and human services.
[Bro00]       Anthony The Guardian Browne. Lives                        MMWR: Morbidity and mortality weekly
              ruined as nhs leaks patients’ notes, 2000.                report, 52(Suppl. 1):1–17, 2003.
[FMR10]      Yi Fang, Sundar Murugappan, and                          segmentation of tumor areas in breast
             Karthik Ramani. Estimating view pa-                      histology images. In Medical Image Un-
             rameters from random projections for                     derstanding and Analysis, pages 17–23,
             tomography using spherical mds. BMC                      2012.
             medical imaging, 10(1):12, 2010.
                                                          [KKMM12a] K. Kenthapadi, A. Korolova, I. Mironov,
[GLH13]      C. Giannella, K. Liu, and Kargupta H.                  and N. Mishra. Privacy via the Johnson-
             Breaching euclidean distance preserving                Lindenstrauss Transform.      ArXiv e-
             data perturbation using few known in-                  prints, April 2012.
             puts. Data and Knowledge Engineering,
             2013.                                        [KKMM12b] Krishnaram Kenthapadi, Aleksandra
      +                                                             Korolova, Ilya Mironov, and Nina
[HHC 14]     Shu-Yu Hsu, Yingchieh Ho, Po-Yao
                                                                    Mishra.     Privacy via the johnson-
             Chang, Chauchin Su, and Chen-Yi Lee.
                                                                    lindenstrauss transform. arXiv preprint
             A 48.6-to-105.2 µw machine learning
                                                                    arXiv:1204.2606, 2012.
             assisted cardiac sensor soc for mobile
             healthcare applications. IEEE Journal        [KLR06]     H. Kargupta, K. Liu, and J. Ryan.
             of Solid-State Circuits, 49(4):801–811,                  Random projection-based multiplicative
             2014.                                                    data perturbation for privacy preserving
[HXY+ 16]    Jian-Hua Huang, Hua-Lin Xie, Jun                         distributed data mining. IEEE Transac-
             Yan, Dong-Sheng Cao, Hong-Mei Lu,                        tions on Knowledge & Data Engineering,
             Qing-Song Xu, and Yi-Zeng Liang.                         18:92–106, 01 2006.
             Correction:   Interpretation of type
             2 diabetes mellitus relevant gc-ms           [LF12]      Li Liu and Paul Fieguth.            Tex-
             metabolomics fingerprints by using ran-                  ture classification from random features.
             dom forests.      Analytical Methods,                    IEEE Transactions on Pattern Analy-
             8(8):1950–1951, 2016.                                    sis and Machine Intelligence, 34(3):574–
                                                                      586, 2012.
[JDXC+ 14]   Lihong Jiang, Li Da Xu, Hongming Cai,
             Zuhai Jiang, Fenglin Bu, and Boyi Xu.        [LGK06]     Kun Liu, Chris Giannella, and Hillol
             An iot-oriented data storage framework                   Kargupta. An attacker’s view of dis-
             in cloud computing platform. IEEE                        tance preserving maps for privacy pre-
             Transactions on Industrial Informatics,                  serving data mining. In Proceedings of
             10(2):1443–1451, 2014.                                   the 10th European Conference on Prin-
                                                                      ciple and Practice of Knowledge Discov-
[JL84]       William B Johnson and Joram Linden-                      ery in Databases, PKDD’06, pages 297–
             strauss. Extensions of lipschitz map-                    308, Berlin, Heidelberg, 2006. Springer-
             pings into a hilbert space. Contemporary                 Verlag.
             mathematics, 26(189-206):1, 1984.
                                                          [Liu07a]    Kun Liu. Multiplicative Data Perturba-
[JNY07]      Yu-Gang Jiang, Chong-Wah Ngo, and
                                                                      tion for Privacy Preserving Data Min-
             Jun Yang. Towards optimal bag-of-
                                                                      ing. PhD thesis, University of Maryland,
             features for object categorization and se-
                                                                      2007.
             mantic video retrieval. ACM, 2007.
[Jol17]      Arnaud Joly. Exploiting random pro-          [Liu07b]    Kun Liu. Multiplicative Data Perturba-
             jections and sparsity with random                        tion for Privacy Preserving Data Min-
             forests and gradient boosting methods–                   ing. PhD thesis, University of Maryland,
             application to multi-label and multi-                    2007.
             output learning, random forest model
             compression and leveraging input spar-       [LJJ+ 09]   Ronan A Lyons, Kerina H Jones, Gareth
             sity. arXiv preprint arXiv:1704.08067,                   John, Caroline J Brooks, Jean-Philippe
             2017.                                                    Verplancke, David V Ford, Ginevra
                                                                      Brown, and Ken Leake. The sail data-
[KEDR12]     Adnan Mujahid Khan, Hesham El-Daly,                      bank: linking multiple health and social
             and Nasir Rajpoot. Ranpec: Random                        care datasets. BMC medical informatics
             projections with ensemble clustering for                 and decision making, 9(1):3, 2009.
[LKR06]     Kun Liu, Hillol Kargupta, and Jes-                        Seventeenth ACM SIGACT-SIGMOD-
            sica Ryan. Random projection-based                        SIGART Symposium on Principles of
            multiplicative data perturbation for pri-                 Database Systems, PODS ’98, pages
            vacy preserving distributed data mining.                  188–, New York, NY, USA, 1998. ACM.
            IEEE Transactions on knowledge and
            Data Engineering, 18(1):92–106, 2006.         [Tan16]     Colin Tankard. What the gdpr means for
                                                                      businesses. Network Security, 2016(6):5–
[LL15]      In Lee and Kyoochun Lee. The inter-                       8, 2016.
            net of things (iot): Applications, invest-
            ments, and challenges for enterprises.        [Wik14]     Suanu Bliss Wikina. What caused the
            Business Horizons, 58(4):431–440, 2015.                   breach? an examination of use of in-
                                                                      formation technology and health data
[MP99]      George D Magoulas and Andriana                            breaches. Perspectives in health infor-
            Prentza. Machine learning in medi-                        mation management, 11(Fall), 2014.
            cal applications. In Advanced Course
            on Artificial Intelligence, pages 300–307.    [XLZW16]    Haozhe Xie, Jie Li, Qiaosheng Zhang,
            Springer, 1999.                                           and Yadong Wang.           Comparison
                                                                      among dimensionality reduction tech-
[MSDPC12] Daniele Miorandi, Sabrina Sicari,                           niques based on random projection for
          Francesco De Pellegrini, and Imrich                         cancer classification. Computational bi-
          Chlamtac. Internet of things: Vision,                       ology and chemistry, 65:165–172, 2016.
          applications and research challenges. Ad
          hoc networks, 10(7):1497–1516, 2012.            [ZWC+ 09]   Evangelia I Zacharaki, Sumei Wang,
                                                                      Sanjeev Chawla, Dong Soo Yoo, Ronald
[New09]     Natalie Privacy Law Blog Newman. Net-                     Wolf, Elias R Melhem, and Christos
            flix sued for largest voluntary privacy                   Davatzikos. Classification of brain tu-
            breach to date, 2009.                                     mor type and grade using mri tex-
                                                                      ture and shape in a machine learn-
[NS08]      Arvind     Narayanan     and   Vitaly                     ing scheme.     Magnetic resonance in
            Shmatikov. Robust de-anonymization                        medicine, 62(6):1609–1618, 2009.
            of large sparse datasets. In Security
            and Privacy, 2008. SP 2008. IEEE
            Symposium on, pages 111–125. IEEE,
            2008.

[Pal05]     Mahesh Pal. Random forest classifier
            for remote sensing classification. In-
            ternational Journal of Remote Sensing,
            26(1):217–222, 2005.

[RvdM14]    J Rivera and R van der Meulen. Gart-
            ner says the internet of things will trans-
            form the data center. Retrieved August,
            5:2014, 2014.

[SDG+ 14]   Beata Strack, Jhonathan P. Deshazo,
            Chris Gennings, Juan L. Olmo, Sebas-
            tian Ventura, Krzysztof J. Cios, and
            John N. Clore. Impact of hba1c mea-
            surement on hospital readmission rates:
            Analysis of 70,000 clinical database pa-
            tient records. BioMed Research Interna-
            tional, 2014.

[SS98]      Pierangela Samarati and Latanya
            Sweeney. Generalizing data to provide
            anonymity when disclosing informa-
            tion (abstract). In Proceedings of the

</pre>