=Paper=
{{Paper
|id=Vol-2482/paper3
|storemode=property
|title=Random Projection to Preserve Patient Privacy
|pdfUrl=https://ceur-ws.org/Vol-2482/paper3.pdf
|volume=Vol-2482
|authors=Aris Anagnostopoulos,Fabio Angeletti,Federico Arcangeli,Chris Schwiegelshohn,Andrea Vitaletti
|dblpUrl=https://dblp.org/rec/conf/cikm/Anagnostopoulos18
}}
==Random Projection to Preserve Patient Privacy==
Random Projection to Preserve Patient Privacy
Aris Anagnostopoulos Fabio Angeletti
Sapienza University of Rome Sapienza University of Rome
aris@diag.uniroma1.it angeletti@diag.uniroma1.it
Federico Arcangeli Chris Schwiegelshohn
Sapienza University of Rome Sapienza University of Rome
federico.arcangeli1@gmail.com schwiegelshohn@diag.uniroma1.it
Andrea Vitaletti
Sapienza University of Rome
vitaletti@diag.uniroma1.it
against known input/output attacks while of-
fering high quality data, as long as the pro-
Abstract jected space is smaller than the original space,
and as long as the amount of leaked data avail-
With the availability of accessible and widely able to the adversary is limited.
used cloud services, it is natural that large
components of healthcare systems migrate to
1 Introduction
them; for example, patient databases can be
stored and processed in the cloud. Such cloud During the recent years, we witnessed the tremendous
services provide enhanced flexibility and addi- progress made in the field of wireless sensor networks.
tional gains, such as availability, ease of data This paved the way and facilitated the wide adop-
share, and so on. This trend poses serious tion of small electronic devices with interconnection
threats regarding the privacy of the patients capabilities. These devices composed the majority of
and the trust that an individual must put into the so-called Internet of Things (IoT). The IoT is a
the healthcare system itself. Thus, there is a highly dynamic and radically distributed networked
strong need of privacy preservation, achieved system, composed of an incredible high number of ob-
through a variety of different approaches. jects [MSDPC12]. It is vastly considered as one of the
most expanding area within future technologies and
In this paper, we study the application of
it is attracting vast attention in different industry ap-
a random projection-based approach to pa-
plications [LL15], ranging from smart cities to home
tient data as a means to achieve two goals:
automation, farming, and many more fields of appli-
(1) provably mask the identity of users un-
cation. Ubiquitous sensors, smart objects and devices
der some adversarial-attack settings, (2) pre-
involved in IoT can generate a tremendous amount of
serve enough information to allow for aggre-
data [JDXC+ 14]. This flow of data requires robust,
gate data analysis and application of machine-
available and fast storage solutions and builds the
learning techniques. As far as we know, such
bases to very effective and powerful algorithms in the
approaches have not been applied and tested
fields of machine learning and data mining [CDW+ 15].
on medical data. We analyze the trade-
off between the loss of accuracy on the out- Electronics health-care solutions and, generally
come of machine-learning algorithms and the speaking, the Internet of Health Things (IoHT) fol-
resilience against an adversary. We show lows the same trend. About 73% of healthcare exec-
that random projections proved to be strong utives say that IoHT is a disrupting techology for the
next years and it is becoming one of the most funded
Copyright © CIKM 2018 for the individual papers by the papers' areas in IoT. Pervasive IoHT enables cost savings for
authors. Copyright © CIKM 2018 for the volume as a collection both the administrations and the individuals, but on
by its editors. This volume and its papers are published under
the other hand it has some barriers, like: privacy and
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
security concerns, lack of skilled workers, poor inter- dom, one cannot use the projected data to obtain the
operability and more [acc17]. real data: each datapoint appears to be random. This,
Given the sensitive nature of healthcare data, there unless the attacker has some significant power. This
is a strong need to protect the information of the trade off is studied in previous works (e.g. [BBM16],
patients. Furthermore, the recent adoption of Gen- [Liu07a]).
eral Data Protection Regulation (GDPR) strength- It is not clear a priori that this approach could work
ens data protection and now it must be applied to in the application on medical dataset that interests us.
any organisation or individual that collects and pro- For instance, the lemma by Johnson and Lindenstrauss
cesses information related to EU citizens, regardless is typically applied on settings where the original data
where the data is physically stored or where they are lie on a very high-dimensional space. However, in prac-
based[Alb16, Tan16]. At the same time, analysis of tice, the original dimension may be low (for instance
such data are crucial for medical research and the drug in our dataset it is about 50). In this paper we look
industry. Consequently, there is a need to design ap- at this and other issues by applying the random pro-
proaches that allow data processing without exposing jection to a dataset containing information about 70K
the personal underlying information. cases of diabetes [SDG+ 14]. We show that it is possi-
For this reason, there has been a series of tech- ble to reduce the dimensionality of the data and still
niques for perturbing data such that information on obtain accuracy scores that are comparable with the
individual data points cannot be leaked, while aggre- ones obtained from the original non-projected data.
gate information is preserved. Examples of such ap- At the same time, we also show how sensible and pri-
proaches are k-anonymity [SS98] and differential pri- vate patient information such their age or gender are
vacy [Dwo06]. The various approaches put different safe against attacks that try to reconstruct the map-
importance on the privacy requirements; for instance, ping between the original data and the projected data
differential privacy attempts to alter the data such as after applying random projections.
to provide very strong privacy guarantees, typically,
without specifying the usefulness of the resulting data. Structure of the paper. The rest of this paper
for general-purpose data analysis. is organized as follows. In Section 2 we present cur-
In this paper we apply a method, which can be rent solutions and the state of the art on random pro-
found in [Liu07a], where the explicit goal is to ob- jections and other privacy-preserving techniques. In
tain a dataset that remains useful after the perturba- Section 3 we present the goals of our work and the ap-
tion (still providing some privacy guarantees). More proach we used for achieving them, leveraging random
specifically, our approach is based on random projec- projections. In Section 4 we show our experimental
tions (RP), a technique that is typically applied pri- results, where we explore the limits of our approach
marily for efficiency reasons. It is based on a fun- both in terms of accuracy and privacy protection. We
damental result from the work of Johnson and Lin- conclude in Section 5 where we also propose future
denstrauss [JL84] and the idea is the following: As- work.
sume that we have large amounts of data (n data-
points), lying on a high-dimensional space. Then, if
2 State of the art
we project each point to a random subspace of dimen-
sion O(log n/2 ), with high probability all pairwise dis- Data leakages are very common [Wik14]. In this work,
tances between the data points are preserved within a we are more interested in reducing the ability of an
factor of 1 ± . This technique has found multiple ap- attacker to reconstruct non-yet-leaked data from the
plications in streaming algorihtms, in finding nearest leaked one. Within medical premises, there are multi-
neighbors in high-dimensional spaces, in reducing the ple individuals who could obtain access to protected
dimension of databases, and so on. information from the rightous doctor to untrustful
Our idea of applying a random projection approach workers. This could lead to multiple entities knowing
on privacy-preserving data mining is the following. As- protected personal health information.
sume that we have the records of multiple patients. Before 2003, with the enforcing of HIPAA rules,
Then we can consider a random projection of these some private medical information were regularly
data. The result of Johnson and Lindenstrauss guar- shared among professional [Bro00]. Following the
antees that if we execute algorithms that depend on guidelines from Health Insurance Portability and Ac-
the pairwise distances on the data (e.g., several clus- countability Act (HIPAA) [fDCP+ 03], the US govern-
tering or classification algorithms), then the results ob- ment made the first concrete attempt to mitigate the
tained are with high probability similar to those ob- chance of re-identification of patients. In 2009, it was
tained on the real data set (and the error can be quan- clear that HIPAA is not sufficient to protect the pri-
tified). Furthermore, because the projections are ran- vacy of an individual. In fact, the HIPAA was not
able to protect the user personal information after the areas [KEDR12], to enhance tomography [FMR10], to
anonimization process that substituted HIPAA param- cluster DNA microarrays[AV09] or to classify cancer
eters with IDs. In a famous case [New09], some re- [XLZW16].
searchers were able to re-identify users and also their In [LKR06], RPs were used to mask clear data pro-
sexual orientation and other information. Moreover, jecting them in smaller spaces, while in [BBM16] and
the availability of correlated data (coming from the [KKMM12b], similarly to our work, the authors dis-
same source or other sources) could greatly help to cuss how to exploit RPs to enhance data privacy. The
identify a patient. Data breaches continue to increase authors in [LKR06] also discuss the utility of the RP
year after year, between 2005 and 2014, only in the in reducing complexity of problems while maintaining
US, more than 26 million of people had some form of the usefulness of the projected data for algorithms. It
personal health information leaked [Wik14]. is anticipated that by 2020 there will be more than
Therefore, more elaborate tecniques, which add 26 billion devices involved in IoT related applications
noise to the data, have been developed in the last years [RvdM14]. Surely, not all of them will be part of
to protect users’ privacy and still maintain a good level the healthcare field, however we expect a very large
of accuracy when exploring and analyzing the data. amount of information to process. The usefulness of
One of these is differential privacy [Dwo06]. This ap- RP in reducing problem complexity (or resource re-
proach focuses on providing statistically coherent re- quirements) is well understood and exploited as use-
sponses querying a database, i.e. third parties are in- ful resource in the literature [CEM+ 15, LKR06, FB03,
terested to query for information about a sample of BZD10, AWY+ 99]. For example, in [FB03], the au-
a population, not a single individual. Instead, we are thors explore some ways to reduce high dimensional
interested also in providing data about a specific indi- data for clustering while, in [LF12], is presented a work
vidual, for example investigating if he or she is suitable on classification of small patches of images from a very
for a clinical trial. large database that takes advantage of the properties
In [KKMM12a] the authors proved that the offered by RPs.
Johnson-Lindenstrauss transform can be used as an al- During the last two decades, the contribution of ma-
ternative approach to achieve differential privacy. The chine learning and data mining algorithms in health-
method is then compared against other techniques, care applications became more frequent year after
such as adding Gaussian noise to data or random- year. This is well demonstrated in the literature,
ized response. The proposed approach has superior for example in [CHH+ 17, MP99, ZWC+ 09, HHC+ 14].
accuracy bounds than the others, while still keeping One last aspect to consider is the chance to link to-
secure the privacy of the records. The authors also gether multiple datasets. For example in [LJJ+ 09], the
criticize the work of Liu et al. about releasing data authors presented the infrastructure of a databank in
to third parties after applying random projections in order to enable record-linkage research studies. This
order to protect sensitive information while still pre- linkage on one hand could deeply help the development
serving accuracy of different data mining algorithms: of newer treatments or drugs, but on the other hand
an adversary that has some background knowledge can poses threats to the privacy of the individuals.
infer approximations of the original data. We address
this issue in the scenario of known input-output data 3 Problem Formulation
(section 3.2) and show how in real world scenario re-
garding medical data, under reasonable assumptions We consider a reference scenario in which a group of
for the power of the adversary, it is difficult for an at- users, characterized by private features, are potentially
tacker to discover private information from projected suitable for a clinical trial. Only a limited number of
records. users in the group will be actually enrolled in the trial.
In the literature there exist a very large number of For the enrolled users, namely the patients, the private
works regarding the re-identification of person start- features will be eventually made public to participate
ing from various data, within some degree it is called to the clinical trial in the most effective way. Some
“breaking the k-anonimity”. For example in [NS08] knowledge on the group is of primary importance for
the authors presents a method to re-identify a user the researchers to understand the size and the char-
from its preferences. acteristics of potential patients. In general, users are
In this paper, we aim at investigating to what ex- well disposed to support this need of the researchers
tent RPs can provide useful data for machine learning provided that their privacy is preserved. The main
algorithms (e.g. classification) on a group of potential problem we want to address in this paper is:
patients while preserving at the same time the privacy
of individuals. RPs have been employed in a number of Can we learn something on the group of users as a
healthcare applications, for example to segment tumor whole, while preserving the privacy of the individuals
who will not participate in the trial? We now elaborate on these two dimensions.
More formally, we consider a group of n users, where 3.1 Accuracy
each user u is characterized by m features. We repre-
sent the corresponding dataset as a matrix X ∈ Rm×n , Lemma 3.1 provides a technique to generate a low-
with m rows (the features) and n columns (the users). dimensional representation of the original data main-
As already observed, in the era of big data, m and n taining the pairwise distance within an error . Since
can be particularly big. the pairwise distance is the key ingredient for many
Giannella et al. [GLH13] show how it is possible to classification tasks performed by machine learning al-
break the privacy in some contexts of distance preserv- gorithms, this property allow us to have some guar-
ing mappings. Liu [Liu07b] instead, highlights how antees that the solution found in the low-dimensional
mappings that do preserve distances within certain space is a good approximation of the solution in the
bounds like random projections can boost the privacy original and higher dimensional space. Furthermore,
guarantees. We will apply these techniques in order reducing the size of the input data speeds-up the ex-
to prove that users’ privacy can be kept safe against ecution time of the algorithms and limits the amount
malicious attackers. of resources needed.
We are interested in understanding to what ex-
tent the random projection technique, which has been Lemma 3.1 (Johnson and Lindenstrauss) Given >
originally conceived to reduce the dimensionality of a 0 and an integer n let k be a positive integer such that
dataset, can also be used to preserve the privacy of the k ≥ k0 = O( log(n)
2 ). For every set P of n points in
users. In particular, we apply a random projection to Rm there exists a mapping f : Rm → Rk such that for
X, such that if R ∈ Rk×m is the random-projection all u, v ∈ P
matrix Y = RX is the transformed matrix after ap-
plying the random projection, with Y ∈ Rk×n . We (1 − ) k u − v k2 ≤k f (u) − f (v) k2 ≤ (1 + ) k u − v k2
denote by xui the column in X associated to user ui ,
It can be proved that a random projection, is a map-
and with yiu the corresponding column in Y . In the
ping f that fulfills the previous lemma with positive
scenario we are describing the projected matrix Y is
probability. This is often referred as JL-embeddings.
known to the public, it is indeed the dataset on which
researchers try to distill information on the group; the
3.2 Privacy: Known Input–Output Attack
transformation matrix R and the original data X are
private. Some columns of X may become public once We now try to answer one of the questions we raised
the corresponding users will eventually decide to par- in the previous section: Can a a malicious third party
ticipate to a clinical trial, in other words some pairs who knows some pairs (xui , yiu ) (i.e. that a particular
(xui , yiu ) will become public. record xui is associated to yiu after its projection) learn
We can now better describe the problem, splitting it information about other records?
into two sub-problems: Liu in his Ph.D. thesis [Liu07b] describes a Bayes
privacy model to measure the privacy offered by a
Accuracy. Can we learn something on the group ex-
perturbation technique. The model considers the at-
ploiting Y ? Here we want to understand if the
tacker’s apriori and a posteriori beliefs about the data
results of some machine-learning algorithms on Y
and uses Bayesian inference to evaluate the privacy.
are a good approximation of the ones obtained on
For completeness, we repeat his framework here.
X. If we answer positively to this question, we
Let x be the unknown private data, y the perturbed
can at least conclude that what can be learned
data and θ the attacker’s additional knowledge about
from the original data can be also learned from
the data. Then the MAP estimate of x given y and θ
the projected data.
is
Privacy. Can we preserve the privacy of the individ- x̂M AP (y, θ) = arg max fx|y,θ (x|y, θ)
x
uals that will not participate in the trial? As al-
ready observed, Y is public whereas only some with fx|y,θ the conditional probability density of x
columns of X will eventually become public when given y and θ.
the corresponding users will decide to partici- Let Xp denote the first p columns of X and Xn−p
pate in a clinical trial. Consequently, some pairs the remaining columns. We define similarly Yp and
(xui , yiu ) will become public. Here we want to un- Yn−p . We further assume that the columns of Xp are
derstand if an attacker knowing Y and the some linearly independent and that Xp is known to the at-
pairs (xui , yiu ) can possibly know something about tacker (i.e., the attacker has full knowledge of p pa-
the other users that do not participate in the trial. tients). Y is entirely known to the attacker, because
as we stated before, it is publicly available to conduct We want to maximize this function in order to solve
experiments on the projected data. the problem of finding the best estimate of x given the
For the next reasoning the following hypothesis observation of Xp .
must be verified: Liu proposes an algorithm to estimate the nondis-
closed records of a certain dataset. Experimental re-
• The original data arose from as a sample from a sults have shown that while decreasing the number
matrix variate distribution. of column records known to the attacker (denoted by
• The projection matrix R is a k × m random ma- p) the relative error of the estimation increases. The
trix with each entry indipendent and identically error in the estimation increases also decreasing the
distributed with 0 mean and unit variance. R has dimensionality of the projected subspace (denoted by
a matrix variate Gaussian distribution with mean k). In particular the algorithm uses the Nelder–Mead
matrix M = 0 and covariance matrix Σ = Ik ⊗In .1 simplex algorithm to find the optimal solution of the
maximization problem.
• Y has a matrix variate Gaussian distribution with
mean matrix M = 0 and covariance matrix Σ = 4 Experimental results
Ik ⊗ k1 X T X
In this section, we present experimental results ob-
The attacker will try to produce x̂i , with 1 ≤ i ≤ tained on a dataset containing information about
m−p, such that x̂i is a good estimate of the undisclosed70000 cases of diabetes diagnosticated in 130 US hos-
private record xi . In other words the attacker’s targetpitals during the decade 1999-2008 [SDG+ 14] 2 . From
is to try to give an estimation of one of the records now on we will refer to this dataset as the diabetes
contained in Xn−p , given that he knows the records in dataset.
Xp and their randomly projected counterpart in Yp . We focus on the classification of patients based
We now derive the MAP estimate of x given y = Rx on their privatized data. Following the work in
and the known matrices Xp and Yp [DMS+ 17, HXY+ 16], we choose to use random forest
1 1 classifier in our dataset to classify the users. More-
x̂M AP (y, θ) = arg max fx|y,θ (x = x| √ Rx = y, √ RXp =over,
Yp ) from the work in [Jol17], we know that random
x k k forest classifiers works really well with random pro-
which can be simplified in jections. In Figure 1 we report the effectiveness in
terms of accuracy running the random forest classi-
1 fier [Pal05] on the original data and on the projected
arg max fx,y,θ ( √ RX = Y )
x k data in multiple lower dimensional spaces. To run and
validate the classification algorithm, we divided the
where X = [xXp ] and Y = [yYp ]. whole dataset into two parts: train and test. In the
We further suppose that the attacker has no other dataset we decided to predict the range of glucose level
background knowledge about the private data, so we in the blood. So that, the algorithm was firstly trained
can assume that θ = 0. with the records within the train part of the diabetes
The previous result can be written as dataset, providing all the target values. Thus, we made
1 the random forest classifier algorithm predicts the tar-
arg max fx,y ( √ RX = Y ) = get values in the test part giving its features as in-
x k
put. Moreover, we tested the effectiveness of RPs also
1
arg max f √1 RZ|Z ( √ RZ = Y |Z = X)fZ (Z = X) with k-nearest neighbors (k-NN) classifier, the results
x k k were reported in Figure 2. Our approach was inspired
If we assume that fZ is distributed uniformly over an by [AC06]. The results are quite different because in
interval, we finally get the first experiment we taken a feature of the dataset
(the range of glucose level in the blood) as the value to
1 predict, instead with the second experiment we choose
x̂M AP (y) = arg max f √1 RZ|Z ( √ RZ = Y |Z = X)
x k k to run firstly a kMeans clustering algorithm (on the
In [Liu07b, Theorem 5.3.8] is shown that the prob- whole dataset) to obtain labeled groups and then, with
ability density function we obtained has the following the k-nearest neighbors (k-NN) classifier we predicted
form the values.
The blue line represents the accuracy of the machine
− 21 k(p+1) 1 T −1k 1 1 T −1 T
(2π) det( X X) 2 etr{− Y ( X X) Y } 2 The dataset is called “Diabetes 130-US hospi-
k 2 k
tals for years 1999–2008 Data Set” and is available at
1 ⊗ indicates the Kroenecker product of two matrices [Liu07b] https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+ye
learning algorithm on the original data. The orange wisker again. Instead, for the accuracy of classifica-
line, instead, represents the accuracy of the same al- tion on the projected data, we ran the algorithm more
gorithm on the projected (obfuscated) data. We tested than 100 times. In each round the algorithm generated
the classification algorithm on projected spaces in dif- a value for each projected space. The results were ob-
ferent sizes, starting from only 2 components up to 10 tained using the scikit-learn package on Python 3.6.
components. In [KLR06, LGK06] the authors explore the security
of such techniques: they show how it is possible to use
data dimensionality reduction techniques to lower the
complexity of data mining algorithms while preserving
their accuracy and how those techniques preserve the
privacy of users.
The authors start from the same privacy hypothesis
we have presented in 3.2 and study how an attacker in
possession of a collection of linearly independent pri-
vate data records and their corresponding transformed
part can gather some insight about other records.
We present the results we got running the algorithm
of [Liu07b] on this dataset. After choosing a number p
of record pairs (xp , yp ) we select a record x for which
Figure 1: Accuracy of the random forest classifier al- we do not know the mapping; the algorithm we are
gorithm on the original data (blue line) and on the using will try to give an estimation x̂ of the original
projected data (orange line), varying the projection record x.
space (# of components). Mean values are reported We used two techniques to evaluate how similar to
as lines and 95% confidence intervals are reported as the original records the algorithm’s estimations were.
vertical lines. We measured the distance between the estimation x̂
provided by the algorithm and the original record x.
We compute the relative error between the two vectors
with the following:
||x − x̂||2
E(x, x̂) =
||x||2
The error E increases with the Euclidean distance be-
tween the two. Notice that with this notation it may
happen that the error is greater than one: this could
verify in the case that the distance between x and its
estimations x̂ is high and the norm of x is a small value.
This could happen if the algorithm’s estimation is very
far off from the original record.
Figure 2: Accuracy of the k-nearest neighbors (k-NN) This measuring has the drawback to lack an upper
algorithm on the original data (blue line) and on the bound for the dissimilarity. Neither the cosine similar-
projected data (orange line), varying the projection ity helps, since in our case we are not interested only
space (# of components). Mean values are reported in the direction of vectors but also in their magnitude.
as lines and 95% confidence intervals are reported as A solution is provided in [JNY07], where a radial
vertical lines. basis function kernel can be used for representing sim-
1
The lines plotted in Figure 1 presents the average ilarities: we are going to use 1 − edist(x,x̂) as a similar-
values for each projection space, while the vertical ity function between x and its estimation x̂, where
wiskers represent the confidence interval correspond- dist(x, x̂) = ||x − x̂||2 . The bigger the Euclidean
ing to a specific projection space. For the baseline distance between two vectors, the bigger the error
(classification on the clear data) we ran the classifica- edist(x,x̂) will be. In this way we have a [0, 1) bound
tion algorithm 50 times, in each round starting from for the similarity of the estimations. By applying the
a random state of the random forest classifier. Since inverse we get a value in the range [0,1): if x and x̂
the original data is not projected into any space, we are the same vector (perfect reconstruction performed
1
have only a baseline with the associated mean value by the algorithm) then edist(x,x̂ )
= 1.
and confidence interval. Thus, we reported the con- Our workplan is the following: for every subspace
fidence interval only at the lefties part of line using of dimensionality k we apply the algorithm with differ-
ent knowledge about the number of pairs (xp , yp ) the
attacker knows. We go from p = k − 1 to p = 1. In the
next figures we display the results of our experiments,
with the two different measuring techniques we used
to quantify the similarity between the original records
and the estimated ones. We report the mean of the
errors for every pair (k, p) and the variance. On the
X axis are placed the tuples (k, p) for which we have
conducted the experiments, on the Y axis we placed
the reconstruction errors.
On low-dimensionality subspaces we get a high rela-
tive error, meaning that it is not possible to give an
effective approximation of the original (private) data
records. In higher dimensions the approximation is
closer to the original data. We ran our experiments
with 10 features of the dataset, since with vectors of
higher dimensionality it becomes more difficult to run Figure 3: Mean and variance of the relative error while
the reconstruction algorithm in reasonable times; also using the formula ||x−x̂||
||x||2
2
with higher dimensionalities the algorithm we are us-
ing outputs vector reconstructions that are very dis-
similar from the original ones.
We applied the random projection to reduce the fea-
ture space in different dimensions, from 10 to 3. No-
tice, however, that even when the projected space has
the same dimension of the original space, we already
get a significant relative error, meaning that on the av-
erage it is not possible for the attacker to extrapolate
any useful information about the patients’ records. So
for records of higher dimensionality there is already a
safe privacy bound when applying random projection
to them, at least against this kind of attacks.
We assigned an increasing numerical value to nom-
inal features, that is, we assigned 0 to the text male
and 1 to text female in the gender feature.
We applied random projection to this records, from
k = 10 (no dimensionality reduction) to k = 3; the Figure 4: Mean and variance of the similarity between
number p of pairs (original record, projected record) original records and their reconstruction while using
1
known to the attacker is in the range k − 1 ≤ p ≤ 1. the similarity function 1 − e||x−x̂|| 2
With k = 2 we obviously have only p = 1: we rors when applying the techniques we have explained.
omit this result since it is not meaningful with respect This is another confirmation of the thesis that random
the other results we get for higher k and p, because it projections help keep the privacy of users when their
does not show how knowing less (or more) information information is shared among research institutes.
about the original data changes the reliability of the
reconsturction we get.
5 Conclusions
In the next figures we show the mean and variances
of the errors for every tuple (k, p) for which we have In this work, we applied an random-projections ap-
conducted the experiment. It can be seen from the proach to privacy-preserving data mining of medical
charts that as the number of known input-output pairs data.
p decreases, the reconstruction error increases. To- First we demonstrated the usefulness of RP in in-
gether with the dimensionality reduction, disclosing a creasing privacy of personal health data. The pro-
scarce number of known input-output pairs can help jected data are useful for machine learning algorithms
with the task of preserving the privacy of users in- (for example, in clustering) while allows the sharing of
volved in clinical trials. information between parties without revealing the pa-
In this case we are projecting low dimensionality tients’ clear data. In this particular application, this
vectors (k = 10) but we still get high reconstruction er- is of notable importance since allows entities involved
in different health branches to cooperate effectively [BZD10] Christos Boutsidis, Anastasios Zouzias,
without sharing clear data. Second, we investigated and Petros Drineas. Random projections
to what extent an attacker can discover additional in- for k-means clustering. In Advances in
formation starting from leaked data. As long as the Neural Information Processing Systems,
projected space is smaller than the original space, and pages 298–306, 2010.
as long as the amount of data leaked is small, than the
proposed approach is robust and mantains very good [CDW+ 15] Feng Chen, Pan Deng, Jiafu Wan,
performance in both accuracy and privacy. Daqiang Zhang, Athanasios V Vasilakos,
and Xiaohui Rong. Data mining for the
We analyzed the ratio behind and the performances
internet of things: literature review and
(in terms of accuracy) of the RP applied on sensible
challenges. International Journal of Dis-
healthcare data. The results shows that the use of RP
tributed Sensor Networks, 11(8):431047,
offers great enhancements in privacy protection. This
2015.
was a first step into developing a full-fledged platform
that allows the effective share of medical data. In fu- [CEM+ 15] Michael B Cohen, Sam Elder, Cameron
ture we are planning a bigger real-world deployment Musco, Christopher Musco, and
of such platform to further validate our results, plus Madalina Persu. Dimensionality reduc-
an audit to check privacy protection against real third tion for k-means clustering and low rank
parties. approximation. In Proceedings of the
forty-seventh annual ACM symposium
References on Theory of computing, pages 163–172.
ACM, 2015.
[AC06] Nir Ailon and Bernard Chazelle. Ap-
proximate nearest neighbors and the fast [CHH+ 17] Min Chen, Yixue Hao, Kai Hwang,
Johnson-Lindenstrauss transform. In Lu Wang, and Lin Wang. Disease predic-
Proceedings of the 38th Annual ACM tion by machine learning over big data
Symposium on Theory of Computing, from healthcare communities. IEEE Ac-
Seattle, WA, USA, May 21-23, 2006, cess, 5:8869–8879, 2017.
pages 557–563, 2006.
[DMS+ 17] Sarah DuBrava, Jack Mardekian, Alesia
[acc17] accentureconsulting. Internet of health Sadosky, E Jay Bienen, Bruce Parsons,
things survey, 2017. Markay Hopps, and John Markman. Us-
ing random forest models to identify
[Alb16] Jan Philipp Albrecht. How the gdpr will correlates of a diabetic peripheral neu-
change the world. Eur. Data Prot. L. ropathy diagnosis from electronic health
Rev., 2:287, 2016. record data. Pain Medicine, 18(1):107–
115, 2017.
[AV09] Roberto Avogadri and Giorgio Valen-
tini. Fuzzy ensemble clustering based on [Dwo06] Cynthia Dwork. Differential privacy.
random projections for dna microarray In Proceedings of the 33rd Interna-
data analysis. Artificial Intelligence in tional Conference on Automata, Lan-
Medicine, 45(2-3):173–183, 2009. guages and Programming - Volume Part
II, ICALP’06, pages 1–12, Berlin, Hei-
[AWY+ 99] Charu C Aggarwal, Joel L Wolf, Philip S delberg, 2006. Springer-Verlag.
Yu, Cecilia Procopiuc, and Jong Soo
Park. Fast algorithms for projected clus- [FB03] Xiaoli Z Fern and Carla E Brodley.
tering. In ACM SIGMoD Record, vol- Random projection for high dimensional
ume 28, pages 61–72. ACM, 1999. data clustering: A cluster ensemble ap-
proach. In Proceedings of the 20th Inter-
[BBM16] Tiziano Bianchi, Valerio Bioglio, and national Conference on Machine Learn-
Enrico Magli. Analysis of one-time ran- ing (ICML-03), pages 186–193, 2003.
dom projections for privacy preserving
compressed sensing. IEEE Transactions [fDCP+ 03] Centers for Disease Control, Prevention,
on Information Forensics and Security, et al. Hipaa privacy rule and public
11(2):313–327, 2016. health. guidance from cdc and the us de-
partment of health and human services.
[Bro00] Anthony The Guardian Browne. Lives MMWR: Morbidity and mortality weekly
ruined as nhs leaks patients’ notes, 2000. report, 52(Suppl. 1):1–17, 2003.
[FMR10] Yi Fang, Sundar Murugappan, and segmentation of tumor areas in breast
Karthik Ramani. Estimating view pa- histology images. In Medical Image Un-
rameters from random projections for derstanding and Analysis, pages 17–23,
tomography using spherical mds. BMC 2012.
medical imaging, 10(1):12, 2010.
[KKMM12a] K. Kenthapadi, A. Korolova, I. Mironov,
[GLH13] C. Giannella, K. Liu, and Kargupta H. and N. Mishra. Privacy via the Johnson-
Breaching euclidean distance preserving Lindenstrauss Transform. ArXiv e-
data perturbation using few known in- prints, April 2012.
puts. Data and Knowledge Engineering,
2013. [KKMM12b] Krishnaram Kenthapadi, Aleksandra
+ Korolova, Ilya Mironov, and Nina
[HHC 14] Shu-Yu Hsu, Yingchieh Ho, Po-Yao
Mishra. Privacy via the johnson-
Chang, Chauchin Su, and Chen-Yi Lee.
lindenstrauss transform. arXiv preprint
A 48.6-to-105.2 µw machine learning
arXiv:1204.2606, 2012.
assisted cardiac sensor soc for mobile
healthcare applications. IEEE Journal [KLR06] H. Kargupta, K. Liu, and J. Ryan.
of Solid-State Circuits, 49(4):801–811, Random projection-based multiplicative
2014. data perturbation for privacy preserving
[HXY+ 16] Jian-Hua Huang, Hua-Lin Xie, Jun distributed data mining. IEEE Transac-
Yan, Dong-Sheng Cao, Hong-Mei Lu, tions on Knowledge & Data Engineering,
Qing-Song Xu, and Yi-Zeng Liang. 18:92–106, 01 2006.
Correction: Interpretation of type
2 diabetes mellitus relevant gc-ms [LF12] Li Liu and Paul Fieguth. Tex-
metabolomics fingerprints by using ran- ture classification from random features.
dom forests. Analytical Methods, IEEE Transactions on Pattern Analy-
8(8):1950–1951, 2016. sis and Machine Intelligence, 34(3):574–
586, 2012.
[JDXC+ 14] Lihong Jiang, Li Da Xu, Hongming Cai,
Zuhai Jiang, Fenglin Bu, and Boyi Xu. [LGK06] Kun Liu, Chris Giannella, and Hillol
An iot-oriented data storage framework Kargupta. An attacker’s view of dis-
in cloud computing platform. IEEE tance preserving maps for privacy pre-
Transactions on Industrial Informatics, serving data mining. In Proceedings of
10(2):1443–1451, 2014. the 10th European Conference on Prin-
ciple and Practice of Knowledge Discov-
[JL84] William B Johnson and Joram Linden- ery in Databases, PKDD’06, pages 297–
strauss. Extensions of lipschitz map- 308, Berlin, Heidelberg, 2006. Springer-
pings into a hilbert space. Contemporary Verlag.
mathematics, 26(189-206):1, 1984.
[Liu07a] Kun Liu. Multiplicative Data Perturba-
[JNY07] Yu-Gang Jiang, Chong-Wah Ngo, and
tion for Privacy Preserving Data Min-
Jun Yang. Towards optimal bag-of-
ing. PhD thesis, University of Maryland,
features for object categorization and se-
2007.
mantic video retrieval. ACM, 2007.
[Jol17] Arnaud Joly. Exploiting random pro- [Liu07b] Kun Liu. Multiplicative Data Perturba-
jections and sparsity with random tion for Privacy Preserving Data Min-
forests and gradient boosting methods– ing. PhD thesis, University of Maryland,
application to multi-label and multi- 2007.
output learning, random forest model
compression and leveraging input spar- [LJJ+ 09] Ronan A Lyons, Kerina H Jones, Gareth
sity. arXiv preprint arXiv:1704.08067, John, Caroline J Brooks, Jean-Philippe
2017. Verplancke, David V Ford, Ginevra
Brown, and Ken Leake. The sail data-
[KEDR12] Adnan Mujahid Khan, Hesham El-Daly, bank: linking multiple health and social
and Nasir Rajpoot. Ranpec: Random care datasets. BMC medical informatics
projections with ensemble clustering for and decision making, 9(1):3, 2009.
[LKR06] Kun Liu, Hillol Kargupta, and Jes- Seventeenth ACM SIGACT-SIGMOD-
sica Ryan. Random projection-based SIGART Symposium on Principles of
multiplicative data perturbation for pri- Database Systems, PODS ’98, pages
vacy preserving distributed data mining. 188–, New York, NY, USA, 1998. ACM.
IEEE Transactions on knowledge and
Data Engineering, 18(1):92–106, 2006. [Tan16] Colin Tankard. What the gdpr means for
businesses. Network Security, 2016(6):5–
[LL15] In Lee and Kyoochun Lee. The inter- 8, 2016.
net of things (iot): Applications, invest-
ments, and challenges for enterprises. [Wik14] Suanu Bliss Wikina. What caused the
Business Horizons, 58(4):431–440, 2015. breach? an examination of use of in-
formation technology and health data
[MP99] George D Magoulas and Andriana breaches. Perspectives in health infor-
Prentza. Machine learning in medi- mation management, 11(Fall), 2014.
cal applications. In Advanced Course
on Artificial Intelligence, pages 300–307. [XLZW16] Haozhe Xie, Jie Li, Qiaosheng Zhang,
Springer, 1999. and Yadong Wang. Comparison
among dimensionality reduction tech-
[MSDPC12] Daniele Miorandi, Sabrina Sicari, niques based on random projection for
Francesco De Pellegrini, and Imrich cancer classification. Computational bi-
Chlamtac. Internet of things: Vision, ology and chemistry, 65:165–172, 2016.
applications and research challenges. Ad
hoc networks, 10(7):1497–1516, 2012. [ZWC+ 09] Evangelia I Zacharaki, Sumei Wang,
Sanjeev Chawla, Dong Soo Yoo, Ronald
[New09] Natalie Privacy Law Blog Newman. Net- Wolf, Elias R Melhem, and Christos
flix sued for largest voluntary privacy Davatzikos. Classification of brain tu-
breach to date, 2009. mor type and grade using mri tex-
ture and shape in a machine learn-
[NS08] Arvind Narayanan and Vitaly ing scheme. Magnetic resonance in
Shmatikov. Robust de-anonymization medicine, 62(6):1609–1618, 2009.
of large sparse datasets. In Security
and Privacy, 2008. SP 2008. IEEE
Symposium on, pages 111–125. IEEE,
2008.
[Pal05] Mahesh Pal. Random forest classifier
for remote sensing classification. In-
ternational Journal of Remote Sensing,
26(1):217–222, 2005.
[RvdM14] J Rivera and R van der Meulen. Gart-
ner says the internet of things will trans-
form the data center. Retrieved August,
5:2014, 2014.
[SDG+ 14] Beata Strack, Jhonathan P. Deshazo,
Chris Gennings, Juan L. Olmo, Sebas-
tian Ventura, Krzysztof J. Cios, and
John N. Clore. Impact of hba1c mea-
surement on hospital readmission rates:
Analysis of 70,000 clinical database pa-
tient records. BioMed Research Interna-
tional, 2014.
[SS98] Pierangela Samarati and Latanya
Sweeney. Generalizing data to provide
anonymity when disclosing informa-
tion (abstract). In Proceedings of the