<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Random Pro jection to Preserve Patient Privacy</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Aris Anagnostopoulos Fabio Angeletti Sapienza University of Rome Sapienza University of Rome</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1998</year>
      </pub-date>
      <abstract>
        <p>With the availability of accessible and widely used cloud services, it is natural that large components of healthcare systems migrate to them; for example, patient databases can be stored and processed in the cloud. Such cloud services provide enhanced exibility and additional gains, such as availability, ease of data share, and so on. This trend poses serious threats regarding the privacy of the patients and the trust that an individual must put into the healthcare system itself. Thus, there is a strong need of privacy preservation, achieved through a variety of di erent approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper, we study the application of
a random projection-based approach to
patient data as a means to achieve two goals:
(1) provably mask the identity of users
under some adversarial-attack settings, (2)
preserve enough information to allow for
aggregate data analysis and application of
machinelearning techniques. As far as we know, such
approaches have not been applied and tested
on medical data. We analyze the
tradeo between the loss of accuracy on the
outcome of machine-learning algorithms and the
resilience against an adversary. We show
that random projections proved to be strong
Copyright © CIKM 2018 for the individual papers by the papers'
authors. Copyright © CIKM 2018 for the volume as a collection
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
against known input/output attacks while
offering high quality data, as long as the
projected space is smaller than the original space,
and as long as the amount of leaked data
available to the adversary is limited.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>During the recent years, we witnessed the tremendous
progress made in the eld of wireless sensor networks.
This paved the way and facilitated the wide
adoption of small electronic devices with interconnection
capabilities. These devices composed the majority of
the so-called Internet of Things (IoT). The IoT is a
highly dynamic and radically distributed networked
system, composed of an incredible high number of
objects [MSDPC12]. It is vastly considered as one of the
most expanding area within future technologies and
it is attracting vast attention in di erent industry
applications [LL15], ranging from smart cities to home
automation, farming, and many more elds of
application. Ubiquitous sensors, smart objects and devices
involved in IoT can generate a tremendous amount of
data [JDXC+14]. This ow of data requires robust,
available and fast storage solutions and builds the
bases to very e ective and powerful algorithms in the
elds of machine learning and data mining [CDW+15].</p>
      <p>Electronics health-care solutions and, generally
speaking, the Internet of Health Things (IoHT)
follows the same trend. About 73% of healthcare
executives say that IoHT is a disrupting techology for the
next years and it is becoming one of the most funded
areas in IoT. Pervasive IoHT enables cost savings for
both the administrations and the individuals, but on
the other hand it has some barriers, like: privacy and
security concerns, lack of skilled workers, poor
interoperability and more [acc17].</p>
      <p>Given the sensitive nature of healthcare data, there
is a strong need to protect the information of the
patients. Furthermore, the recent adoption of
General Data Protection Regulation (GDPR)
strengthens data protection and now it must be applied to
any organisation or individual that collects and
processes information related to EU citizens, regardless
where the data is physically stored or where they are
based[Alb16, Tan16]. At the same time, analysis of
such data are crucial for medical research and the drug
industry. Consequently, there is a need to design
approaches that allow data processing without exposing
the personal underlying information.</p>
      <p>For this reason, there has been a series of
techniques for perturbing data such that information on
individual data points cannot be leaked, while
aggregate information is preserved. Examples of such
approaches are k-anonymity [SS98] and di erential
privacy [Dwo06]. The various approaches put di erent
importance on the privacy requirements; for instance,
di erential privacy attempts to alter the data such as
to provide very strong privacy guarantees, typically,
without specifying the usefulness of the resulting data.
for general-purpose data analysis.</p>
      <p>In this paper we apply a method, which can be
found in [Liu07a], where the explicit goal is to
obtain a dataset that remains useful after the
perturbation (still providing some privacy guarantees). More
speci cally, our approach is based on random
projections (RP), a technique that is typically applied
primarily for e ciency reasons. It is based on a
fundamental result from the work of Johnson and
Lindenstrauss [JL84] and the idea is the following:
Assume that we have large amounts of data (n
datapoints), lying on a high-dimensional space. Then, if
we project each point to a random subspace of
dimension O(log n= 2), with high probability all pairwise
distances between the data points are preserved within a
factor of 1 . This technique has found multiple
applications in streaming algorihtms, in nding nearest
neighbors in high-dimensional spaces, in reducing the
dimension of databases, and so on.</p>
      <p>Our idea of applying a random projection approach
on privacy-preserving data mining is the following.
Assume that we have the records of multiple patients.
Then we can consider a random projection of these
data. The result of Johnson and Lindenstrauss
guarantees that if we execute algorithms that depend on
the pairwise distances on the data (e.g., several
clustering or classi cation algorithms), then the results
obtained are with high probability similar to those
obtained on the real data set (and the error can be
quanti ed). Furthermore, because the projections are
random, one cannot use the projected data to obtain the
real data: each datapoint appears to be random. This,
unless the attacker has some signi cant power. This
trade o is studied in previous works (e.g. [BBM16],
[Liu07a]).</p>
      <p>It is not clear a priori that this approach could work
in the application on medical dataset that interests us.
For instance, the lemma by Johnson and Lindenstrauss
is typically applied on settings where the original data
lie on a very high-dimensional space. However, in
practice, the original dimension may be low (for instance
in our dataset it is about 50). In this paper we look
at this and other issues by applying the random
projection to a dataset containing information about 70K
cases of diabetes [SDG+14]. We show that it is
possible to reduce the dimensionality of the data and still
obtain accuracy scores that are comparable with the
ones obtained from the original non-projected data.
At the same time, we also show how sensible and
private patient information such their age or gender are
safe against attacks that try to reconstruct the
mapping between the original data and the projected data
after applying random projections.</p>
      <p>Structure of the paper. The rest of this paper
is organized as follows. In Section 2 we present
current solutions and the state of the art on random
projections and other privacy-preserving techniques. In
Section 3 we present the goals of our work and the
approach we used for achieving them, leveraging random
projections. In Section 4 we show our experimental
results, where we explore the limits of our approach
both in terms of accuracy and privacy protection. We
conclude in Section 5 where we also propose future
work.
2</p>
    </sec>
    <sec id="sec-3">
      <title>State of the art</title>
      <p>Data leakages are very common [Wik14]. In this work,
we are more interested in reducing the ability of an
attacker to reconstruct non-yet-leaked data from the
leaked one. Within medical premises, there are
multiple individuals who could obtain access to protected
information from the rightous doctor to untrustful
workers. This could lead to multiple entities knowing
protected personal health information.</p>
      <p>Before 2003, with the enforcing of HIPAA rules,
some private medical information were regularly
shared among professional [Bro00]. Following the
guidelines from Health Insurance Portability and
Accountability Act (HIPAA) [fDCP+03], the US
government made the rst concrete attempt to mitigate the
chance of re-identi cation of patients. In 2009, it was
clear that HIPAA is not su cient to protect the
privacy of an individual. In fact, the HIPAA was not
able to protect the user personal information after the
anonimization process that substituted HIPAA
parameters with IDs. In a famous case [New09], some
researchers were able to re-identify users and also their
sexual orientation and other information. Moreover,
the availability of correlated data (coming from the
same source or other sources) could greatly help to
identify a patient. Data breaches continue to increase
year after year, between 2005 and 2014, only in the
US, more than 26 million of people had some form of
personal health information leaked [Wik14].</p>
      <p>Therefore, more elaborate tecniques, which add
noise to the data, have been developed in the last years
to protect users' privacy and still maintain a good level
of accuracy when exploring and analyzing the data.
One of these is di erential privacy [Dwo06]. This
approach focuses on providing statistically coherent
responses querying a database, i.e. third parties are
interested to query for information about a sample of
a population, not a single individual. Instead, we are
interested also in providing data about a speci c
individual, for example investigating if he or she is suitable
for a clinical trial.</p>
      <p>In [KKMM12a] the authors proved that the
Johnson-Lindenstrauss transform can be used as an
alternative approach to achieve di erential privacy. The
method is then compared against other techniques,
such as adding Gaussian noise to data or
randomized response. The proposed approach has superior
accuracy bounds than the others, while still keeping
secure the privacy of the records. The authors also
criticize the work of Liu et al. about releasing data
to third parties after applying random projections in
order to protect sensitive information while still
preserving accuracy of di erent data mining algorithms:
an adversary that has some background knowledge can
infer approximations of the original data. We address
this issue in the scenario of known input-output data
(section 3.2) and show how in real world scenario
regarding medical data, under reasonable assumptions
for the power of the adversary, it is di cult for an
attacker to discover private information from projected
records.</p>
      <p>In the literature there exist a very large number of
works regarding the re-identi cation of person
starting from various data, within some degree it is called
\breaking the k-anonimity". For example in [NS08]
the authors presents a method to re-identify a user
from its preferences.</p>
      <p>In this paper, we aim at investigating to what
extent RPs can provide useful data for machine learning
algorithms (e.g. classi cation) on a group of potential
patients while preserving at the same time the privacy
of individuals. RPs have been employed in a number of
healthcare applications, for example to segment tumor
areas [KEDR12], to enhance tomography [FMR10], to
cluster DNA microarrays[AV09] or to classify cancer
[XLZW16].</p>
      <p>In [LKR06], RPs were used to mask clear data
projecting them in smaller spaces, while in [BBM16] and
[KKMM12b], similarly to our work, the authors
discuss how to exploit RPs to enhance data privacy. The
authors in [LKR06] also discuss the utility of the RP
in reducing complexity of problems while maintaining
the usefulness of the projected data for algorithms. It
is anticipated that by 2020 there will be more than
26 billion devices involved in IoT related applications
[RvdM14]. Surely, not all of them will be part of
the healthcare eld, however we expect a very large
amount of information to process. The usefulness of
RP in reducing problem complexity (or resource
requirements) is well understood and exploited as
useful resource in the literature [CEM+15, LKR06, FB03,
BZD10, AWY+99]. For example, in [FB03], the
authors explore some ways to reduce high dimensional
data for clustering while, in [LF12], is presented a work
on classi cation of small patches of images from a very
large database that takes advantage of the properties
o ered by RPs.</p>
      <p>During the last two decades, the contribution of
machine learning and data mining algorithms in
healthcare applications became more frequent year after
year. This is well demonstrated in the literature,
for example in [CHH+17, MP99, ZWC+09, HHC+14].
One last aspect to consider is the chance to link
together multiple datasets. For example in [LJJ+09], the
authors presented the infrastructure of a databank in
order to enable record-linkage research studies. This
linkage on one hand could deeply help the development
of newer treatments or drugs, but on the other hand
poses threats to the privacy of the individuals.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Problem Formulation</title>
      <p>We consider a reference scenario in which a group of
users, characterized by private features, are potentially
suitable for a clinical trial. Only a limited number of
users in the group will be actually enrolled in the trial.
For the enrolled users, namely the patients, the private
features will be eventually made public to participate
to the clinical trial in the most e ective way. Some
knowledge on the group is of primary importance for
the researchers to understand the size and the
characteristics of potential patients. In general, users are
well disposed to support this need of the researchers
provided that their privacy is preserved. The main
problem we want to address in this paper is:</p>
      <p>Can we learn something on the group of users as a
whole, while preserving the privacy of the individuals
who will not participate in the trial?</p>
      <sec id="sec-4-1">
        <title>We now elaborate on these two dimensions.</title>
        <p>More formally, we consider a group of n users, where
each user u is characterized by m features. We
represent the corresponding dataset as a matrix X 2 Rm n,
with m rows (the features) and n columns (the users).
As already observed, in the era of big data, m and n
can be particularly big.</p>
        <p>Giannella et al. [GLH13] show how it is possible to
break the privacy in some contexts of distance
preserving mappings. Liu [Liu07b] instead, highlights how
mappings that do preserve distances within certain
bounds like random projections can boost the privacy
guarantees. We will apply these techniques in order
to prove that users' privacy can be kept safe against
malicious attackers.</p>
        <p>We are interested in understanding to what
extent the random projection technique, which has been
originally conceived to reduce the dimensionality of a
dataset, can also be used to preserve the privacy of the
users. In particular, we apply a random projection to
X, such that if R 2 Rk m is the random-projection
matrix Y = RX is the transformed matrix after
applying the random projection, with Y 2 Rk n. We
denote by xiu the column in X associated to user ui,
and with yiu the corresponding column in Y . In the
scenario we are describing the projected matrix Y is
known to the public, it is indeed the dataset on which
researchers try to distill information on the group; the
transformation matrix R and the original data X are
private. Some columns of X may become public once
the corresponding users will eventually decide to
participate to a clinical trial, in other words some pairs
(xiu; yiu) will become public.</p>
        <p>We can now better describe the problem, splitting it
into two sub-problems:
Accuracy. Can we learn something on the group
exploiting Y ? Here we want to understand if the
results of some machine-learning algorithms on Y
are a good approximation of the ones obtained on
X. If we answer positively to this question, we
can at least conclude that what can be learned
from the original data can be also learned from
the projected data.</p>
        <p>Privacy. Can we preserve the privacy of the
individuals that will not participate in the trial? As
already observed, Y is public whereas only some
columns of X will eventually become public when
the corresponding users will decide to
participate in a clinical trial. Consequently, some pairs
(xiu; yiu) will become public. Here we want to
understand if an attacker knowing Y and the some
pairs (xiu; yiu) can possibly know something about
the other users that do not participate in the trial.
3.1</p>
        <p>Accuracy
Lemma 3.1 provides a technique to generate a
lowdimensional representation of the original data
maintaining the pairwise distance within an error . Since
the pairwise distance is the key ingredient for many
classi cation tasks performed by machine learning
algorithms, this property allow us to have some
guarantees that the solution found in the low-dimensional
space is a good approximation of the solution in the
original and higher dimensional space. Furthermore,
reducing the size of the input data speeds-up the
execution time of the algorithms and limits the amount
of resources needed.</p>
        <p>Lemma 3.1 (Johnson and Lindenstrauss) Given &gt;
0 and an integer n let k be a positive integer such that
k k0 = O( log2(n) ). For every set P of n points in
Rm there exists a mapping f : Rm ! Rk such that for
all u; v 2 P
(1
) k u
v k2
k f (u)
f (v) k2
(1 + ) k u
v k2</p>
        <p>It can be proved that a random projection, is a
mapping f that ful lls the previous lemma with positive
probability. This is often referred as JL-embeddings.
3.2</p>
        <p>Privacy: Known Input{Output Attack
We now try to answer one of the questions we raised
in the previous section: Can a a malicious third party
who knows some pairs (xiu; yiu) (i.e. that a particular
record xiu is associated to yiu after its projection) learn
information about other records?</p>
        <p>Liu in his Ph.D. thesis [Liu07b] describes a Bayes
privacy model to measure the privacy o ered by a
perturbation technique. The model considers the
attacker's apriori and a posteriori beliefs about the data
and uses Bayesian inference to evaluate the privacy.
For completeness, we repeat his framework here.</p>
        <p>Let x be the unknown private data, y the perturbed
data and the attacker's additional knowledge about
the data. Then the MAP estimate of x given y and
is
x^MAP (y; ) = arg max fxjy; (xjy; )
x
with fxjy; the conditional probability density of x
given y and .</p>
        <p>Let Xp denote the rst p columns of X and Xn p
the remaining columns. We de ne similarly Yp and
Yn p. We further assume that the columns of Xp are
linearly independent and that Xp is known to the
attacker (i.e., the attacker has full knowledge of p
patients). Y is entirely known to the attacker, because
as we stated before, it is publicly available to conduct
experiments on the projected data.</p>
        <p>For the next reasoning the following hypothesis
must be veri ed:</p>
        <p>The original data arose from as a sample from a
matrix variate distribution.</p>
        <p>The projection matrix R is a k m random
matrix with each entry indipendent and identically
distributed with 0 mean and unit variance. R has
a matrix variate Gaussian distribution with mean
matrix M = 0 and covariance matrix = Ik In.1
Y has a matrix variate Gaussian distribution with
mean matrix M = 0 and covariance matrix =
Ik k1 XT X</p>
        <p>The attacker will try to produce x^i , with 1 i
m p, such that x^i is a good estimate of the undisclosed
private record xi. In other words the attacker's target
is to try to give an estimation of one of the records
contained in Xn p, given that he knows the records in
Xp and their randomly projected counterpart in Yp.</p>
        <p>We now derive the MAP estimate of x given y = Rx
and the known matrices Xp and Yp
1
x^MAP (y; ) = arg max fxjy; (x = xj p
x k
1
Rx = y; p
k
which can be simpli ed in
1
arg max fx;y; ( p
x k</p>
        <p>RX = Y )
1
arg max fx;y( p
x k</p>
        <p>1
arg max f p1k RZjZ ( p
x k</p>
        <p>RX = Y ) =</p>
        <p>RZ = Y jZ = X)fZ (Z = X)
We want to maximize this function in order to solve
the problem of nding the best estimate of x given the
observation of Xp.</p>
        <p>Liu proposes an algorithm to estimate the
nondisclosed records of a certain dataset. Experimental
results have shown that while decreasing the number
of column records known to the attacker (denoted by
p) the relative error of the estimation increases. The
error in the estimation increases also decreasing the
dimensionality of the projected subspace (denoted by
k). In particular the algorithm uses the Nelder{Mead
simplex algorithm to nd the optimal solution of the
maximization problem.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental results</title>
      <p>In this section, we present experimental results
obtained on a dataset containing information about
70000 cases of diabetes diagnosticated in 130 US
hospitals during the decade 1999-2008 [SDG+14] 2. From
now on we will refer to this dataset as the diabetes
dataset.</p>
      <p>We focus on the classi cation of patients based
on their privatized data. Following the work in
[DMS+17, HXY+16], we choose to use random forest
classi er in our dataset to classify the users.
MoreRXp =oYvpe)r, from the work in [Jol17], we know that random
forest classi ers works really well with random
projections. In Figure 1 we report the e ectiveness in
terms of accuracy running the random forest
classier [Pal05] on the original data and on the projected
data in multiple lower dimensional spaces. To run and
validate the classi cation algorithm, we divided the
whole dataset into two parts: train and test. In the
dataset we decided to predict the range of glucose level
in the blood. So that, the algorithm was rstly trained
with the records within the train part of the diabetes
dataset, providing all the target values. Thus, we made
the random forest classi er algorithm predicts the
target values in the test part giving its features as
input. Moreover, we tested the e ectiveness of RPs also
with k-nearest neighbors (k-NN) classi er, the results
were reported in Figure 2. Our approach was inspired
by [AC06]. The results are quite di erent because in
the rst experiment we taken a feature of the dataset
(the range of glucose level in the blood) as the value to
predict, instead with the second experiment we choose
to run rstly a kMeans clustering algorithm (on the
whole dataset) to obtain labeled groups and then, with
the k-nearest neighbors (k-NN) classi er we predicted
the values.</p>
      <p>The blue line represents the accuracy of the machine
If we assume that fZ is distributed uniformly over an
interval, we nally get
1
x^MAP (y) = arg max f p1k RZjZ ( p
x k</p>
      <p>RZ = Y jZ = X)</p>
      <p>In [Liu07b, Theorem 5.3.8] is shown that the
probability density function we obtained has the following
form
(2 ) 12 k(p+1)det( 1 XT X) 12 ketrf
k
1 indicates the Kroenecker product of two matrices [Liu07b]
where X = [xXp] and Y = [yYp].</p>
      <p>We further suppose that the attacker has no other
background knowledge about the private data, so we
can assume that = 0.</p>
      <p>The previous result can be written as
2The dataset is called \Diabetes 130-US
hospitals for years 1999{2008 Data Set" and is available at
https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+y
learning algorithm on the original data. The orange
line, instead, represents the accuracy of the same
algorithm on the projected (obfuscated) data. We tested
the classi cation algorithm on projected spaces in
different sizes, starting from only 2 components up to 10
components.</p>
      <p>The lines plotted in Figure 1 presents the average
values for each projection space, while the vertical
wiskers represent the con dence interval
corresponding to a speci c projection space. For the baseline
(classi cation on the clear data) we ran the classi
cation algorithm 50 times, in each round starting from
a random state of the random forest classi er. Since
the original data is not projected into any space, we
have only a baseline with the associated mean value
and con dence interval. Thus, we reported the
condence interval only at the lefties part of line using
wisker again. Instead, for the accuracy of classi
cation on the projected data, we ran the algorithm more
than 100 times. In each round the algorithm generated
a value for each projected space. The results were
obtained using the scikit-learn package on Python 3.6.</p>
      <p>In [KLR06, LGK06] the authors explore the security
of such techniques: they show how it is possible to use
data dimensionality reduction techniques to lower the
complexity of data mining algorithms while preserving
their accuracy and how those techniques preserve the
privacy of users.</p>
      <p>The authors start from the same privacy hypothesis
we have presented in 3.2 and study how an attacker in
possession of a collection of linearly independent
private data records and their corresponding transformed
part can gather some insight about other records.</p>
      <p>We present the results we got running the algorithm
of [Liu07b] on this dataset. After choosing a number p
of record pairs (xp; yp) we select a record x for which
we do not know the mapping; the algorithm we are
using will try to give an estimation x^ of the original
record x.</p>
      <p>We used two techniques to evaluate how similar to
the original records the algorithm's estimations were.
We measured the distance between the estimation x^
provided by the algorithm and the original record x.
We compute the relative error between the two vectors
with the following:</p>
      <p>E(x; x^) = jjx
jjxjj2
x^jj2
The error E increases with the Euclidean distance
between the two. Notice that with this notation it may
happen that the error is greater than one: this could
verify in the case that the distance between x and its
estimations x^ is high and the norm of x is a small value.
This could happen if the algorithm's estimation is very
far o from the original record.</p>
      <p>This measuring has the drawback to lack an upper
bound for the dissimilarity. Neither the cosine
similarity helps, since in our case we are not interested only
in the direction of vectors but also in their magnitude.</p>
      <p>A solution is provided in [JNY07], where a radial
basis function kernel can be used for representing
similarities: we are going to use 1 edist1(x;x^) as a
similarity function between x and its estimation x^, where
dist(x; x^) = jjx x^jj2. The bigger the Euclidean
distance between two vectors, the bigger the error
edist(x;x^) will be. In this way we have a [0; 1) bound
for the similarity of the estimations. By applying the
inverse we get a value in the range [0,1): if x and x^
are the same vector (perfect reconstruction performed
1
by the algorithm) then edist(x;x^) = 1.</p>
      <p>Our workplan is the following: for every subspace
of dimensionality k we apply the algorithm with di
erent knowledge about the number of pairs (xp; yp) the
attacker knows. We go from p = k 1 to p = 1. In the
next gures we display the results of our experiments,
with the two di erent measuring techniques we used
to quantify the similarity between the original records
and the estimated ones. We report the mean of the
errors for every pair (k; p) and the variance. On the
X axis are placed the tuples (k; p) for which we have
conducted the experiments, on the Y axis we placed
the reconstruction errors.</p>
      <p>On low-dimensionality subspaces we get a high
relative error, meaning that it is not possible to give an
e ective approximation of the original (private) data
records. In higher dimensions the approximation is
closer to the original data. We ran our experiments
with 10 features of the dataset, since with vectors of
higher dimensionality it becomes more di cult to run
the reconstruction algorithm in reasonable times; also
with higher dimensionalities the algorithm we are
using outputs vector reconstructions that are very
dissimilar from the original ones.</p>
      <p>We applied the random projection to reduce the
feature space in di erent dimensions, from 10 to 3.
Notice, however, that even when the projected space has
the same dimension of the original space, we already
get a signi cant relative error, meaning that on the
average it is not possible for the attacker to extrapolate
any useful information about the patients' records. So
for records of higher dimensionality there is already a
safe privacy bound when applying random projection
to them, at least against this kind of attacks.</p>
      <p>We assigned an increasing numerical value to
nominal features, that is, we assigned 0 to the text male
and 1 to text female in the gender feature.</p>
      <p>We applied random projection to this records, from
k = 10 (no dimensionality reduction) to k = 3; the
number p of pairs (original record, projected record)
known to the attacker is in the range k 1 p 1.</p>
      <p>With k = 2 we obviously have only p = 1: we
omit this result since it is not meaningful with respect
the other results we get for higher k and p, because it
does not show how knowing less (or more) information
about the original data changes the reliability of the
reconsturction we get.</p>
      <p>In the next gures we show the mean and variances
of the errors for every tuple (k; p) for which we have
conducted the experiment. It can be seen from the
charts that as the number of known input-output pairs
p decreases, the reconstruction error increases.
Together with the dimensionality reduction, disclosing a
scarce number of known input-output pairs can help
with the task of preserving the privacy of users
involved in clinical trials.</p>
      <p>In this case we are projecting low dimensionality
vectors (k = 10) but we still get high reconstruction
errors when applying the techniques we have explained.
This is another con rmation of the thesis that random
projections help keep the privacy of users when their
information is shared among research institutes.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this work, we applied an random-projections
approach to privacy-preserving data mining of medical
data.</p>
      <p>First we demonstrated the usefulness of RP in
increasing privacy of personal health data. The
projected data are useful for machine learning algorithms
(for example, in clustering) while allows the sharing of
information between parties without revealing the
patients' clear data. In this particular application, this
is of notable importance since allows entities involved
in di erent health branches to cooperate e ectively
without sharing clear data. Second, we investigated
to what extent an attacker can discover additional
information starting from leaked data. As long as the
projected space is smaller than the original space, and
as long as the amount of data leaked is small, than the
proposed approach is robust and mantains very good
performance in both accuracy and privacy.</p>
      <p>We analyzed the ratio behind and the performances
(in terms of accuracy) of the RP applied on sensible
healthcare data. The results shows that the use of RP
o ers great enhancements in privacy protection. This
was a rst step into developing a full- edged platform
that allows the e ective share of medical data. In
future we are planning a bigger real-world deployment
of such platform to further validate our results, plus
an audit to check privacy protection against real third
parties.
[AC06]</p>
      <p>Anthony The Guardian Browne. Lives
ruined as nhs leaks patients' notes, 2000.</p>
      <sec id="sec-6-1">
        <title>Feng Chen, Pan Deng, Jiafu Wan,</title>
        <p>Daqiang Zhang, Athanasios V Vasilakos,
and Xiaohui Rong. Data mining for the
internet of things: literature review and
challenges. International Journal of
Distributed Sensor Networks, 11(8):431047,
2015.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Michael B Cohen, Sam Elder, Cameron</title>
        <p>Musco, Christopher Musco, and
Madalina Persu. Dimensionality
reduction for k-means clustering and low rank
approximation. In Proceedings of the
forty-seventh annual ACM symposium
on Theory of computing, pages 163{172.
ACM, 2015.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Min Chen, Yixue Hao, Kai Hwang,</title>
        <p>Lu Wang, and Lin Wang. Disease
prediction by machine learning over big data
from healthcare communities. IEEE
Access, 5:8869{8879, 2017.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Sarah DuBrava, Jack Mardekian, Alesia</title>
        <p>Sadosky, E Jay Bienen, Bruce Parsons,
Markay Hopps, and John Markman.
Using random forest models to identify
correlates of a diabetic peripheral
neuropathy diagnosis from electronic health
record data. Pain Medicine, 18(1):107{
115, 2017.</p>
      </sec>
      <sec id="sec-6-5">
        <title>Cynthia Dwork. Di erential privacy.</title>
        <p>In Proceedings of the 33rd
International Conference on Automata,
Languages and Programming - Volume Part
II, ICALP'06, pages 1{12, Berlin,
Heidelberg, 2006. Springer-Verlag.</p>
      </sec>
      <sec id="sec-6-6">
        <title>Xiaoli Z Fern and Carla E Brodley.</title>
        <p>Random projection for high dimensional
data clustering: A cluster ensemble
approach. In Proceedings of the 20th
International Conference on Machine
Learning (ICML-03), pages 186{193, 2003.</p>
      </sec>
      <sec id="sec-6-7">
        <title>Centers for Disease Control, Prevention, et al. Hipaa privacy rule and public health. guidance from cdc and the us department of health and human services.</title>
        <p>MMWR: Morbidity and mortality weekly
report, 52(Suppl. 1):1{17, 2003.</p>
      </sec>
      <sec id="sec-6-8">
        <title>Shu-Yu Hsu, Yingchieh Ho, Po-Yao</title>
        <p>Chang, Chauchin Su, and Chen-Yi Lee.
A 48.6-to-105.2 w machine learning
assisted cardiac sensor soc for mobile
healthcare applications. IEEE Journal
of Solid-State Circuits, 49(4):801{811,
2014.</p>
      </sec>
      <sec id="sec-6-9">
        <title>Jian-Hua Huang, Hua-Lin Xie, Jun</title>
        <p>Yan, Dong-Sheng Cao, Hong-Mei Lu,
Qing-Song Xu, and Yi-Zeng Liang.
Correction: Interpretation of type
2 diabetes mellitus relevant gc-ms
metabolomics ngerprints by using
random forests. Analytical Methods,
8(8):1950{1951, 2016.</p>
      </sec>
      <sec id="sec-6-10">
        <title>Lihong Jiang, Li Da Xu, Hongming Cai, Zuhai Jiang, Fenglin Bu, and Boyi Xu. An iot-oriented data storage framework in cloud computing platform. IEEE</title>
        <p>Transactions on Industrial Informatics,
10(2):1443{1451, 2014.</p>
      </sec>
      <sec id="sec-6-11">
        <title>William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.</title>
      </sec>
      <sec id="sec-6-12">
        <title>Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. Towards optimal bag-offeatures for object categorization and semantic video retrieval. ACM, 2007.</title>
        <p>Arnaud Joly. Exploiting random
projections and sparsity with random
forests and gradient boosting methods{
application to multi-label and
multioutput learning, random forest model
compression and leveraging input
sparsity. arXiv preprint arXiv:1704.08067,
2017.</p>
        <p>Adnan Mujahid Khan, Hesham El-Daly,
and Nasir Rajpoot. Ranpec: Random
projections with ensemble clustering for
segmentation of tumor areas in breast
histology images. In Medical Image
Understanding and Analysis, pages 17{23,
2012.</p>
      </sec>
      <sec id="sec-6-13">
        <title>Ronan A Lyons, Kerina H Jones, Gareth</title>
        <p>John, Caroline J Brooks, Jean-Philippe
Verplancke, David V Ford, Ginevra
Brown, and Ken Leake. The sail
databank: linking multiple health and social
care datasets. BMC medical informatics
and decision making, 9(1):3, 2009.
[LL15]</p>
      </sec>
      <sec id="sec-6-14">
        <title>Pierangela Samarati and Latanya Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In Proceedings of the</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [MP99] Kun Liu, Hillol Kargupta, and
          <string-name>
            <given-names>Jessica</given-names>
            <surname>Ryan</surname>
          </string-name>
          .
          <article-title>Random projection-based multiplicative data perturbation for privacy preserving distributed data mining</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>IEEE Transactions on knowledge and Data Engineering</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <volume>92</volume>
          {
          <fpage>106</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Business</given-names>
            <surname>Horizons</surname>
          </string-name>
          ,
          <volume>58</volume>
          (
          <issue>4</issue>
          ):
          <volume>431</volume>
          {
          <fpage>440</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>George D Magoulas</surname>
            and
            <given-names>Andriana</given-names>
          </string-name>
          <string-name>
            <surname>Prentza</surname>
          </string-name>
          .
          <article-title>Machine learning in medical applications</article-title>
          .
          <source>In Advanced Course on Arti cial Intelligence</source>
          , pages
          <fpage>300</fpage>
          {
          <fpage>307</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Springer</surname>
          </string-name>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [MSDPC12]
          <string-name>
            <given-names>Daniele</given-names>
            <surname>Miorandi</surname>
          </string-name>
          , Sabrina Sicari, Francesco De Pellegrini, and
          <string-name>
            <given-names>Imrich</given-names>
            <surname>Chlamtac</surname>
          </string-name>
          .
          <article-title>Internet of things: Vision, applications and research challenges</article-title>
          .
          <source>Ad hoc networks</source>
          ,
          <volume>10</volume>
          (
          <issue>7</issue>
          ):
          <volume>1497</volume>
          {
          <fpage>1516</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Tan16] [Wik14] [XLZW16] [ZWC+09] [New09] [NS08] [Pal05] [RvdM14] [SDG+14] [SS98] Natalie Privacy Law Blog Newman</source>
          .
          <article-title>Netix sued for largest voluntary privacy breach to date</article-title>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Arvind</given-names>
            <surname>Narayanan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vitaly</given-names>
            <surname>Shmatikov. Robust</surname>
          </string-name>
          de
          <article-title>-anonymization of large sparse datasets</article-title>
          .
          <source>In Security and Privacy</source>
          ,
          <year>2008</year>
          .
          <article-title>SP 2008</article-title>
          .
          <article-title>IEEE Symposium on</article-title>
          , pages
          <volume>111</volume>
          {
          <fpage>125</fpage>
          . IEEE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Mahesh</given-names>
            <surname>Pal</surname>
          </string-name>
          .
          <article-title>Random forest classi er for remote sensing classi cation</article-title>
          .
          <source>International Journal of Remote Sensing</source>
          ,
          <volume>26</volume>
          (
          <issue>1</issue>
          ):
          <volume>217</volume>
          {
          <fpage>222</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>J</given-names>
            <surname>Rivera and R van der Meulen</surname>
          </string-name>
          .
          <article-title>Gartner says the internet of things will transform the data center</article-title>
          .
          <source>Retrieved August</source>
          ,
          <volume>5</volume>
          :
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Beata</given-names>
            <surname>Strack</surname>
          </string-name>
          , Jhonathan P. Deshazo, Chris Gennings, Juan L.
          <string-name>
            <surname>Olmo</surname>
            , Sebastian Ventura,
            <given-names>Krzysztof J.</given-names>
          </string-name>
          <string-name>
            <surname>Cios</surname>
            ,
            <given-names>and John N. Clore.</given-names>
          </string-name>
          <article-title>Impact of hba1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient records</article-title>
          .
          <source>BioMed Research International</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Colin</given-names>
            <surname>Tankard</surname>
          </string-name>
          .
          <article-title>What the gdpr means for businesses</article-title>
          .
          <source>Network Security</source>
          ,
          <year>2016</year>
          (
          <volume>6):5</volume>
          {
          <issue>8</issue>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Suanu</given-names>
            <surname>Bliss Wikina</surname>
          </string-name>
          .
          <article-title>What caused the breach? an examination of use of information technology and health data breaches</article-title>
          .
          <source>Perspectives in health information management</source>
          ,
          <volume>11</volume>
          (Fall),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Haozhe</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jie</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qiaosheng</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yadong</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Comparison among dimensionality reduction techniques based on random projection for cancer classi cation</article-title>
          .
          <source>Computational biology and chemistry</source>
          ,
          <volume>65</volume>
          :
          <fpage>165</fpage>
          {
          <fpage>172</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Evangelia</surname>
            <given-names>I Zacharaki</given-names>
          </string-name>
          , Sumei Wang, Sanjeev Chawla, Dong Soo Yoo, Ronald Wolf,
          <string-name>
            <surname>Elias R Melhem</surname>
            , and
            <given-names>Christos</given-names>
          </string-name>
          <string-name>
            <surname>Davatzikos</surname>
          </string-name>
          .
          <article-title>Classi cation of brain tumor type and grade using mri texture and shape in a machine learning scheme</article-title>
          .
          <source>Magnetic resonance in medicine</source>
          ,
          <volume>62</volume>
          (
          <issue>6</issue>
          ):
          <volume>1609</volume>
          {
          <fpage>1618</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>