=Paper=
{{Paper
|id=Vol-1671/paper6
|storemode=property
|title=Rigorous and Flexible Privacy Models for Utilizing Personal Spatiotemporal Data
|pdfUrl=https://ceur-ws.org/Vol-1671/paper6.pdf
|volume=Vol-1671
|authors=Yang Cao
|dblpUrl=https://dblp.org/rec/conf/vldb/Cao16
}}
==Rigorous and Flexible Privacy Models for Utilizing Personal Spatiotemporal Data==
Rigorous and Flexible Privacy Models for Utilizing
Personal Spatiotemporal Data
Yang Cao
Supervised by Masatoshi Yoshikawa
Kyoto University, Japan
soyo@db.soc.i.kyoto-u.ac.jp
ABSTRACT anonymized medical records and was first made available
Personal data are the new oil. Vast amounts of spatiotem- in 1996 for medical research and public health purposes.
poral data generated by individuals have been collected and Latanya Sweeney [2] linked the anonymized GIC database
analyzed, such as check-in data, trajectories, web browsing (which retained the birthday, sex, and ZIP code of each pa-
data and timestamped medical records. These personal data tient) with voter registration records to identify the medical
are a valuable resource for businesses and also have the po- records of the governor of Massachusetts. In 2006, AOL Inc.
tential to provide significant social benefits through sharing released more than 650 thousand users’ anonymized search
and reuse. However, privacy concerns hinder the wider use logs for research purposes, but researchers [3] found it is
of these personal data. To this end, dozens of privacy pro- possible to de-anonymize some users by mining these times-
tection models have been proposed for privacy-preserving tamped records. A recent privacy leakage case was related
data publishing (PPDP). -differential privacy has emerged to the datasets released for the Netflix Prize competition for
as a de facto privacy model for PPDP because of its rigor- recommender algorithms. Researchers [4] verified that the
ous theoretical guarantees, but it has also been criticized as datasets could be de-anonymized by linking them with the
impractical in many scenarios. We attempt to design rig- IMDB dataset; as a result, Netflix became the target of a
orous and flexible privacy models that able to bridge the lawsuit and was obliged to terminate this well-known com-
gap between theoretical limitations and practical needs. In petition in 2010. In Japan, the East Japan Railway Com-
this article, we first motivate the importance of rigorousness pany (JR) faced a scandal in 2013 because of the selling of
and flexibility for privacy models and then present two pri- customer data collected through the use of SUICA cards.
vacy models that extend differential privacy in a practical These privacy leakage cases illustrate the conflict between
manner. increasing public concerns about data privacy and the need
for personal data publishing. Therefore, privacy issues have
become a critical problem in the big data era.
1. INTRODUCTION Personal spatiotemporal data are highly sensitive even if
New opportunities are arising to enrich our understand- they are anonymized. Studies [5] [6] have shown that only
ing of the physical world by utilizing personal big data. four spatiotemporal data points from anonymized mobile
Such data generated by individuals that possess spatial and datasets and credit card records (with purchase times and
temporal properties are called personal spatiotemporal data locations) are sufficient to uniquely identify 90% of individ-
and include information such as people’s movements, search uals. Another study [7] has demonstrated that it is possible
logs, shopping records, social activities and medical histo- to track smart phone locations simply by monitoring bat-
ries. Companies such as Google, Amazon and Facebook tery consumption data (timestamps and remaining battery
have created enormous wealth by using such personal data. life), which do not require the users’ permission to be shared.
These data also have the potential to support significant Therefore, because of the unique and rich patterns that are
initiatives for the public good, such as innovative services, present in personal spatiotemporal data, a rigorous privacy
traffic monitoring, disease surveillance and health promo- model is needed to utilize personal spatiotemporal data in
tion. Therefore, it is important to enable the sharing and a privacy-preserving manner.
reuse of personal spatiotemporal data. At the same time, privacy models should be flexible for the
However, several well-known cases of privacy leakage have following two reasons. First, because privacy protection im-
arisen in attempts to provide privacy-preserving data pub- plies the hiding or perturbing of sensitive information1 , the
lishing (PPDP) [1]. One of the most famous privacy leakage privacy level should be tunable (to an appropriate level) for
cases occurred with regard to the Massachusetts Group In- different applications to achieve a suitable trade-off between
surance Commission (GIC) medical dataset, which contains privacy and utility. Second, privacy as a complex social con-
cept should be modifiable in accordance with different social
contexts. For example, different users might have different
1
Although some cryptographic approaches (e.g., homomor-
phic encryption) can be used for PPDP, because they
Proceedings of the VLDB 2016 PhD Workshop, September 9, 2016. New limit the possible data recipients, we here discuss primarily
Delhi, India. perturbation-based approaches that enable a wider sharing
Copyright (c) 2016 for this paper by its authors. Copying permitted for of data.
private and academic purposes.
preferences regarding privacy; for a single user, the desired uID time loc.
u1 t1 l1
protection levels might also be different at different times u3 t1 l1
t1 t2 t3 t4 t5 …
t1 t2 t3 t4 t5 …
and places. u2 t2 l2 l1 2 0 0 0 1 …
u3 t2 l3 u1 l1 l2 …
In the following, we briefly review the background for this u3 t3 l4
l2 0 1 0 0 2 …
u1 t5 l2
u2 l2 l1 …
…
research in Section 2 and then present two proposed privacy u2 t5 l1
l3 0 1 0 0 0
u3 l1 l3 l4 l2 … l4 …
models in Section 3 and Section 4. u3
…
t5
…
l2
…
0 0 1 0 0
(a) Raw data. (b) Trajectory data. (c) Aggregate data.
2. BACKGROUND Figure 1: Raw data and aggregate information.
Below, we summarize two crucial weaknesses of differen-
2.1 Previous Privacy Models tial privacy that hinder its extensive use in data publishing.
Previous studies have considered two major families of pri- • One size fits all. -differential privacy uses a single
vacy models: k-anonymity [2] and -differential privacy [8]. parameter to represent the protection level for all
k-anonymity [2] was proposed in 2002 and was the first pri- users in a database rather than providing a personal-
vacy model to be widely accepted by practitioners. However, ized privacy protection level.
Machanavajjhala [9] identified a serious flaw of k -anonymity
by demonstrating the possibility of homogeneity attacks and • Data independence assumption. Differential pri-
background knowledge attacks and, to resolve this flaw, pro- vacy assumes that the tuples in a database are inde-
posed an improved privacy model based on k -anonymity pendent; however, real-world data tend to be tempo-
named `-diversity. However, `-diversity had its own short- rally or spatially correlated.
comings, which motivated further improvement to yield sub-
sequent new privacy models, such as t-closeness [10] and M - 2.3 Problem setting
invariance [11]. Researchers have realized that the essential Here, we describe a scenario of the publishing of aggre-
defect of the k-anonymity family is that the privacy guar- gate spatiotemporal data. We consider a scenario in which
antee depends on the background knowledge possessed by a trusted server collects spatiotemporal data points from
adversaries. In 2006, Dwork et al. proposed -differential users and continuously stores them in database Di at each
privacy [8], which has received increasing attention because timestamp i ∈ [1, t]. Let t denote the current timestamp.
it is guaranteed to be rigorous and mathematically provable. Each data point huID, time, loci is a row in Di (Fig. 1(a)).
Differential privacy ensures protection against adversaries Suppose that locs is the set of all locations in which we
with arbitrary background knowledge. are interested (POIs); then, the server wishes to publish a
vector ri at each timestamp i that consists of the counts
2.2 Differential Privacy of loc ∈ locs that appear in Di (Fig.1(c)), i.e., the answer
Intuitively, differential privacy guarantees that any single to the count query Qc : Di → R|locs| . Without loss of
user’s data in the database have only a “slight” (bounded generality, we assume that each user appears at only one
by ) impact on changes to the outputs. The parameter location at most at each timestamp. We need to transform
that is used to control the privacy level is defined as a real the aggregate data depicted in Figure 1(c) into a secure form
positive number. A lower value of corresponds to a lower for publishing because they are computed directly from the
likelihood of privacy leakage. In the case of a suitably small sensitive raw data. The most straightforward method is to
, an adversary cannot associate any piece of information employ the Laplace mechanism [12] (add random Laplace
with a specific individual by mining the answers to queries. noise) to achieve DP. In accordance with the limitations of
Definition 1 (Neighboring Database [8]). If D is a database DP discussed above, two problems are encountered when
and D0 is nearly identical to D but differs by one record, then applying DP in practice, as follows:
we say that D and D0 are two neighboring databases. • How can personalized differential privacy be achieved?
Definition 2 (-Differential Privacy, -DP [8]). Let Λ be a • How can differential privacy be guaranteed even if the
randomized algorithm, and let R represent all possible out- data are temporally or spatially correlated?
puts of Λ. The algorithm Λ satisfies -DP if the following
We present two privacy models, `-trajectory privacy (Sec-
holds for any r ∈ R and any two neighboring databases D
tion 3) and spatiotemporal privacy (Section 4), to solve the
and D0 .
above problems.
P r[Λ(D) = r] ≤ exp() · P r[Λ(D0 ) = r] (1)
3. `-TRAJECTORY PRIVACY
In the inequality expressed in (1), a positive parameter In a previous work [14], we defined a new privacy model
, called the privacy budget, is given in advance. It is used that we called `-trajectory privacy to guarantee that any `-
to control the level of privacy protection. A lower values of length trajectories for each user are protected under differ-
corresponds to higher privacy and more randomness, and ential privacy. We formalized our privacy definition, proved
vice versa. The global sensitivity (or sensitivity for short) its feasibility, and designed efficient mechanisms to imple-
GS(Q) of query Q is the maximum distance L1 between the ment it. This privacy model is personalized, i.e., different
query results for any two neighboring databases. users can specify different lengths of `-trajectories (` succes-
Two widely used methods of achieving differential privacy sive data points) depending on their privacy requirements,
are the Laplace mechanism [12], which adds random noise to as illustrated in Figure 2(b). A closely related privacy model
the real data to prevent the disclosure of sensitive informa- called w-event privacy [15] protects the data within sliding
tion, and the Exponential mechanism [13], which randomly windows (all data points in w with successive timestamps),
returns the desired item with some calibrated probability. as shown in Figure 2(a). In the following, we present the
definition of the `-trajectory privacy model and related ex- loc3
perimental results. loc5 colleague u2 couple
We first define the neighboring databases in our setting.
loc4
Definition 3 (`-Trajectory Neighboring Stream Prefixes). u1 u3
Let St0 be a near copy of the trajectory stream prefixes St (a) Road Network (b) Social Ties
that differs in ` neighboring databases Di0 . St and St0 are
temporal correlation spatial correlation
neighboring `-trajectory stream prefixes if one can be ob- for single user for user-user
tained from the other by modifying single or multiple locs D1 D2 D3 … r1 r2 r3 …
in any one `-trajectory `u,k (recall that an `-trajectory is a 7:00 8:00 9:00 …
7:00 8:00 9:00 ..
set of ` spatiotemporal data points). We say that St and St0 0 2 2
loc1 ..
u1 loc3 loc1 loc1 …
u2 loc2 loc1 loc1 …
loc2 2 0 0 ..
1 0 1
are neighboring with respect to `u,k . u3 loc2 loc4 loc5 …
loc3
loc4 1 1 0
..
..
u4 loc4 loc5 loc3 … loc5 0 1 1 ..
Intuitively, `-trajectory privacy attempts to ensure that (c) Spatiotemporal Data (d) Raw Aggregates
any `-trajectory for any single user has a only “slight” (bounded Figure 3: The vulnerability of differential privacy to
by ) impact on the outputs. correlated spatiotemporal data.
Definition 4 (`-Trajectory -Differential Privacy). Let Λ be a varying degree of knowledge of the spatiotemporal corre-
an algorithm that takes prefixes of trajectory streams St = lations of the data of interest. We demonstrate that the
{D1 , · · · , Dt } as inputs. Let Nt = {n1 , · · · , nt } be a possi- risk of a DP mechanism may increase over time given the
ble perturbed output stream of Λ. If the following holds for existence of such adversaries. To prevent such privacy leak-
any `-trajectory neighboring St and St0 , age, we propose a new privacy definition, ST -differential
privacy, and present novel mechanisms to satisfy this pri-
P r [Λ(St ) = Nt ] ≤ e · P r Λ(St0 ) = Nt
(2)
vacy definition. In the following, we first demonstrate the
then we say that Λ satisfies `-trajectory -differential privacy problem by means of a motivating example, present the def-
(or `-trajectory privacy for brevity). inition of realistic adversaries, and briefly describe measures
The following theorem provides insight regarding the im- for protecting against such adversaries. This work has been
plementation of `-trajectory privacy. submitted to an international conference.
Theorem 1. Let Λ be an integrated algorithm that takes
stream prefixes St = {Di , i ∈ [1, t]} as inputs and produces
4.1 A Motivating Example
Nt = {ni , i ∈ [1, t]} as outputs. Λ consists of a series of The following example (illustrated in Figure 3) shows that
sub-algorithms {Ai , i ∈ [1, t]}, each of which takes Di as its the existence of adversaries with knowledge of the spatiotem-
input and outputs noisy data ni with independent random- poral correlations of the data may degrade the expected pri-
ness. Presume that Ai ensures εi -DP, and let τu,k be the vacy guarantee of DP.
set of timestamps dominated by an arbitrary trajectory `u,k ; Example 1. Consider the scenario of spatiotemporal data
then, Λ satisfies `-trajectory privacy if publishing illustrated in Figure 3. A trusted server continu-
X ously collects users’ locations at each time point. Our goal is
∀u, ∀k, εi ≤ (3) to publish the aggregates shown in Table (d) while protecting
i∈τu,k
Based on the above theorem, we designed an algorithmic against adversaries who may have some knowledge regarding
framework for publishing differentially private aggregates membership in the database. We assume that each user ap-
that satisfy `-trajectory privacy. The experimental results pears at only one location at most at each time point. Then,
show that our approach is effective and efficient compared according to the Laplace mechanism [12], adding Lap(1/)
with previous works [16] and [15]. More details can be found noise2 to perturb the data can achieve -DP. However, from
in our paper [14]. auxiliary information such as road network constraints, an
attacker may know the mobility patterns of a targeted vic-
tim; for example, the adversary may know that the victim
4. ST -DIFFERENTIAL PRIVACY “always arrives at loc5 after visiting loc4 ,” as illustrated in
In existing studies, traditional DP mechanisms (e.g., the Figure 3(a), which leads to the patterns indicated by the solid
Laplace mechanism) are employed as primitives. These mech- lines in Figure 3(c) and (d). Consequently, because of the
anisms assume that the data are independent or that adver- composability of DP, adding Lap(1/) noise guarantees only
saries do not have knowledge of the data correlations. How- 2-DP against this attacker. We refer to the transition pat-
ever, data collected from the real world tend to be tempo- tern of a single user as temporal correlation. By contrast,
rally or spatially correlated, and such correlations could be when an attacker has the following knowledge (illustrated in
acquired by adversaries. In this study, we introduce a new Figure 3(b)), privacy leakage will also occur: (1) Users u1
class of adversaries named realistic adversaries who have and u2 are always at the same location during working hours
(e.g., they are colleagues). (2) Users u2 and u3 are always
t1 t2 t3 t4 t5 … t1 t2 t3 t4 t5 …
at the same location during leisure time (e.g., they are new-
ho
lyweds). We refer to such geographical similarities between
u1 me bar … u1 ho bar …
me
users as spatial correlations. In Figure 3(c) and (d), it is
ho
u2 bar … u2 ho
…
me bar me easy to see the resulting patterns, which are illustrated as
ho offi gy
u3 me ce m bar … u3 ho
me
offi
ce
gy
m bar … dashed lines. Therefore, because of the high sensitivity in-
(a) w-event privacy (w=3). (b) l-trajectory privacy (l=3).
duced by the highly correlated nature of the data, the addition
of Lap(1/) noise achieves only 2-DP.
Figure 2: Illustration of `-trajectory privacy com-
pared with w-event privacy [15]. 2
Lap(b) denotes a Laplace distribution with variance 2b2 .
ε= 1.0
4.2 Realistic Adversaries 0.9
(a) Strong temporal corr. (a)
(b) Moderate temporal corr.
To formalize realistic adversaries (RAs) and their impact 0.8
(c) No temporal corr.
0.7 (b)
on privacy leakage, we first study commonly used models 0.475 0.496
TPL
0.6
0.422 0.45
for describing temporal and spatial correlations as means of 0.4
0.302
0.349
0.388
0.3 0.247
expressing the prior knowledge of RAs. We then define a 0.181
0.2
0.1 (c)
new privacy model, ST -Differential Privacy (ST -DP), to 0.1
account for RAs. t=1 2 3 4 5 6 7 8 9 10
Definition 5 (Temporal Correlations). The temporal cor- Figure 4: Privacy leakage may increase over time.
relation of user i is described by a transition matrix Pi ∈
R|loc|×|loc| , which represents Pr(lit−1 |lit ), where lit is the lo- achieve 0.1-DP at each time point. However, in an extreme
cation of user i at time t. case, if a user’s data at any two successive time points are
Definition 6 (Spatial Correlations). Let xi denote Gaus- correlated with probability 1, say, Pa = 10 01 , then continual
sian random variables that represent the possible locations of data publishing is equivalent to releasing the same data mul-
each user i ∈ [1, n]. GMRF G, which we call the correlation tiple times. Hence, the privacy leakage at each time point
graph, describes the spatial correlations among x1 , . . . , xn . will accumulate with respect to previous time points, exhibit-
We define realistic adversaries in a comprehensive and ing a linear increase (Figure 4(a)). In another extreme case,
flexible manner such that various scenarios can be repre- if there is no temporal correlation, or a uniform correlation
such as Pc = 00 11 , then the privacy leakage at each time
sented.
Definition 7 (Realistic Adversaries). The adversaries tar- point does not increase, as shown in Figure 4(c). Figure
4(b) depicts the privacy leakage induced by Pb = 0.8 0.2
geting user i who possess various levels of prior knowledge, 0 1 ,
denoted by Ri (Pi , Gi , m), are called the realistic adversaries which can be finely calculated using our algorithm.
of that user. There are three basic types of realistic adver- 5. CONCLUSION AND FUTURE WORK
saries: (1) Ri (Pi , ∅, m), (2) Ri (∅, Gi , m), and (3) Ri (Pi , Gi , m). In this research, we addressed two fundamental limita-
Attack by realistic adversaries is Bayesian in nature [17]. tions of differential privacy by developing rigorous and flex-
t
Let DK ⊂ Dt represent the prior knowledge of such ad- ible privacy models for the use of personal spatiotemporal
versaries, and let DUt be the unknown tuples, i.e., Dt = data. An interesting future direction of research will be to
t
DK ∪ DUt ∪ {lit }. The adversaries attempt to infer the lo- extend our privacy models to other data types.
cation of user i at time t. To this end, they first infer DUt Acknowledgments
t 0
based on DK and a guessed value of lit (or lit , to represent
a different guess) and then attempt to distinguish the dif- We appreciate the comments of the anonymous reviewers.
0 This research is partially supported by the Center of Innova-
ference between the two neighboring databases Dt and Dt ,
t0 t t t0 tion Program of the Japan Science and Technology Agency,
where D = DK ∪ DU ∪ {li }.
JST.
Definition 8 (-DPST ). At each time t, a DP mechanism
t t t
M takes D as input and produces r as output. The pri- 6.[1] Benjamin
REFERENCES
C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu.
t
vacy leakage of M against Ri (Pi , Gi , m) is defined as follows. Privacy-preserving data publishing: A survey of recent developments.
ACM Comput. Surv., 42(4):14:1–14:53, 2010.
t def Pr(r 1 , . . . , r t |lit , DK t
) [2] Latanya Sweeney. K-anonymity: A model for protecting privacy. Int. J.
P L(Ri , M ) == sup log Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, 2002.
∀|D t |=m Pr(r 1 , . . . , r t |lit 0 , DK
t )
[3] Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou
K
r3579x?: Anonymized social networks, hidden patterns, and structural
P 1 t t t t t steganography. In WWW, pages 181–190, 2007.
D t Pr(r , . . . , r |D ) Pr(DU |li , DK ) [4] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse
U
= sup log P 1 t t0 t t0 t
(4) datasets. In S&P ’08, pages 111–125, 2008.
∀|D t |=m
K D t Pr(r , . . . , r |D ) Pr(DU |li , DK ) [5] Yves-Alexandre de Montjoye, CÃľsar A. Hidalgo, Michel Verleysen, and
U
Vincent D. Blondel. Unique in the crowd: The privacy bounds of human
Mt satisfies -DPST (i.e., DP under spatiotemporal corre- mobility. Sci. Rep., 3, 2013.
[6] Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, and
lations) if, for all Ri , i ∈ [1, n], it holds that P L(Ri , Mt ) ≤ . Alex âĂIJSandyâĂİ Pentland. Unique in the shopping mall: On the
reidentifiability of credit card metadata. Science, 347(6221):536–539, 2015.
[7] Yan Michalevsky, Gabi Nakibly, Gunaa Arumugam Veerapandian, Dan
4.3 Quantifying Privacy Leakage Boneh, and Gabi Nakibly. PowerSpy: location tracking using mobile device
power analysis. In SEC, pages 785–800, 2015.
The primary challenge of protecting against realistic ad- [8] Cynthia Dwork. Differential privacy. In ICALP, pages 1–12, 2006.
[9] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and
versaries is to finely quantify the privacy leakage, i.e., Equa- Muthuramakrishnan Venkitasubramaniam. L-diversity: Privacy beyond
tion (4). Because of space limitation, we briefly describe k-anonymity. ACM Transactions on Knowledge Discovery from Data,
1(1):3–es, 2007.
a concept for how this can be achieved and present a pre- [10] Ninghui Li, Tiancheng Li, and S. Venkatasubramanian. t-closeness:
Privacy beyond k-anonymity and l-diversity. In IEEE 23rd International
liminary result. We analytically divide Equation (4) into Conference on Data Engineering, 2007. ICDE 2007, pages 106–115, 2007.
two parts, namely, the privacy leakages w.r.t. adversaries [11] Xiaokui Xiao and Yufei Tao. M-invariance: Towards privacy preserving
re-publication of dynamic datasets. In SIGMOD, pages 689–700, 2007.
of type (1) (in Definition 7) and those of type (2). The [12] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith.
Calibrating noise to sensitivity in private data analysis. In TCC, pages
corresponding problems of quantification can be translated 265–284, 2006.
into inference on the GMRF (for type (1)) and a linear- [13] Frank McSherry and Kunal Talwar. Mechanism design via differential
privacy. In FOCS, pages 94–103, 2007.
factional programming problem (for type (2)). We then an- [14] Yang Cao and Masatoshi Yoshikawa. Differentially private real-time data
alyze the combination of these problems, i.e., the privacy publishing over infinite trajectory streams. IEICE Trans. Inf.& Syst.,
E99-D(1):163–175, 2016.
leakage w.r.t. type (3). [15] Georgios Kellaris, Stavros Papadopoulos, Xiaokui Xiao, and Dimitris
Papadias. Differentially private event sequences over infinite streams.
The result reveals that the privacy leakage of a DP mech- PVLDB, 7(12):1155–1166, 2014.
anism w.r.t. realistic adversaries may increase over time, as [16] Liyue Fan, Li Xiong, and Vaidy Sunderam. FAST: differentially private
real-time aggregate monitor with filtering and adaptive sampling. In
shown in Example 2. SIGMOD, pages 1065–1068, 2013.
[17] Daniel Kifer and Ashwin Machanavajjhala. Pufferfish: A framework for
Example 2. According to the Laplace mechanism, tradi- mathematical privacy definitions. ACM Trans. Database Syst.,
39(1):3:1–3:36, 2014.
tionally, adding Lap(1/0.1) noise to published counts can