On the Impact of User Movement Simulations in
   the Evaluation of LBS Privacy-Preserving
                  Techniques

                Sergio Mascetti1 , Dario Freni1 , Claudio Bettini1 ,
                      X. Sean Wang2 , and Sushil Jajodia3
                             1
                              DICo, Università di Milano
                       2
                           Dept. of CS, University of Vermont
                           3
                             CSIS, George Mason University


      Abstract. The evaluation of privacy-preserving techniques for LBS is
      often based on simulations of mostly random user movements that only
      partially capture real deployment scenarios. We claim that benchmarks
      tailored to specific scenarios are needed, and we report preliminary re-
      sults on how they may be generated through an agent-based context-
      aware simulator. We consider privacy preserving algorithms based on
      spatial cloaking and compare the experimental results obtained on two
      benchmarks: the first based on mostly random movements, and the sec-
      ond obtained from the context-aware simulator. The specific deployment
      scenario is the provisioning of a friend-finder-like service on weekend
      nights in a big city. Our results show that, compared to the context-
      aware simulator, the random user movement simulator leads to signifi-
      cantly different results for a spatial-cloaking algorithm, under-protecting
      in some cases, and over-protecting in others.


1   Introduction

Location-based services (LBS) are often cited as killer applications for the latest
GPS-equipped 3G phones. These phones are slated to be massively distributed
in 70 countries. While car navigation and identification of nearest points of
interest are already widely used services, more interest are generating the so-
called friend-finder services as a class of LBS that will change once more our way
to interact. A friend-finder service reveals to a participating user the presence of
other close-by participants belonging to a particular group (friends is only one
example), possibly showing their position on a map. From a technical point of
view, in contrast to services that find nearest points of interests, this service is
characterized by a sequence of LBS requests instead of single ones, since a user
may want to periodically check, while moving or even while staying in the same
place, for close-by participants.
    Sociological studies have shown that a large number of users perceive the
release of their precise location, as part of a LBS request, as a possible privacy
threat [1]. Considering friend-finder services it is easy to identify two types of
privacy threats: a) the association of the identity of the user with the specific
group of persons he is interested in may reveal her religious, sexual, or political
orientation, and b) the association of the identity of the user with her precise
location may reveal what kind of places she has been to, or that she has not
been where she was supposed to be at that time.
    As formally shown in [2] the likelihood of a privacy violation, and conse-
quently the defense techniques to be applied, strongly depend on the knowledge
that an adversary may have. In the friend-finder service the service provider (SP)
may not be trusted or the communication channels may be insecure; then, the
adversary’s knowledge may include the precise identity and location information
submitted with each request, and both the above privacy threats would become
real privacy breaches. The substitution of identities with pseudonyms does not
entirely solve the problem, if, for example, the adversary happens to know who
is at the location at the time reported in the request (e.g., in the case the issuer
of the request uses a fidelity card at a store). In some cases, the adversary may
also be able to recognize sequences of requests as issued by the same anonymous
user (e.g., by observing the same pseudonym or by spatio-temporal tracking)
and use this information to re-identify the issuer.
    Several defense techniques against both threats under different adversary
models have been proposed, and may be applied to the friend-finder service;
however, current proposals very rarely have formal assessments of the provided
privacy preservation, and are generally supported by experimental results based
either on real datasets of questionable significance for real LBS services (i.e.,
trucks or school bus traces) or on data simulations based on mostly random user
movements that hardly match the specific deployment scenario of a LBS service.
    In order to understand if the use of simulations based on mostly random
user movements may be a real problem, or if it is actually useful and safe to use
these simulations, we considered a typical deployment scenario for a friend-finder
service: a large number of young people using the service on a weekend night in
a large city like Milan, Italy. We performed a deep study, using different sources,
including on-line surveys, of the parameters characterizing this scenario. We then
used the Brinkhoff simulator [3], widely used in testing LBS privacy preservation,
to generate, based on the parameters, a first dataset of user movements. A second
dataset was created with a personalized version of the Siafu agent-based context-
aware simulator [4] which is able to capture much more details of our scenario.
Then, based on a common metric for privacy and quality of service evaluation, we
run a large number of tests on both datasets, considering different abilities of re-
identification by the adversary, as well as different privacy preserving techniques.
    Our results consistently show that (i) in some cases the evaluation on random
movement simulations leads to the definition of overprotective techniques and
(ii) in other cases, the techniques that are shown to meet privacy requirements
based on those simulations do not meet them when tested with more realistic
context-aware simulations.
    We focus our technical treatment on protecting the association of the user
with the request he has issued (e.g., with the group of people he is interested in,
as in threat (a) described above), even if we believe that our arguments can be
easily extended to techniques only aimed to protect the location.


Related work

We are not aware of related work in this area considering specifically the rele-
vance of realistic simulations in LBS. There are however several studies on user
movements with impact on many different application areas including epidemi-
ology, transportation, computer networks, marketing, as well as LBS. A very in-
teresting study supporting an argument against random movement simulations
recently appeared [5]. In the following we briefly report the main techniques
currently proposed for LBS privacy preservation, identifying the ones similar to
those tested in our experiments, and the ones using simulations to generate the
datasets for experiments.
    Privacy preserving solutions based on cryptographic techniques that totally
hide the location information in requests, even to the SP, have been recently
proposed [6] for LBS based on 1-NN queries, and may be probably adapted for
the service we consider. If proven to be correct, no simulation would be needed for
these techniques since no information would leak from any request and the above
privacy threats do not apply. However, this adaptation is still to be investigated,
and there are some general concerns with these approaches regarding efficiency
and flexibility.
    A popular alternative technique is spatial cloaking, consisting in the gener-
alization of the spatial information transmitted to the SP as part of a service
request. By receiving generalized locations, the SP can only return approximate
results on the presence of close-by group members and their positions; while it
may be possible to have a trusted entity in the middle filtering the communi-
cation and improving the precision, the related overhead costs should be taken
into account in evaluating the trade-off between generalization and quality of
service. While in this paper we consider techniques based on spatial cloaking
as in [7–9, 2, 10], other proposals have considered different techniques, including
the generation of dummy requests, the use of incremental requests, or the sub-
stitution in the request of the position of the issuer with a region that does not
include her (see among others [11–13]).
    Most of the proposals for LBS privacy have only considered requests in iso-
lation while a few have also addressed the cases in which sequences of requests
can be exploited by the adversary ([14, 15] among others), as in the friend-finder
service. A related problem is privacy-aware publication of trajectories [16, 17];
even if this has some aspects more similar to database publication than to ser-
vice request privacy preservation, we believe that our results may be important
for these studies as well.
    Synthetic, mostly random, user movements obtained by the Brinkhoff sim-
ulator or other simulators have been used in most of the above cited papers as
well as in our own previous work.
Organization
The rest of the paper is organized as follows. In Section 2 we formally define how
we evaluate the privacy of LBS requests, or equivalently, how we measure the
risk of a privacy violation upon issuing a request. In Section 3 we explain how
the two datasets were obtained from the generators based on the parameters
characterizing the deployment scenario. In Section 4 we briefly explain the pri-
vacy preservation algorithms being used and we report our experimental results.
Section 5 concludes the paper.


2     Privacy metric of generalized requests
As mentioned in the introduction, we are concerned with privacy protection via
location generalization (also called spatial cloaking). In this section, we formalize
the adversary model we consider in this paper, and give a metric to measure the
privacy provided by a set of generalized requests against the adversary.

2.1    Requests, original requests, and generalized requests
We first formally define requests and generalized request for LBS. A request
issued by a user without alteration is called an original request, and a generalized
request is one that is sent to the service provider and has been altered from the
original one for the purpose of privacy protection. Both kinds of requests are
called requests and denoted r. A convention in this paper is to use r0 to denote
generalized requests to emphasize its generalized nature, while use r to denote
original requests, if not specified otherwise.
     Either the client software or a trusted medium transforms (or generalizes) an
original request to a generalized one. In this paper, we are not concerned about
how the generalization has happened, but rather on the resulting generalized
requests and their privacy properties. In the experimental section, we evaluate
the performance of generalization algorithms based on the generalized requests
they generate.
     Each LBS request r, either original or generalized, is logically divided into
three parts: IDdata, ST data, and SSdata, containing user identification data,
location and time of the request, and other service parameters, respectively. In
the sequel, the spatial and temporal components in ST data are denoted with
Sdata and T data, respectively. In this paper, for the sake of simplicity, we con-
sider space and time as discrete domains. However, our results can be easily
extended to the case in which these two domains are continuous.
     Each generalized request r0 must correspond to an original request r such that
the difference between r and r0 is only in SData and furthermore, r.Sdata ⊆
r0 .Sdata, i.e., the spatial region of the generalized request must contain (or be
equal to) the spatial region of the original request4 . We use issuer(r) to denote
the actual issuer of the (original or generalized) request r.
4
    Here “region” can be a point.
2.2   Adversary model

The objective here is to provide an adversary model that captures a general class
of adversary models. In a sense, our adversary model is an adversary “meta-
model”. This adversary meta-model concerns two aspects of knowledge that an
adversary might have: (1) knowledge of users’ whereabouts (i.e., their locations),
and (2) correlation of (generalized) requests. These two aspects cover the (ex-
plicit or implicit) assumptions appeared in the relevant literature.
    For users’ locations, we assume that the adversary has the knowledge ex-
pressed as the following Ident function:

                      Identt : the Areas −→ the U ser sets,

that is, given an area A, Identt (A) is the set of users whom, through certain
means, the adversary has identified to be located in area A at time t. In the
following, when no confusion arises, we omit the time instant t. We further
assume that this knowledge is correct in the sense that these identified users in
reality are indeed in area A at the time.
     For a given user i, if there exists an area A such that i ∈ Ident(A), then we
say i is identified by the adversary. Furthermore, we say that i is identified in
A. Note that there may be users who are also in A but the adversary does not
identify them. This may happen either because the adversary is not aware of
the presence of users in A, or because the adversary cannot identify these users
even if he is aware of their presence. We do not distinguish these two cases in
our adversary model as we shall see later that the distinction of the two cases
does not make any perceptible difference in the ability of the adversary when
the total population is large.
     Clearly, in reality, there are lots of different sources of external information
that can lead the adversary to estimate the location of users. Some may lead the
adversary to know that a user is in a certain area, but not the exact location.
For example, an adversary may know that Bob is in a pub (due to his use of a
fidelity card at the pub), but may not know which room he is in. Some statistical
analysis may be done to derive the probability that Bob is in a particular room,
but this is beyond the scope of this paper.
     The most conservative assumption regarding this capability of the adversary
is that Ident(A) will give exactly all the users for each area A. It can be seen that
if the privacy of the user is guaranteed in this most conservative assumption, then
privacy is also guaranteed against any less precise Ident function. However, this
conservative assumption is unlikely true in reality, while some observed that this
assumption degenerates the quality of service unnecessarily. It will be interesting
to see how much privacy and quality of service change with more realistic Ident
functions. This is partly the goal of our paper.
     As part of this adversary model regarding the location and users, we also
assume another function:

                          N umt : the Areas −→ [0, ∞),
that is, given an area A, N umt (A) gives an estimate of the number of users
in the area at time t. This function can be derived from statistical information
publicly available or through some kind of counting mechanism such as tickets to
a theater. Again, when no confusion arises, we do not indicate the time instant
t.
    The second part of the adversary model is his ability to correlate requests.
We formalize this with the following function L:

                     L : the Requests −→ the Request sets,

that is, given a (generalized) request r0 , L(r0 ) gives a set of requests such that
the adversary has concluded, through certain means, are issued by the same user
who issued the request r0 . In other words, all the requests in L(r0 ) are linked to
r0 , although the adversary may still not know who the user is.
     Note that L(r) may only give an (often small) subset of all the requests
issued by the issuer of r. On the other hand, we assume that the function L is
correct in the sense that each request in L(r) is indeed issued by the same user
in reality. A set of requests is called a trace, denoted τ , if from the link function
L we understand that all requests are issued by the same user. The requests in
τ are implicitly ordered along the time dimension.
     As in the case for Ident function, the most conservative assumption on cor-
relation is that L(r) gives exactly all the (generalized) requests that are issued
by the issuer of r. This is a very strong assumption that may lead to severely
decrease quality of service when accompanied with the most conservative as-
sumption about the Ident function. Again, a partial goal of this paper is to
study the impact of a less conservative but more realistic assumption on L.
     In [2], we proposed a formal framework to model LBS privacy attacks and
defenses for the static case. The main idea is that the safety of a defense technique
can be formally evaluated only if the context, i.e., the assumptions about the
adversary’s external knowledge, is explicitly stated. Following this methodology,
in this paper, a context CH is given by three functions Ident, N um, and L, that
is
                               CH = (Ident, N um, L).
In the next section, we formalize the attack on the generalized requests that an
adversary can perform in a context CH .
    A consequence of restricting to context CH is that, analogously to the re-
lated work in this area, we focus our attention on using only ST data as a quasi-
identifier. Intuitively, a quasi-identifier in a request is a combination of values
that can be used to provide more information on who the actual issuer of a
request may be than without these values. For example, if the Ident function is
given, the ST Data in the request is a quasi-identifier as it may provide infor-
mation on the actual issuer, as shown in the next subsection. In principle, any
information contained in a request should be carefully analyzed to see if it may
serve as a quasi-identifier. For example, the IDdata part is an obvious target,
and some service specific parameters may be used to link the request to users.
However, these aspects are outside the scope of this paper.
2.3   Privacy Evaluation
The general question for this subsection is, given a set of generalized requests
and a context CH , how much privacy the users who issued these requests have.
   We want to find the following function:
                   Att : the Request set × the U sers −→ [0, 1],
Intuitively, given a (generalized) request r0 and a user i, Att(r0 , i) gives the
probability that the adversary can derive from CH that i is the issuer of r0
among all the users.
    In the following of this section we show how to specify the attack function
for context CH . Once the attack function is specified, we can use the following
formula to evaluate the privacy value of a request:
                     P rivacy(r0 ) = 1 − AttCH (r0 , issuer(r0 ))                 (1)
Intuitively, this value is the probability that the attacker will not associate the
issuer of request r0 to r0 .
    In order to specify the Att function, we introduce the function Inside(i, r0 )
that indicates the probability of user i to be located in r0 .Sdata at the time
of the request. Intuitively, Inside(i, r0 ) = 1 if user i is identified by the ad-
versary as one of the users that are located in r0 .Sdata at time r0 .T data, i.e.,
i ∈ Identt (r0 .Sdata) when t = r0 .T data. On the contrary, Inside(i, r0 ) = 0 if
i is recognized by the adversary as one of the users located outside r0 .Sdata
at time r0 .T data, i.e., there exists an area A with A ∩ r0 .Sdata = ∅ such that
i ∈ Ident(A). Finally, if neither of the above cases hold, then the adversary does
not know where i is. There is still a probability that i is in r0 .Sdata. Theoreti-
cally, this probability is the number of users in r0 .Sdata that are not recognized
by the adversary (i.e., N um(r0 .Sdata) − |Ident(r0 .Sdata)|) divided by all the
users who are not recognized by the adversary anywhere (i.e., |I| − |Ident(Ω)|,
where I is the set of all users, and Ω is the entire area for the application).
Formally,
                                                 0
                           1 if i ∈ Ident(r .Sdata)
                          
                                               0
         Inside(i, r ) = 0 if ∃A
                    0                  : A ∩ r .Sdata = ∅ and i ∈ Ident(A)      (2)
                           N um(r0 .Sdata)−|Ident(r0 .Sdata)|
                                      |I|−|Ident(Ω)|           otherwise

    We can now define the Att function in context CH . For the sake of presenta-
tion, let us first consider the attack in the snapshot context
                          Csnap = (Ident, N um, Lsnap ),
where for each generalized request r0 , Lsnap (r0 ) = {r0 }. In this special case, the
probability of a user i of being the issuer of r0 is given by the probability of i
being in r0 .Sdata at the time of the request, normalized among all the users in
I. Formally, the attack can be defined as:
                                                 Inside(i, r0 )
                       AttCsnap (r0 , i) = P                  0 0
                                                                                  (3)
                                               i0 ∈I Inside(i , r )
   When the total population of users is large (relative to the number of users
whose locations are known to the adversary), then the “otherwise” case in For-
mula 2 is very small, albeit nonzero. Intuitively, if a user i falls into this case,
then the adversary cannot really distinguish this particular user from all other
users who also fall into this case. For such a user i, we can simply give a value
1/|I| to AttCsnap (r0 , i). We could give 1/(|I| − N um(Ω)), but this does not make
much impact in practice. Now it’s easy to see that
                                 1/N um(r0 .Sdata) if Inside(i, r0 ) = 1
                                
                          0
            AttCsnap (r , i) ≈ 0                      if Inside(i, r0 ) = 0      (4)
                                  1/|I|               otherwise
                                

The above formula makes intuitively sense. Indeed, if i is recognized as in-
side r0 .Sdata, without any other information, the adversary cannot distinguish
him/her from any of the N um(r0 .Sdata) people in the area who might be the
issuer. If i is recognized outside, then clearly i cannot be the issuer due to our
definition of (generalized) requests. If i is not recognized anywhere (meaning
he/she can be anywhere), then the attacker cannot distinguish him/her from
any of the other people who are not recognized. Since we assume the total popu-
lation is much greater than N um(Ω), the probability that i is the issuer is close
to 1/|I|.
Example 1. Consider the situation shown in Figure 1(a) in which there is the
request r0 such that, at time r0 .T data, there are three users in r0 .Sdata: one of
them is identified as i1 , the other two are not identified. The adversary can also
identify users i2 and i3 outside r0 .Sdata at time r0 .T data. Assume that the set
I contains 100 users.


        (a) First request, r0 .                           (b) Second request, r00 .

                                  Fig. 1. Example of attack


   Clearly, i2 and i3 have zero probability of being the issuers, since they are
identified outside r0 .Sdata and due to the assumption that the spatial region
of any generalized request must contain the spatial region of the original re-
quest. On the contrary, the adversary is sure about the fact that i1 is lo-
cated in Pr0 .Sdata. By Equation 3, the attack associates i1 to r0 with likeli-
hood 1/( i0 ∈I Inside(i0 , r0 )). By Formula       2, for each user i in I \ {i1 , i2 , i3 },
Inside(i, r0 ) = 2/100. Therefore,                        0 0
                                        P
                                             0
                                            i ∈I Inside(i  , r ) = 97 ∗ 2/100 + 1 ≈ 3.
Consequently, the probability of i1 to be the issuer of r0 is approximately 1/3.
Moreover, each user i ∈ I \ {i1 , i2 , i3 } has a probability to be the issuer of about
(2/100)/3 = 2/300.
   In the general case L(r0 ) ⊇ {r0 }, we can evaluate, analogously to the snapshot
case, the probability that a user is located in the generalized region of all the
requests in the trace τ = L(r0 ). So, we can extend the Inside function to traces
where, given a trace τ and a user i, Inside(i, τ ) is the probability that user i is
located, for each request r0 in τ , in r0 .ST data. Then, the attack is
                                              Inside(i, L(r0 ))
                        AttCH (r0 , i) = P                0     0
                                                                                         (5)
                                            i0 ∈I Inside(i , L(r ))

       We now turn to consider how to compute Inside(i, τ ).
       First consider some easy cases. If i ∈ Ident(r0 ) for all requests r0 ∈ τ , then
Inside(i, τ ) = 1. If i ∈ Ident(A) and A ∩ r0 .Sdata = ∅ for an area A and at least
one requests r0 ∈ τ , then Inside(i, τ ) = 0.
       The rest of cases are difficult ones. To calculate Inside(i, τ ), we need to
consider the likelihood of someone moving from one location to/from another
in the specific times. In this paper, we advocate the following as a reasonable
approach. We assume for each pair of locations A and B and two times t1 and
t2 , we know the probability of a user i being in B at time t2 conditioned on the
fact that the user is in A at time t1 . In formalism, consider two random variables
X: “i is inside A at time t1 ” and Y : “i is inside B at time t2 ”, where A and B
are two areas and t1 and t2 are two different times. We assume the adversary
knows the value P (Y |X).
       We note that P (Y |X) in general is not the same as P (Y ). Indeed, how likely
user i is in B can depend on how likely the same user is in A. Take two extreme
examples: if A and B are very far away and t1 and t2 are close to each other,
then i cannot be in B at t2 if i is in A at t1 , i.e., P (Y |X) ≈ 0. On the other
hand, if A and B are just two locations along a one-way road and the difference
between times t1 and t2 matches the time needed to move from A to B with a
normal moving speed, then P (Y |X) ≈ 1. In practice, this value can be derived
from historical observations and experiences.
       Now, assume τ consists of the requests r10 , . . . , rk0 . We form a Bayesian net-
work for each user i with X1 , . . ., Xk as the nodes, where each Xj corresponds
to the random variable: “user i is in rj .Sdata at time rj .T data”. In this network,
for each node Xh that satisfies the condition (denoted c) i ∈ Identt (rh0 .Sdata)
with t = rh0 .T data, we draw an arc towards each other node Xh0 which does
not satisfy condition C. In addition, for each pair of nodes rh0 and rh00 such that
neither satisfy condition c, we draw an arc from Xh0 to Xh00 if the rh0 0 .T data <
rh0 00 .T data. (The resulting network is acyclic.) As we have assumed, we know the
value P (Xh0 |Xh ) for each arc Xh to Xh0 . Denote by E the conjunctive fact that
P (Xh ) = 1 for each rh ∈ τ that satisfies condition c. What we want to find is
P (X1 , . . . , Xk |E) = Inside(i, τ ). This is a well-studied belief revision problem,
and many computation and approximation methods exist. (Note that if we apply
this method to the easier cases mentioned earlier, we would arrive at the correct
values.)

Example 2. Continue from Example 1 and assume a second request r00 (see
Figure 1(b)) is issued after r0 and that r00 is linked with r0 , so τ consists of
these two requests. At time r00 .T data, there are 4 users inside r00 .Sdata, two of
which are identified as i1 and i2 . No user is identified outside r00 .Sdata. From
the above discussion, it follows that Inside(i2 , τ ) = Inside(i3 , τ ) = 0 since
i2 and i3 are identified outside the first generalized request r0 . All the other
users have a non-zero probability of being inside the generalized region of each
request in the trace. In particular, Inside(i1 , τ ) = 1 since i1 is recognized in
both requests. Consider a user i ∈ I \ {i1 , i2 , i3 }, and denote X and Y being
the assertion that “i is in r0 .Sdata at time r0 .T data” and “i is in r00 .Sdata at
time r00 .T data”. In this case, the Bayesian network for i has two nodes Xr0 and
Xr00 , and there is an arc from Xr0 to Xr00 since r00 is issued after r0 is. Now
let us assume P (Xr00 |Xr0 ) = 0.75, i.e., there is a 75% likelihood that some-
one in r0 .Sdata will move to r00 .Sdata at the specified times. Now compute
Inside(i, τ ) = P (Xr0 , Xr00 ) = P (Xr0 )P (Xr00 |Xr0 ) = 2/97 ∗ 0.75. Now the sum of
all the Inside(j, τ ) value is 1 + 0 + 0 + 97 ∗ 2/97 ∗ 0.75 = 2.5. The attack value
under these assumptions then is as follows: For AttCH (r00 , i1 ) = 1/2.5 = 40%,
AttCH (r00 , i2 ) = AttCH (r00 , i3 ) = 0, and AttCH (r00 , i) = (2/97) ∗ .75/2.5 ≈ 0.6%
for all other 97 users i.
     To make the situation more interesting, let us remove the fact that i2 was rec-
ognized outside at time r0 .T data, and we want to figure out the value Inside(i2 , τ ).
In this case, let us assume P (Xr0 |Xr00 ) = 0.75, namely people who are in
r00 .Sdata have a 75% likelihood to be from r0 .Sdata. Under the fact E that i2 is in
r00 .Sdata, then we know Inside(i2 , τ ) = P (Xr0 , Xr00 |E) = 0.75. Then the sum of
Inside values is 1+0.75+0+97∗2/97∗0.75 = 3.25. Hence, AttCH (r00 , i1 ) ≈ 31%,
AttCH (r00 , i2 ) ≈ 23%, AttCH (r00 , i3 ) = 0, and AttCH (r00 , i) = (2/97) ∗ 0.75/3.25 ≈
0.47% for each other 97 users i. This is an interesting exercise as it reveals that if
we add i2 to be possibly in r0 .Sdata (with 75% probability), then the likelihood
that i1 is the issuer decreases, which is intuitively correct.

    It is worth noting that the definition of attack in context CH is a proper
extension of the attack that can be defined in the conservative context in which
the adversary knows the location and the identity of each user in each time
instant. The historical attack in this context was first proposed in [14]. The idea
is that the only users that have non-zero probability of being the issuer of a
trace of requests are those whose spatio-temporal location is contained in the
generalized region of every request in the trace. It can be easily seen that, if each
user can be identified at each time instant, then the Inside() function returns
either 0 or 1 and hence the attack we specified for context CH assigns a zero
probability to each user that is located outside the generalized region of any
request in the trace.

3     The MilanoByNight simulation
In order to evaluate privacy-preserving techniques applied to LBS, a dataset of
users’ movements is needed. In our experiments, we want to focus on privacy
threats that arise when using a friend finder service, as described in Section 1. We
suppose that this kind of service is primarily used by people during entertainment
hours, especially at night. Therefore, the ideal dataset for our experiments should
represent movements of people on a typical Friday or Saturday night in a big city,
when users tend to move to entertainment places. To our knowledge, currently
there are no datasets like this publicly available, specially considering that we
want to have large scale, individual, and precise location data (i.e., with the
same approximation of current consumer GPS technology). In this section we
describe how we generated this user movement dataset.

3.1    Relevant Parameters
For our experiments we want to artificially generate movements for 100, 000 users
on the road network of Milan5 . The total area of the map is 324 km2 , and the
resulting average density is 308 users/km2 . Very detailed digital vector maps
of the city have been generously provided by the municipality of Milan. The
simulation includes a total of 30, 000 home buildings and 1, 000 entertainment
places; the first value is strictly related to the considered number of users, while
the second is based on real data from public sources which also provide the
geographical distribution of the places. Our simulation starts at 7 pm and ends at
1 am. During these hours, each user moves from house to an entertainment place,
spends some time in that place, and possibly moves to another entertainment
place or go back home.
    All probabilities related to users’ choices are modeled with a probability
distributions. For this specific data generation, some of the important parameters
of the simulation are:
 – Source and destination. These are the locations essential to define move-
   ments. They may be homes or entertainment places. Some places in some
   districts are more popular than others.
 – StartingTime. The time at which a user leaves her home to go to the first
   entertainment place.
 – Permanence. How long will a user stay at one entertainment place?
 – NumPlaces. How many entertainment places will a user visit on one night?
    In order to have a realistic model of these distributions, we prepared a survey
to collect real users data. We are still collecting data, but the current parameters
are based on interviews of more than 300 people in our target category.
5
    100, 000 is an estimation of the number of people participating in the service we
    consider.
3.2   Weaknesses of mostly random movement simulations

Many papers in the field of privacy preservation in LBS use artificial data gen-
erated by moving object simulators to evaluate their techniques. However, most
of the simulators are usually not able to reproduce a realistic behavior of users.
For example, objects generated by the Brinkhoff generator [3] cannot be ag-
gregated in certain places (e.g., entertainment places). Indeed, once an object
is instantiated, the generator chooses a random destination point on the map;
after reaching the destination, the object disappears from the dataset. For the
same reason, it is not possible to reproduce simple movement patterns (e.g.: a
user going out from her home to another place and then coming back home),
nor to simulate that a user remains for a certain time in a place.
    Despite these strong limitations, we made our best effort to use the Brinkhoff
simulator to generate a set of user movements with characteristics as close as
possible to those explained in Section 3.1. For example, in order to simulate
entertainment places, some random points on the map, among those points on
the trajectories of users, were picked. The simulation has the main purpose of
understanding if testing privacy preservation over random movement simulations
gives significantly different results with respect to more realistic simulations.


3.3   Generation of user movements with a context simulator

In order to obtain a dataset consistent with the parameters specified in Sec-
tion 3.1, we need a more sophisticated simulator. For our experiments, we have
chosen to customize the Siafu context simulator [4]. With a context simulator
it is possible to design models for agents, places and context. Therefore, it is
possible to define particular places of aggregation and make users dynamically
choose which place to reach and how long to stay in that place. In our simulation
homes are distributed almost uniformly on the map, with a minor concentration
on the central zones of the city. Entertainment places are mostly concentrated
in 5 zones of the city.
    The distributions for StartingTime, Permanence and NumPlaces parameters
introduced in Section 3.1 were modeled with the results of the survey. For exam-
ple, the time of permanence in an entertainment place was modeled according
to the following percentages derived from the survey: 9.17% of the users stays
less than 1 hour, 34.20% stays between 1 and 2 hours, 32.92% stays between 2
and 3 hours, 16.04% stays between 3 and 4 hours, and 7.68% stays more than 4
hours.
    Following these parameters, in our dataset users spend 50.87% of the time
at home, 7.28% of the time moving from one place to another and 41.85% of the
time in entertainment places. When a user moves from one place to another, she
decides whether to go on foot or by car. In general, if an entertainment place
is farther than 500 meters, people tend to move by car, and this is reflected in
the simulation. The average speed of movements by car is 20 km/h, while the
average speed on foot is 3.6 km/h. With our parameters 10.64% of movements
are done on foot, while all the others are done by car.
4     Experimental results
In this section we show the results of our experimental evaluation. We first
define how we evaluate the quality of service in Section 4.1, then we describe
the experimental setting in Section 4.2 and the generalization algorithms we
used in Section 4.3. Finally, in Sections 4.4 and 4.5 we show the impact of the
simulation parameters and of the user movements, respectively, in the evaluation
of the generalization algorithms.

4.1   Evaluation of the Quality of Service
Different metrics can be defined to measure QoS for different kind of services. For
instance, for the friend-finder service we are considering, it would be possible to
measure how many times the generalization leads the SP to return an incorrect
result i.e., the issuer is not notified of a close-by friend or, vice versa, the issuer
is notified for a friend that is not close-by. While this metric is useful for this
specific application, we want to measure the QoS independently from the specific
kind of service. For this reason, in this paper we evaluate how QoS degrades in
terms of the perimeter of the generalized region. If the generalized region is
too large, the service becomes useless. For this purpose, we introduce a new
parameter, called maxP , that indicates this threshold in terms of the maximum
perimeter. We assume that no request is sent to the SP with a perimeter larger
than maxP .

4.2   Experimental settings
In our experiments we used two datasets of users movements. The dataset AB
(Agent-Based) was generated with the customized Siafu simulator as described
in Section 3.3, while the dataset M RM (Mostly Random Movement) was created
with the Brinkhoff simulator as described in Section 3.2. In both cases, we sim-
ulate LBS requests for the friend-finder service by choosing random users in the
simulation, we compute for each request the generalization according to a given
algorithm, we evaluate QoS as explained in Section 4.1 and privacy according to
formula (1) presented in Section 2.
    The most important parameters that characterize the simulations are re-
ported in Table 1, with the values in bold denoting default values. The number
of users indicates how many users are in the simulation, and the simulations
are designed so that this number remains almost constant at each time instant.
In every two minutes, each user has a probability Preq of issuing a request. For
technical reasons, the reported tests are based on a time frame of three hours
over the total six hours of the MilanoByNight scenario. This implies that in the
default case we consider a total of 45 requests (one every four minutes of the
considered time frame). The parameter Pid−in indicates the probability that a
user is identified when she is located in a entertainment place while Pid−out is
the probability that a user is identified in any other location (e.g., while moving
from home to a entertainment place). While we also perform experiments where
the two probabilities are the same, our scenario suggests as much more realistic
a higher value for Pid−in (it is considered ten times higher than Pid−out ). This
is due to the fact that restaurants, pubs, movie theaters, and similar places are
likely to have different ways to identify people (fidelity or membership cards, wifi
hotspots, cameras, credit card payments, etc.) and in several cases more than
one place is owned by the same company that may have an interest in collecting
data about its customers.
    Finally, Plink indicates the probability that two consecutive requests can be
identified as issued by the same user.6 While we perform our tests considering
a full range of values, the specific default value reported in the table is due
to a recent study on the ability of linking positions based on spatio-temporal
correlation [18].


                               Table 1. Parameter values


           Parameter                              Values
             dataset                            AB, MRM
         number of users 10k, 20k, 30k, 40k, 50k, 60k, 70k, 80k, 90k, 100k
              Preq            0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
             Pid−in           0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
             Pid−out    0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1
              Plink        0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.87, 0.9, 1.0


   The experimental results we show in this section are obtained by running the
simulation for 100 issuers and then computing the average values.


4.3     The Generalization Algorithms Used in the Experiments

In our experiments we evaluate the privacy and the QoS of requests generalized
by using two algorithms previously proposed in the literature. The first one,
called Grid, was presented in [19], and it is used as a representative of several
algorithms aimed at guaranteeing k-anonymity in the snapshot case, i.e., these
algorithms do not take into account link ability of the adversary. Intuitively,
this particular algorithm partitions all users into blocks, each one having at
least cardinality k. Then, it computes the generalized region as the minimum
bounding rectangle (MBR) that covers the location of the users in the same
block as the issuer.
    The second algorithm, Greedy, was first proposed in [14] and a similar idea
was also described in [15]. The use of Greedy is intended to represent algorithms
aimed at preserving privacy in the historical case, i.e., the general CH context,
6
    The limitation to consecutive requests is because in our specific scenario we assume
    linking is performed mainly through spatio-temporal correlation.
assuming that the attacker may actually obtain and recognize traces of requests
from the same issuer. This algorithm computes the generalization of the first
request r in a trace using an algorithm for the snapshot case. While doing this,
the set A of users located in the generalized region is stored. The generalized
regions of the successive request r0 linked with r is then computed as the MBR
of the location of the users in A at the time of r0 . In our implementation we
use Grid as the snapshot algorithm to compute the generalization of the first
request.
    For the purpose of our tests, we modified the two algorithms above so that
each generalized region has a perimeter always smaller than maxP . To achieve
this, if the perimeter of the generalized region is larger than maxP , then the
region is iteratively shrunk, until its perimeter is below maxP , by excluding
from the MBR the user that is farther from the issuer. In the Greedy algorithm,
when a user is excluded from the generalized region, then it is also excluded from
the set A of users, and hence he is not used in the generalization of the succes-
sive requests. Eventually, when the set A contains the issuer only, a snapshot
generalization is executed again and A is reinitialized.
    In addition to the input request r, and the location of all the users in the
system, the considered algorithms require two additional parameters: the value
k, and the threshold maxP . In our tests, we used values for k between 10 and
60 (default is 10) and values for maxP between 1000 to 4000 meters (default is
1000 meters).
    In our experimental results we also evaluated the privacy threat when no
privacy preserving algorithm is applied. The label NoAlg is used in the figures
to identify results in this particular case.

4.4   Impact of Simulation Parameters in the Evaluation of the
      Generalization Algorithms
The objective of the first set of experimental results we present is to show which
parameters of the simulation affect most the evaluation of the generalization
algorithms. In these tests we used the AB dataset only.
    Figure 2(a) shows that the average privacy obtained with Greedy and Grid
is not significantly affected by the size of the total population. Indeed, both
algorithms, independently from the total number of users, try to have generalized
regions that cover the location of k users, so the privacy of the requests is not
affected. However, when the density is high, the two algorithms can generalize
to a small area, while when the density is low, a larger area is necessary to
cover the location of k users (see Figure 2(b)). On the contrary, the privacy
obtained when no generalization is performed is significantly affected by the
total population. Indeed, a higher density increases the probability of different
users to be in the same location and hence it increases privacy also if the requests
are not generalized.
    A parameter that significantly affects the average privacy is the probability
of identification of a user in a certain place. In Figure 3 we show the experi-
mental results for different values of Pid−in when, in each test, Pid−out is set to
                       1                                                                                      700
                                                                                                                                     Greedy
                                                                                                                                        Grid


                                                                                      Average perimeter (m)
                     0.9
                                                                                                              600
  Average privacy
                     0.8
                                                                                                              500
                     0.7
                                                                                                              400
                     0.6                                 Greedy
                                                            Grid
                                                          NoAlg                                               300
                     0.5
                       10000    40000       70000                         100000                               10000      40000       70000    100000
                                 Number of users                                                                           Number of users
                           (a) Average privacy.                                                                     (b) Average perimeter.

                    Fig. 2. Performance evaluation for different values of the total population.


Pid−in /10. As expected, considering a trace of requests, the higher is the proba-
bility of identifying users in one or more of the regions from which the requests
in the trace were performed, the smaller is the level of privacy.


                                                            1
                                                          0.9
                                                          0.8
                                       Average privacy


                                                          0.7
                                                          0.6
                                                          0.5
                                                          0.4      Greedy
                                                          0.3         Grid
                                                                    NoAlg
                                                          0.2
                                                                0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9                        1
                                                                     Probability of identification (inside)


                    Fig. 3. Average privacy for different values of Pid−in (Pid−out = Pid−in /10).


    Figure 4(a) shows the impact of Plink on the average privacy. As expected,
high values of Plink lead to small values of privacy. Our results show that the
relation between the Plink and privacy is not linear. Indeed, privacy depends
almost linearly on the average length of the traces identified by the adversary
(Figure 4(b)). However, the average length of the traces grows almost exponen-
tially with the value of Plink (Figure 5).
    To summarize the first set of experiments, our findings show that many pa-
rameters of the simulation significantly affect the evaluation of the generalization
algorithms. This implies that when a generalization algorithm is evaluated it is
necessary to carefully estimate realistic values for the parameters of the simula-
tion. Indeed, an error in the estimation may lead to misleading results.
                     1                                                                                            1

                    0.8                                                                                          0.8
  Average privacy


                                                                                               Average privacy
                    0.6                                                                                          0.6

                    0.4                                                                                          0.4

                    0.2   Greedy                                                                                 0.2   Greedy
                             Grid                                                                                         Grid
                           NoAlg                                                                                        NoAlg
                     0                                                                                            0
                      0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1                                                            5   10 15 20 25 30 35 40           45
                      Probability of linking two consecutive requests                                                      Average length of the traces
(a) Average privacy as a function of Plink . (b) Average privacy as a function of the
                                             average trace length.

                             Fig. 4. Performance evaluation for different values of Plink .

                                                                   45
                                                                   40
                                            Average trace length


                                                                   35
                                                                   30
                                                                   25
                                                                   20
                                                                   15
                                                                   10
                                                                   5
                                                                   0
                                                                        0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9                      1
                                                                        Probability of linking two consecutive requests


                                    Fig. 5. Average trace length as a function of Plink .


4.5                 Impact of the User Movements on the Evaluation of the
                    Generalization Algorithms

The objective of the second set of experiments is to answer an important ques-
tion posed in this paper: what is the impact of the different simulated user
movements on the evaluation of the Generalization Algorithms? We answer to
this question with a set of tests performed on the two different datasets we
obtained as described above.
    The first set of tests, reported in in Figure 6, compares the privacy achieved
by the Greedy algorithm on the two datasets for different values of k and for
different values of QoS. The experiments on M RM were repeated trying also
larger values for the QoS threshold (maxP = 2000 and maxP = 4000), so three
different versions of M RM appear in the figures. In order to focus on these
parameters only, in these tests the probability of identification was set to the
same value for any place (Pid−in = Pid−out = 0.1), and for the M RM dataset
the issuer of the requests was randomly chosen only among those that stay
in the simulation for 3 hours, ignoring the ones staying for much shorter time
that inevitably are part of this dataset. This setting allowed us to compare the
results on the two datasets using the same average length of traces identified by
the adversary.
    Figure 6(a) shows that the average privacy of the algorithm evaluated on
the AB dataset is much higher than on the M RM dataset. This is mainly
motivated by the fact that in AB users tend to concentrate in a few locations
(the entertainment places) and this enhances privacy. This is also confirmed by a
similar test performed without using any generalization of locations; we obtained
values constantly higher for the AB dataset (the average privacy is 0.67 in AB
and 0.55 in M RM ).
    In Figure 6(b) we show the QoS achieved by the algorithm in the two datasets
with respect to the average privacy achieved. This result confirms that the level
of privacy evaluated on the AB dataset using small values of k and maxP for
the algorithm cannot be observed on the M RM dataset even with much higher
values for these parameters.
    From the experiments shown in Figure 6 we can conclude that if the M RM
dataset is used as a benchmark to estimate the values of k and maxP that
are necessary to provide a desired average level of privacy, then the results will
suggest the use of values that are over-protective. As a consequence, it is possible
that the service will exhibit a much lower QoS than the one that could be
achieved with the same algorithm.


                     1                                                              1

                                                                                  0.95
  Average privacy


                                                                Average privacy


                    0.9                                                            0.9

                                                                                  0.85

                    0.8                                                            0.8
                                     AB. maxP=1000                                                    AB. maxP=1000
                                    MRM. maxP=1000                                                   MRM. maxP=1000
                                    MRM. maxP=2000                                0.75               MRM. maxP=2000
                                    MRM. maxP=4000                                                   MRM. maxP=4000
                    0.7                                                            0.7
                          10   20     30       40    50   60                             0   400     800   1200 1600 2000 2400
                                           k                                                       Average perimeter (m)
  (a) Average privacy as a function of k.                      (b) Average privacy as a function of the
                                                               average perimeter.

Fig. 6. Evaluation of the Greedy algorithm using AB and M RM data sets. Pid−in =
Pid−out = 0.1


    The above results may still support the safety of using M RM , since according
to what we have seen above a technique achieving a certain level of privacy may
only do better in a real scenario. However, our second set of experiments shows
that this is not the case.
    In Figure 7 we show the results we obtained by varying the probability of
identification. For this test, we considered two sets of issuers in the M RM data
set. One set is composed by users that stay in the simulation for 3 hours, (M RM
long traces, in Figure 7), while the other contains issuers randomly chosen in the
entire set of users (M RM all traces, in Figure 7), hence including users staying
in the simulation for a much shorter time.
    In Figure 7(a) and 7(b) we can observe that the execution on the M RM
dataset leads to evaluate a privacy level that is higher than the one obtained
on the AB dataset. In particular, the evaluation of the Grid algorithm using
the M RM dataset (Figure 7(b)), would suggest that the algorithm is able to
provide a high privacy protection. However, when evaluating the same algorithm
using the more realistic dataset AB, this conclusion seems to be incorrect. In
this case, the evaluation on the M RM dataset may lead to underestimate the
privacy risk, and hence to deploy services based on generalization algorithms
that may not provide the minimum required level of privacy.


                          1                                                                       1
                        0.98
                        0.96                                                                     0.9
    Average privacy


                                                                               Average privacy
                        0.94
                                                                                                 0.8
                        0.92
                         0.9
                                                                                                 0.7
                        0.88
                        0.86                     AB                                              0.6                     AB
                        0.84         MRM long traces                                                         MRM long traces
                                      MRM all traces                                                          MRM all traces
                        0.82                                                                     0.5
                               0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8                           0.1    0.2   0.3   0.4   0.5   0.6   0.7   0.8
                                                 P_id_in                                                                  P_id_in
                                (a) Greedy algorithm.                                                        (b) Grid algorithm.

                      Fig. 7. Average privacy using AB and M RM data sets. Pid−out = Pid−in /10.


5                     Conclusions and open issues
In this paper we claim that the experimental evaluation of LBS privacy preserv-
ing techniques should be based on user movement datasets obtained through
simulations tailored to the specific deployment scenario of the target services.
Our results support our thesis for the class of LBS known as friend-finder ser-
vices, for techniques based on spatial cloaking, and for adversary models that
include the possibility for the adversary to occasionally recognize people in cer-
tain locations. We believe that these results can be generalized to other LBS,
techniques and adversary models. For example, as a future work, it would be
interesting to also evaluate some defense techniques that generalize the issuer’s
location to an area that does not necessarily contain the issuer’s location. More-
over, in our experiments we only considered the first of the two privacy threats
presented in the introduction. We do have some ideas on how to extend them
to consider the second, location privacy, as well. Finally, we believe a signifi-
cant effort should be devoted to the development of new flexible and efficient
context-aware user movement simulators, as well as to the collection of real
data, possibly even in an aggregated form, to properly tune the simulations. In
our opinion this is a necessary step to have significant common benchmarks to
evaluate LBS privacy preserving techniques.

Acknowledgments
The authors would like to thank Stefano Varesi for his contribution in writing
the code that was used for our simulations. This work was partially supported
by National Science Foundation (NSF) under grant N. CNS-0716567, and by
Italian MIUR under grant InterLink II04C0EC1D.

References
 1. Barkhuus, L., Dey, A.: Location-based services for mobile telephony: a study of
    users privacy concerns. In: Proc. of the 9th International Conference on Human-
    Computer Interaction, IOS Press (2003) 709–712
 2. Bettini, C., Mascetti, S., Wang, X.S., Jajodia, S.: Anonymity in location-based
    services: towards a general framework. In: Proc. of the 8th International Conference
    on Mobile Data Management, IEEE Computer Society (2007)
 3. Brinkhoff, T.: A framework for generating network-based moving objects. GeoIn-
    formatica 6(2) (2002) 153–180
 4. Martin, M., Nurmi, P.: A generic large scale simulator for ubiquitous comput-
    ing. In: 3rd Annual International Conference on Mobile and Ubiquitous Systems:
    Networking & Services, IEEE Computer Society (July 2006)
 5. Gonzalez, M.C., Hidalgo, C.A., Barabasi, A.L.: Understanding individual human
    mobility patterns. Nature 453 (June 2008) 779–782
 6. Ghinita, G., Kalnis, P., Khoshgozaran, A., Shahabi, C., Tan, K.L.: Private queries
    in location based services: Anonymizers are not necessary. In: Proc. of SIGMOD,
    ACM Press (2008)
 7. Gruteser, M., Grunwald, D.: Anonymous usage of location-based services through
    spatial and temporal cloaking. In: Proc. of the 1st International Conference on
    Mobile Systems, Applications and Services (MobiSys), The USENIX Association
    (2003)
 8. Mokbel, M.F., Chow, C.Y., Aref, W.G.: The new casper: query processing for
    location services without compromising privacy. In: Proc. of the 32nd International
    Conference on Very Large Data Bases, VLDB Endowment (2006) 763–774
 9. Gedik, B., Liu, L.: Protecting location privacy with personalized k-anonymity:
    Architecture and algorithms. IEEE Transactions on Mobile Computing 7(1) (2008)
    1–18
10. Kalnis, P., Ghinita, G., Mouratidis, K., Papadias, D.: Preventing location-based
    identity inference in anonymous spatial queries. IEEE Transactions on Knowledge
    and Data Engineering 19(12) (2007) 1719–1733
11. Kido, H., Yanagisawa, Y., Satoh, T.: Protection of location privacy using dummies
    for location-based services. In: Proc. of the 21st International Conference on Data
    Engineering Workshops, IEEE Computer Society (2005)
12. Yiu, M.L., Jensen, C.S., Huang, X., Lu, H.: Spacetwist: Managing the trade-
    offs among location privacy, query performance, and query accuracy in mobile
    services. In: Proc. of the 24th International Conference on Data Engineering,
    IEEE Computer Society (2008)
13. Ardagna, C.A., Cremonini, M., Damiani, E., di Vimercati, S.D.C., Samarati, P.:
    Location privacy protection through obfuscation-based techniques. In: Proc. of the
    21st Conference on Data and Applications Security. Volume 4602 of Lecture Notes
    in Computer Science., Springer (2007) 47–60
14. Bettini, C., Wang, X.S., Jajodia, S.: Protecting privacy against location-based
    personal identification. In: Proc. of the 2nd workshop on Secure Data Management.
    Volume 3674 of LNCS., Springer (2005) 185–199
15. Xu, T., Cai, Y.: Location anonymity in continuous location-based services. In:
    Proc. of ACM International Symposium on Advances in Geographic Information
    Systems, ACM Press (2007)
16. Abul, O., Bonchi, F., Nanni, M.: Never walk alone: Uncertainty for anonymity in
    moving objects databases. In: Proc. of the 24th International Conference on Data
    Engineering, IEEE Computer Society (2008)
17. Terrovitis, M., Mamoulis, N.: Privacy preservation in the publication of trajecto-
    ries. In: Proc. of the 9th International Conference on Mobile Data Management,
    IEEE Computer Society (2008)
18. Vyahhi, N., Bakiras, S., Kalnis, P., Ghinita, G.: Tracking moving objects in
    anonymized trajectories. In: Proc. of 19th International Conference on Database
    and Expert Systems Applications, Springer (2008, to Appear)
19. Mascetti, S., Bettini, C., Freni, D., Wang, X.S.: Spatial generalization algorithms
    for lbs privacy preservation. Journal of Location Based Services 2(1) (2008)