-

On the Impact of User Movement Simulations in the Evaluation of LBS Privacy-Preserving Techniques

Sergio Mascetti

Dario Freni

Claudio Bettini

X. Sean Wang

Sushil Jajodia

0 0 CSIS, George Mason University 1 DICo, Universita di Milano 2 Dept. of CS, University of Vermont , USA

The evaluation of privacy-preserving techniques for LBS is often based on simulations of mostly random user movements that only partially capture real deployment scenarios. We claim that benchmarks tailored to speci c scenarios are needed, and we report preliminary results on how they may be generated through an agent-based contextaware simulator. We consider privacy preserving algorithms based on spatial cloaking and compare the experimental results obtained on two benchmarks: the rst based on mostly random movements, and the second obtained from the context-aware simulator. The speci c deployment scenario is the provisioning of a friend- nder-like service on weekend nights in a big city. Our results show that, compared to the contextaware simulator, the random user movement simulator leads to signi cantly di erent results for a spatial-cloaking algorithm, under-protecting in some cases, and over-protecting in others.

Location-based services (LBS) are often cited as killer applications for the latest GPS-equipped 3G phones. These phones are slated to be massively distributed in 70 countries. While car navigation and identi cation of nearest points of interest are already widely used services, more interest are generating the socalled friend- nder services as a class of LBS that will change once more our way to interact. A friend- nder service reveals to a participating user the presence of other close-by participants belonging to a particular group (friends is only one example), possibly showing their position on a map. From a technical point of view, in contrast to services that nd nearest points of interests, this service is characterized by a sequence of LBS requests instead of single ones, since a user may want to periodically check, while moving or even while staying in the same place, for close-by participants.

Sociological studies have shown that a large number of users perceive the release of their precise location, as part of a LBS request, as a possible privacy threat [ 1 ]. Considering friend- nder services it is easy to identify two types of privacy threats: a) the association of the identity of the user with the speci c group of persons he is interested in may reveal her religious, sexual, or political orientation, and b) the association of the identity of the user with her precise location may reveal what kind of places she has been to, or that she has not been where she was supposed to be at that time.

As formally shown in [ 2 ] the likelihood of a privacy violation, and consequently the defense techniques to be applied, strongly depend on the knowledge that an adversary may have. In the friend- nder service the service provider (SP) may not be trusted or the communication channels may be insecure; then, the adversary's knowledge may include the precise identity and location information submitted with each request, and both the above privacy threats would become real privacy breaches. The substitution of identities with pseudonyms does not entirely solve the problem, if, for example, the adversary happens to know who is at the location at the time reported in the request (e.g., in the case the issuer of the request uses a delity card at a store). In some cases, the adversary may also be able to recognize sequences of requests as issued by the same anonymous user (e.g., by observing the same pseudonym or by spatio-temporal tracking) and use this information to re-identify the issuer.

Several defense techniques against both threats under di erent adversary models have been proposed, and may be applied to the friend- nder service; however, current proposals very rarely have formal assessments of the provided privacy preservation, and are generally supported by experimental results based either on real datasets of questionable signi cance for real LBS services (i.e., trucks or school bus traces) or on data simulations based on mostly random user movements that hardly match the speci c deployment scenario of a LBS service.

In order to understand if the use of simulations based on mostly random user movements may be a real problem, or if it is actually useful and safe to use these simulations, we considered a typical deployment scenario for a friend- nder service: a large number of young people using the service on a weekend night in a large city like Milan, Italy. We performed a deep study, using di erent sources, including on-line surveys, of the parameters characterizing this scenario. We then used the Brinkho simulator [ 3 ], widely used in testing LBS privacy preservation, to generate, based on the parameters, a rst dataset of user movements. A second dataset was created with a personalized version of the Siafu agent-based contextaware simulator [ 4 ] which is able to capture much more details of our scenario. Then, based on a common metric for privacy and quality of service evaluation, we run a large number of tests on both datasets, considering di erent abilities of reidenti cation by the adversary, as well as di erent privacy preserving techniques.

Our results consistently show that (i) in some cases the evaluation on random movement simulations leads to the de nition of overprotective techniques and (ii) in other cases, the techniques that are shown to meet privacy requirements based on those simulations do not meet them when tested with more realistic context-aware simulations.

We focus our technical treatment on protecting the association of the user with the request he has issued (e.g., with the group of people he is interested in, as in threat (a) described above), even if we believe that our arguments can be easily extended to techniques only aimed to protect the location. Related work We are not aware of related work in this area considering speci cally the relevance of realistic simulations in LBS. There are however several studies on user movements with impact on many di erent application areas including epidemiology, transportation, computer networks, marketing, as well as LBS. A very interesting study supporting an argument against random movement simulations recently appeared [ 5 ]. In the following we brie y report the main techniques currently proposed for LBS privacy preservation, identifying the ones similar to those tested in our experiments, and the ones using simulations to generate the datasets for experiments.

Privacy preserving solutions based on cryptographic techniques that totally hide the location information in requests, even to the SP, have been recently proposed [ 6 ] for LBS based on 1-NN queries, and may be probably adapted for the service we consider. If proven to be correct, no simulation would be needed for these techniques since no information would leak from any request and the above privacy threats do not apply. However, this adaptation is still to be investigated, and there are some general concerns with these approaches regarding e ciency and exibility.

A popular alternative technique is spatial cloaking, consisting in the generalization of the spatial information transmitted to the SP as part of a service request. By receiving generalized locations, the SP can only return approximate results on the presence of close-by group members and their positions; while it may be possible to have a trusted entity in the middle ltering the communication and improving the precision, the related overhead costs should be taken into account in evaluating the trade-o between generalization and quality of service. While in this paper we consider techniques based on spatial cloaking as in [7{9, 2, 10], other proposals have considered di erent techniques, including the generation of dummy requests, the use of incremental requests, or the substitution in the request of the position of the issuer with a region that does not include her (see among others [11{13]).

Most of the proposals for LBS privacy have only considered requests in isolation while a few have also addressed the cases in which sequences of requests can be exploited by the adversary ([14, 15] among others), as in the friend- nder service. A related problem is privacy-aware publication of trajectories [16, 17]; even if this has some aspects more similar to database publication than to service request privacy preservation, we believe that our results may be important for these studies as well.

Synthetic, mostly random, user movements obtained by the Brinkho simulator or other simulators have been used in most of the above cited papers as well as in our own previous work. The rest of the paper is organized as follows. In Section 2 we formally de ne how we evaluate the privacy of LBS requests, or equivalently, how we measure the risk of a privacy violation upon issuing a request. In Section 3 we explain how the two datasets were obtained from the generators based on the parameters characterizing the deployment scenario. In Section 4 we brie y explain the privacy preservation algorithms being used and we report our experimental results. Section 5 concludes the paper. 2

Privacy metric of generalized requests

As mentioned in the introduction, we are concerned with privacy protection via location generalization (also called spatial cloaking). In this section, we formalize the adversary model we consider in this paper, and give a metric to measure the privacy provided by a set of generalized requests against the adversary. 2.1

Requests, original requests, and generalized requests We rst formally de ne requests and generalized request for LBS. A request issued by a user without alteration is called an original request, and a generalized request is one that is sent to the service provider and has been altered from the original one for the purpose of privacy protection. Both kinds of requests are called requests and denoted r. A convention in this paper is to use r0 to denote generalized requests to emphasize its generalized nature, while use r to denote original requests, if not speci ed otherwise.

Either the client software or a trusted medium transforms (or generalizes) an original request to a generalized one. In this paper, we are not concerned about how the generalization has happened, but rather on the resulting generalized requests and their privacy properties. In the experimental section, we evaluate the performance of generalization algorithms based on the generalized requests they generate.

Each LBS request r, either original or generalized, is logically divided into three parts: IDdata, ST data, and SSdata, containing user identi cation data, location and time of the request, and other service parameters, respectively. In the sequel, the spatial and temporal components in ST data are denoted with Sdata and T data, respectively. In this paper, for the sake of simplicity, we consider space and time as discrete domains. However, our results can be easily extended to the case in which these two domains are continuous.

Each generalized request r0 must correspond to an original request r such that the di erence between r and r0 is only in SData and furthermore, r:Sdata r0:Sdata, i.e., the spatial region of the generalized request must contain (or be equal to) the spatial region of the original request4. We use issuer(r) to denote the actual issuer of the (original or generalized) request r. 4 Here \region" can be a point. 2.2

Adversary model The objective here is to provide an adversary model that captures a general class of adversary models. In a sense, our adversary model is an adversary \metamodel". This adversary meta-model concerns two aspects of knowledge that an adversary might have: (1) knowledge of users' whereabouts (i.e., their locations), and (2) correlation of (generalized) requests. These two aspects cover the (explicit or implicit) assumptions appeared in the relevant literature.

For users' locations, we assume that the adversary has the knowledge expressed as the following Ident function:

Identt : the Areas ! the U ser sets; that is, given an area A, Identt(A) is the set of users whom, through certain means, the adversary has identi ed to be located in area A at time t. In the following, when no confusion arises, we omit the time instant t. We further assume that this knowledge is correct in the sense that these identi ed users in reality are indeed in area A at the time.

For a given user i, if there exists an area A such that i 2 Ident(A), then we say i is identi ed by the adversary. Furthermore, we say that i is identi ed in A. Note that there may be users who are also in A but the adversary does not identify them. This may happen either because the adversary is not aware of the presence of users in A, or because the adversary cannot identify these users even if he is aware of their presence. We do not distinguish these two cases in our adversary model as we shall see later that the distinction of the two cases does not make any perceptible di erence in the ability of the adversary when the total population is large.

Clearly, in reality, there are lots of di erent sources of external information that can lead the adversary to estimate the location of users. Some may lead the adversary to know that a user is in a certain area, but not the exact location. For example, an adversary may know that Bob is in a pub (due to his use of a delity card at the pub), but may not know which room he is in. Some statistical analysis may be done to derive the probability that Bob is in a particular room, but this is beyond the scope of this paper.

The most conservative assumption regarding this capability of the adversary is that Ident(A) will give exactly all the users for each area A. It can be seen that if the privacy of the user is guaranteed in this most conservative assumption, then privacy is also guaranteed against any less precise Ident function. However, this conservative assumption is unlikely true in reality, while some observed that this assumption degenerates the quality of service unnecessarily. It will be interesting to see how much privacy and quality of service change with more realistic Ident functions. This is partly the goal of our paper.

As part of this adversary model regarding the location and users, we also assume another function:

N umt : the Areas ! [0; 1); that is, given an area A, N umt(A) gives an estimate of the number of users in the area at time t. This function can be derived from statistical information publicly available or through some kind of counting mechanism such as tickets to a theater. Again, when no confusion arises, we do not indicate the time instant t.

The second part of the adversary model is his ability to correlate requests. We formalize this with the following function L:

L : the Requests

! the Request sets; that is, given a (generalized) request r0, L(r0) gives a set of requests such that the adversary has concluded, through certain means, are issued by the same user who issued the request r0. In other words, all the requests in L(r0) are linked to r0, although the adversary may still not know who the user is.

Note that L(r) may only give an (often small) subset of all the requests issued by the issuer of r. On the other hand, we assume that the function L is correct in the sense that each request in L(r) is indeed issued by the same user in reality. A set of requests is called a trace, denoted , if from the link function L we understand that all requests are issued by the same user. The requests in are implicitly ordered along the time dimension.

As in the case for Ident function, the most conservative assumption on correlation is that L(r) gives exactly all the (generalized) requests that are issued by the issuer of r. This is a very strong assumption that may lead to severely decrease quality of service when accompanied with the most conservative assumption about the Ident function. Again, a partial goal of this paper is to study the impact of a less conservative but more realistic assumption on L.

In [ 2 ], we proposed a formal framework to model LBS privacy attacks and defenses for the static case. The main idea is that the safety of a defense technique can be formally evaluated only if the context, i.e., the assumptions about the adversary's external knowledge, is explicitly stated. Following this methodology, in this paper, a context CH is given by three functions Ident, N um, and L, that is

CH = (Ident; N um; L): In the next section, we formalize the attack on the generalized requests that an adversary can perform in a context CH .

A consequence of restricting to context CH is that, analogously to the related work in this area, we focus our attention on using only ST data as a quasiidenti er. Intuitively, a quasi-identi er in a request is a combination of values that can be used to provide more information on who the actual issuer of a request may be than without these values. For example, if the Ident function is given, the ST Data in the request is a quasi-identi er as it may provide information on the actual issuer, as shown in the next subsection. In principle, any information contained in a request should be carefully analyzed to see if it may serve as a quasi-identi er. For example, the IDdata part is an obvious target, and some service speci c parameters may be used to link the request to users. However, these aspects are outside the scope of this paper. The general question for this subsection is, given a set of generalized requests and a context CH , how much privacy the users who issued these requests have.

We want to nd the following function:

Att : the Request set

the U sers ! [0; 1]; Intuitively, given a (generalized) request r0 and a user i, Att(r0; i) gives the probability that the adversary can derive from CH that i is the issuer of r0 among all the users.

In the following of this section we show how to specify the attack function for context CH . Once the attack function is speci ed, we can use the following formula to evaluate the privacy value of a request:

P rivacy(r0) = 1

AttCH (r0; issuer(r0))

(1) Intuitively, this value is the probability that the attacker will not associate the issuer of request r0 to r0.

In order to specify the Att function, we introduce the function Inside(i; r0) that indicates the probability of user i to be located in r0:Sdata at the time of the request. Intuitively, Inside(i; r0) = 1 if user i is identi ed by the adversary as one of the users that are located in r0:Sdata at time r0:T data, i.e., i 2 Identt(r0:Sdata) when t = r0:T data. On the contrary, Inside(i; r0) = 0 if i is recognized by the adversary as one of the users located outside r0:Sdata at time r0:T data, i.e., there exists an area A with A \ r0:Sdata = ; such that i 2 Ident(A). Finally, if neither of the above cases hold, then the adversary does not know where i is. There is still a probability that i is in r0:Sdata. Theoretically, this probability is the number of users in r0:Sdata that are not recognized by the adversary (i.e., N um(r0:Sdata) jIdent(r0:Sdata)j) divided by all the users who are not recognized by the adversary anywhere (i.e., jIj jIdent( )j, where I is the set of all users, and is the entire area for the application). Formally,

8 1 if i 2 Ident(r0:Sdata) Inside(i; r0) = < 0 if 9A : A \ r0:Sdata = ; and i 2 Ident(A) : Num(r0:Sdata) jIdent(r0:Sdata)j otherwise

jIj jIdent( )j

We can now de ne the Att function in context CH . For the sake of presentation, let us rst consider the attack in the snapshot context

Csnap = (Ident; N um; Lsnap); where for each generalized request r0, Lsnap(r0) = fr0g. In this special case, the probability of a user i of being the issuer of r0 is given by the probability of i being in r0:Sdata at the time of the request, normalized among all the users in I. Formally, the attack can be de ned as:

AttCsnap (r0; i) = Pi02I Inside(i0; r0)

Inside(i; r0)

(2) (3)

When the total population of users is large (relative to the number of users whose locations are known to the adversary), then the \otherwise" case in Formula 2 is very small, albeit nonzero. Intuitively, if a user i falls into this case, then the adversary cannot really distinguish this particular user from all other users who also fall into this case. For such a user i, we can simply give a value 1=jIj to AttCsnap (r0; i). We could give 1=(jIj N um( )), but this does not make much impact in practice. Now it's easy to see that

AttCsnap (r0; i)

8 1=N um(r0:Sdata) if Inside(i; r0) = 1 < 0 if Inside(i; r0) = 0 : 1=jIj otherwise (4) The above formula makes intuitively sense. Indeed, if i is recognized as inside r0:Sdata, without any other information, the adversary cannot distinguish him/her from any of the N um(r0:Sdata) people in the area who might be the issuer. If i is recognized outside, then clearly i cannot be the issuer due to our de nition of (generalized) requests. If i is not recognized anywhere (meaning he/she can be anywhere), then the attacker cannot distinguish him/her from any of the other people who are not recognized. Since we assume the total population is much greater than N um( ), the probability that i is the issuer is close to 1=jIj.

Example 1. Consider the situation shown in Figure 1(a) in which there is the request r0 such that, at time r0:T data, there are three users in r0:Sdata: one of them is identi ed as i1, the other two are not identi ed. The adversary can also identify users i2 and i3 outside r0:Sdata at time r0:T data. Assume that the set I contains 100 users.

(a) First request, r0.

(b) Second request, r00.

Clearly, i2 and i3 have zero probability of being the issuers, since they are identi ed outside r0:Sdata and due to the assumption that the spatial region of any generalized request must contain the spatial region of the original request. On the contrary, the adversary is sure about the fact that i1 is located in r0:Sdata. By Equation 3, the attack associates i1 to r0 with likelihood 1=(Pi02I Inside(i0; r0)). By Formula 2, for each user i in I n fi1; i2; i3g, Inside(i; r0) = 2=100. Therefore, Pi02I Inside(i0; r0) = 97 2=100 + 1 3. Consequently, the probability of i1 to be the issuer of r0 is approximately 1=3. Moreover, each user i 2 I n fi1; i2; i3g has a probability to be the issuer of about (2=100)=3 = 2=300.

In the general case L(r0) fr0g, we can evaluate, analogously to the snapshot case, the probability that a user is located in the generalized region of all the requests in the trace = L(r0). So, we can extend the Inside function to traces where, given a trace and a user i, Inside(i; ) is the probability that user i is located, for each request r0 in , in r0:ST data. Then, the attack is AttCH (r0; i) = Pi02I Inside(i0; L(r0))

Inside(i; L(r0))

(5) We now turn to consider how to compute Inside(i; ).

First consider some easy cases. If i 2 Ident(r0) for all requests r0 2 , then Inside(i; ) = 1. If i 2 Ident(A) and A \ r0:Sdata = ; for an area A and at least one requests r0 2 , then Inside(i; ) = 0.

The rest of cases are di cult ones. To calculate Inside(i; ), we need to consider the likelihood of someone moving from one location to/from another in the speci c times. In this paper, we advocate the following as a reasonable approach. We assume for each pair of locations A and B and two times t1 and t2, we know the probability of a user i being in B at time t2 conditioned on the fact that the user is in A at time t1. In formalism, consider two random variables X: \i is inside A at time t1" and Y : \i is inside B at time t2", where A and B are two areas and t1 and t2 are two di erent times. We assume the adversary knows the value P (Y jX).

We note that P (Y jX) in general is not the same as P (Y ). Indeed, how likely user i is in B can depend on how likely the same user is in A. Take two extreme examples: if A and B are very far away and t1 and t2 are close to each other, then i cannot be in B at t2 if i is in A at t1, i.e., P (Y jX) 0. On the other hand, if A and B are just two locations along a one-way road and the di erence between times t1 and t2 matches the time needed to move from A to B with a normal moving speed, then P (Y jX) 1. In practice, this value can be derived from historical observations and experiences.

Now, assume consists of the requests r10; : : : ; rk0. We form a Bayesian network for each user i with X1, : : :, Xk as the nodes, where each Xj corresponds to the random variable: \user i is in rj:Sdata at time rj:T data". In this network, for each node Xh that satis es the condition (denoted c) i 2 Identt(rh0:Sdata) with t = rh0:T data, we draw an arc towards each other node Xh0 which does not satisfy condition C. In addition, for each pair of nodes rh0 and rh00 such that neither satisfy condition c, we draw an arc from Xh0 to Xh00 if the rh00 :T data < rh000 :T data. (The resulting network is acyclic.) As we have assumed, we know the value P (Xh0jXh) for each arc Xh to Xh0. Denote by E the conjunctive fact that P (Xh) = 1 for each rh 2 that satis es condition c. What we want to nd is P (X1; : : : ; XkjE) = Inside(i; ). This is a well-studied belief revision problem, and many computation and approximation methods exist. (Note that if we apply this method to the easier cases mentioned earlier, we would arrive at the correct values.) Example 2. Continue from Example 1 and assume a second request r00 (see Figure 1(b)) is issued after r0 and that r00 is linked with r0, so consists of these two requests. At time r00:T data, there are 4 users inside r00:Sdata, two of which are identi ed as i1 and i2. No user is identi ed outside r00:Sdata. From the above discussion, it follows that Inside(i2; ) = Inside(i3; ) = 0 since i2 and i3 are identi ed outside the rst generalized request r0. All the other users have a non-zero probability of being inside the generalized region of each request in the trace. In particular, Inside(i1; ) = 1 since i1 is recognized in both requests. Consider a user i 2 I n fi1; i2; i3g, and denote X and Y being the assertion that \i is in r0:Sdata at time r0:T data" and \i is in r00:Sdata at time r00:T data". In this case, the Bayesian network for i has two nodes Xr0 and Xr00 , and there is an arc from Xr0 to Xr00 since r00 is issued after r0 is. Now let us assume P (Xr00 jXr0 ) = 0:75, i.e., there is a 75% likelihood that someone in r0:Sdata will move to r00:Sdata at the speci ed times. Now compute Inside(i; ) = P (Xr0 ; Xr00 ) = P (Xr0 )P (Xr00 jXr0 ) = 2=97 0:75. Now the sum of all the Inside(j; ) value is 1 + 0 + 0 + 97 2=97 0:75 = 2:5. The attack value under these assumptions then is as follows: For AttCH (r00; i1) = 1=2:5 = 40%, AttCH (r00; i2) = AttCH (r00; i3) = 0, and AttCH (r00; i) = (2=97) :75=2:5 0:6% for all other 97 users i.

To make the situation more interesting, let us remove the fact that i2 was recognized outside at time r0:T data, and we want to gure out the value Inside(i2; ). In this case, let us assume P (Xr0 jXr00 ) = 0:75, namely people who are in r00:Sdata have a 75% likelihood to be from r0:Sdata. Under the fact E that i2 is in r00:Sdata, then we know Inside(i2; ) = P (Xr0 ; Xr00 jE) = 0:75. Then the sum of Inside values is 1+0:75+0+97 2=97 0:75 = 3:25. Hence, AttCH (r00; i1) 31%, AttCH (r00; i2) 23%, AttCH (r00; i3) = 0, and AttCH (r00; i) = (2=97) 0:75=3:25 0:47% for each other 97 users i. This is an interesting exercise as it reveals that if we add i2 to be possibly in r0:Sdata (with 75% probability), then the likelihood that i1 is the issuer decreases, which is intuitively correct.

It is worth noting that the de nition of attack in context CH is a proper extension of the attack that can be de ned in the conservative context in which the adversary knows the location and the identity of each user in each time instant. The historical attack in this context was rst proposed in [14]. The idea is that the only users that have non-zero probability of being the issuer of a trace of requests are those whose spatio-temporal location is contained in the generalized region of every request in the trace. It can be easily seen that, if each user can be identi ed at each time instant, then the Inside() function returns either 0 or 1 and hence the attack we speci ed for context CH assigns a zero probability to each user that is located outside the generalized region of any request in the trace. 3

The MilanoByNight simulation

In order to evaluate privacy-preserving techniques applied to LBS, a dataset of users' movements is needed. In our experiments, we want to focus on privacy threats that arise when using a friend nder service, as described in Section 1. We suppose that this kind of service is primarily used by people during entertainment hours, especially at night. Therefore, the ideal dataset for our experiments should represent movements of people on a typical Friday or Saturday night in a big city, when users tend to move to entertainment places. To our knowledge, currently there are no datasets like this publicly available, specially considering that we want to have large scale, individual, and precise location data (i.e., with the same approximation of current consumer GPS technology). In this section we describe how we generated this user movement dataset. 3.1

Relevant Parameters For our experiments we want to arti cially generate movements for 100; 000 users on the road network of Milan5. The total area of the map is 324 km2, and the resulting average density is 308 users/km2. Very detailed digital vector maps of the city have been generously provided by the municipality of Milan. The simulation includes a total of 30; 000 home buildings and 1; 000 entertainment places; the rst value is strictly related to the considered number of users, while the second is based on real data from public sources which also provide the geographical distribution of the places. Our simulation starts at 7 pm and ends at 1 am. During these hours, each user moves from house to an entertainment place, spends some time in that place, and possibly moves to another entertainment place or go back home.

All probabilities related to users' choices are modeled with a probability distributions. For this speci c data generation, some of the important parameters of the simulation are: { Source and destination. These are the locations essential to de ne movements. They may be homes or entertainment places. Some places in some districts are more popular than others. { StartingTime. The time at which a user leaves her home to go to the rst entertainment place. { Permanence. How long will a user stay at one entertainment place? { NumPlaces. How many entertainment places will a user visit on one night?

In order to have a realistic model of these distributions, we prepared a survey to collect real users data. We are still collecting data, but the current parameters are based on interviews of more than 300 people in our target category. 5 100; 000 is an estimation of the number of people participating in the service we consider. 3.2

Weaknesses of mostly random movement simulations Many papers in the eld of privacy preservation in LBS use arti cial data generated by moving object simulators to evaluate their techniques. However, most of the simulators are usually not able to reproduce a realistic behavior of users. For example, objects generated by the Brinkho generator [ 3 ] cannot be aggregated in certain places (e.g., entertainment places). Indeed, once an object is instantiated, the generator chooses a random destination point on the map; after reaching the destination, the object disappears from the dataset. For the same reason, it is not possible to reproduce simple movement patterns (e.g.: a user going out from her home to another place and then coming back home), nor to simulate that a user remains for a certain time in a place.

Despite these strong limitations, we made our best e ort to use the Brinkho simulator to generate a set of user movements with characteristics as close as possible to those explained in Section 3.1. For example, in order to simulate entertainment places, some random points on the map, among those points on the trajectories of users, were picked. The simulation has the main purpose of understanding if testing privacy preservation over random movement simulations gives signi cantly di erent results with respect to more realistic simulations. 3.3

Generation of user movements with a context simulator In order to obtain a dataset consistent with the parameters speci ed in Section 3.1, we need a more sophisticated simulator. For our experiments, we have chosen to customize the Siafu context simulator [ 4 ]. With a context simulator it is possible to design models for agents, places and context. Therefore, it is possible to de ne particular places of aggregation and make users dynamically choose which place to reach and how long to stay in that place. In our simulation homes are distributed almost uniformly on the map, with a minor concentration on the central zones of the city. Entertainment places are mostly concentrated in 5 zones of the city.

The distributions for StartingTime, Permanence and NumPlaces parameters introduced in Section 3.1 were modeled with the results of the survey. For example, the time of permanence in an entertainment place was modeled according to the following percentages derived from the survey: 9:17% of the users stays less than 1 hour, 34:20% stays between 1 and 2 hours, 32:92% stays between 2 and 3 hours, 16:04% stays between 3 and 4 hours, and 7:68% stays more than 4 hours.

Following these parameters, in our dataset users spend 50:87% of the time at home, 7:28% of the time moving from one place to another and 41:85% of the time in entertainment places. When a user moves from one place to another, she decides whether to go on foot or by car. In general, if an entertainment place is farther than 500 meters, people tend to move by car, and this is re ected in the simulation. The average speed of movements by car is 20 km/h, while the average speed on foot is 3:6 km/h. With our parameters 10:64% of movements are done on foot, while all the others are done by car.

Experimental results

In this section we show the results of our experimental evaluation. We rst de ne how we evaluate the quality of service in Section 4.1, then we describe the experimental setting in Section 4.2 and the generalization algorithms we used in Section 4.3. Finally, in Sections 4.4 and 4.5 we show the impact of the simulation parameters and of the user movements, respectively, in the evaluation of the generalization algorithms. Di erent metrics can be de ned to measure QoS for di erent kind of services. For instance, for the friend- nder service we are considering, it would be possible to measure how many times the generalization leads the SP to return an incorrect result i.e., the issuer is not noti ed of a close-by friend or, vice versa, the issuer is noti ed for a friend that is not close-by. While this metric is useful for this speci c application, we want to measure the QoS independently from the speci c kind of service. For this reason, in this paper we evaluate how QoS degrades in terms of the perimeter of the generalized region. If the generalized region is too large, the service becomes useless. For this purpose, we introduce a new parameter, called maxP , that indicates this threshold in terms of the maximum perimeter. We assume that no request is sent to the SP with a perimeter larger than maxP . In our experiments we used two datasets of users movements. The dataset AB (Agent-Based) was generated with the customized Siafu simulator as described in Section 3.3, while the dataset M RM (Mostly Random Movement) was created with the Brinkho simulator as described in Section 3.2. In both cases, we simulate LBS requests for the friend- nder service by choosing random users in the simulation, we compute for each request the generalization according to a given algorithm, we evaluate QoS as explained in Section 4.1 and privacy according to formula (1) presented in Section 2.

The most important parameters that characterize the simulations are reported in Table 1, with the values in bold denoting default values. The number of users indicates how many users are in the simulation, and the simulations are designed so that this number remains almost constant at each time instant. In every two minutes, each user has a probability Preq of issuing a request. For technical reasons, the reported tests are based on a time frame of three hours over the total six hours of the MilanoByNight scenario. This implies that in the default case we consider a total of 45 requests (one every four minutes of the considered time frame). The parameter Pid in indicates the probability that a user is identi ed when she is located in a entertainment place while Pid out is the probability that a user is identi ed in any other location (e.g., while moving from home to a entertainment place). While we also perform experiments where the two probabilities are the same, our scenario suggests as much more realistic a higher value for Pid in (it is considered ten times higher than Pid out). This is due to the fact that restaurants, pubs, movie theaters, and similar places are likely to have di erent ways to identify people ( delity or membership cards, wi hotspots, cameras, credit card payments, etc.) and in several cases more than one place is owned by the same company that may have an interest in collecting data about its customers.

Finally, Plink indicates the probability that two consecutive requests can be identi ed as issued by the same user.6 While we perform our tests considering a full range of values, the speci c default value reported in the table is due to a recent study on the ability of linking positions based on spatio-temporal correlation [18].

The experimental results we show in this section are obtained by running the simulation for 100 issuers and then computing the average values.

The Generalization Algorithms Used in the Experiments In our experiments we evaluate the privacy and the QoS of requests generalized by using two algorithms previously proposed in the literature. The rst one, called Grid, was presented in [19], and it is used as a representative of several algorithms aimed at guaranteeing k-anonymity in the snapshot case, i.e., these algorithms do not take into account link ability of the adversary. Intuitively, this particular algorithm partitions all users into blocks, each one having at least cardinality k. Then, it computes the generalized region as the minimum bounding rectangle (MBR) that covers the location of the users in the same block as the issuer.

The second algorithm, Greedy, was rst proposed in [14] and a similar idea was also described in [15]. The use of Greedy is intended to represent algorithms aimed at preserving privacy in the historical case, i.e., the general CH context, 6 The limitation to consecutive requests is because in our speci c scenario we assume linking is performed mainly through spatio-temporal correlation. assuming that the attacker may actually obtain and recognize traces of requests from the same issuer. This algorithm computes the generalization of the rst request r in a trace using an algorithm for the snapshot case. While doing this, the set A of users located in the generalized region is stored. The generalized regions of the successive request r0 linked with r is then computed as the MBR of the location of the users in A at the time of r0. In our implementation we use Grid as the snapshot algorithm to compute the generalization of the rst request.

For the purpose of our tests, we modi ed the two algorithms above so that each generalized region has a perimeter always smaller than maxP . To achieve this, if the perimeter of the generalized region is larger than maxP , then the region is iteratively shrunk, until its perimeter is below maxP , by excluding from the MBR the user that is farther from the issuer. In the Greedy algorithm, when a user is excluded from the generalized region, then it is also excluded from the set A of users, and hence he is not used in the generalization of the successive requests. Eventually, when the set A contains the issuer only, a snapshot generalization is executed again and A is reinitialized.

In addition to the input request r, and the location of all the users in the system, the considered algorithms require two additional parameters: the value k, and the threshold maxP . In our tests, we used values for k between 10 and 60 (default is 10) and values for maxP between 1000 to 4000 meters (default is 1000 meters).

In our experimental results we also evaluated the privacy threat when no privacy preserving algorithm is applied. The label NoAlg is used in the gures to identify results in this particular case. 4.4

Impact of Simulation Parameters in the Evaluation of the Generalization Algorithms The objective of the rst set of experimental results we present is to show which parameters of the simulation a ect most the evaluation of the generalization algorithms. In these tests we used the AB dataset only.

Figure 2(a) shows that the average privacy obtained with Greedy and Grid is not signi cantly a ected by the size of the total population. Indeed, both algorithms, independently from the total number of users, try to have generalized regions that cover the location of k users, so the privacy of the requests is not a ected. However, when the density is high, the two algorithms can generalize to a small area, while when the density is low, a larger area is necessary to cover the location of k users (see Figure 2(b)). On the contrary, the privacy obtained when no generalization is performed is signi cantly a ected by the total population. Indeed, a higher density increases the probability of di erent users to be in the same location and hence it increases privacy also if the requests are not generalized.

A parameter that signi cantly a ects the average privacy is the probability of identi cation of a user in a certain place. In Figure 3 we show the experimental results for di erent values of Pid in when, in each test, Pid out is set to 0.9

Greedy

Grid Greedy

Grid

NoAlg 40000 70000

Number of users (a) Average privacy.

100000

100000 Pid in=10. As expected, considering a trace of requests, the higher is the probability of identifying users in one or more of the regions from which the requests in the trace were performed, the smaller is the level of privacy.

Figure 4(a) shows the impact of Plink on the average privacy. As expected, high values of Plink lead to small values of privacy. Our results show that the relation between the Plink and privacy is not linear. Indeed, privacy depends almost linearly on the average length of the traces identi ed by the adversary (Figure 4(b)). However, the average length of the traces grows almost exponentially with the value of Plink (Figure 5).

To summarize the rst set of experiments, our ndings show that many parameters of the simulation signi cantly a ect the evaluation of the generalization algorithms. This implies that when a generalization algorithm is evaluated it is necessary to carefully estimate realistic values for the parameters of the simulation. Indeed, an error in the estimation may lead to misleading results. 0.8 0.8

Impact of the User Movements on the Evaluation of the Generalization Algorithms The objective of the second set of experiments is to answer an important question posed in this paper: what is the impact of the di erent simulated user movements on the evaluation of the Generalization Algorithms? We answer to this question with a set of tests performed on the two di erent datasets we obtained as described above.

The rst set of tests, reported in in Figure 6, compares the privacy achieved by the Greedy algorithm on the two datasets for di erent values of k and for di erent values of QoS. The experiments on M RM were repeated trying also larger values for the QoS threshold (maxP = 2000 and maxP = 4000), so three di erent versions of M RM appear in the gures. In order to focus on these parameters only, in these tests the probability of identi cation was set to the same value for any place (Pid in = Pid out = 0:1), and for the M RM dataset the issuer of the requests was randomly chosen only among those that stay in the simulation for 3 hours, ignoring the ones staying for much shorter time that inevitably are part of this dataset. This setting allowed us to compare the results on the two datasets using the same average length of traces identi ed by the adversary.

Figure 6(a) shows that the average privacy of the algorithm evaluated on the AB dataset is much higher than on the M RM dataset. This is mainly motivated by the fact that in AB users tend to concentrate in a few locations (the entertainment places) and this enhances privacy. This is also con rmed by a similar test performed without using any generalization of locations; we obtained values constantly higher for the AB dataset (the average privacy is 0:67 in AB and 0:55 in M RM ).

In Figure 6(b) we show the QoS achieved by the algorithm in the two datasets with respect to the average privacy achieved. This result con rms that the level of privacy evaluated on the AB dataset using small values of k and maxP for the algorithm cannot be observed on the M RM dataset even with much higher values for these parameters.

From the experiments shown in Figure 6 we can conclude that if the M RM dataset is used as a benchmark to estimate the values of k and maxP that are necessary to provide a desired average level of privacy, then the results will suggest the use of values that are over-protective. As a consequence, it is possible that the service will exhibit a much lower QoS than the one that could be achieved with the same algorithm.

1 iycva 0.9 r p e g a rve 0.8 A 1 0.95

The above results may still support the safety of using M RM , since according to what we have seen above a technique achieving a certain level of privacy may only do better in a real scenario. However, our second set of experiments shows that this is not the case.

In Figure 7 we show the results we obtained by varying the probability of identi cation. For this test, we considered two sets of issuers in the M RM data set. One set is composed by users that stay in the simulation for 3 hours, (M RM long traces, in Figure 7), while the other contains issuers randomly chosen in the entire set of users (M RM all traces, in Figure 7), hence including users staying in the simulation for a much shorter time.

In Figure 7(a) and 7(b) we can observe that the execution on the M RM dataset leads to evaluate a privacy level that is higher than the one obtained on the AB dataset. In particular, the evaluation of the Grid algorithm using the M RM dataset (Figure 7(b)), would suggest that the algorithm is able to provide a high privacy protection. However, when evaluating the same algorithm using the more realistic dataset AB, this conclusion seems to be incorrect. In this case, the evaluation on the M RM dataset may lead to underestimate the privacy risk, and hence to deploy services based on generalization algorithms that may not provide the minimum required level of privacy. In this paper we claim that the experimental evaluation of LBS privacy preserving techniques should be based on user movement datasets obtained through simulations tailored to the speci c deployment scenario of the target services. Our results support our thesis for the class of LBS known as friend- nder services, for techniques based on spatial cloaking, and for adversary models that include the possibility for the adversary to occasionally recognize people in certain locations. We believe that these results can be generalized to other LBS, techniques and adversary models. For example, as a future work, it would be interesting to also evaluate some defense techniques that generalize the issuer's location to an area that does not necessarily contain the issuer's location. Moreover, in our experiments we only considered the rst of the two privacy threats presented in the introduction. We do have some ideas on how to extend them to consider the second, location privacy, as well. Finally, we believe a signi cant e ort should be devoted to the development of new exible and e cient context-aware user movement simulators, as well as to the collection of real data, possibly even in an aggregated form, to properly tune the simulations. In our opinion this is a necessary step to have signi cant common benchmarks to evaluate LBS privacy preserving techniques.

Acknowledgments

The authors would like to thank Stefano Varesi for his contribution in writing the code that was used for our simulations. This work was partially supported by National Science Foundation (NSF) under grant N. CNS-0716567, and by Italian MIUR under grant InterLink II04C0EC1D. 13. Ardagna, C.A., Cremonini, M., Damiani, E., di Vimercati, S.D.C., Samarati, P.: Location privacy protection through obfuscation-based techniques. In: Proc. of the 21st Conference on Data and Applications Security. Volume 4602 of Lecture Notes in Computer Science., Springer (2007) 47{60 14. Bettini, C., Wang, X.S., Jajodia, S.: Protecting privacy against location-based personal identi cation. In: Proc. of the 2nd workshop on Secure Data Management.

Volume 3674 of LNCS., Springer (2005) 185{199 15. Xu, T., Cai, Y.: Location anonymity in continuous location-based services. In: Proc. of ACM International Symposium on Advances in Geographic Information Systems, ACM Press (2007) 16. Abul, O., Bonchi, F., Nanni, M.: Never walk alone: Uncertainty for anonymity in moving objects databases. In: Proc. of the 24th International Conference on Data Engineering, IEEE Computer Society (2008) 17. Terrovitis, M., Mamoulis, N.: Privacy preservation in the publication of trajectories. In: Proc. of the 9th International Conference on Mobile Data Management, IEEE Computer Society (2008) 18. Vyahhi, N., Bakiras, S., Kalnis, P., Ghinita, G.: Tracking moving objects in anonymized trajectories. In: Proc. of 19th International Conference on Database and Expert Systems Applications, Springer (2008, to Appear) 19. Mascetti, S., Bettini, C., Freni, D., Wang, X.S.: Spatial generalization algorithms for lbs privacy preservation. Journal of Location Based Services 2(1) (2008)

1. Barkhuus , L. , Dey , A. : Location-based services for mobile telephony: a study of users privacy concerns . In: Proc. of the 9th International Conference on HumanComputer Interaction , IOS Press ( 2003 ) 709 { 712

2. Bettini , C. , Mascetti , S. , Wang , X.S. , Jajodia , S.: Anonymity in location-based services: towards a general framework . In: Proc. of the 8th International Conference on Mobile Data Management, IEEE Computer Society ( 2007 )

3. Brinkho , T.: A framework for generating network-based moving objects . GeoInformatica 6 ( 2 ) ( 2002 ) 153 { 180

4. Martin , M. , Nurmi , P.: A generic large scale simulator for ubiquitous computing . In: 3rd Annual International Conference on Mobile and Ubiquitous Systems: Networking & Services, IEEE Computer Society (July 2006 )

5. Gonzalez , M.C. , Hidalgo , C.A. , Barabasi , A.L. : Understanding individual human mobility patterns . Nature 453 ( June 2008 ) 779 { 782

6. Ghinita , G. , Kalnis , P. , Khoshgozaran , A. , Shahabi , C. , Tan , K.L. : Private queries in location based services: Anonymizers are not necessary . In: Proc. of SIGMOD , ACM Press ( 2008 )

7. Gruteser , M. , Grunwald , D. : Anonymous usage of location-based services through spatial and temporal cloaking . In: Proc. of the 1st International Conference on Mobile Systems, Applications and Services (MobiSys) , The USENIX Association ( 2003 )

8. Mokbel , M.F. , Chow , C.Y. , Aref , W.G. : The new casper: query processing for location services without compromising privacy . In: Proc. of the 32nd International Conference on Very Large Data Bases, VLDB Endowment ( 2006 ) 763 { 774

9. Gedik , B. , Liu , L. : Protecting location privacy with personalized k-anonymity: Architecture and algorithms . IEEE Transactions on Mobile Computing 7 ( 1 ) ( 2008 ) 1 { 18

10. Kalnis , P. , Ghinita , G. , Mouratidis , K. , Papadias , D. : Preventing location-based identity inference in anonymous spatial queries . IEEE Transactions on Knowledge and Data Engineering 19 ( 12 ) ( 2007 ) 1719 { 1733

11. Kido , H. , Yanagisawa , Y. , Satoh , T. : Protection of location privacy using dummies for location-based services . In: Proc. of the 21st International Conference on Data Engineering Workshops , IEEE Computer Society ( 2005 )

12. Yiu , M.L. , Jensen , C.S. , Huang , X. , Lu , H.: Spacetwist: Managing the tradeo s among location privacy, query performance, and query accuracy in mobile services . In: Proc. of the 24th International Conference on Data Engineering , IEEE Computer Society ( 2008 )