=Paper=
{{Paper
|id=Vol-1810/BIGQP_paper_06
|storemode=property
|title=Privacy Preservation of Semantic Trajectory Databases using Query Auditing Techniques
|pdfUrl=https://ceur-ws.org/Vol-1810/BIGQP_paper_06.pdf
|volume=Vol-1810
|authors=Despina Kopanaki,Nikos Pelekis
|dblpUrl=https://dblp.org/rec/conf/edbt/KopanakiP17
}}
==Privacy Preservation of Semantic Trajectory Databases using Query Auditing Techniques==
Privacy Preservation of Semantic Trajectory Databases
using Query Auditing Techniques
Despina Kopanaki Nikos Pelekis
Dept. of Informatics Dept. of Statistics and Insurance Science
University of Piraeus, Greece University of Piraeus, Greece
dkopanak@unipi.gr npelekis@unipi.gr
ABSTRACT increasing need of analyzing mobility data has led to the
Existing approaches that publish anonymized spatiotemporal representation of trajectories with contextual data from external
traces of mobile humans deal with the preservation of privacy data sources, thus transforming raw trajectories to the so-called
operating under the assumption that most of the information in the semantic trajectories. A semantically-annotated trajectory, in short
original dataset can be disclosed without causing any privacy semantic trajectory, is considered as a sequence of stops (i.e.
violation. However, an alternative strategy considers that data places where the object remains “static”) and moves (i.e. parts of
stays in-house to the hosting organization and privacy-preserving the object’s trajectory in between two stops) [12]. This alternative
mobility data management systems are in charge of privacy-aware representation of trajectories may pose even greater privacy
sharing of the mobility data. Furthermore, human trajectories are violation threats. Consider for example a malevolent user who is
nowadays enriched with semantic information by using able to detect places of interest (POIs) where a moving object has
background geographic information and/or by user-provided data stopped (e.g. home, hospital, betting office, etc.). This additional
via location-based social media. This new type of representation knowledge allows the inference of personal sensitive information
of personal movements as sequences of places visited by a person of this specific individual. On the one hand, analyzing
during his/her movement poses even greater privacy violation semantically-enriched movement traces of users can aid decision
threats. To facilitate privacy-aware sharing of mobility data, we making in a wide spectrum of applications, on the other hand, the
design a semantic-aware MOD engine were all potential privacy disclosure of such data to untrusted parties may expose the
breaches that may occur when answering a query, are prevented privacy of the users whose movement is recorded. Sharing user
through an auditing mechanism. Moreover, in order to improve mobility data for analysis purposes should be done only after the
user friendliness and system functionality of the aforementioned data has been protected against potential privacy breaches.
engine, we propose Zoom-Out algorithm as a distinct component, Most of the methodologies that have been proposed in the
whose objective is to modify the initial query that cannot be literature, aim at protecting users’ privacy by releasing an
answered at first due to privacy violation, to the ‘nearest’ query anonymized version of the original dataset ([1][2][6][7][8][9][11]
that can be possibly answered with ‘safety’. [16]). These approaches assume that in the anonymized dataset a
malevolent user will not be able to link a specific user with a
Keywords movement. In this paper, we employ a more conservative
Privacy-aware query engine; mobility data; anonymity; semantic approach to privacy by assuming that data stays in-house to the
trajectories. hosting organization in order to prevent any privacy breach. An
auditing mechanism is responsible to control the information that
1. INTRODUCTION is released to third parties and ensure privacy-aware data sharing.
Nowadays, the ease of collecting and storing data from an
increasing variety of devices where positioning technologies Gkoulalas-Divanis et al. [5] first proposed an envisioned query
(GPS) are embedded along with their enhanced processing engine where subscribed users have gained restricted access to the
capabilities, led inevitably to the desire of revealing useful database to accomplish various analysis tasks. Pelekis et al. [13]
information from them. Alongside this progress, potential then proposed HERMES++, a privacy-aware query engine that
breaches of individuals’ privacy came to the light. Space-time can protect the trajectory database from potential attacks, while
‘fingerprints’ of each recording moving entity (i.e. trajectories) supporting popular queries for mobility data analysis. Both
may prove to be a dangerous tool in the hands of a malicious user. approaches deal with spatiotemporal trajectory databases.
The scientific community has proposed various approaches to
protect individual’s privacy ([1][2][6][7][8][9][11][16]). Inspired by the work previously described and considering the
richer representation of semantic trajectories, we design a query-
Most of the aforementioned studies, define raw trajectories as based auditing mechanism that can effectively identify and block
sequences of points on a geometric space, focusing on their a range of potential attacks that could lead to user identification or
spatiotemporal nature without complementing raw data with tracking, for privacy-aware sharing of in-house semantic mobility
additional information from the application context. However, the data. The proposed mechanism provides an answer if k-anonymity
principle is not violated w.r.t the user’s current history. Moreover,
we propose an algorithm, called Zoum-Out, which modifies the
2017, Copyright is with the authors. Published in the Workshop original query that cannot be answered at first due to privacy
Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, restrictions, to the most similar query that can be safely answered.
2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073).
The algorithm generalizes space, time and/or semantic dimension
Distribution of this paper is permitted under the terms of the Creative of one or more sub-queries w.r.t. to a distortion threshold.
Commons license CC-by-nc-nd 4.0
Summarizing, in this paper we make the following contributions:
• We identify various types of attacks and thus privacy indistinguishable from. Kopanaki et al. [7] introduced the problem
violations that malevolent users may try to pursue when of Personalized (K,∆)-anonymity where user-specific privacy
querying the original semantic trajectory database. requirements are used to avoid over-anonymization and decrease
information distortion. They proposed efficient modifications to
• We design a query-based auditing mechanism that can state-of-the-art (k,δ)-anonymization algorithms by introducing
effectively identify and block a range of potential techniques built upon users’ personalized privacy settings and
attacks that could lead to user identification or tracking. trajectory segmentation.
• We propose Zoom-Out algorithm aiming at increasing Recently, in [10] authors faced the problem of anonymizing
user friendliness of the proposed mechanism by semantic trajectories. To release a safe version of a semantic
modifying the (original) query posed that cannot be trajectory dataset, they propose a method that generalizes
answered due to privacy violation, to the ‘nearest’ sequences of visited places based on a privacy place taxonomy.
possible ‘safe’ query.
On the other hand, in several sharing scenarios data should stay
The rest of the paper is structured as follows: Section 2 presents in-house to the hosting organization and the information must
related work. Section 3 introduces different types of attacks of a remain private. This is the case when a data holder is not willing
malevolent user. Section 4 provides the auditing mechanism that or is not able due to regulations to publish the entire dataset.
handles the previously described attacks as well as the Zoom-Out Assuming that at least part of the data has to become available to
algorithm. Finally, Section 5 concludes the paper. possibly untrusted third parties for analysis purposes, a
mechanism is needed in order to ensure that no sensitive
2. Related Work information will be released during this process. Along this
Methods that have been proposed so far to tackle the issue of direction, methodologies have been proposed for disclosure
privacy-preserving mobility data publication mostly adopt the control in statistical databases [3]. These approaches support only
principle of k-anonymity, which was originally proposed for count and/or sum queries, since no other information can be made
relational databases [15]. k-anonymity principle is the most available to the inquirer.
common approach that has been adopted for the anonymization of
both relational and mobility data. For mobility data, it states that a Gkoulalas-Divanis and Verykios [5] first described the design
dataset must be anonymized so that every trajectory is principles of a query engine that protects user privacy by
indistinguishable from at least k-1 other trajectories. generating fake trajectories. The idea behind [5] is that malevolent
users who query the trajectory database should not be able to
Hoh and Gruteser [6] presented a data perturbation algorithm that discover (with high confidence) any real trajectory that is returned
is based on path crossing. When two non-intersecting trajectories as part of the answer set of their query, while they can use the
are close enough, it generates a fake crossing in the sanitized returned data to support their analytic tasks.
dataset to prevent adversaries from tracking a complete user's
trajectory. Terrovitis and Mamoulis [16] consider datasets as In [13] and [14] authors extend and developed a privacy-aware
sequences of places visited by users. Based on the assumption that query engine along with a benchmark framework. The proposed
a malevolent user holds partial information of users’ trajectories, a engine audits queries for trajectory data to block potential attacks
suppression technique is proposed that eradicates the least number to user privacy, supports range, distance, and k-nearest neighbor
of places from a user’s trajectory so that the remaining trajectory spatial and spatiotemporal queries, and preserves user anonymity
is k-anonymous. in answers to queries by returning realistic fakes trajectories,
while protecting user-specific sensitive locations.
Abul et al. [1] proposed a k-anonymity approach that relies on the
inherent uncertainty of moving objects whereabouts where a Finally, in [17] authors proposed a data stream management
trajectory is considered as a cylinder. The anonymity algorithm system aiming at preserving users’ privacy by enforcing
identifies trajectories that lie close to each other in time, employs Hippocratic principles. Limited collection, limited use and limited
space translation and generates clusters of at least k trajectories. disclosure of data are the main privacy requirements that the
Each cluster of k trajectories forms an anonymity region and the system implements.
co-clustered trajectories can be released. To achieve space-time To the best of our knowledge, this is the first work that proposes a
translation, the authors proposed W4M [2], which uses a different query auditing mechanism for semantically-enriched mobility
distance measure that allows time-warping. data, able to provide answers while ensuring that no personal
Nergiz et al. [11] proposed a coarsening strategy to generate a information will be disclosed to untrusted third parties.
sanitized dataset that consists of k-anonymous sequences. The
algorithm first generalizes a set of trajectories into a set of
3. Privacy Attacks
sequences of k-anonymized regions, reconstructs, consolidates the The main purpose of every attack of a malevolent is to broaden
trajectories of the original dataset into clusters of k and her knowledge about an individual or a situation that interests her.
anonymizes the trajectories in each cluster. Monreale et al. [9] This occurs when the attacker raises her confidence about an
proposed another anonymization approach that is based on the event that may be related to an individual who is the ‘target’ or a
combination of spatial generalization and k-anonymity principle. situation for which she wishes to acquire more specific
The geographical area covered by the trajectories belonging to the knowledge. Usually, a malevolent has prior knowledge, i.e. time,
dataset is partitioned into sub-areas. The original trajectories are place, type of event and/or semantics (or any possible
then generalized and transformed so as to satisfy k-anonymity combination) about an individual.
principle. In our setting, each query may contain one or more sub-queries
Mahdavifar et al. [8] introduced the idea of non-uniform privacy (i.e. standalone spatiotemporal range queries with key words
requirements, whereby each trajectory is associated with its own constrains within another query) . Moreover, overlapping queries
privacy level indicating the number of trajectories it should be is a sequence of at least two queries posed by a user, having as a
characteristic that the criteria of these successive questions are Q1 with a sub-query that for sure contains the target (how many
overlapping. We assume that the queries differ only in one people stayed during the night in area A and during the day were
dimension (space / time / semantics) or in the number of the sub- working in area B). The number of trajectories that fulfil Q1 is 6.
queries that each one contains. Q2 contains the same sub-queries with Q1 along with a new sub-
query that asks for those that returned after work back to area A
User Identification Attack. In this attack the identity of a user can
during the night. If the answer of Q2 returns 5 trajectories, then the
be revealed by posing overlapping queries in spatial and/or malevolent may assume that the target did not spend the night at
temporal dimension. The attacker poses a query and if the number home. If the result again was 6, then she would be certain that the
of trajectories is at least k, proceeds with one or more queries target returned to her home. However, by posing two queries her
modifying each time only the same dimension such that every confidence was increased.
time the new query contains the previous one.
Let’s assume that a user poses query Q1 that contains n 4. Attack Prevention
trajectories where n³k and then Q2 which returns as an answer To prevent the previously described attacks, an auditing
n+m trajectories, where m, (3) a semantic trajectory database D, (4)
Q’ will result at least k trajectories. The sets of subqueries in Q
distortion limit dist, (5) array H[tr_id, freq, SQ1, SQ2, …, SQn]
and Q’ are of the same size. Each sub-query of Q’ is either the Output: Q’=
same or generalized w.r.t. the corresponding sub-query contained 1. Q’ ¬ Q; H ¬ Æ;
in Q. 2. Ntr ¬ Count(Q’)
3. repeat
An obvious approach would be to apply the aforementioned
4. Something_Changed ¬ False;
method only in case where a query cannot be answered marginally 5. for i=1 to n do
w.r.t. to k-anonymity threshold. A threshold should then be 6. Execute_Query(in SQ’i out tr_ids)
required based on which the mechanism would be activated every 7. Fill_Help_Table(in H, tr_ids out H)
time that the query could not be answered. In such a case where 8. end for
the mechanism is activated only if few trajectories are missing 9. Find_Freq_Position(in H, n out i)
from the answer set, privacy breach may occur. A malevolent user 10. episode_found ¬ False
can easily assume that the modified query does contain certain 11. repeat
number of additional trajectories within the returned extra area. 12. Compute_Distortion_Units(in out episode_found, H, i)
An obvious solution is to apply Zoom-Out algorithm regardless of 13. Select_best_candidate_episode(in H, dist, i out tr_id,
the number of trajectories that are needed to reach k-anonymity ep_id)
14. i ¬ i+1
principle.
15. until episode_found or EOF
The goal of the algorithm is to modify one or more sub-queries to 16. if episode_found then
provide an answer to the user. To enable the algorithm to decide 17. Embed_New_Episode(in H, tr_id, ep_id, in out SQ’i,
Something_Changed)
which episode is preferable to be included in the answer of the
18. Νtr ¬ Count(Q’)
modified query, the algorithm should be able to compare the
19. until (not Something_Changed) or (Ntr=k)
distortion (the specific definition of a distortion metric is 20. Compute_Random_Number(in Rmin, Rmax out R)
orthogonal to our approach) that is caused on each sub-query 21. Compute_New_Episodes(in Q’, R out Q’)
when trying to include two or more candidate episodes. A unit 22. return Q’
that calculates the distortion caused due to the generalization of
one or more dimensions on one or more sub-queries is required.
The distortion should be as low as possible to maintain the sub-query in order to be modified so as include an episode from
information that the user required when posing the original query. the trajectory. In case we have a distortion unit greater than a
distortion limit (user-defined), the episode takes the tag INF and
Zoom-Out algorithm takes as input a semantic trajectory database, the algorithm proceeds with the next trajectory. Under these
the original query posed by a user that cannot be answered, conditions the algorithm selects as preferable the episode that has
anonymity threshold k and a matrix H. The output of the the lowest distortion unit value (line 13).
algorithm is the modified query along with the corresponding sub-
queries. As a next step, the sub-query is modified in one or more
dimensions to contain the episode that minimizes the distortion
The algorithm after the initialization process (lines 1-2) continues (line 17). The repetition ends (line 19) either if k trajectories have
with a loop phase where each sub-query of the original query is frequency equal to the number of sub-queries or if no episodes
executed individually and the trajectories that comprise each one were integrated.
are retrieved (line 6). Each trajectory id of these trajectories is
inserted into a matrix (H) along with the frequency indicating its During the generalization process of the sub-queries, a privacy
appearance (freq) in all sub-queries (line 7). Thus, H is a tuple breach may occur. Let’s assume that the spatial dimension of the
containing trajectory id (traj_id), frequency (freq) and the sub- area that the query covers is enlarged so as to contain exactly k
queries (SQi). The maximum value that the counter (freq) can take episodes. The spatial generalization should be the minimum
in each record is equal to the number of the sub-queries. Consider possible in order to keep the distortion caused from this process as
as an example a query with three sub-queries, the counter for each low as possible. To achieve this, most of the episodes that are
trajectory in matrix H will receive a value ranging from 1 up to 3. added will appear in the borders of the modified area. The
If the counter receives the maximum value, the episodes of this malevolent user thus will be more confident that between the
trajectory are identified in all sub-queries, thus this trajectory is query posed and the modified query will be at least one episode.
returned as an answer to the overall query. In order to avoid such a violation, the modified query is expanded
on each side by a randomly generated percentage R (line 21).
Based on matrix H, the algorithm detects the trajectory or Finally, we get as output the final modified query along with its
trajectories with frequency (i.e., number of sub-queries) less than sub-queries (line 22).
the maximum possible frequency but at the same time with the
highest value among the other trajectories in the matrix (line 9). 4.3 Query Auditing
Subsequently, a loop starts that ensures that if no episode is found, The main goal of the Query-Auditing algorithm is to prevent any
the algorithm will search the trajectory that has the subsequent privacy violation that may occur. The input of the algorithm is the
smaller frequency. This loop ends either when permissible anonymity threshold k, the initial query posed by the user along
episodes can be integrated, or if all the remaining trajectories from with the corresponding sub-queries, a semantic trajectory database
the matrix have been investigated and no episode is found (line D and the id of the user posing the query. The algorithm first
15). To define the most appropriate candidate episodes, the executes query Q and gets the number of trajectories that belong
algorithm employs a process called Compute_Distortion_Units. A to the answer set (line 1). Then, the episodes that are considered
metric function is used that calculates the distortion caused in a as sensitive are defined and removed from the answer set (line 2).
Algorithm 2. Query Auditing Algorithm 5. CONCLUSIONS
Input: (1) anonymity threshold k, (2) initial query with sub-queries, Q In this paper, we proposed an envisioned query engine able to
= , (3) a semantic trajectory database, D, (4) user id, provide safe answers to queries posed by users in semantic
uid trajectory databases. Different types of privacy attacks have been
Output: FQ addressed and an effective auditing mechanism able to prevent
1. FQ ¬ Execute Q privacy braches have been proposed. Finally, Zoom-Out algorithm
2. Find all sensitive episodes, remove them from FQ is able to modify an initially not acceptable query to the closest
3. if ||FQ||