=Paper= {{Paper |id=Vol-1810/BIGQP_paper_06 |storemode=property |title=Privacy Preservation of Semantic Trajectory Databases using Query Auditing Techniques |pdfUrl=https://ceur-ws.org/Vol-1810/BIGQP_paper_06.pdf |volume=Vol-1810 |authors=Despina Kopanaki,Nikos Pelekis |dblpUrl=https://dblp.org/rec/conf/edbt/KopanakiP17 }} ==Privacy Preservation of Semantic Trajectory Databases using Query Auditing Techniques== https://ceur-ws.org/Vol-1810/BIGQP_paper_06.pdf
    Privacy Preservation of Semantic Trajectory Databases
              using Query Auditing Techniques
                      Despina Kopanaki                                                            Nikos Pelekis
                      Dept. of Informatics                                         Dept. of Statistics and Insurance Science
                  University of Piraeus, Greece                                          University of Piraeus, Greece
                     dkopanak@unipi.gr                                                         npelekis@unipi.gr

ABSTRACT                                                                   increasing need of analyzing mobility data has led to the
Existing approaches that publish anonymized spatiotemporal                 representation of trajectories with contextual data from external
traces of mobile humans deal with the preservation of privacy              data sources, thus transforming raw trajectories to the so-called
operating under the assumption that most of the information in the         semantic trajectories. A semantically-annotated trajectory, in short
original dataset can be disclosed without causing any privacy              semantic trajectory, is considered as a sequence of stops (i.e.
violation. However, an alternative strategy considers that data            places where the object remains “static”) and moves (i.e. parts of
stays in-house to the hosting organization and privacy-preserving          the object’s trajectory in between two stops) [12]. This alternative
mobility data management systems are in charge of privacy-aware            representation of trajectories may pose even greater privacy
sharing of the mobility data. Furthermore, human trajectories are          violation threats. Consider for example a malevolent user who is
nowadays enriched with semantic information by using                       able to detect places of interest (POIs) where a moving object has
background geographic information and/or by user-provided data             stopped (e.g. home, hospital, betting office, etc.). This additional
via location-based social media. This new type of representation           knowledge allows the inference of personal sensitive information
of personal movements as sequences of places visited by a person           of this specific individual. On the one hand, analyzing
during his/her movement poses even greater privacy violation               semantically-enriched movement traces of users can aid decision
threats. To facilitate privacy-aware sharing of mobility data, we          making in a wide spectrum of applications, on the other hand, the
design a semantic-aware MOD engine were all potential privacy              disclosure of such data to untrusted parties may expose the
breaches that may occur when answering a query, are prevented              privacy of the users whose movement is recorded. Sharing user
through an auditing mechanism. Moreover, in order to improve               mobility data for analysis purposes should be done only after the
user friendliness and system functionality of the aforementioned           data has been protected against potential privacy breaches.
engine, we propose Zoom-Out algorithm as a distinct component,             Most of the methodologies that have been proposed in the
whose objective is to modify the initial query that cannot be              literature, aim at protecting users’ privacy by releasing an
answered at first due to privacy violation, to the ‘nearest’ query         anonymized version of the original dataset ([1][2][6][7][8][9][11]
that can be possibly answered with ‘safety’.                               [16]). These approaches assume that in the anonymized dataset a
                                                                           malevolent user will not be able to link a specific user with a
Keywords                                                                   movement. In this paper, we employ a more conservative
Privacy-aware query engine; mobility data; anonymity; semantic             approach to privacy by assuming that data stays in-house to the
trajectories.                                                              hosting organization in order to prevent any privacy breach. An
                                                                           auditing mechanism is responsible to control the information that
1. INTRODUCTION                                                            is released to third parties and ensure privacy-aware data sharing.
Nowadays, the ease of collecting and storing data from an
increasing variety of devices where positioning technologies               Gkoulalas-Divanis et al. [5] first proposed an envisioned query
(GPS) are embedded along with their enhanced processing                    engine where subscribed users have gained restricted access to the
capabilities, led inevitably to the desire of revealing useful             database to accomplish various analysis tasks. Pelekis et al. [13]
information from them. Alongside this progress, potential                  then proposed HERMES++, a privacy-aware query engine that
breaches of individuals’ privacy came to the light. Space-time             can protect the trajectory database from potential attacks, while
‘fingerprints’ of each recording moving entity (i.e. trajectories)         supporting popular queries for mobility data analysis. Both
may prove to be a dangerous tool in the hands of a malicious user.         approaches deal with spatiotemporal trajectory databases.
The scientific community has proposed various approaches to
protect individual’s privacy ([1][2][6][7][8][9][11][16]).                 Inspired by the work previously described and considering the
                                                                           richer representation of semantic trajectories, we design a query-
Most of the aforementioned studies, define raw trajectories as             based auditing mechanism that can effectively identify and block
sequences of points on a geometric space, focusing on their                a range of potential attacks that could lead to user identification or
spatiotemporal nature without complementing raw data with                  tracking, for privacy-aware sharing of in-house semantic mobility
additional information from the application context. However, the          data. The proposed mechanism provides an answer if k-anonymity
                                                                           principle is not violated w.r.t the user’s current history. Moreover,
                                                                           we propose an algorithm, called Zoum-Out, which modifies the
 2017, Copyright is with the authors. Published in the Workshop            original query that cannot be answered at first due to privacy
 Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21,             restrictions, to the most similar query that can be safely answered.
 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073).
                                                                           The algorithm generalizes space, time and/or semantic dimension
 Distribution of this paper is permitted under the terms of the Creative   of one or more sub-queries w.r.t. to a distortion threshold.
 Commons license CC-by-nc-nd 4.0
                                                                           Summarizing, in this paper we make the following contributions:
     •    We identify various types of attacks and thus privacy          indistinguishable from. Kopanaki et al. [7] introduced the problem
          violations that malevolent users may try to pursue when        of Personalized (K,∆)-anonymity where user-specific privacy
          querying the original semantic trajectory database.            requirements are used to avoid over-anonymization and decrease
                                                                         information distortion. They proposed efficient modifications to
     •    We design a query-based auditing mechanism that can            state-of-the-art (k,δ)-anonymization algorithms by introducing
          effectively identify and block a range of potential            techniques built upon users’ personalized privacy settings and
          attacks that could lead to user identification or tracking.    trajectory segmentation.
     •    We propose Zoom-Out algorithm aiming at increasing             Recently, in [10] authors faced the problem of anonymizing
          user friendliness of the proposed mechanism by                 semantic trajectories. To release a safe version of a semantic
          modifying the (original) query posed that cannot be            trajectory dataset, they propose a method that generalizes
          answered due to privacy violation, to the ‘nearest’            sequences of visited places based on a privacy place taxonomy.
          possible ‘safe’ query.
                                                                         On the other hand, in several sharing scenarios data should stay
The rest of the paper is structured as follows: Section 2 presents       in-house to the hosting organization and the information must
related work. Section 3 introduces different types of attacks of a       remain private. This is the case when a data holder is not willing
malevolent user. Section 4 provides the auditing mechanism that          or is not able due to regulations to publish the entire dataset.
handles the previously described attacks as well as the Zoom-Out         Assuming that at least part of the data has to become available to
algorithm. Finally, Section 5 concludes the paper.                       possibly untrusted third parties for analysis purposes, a
                                                                         mechanism is needed in order to ensure that no sensitive
2. Related Work                                                          information will be released during this process. Along this
Methods that have been proposed so far to tackle the issue of            direction, methodologies have been proposed for disclosure
privacy-preserving mobility data publication mostly adopt the            control in statistical databases [3]. These approaches support only
principle of k-anonymity, which was originally proposed for              count and/or sum queries, since no other information can be made
relational databases [15]. k-anonymity principle is the most             available to the inquirer.
common approach that has been adopted for the anonymization of
both relational and mobility data. For mobility data, it states that a   Gkoulalas-Divanis and Verykios [5] first described the design
dataset must be anonymized so that every trajectory is                   principles of a query engine that protects user privacy by
indistinguishable from at least k-1 other trajectories.                  generating fake trajectories. The idea behind [5] is that malevolent
                                                                         users who query the trajectory database should not be able to
Hoh and Gruteser [6] presented a data perturbation algorithm that        discover (with high confidence) any real trajectory that is returned
is based on path crossing. When two non-intersecting trajectories        as part of the answer set of their query, while they can use the
are close enough, it generates a fake crossing in the sanitized          returned data to support their analytic tasks.
dataset to prevent adversaries from tracking a complete user's
trajectory. Terrovitis and Mamoulis [16] consider datasets as            In [13] and [14] authors extend and developed a privacy-aware
sequences of places visited by users. Based on the assumption that       query engine along with a benchmark framework. The proposed
a malevolent user holds partial information of users’ trajectories, a    engine audits queries for trajectory data to block potential attacks
suppression technique is proposed that eradicates the least number       to user privacy, supports range, distance, and k-nearest neighbor
of places from a user’s trajectory so that the remaining trajectory      spatial and spatiotemporal queries, and preserves user anonymity
is k-anonymous.                                                          in answers to queries by returning realistic fakes trajectories,
                                                                         while protecting user-specific sensitive locations.
Abul et al. [1] proposed a k-anonymity approach that relies on the
inherent uncertainty of moving objects whereabouts where a               Finally, in [17] authors proposed a data stream management
trajectory is considered as a cylinder. The anonymity algorithm          system aiming at preserving users’ privacy by enforcing
identifies trajectories that lie close to each other in time, employs    Hippocratic principles. Limited collection, limited use and limited
space translation and generates clusters of at least k trajectories.     disclosure of data are the main privacy requirements that the
Each cluster of k trajectories forms an anonymity region and the         system implements.
co-clustered trajectories can be released. To achieve space-time         To the best of our knowledge, this is the first work that proposes a
translation, the authors proposed W4M [2], which uses a different        query auditing mechanism for semantically-enriched mobility
distance measure that allows time-warping.                               data, able to provide answers while ensuring that no personal
Nergiz et al. [11] proposed a coarsening strategy to generate a          information will be disclosed to untrusted third parties.
sanitized dataset that consists of k-anonymous sequences. The
algorithm first generalizes a set of trajectories into a set of
                                                                         3. Privacy Attacks
sequences of k-anonymized regions, reconstructs, consolidates the        The main purpose of every attack of a malevolent is to broaden
trajectories of the original dataset into clusters of k and              her knowledge about an individual or a situation that interests her.
anonymizes the trajectories in each cluster. Monreale et al. [9]         This occurs when the attacker raises her confidence about an
proposed another anonymization approach that is based on the             event that may be related to an individual who is the ‘target’ or a
combination of spatial generalization and k-anonymity principle.         situation for which she wishes to acquire more specific
The geographical area covered by the trajectories belonging to the       knowledge. Usually, a malevolent has prior knowledge, i.e. time,
dataset is partitioned into sub-areas. The original trajectories are     place, type of event and/or semantics (or any possible
then generalized and transformed so as to satisfy k-anonymity            combination) about an individual.
principle.                                                               In our setting, each query may contain one or more sub-queries
Mahdavifar et al. [8] introduced the idea of non-uniform privacy         (i.e. standalone spatiotemporal range queries with key words
requirements, whereby each trajectory is associated with its own         constrains within another query) . Moreover, overlapping queries
privacy level indicating the number of trajectories it should be         is a sequence of at least two queries posed by a user, having as a
characteristic that the criteria of these successive questions are      Q1 with a sub-query that for sure contains the target (how many
overlapping. We assume that the queries differ only in one              people stayed during the night in area A and during the day were
dimension (space / time / semantics) or in the number of the sub-       working in area B). The number of trajectories that fulfil Q1 is 6.
queries that each one contains.                                         Q2 contains the same sub-queries with Q1 along with a new sub-
                                                                        query that asks for those that returned after work back to area A
User Identification Attack. In this attack the identity of a user can
                                                                        during the night. If the answer of Q2 returns 5 trajectories, then the
be revealed by posing overlapping queries in spatial and/or             malevolent may assume that the target did not spend the night at
temporal dimension. The attacker poses a query and if the number        home. If the result again was 6, then she would be certain that the
of trajectories is at least k, proceeds with one or more queries        target returned to her home. However, by posing two queries her
modifying each time only the same dimension such that every             confidence was increased.
time the new query contains the previous one.
Let’s assume that a user poses query Q1 that contains n                 4. Attack Prevention
trajectories where n³k and then Q2 which returns as an answer           To prevent the previously described attacks, an auditing
n+m trajectories, where m, (3) a semantic trajectory database D, (4)
Q’ will result at least k trajectories. The sets of subqueries in Q
                                                                        distortion limit dist, (5) array H[tr_id, freq, SQ1, SQ2, …, SQn]
and Q’ are of the same size. Each sub-query of Q’ is either the         Output: Q’= 
same or generalized w.r.t. the corresponding sub-query contained        1.   Q’ ¬ Q; H ¬ Æ;
in Q.                                                                   2.   Ntr ¬ Count(Q’)
                                                                        3.   repeat
An obvious approach would be to apply the aforementioned
                                                                        4.         Something_Changed ¬ False;
method only in case where a query cannot be answered marginally         5.         for i=1 to n do
w.r.t. to k-anonymity threshold. A threshold should then be             6.             Execute_Query(in SQ’i out tr_ids)
required based on which the mechanism would be activated every          7.             Fill_Help_Table(in H, tr_ids out H)
time that the query could not be answered. In such a case where         8.         end for
the mechanism is activated only if few trajectories are missing         9.         Find_Freq_Position(in H, n out i)
from the answer set, privacy breach may occur. A malevolent user        10.        episode_found ¬ False
can easily assume that the modified query does contain certain          11.        repeat
number of additional trajectories within the returned extra area.       12.            Compute_Distortion_Units(in out episode_found, H, i)
An obvious solution is to apply Zoom-Out algorithm regardless of        13.            Select_best_candidate_episode(in H, dist, i out tr_id,
the number of trajectories that are needed to reach k-anonymity              ep_id)
                                                                        14.            i ¬ i+1
principle.
                                                                        15.        until episode_found or EOF
The goal of the algorithm is to modify one or more sub-queries to       16.        if episode_found then
provide an answer to the user. To enable the algorithm to decide        17.            Embed_New_Episode(in H, tr_id, ep_id, in out SQ’i,
                                                                             Something_Changed)
which episode is preferable to be included in the answer of the
                                                                        18.            Νtr ¬ Count(Q’)
modified query, the algorithm should be able to compare the
                                                                        19. until (not Something_Changed) or (Ntr=k)
distortion (the specific definition of a distortion metric is           20. Compute_Random_Number(in Rmin, Rmax out R)
orthogonal to our approach) that is caused on each sub-query            21. Compute_New_Episodes(in Q’, R out Q’)
when trying to include two or more candidate episodes. A unit           22. return Q’
that calculates the distortion caused due to the generalization of
one or more dimensions on one or more sub-queries is required.
The distortion should be as low as possible to maintain the             sub-query in order to be modified so as include an episode from
information that the user required when posing the original query.      the trajectory. In case we have a distortion unit greater than a
                                                                        distortion limit (user-defined), the episode takes the tag INF and
Zoom-Out algorithm takes as input a semantic trajectory database,       the algorithm proceeds with the next trajectory. Under these
the original query posed by a user that cannot be answered,             conditions the algorithm selects as preferable the episode that has
anonymity threshold k and a matrix H. The output of the                 the lowest distortion unit value (line 13).
algorithm is the modified query along with the corresponding sub-
queries.                                                                As a next step, the sub-query is modified in one or more
                                                                        dimensions to contain the episode that minimizes the distortion
The algorithm after the initialization process (lines 1-2) continues    (line 17). The repetition ends (line 19) either if k trajectories have
with a loop phase where each sub-query of the original query is         frequency equal to the number of sub-queries or if no episodes
executed individually and the trajectories that comprise each one       were integrated.
are retrieved (line 6). Each trajectory id of these trajectories is
inserted into a matrix (H) along with the frequency indicating its      During the generalization process of the sub-queries, a privacy
appearance (freq) in all sub-queries (line 7). Thus, H is a tuple       breach may occur. Let’s assume that the spatial dimension of the
containing trajectory id (traj_id), frequency (freq) and the sub-       area that the query covers is enlarged so as to contain exactly k
queries (SQi). The maximum value that the counter (freq) can take       episodes. The spatial generalization should be the minimum
in each record is equal to the number of the sub-queries. Consider      possible in order to keep the distortion caused from this process as
as an example a query with three sub-queries, the counter for each      low as possible. To achieve this, most of the episodes that are
trajectory in matrix H will receive a value ranging from 1 up to 3.     added will appear in the borders of the modified area. The
If the counter receives the maximum value, the episodes of this         malevolent user thus will be more confident that between the
trajectory are identified in all sub-queries, thus this trajectory is   query posed and the modified query will be at least one episode.
returned as an answer to the overall query.                             In order to avoid such a violation, the modified query is expanded
                                                                        on each side by a randomly generated percentage R (line 21).
Based on matrix H, the algorithm detects the trajectory or              Finally, we get as output the final modified query along with its
trajectories with frequency (i.e., number of sub-queries) less than     sub-queries (line 22).
the maximum possible frequency but at the same time with the
highest value among the other trajectories in the matrix (line 9).      4.3 Query Auditing
Subsequently, a loop starts that ensures that if no episode is found,   The main goal of the Query-Auditing algorithm is to prevent any
the algorithm will search the trajectory that has the subsequent        privacy violation that may occur. The input of the algorithm is the
smaller frequency. This loop ends either when permissible               anonymity threshold k, the initial query posed by the user along
episodes can be integrated, or if all the remaining trajectories from   with the corresponding sub-queries, a semantic trajectory database
the matrix have been investigated and no episode is found (line         D and the id of the user posing the query. The algorithm first
15). To define the most appropriate candidate episodes, the             executes query Q and gets the number of trajectories that belong
algorithm employs a process called Compute_Distortion_Units. A          to the answer set (line 1). Then, the episodes that are considered
metric function is used that calculates the distortion caused in a      as sensitive are defined and removed from the answer set (line 2).
Algorithm 2. Query Auditing Algorithm                                      5. CONCLUSIONS
Input: (1) anonymity threshold k, (2) initial query with sub-queries, Q    In this paper, we proposed an envisioned query engine able to
= , (3) a semantic trajectory database, D, (4) user id,   provide safe answers to queries posed by users in semantic
uid                                                                        trajectory databases. Different types of privacy attacks have been
Output: FQ                                                                 addressed and an effective auditing mechanism able to prevent
1.   FQ ¬ Execute Q                                                        privacy braches have been proposed. Finally, Zoom-Out algorithm
2.   Find all sensitive episodes, remove them from FQ                      is able to modify an initially not acceptable query to the closest
3.   if ||FQ||