Scalable Linkage across Location Enhanced Services

                                                                 Fuat BASIK
                                  Supervised By: Hakan Ferhatosmanoğlu and Buğra Gedik
                                Department of Computer Engineering, Bilkent University, Turkey
                                                       fuat.basik@bilkent.edu.tr


ABSTRACT                                                                  records from two different location-enhanced services, even
In this work, we investigate methods for merging spatio-                  when the datasets are semantically different. Another ex-
temporal usage and entity records across two location-                    ample would be linkage of the sensor data from different
enhanced services, even when the datasets are semantically                vendors that are embedded to the same moving system, i.e.
different. To address both effectiveness and efficiency, we               self-driving cars. This linkage enables data scientists and
study this linkage problem in two parts: model and frame-                 service providers to obtain information that they cannot de-
work. First we discuss models, including k-l diversity— a                 rive by mining only one set of usage records. Consider a LES
concept we developed to capture both spatial and temporal                 provider who combines user segmentation results derived
diversity aspects of the linkage, and probabilistic linkage.              from its own usage records with social segmentation results
Second, we aim to develop a framework that brings efficient               derived from the publicly available Swarm records. There
computation and parallelization support for both models of                are several algorithmic and systems challenges to merge
linkage.                                                                  information from multiple sources of anonymized spatio-
                                                                          temporal data that are collected with necessary permissions.
                                                                          To cover both effectiveness and efficiency, we divide this link-
1.     INTRODUCTION                                                       age problem into two parts: model and framework.
An important portion of digital footprint left behind by enti-               To develop effective models, one needs to define a simi-
ties interacting with online services contains spatio-temporal            larity or probabilistic measure for linkage, which considers
references. This footprint is a fertile resource for business             time, location, and the relationship between the two. This
intelligence applications [11]. We refer to the services that             is relatively simpler for many record linkage tasks [4], where
create spatio-temporal records of their usage as Location En-             linkage is defined based on a similarity measure defined over
hanced Services (LES). For instance, Foursquare/Swarm1 —                  records (such as Minkowski distance or Jaccard similarity).
a popular social networking service, records the locations                In spatio-temporal linkage, for a pair of users from two dif-
of users when they check-in at a point-of-interest (POI)                  ferent datasets to be considered as matching, their usage
registered in the system. Similarly, mobile phone service                 history must contain records that are close both in space
providers generate a record every time a call is made, which              and time; and there must not be negative matches, such as
includes the cell tower whose coverage area contains the                  records that are close in time, but far in distance. We call
user’s location.                                                          such negative matches, alibis. To address these challenges,
   Records with similar location and time naturally observe               we introduce two linkage models. The first one is based on
similar phenomena. The data analyst can gather such data                  k-l diversity — a new concept we have introduced to capture
from multiple sources, which are typically anonymized due                 both spatial and temporal diversity aspects of the linkage.
to privacy concerns. These sources could generate seman-                  A pair of entities, one from each dataset, is called k-l diverse
tically different datasets, or the semantic link between the              if they have at least k co-occurring records (both temporally
sources could have been lost due to anonymization. As most                and spatially) in at least l different locations, and, such pairs
data science tasks require large amount of data for accurate              of entities must not have any alibis. The second model we
training with higher confidence, scientists need to combine               aim to develop is based on probabilistic linkage — in which
data from multiple sources to produce accurate aggregate                  we seek to model the matching probability of two entities
patterns. For example, spatio-temporal usage records be-                  based on their spatio-temporal history. A pair of entities
longing to the same real-world user can be matched across                 are called match, or linked with probability P , which is pro-
1
    www.foursquare.com / www.swarmapp.com                                 portional to their common events aggregated on grids, and
                                                                          timestamps. P is inversely proportional to number of all
                                                                          other entities simultaneously acting at the same grid.
                                                                             Considering that location-based social networks get mil-
                                                                          lions of updates every day, linkage over hundreds of days of
                                                                          data would take impractically long amount of time. Naı̈ve
                                                                          record linkage algorithms that compare every pair of records
                                                                          take O(n2 ) time [6], where n is the number of records. The
Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich,
Germany.
                                                                          generic entity matching tools do not provide the necessary
Copyright (c) 2017 for this paper by its authors. Copying permitted for   optimization for scalability and efficiency of spatio-temporal
private and academic purposes..
linkage [3]. In order to merge data sets in a reasonable
time, we will develop a scalable framework that takes ad-                       a
vantage of the spatio-temporal structure of the data. The                                                           d
ST-Link algorithm we have recently modeled to realize the                                        1/6
k-l diversity model in real world, uses two filtering steps
before pairwise comparisons of candidate users, and makes                                        1/2
                                                                                b                                         e
use of spatial index trees, temporal sliding windows and log-
structured merge trees [7]. In addition to effective indexing
                                                                                                 1/4
techniques, we believe efficiency could benefit from paral-
lelization of computation.                                                                       1

                                                                                c                                   f
2.    LINKAGE MODELS                                                                     x                y
Datasets. We denote the two spatio-temporal usage record
datasets from the two LES across which the linkage is to be
performed as I and E.                                             Figure 1: The co-occurring event pairs are shown
Entities and events. Entities, or users, are real-world           using dashed lines. Events from a given entities are
systems or people who use LES. We use the terms user and          shown within circles. Entities a, b, c, and y are from
entity interchangeably. They are represented in the datasets      one LES, and the entities d, e, f , and x are from the
with their ids, potentially anonymized, which are different       other LES.
for the two LES. Events correspond to usage records gen-
erated by a LES as a result of users interacting with the         function to represent aforementioned co-occurrence relation
service. For an event e ∈ E (or i ∈ I), e.u (or i.u) represents   of records i, and e, We weight these co-occurring event pairs
the entity associated with the event. We use UE and UI to         as:
denote the set of entity ids in the datasets E and I, respec-
tively. We have UE = {e.u : e ∈ E} and UI = {i.u : i ∈ I}.                   w(i, e) =|{i1 .u : C(i1 , e) ∧ i1 ∈ I}|−1 ·
                                                                                                                              (1)
Location and time. Each event in the dataset contains                                  |{e1 .u : C(i, e1 ) ∧ e1 ∈ E}|−1
location and time information. The location information is        Given a co-occurring event pair between two entities, we
in the form of a region, denoted as e.r for event e. We do        check how many possible entities’ events could be matched
not use a point for location, as for most LES the location        to these events. For instance, in Figure 1, consider the solid
information is in the form of a region. We assume the time        line at the top with the weight 1/6. The event on its left
information is a point in time.                                   could be matched to events of 2 different entities, and the
                                                                  event on its right could be matched to events of 3 differ-
2.1    k-l Diversity                                              ent entities. To compute the weight of a co-occurring pair,
The core idea behind the k-l diversity model is to locate         we multiply the inverse of these entity counts, assuming the
pairs of entities whose events satisfy k-l diversity. Further-    possibility of matching from both sides are independent. As
more, such pairs of entities must not have any alibis.            such, in the Figure 1, we get 1/2 · 1/3 = 1/6.
Co-occurrence. Two events from different datasets are             l diverse event pairs. For the same entity pair to be
called co-occurring if they are close in space and time. For      considered l-diverse, there needs to be at least l unique lo-
two records i ∈ I and e ∈ E, closeness is defined in terms of     cations for the co-occurring event pairs in it. However, for a
intersection of regions. To capture closeness in time, we use     location to be counted towards these l locations, the weights
a parameter α, and call two events are close in time if they      of the co-occurring event pairs for that location must be at
are within a window of α time units of each other.                least 1. Here, one subtle issue is defining a unique loca-
Alibi. While a definition of similarity is necessary to link      tion. Intuitively, when datasets have different granularities
events from two different datasets, a definition of dissimilar-   for space, using the higher granularity ones to define unique-
ity is also required to rule out pairs of entities as potential   ness would give more accurate results. This could simply be
matches in our linkage. Such negative matches enable us to        a grid-based division of the space.
rule out incorrect matches and also reduce the space of pos-         Entities x and y could be linked to each other, if they have
sible matches throughout the linkage process. We refer to         k co-occurring event pairs in l diverse locations and their
these negative matches as alibis. In this work, we use alibi      datasets do not contain alibi event pairs. Moreover, we only
to define events from two different datasets that happened        consider entity pairs for which there is no ambiguity, i.e.
around the same time but at different locations, such that it     no two pairs (x, y) and (x, z) that are k-l diverse. Setting
is not possible for a user to move from one of these locations    too low k-l values would lead many ambiguous pairs while
to the other within the duration defined by the difference of     too high values would lead many false negatives. To find
the timestamps of the events.                                     the balance in between these two, we apply elbow detection
                                                                  techniques on k, and l distributions.
Entity linkage. Let x ∈ UI and y ∈ UE be two entities.
In order to be able to decide whether two entities are the        2.2    Probabilistic Linkage
same, we search for k co-occurring event pairs and at least l     Besides k-l diversity, we aim to model the spatio-temporal
of them are at diverse locations. However, each co-occurring      linkage problem using a probabilistic model. For consis-
event pair does not count as 1, since each of these events        tency, we try to use the same notation with k-l diversity
could co-occur with many other events. Let C(i, e) be the         model as much as possible.
                                                                   If we have more than one sample of the entity (for time
                                                                   slots t0 to time slot tT where we don’t necessitate the slots
                                                                   to be sequential in time) and the grid we can then use the
                                                                   history of the entity. The probability is similar except taking
                                                                   the tracks of the entities into account:
                                                                                 ( Pm P M (n−1,m−1,k−1)
                                                                                      k=1
                                                                                        Pm
                                                                                              P M (n,m,k)
                                                                                                          , Gx (ti ) = Gy (ti ) ∀ti
                                                                   P (x ≡ y) =            k=1
                                                                                   0,                         otherwise
                                                                                                                                (5)
                                                                   where n > m.

                                                                     An important issue is the decision on the values of n and
Figure 2: An example of 5 co-occurring events in 3                 m. If the entities x and y have l events sharing the same
diverse locations from real world data. All weights                cell and time interval, one must look for all possible pairs
are assumed to be 1                                                that satisfy this property. Since the number of users in the
                                                                   intersection are smaller and is expected to decrease rapidly
                                                                   for large l values the probability of two users in respective
   Probabilistic model starts by aggregating all of the entities   services being the same user with the same track will be
on a common grid using the spatio-temporal features of the         considerably high.
datasets. Gk denotes the set of entities in a specific cell          It is fair to assume that both the systems have consid-
in the grid and |GIk (t)| denotes the number of entities from      erable amount of common entities which will match. This
dataset I in cell Gk at some time interval [t − t + α]. Let        commonality needs to be calculated empirically by ground
x ∈ UI and y ∈ UE be two entities, and set of entities co-         truth values. After this value is found we can use this value
located with entity x in UE , and set of entities co-located       to limit the k values (e.g. k ∈ [k1 : k2 ]) when calculating
with entity y in UI at time [t − t + α], in the grid is given      the probabilities and the probability in Theorem turns to:
as Gx (t), and Gy (t) respectively.                                                     Pk=k2
   Assuming two sets, S1 and S2 , where |S1 | = |S2 | = n, the                            k=k1 P M (n − 1, m − 1, k − 1)
number of possible different complete matches (CM) (each                   P (x ≡ y) =        Pk=k2                          (6)
element in S1 has a partner in S2 ) between the elements of                                     k=k1 P M (n, m, k)

these sets is n!, using trivial combinations without repeti-          One can relax the condition of equality for the tracking
tion. If the number of elements in the sets are not equal, i.e.    based on the location accuracy of the services and the cho-
|S1 | = n 6= |S2 | = m then the problem turns into choosing        sen grid sizes. Moreover, similar to the diversity concept of
m out of n (where n ≥ m) and calculating       complete match      the k-l diversity model, the distance among the shared cells
                                       n
                                         
with m elements, i.e. CM (n, m) = m       m!.                      could be used to distinguish between multiple pairs shar-
   As the user set of one LES is typically not a subset of the     ing a common user, with close matching probabilities, i.e.
second, we define the partial match (PM) where only k out          P (x ≡ y) = P (x ≡ z).
of the m elements in S2 match with k out of n elements in
S1 . In this case we also need to choose k out of m and use
the complete match, i.e. P M (n, m, k) = m         CM (n, k) =     3.   FRAMEWORK
                                                 k
 m n                                                               The second component of this doctoral work is the frame-
     
  k   k
        k!.
   Let x ≡ y represents entities x, and y are the same real-       work to perform the linkage efficiently. Naı̈ve record linkage
world entity (match), the probability of a pair of specific        algorithms that compare every pair of records take O(n2 )
two items to match each other is calculated by the number          time [6], where n is the number of records. Therefore, there
of events where x and y match divided by the number of             are number of techniques implemented, i.e. indexing, block-
events in the universal set of all possibilities.                  ing, to prune search space of linkage. To perform the linkage
   If we are given a single snapshot of the grid at time t the     in reasonable time, we take advantage of the spatio-temporal
probability of a randomly chosen pair of co-located entities       structure of the data. To realize effectiveness of the k-l di-
in the different services being the same entity (we are assum-     versity model, we develop an algorithm called ST-Link [2].
ing a complete match case only) can be found as following:         Our implementation for the probabilistic model is still on-
                                                                   going.
             P M (n − 1, m − 1, m − 1)                                The ST-Link algorithm uses two filtering steps before
 P (x ≡ y) =                                                (2)
                   P M (n, m, m)                                   pairwise comparisons of candidate entities are performed to
             CM (n − 1, m − 1)        1                            compute the final linkage. It first distributes entities (users)
           =                        =                       (3)    over coarse-grained geographical regions that we call domi-
                  CM (n, m)           n
             (         1
                                                                   nating grid cells. Such grid cells contain most of the activ-
                     I      E       , if Gx (t) = Gy (t) = Gk      ities of their users. For two users to link, they must have a
           = max(|Gk (t)|,|Gk (t)|)
               0,                     otherwise                    common dominating grid. Once this step is over, the linkage
                                                            (4)    is independently performed over each dominating grid cell.
                                                                   To identify the dominating grids, we make a sequential scan
This is an intuitive result since a random entity from the         over all records, and utilize a quad-tree based index, which
smaller set can be equal to any element in the larger set          limits the area of the smallest grid from below. During the
with an equal probability.                                         temporal filtering step, ST-Link uses a sliding window based
                                                                   scan to build candidate user pairs, while also pruning this
list as alibis are encountered for the current candidate pairs.    de Montjoye et al. [5] have shown that, given a spatio-
Finally, our complete linkage model is evaluated over candi-       temporal dataset of call detail records, one can uniquely
date pairs of users that remain following the spatial and tem-     identify the 95 % of the population by using 4 randomly
poral filtering steps. During this linkage step, we will need      selected spatio-temporal points. However, linking users is
the time sorted events of the users at hand. For that pur-         different from identification, as identification leaves whose
pose, during the forward scan, we also create a disk-based         data to aggregate question unanswered.
index sorted by the user id and event time. This index en-
ables us to quickly iterate over the events of a given user in     5.   CONCLUSIONS & RESEARCH PLAN
timestamp order, which is an operation used by the linkage
                                                                   In this paper, we introduced two linkage models for match-
step. Also, if one of the datasets is more sparse than the
                                                                   ing users across location enhanced services, and discussed
other, it performs the linkage by iterating over the users of
                                                                   implementation techniques. We have already realized a sin-
the dense datasets first, making sure their events are loaded
                                                                   gle machine implementation of the k-l diversity model with
only once. This is akin to the classical join ordering heuristic
                                                                   ST-Link algorithm. We are now working on validation and
in databases.
                                                                   implementation of the probabilistic model, and aim to com-
   Our experimental evaluation shows that k-l diversity model
                                                                   pare these two models with each other. Our single ma-
is effective (up to 89% precision and 61% recall), yet the effi-
                                                                   chine implementations showed that both models could ben-
ciency could benefit from a distributed approach. However,
                                                                   efit from a parallelized distributed implementation. There-
distributed processing is challenging due to mobility of users,
                                                                   fore, we set the development of a distributed and generic
and the scale of the data. First, distributing records based
                                                                   framework as the future goal of this doctoral work.
on their spatio-temporal features would spread records of a
single user to multiple processing nodes, hence lead to high
inter-machine communications cost. While the concept of            6.   REFERENCES
dominating grid cells addresses this issue, scalability would       [1] P. Bakalov and V. Tsotras. Continuous spatiotemporal
still suffer from spatial skew of real data (in our experiments         trajectory joins. In GeoSensor Networks, volume 4540
%18 of all records were residing on a single grid out of 120            of Lecture Notes in Computer Science, pages 109–128.
grids). Since the temporal filtering techniques requires at             Springer Berlin Heidelberg, 2008.
least one batch of data to reside at the same machine (this         [2] F. Basik, B. Gedik, C. Etemoglu, and
issue exists in both models either for filtering or aggregat-           H. Ferhatosmanoglu. Spatio-temporal linkage over
ing), records cannot be written to machines in parallel which           location-enhanced services. IEEE Trans. on Mobile
would lead to low write performance. With these challenges              Computing, PP(99):1–1, 2017.
identified, we are going to focus on optimizations of both          [3] O. Benjelloun, H. Garcia-Molina, D. Menestrina,
models to create a single optimized framework which could               Q. Su, S. E. Whang, and J. Widom. Swoosh: A
efficiently perform linkage for both models. Such framework             generic approach to entity resolution. The VLDB
would be beneficial for both industry and academia when                 Journal, 18(1):255–276, Jan. 2009.
performing aggregation of semantically different datasets for       [4] P. Christen. Data matching: concepts and techniques
social good applications, and when benchmarking the link-               for record linkage, entity resolution, and duplicate
age research.                                                           detection. Springer Science & Business Media, 2012.
                                                                    [5] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and
4.   RELATED WORK                                                       V. D. Blondel. Unique in the crowd: The privacy
Record Linkage. One of the earliest appearances of the                  bounds of human mobility. Scientific reports, 3, 2013.
term record linkage is by Newcombe et al. [9]. In the liter-        [6] L. Getoor and A. Machanavajjhala. Entity resolution:
ature, it is also referred to as entity resolution (ER), dedu-          Theory, practice & open challenges. In VLDB
plication, object identification, and reference reconciliation,         Conference (PVLDB), 2012.
discussed in [4]. Most of the work in this area focus on a          [7] S. Ghemawat and J. Dean. LevelDB.
single type of databases and define the linked records with             https://github.com/google/leveldb, 2015.
respect to a similarity metric. To the best of our knowledge,       [8] E. H. Jacox and H. Samet. Spatial join techniques.
linking the users of the usage records, specifically targeted           ACM Trans. Database Syst., 32(1), Mar. 2007.
at spatio-temporal datasets is novel.                               [9] H. B. Newcombe, J. M. Kennedy, S. J. Axford, and
Spatial Record Linkage and Spatial Joins. Many join                     A. P. James. Automatic linkage of vital records:
algorithms are proposed in the literature for spatial data [8].         Computers can be used to extract ”follow-up”
Spatial record linkage and join algorithms are not directly             statistics of families from files of routine records.
applicable for spatio-temporal data as they are based on in-            Science, 130(3381):954–959, 1959.
tersection of minimum bounding boxes, one-sided nearest            [10] C. Riederer, Y. Kim, A. Chaintreau, N. Korula, and
join, or string similarity. Spatio-temporal joins have con-             S. Lattanzi. Linking users across domains with
straints on both spatial and temporal domains [1]. [10] is              location data: Theory and validation. In Proc. of the
a recent work with similar motivation in which calculates               25th Int. Conf.on WWW, pages 707–719, 2016.
weights of matching between users and applies maximum              [11] A. Skovsgaard, D. Sidlauskas, and C. Jensen. Scalable
weight partitioning techniques. Their experiments validate              top-k spatio-temporal term querying. In IEEE Int.
the accuracy of this approach, but they do not focus on scal-           Conference on Data Engineering (ICDE), pages
ability.                                                                148–159, March 2014.
User Identification. Our work has commonalities with
the work done in the area of user identification. For instance,