Scalable Linkage across Location Enhanced Services Fuat BASIK Supervised By: Hakan Ferhatosmanoğlu and Buğra Gedik Department of Computer Engineering, Bilkent University, Turkey fuat.basik@bilkent.edu.tr ABSTRACT records from two different location-enhanced services, even In this work, we investigate methods for merging spatio- when the datasets are semantically different. Another ex- temporal usage and entity records across two location- ample would be linkage of the sensor data from different enhanced services, even when the datasets are semantically vendors that are embedded to the same moving system, i.e. different. To address both effectiveness and efficiency, we self-driving cars. This linkage enables data scientists and study this linkage problem in two parts: model and frame- service providers to obtain information that they cannot de- work. First we discuss models, including k-l diversity— a rive by mining only one set of usage records. Consider a LES concept we developed to capture both spatial and temporal provider who combines user segmentation results derived diversity aspects of the linkage, and probabilistic linkage. from its own usage records with social segmentation results Second, we aim to develop a framework that brings efficient derived from the publicly available Swarm records. There computation and parallelization support for both models of are several algorithmic and systems challenges to merge linkage. information from multiple sources of anonymized spatio- temporal data that are collected with necessary permissions. To cover both effectiveness and efficiency, we divide this link- 1. INTRODUCTION age problem into two parts: model and framework. An important portion of digital footprint left behind by enti- To develop effective models, one needs to define a simi- ties interacting with online services contains spatio-temporal larity or probabilistic measure for linkage, which considers references. This footprint is a fertile resource for business time, location, and the relationship between the two. This intelligence applications [11]. We refer to the services that is relatively simpler for many record linkage tasks [4], where create spatio-temporal records of their usage as Location En- linkage is defined based on a similarity measure defined over hanced Services (LES). For instance, Foursquare/Swarm1 — records (such as Minkowski distance or Jaccard similarity). a popular social networking service, records the locations In spatio-temporal linkage, for a pair of users from two dif- of users when they check-in at a point-of-interest (POI) ferent datasets to be considered as matching, their usage registered in the system. Similarly, mobile phone service history must contain records that are close both in space providers generate a record every time a call is made, which and time; and there must not be negative matches, such as includes the cell tower whose coverage area contains the records that are close in time, but far in distance. We call user’s location. such negative matches, alibis. To address these challenges, Records with similar location and time naturally observe we introduce two linkage models. The first one is based on similar phenomena. The data analyst can gather such data k-l diversity — a new concept we have introduced to capture from multiple sources, which are typically anonymized due both spatial and temporal diversity aspects of the linkage. to privacy concerns. These sources could generate seman- A pair of entities, one from each dataset, is called k-l diverse tically different datasets, or the semantic link between the if they have at least k co-occurring records (both temporally sources could have been lost due to anonymization. As most and spatially) in at least l different locations, and, such pairs data science tasks require large amount of data for accurate of entities must not have any alibis. The second model we training with higher confidence, scientists need to combine aim to develop is based on probabilistic linkage — in which data from multiple sources to produce accurate aggregate we seek to model the matching probability of two entities patterns. For example, spatio-temporal usage records be- based on their spatio-temporal history. A pair of entities longing to the same real-world user can be matched across are called match, or linked with probability P , which is pro- 1 www.foursquare.com / www.swarmapp.com portional to their common events aggregated on grids, and timestamps. P is inversely proportional to number of all other entities simultaneously acting at the same grid. Considering that location-based social networks get mil- lions of updates every day, linkage over hundreds of days of data would take impractically long amount of time. Naı̈ve record linkage algorithms that compare every pair of records take O(n2 ) time [6], where n is the number of records. The Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich, Germany. generic entity matching tools do not provide the necessary Copyright (c) 2017 for this paper by its authors. Copying permitted for optimization for scalability and efficiency of spatio-temporal private and academic purposes.. linkage [3]. In order to merge data sets in a reasonable time, we will develop a scalable framework that takes ad- a vantage of the spatio-temporal structure of the data. The d ST-Link algorithm we have recently modeled to realize the 1/6 k-l diversity model in real world, uses two filtering steps before pairwise comparisons of candidate users, and makes 1/2 b e use of spatial index trees, temporal sliding windows and log- structured merge trees [7]. In addition to effective indexing 1/4 techniques, we believe efficiency could benefit from paral- lelization of computation. 1 c f 2. LINKAGE MODELS x y Datasets. We denote the two spatio-temporal usage record datasets from the two LES across which the linkage is to be performed as I and E. Figure 1: The co-occurring event pairs are shown Entities and events. Entities, or users, are real-world using dashed lines. Events from a given entities are systems or people who use LES. We use the terms user and shown within circles. Entities a, b, c, and y are from entity interchangeably. They are represented in the datasets one LES, and the entities d, e, f , and x are from the with their ids, potentially anonymized, which are different other LES. for the two LES. Events correspond to usage records gen- erated by a LES as a result of users interacting with the function to represent aforementioned co-occurrence relation service. For an event e ∈ E (or i ∈ I), e.u (or i.u) represents of records i, and e, We weight these co-occurring event pairs the entity associated with the event. We use UE and UI to as: denote the set of entity ids in the datasets E and I, respec- tively. We have UE = {e.u : e ∈ E} and UI = {i.u : i ∈ I}. w(i, e) =|{i1 .u : C(i1 , e) ∧ i1 ∈ I}|−1 · (1) Location and time. Each event in the dataset contains |{e1 .u : C(i, e1 ) ∧ e1 ∈ E}|−1 location and time information. The location information is Given a co-occurring event pair between two entities, we in the form of a region, denoted as e.r for event e. We do check how many possible entities’ events could be matched not use a point for location, as for most LES the location to these events. For instance, in Figure 1, consider the solid information is in the form of a region. We assume the time line at the top with the weight 1/6. The event on its left information is a point in time. could be matched to events of 2 different entities, and the event on its right could be matched to events of 3 differ- 2.1 k-l Diversity ent entities. To compute the weight of a co-occurring pair, The core idea behind the k-l diversity model is to locate we multiply the inverse of these entity counts, assuming the pairs of entities whose events satisfy k-l diversity. Further- possibility of matching from both sides are independent. As more, such pairs of entities must not have any alibis. such, in the Figure 1, we get 1/2 · 1/3 = 1/6. Co-occurrence. Two events from different datasets are l diverse event pairs. For the same entity pair to be called co-occurring if they are close in space and time. For considered l-diverse, there needs to be at least l unique lo- two records i ∈ I and e ∈ E, closeness is defined in terms of cations for the co-occurring event pairs in it. However, for a intersection of regions. To capture closeness in time, we use location to be counted towards these l locations, the weights a parameter α, and call two events are close in time if they of the co-occurring event pairs for that location must be at are within a window of α time units of each other. least 1. Here, one subtle issue is defining a unique loca- Alibi. While a definition of similarity is necessary to link tion. Intuitively, when datasets have different granularities events from two different datasets, a definition of dissimilar- for space, using the higher granularity ones to define unique- ity is also required to rule out pairs of entities as potential ness would give more accurate results. This could simply be matches in our linkage. Such negative matches enable us to a grid-based division of the space. rule out incorrect matches and also reduce the space of pos- Entities x and y could be linked to each other, if they have sible matches throughout the linkage process. We refer to k co-occurring event pairs in l diverse locations and their these negative matches as alibis. In this work, we use alibi datasets do not contain alibi event pairs. Moreover, we only to define events from two different datasets that happened consider entity pairs for which there is no ambiguity, i.e. around the same time but at different locations, such that it no two pairs (x, y) and (x, z) that are k-l diverse. Setting is not possible for a user to move from one of these locations too low k-l values would lead many ambiguous pairs while to the other within the duration defined by the difference of too high values would lead many false negatives. To find the timestamps of the events. the balance in between these two, we apply elbow detection techniques on k, and l distributions. Entity linkage. Let x ∈ UI and y ∈ UE be two entities. In order to be able to decide whether two entities are the 2.2 Probabilistic Linkage same, we search for k co-occurring event pairs and at least l Besides k-l diversity, we aim to model the spatio-temporal of them are at diverse locations. However, each co-occurring linkage problem using a probabilistic model. For consis- event pair does not count as 1, since each of these events tency, we try to use the same notation with k-l diversity could co-occur with many other events. Let C(i, e) be the model as much as possible. If we have more than one sample of the entity (for time slots t0 to time slot tT where we don’t necessitate the slots to be sequential in time) and the grid we can then use the history of the entity. The probability is similar except taking the tracks of the entities into account: ( Pm P M (n−1,m−1,k−1) k=1 Pm P M (n,m,k) , Gx (ti ) = Gy (ti ) ∀ti P (x ≡ y) = k=1 0, otherwise (5) where n > m. An important issue is the decision on the values of n and Figure 2: An example of 5 co-occurring events in 3 m. If the entities x and y have l events sharing the same diverse locations from real world data. All weights cell and time interval, one must look for all possible pairs are assumed to be 1 that satisfy this property. Since the number of users in the intersection are smaller and is expected to decrease rapidly for large l values the probability of two users in respective Probabilistic model starts by aggregating all of the entities services being the same user with the same track will be on a common grid using the spatio-temporal features of the considerably high. datasets. Gk denotes the set of entities in a specific cell It is fair to assume that both the systems have consid- in the grid and |GIk (t)| denotes the number of entities from erable amount of common entities which will match. This dataset I in cell Gk at some time interval [t − t + α]. Let commonality needs to be calculated empirically by ground x ∈ UI and y ∈ UE be two entities, and set of entities co- truth values. After this value is found we can use this value located with entity x in UE , and set of entities co-located to limit the k values (e.g. k ∈ [k1 : k2 ]) when calculating with entity y in UI at time [t − t + α], in the grid is given the probabilities and the probability in Theorem turns to: as Gx (t), and Gy (t) respectively. Pk=k2 Assuming two sets, S1 and S2 , where |S1 | = |S2 | = n, the k=k1 P M (n − 1, m − 1, k − 1) number of possible different complete matches (CM) (each P (x ≡ y) = Pk=k2 (6) element in S1 has a partner in S2 ) between the elements of k=k1 P M (n, m, k) these sets is n!, using trivial combinations without repeti- One can relax the condition of equality for the tracking tion. If the number of elements in the sets are not equal, i.e. based on the location accuracy of the services and the cho- |S1 | = n 6= |S2 | = m then the problem turns into choosing sen grid sizes. Moreover, similar to the diversity concept of m out of n (where n ≥ m) and calculating complete match the k-l diversity model, the distance among the shared cells n  with m elements, i.e. CM (n, m) = m m!. could be used to distinguish between multiple pairs shar- As the user set of one LES is typically not a subset of the ing a common user, with close matching probabilities, i.e. second, we define the partial match (PM) where only k out P (x ≡ y) = P (x ≡ z). of the m elements in S2 match with k out of n elements in S1 . In this case we also need to choose k out of m and use the complete match, i.e. P M (n, m, k) = m CM (n, k) = 3. FRAMEWORK k m n The second component of this doctoral work is the frame-   k k k!. Let x ≡ y represents entities x, and y are the same real- work to perform the linkage efficiently. Naı̈ve record linkage world entity (match), the probability of a pair of specific algorithms that compare every pair of records take O(n2 ) two items to match each other is calculated by the number time [6], where n is the number of records. Therefore, there of events where x and y match divided by the number of are number of techniques implemented, i.e. indexing, block- events in the universal set of all possibilities. ing, to prune search space of linkage. To perform the linkage If we are given a single snapshot of the grid at time t the in reasonable time, we take advantage of the spatio-temporal probability of a randomly chosen pair of co-located entities structure of the data. To realize effectiveness of the k-l di- in the different services being the same entity (we are assum- versity model, we develop an algorithm called ST-Link [2]. ing a complete match case only) can be found as following: Our implementation for the probabilistic model is still on- going. P M (n − 1, m − 1, m − 1) The ST-Link algorithm uses two filtering steps before P (x ≡ y) = (2) P M (n, m, m) pairwise comparisons of candidate entities are performed to CM (n − 1, m − 1) 1 compute the final linkage. It first distributes entities (users) = = (3) over coarse-grained geographical regions that we call domi- CM (n, m) n ( 1 nating grid cells. Such grid cells contain most of the activ- I E , if Gx (t) = Gy (t) = Gk ities of their users. For two users to link, they must have a = max(|Gk (t)|,|Gk (t)|) 0, otherwise common dominating grid. Once this step is over, the linkage (4) is independently performed over each dominating grid cell. To identify the dominating grids, we make a sequential scan This is an intuitive result since a random entity from the over all records, and utilize a quad-tree based index, which smaller set can be equal to any element in the larger set limits the area of the smallest grid from below. During the with an equal probability. temporal filtering step, ST-Link uses a sliding window based scan to build candidate user pairs, while also pruning this list as alibis are encountered for the current candidate pairs. de Montjoye et al. [5] have shown that, given a spatio- Finally, our complete linkage model is evaluated over candi- temporal dataset of call detail records, one can uniquely date pairs of users that remain following the spatial and tem- identify the 95 % of the population by using 4 randomly poral filtering steps. During this linkage step, we will need selected spatio-temporal points. However, linking users is the time sorted events of the users at hand. For that pur- different from identification, as identification leaves whose pose, during the forward scan, we also create a disk-based data to aggregate question unanswered. index sorted by the user id and event time. This index en- ables us to quickly iterate over the events of a given user in 5. CONCLUSIONS & RESEARCH PLAN timestamp order, which is an operation used by the linkage In this paper, we introduced two linkage models for match- step. Also, if one of the datasets is more sparse than the ing users across location enhanced services, and discussed other, it performs the linkage by iterating over the users of implementation techniques. We have already realized a sin- the dense datasets first, making sure their events are loaded gle machine implementation of the k-l diversity model with only once. This is akin to the classical join ordering heuristic ST-Link algorithm. We are now working on validation and in databases. implementation of the probabilistic model, and aim to com- Our experimental evaluation shows that k-l diversity model pare these two models with each other. Our single ma- is effective (up to 89% precision and 61% recall), yet the effi- chine implementations showed that both models could ben- ciency could benefit from a distributed approach. However, efit from a parallelized distributed implementation. There- distributed processing is challenging due to mobility of users, fore, we set the development of a distributed and generic and the scale of the data. First, distributing records based framework as the future goal of this doctoral work. on their spatio-temporal features would spread records of a single user to multiple processing nodes, hence lead to high inter-machine communications cost. While the concept of 6. REFERENCES dominating grid cells addresses this issue, scalability would [1] P. Bakalov and V. Tsotras. Continuous spatiotemporal still suffer from spatial skew of real data (in our experiments trajectory joins. In GeoSensor Networks, volume 4540 %18 of all records were residing on a single grid out of 120 of Lecture Notes in Computer Science, pages 109–128. grids). Since the temporal filtering techniques requires at Springer Berlin Heidelberg, 2008. least one batch of data to reside at the same machine (this [2] F. Basik, B. Gedik, C. Etemoglu, and issue exists in both models either for filtering or aggregat- H. Ferhatosmanoglu. Spatio-temporal linkage over ing), records cannot be written to machines in parallel which location-enhanced services. IEEE Trans. on Mobile would lead to low write performance. With these challenges Computing, PP(99):1–1, 2017. identified, we are going to focus on optimizations of both [3] O. Benjelloun, H. Garcia-Molina, D. Menestrina, models to create a single optimized framework which could Q. Su, S. E. Whang, and J. Widom. Swoosh: A efficiently perform linkage for both models. Such framework generic approach to entity resolution. The VLDB would be beneficial for both industry and academia when Journal, 18(1):255–276, Jan. 2009. performing aggregation of semantically different datasets for [4] P. Christen. Data matching: concepts and techniques social good applications, and when benchmarking the link- for record linkage, entity resolution, and duplicate age research. detection. Springer Science & Business Media, 2012. [5] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and 4. RELATED WORK V. D. Blondel. Unique in the crowd: The privacy Record Linkage. One of the earliest appearances of the bounds of human mobility. Scientific reports, 3, 2013. term record linkage is by Newcombe et al. [9]. In the liter- [6] L. Getoor and A. Machanavajjhala. Entity resolution: ature, it is also referred to as entity resolution (ER), dedu- Theory, practice & open challenges. In VLDB plication, object identification, and reference reconciliation, Conference (PVLDB), 2012. discussed in [4]. Most of the work in this area focus on a [7] S. Ghemawat and J. Dean. LevelDB. single type of databases and define the linked records with https://github.com/google/leveldb, 2015. respect to a similarity metric. To the best of our knowledge, [8] E. H. Jacox and H. Samet. Spatial join techniques. linking the users of the usage records, specifically targeted ACM Trans. Database Syst., 32(1), Mar. 2007. at spatio-temporal datasets is novel. [9] H. B. Newcombe, J. M. Kennedy, S. J. Axford, and Spatial Record Linkage and Spatial Joins. Many join A. P. James. Automatic linkage of vital records: algorithms are proposed in the literature for spatial data [8]. Computers can be used to extract ”follow-up” Spatial record linkage and join algorithms are not directly statistics of families from files of routine records. applicable for spatio-temporal data as they are based on in- Science, 130(3381):954–959, 1959. tersection of minimum bounding boxes, one-sided nearest [10] C. Riederer, Y. Kim, A. Chaintreau, N. Korula, and join, or string similarity. Spatio-temporal joins have con- S. Lattanzi. Linking users across domains with straints on both spatial and temporal domains [1]. [10] is location data: Theory and validation. In Proc. of the a recent work with similar motivation in which calculates 25th Int. Conf.on WWW, pages 707–719, 2016. weights of matching between users and applies maximum [11] A. Skovsgaard, D. Sidlauskas, and C. Jensen. Scalable weight partitioning techniques. Their experiments validate top-k spatio-temporal term querying. In IEEE Int. the accuracy of this approach, but they do not focus on scal- Conference on Data Engineering (ICDE), pages ability. 148–159, March 2014. User Identification. Our work has commonalities with the work done in the area of user identification. For instance,