PRIVATEER:
              A Private Record Linkage Toolkit

    Alexandros Karakasidis1 , Georgia Koloniari2 , and Vassilios S. Verykios1
    1
      Hellenic Open University, School of Science & Technology, Patras, Greece
                      {a.karakasidis, verykios}@eap.gr
2
  University of Macedonia, Department of Applied Informatics, Thessaloniki, Greece
                             gkoloniari@uom.edu.gr


        Abstract. Privacy preserving record linkage (PPRL) is the process of
        integrating data across multiple heterogeneous data sources without com-
        promising their privacy. While many techniques have been developed for
        PPRL, there has not been, up to now, a universally accepted method
        providing both performance and quality of results in all cases. To this
        end, we present PRIVATEER, a toolkit which aims at enabling practi-
        tioners to compare various techniques involved in the PPRL process and
        determine the best for their needs. The toolkit is based on a simulator,
        designed to be highly configurable, modular and extensible, allowing the
        user to test different configurations by combining a number of privacy
        preserving blocking and matching methods with corresponding distance
        and similarity measures on her own or sample data. We showcase the us-
        ability of our toolkit by presenting experimental results measuring both
        quality and performance of state-of-the-art PPRL methods.


Keywords: Privacy, Record Linkage, Toolkit, Simulator


1   Introduction
Large volumes of data, describing individuals, are collected by public or pri-
vate organizations. Being able to interconnect and analyze these independently
stored data is beneficial for businesses, governments and research. For instance,
being able to interconnect data of distinct health organizations, hospitals and
physicians could prevent epidemic outbreaks or help us better understand the
mechanisms behind certain diseases.
    The task of linking such heterogeneous data, known as the record linkage
problem, is not trivial [4]. Since data are stored by independent organizations,
there is no unique identifier to determine common data. Also, such data suf-
fer from low quality, exhibiting typos, abbreviations and misspellings, requiring
approximate matching methods to link them. Additional challenges arise when
we need to protect the privacy of the subjects described by the data. There-
fore, Privacy Preserving Record Linkage (PPRL) has emerged as the problem
where data from heterogeneous sources are integrated without revealing to the
participating sources any extra knowledge not relating to their common records.
    To privately link data with accuracy, Privacy Preserving Matching (PPM )
methods focus on elaborately matching data, while maintaining privacy. Though
accurate, matching methods often incur high computational costs. To increase
their scalability, PPM methods are combined with Privacy Preserving Blocking
(PPB ) methods. PPB methods prune out unlikely to match candidate pairs of
records, while also maintaining privacy. Even though they improve efficiency,
PPB methods often compromise matching quality.
    To the best of our knowledge, there is no single solution that successfully
addresses the privacy preserving record linkage problem to its full extent. In
particular, state-of-the-art PPB and PPM methods usually focus on a specific
type of data. For instance, there are PPM approaches appropriate for string
data [11, 16] and others for numerical data [9, 10]. Moreover, there are two party
methods and others that require the deployment of a third party. As a result, each
time that a PPRL solution has to be deployed, the most appropriate algorithms
for the given application scenario have to be selected.
    To this end, we introduce PRIVATEER, a PRIVATE REcord linkage toolkit
designed to assist practitioners in this selection process, and also serve as a com-
parison tool for different methods. PRIVATEER is a modular simulator, con-
stantly under development, implementing a set of state-of-the-art PPRL meth-
ods. Considering the various tasks involved in PPRL, the aim of PRIVATEER is
to accommodate all this functionality into a single toolkit. It consists of clearly
separate modules, each of them performing a separate task, and uses differ-
ent components implementing various privacy preserving matching and blocking
methods, and auxiliary components that are deployed by matching and blocking
methods, such as different distance measures used in approximate matching and
reference set generators, where a reference set is used as an intermediate point of
comparison. Highly composable, PRIVATEER allows all compatible PPB meth-
ods to be combined with compatible PPM ones, and with corresponding auxiliary
methods, so as to form a complete PPRL solution.
    PRIVATEER is also configurable enabling the user to fine tune each of the
deployed methods and provides a variety of performance measures. While the
toolkit is equipped with sample data, it also enables the use of a user’s own data.
Towards extensibility, PRIVATEER’s clear and comprehensive design provides
an API for extending its functionality. Through the Java paradigm the user is
able to add its own modules and algorithms.
    Similar toolkits, such as TAILOR [6], have been developed for the classical
record linkage problem. Related tools for PPRL include data generators [3],
corrupters [2], and tools that combine both functionalities [18]. While these tools
aim at developing test data for PPRL, PRIVATEER aims at testing different
PPRL methods. A first version of our toolkit is presented in [14], but without
offering composability or a graphical user interface.
    The rest of this paper is organized as follows. In Section 2, we present the
architectural design of PRIVATEER and briefly describe the PPM and PPB
algorithms, distance measures, and reference set generation methods currently
implemented in the toolkit. Section 3 provides experimental results derived from
PRIVATEER and we conclude in Section 4.
                      Fig. 1. Architecture of PRIVATEER


2     PRIVATEER’s Architecture
The primary goals of PRIVATEER’s design is composability and extensibility.
We wanted PRIVATEER to easily enable the combination of different methods
to compose a complete PPRL solution, and the incorporation of new ones as
state-of-the-art PPM and PPB methods are constantly developed. To this end,
an object-oriented design is appropriate, thus, we used Java to build the simu-
lator comprising PRIVATEER. Also, Java has the property of Write Once Run
Anywhere, providing machine independence, and there are many out-of-the-box
libraries and data structures one can exploit.
    We have made a particular effort to limit the assumptions when designing
the architectural components of the simulator. In particular, we assume that all
operations are executed sequentially. In many PPB and PPM algorithms, there
are functions that require the parallel operation of the matching parties. PRI-
VATEER simulates this functionality by sequentially executing separate objects.
Also, we only focus on the processing cost and do not examine communication
costs and data propagation delays.

2.1   PRIVATEER’s Modules
We followed a modular design, separating data, execution functionality and im-
plemented methods. This approach improves PRIVATEER’s extensibility, as it
is very easy through its API to add new features and modules. Moreover, since
data is retrieved in a transparent way, new data sources may be added as well.
    Next, we present the architectural elements that comprise PRIVATEER.
The Base Module. The base module holds PRIVATEER’s main functional-
ity organizing the execution of each evaluation experiment. It coordinates the
different modules and enables us to execute each experiment multiple times. As
illustrated in Fig. 1(1), it hosts the configuration module, a PPB and a PPM
module, the results processing and the visualization module.
The Configuration Module. The configuration module is invoked by the base
module and is responsible for setting an experiment’s execution parameters. It
consists of a GUI implemented in JavaFX which receives input from the user and
a mechanism that maps user input to specific parameters. This mapping mech-
anism is important, since each of PRIVATEER’s modules has to be separately
configured using separate parameters provided by the same user interface.
The Nodes. The nodes implement the functionality of the data sources and a
third party, when one is needed. While a data source node has direct access to
its data, the third party neither holds nor has direct access to any data. Instead,
it is fed with data by the source nodes. Nodes are deployed by the PPB or PPM
module and implement their part of the respective algorithm (Fig. 1(2)).
The Data Module. The data module is used to provide direct access to external
data. It is invoked by each data source node. The used data may be stored in
different databases and the configuration module enables the user to select her
own input database, thanks to a transparent connectivity mechanism. Currently,
MySQL and SQLite databases are supported.
The Measure Module. This module implements the distance or similarity
measure to be used by the PPB and later PPM algorithm, in a transparent way.
Thus, it allows us to deploy any compatible measure with any PPB and PPM
method using the GUI. This way, the user has the ability to test the available
algorithms and assess their behavior with different configurations.
The Reference Set Generator Module. Similarly to the measure module,
the reference set generator provides different methods that can be used inter-
changeably with our PPB and PPM methods.
The PPB and PPM Modules. The privacy preserving blocking (PPB) and
the privacy preserving matching (PPM) modules implement the corresponding
type of protocols. Depending on the type of the protocol, as shown in Fig. 1(1),
each module initiates two nodes and, when necessary, a third party node. All
these nodes implement the same algorithm, which is chosen through the GUI and
provided to the corresponding module by the base module. For a PPB or PPM
algorithm to operate, the PPB or PPM module load a measure module, which
implements the measure used by the corresponding algorithm. If it is required
by the implemented algorithm, a reference set generator module is also loaded.
     PPRL protocols may vary regarding their implementation and the number of
participating nodes. Generally, however, they are either based on some type of
data transformation or they may exhibit a multiparty computational behavior. In
both cases, a third party may be required and all options may be implemented in
PRIVATEER. Moreover, to offer the ability of experimenting with a no blocking
setup, a mockup No Blocking method has also been implemented.
     Regarding its operation, the PPB passes on the parameters received from the
configuration module to the respective modules it hosts and executes the selected
PPB algorithm. When the execution concludes, it produces lists of blocks of
record ids which are then used by the PPM module. The PPM module receives
the blocks of record ids from the PPB module and applies the selected PPM
algorithm on each of them. Upon the conclusion of the execution of the PPM
module, a list with matching record ids is produced which is then passed to the
results processing module for processing.
The Results Processing Module. The results processing module processes
the matches list received from the PPM module and calculates the number of
total matches, the number of true and false matches, precision, recall and the
execution times of the PPM and PPB modules. If the input data have the same
unique identifiers (i.e., corrupted versions of the same dataset), the module au-
tomatically calculates the measurements based on the common primary key.
Otherwise, a matching csv file with the matches between different primary keys
is used.
The Visualization Module. The role of the visualization module is to visualize
the results of each experiment by collecting the statistics produced by the results
processing module and produce plots and tables of these statistics.

2.2    Implemented Algorithms and Measures
The modules of the architecture enable the selection of a method for each of
the different tasks in the PPRL process. Thus, the PRIVATEER is equipped
with a pool of implemented state-of-the-art approaches for each such task, i.e.,
PPB and PPM algorithms, distance and similarity measures and reference set
generators (Tables 1-3).
    The PPB algorithms currently implemented in PRIVATEER are illustrated
in the upper part of Table 1. The first column holds the name of the algorithm,
the second a short description, the third shows the type of data each algorithm
is designed for and the forth indicates whether a third party node is required.
Similarly, PPM algorithms are included in the lower part of Table 1.
    Many of the PPRL algorithms that have been suggested, use a distance or
similarity measure [4]. We designed PRIVATEER so as to offer the ability to
use in the PPRL algorithms alternative measures, other than the default ones.
Possible combinations are illustrated in the third column of Table 2. Default
algorithms that use the measure are indicated by bold.
    Reference sets (RS) are used in PPRL as an intermediate point of compari-
son to provide privacy. A reference set may be derived from a publicly available
database or be arbitrarily generated. Table 3 summarizes PRIVATEER’s cur-
rently available methods for RS generation. All methods apart from Arbitrary
String Generation [16] use a user defined publicly available dataset.

3     Execution Results
To illustrate PRIVATEER’s usability, we present sample experiments exploring
alternative combinations of PPB and PPM methods, beyond their default setup,
so as to assess their behavior and performance in different configurations.
    All experiments are conducted using an Intel i3 PC with 8GB of RAM. We
use a sample dataset, which originates from the North Carolina voters database3 .
We assume that the attributes: last name, first name, middle name, city of resi-
dence and precinct description comprise a candidate key, and use them for both
matching and blocking. After deduplicating the dataset based on the selected
candidate key, we generate two databases of 50 000 records each, so that the
25% of their records is common. Finally, we use the “master” table of Lahman’s
baseball database4 , as a source for reference set creation. Since we are interested
3
    Available at ftp://alt.ncsbe.gov/data/
4
    Available at http://www.seanlahman.com
        Name                                Description                            Datatype 3P
                                      PPB Algorithms
                Option when the use of blocking is not desired. Records to be
B1 No Blocking linked are organized in a single block for each of the two data       N/A
                sources.
                SNEF is a meta-blocking algorithm specifically designed to
       SNEF     shift the Sorted Neighborhood approach [8] to the PPRL Strings,
B2                                                                                          ✔
        [13]    paradigm. This algorithm uses reference set clustering for cre- Numbers
                ating blocks.
                A single attribute is utilized for blocking. Each data source cre-
      TPPB      ates a separate reference set from a publicly available database.
B3                                                                                  Strings
        [19]    Then, it merges its data with the reference set and lexicograph-
                ically sorts the result.
                Introduces hash signature transformations to securely repre-
      Simple    sent TF/IDF weight vectors. Blocks consist of records which
B4   Blocking   share common strings called tokens, with a token being an Strings ✔
         [1]    intact word. The Jaccard metric is used for comparing hash
                signatures.
     Phonetic   Uses Soundex’s inherent privacy characteristics, which are
     Blocking   based on the common information suppression property of all
B5                                                                                  Strings ✔
                phonetic encoding algorithms. Records are organized in blocks
        [12]
                based on their phonetic encoding similarity.
                                      PPM Algorithms
      Bloom     Bigrams of strings are inserted into a Bloom Filter. Then, a
M1   Bigrams    bitwise comparison is performed on Bloom Filters using the Strings ✔
        [17]    Dice Coefficient.
    Embedding Based on the edit distances between actual and reference set
M2    Spaces    data, a transformation of the data, called embedding, is derived Strings ✔
        [16]    and used for matching.
                Secures the use of edit distance through the use of Bloom fil-
    Secure Edit
     Distance   ters. Each character of a matching field is extracted and after
M3                                                                                  Strings
                appending its position in the string, it is inserted in a Bloom
        [12]
                filter which is encrypted using a secure hash function.
   Homomorphic Homomorphic encryption manages to calculate differences se-
M4 Encryption curely based on the properties of specific cryptographic algo- Numbers
       [9, 10]  rithms, such as RSA.
     Soundex
     Matching   Transforms attributes into Soundex codes and exploit its in-
M5                                                                                  Strings ✔
                herent information suppression properties to provide privacy.
        [11]
                Calculates the TF/IDF weights of all words, called tokens,
       Hash     within all fields. Then, a hash function is used to create arrays
M6  Signatures  of these weights, one for each record. This method is not suit- Strings ✔
         [1]    able for approximate matching, since the entire token (word)
                is encoded.
                    Table 1. Supported PPB and PPM algorithms.

    Measure                                 Description                            Compatibility
                                        Distance measures
                   Calculates the minimum number of operations (character
  Levenshtein      insertion, deletion and substitution) required to transform     B2, B3, M2
                   one string to another.
   Damerau -       Alternative to Levenshtein distance that also takes into ac-
                                                                                    B2, B3, M2
  Levenshtein      count the transposition of two adjacent characters.
                                       Similarity measures
                   Focused on short strings such as names. Produces a nor-
 Jaro - Winkler malized score between 0, when there is no similarity, and 1,        B2, B3, M2
                   where there is an exact match between two strings.
                   Used to measure set similarity. In the context of PRIVA-
    Jaccard
                   TEER, it is used with bit vector representations of strings,      B4, M1
   coefficient
                   as in [1].
                   Also used with bit vector representations of strings. Used as
Dice’s coefficient                                                                   B4, M1
                   a default similarity measure in [17].
Q-grams Dice’s Dice’s coefficient for plain text comparisons in combination
                                                                                    B2, B3, M2
   coefficient     with q-grams, as in [7].
                              Table 2. Supported measures.

    Method                                Description                              Compatibility
                  Tries to produce the most representative reference set from
 NN-Clustering                                                                      M2, B2, B3
                  the entire data corpus.
  K-Medoids       Elects medoids from the most representative reference set
                                                                                   M2, B2, B3
  Clustering      from the entire data corpus.
Arbitrary String
                 Generates random strings                                          M2, B2, B3
  Generation
                 Generates a reference set of specific size. Given an array of
  Durstenfeld
                 elements the first n of them are randomly exchanged with           M2, B2, B3
    Shuffle
                 others in the array and selected out of a data corpus.
                            Table 3. Reference set generators.
                                                               PRIVATEER           7


Fig. 2. Results illustrated by PRIVATEER’s visualization module: (left) PPM, (center)
PPB and (right) RS generation.


in linking low quality data, we corrupt one of the two databases using the cor-
rupter described in [2]. The corrupted records contain one error per attribute so
that a join with the original ones returns no matches. With equal probability,
an error might either be a character insertion, deletion or substitution.
    Figure 2(left) illustrates how PRIVATEER allows us to test a blocking method
with different matching methods. In particular, we combine the TPPB method
[19], with three PPM methods: Bloom Bigrams [17], originally used in [19] as
well, Hash Signatures [1], and Phonetics Matching [11]. We compare the three
combinations in terms of precision and recall. As we observe in Fig. 2(left), Hash
Signatures have zero recall, since this method is not designed for approximate
matching. On the other hand, the Bloom Bigrams’ high performance in terms
of recall justifies its overall popularity [10].
    In the second experiment, we combine a matching method with different
blocking methods. In particular, Bloom Bigrams is combined with Simple Block-
ing [1], SNEF [13] and TPPB [19]. As for blocking, efficiency is important, we
measure time to determine the most efficient blocking method for Bloom Bigrams
matching. Figure 2(center) illustrates the time required for PPB, for PPM and
for the overall PPRL process. We observe how the matching time varies depend-
ing on the blocking method used. While Simple Blocking is the fastest blocking
approach, it does not manage to significantly reduce the matching time. On the
other hand, while SNEF is slower while blocking, it achieves the best overall
time as it significantly reduces matching time. Finally, TPPB has the highest
blocking time, but achieves the best matching time. It clearly creates the optimal
blocks, but unfortunately the time blocking requires is greater than the overall
time for SNEF.
    Next, we evaluate three reference set generation methods with SNEF[13]
and use Bloom Bigrams for matching [17]. The first one is the arbitrary string
generation method [16], the second is a modification of the Durstenfeld shuffle
algorithm [5], while the third uses the K-medoids algorithm [15], which was
originally used in [13]. Figure 2(right) shows that with regards to precision and
recall, the Durstenfeld Shuffle, and not the K-medoids used in the default setup
of SNEF, is the best choice.
4    Conclusions
We have presented PRIVATEER, a modular, highly configurable privacy pre-
serving record linkage toolkit that enables combining and comparing different
PPRL solutions. We showed the usability of our toolkit by including evaluation
results derived from experiments run by the PRIVATEER on sample data.

Acknowledgment. Attendance to this conference was partially supported by
the Hellenic Artificial Intelligence Society (EETN).
References
 1. Al-Lawati, A., Lee, D., McDaniel, P.: Blocking - aware private record linkage. In:
    IQIS (2005)
 2. Bachteler, T., Reiher, J.: A test data generator for evaluating record linkage meth-
    ods. Tech. rep., German RLC Work. Paper No. wp-grlc-2012-01 (2012)
 3. Christen, P.: Febrl-: an open source data cleaning, deduplication and record linkage
    system with a graphical user interface. In: ACM SIGKDD (2008)
 4. Christen, P.: Data Matching. Data-Centric Systems and Applications, Springer
    (2012)
 5. Durstenfeld, R.: Algorithm 235: Random permutation. Communications of the
    ACM 7(7), 420– (Jul 1964)
 6. Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: Tailor: a record linkage toolbox.
    In: IEEE ICDE (2002)
 7. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S.,
    Pietarinen, L., Srivastava, D.: Using q-grams in a dbms for approximate string
    processing. IEEE Data Engineering Bulletin 24(4), 28–34 (2001)
 8. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the
    merge/purge problem. Data Mining & Knowledge Discovery 2(1), 9–37 (1998)
 9. Inan, A., Kantarcioglu, M., Bertino, E., Scannapieco, M.: A hybrid approach to
    private record linkage. In: IEEE ICDE (2008)
10. Inan, A., Kantarcioglu, M., Ghinita, G., Bertino, E.: Private record matching using
    differential privacy. In: ACM EDBT (2010)
11. Karakasidis, A., Verykios, V.S.: Privacy preserving record linkage using phonetic
    codes. In: BCI (2009)
12. Karakasidis, A., Verykios, V.S.: Secure blocking + secure matching = secure record
    linkage. JCSE 5(3), 223–235 (2011)
13. Karakasidis, A., Verykios, V.S.: A sorted neighborhood approach to multidimen-
    sional privacy preserving blocking. In: IEEE ICDMW (2012)
14. Karakasidis, A., Verykios, V.S.: A simulator for privacy preserving record linkage.
    In: EANN, Part II (2013)
15. Kaufman, L., Rousseeuw, P.: Clustering by Means of Medoids. Reports of the
    Faculty of Mathematics and Informatics. Delft Univ. of Technology (1987)
16. Scannapieco, M., Figotin, I., Bertino, E., Elmagarmid, A.K.: Privacy preserving
    schema and data matching. In: ACM SIGMOD (2007)
17. Schnell, R., Bachteler, T., Reiher, J.: Privacy preserving record linkage using bloom
    filters. BMC Medical Informatics & Decision Making 9(1), 41+ (2009)
18. Tran, K., Vatsalan, D., Christen, P.: Geco: an online personal data generator and
    corruptor. In: ACM CIKM (2013)
19. Vatsalan, D., Christen, P., Verykios, V.S.: Efficient two-party private blocking
    based on sorted nearest neighborhood clustering. In: ACM CIKM (2013)