<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Fuzzy Vaults for Privacy Preserving Record Linkage</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xhino Mullaymeri</string-name>
          <email>dai16019@uom.edu.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fuzzy Vault, Privacy Preserving Record Linkage, Approximate</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandros Karakasidis</string-name>
          <email>a.karakasidis@uom.edu.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Applied Informatics, University of Macedonia</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>String Matching</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Approximate string matching has a variety of applications, one of them being record linkage, where records from diferent sources, without common unique identifiers, are matched. When these records refer to individuals, maintaining their privacy becomes a paramount requirement, leading us to develop privacy preserving record linkage methods. In this paper, we present a solution to the privacy preserving record linkage problem which is based on Fuzzy Vaults, having the advantage of being two-party based. A Fuzzy Vault is a cryptographic structure based on noise. Fuzzy Vaults have been previously used mainly in biometric applications. In this work, we explain our methodology in depth, providing detailed examples, we support our argument for privacy preservation by a solid discussion and provide extensive experimental results accompanied by a rigorous analysis, not only to indicate the stability and robustness of our method, but also to demonstrate that our setup manages to achieve superior results when compared to a baseline three party - based method.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In this digital era we are living, the number of agencies and
corporations which store personal data has dramatically increased.
A common need for these organizations to exchange information
for individuals described by these data and perform analysis on
them. As a data integration step, this would traditionally boil
down to a database join in the case that common unique
identifiers are shared such as social id or a driving license number.
However, since this is not always the case, the next step is to
exploit fields that, when combined may uniquely identify a record.
What we have just described is commonly known in the literature
as the Record Linkage Problem. Solving this problem, however,
does not take into consideration that the privacy of the described
individuals in these databases should be maintained. This
constitutes the Privacy Preserving Record Linkage problem, where
two parties aim to find common records without revealing any
additional information to each other, apart from the matching
ones. However, data are usually dirty and even common fields
between the two parties’ databases might diverge, quite often
due to human error (i.e. typing errors during data entry). Since
quite often such identifiers are string values as name, surname
or address, one approach to overcome this complication is by
using approximate string matching methods which should also
consider privacy preservation.</p>
      <p>To indicate the importance of the Privacy Preserving Record
Linkage task, let us consider the COVID-19 pandemic we are
experiencing and the need for tracking the contacts of confirmed
cases, which undoubtedly constitutes a matter of great concern.
Suppose that a verified case of the disease is an airline passenger.
To ensure public health, all patient’s close contacts should be
tracked and informed that they might have been infected.
Applying Privacy Preserving Record Linkage on databases from health
agencies and airlines, authorities may locate and inform people
who might have been exposed to the virus, without
compromising privacy. In the same context, surveillance of quarantined
cases can also be benefited. When someone is placed under house
quarantine, his identifying data may be added on a public
governmental database so as to be accessed by every agency which
ofers public services (accommodation, transport, entertainment).
As these places are usually overcrowded, makes them places of
possible outbreaks. Before providing any services, such providers
should be able to check if their customers’ names are on the
quarantined cases list, protecting, in any case, the privacy of the
individuals in all sites.</p>
      <p>
        In this paper, we present in detail the operation and the
underthe-hood implementation details of the two-party method for
performing privacy preserving record linkage using Fuzzy Vaults
briefly sketched in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. A Fuzzy Vault scheme [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a
cryptographic structure where a secret is being “locked” with a key and
in order to “unlock” it someone must use a key which is similar
enough to the locking key. This “unlocking” phase of a Fuzzy
Vault is by its nature approximate, so it suits very well our goal
for approximate string matching. Originally, It has been used
for biometric authentication algorithms and there are such
implementations displayed in the literature [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Despite the work
on biometric authentication, to the best of our knowledge, there
have been no works utilizing Fuzzy Vaults for performing
privacy preserving approximate string matching, focused on, but
not limited to, Privacy Preserving Record Linkage operations.
      </p>
      <p>As such, our contributions may be summarized as follows. First
of all, we illustrate, for the first time, all the details of our novel
methodology using Fuzzy Vaults for private approximate string
matching so as to be utilized in a privacy preserving record
linkage context. This methodology consists of a carefully designed
workflow employing reference sets, non bijective mapping
functions and noise addition. We provide detailed examples in order
to allow the in-depth comprehension of our method’s details and
we support our argument of privacy preservation by a solid
discussion. In terms of empirical evaluation, we provide extensive
experimentation, not only for exhibiting our method’s behavior
but for proving its superiority against a baseline three party-based
method in terms of overall performance, a fact with additional
value considering that our fuzzy vault approach does not require
a third party.</p>
      <p>The rest of this paper is organized as follows. Section 2 presents
works related to ours. In Section 3, there is the background for
introducing the reader to our method, which is detailed in Section
4. Section 5 contains a discussion over the privacy properties of
our method. In Section 6 we provide the empirical analysis of
our method in order to demonstrate its robustness and stability.
Finally, in Section 7, we provide our conclusions and thoughts
for future research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Private approximate string matching techniques have been widely
developed in the context of privacy preserving record linkage
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], being mostly based either on Secure Multiparty
Computation (SMC), usually incurring additional communication costs,
or on data perturbation, that often seem to be quite vulnerable
to attacks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In order to overcome some of the drawbacks and render SMC
based techniques applicable as it comes to real world size datasets,
hybrid methods have emerged. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] a hybrid method which
combines diferential privacy and homomorphic encryption is
presented. Here, instead of sanitizing data, diferential privacy is
applied to ensure data privacy. Afterwards, data partitioning into
blocks takes place. Finally, an SMC step based on homomorphic
encryption occurs and returns the matching pairs. This method
manages to partially improve the situation with the bottleneck
of the limited size of data that SMC can endure.
      </p>
      <p>
        Another hybrid method which is based upon blocking,
utilization of diferential privacy and on an SMC (homomorphic)
phase is described in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. However, in this case the two parties
compare their data with the help of a third one. Moreover, even
recent techniques addressing this problem employ a third party
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], an approach that may be rendered impractical requiring the
availability of such a third party.
      </p>
      <p>
        In the last few years, there has been particular efort towards
developing two-party based techniques [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. One of the methods
used to achieve this functionality is homomorphic encryption [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
which, however is a computationally expensive method [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] also
liable to sufer certain attacks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Another recent technique is
based on SPDZ [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] which, however, may also be expensive in
terms of communication and computation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Finally, Garbled
Circuits can ofer a solution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but further evaluation is needed
in terms of execution time, size and reusability [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        To this end, we are presenting a solution to the two-party
privacy preserving record linkage problem using Fuzzy Vaults. A
Fuzzy Vault scheme may be used for privacy preserving
matching, personal entropy systems, biometrics [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] or, as
indicated by the recent literature [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], for signature verification.
Our method has been briefly presented in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In this paper,
we move one step further by providing a simple protocol for
two-party privacy-preserving record linkage using Fuzzy Vaults,
detailed examples, an in-depth privacy discussion, exhibiting the
privacy characteristics of our method and, last but not least, an
extended experimental evaluation, also including a comparison
with a baseline method.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>PRELIMINARIES</title>
      <p>In this section, we define the problem we are solving and provide
the required background for laying out our method. The notation
used throughout this paper is summarized in Table 1.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Problem Formulation</title>
      <p>Let us consider two data sources, called Alice () and Bob ().
Privacy preserving record linkage is the problem of identifying
(linking) all pairs of their records that refer to the same real world
entity, so that no more information is disclosed to either ,  or
any third party involved in the process besides the identifiers of
the linked records.</p>
      <p>Alice and Bob will most probably use diferent schemas in
their databases. As such, they will have diferent attributes. Let
 be Alice’s schema and  be Bob’s schema and let us assume
that in these schemas  of the attributes are common between
the two sources forming a composite key. These attributes might
be names, surnames, addresses and so on. As such, none of these
on its own may comprise a unique identifier that can be used to
identify a record. We refer to these attributes as matching fields .
This composite key is used to determine when two records match,
i.e., when they refer to the same entity. To determine when two
records match, the respective attributes forming the composite
key need to be compared. Considering that our data is often dirty,
matching should rely on a similarity or distance function.</p>
      <p>Let us consider a similarity function  () → [0..1] and a
threshold  &gt; 0. Given the composite keys  and  for records  
and   , respectively, we define the following matching function
 → {0, 1}:
 (,  ) =
( 1, if  (,  ) ≥</p>
      <p>0, otherwise.
 ( ) =

X  ∗ , ( )
=0
, =

Y</p>
      <p>−  
=0,̸=  −  
If  (,  ) = 1, then the pair ( ,   ) is a match.</p>
      <p>This process is the matching process. To preserve privacy, i.e.,
ensure privacy preserving matching (PPM), after the completion
of this process, the only information revealed is the identifiers
of the matched records. In our case, where we use Fuzzy Vaults
for matching, Formula 1 will have as input Bob’s key, 
corresponding to record   , and a Fuzzy Vault  for Alice’s   and it
will examine their absolute matching status using the method we
describe in detail in Section 4, returning 1 when Bob successfully
unlocks Alice’s Fuzzy Vault with his and 0 otherwise. As such,
the matching threshold is equal to 1.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Lagrange Interpolation</title>
      <p>
        Lagrange interpolation [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] is the most common interpolation
method used in the fuzzy vault literature. Considering a
twodimensional Euclidean space, and given a data set of n + 1 points:
(0, 0) , (1, 2) , . . . , (,  ) i ̸=   ∀ ,  ∈ {0, 1, . . . , }
the approximation function using Lagrange interpolation is a
polynomial of nth degree and is defined as shown in Equation 2:
      </p>
      <p>This method is called Lagrange interpolation because it uses
Lagrange basis polynomials in order to interpolate the data set,
as illustrated in Equation 3:
(1)
(2)
(3)</p>
      <p>By noticing the denominator of Equation 3 it is evident why
i ̸=   ∀ ,  should hold for the formula’s abscissas.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Fuzzy Vault</title>
      <p>
        A Fuzzy vault is a cryptographic construct used for secret sharing,
introduced in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and, in brief, it works as follows. Alice, wants
to protect a secret  by locking it up with a key, which is a set
of items . This is the Lock phase. Bob, may have access to
this secret only if he uses a key, consisting of a set of items  is
substantially similar to Alice’s key. This is the Unlock phase. To
assure privacy, noise is also injected during the Lock phase.
      </p>
      <p>Next, we will further explain the details of this process and
employ a running example to facilitate the reader. The original Fuzzy
Vault construct was designed for numerical values, which we
will use to illustrate its operation. However, as our work focuses
on strings, we will lay out in later sections our methodology for
achieving suitable transformations in order to adapt this concept
for alphanumeric data.</p>
      <p>3.3.1 Lock Phase. To illustrate the operation of a Fuzzy Vault,
we will commence with the Lock phase and consider, as stated
earlier, that Alice wants to lock her secret  using her key, .
For this purpose, she uses a polynomial  of degree  so that
 ’s coeficients are, in fact, mappings of her secret . Then, she
uses the members of her key  as x-coordinates for  creating
a collection of points . As  is a set, its elements are unique,
thus the the x-coordinates of ’s points are going to be unique
as well.</p>
      <p>
        Let us now denote with | | the cardinality of . Equation
4 illustrates the restriction that has to be satisfied with regard
to the size of the polynomial used. This is due to Lagrange’s
interpolation method which requires  + 1 data points in order
to return a polynomial of degree at most  [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The key set 
holds the abscissas of the data points that Lagrange will later
interpolate in the unlock phase.
      </p>
      <p>| |≥  + 1
(4)</p>
      <p>Now, Alice will also create noise in order to protect the Fuzzy
Vault’s, thus ofering privacy. To do so, she introduces a
distraction set of points designated by  and referred to as Chaf points .
These points do not belong to the polynomials’s curve. Their
purpose is to confuse an attacker by refraining her from identifying
the points in  that have resulted from the Lock phase. As such,
collection  may be considered as the protection of the vault.
The union of collections  and  comprised the vault  , or more
formally: V  ,  =  ∪ . In fact,  is considered to be the key
of the vault  since it identifies the collection of real points .
As chaf points aim to confuse an attacker, the cardinality of  is
of a greater order of magnitude than the cardinality of set . We
define this set of chaf points as illustrated in Equation 5:
 =  ′ ̸= ,  ̸=  ( ′) , ∀ ∈  &amp; ∀ ∈/  &amp; | |&gt;&gt; | | (5)
At this point, we will make this process more clear through
Example 3.1, with its illustration in Figure 1.</p>
      <p>Example 3.1. Let as assume that the secret which Alice wants
to lock is  = 2317 and that she holds a key, which is the
set  = {1, 2, 3, 4, 5, 6}. She creates a polynomial  of degree
 = 3 which has the secret embedded into its coeficients:  ( ) =
2 ∗  3 + 3 ∗  2 + 1 ∗  1 + 7 ∗  0. The next step involves the
projection of her key on the polynomial  . Assume that Alice’s key</p>
      <sec id="sec-6-1">
        <title>Symbol</title>
        <p>n
K
f
P
r
V
RS
C
R
H
Superscripted A ()
Superscripted B ( )
|·|</p>
      </sec>
      <sec id="sec-6-2">
        <title>Description</title>
        <p>Degree of polynomial
Key
Size of fragment
Polynomial
Plaintext record
Fuzzy Vault
Reference Set
Chaf Points
Original Points
Verification Hash
Alice (e.g. : Alice’s Key)
Bob (e.g. : Bob’s Key)</p>
        <p>Size (e.g |RS|: Reference Set size)
is the set  = {1, 2, 3, 4, 5, 6} . By projecting each of these points
Alice ends up with a collection  of points in a two-dimensional
space,  = {,  ( )}, ∀ ∈ . For our example, this results into:
 = (1, 13), (2, 37), (3, 91), (4, 187), (5, 337), (6, 553). In our example,
’s cardinality is | |= 6.</p>
        <p>Next, Alice creates the distraction set  comprising of random
chaf points which do not lie on the polynomial. Finally, Alice
combines and shufles the two collections in order to create the
vault  , which contains both genuine and chaf points, which
are indistinguishable. As stated before, the size of  should
significantly exceed that of . However, for presentation purposes
and without harm to the general case, we will only consider four
points in the chaf set:  = {7, 8, 9, 10} For our toy example, this
will yield:  = {(1, 13), (2, 37), (3, 91), (4, 187), (5, 337), (6, 553), . . . ,
(9, 560), (10, 700)}.</p>
        <p>3.3.2 Unlock phase. Having seen how Alice locks a secret,
let us now describe the Unlock phase, where Bob will attempt to
unlock the Fuzzy Vault  he received from Alice and get access
to Alice’s secret . For this purpose, Bob will have to use a key
set  in order to unlock  . This will only happen in the case
that his key,  , substantially overlaps with Alice’s key , used
to create  . If so, Bob may distinguish enough genuine points of
the vault  belonging to the collection .</p>
        <p>
          However, Bob may also identify some distraction points
belonging to the chaf set  too. Bob’s ability to eventually
reconstruct the original polynomial depends on the cardinality of Bob’s
key. If the key’s cardinality is smaller than the polynomial’s
degree then interpolation methods will not be able to recover the
proper polynomial [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Therefore, Equation 4 has to be satisfied
for Bob’s key,  , as well. Unlocking the vault requires, first, that
Bob finds all the vault’s points that have the same abscissas with
any of his key’s elements, or more formally:
 = {(, ) ∈  | ∈  }
(6)
        </p>
        <p>Distraction points might exist because  might also overlap
with the collection of random chaf points . If  and  do not
significantly overlap, then the polynomial reconstruction cannot
be performed. Example 3.2 illustrates this process.</p>
        <p>
          Example 3.2. Let us consider, in our running example, that Bob,
with his key set  = {1, 2, 4, 5, 8, 9}, tries to unlock  . The
collection of  for Bob will be:  = {(1, 13), (2, 37), (4, 187), (5, 337),
(8, 360)}. Bob has to reconstruct  in order to access the vault’s
contents. For our case, let us consider that this occurs through
Algorithm 1: Protocol Overview.
Lagrange interpolation [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. We note here that, part of  for
Bob’s key is a point (8, 360) which comprises a noise element
for  and as such it does not belong to the original polynomial.
Consequently, each of the combinations that include the noise
element will return a false polynomial. Note here that the
cardinality of Bob’s key is | |= | |= 6. In this case, we have chosen
both Alice and Bob to have keys of the same size.
        </p>
        <p>Nevertheless, there is the capacity for Bob to have a key with
a diferent size to Alice’s, given that it has at least one element
more than the degree of the polynomial used. Since Lagrange
interpolation requires  + 1 genuine points in order to reconstruct
the polynomial which corresponds to them, if | |&lt;  + 1, Bob
will fail to unlock the vault. On the other hand, if | |&gt;  + 1 Bob
has to create all possible  + 1 combinations of  and use them
in order to interpolate the polynomial. Now, we should take into
consideration that each, one of the possible combinations will
return a polynomial but Bob has somehow to decide which one
is the correct (there is no need to calculate all of them).
3.4</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Reference Set</title>
      <p>
        A Reference Set is a corpus of data used as a means for
preprocessing when matching sensitive information. This corpus may
either be publicly available or pre-agreed among the matching
parties, Alice and Bob, in our case. It works as follows.
Considering that Alice and Bob wish to match their data without revealing
any information to each other, they use some mapping function,
having also been preagreed and the same for both, which relates
each of the values in their dataset to one or more values in the
Reference Set. Then, instead of comparing their data directly,
they are able to compare the results of the mapping function.
Such a construct has been used in the privacy preserving record
linkage context either for matching [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] or for blocking [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Example 3.3 illustrates how a Reference Set may be used.
      </p>
      <p>
        Example 3.3. In this example we are using Levenshtein
distance [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], indicated by (·), as the mapping function. We still
consider Alice and Bob, with Alice holding the word ‘error’, while
Bob holds the word ‘erasure’ and they wish to privately match
them. For the sake of presentation, the Reference Set contains the
single word ‘energy’. In this case, for Alice (, ) = 4,
while for Bob,  (, ) = 6. Since  ̸=  , we may
easily deduce that Alice and Bob hold diferent values.
4
      </p>
    </sec>
    <sec id="sec-8">
      <title>METHOD DESCRIPTION</title>
      <p>Having examined the operation of the Fuzzy Vault for numerical
data, let us now describe our approach for exploiting this
mechanism for approximate private string matching, which will allow
us to perform privacy preserving record linkage. We consider two
honest but curious matching parties, Alice and Bob. This means
that, both of these two participants do not exhibit a malicious
behavior, attacking the operation of the protocol, but they try
to infer as much information as possible without deviating from
the protocol.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Protocol Description</title>
      <p>Let us begin by providing a simple protocol for using Fuzzy
Vaults for privacy preserving record linkage. For this purpose,
a separate Fuzzy Vault for each record in a database, as such,
a separate polynomial, has to be created and secretly shared.
We will consider for our case, that alphanumeric identifiers are
used. As our focus is on privately matching records, we assume
that either both Bob and Alice share a common schema, or they
have diferent schemas but corresponding fields on the use of
which they have agreed beforehand, or they use some
privacypreserving schema matching method.</p>
      <p>The protocol we are describing is asymmetric. This means
that both parties do not perform exactly the same steps, but they
cooperate in order to achieve their common goal of privately
linking their records. These steps, are illustrated in Algorithm 1.
First, Alice converts all of the records in her database to Fuzzy
Vaults, resulting in one Fuzzy Vault for each record. Then, she
transmits these Fuzzy Vaults to Bob through a considered secure
channel. Bob, in his turn, creates his Fuzzy Vaults and, using the
resulting keys, attempts to unlock the received vaults, resulting
in a list of matches. When the process concludes, he sends the ids
of the matching Fuzzy Vaults back to Alice. The procedure
concludes with both Alice and Bob exchanging their actual matching
records.</p>
      <p>At this point, the question that rises is how these
alphanumeric fields are going to be transformed into polynomials and,
eventually, into Fuzzy Vaults. This is the process we are going to
describe next.
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>String Encoding for Locking</title>
      <p>Beginning with the locking phase, Alice will secure strings inside
Fuzzy Vaults. Each string is the concatenation of all the string
ifelds in the record that will be used for matching. In order for
such a string to be locked, and to be able to be unlocked afterward
by Bob by a fairly similar string, it should be somehow associated
with the polynomial used to lock it. As such, our method embeds
irreversibly the string into polynomial’s coeficients. We cannot
use a direct representation of the string since this would
jeopardize exposing the key as the abscissas of the genuine points
of the vault, allowing an adversary obtaining the key to directly
recover the string. Additionally, as we will describe later on, Bob
will use this property after unlocking the Fuzzy Vaults, in order
to verify if the match he achieved with Alice’s secret is identical
or approximate.</p>
      <p>To be able to adapt the Fuzzy Vault scheme so as to exhibit
privacy preservation properties for record linkage, a series of</p>
      <sec id="sec-10-1">
        <title>Algorithm 2: Mapping a Record to a Polynomial.</title>
        <p>end
SetCoefficient (  _,  −  );
 + = 1;
steps is necessary. The broad image of these steps are illustrated
in Figure 2. First of all, a record, which plays the role of the string
to be secured, plays a double role in the whole process, as it is
used both for creating a polynomial and for building a key for
encoding this record. This polynomial and the key are then used
to create the set of Genuine Points, which, in combination with
the generated chaf points, will result in a Fuzzy Vault. At this
point, we have to mention that the polynomial representation of
the record is also used to produce the verification data, also used
during the unlocking phase.</p>
        <p>4.2.1 Polynomial generation. Let us begin with examining
the process that generates the aforementioned polynomial. The
mapping algorithm we propose is illustrated in Algorithm 2. First,
the alphanumeric fields of the record to be locked are
concatenated so as to form a single string (line 2). Then, this string is
fragmented into substrings of size equal to  (line 3), with 
being a user defined parameter, an approach we adopted for
security purposes. This is particularly useful for long strings. Each
character of each fragment is converted to its ASCII code
equivalent. The concatenation of these codes creates a number, which
will form a polynomial’s coeficient (lines 4-11). The remaining
coeficients of the polynomial are set to 1. Example 4.1 further
clarifies the operation of this algorithm. On the other hand, if
the string exceeds the length of 27 characters either a higher
degree polynomial should be created or some characters should
be skipped.</p>
        <p>Example 4.1. Let us assume, as illustrated in Figure 3, that  =
3, meaning that the fragment size will consist of three characters
and that we wish to encode the name JOE_DOE. Let us also
assume that the degree of the polynomial we wish to use is  = 8.
This will yield a polynomial of the form:  ( ) = 8 · 8 +7 · 7 +. . .+
2 · 2 +1 · 1 +0. Given our settings for  and  , this polynomial
has a capacity of embedding 9 · 3 = 27 characters, because it
has 9 coeficients. Moreover, there are 263 = 17576 possible
combinations for each coeficient, because each coeficient can
have up to three letters and there are 26 available letters (we take
into consideration only capital letters) and 263∗9 total possible
polynomials. In our example, the encoded string has less than 27
characters. In this case, the remaining coeficients are all set to 1.</p>
        <p>
          4.2.2 Verification. The aim of the verification step is, as its
name suggests, to verify at the unlocking phase that the recovered
polynomial matches the encoded one. In our approach, we use
the method proposed in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. This method, is based on auxiliary
data generated during the polynomial construction phase. An
MD5 hash  which derives directly from the created polynomial,
is created having as input the concatenation of the polynomial’s
coeficients.
        </p>
        <p>
          4.2.3 Key Generation. Building a polynomial is one of the two
steps required toward generating a record’s set of genuine points.
The second step regards the generation of the key. Since the
locking and the unlocking keys are the two components which
are being compared, the key in our method should be directly
linked with the string. The caveat here is to avoid an one - to
- one connection between the key and the string as a measure
against record multiplicity attacks, described in [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. For instance,
in the case that an adversary somehow gets access to the key,
our method should not allow the recovering of the secret string.
        </p>
        <p>As such, the need to devise an one-way mapping emerged. To
this end, we employ a Reference Set in order to associate its values
with the string resulting from the record’s concatenated fields
by means of a string similarity function. Then, the most similar
values are selected and they are transformed, in an irreversible
manner, to a set of numbers comprising the key for the locked
string, without, however, revealing any information about it.</p>
        <p>Let us now see in how this works. The generation of a locking
key for a single record is detailed in Algorithm 3. This algorithm
should be applied on each record in the recordset resulting in a
set of separate keys, one for each record. This algorithm requests
as input the concatenated alphanumeric fields of record  , as a
single string, to be encoded, a Reference Set  that we are going
to use for encoding and the size | | of the key  which we are
going to produce.</p>
        <p>Our algorithm also requires the use of a string similarity, or
distance, function  (·). This is applied on the pairs formed by 
and each of the values of the Reference Set , _ resulting
to an array of similarity scores _ (lines 1-4). _
is then used to partially sort . As our aim is to retrieve the to
| | values of the , a full sort is unnecessary. The sort order
depends on the type of measure used for  (·). If  (·) is a similarity
function, then sorting occurs in descending order, as we need to
retrieve the most similar of ’s values with  . Otherwise, sort
in ascending order is performed. Afterward, __ are
extracted from , which are the top | | most similar Reference
Set values with  (lines 6-9). Next, we are going to build  by
applying a series of transformations on each of __,
resulting in a numerical set.</p>
      </sec>
      <sec id="sec-10-2">
        <title>Algorithm 3: Key Generation.</title>
        <p>Input:
- r: A record with concatenated alphanumeric fields
- RS: Reference Set
- |K|: Size of resulting key
Output:
- K: Resulting key
1 _ ← ∅;
2 foreach _ ∈  do
3 .append (F (, _ ));
4 end
5 Partial_Sort (, _, | |);
6 __ ← ∅;
7 for  ← 0 to | | by 1 do
8 __.append ([]);
9 end
10  ← ∅;
11 for  ← 0 to | | by 1 do
12 ℎℎ ← Hash (__ []);
13  ← ℎℎ;
14 _ ← Rand (, 0, 106);
15  .append (Cosine (_ ));
16 end</p>
        <p>
          The transformations that each string __[]
undergoes are as follows (lines 10-16). Initially, __[] is
hashed through a simple hash function so as to create a numerical
equivalent ℎℎ (line 12). Then, ℎℎ is used as an input seed in a
random number generator which produces _
ranging between [0, 106] (lines 13-14). The series of transformation
conclude with applying the cosine function on the _ ,
resulting in a value between [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ], which is now a part of 
(line 15).
        </p>
        <p>It is evident that after applying this series of mappings we
have managed to perform an one-way transformation between
the initial string that consists of the concatenation of the records
ifelds and the numbers that comprise  . This is due to the use
of the numerical representation of the string as a seed and the
periodic nature of the cosine function.</p>
        <p>4.2.4 Fuzzy Vault Construction. At this point, three steps
remaining for constructing a Fuzzy Vault: Generating Genuine
Points, generating Chaf Points and, finally, blending these
together.</p>
        <p>
          Genuine Points Generation. Now that we have created the key
 and the polynomial  required by the Fuzzy Vault scheme to
encode a secret, we will projected  over  , as indicated in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]:
 = { ( )}, ∀ ∈  resulting in a set of  points considered to
be the set  of genuine points in the Fuzzy Vault.
        </p>
        <p>
          Noise as Chaf Points. After the creation of the genuine points,
chaf points should also be generated ensuring fuzzy vault’s
security. Noise comprises a standard method for enhancing privacy
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In our method we adhere to this principle by generating
chaf points which aim at distracting a potential attacker from
identifying the points that belong to the actual polynomial. A
naive approach would be to randomly generate points based on
one or more distributions. However, we consider that this would
allow for identifying and phasing out noise quite easily. As such,
we designed a novel approach, more suitable for our case, which
is based on the use of Reference Sets. We use abscissas from
a Reference Set we employ, while the ordinates are randomly
CHAFF POINTS
        </p>
        <p>GENUINE POINTS
chosen but with the restriction that the resulting points do not
lie on the polynomial. This way,  is built.</p>
        <p>Final Step. Eventually, the points of these sets are shufled in
order for genuine and chaf points to be mixed, applying formula
5 so as to create  . A Fuzzy Vault created with this method is
illustrated in Figure 4. At this point, the data encoding from
Alice’s side concludes. Alice sends Bob the Fuzzy Vaults she has
constructed together with the verification hash, One pair for
each record. At this point the first step (line 1) of the protocol
illustrated in Algorithm 1. To further clarify our approach, we
will use Example 4.2:</p>
        <p>
          Example 4.2. Let us assume that “JOHN_SNOW” is the string
Alice wishes to lock. Let us also assume that  (·) is going to
be Levenshtein distance [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. First, she encodes the string into
a polynomial, using the fragmentation of the respective ASCII
codes, with  = 3:  ( ) = 747972 8 + 789583 7 + 787987 6 + 1 5 +
1 4 + 1 3 + 1 2 + 1 1 + 1 0.
        </p>
        <p>Based on the representation of the polynomial, she also
constructs the verification hash, by concatenating the coeficients
of the polynomial so as to form “747972786583787987111111”.
Applying MD5 results in: “2d73f7e7425e2d20264a799e24715317”,
which will be delivered together with the Fuzzy Vault  to Bob.</p>
        <p>
          Then, Alice will have to build her key . For this purpose,
she selects the top | |= 15 elements of the Reference Set which
are more similar with the word “JOHN_SNOW”. For instance,
the first element is “JONES_OWEN” which has F(JOHN_SNOW,
JONES_OWEN) = 5, while the last one is the string “XU_BO” with
F(JOHN_SNOW, XU_BO) = 7. Now, we hash each of the strings
in  into numbers and then the numbers are mapped into the
range [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ] using the cosine function: (  _  ) =
−0.355866.
        </p>
        <p>Next, in order to create the genuine points, Alice projects the
elements of her key  onto the polynomial  . For instance, for
the first key element this yields  (−0.355866) = 1222.8648. The
rest are accordingly projected. When this step concludes,  will
have been built.</p>
        <p>Eventually, Alice will have to build the set of chaf points 
and then calculate its union with  so as to build the Fuzzy Vault
 .  coupled with the verification hash  are delivered to Bob.
At this point, we assume that the second step (line 2) of Algorithm
1 has been successfully completed, thus Alice’s records have
been successfully transmitted to Bob in the form of Fuzzy Vaults,
accompanied by their Verification Data and we are moving to
steps 3 and 4. Now, Bob will attempt to unlock these Fuzzy Vaults,
thus, linking his records with Alice’s. To do so, he will have to
perform the actions illustrated in Figure 5. First, he will have
to create a key for each of his records. Then, all of his keys are
going to be tested against each of Alice’s Fuzzy Vaults, also using
each vault’s Verification Data. Let us see these in detail.</p>
        <p>
          4.3.1 Verification. As described in Section 3.3, a Fuzzy Vault
is unlocked by a key similar to the one that locked it. It is evident
that, for this to happen, Bob should create his unlocking keys
in way identical to the process that Alice followed to create
her locking keys. Thus, Bob builds each of his unlocking keys
, for each of his records, using Algorithm 3. Afterwards, he
will calculate the intersections of all the unlocking keys with
all the Fuzzy Vaults, forming a series of Common Sets, each of
which contains (, ) pairs. Each of Common Set  is then used
for reconstructing the polynomial. Adhering to the literature
[
          <xref ref-type="bibr" rid="ref20 ref25 ref6">6, 20, 25</xref>
          ], this achieved through Lagrange interpolation [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] on
the common points , to create an interpolating polynomial.
4.4
        </p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Approximate Matching</title>
      <p>At this point, our methodology for approximate matching is
applied. The key point for achieving approximation with a Fuzzy
Vault is the step of the reconstruction of the polynomial described
by the Fuzzy Vault. As mentioned earlier, an interpolated
polynomial of -th degree needs at least  + 1 interpolating points
in order to be created. In our case, the polynomial is secured
in a Fuzzy Vault as | | genuine pairs (, ). Therefore, when
||=  + 1, in order to interpolate an -th degree polynomial, the
two keys have to be identical.</p>
      <p>However, if we build a key , having a size greater than
the minimum number of points that an -th degree polynomial
needs in order to be recovered, i.e., | |&gt;  + 1 we can exploit
these redundant (, ) pairs to achieve approximation. In other
words, since  + 1 points are required to unlock an -th degree
polynomial and we project a key with size larger than  + 1, we
may use any  + 1 points of  to unlock the Fuzzy Vault. As
such,  should only match  +1 of ’s points, with | |&gt;  +1,
thus leading to linking non-identical records. On the other hand,
as | | increases, the probability of having common elements
in diferent keys is increased too. As such, too large keys would
result into increased numbers of fault positives.</p>
      <p>Now, after the recovering the polynomial from  , Bob, creates
a Verification Hash   in an identical way that Alice created
. Then these two are checked for matching. If  suficiently
overlaps  and  =  , the two polynomials will be the
same and there is a match. On the other hand, if the two sets
are not similar enough, verification will fail. If the polynomial
is classified as a match, then the strings for which the vault has
been created is considered to be the same with the string Bob
holds. To facilitate the reader, we now provide Example 4.3:</p>
      <p>Example 4.3. Continuing Example 4.2 Bob tries to unlock
a Fuzzy Vault he received from Alice containing the record
“JOHNY_SNOW”. First, he creates his key the same way Alice
did, using the same fragment size. After the creation of his key,
Bob has to find the common points between his key and the
fuzzy vault. In this case, Bob managed to find n = 10 common
points. Thereby, by utilizing Lagrange interpolation, he managed
to recover a polynomial of 8th degree by using 9 out of the 10
common points (he might have to try all the combinations) as
shown: () = 7479728 + 7895837 + 7879876 + 15 + 14 +
13 + 12 + 11 + 10. Last but not least, Bob has to verify the
recovered polynomial, because he has no knowledge if it is the
correct one or not. Therefore, he calculates the MD5 hash of the
concatenation string of his polynomial’s coeficients, which in
our case is “2d73f7e7425e2d20264a799e24715317”. Comparing the
two hashes Bob can asses that the two strings match.
4.5</p>
    </sec>
    <sec id="sec-12">
      <title>Protocol Conclusion</title>
      <p>Eventually, after Bob has decided which of his records match
Alice’s, he transmits a list of the matching records he recovered
requesting their entire data. Alice responds and she exchanges
her corresponding data with Bob (lines 5, 6 of Algorithm 1).</p>
    </sec>
    <sec id="sec-13">
      <title>5 PRIVACY DISCUSSION</title>
      <p>In our setup, we consider that Alice and Bob exhibit an honest
but curious behavior, meaning that they try to gather as much
additional information they can by adhering to the protocol.</p>
      <p>
        A Fuzzy Vault’s security properties rely on the number of chaf
points used while constructing it [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We remind our notation,
where  represents the degree of the polynomial used, | | the
number of generated chaf points and || the number of real,
genuine points. Taking into account that, in order to have a
correctly reconstructed polynomial,  + 1 of the genuine points
must be used. Therefore, any adversary who has access to a
fuzzy vault and wants to unlock it has to find all possible  + 1
combinations of points and make reconstruction attempts. Even
in this case the adversary has to know the polynomial degree 
which has been used.
      </p>
      <p>Without any extra information (rather than the polynomial
degree) any adversary is required to find ||++1| | point
combinations. From all the combinations of  + 1 points that can be
extracted by the vault only |+ 1| combinations are able to
unlock it. Therefore, the probability of guessing correctly a needed
combination is given by Equation 7.</p>
      <p>| |
+1 (7)
| |+| |
+1</p>
      <p>Consequently, a brute force attack in a well concealed fuzzy
vault is considered impracticable. In our approach, to further
increase privacy, we employ a series of further enhancements.
First, there are the chaf points, which comprise the Fuzzy Vault.
Then, there is the use of a Reference Set and, finally, the use of
non-bijective mapping functions.
5</p>
      <p>Experiment id</p>
      <p>Chaf points are responsible for concealing the vault’s secret
which, in our case, is a record . We have taken the following
measures in order to further enhance this property. First of all,
regarding the use of the Reference Set and mapping functions.
Sensitive data are not encoded themselves within the Fuzzy Vault.
What is encoded with the Reference Set is the output of a distance
or similarity function. Furthermore, the output of this function is
fed to a non-bijective function, thus constituting both the
polynomial representation and the key generation processes irreversible.</p>
      <p>To continue, to avoid frequency attacks, instead of using a
random function to generate points, the chaf generation
technique we propose is based on the use of a Reference Set. This way,
the abscissas of the points delivered to Bob are indistinguishable
from the original points. Consider that the vault has 15 genuine
points and 200 chaf points which their abscissas os also derived
from the Reference Set, the same with the genuine points. Bob
knows all the possible abscissas which are produced by the
Reference Set but our method uses these abscissas as noise. Therefore,
there is no information leakage which can help Bob to unlock
the vault without having a proper key.</p>
      <p>
        Furthermore, At this point, it should be noted that, in contrast
with the biometrics applications utilizing Fuzzy Vaults, [
        <xref ref-type="bibr" rid="ref25 ref6">6, 25</xref>
        ]
the leakage of a key may only afect a single vault. This is due
to the fact that we employ a separate polynomial for each string
to be locked in the Fuzzy Vault. Additionally, the degree of the
polynomial and the size of the chaf set also afect a Fuzzy Vault’s
ofered privacy. For further hardening the Fuzzy Vault, the degree
of the polynomial may be increased. Increasing the size of the
chaf set will as well favor privacy.
      </p>
    </sec>
    <sec id="sec-14">
      <title>6 EXPERIMENTAL EVALUATION</title>
      <p>
        In this section, we will provide empirical evidence regarding the
matching performance and the time characteristics of our method.
Finally, we will compare the Fuzzy Vault method we propose
against a baseline privacy preserving matching method based on
phonetic codes [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For our evaluation, we have implemented a
prototype in Python 31 enhanced by a series of packages2.
      </p>
    </sec>
    <sec id="sec-15">
      <title>6.1 Setup</title>
      <p>
        In our experiments, we have used samples from real-world data
originating from the North Carolina voters database3. To evaluate
our approach, we have used ‘last name’ and ‘first name’ from this
1https://github.com/XhinoMullaymeri/Fuzzy_Vault_PPRL_DOLAP21
2Python 3 packages: hashlib, random, scipy.interpolate.lagrange, numpy,
numpy.polyfit, Levenshtein
3Available at: https://dl.ncsbe.gov/index.html?prefix=data/
database as matching attributes, assuming they comprise a
candidate key. Sampling this database, we constructed two datasets,
dataset_A and dataset_B. Dataset_A initially consisted of 1000
records and then deduplicated to 990 ones. Dataset_B has been
built from dataset_A. Since our method is capable of achieving
approximate matching and in order to evaluate its performance
in this terms, we corrupted Dataset_B so as to contain one error
per attribute, as it is rather unusual for a word to contain more
than one to two typographical errors [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Experiments were
conducted on an 8-core virtual server with 16GB of RAM, utilizing,
however, only a single core was for the assessment each time.
      </p>
      <p>For the Reference Set, now, there are two aspects we are
interested in. Its content and its size. As for the content, we used
names and surnames originating from the initial database. In
terms of Reference Set sizes, we experimented with three distinct
dataset sizes. For each letter of the latin alphabet, 15, 20 and 25
records were selected with a uniform distribution, resulting in
15 ∗ 26 =390 records, 20 ∗ 26 = 520 records and 25 ∗ 26 = 650
records. For each size we created three diferent Reference Sets
in order to have more information about the efect of the set’s
content on the results. Table 2 contains the configurations for
all these Reference Sets. Since for each size of Reference Set we
will use three diferent Reference Sets, each time we perform an
experiment with each of them, we maintain a fixed id. To this end,
experiment ids 1–3 are for the Reference sets of 390 elements,
4–6, for those of 520 ones and 7–9 for those with 600 elements.</p>
      <p>The evaluation of our method is going to be based on metrics
broadly used in information retrieval and data mining. These
metrics are Precision, Recall and Balanced Accuracy. Precision is the
fraction of the relevant elements among the retrieved elements</p>
      <p>TP
( =  +  ). Recall is the fraction of the retrieved
eleTP
ments divided by the total relevant elements ( =  +  ).
Balanced Accuracy is used as a metric to indicate accuracy in
cases of data with imbalanced classes, where the number of
instances of one class, overwhelms is significantly larger than the
number of instances of the other one. This metric normalizes the
true positive and true negative predictions by the number of
actual positive and negative samples, respectively, and divides their</p>
      <p>5
Experiment id
6
Recall</p>
      <p>Precision</p>
      <p>Balanced Accuracy
Next, we will examine in detail the outcomes of our empirical
evaluation.</p>
      <p>6.2.1 Matching Performance. Let us begin our experimental
assessment by examining the matching performance achieved
by our Fuzzy Vault scheme. This is illustrated in Figure 6. The
horizontal axis represents the ids of the Reference Sets used,
while the vertical one represents the value for the respective
metric, all ranging between 0 and 1. Regarding the results, now,
ifrst of all, we may observe that Recall levels, represented by the
yellow bars, are almost for every test above 60%. This means
that at least 60% of the fuzzy vaults which had been created
using dataset_A, managed to be unlocked by the proper record of
dataset_B resulting in a successful record linkage. Furthermore,
we may observe three clusters forming, one for each Reference
Set size, where the results are similar within each cluster.</p>
      <p>Comparing precision and recall, we may observe that they
follow opposite directions, a fact that constitutes an expected
outcome. Related with the Reference Set used, the following
situation occurs. When smaller Reference Sets are employed, more
keys are assigned to the same item of the Reference Set. Thus,
more keys are associated. On the other hand, larger Reference
Sets associate small numbers of keys in each of their elements,
leading to higher precision, yet lower recall.</p>
      <p>It is evident that the size of the Reference Set used plays an
important role in the overall behavior of our method.
Nevertheless, we have to stress that no particular measures were taken
to optimize the outcome regarding the distribution of the values
of the Reference Sets. We have deliberately chosen this route
in order to illustrate the baseline characteristics our approach
exhibits.</p>
      <p>Considering Balanced Accuracy now, it is easy to see that it
remains at very high levels in all cases. This is an interesting result
by itself, since it is not only for the matching pairs of records we
are interested, but also, for the non-matching ones. Considering
that we aim at an 1-1 matching for each of dataset_A’s records
with the ones of dataset_B. Considering 1,000 records for each
dataset, we would ideally end up with 1,000 pairs matched and
1, 000 ∗ 1, 000 − 1, 000 = 999, 000 pairs not matched. Precision
and recall do not take into account such imbalances, a situation
captured, however by Balanced Accuracy.</p>
      <p>6.2.2 Total Execution Time. In this set of experiments, we will
assess the relation between the execution time and the size of
the Reference Set. The results of this evaluation are illustrated in
Figure 7. The vertical axis represents time in seconds, while the
horizontal one, again, stands for the Reference Set used. The first
observation we can make is that there are groups formed, based
on the size of the Reference Set. Within each group, execution
times are similar, as the sizes of these Reference Sets are identical
and for each key the same number of comparisons has to take
place. On the other hand, as the size of the Reference Set increases,
execution times increase as well. As a result, we can deduce that,
total execution time is proportional to the size of the Reference
Set.</p>
      <p>This observation is particularly useful, especially when
combined with the conclusions from the previous set of experiments.
In more detail, using smaller Reference Sets has the result of
higher recall and lower precision. On the other hand larger
Reference Sets increase, not only execution time but also precision,
with the cost of lower recall. Selecting intermediate values for
the Reference Sets ofer a fair trade-of between precision and
recall and average time performance.</p>
      <p>6.2.3 Locking and unlocking time shares. In this experiment,
we will focus on Reference Set 8, in order to illustrate time shares
for locking and unlocking the vaults. These are illustrated in
Figures 9a and 9b. We will begin by describing the locking phase.
Here as we may see, the time required by the locking phase is
mostly occupied by the chaf points generation. This is, however,
one of the basic elements for ofering strong privacy
characteristics. The rest of the steps of the locking step occupy only a
small portion of the overall time required, making this part of
our approach open for further improvements.</p>
      <p>Regarding unlocking, it is easy to see that for unlocking, time
is almost equally shared between reconstruction and key
generation. Identifying common points requires just a tiny portion of
the overall time.</p>
      <p>At this point, we will make use of one more plot to be able to
extract safe conclusions regarding execution time. As we may
see, Figure 9c illustrates a time comparison between the locking
and the unlocking phase. It is evident, that the time required to
unlock all vaults is much larger than the time required to lock
them. In particular, while 58 seconds are required for encoding,
6869 seconds are required for both decoding and performing an</p>
      <p>
        6.2.4 Comparison with privacy preserving Soundex. Finally,
we conclude our experimental evaluation, comparing against
Privacy Preserving Soundex [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For this experiment, we used
Reference Set 7, comprising of 650 entries.
      </p>
      <p>This experiment is illustrated in Figure 8. Again, the
vertical axis displays values of the metrics, while the horizontal one
represents the name of each metric. Green bars correspond to
Privacy Preserving Soundex, while our method is illustrated with
the red bars. We may observe that only in terms of Precision our
method is inferior to Privacy Preserving Soundex. This, however,
is a result of the low Recall that Privacy Preserving Soundex has
exhibited. On the other hand, Fuzzy Vault manages to exhibit a
more balanced behavior. This result is promising, also taking into
account that Privacy Preserving Soundex is a three-party method,
as opposed to our two-party method. Regarding execution time,
Private Soundex operates in less than a second. However, as the
results are simulated and this protocol requires a three-party
setting for achieving privacy, we may not extract valid conclusions
based on this comparison, since protocol communication times
are not taken into account.
7</p>
    </sec>
    <sec id="sec-16">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper, we presented the details of our method for
performing two-party privacy preserving record linkage using Fuzzy
Vaults. We discussed the privacy properties of our approach and
provided extended empirical evidence that this new approach is
very promising. In our next steps, we aim at further refining our
method aiming at increasing matching performance while, at the
same time, improving execution times, maintaining, nevertheless,
our method’s privacy characteristics. We will further experiment
with tuning our method’s parameters, trying alternative
interpolation methods.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Luca</given-names>
            <surname>Bonomi</surname>
          </string-name>
          , Yingxiang Huang, and
          <string-name>
            <surname>Lucila</surname>
          </string-name>
          Ohno-Machado.
          <year>2020</year>
          .
          <article-title>Privacy challenges and research opportunities for genomic data sharing</article-title>
          .
          <source>Nature Genetics</source>
          (
          <year>2020</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Feng</given-names>
            <surname>Chen</surname>
          </string-name>
          , Xiaoqian Jiang, Shuang Wang,
          <string-name>
            <surname>Lisa M Schilling</surname>
            ,
            <given-names>Daniella</given-names>
          </string-name>
          <string-name>
            <surname>Meeker</surname>
          </string-name>
          , Toan Ong, Michael E Matheny, Jason N Doctor,
          <string-name>
            <surname>Lucila</surname>
            Ohno-Machado, and
            <given-names>Jaideep</given-names>
          </string-name>
          <string-name>
            <surname>Vaidya</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Perfectly secure and eficient two-party electronichealth-record linkage</article-title>
          .
          <source>IEEE internet computing 22</source>
          ,
          <issue>2</issue>
          (
          <year>2018</year>
          ),
          <fpage>32</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yanling</given-names>
            <surname>Chen</surname>
          </string-name>
          . [n. d.].
          <article-title>Current approaches and challenges for the two-party privacy-preserving record linkage (PPRL)</article-title>
          .
          <source>Collaborative Technologies and Data Science in Artificial Intelligence Applications</source>
          <volume>(</volume>
          [n. d.]).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Christen</surname>
          </string-name>
          , Thilina Ranbaduge, and
          <string-name>
            <given-names>Rainer</given-names>
            <surname>Schnell</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing</article-title>
          . Springer International Publishing AG.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Christen</surname>
          </string-name>
          , Thilina Ranbaduge, Dinusha Vatsalan, and
          <string-name>
            <given-names>Rainer</given-names>
            <surname>Schnell</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Precise and fast cryptanalysis for Bloom filter based privacy-preserving record linkage</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>31</volume>
          ,
          <issue>11</issue>
          (
          <year>2018</year>
          ),
          <fpage>2164</fpage>
          -
          <lpage>2177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T</given-names>
            <surname>Charles</surname>
          </string-name>
          <string-name>
            <given-names>Clancy</given-names>
            , Negar Kiyavash, and
            <surname>Dennis J Lin</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Secure smartcardbased fingerprint authentication</article-title>
          .
          <source>In Proceedings of the 2003 ACM SIGMM workshop on Biometrics methods and applications</source>
          . 45-
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Daniel</surname>
          </string-name>
          Conte and Carl De Boor.
          <year>2017</year>
          .
          <article-title>Elementary numerical analysis: an algorithmic approach</article-title>
          .
          <source>SIAM</source>
          .
          <volume>31</volume>
          -41 pages.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Fred</surname>
            <given-names>J</given-names>
          </string-name>
          <string-name>
            <surname>Damerau</surname>
          </string-name>
          .
          <year>1964</year>
          .
          <article-title>A technique for computer detection and correction of spelling errors</article-title>
          .
          <source>Commun. ACM 7</source>
          ,
          <issue>3</issue>
          (
          <year>1964</year>
          ),
          <fpage>171</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Aleksander</given-names>
            <surname>Essex</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Secure Approximate String Matching for PrivacyPreserving Record Linkage</article-title>
          .
          <source>IEEE Transactions on Information Forensics and Security</source>
          <volume>14</volume>
          ,
          <issue>10</issue>
          (
          <year>2019</year>
          ),
          <fpage>2623</fpage>
          -
          <lpage>2632</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Michael</surname>
            <given-names>T</given-names>
          </string-name>
          <string-name>
            <surname>Goodrich</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>The mastermind attack on genomic data</article-title>
          .
          <source>In 2009 30th IEEE Symposium on Security and Privacy. IEEE</source>
          ,
          <fpage>204</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ali</surname>
            <given-names>Inan</given-names>
          </string-name>
          , Murat Kantarcioglu, Gabriel Ghinita, and
          <string-name>
            <given-names>Elisa</given-names>
            <surname>Bertino</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Private record matching using diferential privacy</article-title>
          .
          <source>In Proceedings of the 13th International Conference on Extending Database Technology</source>
          .
          <fpage>123</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Ari</given-names>
            <surname>Juels</surname>
          </string-name>
          and
          <string-name>
            <given-names>Madhu</given-names>
            <surname>Sudan</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>A fuzzy vault scheme</article-title>
          .
          <source>Designs, Codes and Cryptography</source>
          <volume>38</volume>
          ,
          <issue>2</issue>
          (
          <year>2006</year>
          ),
          <fpage>237</fpage>
          -
          <lpage>257</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Alexandros</surname>
            <given-names>Karakasidis</given-names>
          </string-name>
          , Georgia Koloniari, and Vassilios S Verykios.
          <year>2015</year>
          .
          <article-title>Scalable blocking for privacy preserving record linkage</article-title>
          .
          <source>In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          .
          <fpage>527</fpage>
          -
          <lpage>536</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Alexandros</given-names>
            <surname>Karakasidis</surname>
          </string-name>
          and
          <string-name>
            <surname>Vassilios S Verykios</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Privacy preserving record linkage using phonetic codes</article-title>
          .
          <source>In 2009 Fourth Balkan Conference in Informatics. IEEE</source>
          ,
          <fpage>101</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Basit</given-names>
            <surname>Khurram</surname>
          </string-name>
          and
          <string-name>
            <given-names>Florian</given-names>
            <surname>Kerschbaum</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>SFour: A Protocol for Cryptographically Secure Record Linkage at Scale</article-title>
          .
          <source>In 2020 IEEE 36th International Conference on Data Engineering (ICDE)</source>
          . IEEE,
          <fpage>277</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Mehmet</surname>
            <given-names>Kuzu</given-names>
          </string-name>
          , Murat Kantarcioglu, Ali Inan, Elisa Bertino, Elizabeth Durham, and
          <string-name>
            <given-names>Bradley</given-names>
            <surname>Malin</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Eficient privacy-aware record integration</article-title>
          .
          <source>In Proceedings of the 16th International Conference on Extending Database Technology</source>
          .
          <fpage>167</fpage>
          -
          <lpage>178</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          .
          <year>1965</year>
          .
          <article-title>Binary codes capable of correcting spurious insertions and deletion of ones</article-title>
          .
          <source>Problems of information Transmission 1</source>
          ,
          <issue>1</issue>
          (
          <year>1965</year>
          ),
          <fpage>8</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Xhino</given-names>
            <surname>Mullaymeri</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexandros</given-names>
            <surname>Karakasidis</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <string-name>
            <given-names>A</given-names>
            <surname>Two-Party Private String Matching Fuzzy Vault</surname>
          </string-name>
          <article-title>Scheme</article-title>
          .
          <source>In The 36th ACM/SIGAPP Symposium On Applied Computing. ACM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Chaoyi</surname>
            <given-names>Pang</given-names>
          </string-name>
          , Lifang Gu, David Hansen,
          <string-name>
            <given-names>and Anthony</given-names>
            <surname>Maeder</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Privacypreserving fuzzy matching using a public reference table</article-title>
          .
          <source>In Intelligent Patient Management</source>
          . Springer,
          <fpage>71</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Wendy</given-names>
            <surname>Ponce-Hernandez</surname>
          </string-name>
          , Ramon Blanco-Gonzalo, Judith Liu-Jimenez, and
          <string-name>
            <surname>Raul</surname>
          </string-name>
          Sanchez-Reillo.
          <year>2020</year>
          .
          <article-title>Fuzzy Vault Scheme Based on Fixed-Length Templates Applied to Dynamic Signature Verification</article-title>
          .
          <source>IEEE Access</source>
          <volume>8</volume>
          (
          <year>2020</year>
          ),
          <fpage>11152</fpage>
          -
          <lpage>11164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Thilina</surname>
            <given-names>Ranbaduge</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Christen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Rainer</given-names>
            <surname>Schnell</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Secure and Accurate Two-Step Hash Encoding for Privacy-Preserving Record Linkage</article-title>
          .
          <source>In Pacific-Asia Conference on Knowledge Discovery and Data Mining</source>
          . Springer,
          <fpage>139</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Ahsan</surname>
            <given-names>Saleem</given-names>
          </string-name>
          , Abid Khan, Furqan Shahid,
          <string-name>
            <given-names>M Masoom</given-names>
            <surname>Alam</surname>
          </string-name>
          , and Muhammad Khurram Khan.
          <year>2018</year>
          .
          <article-title>Recent advancements in garbled computing: how far have we come towards achieving secure, eficient and reusable garbled circuits</article-title>
          .
          <source>Journal of Network and Computer Applications</source>
          <volume>108</volume>
          (
          <year>2018</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Walter J Scheirer and Terrance E Boult</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Cracking fuzzy vaults and biometric encryption</article-title>
          .
          <source>In 2007 Biometrics Symposium. IEEE</source>
          , 1-
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24] Springer Verlag GmbH,
          <source>European Mathematical Society. [n. d.]. Encyclopedia of Mathematics. Website</source>
          . URL: https://www.encyclopediaofmath.org/.
          <source>Accessed on 2020-08-11.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Umut</surname>
            <given-names>Uludag</given-names>
          </string-name>
          , Sharath Pankanti, and
          <article-title>Anil</article-title>
          K Jain.
          <year>2005</year>
          .
          <article-title>Fuzzy vault for fingerprints</article-title>
          .
          <source>In International Conference on Audio-and Video-Based Biometric Person Authentication</source>
          . Springer,
          <fpage>310</fpage>
          -
          <lpage>319</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Dinusha</surname>
            <given-names>Vatsalan</given-names>
          </string-name>
          , Ziad Sehili,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Christen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Privacypreserving record linkage for big data: Current approaches and research challenges</article-title>
          .
          <source>In Handbook of Big Data Technologies</source>
          . Springer,
          <fpage>851</fpage>
          -
          <lpage>895</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>