INTRODUCTION

Using Fuzzy Vaults for Privacy Preserving Record Linkage

Xhino Mullaymeri

dai16019@uom.edu.gr 0

Fuzzy Vault, Privacy Preserving Record Linkage, Approximate

Alexandros Karakasidis

a.karakasidis@uom.edu.gr 0 0 Department of Applied Informatics, University of Macedonia , Thessaloniki , Greece 1 String Matching

Approximate string matching has a variety of applications, one of them being record linkage, where records from diferent sources, without common unique identifiers, are matched. When these records refer to individuals, maintaining their privacy becomes a paramount requirement, leading us to develop privacy preserving record linkage methods. In this paper, we present a solution to the privacy preserving record linkage problem which is based on Fuzzy Vaults, having the advantage of being two-party based. A Fuzzy Vault is a cryptographic structure based on noise. Fuzzy Vaults have been previously used mainly in biometric applications. In this work, we explain our methodology in depth, providing detailed examples, we support our argument for privacy preservation by a solid discussion and provide extensive experimental results accompanied by a rigorous analysis, not only to indicate the stability and robustness of our method, but also to demonstrate that our setup manages to achieve superior results when compared to a baseline three party - based method.

INTRODUCTION

In this digital era we are living, the number of agencies and corporations which store personal data has dramatically increased. A common need for these organizations to exchange information for individuals described by these data and perform analysis on them. As a data integration step, this would traditionally boil down to a database join in the case that common unique identifiers are shared such as social id or a driving license number. However, since this is not always the case, the next step is to exploit fields that, when combined may uniquely identify a record. What we have just described is commonly known in the literature as the Record Linkage Problem. Solving this problem, however, does not take into consideration that the privacy of the described individuals in these databases should be maintained. This constitutes the Privacy Preserving Record Linkage problem, where two parties aim to find common records without revealing any additional information to each other, apart from the matching ones. However, data are usually dirty and even common fields between the two parties’ databases might diverge, quite often due to human error (i.e. typing errors during data entry). Since quite often such identifiers are string values as name, surname or address, one approach to overcome this complication is by using approximate string matching methods which should also consider privacy preservation.

To indicate the importance of the Privacy Preserving Record Linkage task, let us consider the COVID-19 pandemic we are experiencing and the need for tracking the contacts of confirmed cases, which undoubtedly constitutes a matter of great concern. Suppose that a verified case of the disease is an airline passenger. To ensure public health, all patient’s close contacts should be tracked and informed that they might have been infected. Applying Privacy Preserving Record Linkage on databases from health agencies and airlines, authorities may locate and inform people who might have been exposed to the virus, without compromising privacy. In the same context, surveillance of quarantined cases can also be benefited. When someone is placed under house quarantine, his identifying data may be added on a public governmental database so as to be accessed by every agency which ofers public services (accommodation, transport, entertainment). As these places are usually overcrowded, makes them places of possible outbreaks. Before providing any services, such providers should be able to check if their customers’ names are on the quarantined cases list, protecting, in any case, the privacy of the individuals in all sites.

In this paper, we present in detail the operation and the underthe-hood implementation details of the two-party method for performing privacy preserving record linkage using Fuzzy Vaults briefly sketched in [ 18 ]. A Fuzzy Vault scheme [ 12 ] is a cryptographic structure where a secret is being “locked” with a key and in order to “unlock” it someone must use a key which is similar enough to the locking key. This “unlocking” phase of a Fuzzy Vault is by its nature approximate, so it suits very well our goal for approximate string matching. Originally, It has been used for biometric authentication algorithms and there are such implementations displayed in the literature [ 25 ]. Despite the work on biometric authentication, to the best of our knowledge, there have been no works utilizing Fuzzy Vaults for performing privacy preserving approximate string matching, focused on, but not limited to, Privacy Preserving Record Linkage operations.

As such, our contributions may be summarized as follows. First of all, we illustrate, for the first time, all the details of our novel methodology using Fuzzy Vaults for private approximate string matching so as to be utilized in a privacy preserving record linkage context. This methodology consists of a carefully designed workflow employing reference sets, non bijective mapping functions and noise addition. We provide detailed examples in order to allow the in-depth comprehension of our method’s details and we support our argument of privacy preservation by a solid discussion. In terms of empirical evaluation, we provide extensive experimentation, not only for exhibiting our method’s behavior but for proving its superiority against a baseline three party-based method in terms of overall performance, a fact with additional value considering that our fuzzy vault approach does not require a third party.

The rest of this paper is organized as follows. Section 2 presents works related to ours. In Section 3, there is the background for introducing the reader to our method, which is detailed in Section 4. Section 5 contains a discussion over the privacy properties of our method. In Section 6 we provide the empirical analysis of our method in order to demonstrate its robustness and stability. Finally, in Section 7, we provide our conclusions and thoughts for future research. 2

RELATED WORK

Private approximate string matching techniques have been widely developed in the context of privacy preserving record linkage [ 26 ], being mostly based either on Secure Multiparty Computation (SMC), usually incurring additional communication costs, or on data perturbation, that often seem to be quite vulnerable to attacks [ 5 ].

In order to overcome some of the drawbacks and render SMC based techniques applicable as it comes to real world size datasets, hybrid methods have emerged. In [ 11 ] a hybrid method which combines diferential privacy and homomorphic encryption is presented. Here, instead of sanitizing data, diferential privacy is applied to ensure data privacy. Afterwards, data partitioning into blocks takes place. Finally, an SMC step based on homomorphic encryption occurs and returns the matching pairs. This method manages to partially improve the situation with the bottleneck of the limited size of data that SMC can endure.

Another hybrid method which is based upon blocking, utilization of diferential privacy and on an SMC (homomorphic) phase is described in [ 16 ]. However, in this case the two parties compare their data with the help of a third one. Moreover, even recent techniques addressing this problem employ a third party [ 21 ], an approach that may be rendered impractical requiring the availability of such a third party.

In the last few years, there has been particular efort towards developing two-party based techniques [ 3 ]. One of the methods used to achieve this functionality is homomorphic encryption [ 9 ], which, however is a computationally expensive method [ 1 ] also liable to sufer certain attacks [ 10 ]. Another recent technique is based on SPDZ [ 15 ] which, however, may also be expensive in terms of communication and computation [ 3 ]. Finally, Garbled Circuits can ofer a solution [ 2 ], but further evaluation is needed in terms of execution time, size and reusability [ 22 ].

To this end, we are presenting a solution to the two-party privacy preserving record linkage problem using Fuzzy Vaults. A Fuzzy Vault scheme may be used for privacy preserving matching, personal entropy systems, biometrics [ 12 ], [ 25 ] or, as indicated by the recent literature [ 20 ], for signature verification. Our method has been briefly presented in [ 18 ]. In this paper, we move one step further by providing a simple protocol for two-party privacy-preserving record linkage using Fuzzy Vaults, detailed examples, an in-depth privacy discussion, exhibiting the privacy characteristics of our method and, last but not least, an extended experimental evaluation, also including a comparison with a baseline method. 3

PRELIMINARIES

In this section, we define the problem we are solving and provide the required background for laying out our method. The notation used throughout this paper is summarized in Table 1. 3.1

Problem Formulation

Let us consider two data sources, called Alice () and Bob (). Privacy preserving record linkage is the problem of identifying (linking) all pairs of their records that refer to the same real world entity, so that no more information is disclosed to either , or any third party involved in the process besides the identifiers of the linked records.

Alice and Bob will most probably use diferent schemas in their databases. As such, they will have diferent attributes. Let be Alice’s schema and be Bob’s schema and let us assume that in these schemas of the attributes are common between the two sources forming a composite key. These attributes might be names, surnames, addresses and so on. As such, none of these on its own may comprise a unique identifier that can be used to identify a record. We refer to these attributes as matching fields . This composite key is used to determine when two records match, i.e., when they refer to the same entity. To determine when two records match, the respective attributes forming the composite key need to be compared. Considering that our data is often dirty, matching should rely on a similarity or distance function.

Let us consider a similarity function () → [0..1] and a threshold > 0. Given the composite keys and for records and , respectively, we define the following matching function → {0, 1}: (, ) = ( 1, if (, ) ≥

0, otherwise. ( ) = X ∗ , ( ) =0 , = Y

− =0,̸= − If (, ) = 1, then the pair ( , ) is a match.

This process is the matching process. To preserve privacy, i.e., ensure privacy preserving matching (PPM), after the completion of this process, the only information revealed is the identifiers of the matched records. In our case, where we use Fuzzy Vaults for matching, Formula 1 will have as input Bob’s key, corresponding to record , and a Fuzzy Vault for Alice’s and it will examine their absolute matching status using the method we describe in detail in Section 4, returning 1 when Bob successfully unlocks Alice’s Fuzzy Vault with his and 0 otherwise. As such, the matching threshold is equal to 1. 3.2

Lagrange Interpolation

Lagrange interpolation [ 24 ] is the most common interpolation method used in the fuzzy vault literature. Considering a twodimensional Euclidean space, and given a data set of n + 1 points: (0, 0) , (1, 2) , . . . , (, ) i ̸= ∀ , ∈ {0, 1, . . . , } the approximation function using Lagrange interpolation is a polynomial of nth degree and is defined as shown in Equation 2:

This method is called Lagrange interpolation because it uses Lagrange basis polynomials in order to interpolate the data set, as illustrated in Equation 3: (1) (2) (3)

By noticing the denominator of Equation 3 it is evident why i ̸= ∀ , should hold for the formula’s abscissas. 3.3

Fuzzy Vault

A Fuzzy vault is a cryptographic construct used for secret sharing, introduced in [ 12 ] and, in brief, it works as follows. Alice, wants to protect a secret by locking it up with a key, which is a set of items . This is the Lock phase. Bob, may have access to this secret only if he uses a key, consisting of a set of items is substantially similar to Alice’s key. This is the Unlock phase. To assure privacy, noise is also injected during the Lock phase.

Next, we will further explain the details of this process and employ a running example to facilitate the reader. The original Fuzzy Vault construct was designed for numerical values, which we will use to illustrate its operation. However, as our work focuses on strings, we will lay out in later sections our methodology for achieving suitable transformations in order to adapt this concept for alphanumeric data.

3.3.1 Lock Phase. To illustrate the operation of a Fuzzy Vault, we will commence with the Lock phase and consider, as stated earlier, that Alice wants to lock her secret using her key, . For this purpose, she uses a polynomial of degree so that ’s coeficients are, in fact, mappings of her secret . Then, she uses the members of her key as x-coordinates for creating a collection of points . As is a set, its elements are unique, thus the the x-coordinates of ’s points are going to be unique as well.

Let us now denote with | | the cardinality of . Equation 4 illustrates the restriction that has to be satisfied with regard to the size of the polynomial used. This is due to Lagrange’s interpolation method which requires + 1 data points in order to return a polynomial of degree at most [ 7 ]. The key set holds the abscissas of the data points that Lagrange will later interpolate in the unlock phase.

| |≥ + 1 (4)

Now, Alice will also create noise in order to protect the Fuzzy Vault’s, thus ofering privacy. To do so, she introduces a distraction set of points designated by and referred to as Chaf points . These points do not belong to the polynomials’s curve. Their purpose is to confuse an attacker by refraining her from identifying the points in that have resulted from the Lock phase. As such, collection may be considered as the protection of the vault. The union of collections and comprised the vault , or more formally: V , = ∪ . In fact, is considered to be the key of the vault since it identifies the collection of real points . As chaf points aim to confuse an attacker, the cardinality of is of a greater order of magnitude than the cardinality of set . We define this set of chaf points as illustrated in Equation 5: = ′ ̸= , ̸= ( ′) , ∀ ∈ & ∀ ∈/ & | |>> | | (5) At this point, we will make this process more clear through Example 3.1, with its illustration in Figure 1.

Example 3.1. Let as assume that the secret which Alice wants to lock is = 2317 and that she holds a key, which is the set = {1, 2, 3, 4, 5, 6}. She creates a polynomial of degree = 3 which has the secret embedded into its coeficients: ( ) = 2 ∗ 3 + 3 ∗ 2 + 1 ∗ 1 + 7 ∗ 0. The next step involves the projection of her key on the polynomial . Assume that Alice’s key

Symbol

n K f P r V RS C R H Superscripted A () Superscripted B ( ) |·|

Description

Degree of polynomial Key Size of fragment Polynomial Plaintext record Fuzzy Vault Reference Set Chaf Points Original Points Verification Hash Alice (e.g. : Alice’s Key) Bob (e.g. : Bob’s Key)

Size (e.g |RS|: Reference Set size) is the set = {1, 2, 3, 4, 5, 6} . By projecting each of these points Alice ends up with a collection of points in a two-dimensional space, = {, ( )}, ∀ ∈ . For our example, this results into: = (1, 13), (2, 37), (3, 91), (4, 187), (5, 337), (6, 553). In our example, ’s cardinality is | |= 6.

Next, Alice creates the distraction set comprising of random chaf points which do not lie on the polynomial. Finally, Alice combines and shufles the two collections in order to create the vault , which contains both genuine and chaf points, which are indistinguishable. As stated before, the size of should significantly exceed that of . However, for presentation purposes and without harm to the general case, we will only consider four points in the chaf set: = {7, 8, 9, 10} For our toy example, this will yield: = {(1, 13), (2, 37), (3, 91), (4, 187), (5, 337), (6, 553), . . . , (9, 560), (10, 700)}.

3.3.2 Unlock phase. Having seen how Alice locks a secret, let us now describe the Unlock phase, where Bob will attempt to unlock the Fuzzy Vault he received from Alice and get access to Alice’s secret . For this purpose, Bob will have to use a key set in order to unlock . This will only happen in the case that his key, , substantially overlaps with Alice’s key , used to create . If so, Bob may distinguish enough genuine points of the vault belonging to the collection .

However, Bob may also identify some distraction points belonging to the chaf set too. Bob’s ability to eventually reconstruct the original polynomial depends on the cardinality of Bob’s key. If the key’s cardinality is smaller than the polynomial’s degree then interpolation methods will not be able to recover the proper polynomial [ 7 ]. Therefore, Equation 4 has to be satisfied for Bob’s key, , as well. Unlocking the vault requires, first, that Bob finds all the vault’s points that have the same abscissas with any of his key’s elements, or more formally: = {(, ) ∈ | ∈ } (6)

Distraction points might exist because might also overlap with the collection of random chaf points . If and do not significantly overlap, then the polynomial reconstruction cannot be performed. Example 3.2 illustrates this process.

Example 3.2. Let us consider, in our running example, that Bob, with his key set = {1, 2, 4, 5, 8, 9}, tries to unlock . The collection of for Bob will be: = {(1, 13), (2, 37), (4, 187), (5, 337), (8, 360)}. Bob has to reconstruct in order to access the vault’s contents. For our case, let us consider that this occurs through Algorithm 1: Protocol Overview. Lagrange interpolation [ 24 ]. We note here that, part of for Bob’s key is a point (8, 360) which comprises a noise element for and as such it does not belong to the original polynomial. Consequently, each of the combinations that include the noise element will return a false polynomial. Note here that the cardinality of Bob’s key is | |= | |= 6. In this case, we have chosen both Alice and Bob to have keys of the same size.

Nevertheless, there is the capacity for Bob to have a key with a diferent size to Alice’s, given that it has at least one element more than the degree of the polynomial used. Since Lagrange interpolation requires + 1 genuine points in order to reconstruct the polynomial which corresponds to them, if | |< + 1, Bob will fail to unlock the vault. On the other hand, if | |> + 1 Bob has to create all possible + 1 combinations of and use them in order to interpolate the polynomial. Now, we should take into consideration that each, one of the possible combinations will return a polynomial but Bob has somehow to decide which one is the correct (there is no need to calculate all of them). 3.4

Reference Set

A Reference Set is a corpus of data used as a means for preprocessing when matching sensitive information. This corpus may either be publicly available or pre-agreed among the matching parties, Alice and Bob, in our case. It works as follows. Considering that Alice and Bob wish to match their data without revealing any information to each other, they use some mapping function, having also been preagreed and the same for both, which relates each of the values in their dataset to one or more values in the Reference Set. Then, instead of comparing their data directly, they are able to compare the results of the mapping function. Such a construct has been used in the privacy preserving record linkage context either for matching [ 19 ] or for blocking [ 13 ]. Example 3.3 illustrates how a Reference Set may be used.

Example 3.3. In this example we are using Levenshtein distance [ 17 ], indicated by (·), as the mapping function. We still consider Alice and Bob, with Alice holding the word ‘error’, while Bob holds the word ‘erasure’ and they wish to privately match them. For the sake of presentation, the Reference Set contains the single word ‘energy’. In this case, for Alice (, ) = 4, while for Bob, (, ) = 6. Since ̸= , we may easily deduce that Alice and Bob hold diferent values. 4

METHOD DESCRIPTION

Having examined the operation of the Fuzzy Vault for numerical data, let us now describe our approach for exploiting this mechanism for approximate private string matching, which will allow us to perform privacy preserving record linkage. We consider two honest but curious matching parties, Alice and Bob. This means that, both of these two participants do not exhibit a malicious behavior, attacking the operation of the protocol, but they try to infer as much information as possible without deviating from the protocol. 4.1

Protocol Description

Let us begin by providing a simple protocol for using Fuzzy Vaults for privacy preserving record linkage. For this purpose, a separate Fuzzy Vault for each record in a database, as such, a separate polynomial, has to be created and secretly shared. We will consider for our case, that alphanumeric identifiers are used. As our focus is on privately matching records, we assume that either both Bob and Alice share a common schema, or they have diferent schemas but corresponding fields on the use of which they have agreed beforehand, or they use some privacypreserving schema matching method.

The protocol we are describing is asymmetric. This means that both parties do not perform exactly the same steps, but they cooperate in order to achieve their common goal of privately linking their records. These steps, are illustrated in Algorithm 1. First, Alice converts all of the records in her database to Fuzzy Vaults, resulting in one Fuzzy Vault for each record. Then, she transmits these Fuzzy Vaults to Bob through a considered secure channel. Bob, in his turn, creates his Fuzzy Vaults and, using the resulting keys, attempts to unlock the received vaults, resulting in a list of matches. When the process concludes, he sends the ids of the matching Fuzzy Vaults back to Alice. The procedure concludes with both Alice and Bob exchanging their actual matching records.

At this point, the question that rises is how these alphanumeric fields are going to be transformed into polynomials and, eventually, into Fuzzy Vaults. This is the process we are going to describe next. 4.2

String Encoding for Locking

Beginning with the locking phase, Alice will secure strings inside Fuzzy Vaults. Each string is the concatenation of all the string ifelds in the record that will be used for matching. In order for such a string to be locked, and to be able to be unlocked afterward by Bob by a fairly similar string, it should be somehow associated with the polynomial used to lock it. As such, our method embeds irreversibly the string into polynomial’s coeficients. We cannot use a direct representation of the string since this would jeopardize exposing the key as the abscissas of the genuine points of the vault, allowing an adversary obtaining the key to directly recover the string. Additionally, as we will describe later on, Bob will use this property after unlocking the Fuzzy Vaults, in order to verify if the match he achieved with Alice’s secret is identical or approximate.

To be able to adapt the Fuzzy Vault scheme so as to exhibit privacy preservation properties for record linkage, a series of

Algorithm 2: Mapping a Record to a Polynomial.

end SetCoefficient ( _, − ); + = 1; steps is necessary. The broad image of these steps are illustrated in Figure 2. First of all, a record, which plays the role of the string to be secured, plays a double role in the whole process, as it is used both for creating a polynomial and for building a key for encoding this record. This polynomial and the key are then used to create the set of Genuine Points, which, in combination with the generated chaf points, will result in a Fuzzy Vault. At this point, we have to mention that the polynomial representation of the record is also used to produce the verification data, also used during the unlocking phase.

4.2.1 Polynomial generation. Let us begin with examining the process that generates the aforementioned polynomial. The mapping algorithm we propose is illustrated in Algorithm 2. First, the alphanumeric fields of the record to be locked are concatenated so as to form a single string (line 2). Then, this string is fragmented into substrings of size equal to (line 3), with being a user defined parameter, an approach we adopted for security purposes. This is particularly useful for long strings. Each character of each fragment is converted to its ASCII code equivalent. The concatenation of these codes creates a number, which will form a polynomial’s coeficient (lines 4-11). The remaining coeficients of the polynomial are set to 1. Example 4.1 further clarifies the operation of this algorithm. On the other hand, if the string exceeds the length of 27 characters either a higher degree polynomial should be created or some characters should be skipped.

Example 4.1. Let us assume, as illustrated in Figure 3, that = 3, meaning that the fragment size will consist of three characters and that we wish to encode the name JOE_DOE. Let us also assume that the degree of the polynomial we wish to use is = 8. This will yield a polynomial of the form: ( ) = 8 · 8 +7 · 7 +. . .+ 2 · 2 +1 · 1 +0. Given our settings for and , this polynomial has a capacity of embedding 9 · 3 = 27 characters, because it has 9 coeficients. Moreover, there are 263 = 17576 possible combinations for each coeficient, because each coeficient can have up to three letters and there are 26 available letters (we take into consideration only capital letters) and 263∗9 total possible polynomials. In our example, the encoded string has less than 27 characters. In this case, the remaining coeficients are all set to 1.

4.2.2 Verification. The aim of the verification step is, as its name suggests, to verify at the unlocking phase that the recovered polynomial matches the encoded one. In our approach, we use the method proposed in [ 20 ]. This method, is based on auxiliary data generated during the polynomial construction phase. An MD5 hash which derives directly from the created polynomial, is created having as input the concatenation of the polynomial’s coeficients.

4.2.3 Key Generation. Building a polynomial is one of the two steps required toward generating a record’s set of genuine points. The second step regards the generation of the key. Since the locking and the unlocking keys are the two components which are being compared, the key in our method should be directly linked with the string. The caveat here is to avoid an one - to - one connection between the key and the string as a measure against record multiplicity attacks, described in [ 23 ]. For instance, in the case that an adversary somehow gets access to the key, our method should not allow the recovering of the secret string.

As such, the need to devise an one-way mapping emerged. To this end, we employ a Reference Set in order to associate its values with the string resulting from the record’s concatenated fields by means of a string similarity function. Then, the most similar values are selected and they are transformed, in an irreversible manner, to a set of numbers comprising the key for the locked string, without, however, revealing any information about it.

Let us now see in how this works. The generation of a locking key for a single record is detailed in Algorithm 3. This algorithm should be applied on each record in the recordset resulting in a set of separate keys, one for each record. This algorithm requests as input the concatenated alphanumeric fields of record , as a single string, to be encoded, a Reference Set that we are going to use for encoding and the size | | of the key which we are going to produce.

Our algorithm also requires the use of a string similarity, or distance, function (·). This is applied on the pairs formed by and each of the values of the Reference Set , _ resulting to an array of similarity scores _ (lines 1-4). _ is then used to partially sort . As our aim is to retrieve the to | | values of the , a full sort is unnecessary. The sort order depends on the type of measure used for (·). If (·) is a similarity function, then sorting occurs in descending order, as we need to retrieve the most similar of ’s values with . Otherwise, sort in ascending order is performed. Afterward, __ are extracted from , which are the top | | most similar Reference Set values with (lines 6-9). Next, we are going to build by applying a series of transformations on each of __, resulting in a numerical set.

Algorithm 3: Key Generation.

Input: - r: A record with concatenated alphanumeric fields - RS: Reference Set - |K|: Size of resulting key Output: - K: Resulting key 1 _ ← ∅; 2 foreach _ ∈ do 3 .append (F (, _ )); 4 end 5 Partial_Sort (, _, | |); 6 __ ← ∅; 7 for ← 0 to | | by 1 do 8 __.append ([]); 9 end 10 ← ∅; 11 for ← 0 to | | by 1 do 12 ℎℎ ← Hash (__ []); 13 ← ℎℎ; 14 _ ← Rand (, 0, 106); 15 .append (Cosine (_ )); 16 end

The transformations that each string __[] undergoes are as follows (lines 10-16). Initially, __[] is hashed through a simple hash function so as to create a numerical equivalent ℎℎ (line 12). Then, ℎℎ is used as an input seed in a random number generator which produces _ ranging between [0, 106] (lines 13-14). The series of transformation conclude with applying the cosine function on the _ , resulting in a value between [ −1, 1 ], which is now a part of (line 15).

It is evident that after applying this series of mappings we have managed to perform an one-way transformation between the initial string that consists of the concatenation of the records ifelds and the numbers that comprise . This is due to the use of the numerical representation of the string as a seed and the periodic nature of the cosine function.

4.2.4 Fuzzy Vault Construction. At this point, three steps remaining for constructing a Fuzzy Vault: Generating Genuine Points, generating Chaf Points and, finally, blending these together.

Genuine Points Generation. Now that we have created the key and the polynomial required by the Fuzzy Vault scheme to encode a secret, we will projected over , as indicated in [ 12 ]: = { ( )}, ∀ ∈ resulting in a set of points considered to be the set of genuine points in the Fuzzy Vault.

Noise as Chaf Points. After the creation of the genuine points, chaf points should also be generated ensuring fuzzy vault’s security. Noise comprises a standard method for enhancing privacy [ 4 ]. In our method we adhere to this principle by generating chaf points which aim at distracting a potential attacker from identifying the points that belong to the actual polynomial. A naive approach would be to randomly generate points based on one or more distributions. However, we consider that this would allow for identifying and phasing out noise quite easily. As such, we designed a novel approach, more suitable for our case, which is based on the use of Reference Sets. We use abscissas from a Reference Set we employ, while the ordinates are randomly CHAFF POINTS

GENUINE POINTS chosen but with the restriction that the resulting points do not lie on the polynomial. This way, is built.

Final Step. Eventually, the points of these sets are shufled in order for genuine and chaf points to be mixed, applying formula 5 so as to create . A Fuzzy Vault created with this method is illustrated in Figure 4. At this point, the data encoding from Alice’s side concludes. Alice sends Bob the Fuzzy Vaults she has constructed together with the verification hash, One pair for each record. At this point the first step (line 1) of the protocol illustrated in Algorithm 1. To further clarify our approach, we will use Example 4.2:

Example 4.2. Let us assume that “JOHN_SNOW” is the string Alice wishes to lock. Let us also assume that (·) is going to be Levenshtein distance [ 17 ]. First, she encodes the string into a polynomial, using the fragmentation of the respective ASCII codes, with = 3: ( ) = 747972 8 + 789583 7 + 787987 6 + 1 5 + 1 4 + 1 3 + 1 2 + 1 1 + 1 0.

Based on the representation of the polynomial, she also constructs the verification hash, by concatenating the coeficients of the polynomial so as to form “747972786583787987111111”. Applying MD5 results in: “2d73f7e7425e2d20264a799e24715317”, which will be delivered together with the Fuzzy Vault to Bob.

Then, Alice will have to build her key . For this purpose, she selects the top | |= 15 elements of the Reference Set which are more similar with the word “JOHN_SNOW”. For instance, the first element is “JONES_OWEN” which has F(JOHN_SNOW, JONES_OWEN) = 5, while the last one is the string “XU_BO” with F(JOHN_SNOW, XU_BO) = 7. Now, we hash each of the strings in into numbers and then the numbers are mapped into the range [ −1, 1 ] using the cosine function: ( _ ) = −0.355866.

Next, in order to create the genuine points, Alice projects the elements of her key onto the polynomial . For instance, for the first key element this yields (−0.355866) = 1222.8648. The rest are accordingly projected. When this step concludes, will have been built.

Eventually, Alice will have to build the set of chaf points and then calculate its union with so as to build the Fuzzy Vault . coupled with the verification hash are delivered to Bob. At this point, we assume that the second step (line 2) of Algorithm 1 has been successfully completed, thus Alice’s records have been successfully transmitted to Bob in the form of Fuzzy Vaults, accompanied by their Verification Data and we are moving to steps 3 and 4. Now, Bob will attempt to unlock these Fuzzy Vaults, thus, linking his records with Alice’s. To do so, he will have to perform the actions illustrated in Figure 5. First, he will have to create a key for each of his records. Then, all of his keys are going to be tested against each of Alice’s Fuzzy Vaults, also using each vault’s Verification Data. Let us see these in detail.

4.3.1 Verification. As described in Section 3.3, a Fuzzy Vault is unlocked by a key similar to the one that locked it. It is evident that, for this to happen, Bob should create his unlocking keys in way identical to the process that Alice followed to create her locking keys. Thus, Bob builds each of his unlocking keys , for each of his records, using Algorithm 3. Afterwards, he will calculate the intersections of all the unlocking keys with all the Fuzzy Vaults, forming a series of Common Sets, each of which contains (, ) pairs. Each of Common Set is then used for reconstructing the polynomial. Adhering to the literature [ 6, 20, 25 ], this achieved through Lagrange interpolation [ 7 ] on the common points , to create an interpolating polynomial. 4.4

Approximate Matching

At this point, our methodology for approximate matching is applied. The key point for achieving approximation with a Fuzzy Vault is the step of the reconstruction of the polynomial described by the Fuzzy Vault. As mentioned earlier, an interpolated polynomial of -th degree needs at least + 1 interpolating points in order to be created. In our case, the polynomial is secured in a Fuzzy Vault as | | genuine pairs (, ). Therefore, when ||= + 1, in order to interpolate an -th degree polynomial, the two keys have to be identical.

However, if we build a key , having a size greater than the minimum number of points that an -th degree polynomial needs in order to be recovered, i.e., | |> + 1 we can exploit these redundant (, ) pairs to achieve approximation. In other words, since + 1 points are required to unlock an -th degree polynomial and we project a key with size larger than + 1, we may use any + 1 points of to unlock the Fuzzy Vault. As such, should only match +1 of ’s points, with | |> +1, thus leading to linking non-identical records. On the other hand, as | | increases, the probability of having common elements in diferent keys is increased too. As such, too large keys would result into increased numbers of fault positives.

Now, after the recovering the polynomial from , Bob, creates a Verification Hash in an identical way that Alice created . Then these two are checked for matching. If suficiently overlaps and = , the two polynomials will be the same and there is a match. On the other hand, if the two sets are not similar enough, verification will fail. If the polynomial is classified as a match, then the strings for which the vault has been created is considered to be the same with the string Bob holds. To facilitate the reader, we now provide Example 4.3:

Example 4.3. Continuing Example 4.2 Bob tries to unlock a Fuzzy Vault he received from Alice containing the record “JOHNY_SNOW”. First, he creates his key the same way Alice did, using the same fragment size. After the creation of his key, Bob has to find the common points between his key and the fuzzy vault. In this case, Bob managed to find n = 10 common points. Thereby, by utilizing Lagrange interpolation, he managed to recover a polynomial of 8th degree by using 9 out of the 10 common points (he might have to try all the combinations) as shown: () = 7479728 + 7895837 + 7879876 + 15 + 14 + 13 + 12 + 11 + 10. Last but not least, Bob has to verify the recovered polynomial, because he has no knowledge if it is the correct one or not. Therefore, he calculates the MD5 hash of the concatenation string of his polynomial’s coeficients, which in our case is “2d73f7e7425e2d20264a799e24715317”. Comparing the two hashes Bob can asses that the two strings match. 4.5

Protocol Conclusion

Eventually, after Bob has decided which of his records match Alice’s, he transmits a list of the matching records he recovered requesting their entire data. Alice responds and she exchanges her corresponding data with Bob (lines 5, 6 of Algorithm 1).

5 PRIVACY DISCUSSION

In our setup, we consider that Alice and Bob exhibit an honest but curious behavior, meaning that they try to gather as much additional information they can by adhering to the protocol.

A Fuzzy Vault’s security properties rely on the number of chaf points used while constructing it [ 12 ]. We remind our notation, where represents the degree of the polynomial used, | | the number of generated chaf points and || the number of real, genuine points. Taking into account that, in order to have a correctly reconstructed polynomial, + 1 of the genuine points must be used. Therefore, any adversary who has access to a fuzzy vault and wants to unlock it has to find all possible + 1 combinations of points and make reconstruction attempts. Even in this case the adversary has to know the polynomial degree which has been used.

Without any extra information (rather than the polynomial degree) any adversary is required to find ||++1| | point combinations. From all the combinations of + 1 points that can be extracted by the vault only |+ 1| combinations are able to unlock it. Therefore, the probability of guessing correctly a needed combination is given by Equation 7.

| | +1 (7) | |+| | +1

Consequently, a brute force attack in a well concealed fuzzy vault is considered impracticable. In our approach, to further increase privacy, we employ a series of further enhancements. First, there are the chaf points, which comprise the Fuzzy Vault. Then, there is the use of a Reference Set and, finally, the use of non-bijective mapping functions. 5

Experiment id

Chaf points are responsible for concealing the vault’s secret which, in our case, is a record . We have taken the following measures in order to further enhance this property. First of all, regarding the use of the Reference Set and mapping functions. Sensitive data are not encoded themselves within the Fuzzy Vault. What is encoded with the Reference Set is the output of a distance or similarity function. Furthermore, the output of this function is fed to a non-bijective function, thus constituting both the polynomial representation and the key generation processes irreversible.

To continue, to avoid frequency attacks, instead of using a random function to generate points, the chaf generation technique we propose is based on the use of a Reference Set. This way, the abscissas of the points delivered to Bob are indistinguishable from the original points. Consider that the vault has 15 genuine points and 200 chaf points which their abscissas os also derived from the Reference Set, the same with the genuine points. Bob knows all the possible abscissas which are produced by the Reference Set but our method uses these abscissas as noise. Therefore, there is no information leakage which can help Bob to unlock the vault without having a proper key.

Furthermore, At this point, it should be noted that, in contrast with the biometrics applications utilizing Fuzzy Vaults, [ 6, 25 ] the leakage of a key may only afect a single vault. This is due to the fact that we employ a separate polynomial for each string to be locked in the Fuzzy Vault. Additionally, the degree of the polynomial and the size of the chaf set also afect a Fuzzy Vault’s ofered privacy. For further hardening the Fuzzy Vault, the degree of the polynomial may be increased. Increasing the size of the chaf set will as well favor privacy.

6 EXPERIMENTAL EVALUATION

In this section, we will provide empirical evidence regarding the matching performance and the time characteristics of our method. Finally, we will compare the Fuzzy Vault method we propose against a baseline privacy preserving matching method based on phonetic codes [ 14 ]. For our evaluation, we have implemented a prototype in Python 31 enhanced by a series of packages2.

6.1 Setup

In our experiments, we have used samples from real-world data originating from the North Carolina voters database3. To evaluate our approach, we have used ‘last name’ and ‘first name’ from this 1https://github.com/XhinoMullaymeri/Fuzzy_Vault_PPRL_DOLAP21 2Python 3 packages: hashlib, random, scipy.interpolate.lagrange, numpy, numpy.polyfit, Levenshtein 3Available at: https://dl.ncsbe.gov/index.html?prefix=data/ database as matching attributes, assuming they comprise a candidate key. Sampling this database, we constructed two datasets, dataset_A and dataset_B. Dataset_A initially consisted of 1000 records and then deduplicated to 990 ones. Dataset_B has been built from dataset_A. Since our method is capable of achieving approximate matching and in order to evaluate its performance in this terms, we corrupted Dataset_B so as to contain one error per attribute, as it is rather unusual for a word to contain more than one to two typographical errors [ 8 ]. Experiments were conducted on an 8-core virtual server with 16GB of RAM, utilizing, however, only a single core was for the assessment each time.

For the Reference Set, now, there are two aspects we are interested in. Its content and its size. As for the content, we used names and surnames originating from the initial database. In terms of Reference Set sizes, we experimented with three distinct dataset sizes. For each letter of the latin alphabet, 15, 20 and 25 records were selected with a uniform distribution, resulting in 15 ∗ 26 =390 records, 20 ∗ 26 = 520 records and 25 ∗ 26 = 650 records. For each size we created three diferent Reference Sets in order to have more information about the efect of the set’s content on the results. Table 2 contains the configurations for all these Reference Sets. Since for each size of Reference Set we will use three diferent Reference Sets, each time we perform an experiment with each of them, we maintain a fixed id. To this end, experiment ids 1–3 are for the Reference sets of 390 elements, 4–6, for those of 520 ones and 7–9 for those with 600 elements.

The evaluation of our method is going to be based on metrics broadly used in information retrieval and data mining. These metrics are Precision, Recall and Balanced Accuracy. Precision is the fraction of the relevant elements among the retrieved elements

TP ( = + ). Recall is the fraction of the retrieved eleTP ments divided by the total relevant elements ( = + ). Balanced Accuracy is used as a metric to indicate accuracy in cases of data with imbalanced classes, where the number of instances of one class, overwhelms is significantly larger than the number of instances of the other one. This metric normalizes the true positive and true negative predictions by the number of actual positive and negative samples, respectively, and divides their

5 Experiment id 6 Recall

Precision

Balanced Accuracy Next, we will examine in detail the outcomes of our empirical evaluation.

6.2.1 Matching Performance. Let us begin our experimental assessment by examining the matching performance achieved by our Fuzzy Vault scheme. This is illustrated in Figure 6. The horizontal axis represents the ids of the Reference Sets used, while the vertical one represents the value for the respective metric, all ranging between 0 and 1. Regarding the results, now, ifrst of all, we may observe that Recall levels, represented by the yellow bars, are almost for every test above 60%. This means that at least 60% of the fuzzy vaults which had been created using dataset_A, managed to be unlocked by the proper record of dataset_B resulting in a successful record linkage. Furthermore, we may observe three clusters forming, one for each Reference Set size, where the results are similar within each cluster.

Comparing precision and recall, we may observe that they follow opposite directions, a fact that constitutes an expected outcome. Related with the Reference Set used, the following situation occurs. When smaller Reference Sets are employed, more keys are assigned to the same item of the Reference Set. Thus, more keys are associated. On the other hand, larger Reference Sets associate small numbers of keys in each of their elements, leading to higher precision, yet lower recall.

It is evident that the size of the Reference Set used plays an important role in the overall behavior of our method. Nevertheless, we have to stress that no particular measures were taken to optimize the outcome regarding the distribution of the values of the Reference Sets. We have deliberately chosen this route in order to illustrate the baseline characteristics our approach exhibits.

Considering Balanced Accuracy now, it is easy to see that it remains at very high levels in all cases. This is an interesting result by itself, since it is not only for the matching pairs of records we are interested, but also, for the non-matching ones. Considering that we aim at an 1-1 matching for each of dataset_A’s records with the ones of dataset_B. Considering 1,000 records for each dataset, we would ideally end up with 1,000 pairs matched and 1, 000 ∗ 1, 000 − 1, 000 = 999, 000 pairs not matched. Precision and recall do not take into account such imbalances, a situation captured, however by Balanced Accuracy.

6.2.2 Total Execution Time. In this set of experiments, we will assess the relation between the execution time and the size of the Reference Set. The results of this evaluation are illustrated in Figure 7. The vertical axis represents time in seconds, while the horizontal one, again, stands for the Reference Set used. The first observation we can make is that there are groups formed, based on the size of the Reference Set. Within each group, execution times are similar, as the sizes of these Reference Sets are identical and for each key the same number of comparisons has to take place. On the other hand, as the size of the Reference Set increases, execution times increase as well. As a result, we can deduce that, total execution time is proportional to the size of the Reference Set.

This observation is particularly useful, especially when combined with the conclusions from the previous set of experiments. In more detail, using smaller Reference Sets has the result of higher recall and lower precision. On the other hand larger Reference Sets increase, not only execution time but also precision, with the cost of lower recall. Selecting intermediate values for the Reference Sets ofer a fair trade-of between precision and recall and average time performance.

6.2.3 Locking and unlocking time shares. In this experiment, we will focus on Reference Set 8, in order to illustrate time shares for locking and unlocking the vaults. These are illustrated in Figures 9a and 9b. We will begin by describing the locking phase. Here as we may see, the time required by the locking phase is mostly occupied by the chaf points generation. This is, however, one of the basic elements for ofering strong privacy characteristics. The rest of the steps of the locking step occupy only a small portion of the overall time required, making this part of our approach open for further improvements.

Regarding unlocking, it is easy to see that for unlocking, time is almost equally shared between reconstruction and key generation. Identifying common points requires just a tiny portion of the overall time.

At this point, we will make use of one more plot to be able to extract safe conclusions regarding execution time. As we may see, Figure 9c illustrates a time comparison between the locking and the unlocking phase. It is evident, that the time required to unlock all vaults is much larger than the time required to lock them. In particular, while 58 seconds are required for encoding, 6869 seconds are required for both decoding and performing an

6.2.4 Comparison with privacy preserving Soundex. Finally, we conclude our experimental evaluation, comparing against Privacy Preserving Soundex [ 14 ]. For this experiment, we used Reference Set 7, comprising of 650 entries.

This experiment is illustrated in Figure 8. Again, the vertical axis displays values of the metrics, while the horizontal one represents the name of each metric. Green bars correspond to Privacy Preserving Soundex, while our method is illustrated with the red bars. We may observe that only in terms of Precision our method is inferior to Privacy Preserving Soundex. This, however, is a result of the low Recall that Privacy Preserving Soundex has exhibited. On the other hand, Fuzzy Vault manages to exhibit a more balanced behavior. This result is promising, also taking into account that Privacy Preserving Soundex is a three-party method, as opposed to our two-party method. Regarding execution time, Private Soundex operates in less than a second. However, as the results are simulated and this protocol requires a three-party setting for achieving privacy, we may not extract valid conclusions based on this comparison, since protocol communication times are not taken into account. 7

CONCLUSIONS AND FUTURE WORK

In this paper, we presented the details of our method for performing two-party privacy preserving record linkage using Fuzzy Vaults. We discussed the privacy properties of our approach and provided extended empirical evidence that this new approach is very promising. In our next steps, we aim at further refining our method aiming at increasing matching performance while, at the same time, improving execution times, maintaining, nevertheless, our method’s privacy characteristics. We will further experiment with tuning our method’s parameters, trying alternative interpolation methods.

[1]

Luca

Bonomi , Yingxiang Huang, and Lucila Ohno-Machado. 2020 . Privacy challenges and research opportunities for genomic data sharing . Nature Genetics ( 2020 ), 1 - 9 .

[2]

Feng

Chen , Xiaoqian Jiang, Shuang Wang, Lisa M Schilling , Daniella Meeker , Toan Ong, Michael E Matheny, Jason N Doctor, Lucila Ohno-Machado, and Jaideep Vaidya . 2018 . Perfectly secure and eficient two-party electronichealth-record linkage . IEEE internet computing 22 , 2 ( 2018 ), 32 - 41 .

[3]

Yanling

Chen . [n. d.]. Current approaches and challenges for the two-party privacy-preserving record linkage (PPRL) . Collaborative Technologies and Data Science in Artificial Intelligence Applications ( [n. d.]).

[4]

Peter

Christen , Thilina Ranbaduge, and

Rainer

Schnell . 2020 . Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing . Springer International Publishing AG.

[5]

Peter

Christen , Thilina Ranbaduge, Dinusha Vatsalan, and

Rainer

Schnell . 2018 . Precise and fast cryptanalysis for Bloom filter based privacy-preserving record linkage . IEEE Transactions on Knowledge and Data Engineering 31 , 11 ( 2018 ), 2164 - 2177 .

[6]

Charles Clancy , Negar Kiyavash, and Dennis J Lin . 2003 . Secure smartcardbased fingerprint authentication . In Proceedings of the 2003 ACM SIGMM workshop on Biometrics methods and applications . 45- 52 .

[7]

Samuel

Daniel Conte and Carl De Boor. 2017 . Elementary numerical analysis: an algorithmic approach . SIAM . 31 -41 pages.

[8] Fred

Damerau . 1964 . A technique for computer detection and correction of spelling errors . Commun. ACM 7 , 3 ( 1964 ), 171 - 176 .

[9]

Aleksander

Essex . 2019 . Secure Approximate String Matching for PrivacyPreserving Record Linkage . IEEE Transactions on Information Forensics and Security 14 , 10 ( 2019 ), 2623 - 2632 .

[10] Michael

Goodrich . 2009 . The mastermind attack on genomic data . In 2009 30th IEEE Symposium on Security and Privacy. IEEE , 204 - 218 .

[11] Ali

Inan

, Murat Kantarcioglu, Gabriel Ghinita, and

Elisa

Bertino . 2010 . Private record matching using diferential privacy . In Proceedings of the 13th International Conference on Extending Database Technology . 123 - 134 .

[12]

Ari

Juels and

Madhu

Sudan . 2006 . A fuzzy vault scheme . Designs, Codes and Cryptography 38 , 2 ( 2006 ), 237 - 257 .

[13] Alexandros

Karakasidis

, Georgia Koloniari, and Vassilios S Verykios. 2015 . Scalable blocking for privacy preserving record linkage . In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 527 - 536 .

[14]

Alexandros

Karakasidis and Vassilios S Verykios . 2009 . Privacy preserving record linkage using phonetic codes . In 2009 Fourth Balkan Conference in Informatics. IEEE , 101 - 106 .

[15]

Basit

Khurram and

Florian

Kerschbaum . 2020 . SFour: A Protocol for Cryptographically Secure Record Linkage at Scale . In 2020 IEEE 36th International Conference on Data Engineering (ICDE) . IEEE, 277 - 288 .

[16] Mehmet

Kuzu

, Murat Kantarcioglu, Ali Inan, Elisa Bertino, Elizabeth Durham, and

Bradley

Malin . 2013 . Eficient privacy-aware record integration . In Proceedings of the 16th International Conference on Extending Database Technology . 167 - 178 .

[17]

Vladimir

Levenshtein . 1965 . Binary codes capable of correcting spurious insertions and deletion of ones . Problems of information Transmission 1 , 1 ( 1965 ), 8 - 17 .

[18]

Xhino

Mullaymeri and

Alexandros

Karakasidis . 2021 .

Two-Party Private String Matching Fuzzy Vault Scheme . In The 36th ACM/SIGAPP Symposium On Applied Computing. ACM.

[19] Chaoyi

Pang

, Lifang Gu, David Hansen,

and Anthony

Maeder . 2009 . Privacypreserving fuzzy matching using a public reference table . In Intelligent Patient Management . Springer, 71 - 89 .

[20]

Wendy

Ponce-Hernandez , Ramon Blanco-Gonzalo, Judith Liu-Jimenez, and Raul Sanchez-Reillo. 2020 . Fuzzy Vault Scheme Based on Fixed-Length Templates Applied to Dynamic Signature Verification . IEEE Access 8 ( 2020 ), 11152 - 11164 .

[21] Thilina

Ranbaduge

Peter

Christen , and

Rainer

Schnell . 2020 . Secure and Accurate Two-Step Hash Encoding for Privacy-Preserving Record Linkage . In Pacific-Asia Conference on Knowledge Discovery and Data Mining . Springer, 139 - 151 .

[22] Ahsan

Saleem

, Abid Khan, Furqan Shahid,

M Masoom

Alam , and Muhammad Khurram Khan. 2018 . Recent advancements in garbled computing: how far have we come towards achieving secure, eficient and reusable garbled circuits . Journal of Network and Computer Applications 108 ( 2018 ), 1 - 19 .

[23] Walter J Scheirer and Terrance E Boult . 2007 . Cracking fuzzy vaults and biometric encryption . In 2007 Biometrics Symposium. IEEE , 1- 6 .

[24] Springer Verlag GmbH, European Mathematical Society. [n. d.]. Encyclopedia of Mathematics. Website . URL: https://www.encyclopediaofmath.org/. Accessed on 2020-08-11.

[25] Umut

Uludag

, Sharath Pankanti, and Anil K Jain. 2005 . Fuzzy vault for fingerprints . In International Conference on Audio-and Video-Based Biometric Person Authentication . Springer, 310 - 319 .

[26] Dinusha

Vatsalan

, Ziad Sehili,

Peter

Christen , and

Erhard

Rahm . 2017 . Privacypreserving record linkage for big data: Current approaches and research challenges . In Handbook of Big Data Technologies . Springer, 851 - 895 .