Using Fuzzy Vaults for Privacy Preserving Record Linkage Xhino Mullaymeri Alexandros Karakasidis Department of Applied Informatics Department of Applied Informatics University of Macedonia University of Macedonia Thessaloniki, Greece Thessaloniki, Greece dai16019@uom.edu.gr a.karakasidis@uom.edu.gr ABSTRACT cases, which undoubtedly constitutes a matter of great concern. Approximate string matching has a variety of applications, one of Suppose that a verified case of the disease is an airline passenger. them being record linkage, where records from different sources, To ensure public health, all patient’s close contacts should be without common unique identifiers, are matched. When these tracked and informed that they might have been infected. Apply- records refer to individuals, maintaining their privacy becomes a ing Privacy Preserving Record Linkage on databases from health paramount requirement, leading us to develop privacy preserv- agencies and airlines, authorities may locate and inform people ing record linkage methods. In this paper, we present a solution who might have been exposed to the virus, without compro- to the privacy preserving record linkage problem which is based mising privacy. In the same context, surveillance of quarantined on Fuzzy Vaults, having the advantage of being two-party based. cases can also be benefited. When someone is placed under house A Fuzzy Vault is a cryptographic structure based on noise. Fuzzy quarantine, his identifying data may be added on a public gov- Vaults have been previously used mainly in biometric applica- ernmental database so as to be accessed by every agency which tions. In this work, we explain our methodology in depth, pro- offers public services (accommodation, transport, entertainment). viding detailed examples, we support our argument for privacy As these places are usually overcrowded, makes them places of preservation by a solid discussion and provide extensive experi- possible outbreaks. Before providing any services, such providers mental results accompanied by a rigorous analysis, not only to should be able to check if their customers’ names are on the indicate the stability and robustness of our method, but also to quarantined cases list, protecting, in any case, the privacy of the demonstrate that our setup manages to achieve superior results individuals in all sites. when compared to a baseline three party - based method. In this paper, we present in detail the operation and the under- the-hood implementation details of the two-party method for KEYWORDS performing privacy preserving record linkage using Fuzzy Vaults briefly sketched in [18]. A Fuzzy Vault scheme [12] is a crypto- Fuzzy Vault, Privacy Preserving Record Linkage, Approximate graphic structure where a secret is being “locked” with a key and String Matching in order to “unlock” it someone must use a key which is similar enough to the locking key. This “unlocking” phase of a Fuzzy 1 INTRODUCTION Vault is by its nature approximate, so it suits very well our goal In this digital era we are living, the number of agencies and cor- for approximate string matching. Originally, It has been used porations which store personal data has dramatically increased. for biometric authentication algorithms and there are such im- A common need for these organizations to exchange information plementations displayed in the literature [25]. Despite the work for individuals described by these data and perform analysis on on biometric authentication, to the best of our knowledge, there them. As a data integration step, this would traditionally boil have been no works utilizing Fuzzy Vaults for performing pri- down to a database join in the case that common unique iden- vacy preserving approximate string matching, focused on, but tifiers are shared such as social id or a driving license number. not limited to, Privacy Preserving Record Linkage operations. However, since this is not always the case, the next step is to ex- As such, our contributions may be summarized as follows. First ploit fields that, when combined may uniquely identify a record. of all, we illustrate, for the first time, all the details of our novel What we have just described is commonly known in the literature methodology using Fuzzy Vaults for private approximate string as the Record Linkage Problem. Solving this problem, however, matching so as to be utilized in a privacy preserving record link- does not take into consideration that the privacy of the described age context. This methodology consists of a carefully designed individuals in these databases should be maintained. This con- workflow employing reference sets, non bijective mapping func- stitutes the Privacy Preserving Record Linkage problem, where tions and noise addition. We provide detailed examples in order two parties aim to find common records without revealing any to allow the in-depth comprehension of our method’s details and additional information to each other, apart from the matching we support our argument of privacy preservation by a solid dis- ones. However, data are usually dirty and even common fields cussion. In terms of empirical evaluation, we provide extensive between the two parties’ databases might diverge, quite often experimentation, not only for exhibiting our method’s behavior due to human error (i.e. typing errors during data entry). Since but for proving its superiority against a baseline three party-based quite often such identifiers are string values as name, surname method in terms of overall performance, a fact with additional or address, one approach to overcome this complication is by value considering that our fuzzy vault approach does not require using approximate string matching methods which should also a third party. consider privacy preservation. The rest of this paper is organized as follows. Section 2 presents To indicate the importance of the Privacy Preserving Record works related to ours. In Section 3, there is the background for Linkage task, let us consider the COVID-19 pandemic we are introducing the reader to our method, which is detailed in Section experiencing and the need for tracking the contacts of confirmed 4. Section 5 contains a discussion over the privacy properties of our method. In Section 6 we provide the empirical analysis of © Copyright 2021 for this paper held by its author(s). Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). our method in order to demonstrate its robustness and stability. Finally, in Section 7, we provide our conclusions and thoughts Alice and Bob will most probably use different schemas in for future research. their databases. As such, they will have different attributes. Let 𝑆 𝐴 be Alice’s schema and 𝑆 𝐵 be Bob’s schema and let us assume 2 RELATED WORK that in these schemas 𝑚 of the attributes are common between Private approximate string matching techniques have been widely the two sources forming a composite key. These attributes might developed in the context of privacy preserving record linkage be names, surnames, addresses and so on. As such, none of these [26], being mostly based either on Secure Multiparty Computa- on its own may comprise a unique identifier that can be used to tion (SMC), usually incurring additional communication costs, identify a record. We refer to these attributes as matching fields. or on data perturbation, that often seem to be quite vulnerable This composite key is used to determine when two records match, to attacks [5]. i.e., when they refer to the same entity. To determine when two In order to overcome some of the drawbacks and render SMC records match, the respective attributes forming the composite based techniques applicable as it comes to real world size datasets, key need to be compared. Considering that our data is often dirty, hybrid methods have emerged. In [11] a hybrid method which matching should rely on a similarity or distance function. combines differential privacy and homomorphic encryption is Let us consider a similarity function 𝐹 () → [0..1] and a thresh- presented. Here, instead of sanitizing data, differential privacy is old 𝑡 > 0. Given the composite keys 𝑐 𝐴 and 𝑐 𝐵 for records 𝑟 𝐴 applied to ensure data privacy. Afterwards, data partitioning into and 𝑟 𝐵 , respectively, we define the following matching function blocks takes place. Finally, an SMC step based on homomorphic 𝐺 → {0, 1}: encryption occurs and returns the matching pairs. This method ( 𝐴 𝐵 1, iff 𝐹 (𝑐 𝐴 , 𝑐 𝐵 ) ≥ 𝑡 manages to partially improve the situation with the bottleneck 𝐺(𝑐 , 𝑐 ) = (1) of the limited size of data that SMC can endure. 0, otherwise. Another hybrid method which is based upon blocking, uti- If 𝐺(𝑐 𝐴 , 𝑐 𝐵 ) = 1, then the pair (𝑟 𝐴 , 𝑟 𝐵 ) is a match. lization of differential privacy and on an SMC (homomorphic) This process is the matching process. To preserve privacy, i.e., phase is described in [16]. However, in this case the two parties ensure privacy preserving matching (PPM), after the completion compare their data with the help of a third one. Moreover, even of this process, the only information revealed is the identifiers recent techniques addressing this problem employ a third party of the matched records. In our case, where we use Fuzzy Vaults [21], an approach that may be rendered impractical requiring the for matching, Formula 1 will have as input Bob’s key, 𝐾 𝐵 corre- availability of such a third party. sponding to record 𝑟 𝐵 , and a Fuzzy Vault 𝑉 for Alice’s 𝑟 𝐴 and it In the last few years, there has been particular effort towards will examine their absolute matching status using the method we developing two-party based techniques [3]. One of the methods describe in detail in Section 4, returning 1 when Bob successfully used to achieve this functionality is homomorphic encryption [9], unlocks Alice’s Fuzzy Vault with his and 0 otherwise. As such, which, however is a computationally expensive method [1] also the matching threshold is equal to 1. liable to suffer certain attacks [10]. Another recent technique is based on SPDZ [15] which, however, may also be expensive in terms of communication and computation [3]. Finally, Garbled 3.2 Lagrange Interpolation Circuits can offer a solution [2], but further evaluation is needed Lagrange interpolation [24] is the most common interpolation in terms of execution time, size and reusability [22]. method used in the fuzzy vault literature. Considering a two- To this end, we are presenting a solution to the two-party dimensional Euclidean space, and given a data set of n + 1 points: privacy preserving record linkage problem using Fuzzy Vaults. A (𝑥 0, 𝑦0 ) , (𝑥 1, 𝑦2 ) , . . . , (𝑥𝑛 , 𝑦𝑛 ) 𝑥 i ̸= 𝑥 𝑗 ∀ 𝑖, 𝑗 ∈ {0, 1, . . . , 𝑛} Fuzzy Vault scheme may be used for privacy preserving match- the approximation function using Lagrange interpolation is a ing, personal entropy systems, biometrics [12], [25] or, as in- polynomial of nth degree and is defined as shown in Equation 2: dicated by the recent literature [20], for signature verification. 𝑛 Our method has been briefly presented in [18]. In this paper, X 𝑃 (𝑥) = 𝑦𝑖 ∗ 𝐿𝑛,𝑖 (𝑥) (2) we move one step further by providing a simple protocol for 𝑖=0 two-party privacy-preserving record linkage using Fuzzy Vaults, This method is called Lagrange interpolation because it uses detailed examples, an in-depth privacy discussion, exhibiting the Lagrange basis polynomials in order to interpolate the data set, privacy characteristics of our method and, last but not least, an as illustrated in Equation 3: extended experimental evaluation, also including a comparison with a baseline method. Y𝑛 𝑥 − 𝑥𝑗 𝐿𝑛,𝑖 = (3) 3 PRELIMINARIES 𝑗=0,𝑖̸= 𝑗 𝑖 − 𝑥 𝑗 𝑥 In this section, we define the problem we are solving and provide By noticing the denominator of Equation 3 it is evident why the required background for laying out our method. The notation 𝑥 i ̸= 𝑥 𝑗 ∀ 𝑖, 𝑗 should hold for the formula’s abscissas. used throughout this paper is summarized in Table 1. 3.3 Fuzzy Vault 3.1 Problem Formulation A Fuzzy vault is a cryptographic construct used for secret sharing, Let us consider two data sources, called Alice (𝐴) and Bob (𝐵). introduced in [12] and, in brief, it works as follows. Alice, wants Privacy preserving record linkage is the problem of identifying to protect a secret 𝑆 by locking it up with a key, which is a set (linking) all pairs of their records that refer to the same real world of items 𝐾 𝐴 . This is the Lock phase. Bob, may have access to entity, so that no more information is disclosed to either 𝐴, 𝐵 or this secret only if he uses a key, consisting of a set of items 𝐵 is any third party involved in the process besides the identifiers of substantially similar to Alice’s key. This is the Unlock phase. To the linked records. assure privacy, noise is also injected during the Lock phase. Table 1: Table of symbols used. Symbol Description n Degree of polynomial K Key f Size of fragment P Polynomial r Plaintext record Figure 1: Fuzzy Vault for locking 2317 into 𝑃(𝑥) = 2𝑥 3 + V Fuzzy Vault 3𝑥 2 + 1𝑥 + 7. Left: Polynomial creation. Right: Fuzzy Vault RS Reference Set combining genuine (green) and chaff (red) points. C Chaff Points R Original Points H Verification Hash Next, we will further explain the details of this process and em- ploy a running example to facilitate the reader. The original Fuzzy Superscripted A (𝐴 ) Alice (e.g. 𝐾 𝐴 : Alice’s Key) Vault construct was designed for numerical values, which we Superscripted B (𝐵 ) Bob (e.g. 𝐾 𝐴 : Bob’s Key) will use to illustrate its operation. However, as our work focuses |·| Size (e.g |RS|: Reference Set size) on strings, we will lay out in later sections our methodology for achieving suitable transformations in order to adapt this concept is the set 𝐴 = {1, 2, 3, 4, 5, 6} . By projecting each of these points for alphanumeric data. Alice ends up with a collection 𝑅 of points in a two-dimensional 3.3.1 Lock Phase. To illustrate the operation of a Fuzzy Vault, space, 𝑅 = {𝑥, 𝑃(𝑥)}, ∀𝑥 ∈ 𝐴. For our example, this results into: we will commence with the Lock phase and consider, as stated 𝑅 = (1, 13), (2, 37), (3, 91), (4, 187), (5, 337), (6, 553). In our example, earlier, that Alice wants to lock her secret 𝑆 using her key, 𝐾 𝐴 . 𝐾 𝐴 ’s cardinality is |𝐾 𝐴 |= 6. For this purpose, she uses a polynomial 𝑃 of degree 𝑛 so that Next, Alice creates the distraction set 𝐶 comprising of random 𝑃’s coefficients are, in fact, mappings of her secret 𝑆. Then, she chaff points which do not lie on the polynomial. Finally, Alice uses the members of her key 𝐾 𝐴 as x-coordinates for 𝑃 creating combines and shuffles the two collections in order to create the a collection of points 𝑅. As 𝐾 𝐴 is a set, its elements are unique, vault 𝑉 , which contains both genuine and chaff points, which thus the the x-coordinates of 𝑅’s points are going to be unique are indistinguishable. As stated before, the size of 𝐶 should sig- as well. nificantly exceed that of 𝑅. However, for presentation purposes Let us now denote with |𝐾 𝐴 | the cardinality of 𝐾 𝐴 . Equation and without harm to the general case, we will only consider four 4 illustrates the restriction that has to be satisfied with regard points in the chaff set: 𝐶 = {7, 8, 9, 10} For our toy example, this to the size of the polynomial used. This is due to Lagrange’s will yield: 𝑉 = {(1, 13), (2, 37), (3, 91), (4, 187), (5, 337), (6, 553), . . . , interpolation method which requires 𝑛 + 1 data points in order (9, 560), (10, 700)}. to return a polynomial of degree at most 𝑛 [7]. The key set 𝐾 𝐴 3.3.2 Unlock phase. Having seen how Alice locks a secret, holds the abscissas of the data points that Lagrange will later let us now describe the Unlock phase, where Bob will attempt to interpolate in the unlock phase. unlock the Fuzzy Vault 𝑉 he received from Alice and get access to Alice’s secret 𝑆. For this purpose, Bob will have to use a key |𝐾 𝐴 |≥ 𝑛 + 1 (4) set 𝐾 𝐵 in order to unlock 𝑉 . This will only happen in the case Now, Alice will also create noise in order to protect the Fuzzy that his key, 𝐾 𝐵 , substantially overlaps with Alice’s key 𝐾 𝐴 , used Vault’s, thus offering privacy. To do so, she introduces a distrac- to create 𝑉 . If so, Bob may distinguish enough genuine points of tion set of points designated by 𝐶 and referred to as Chaff points. the vault 𝑉 belonging to the collection 𝑅. These points do not belong to the polynomials’s curve. Their pur- However, Bob may also identify some distraction points be- pose is to confuse an attacker by refraining her from identifying longing to the chaff set 𝐶 too. Bob’s ability to eventually recon- the points in 𝑅 that have resulted from the Lock phase. As such, struct the original polynomial depends on the cardinality of Bob’s collection 𝐶 may be considered as the protection of the vault. key. If the key’s cardinality is smaller than the polynomial’s de- The union of collections 𝐶 and 𝑅 comprised the vault 𝑉 , or more gree then interpolation methods will not be able to recover the formally: V 𝑉 , 𝑉 = 𝑅 ∪ 𝐶. In fact, 𝐾 𝐴 is considered to be the key proper polynomial [7]. Therefore, Equation 4 has to be satisfied of the vault 𝑉 since it identifies the collection of real points 𝑅. for Bob’s key, 𝐾 𝐵 , as well. Unlocking the vault requires, first, that As chaff points aim to confuse an attacker, the cardinality of 𝐶 is Bob finds all the vault’s points that have the same abscissas with of a greater order of magnitude than the cardinality of set 𝑅. We any of his key’s elements, or more formally: define this set of chaff points as illustrated in Equation 5: 𝑀 = {(𝑥, 𝑦) ∈ 𝑉 |𝑥 ∈ 𝐾 𝐵 } (6) 𝐶 = 𝑥 ′ ̸= 𝑥, 𝑦 ̸= 𝑃(𝑥 ′ ) , ∀𝑥 ∈ 𝐴 & ∀𝑥 ∈/ 𝐴 & |𝐶 |>> |𝑅| (5)  Distraction points might exist because 𝐾 𝐵 might also overlap At this point, we will make this process more clear through with the collection of random chaff points 𝐶. If 𝐾 𝐴 and 𝐾 𝐵 do not Example 3.1, with its illustration in Figure 1. significantly overlap, then the polynomial reconstruction cannot be performed. Example 3.2 illustrates this process. Example 3.1. Let as assume that the secret which Alice wants to lock is 𝑆 = 2317 and that she holds a key, which is the Example 3.2. Let us consider, in our running example, that Bob, set 𝐾 𝐴 = {1, 2, 3, 4, 5, 6}. She creates a polynomial 𝑃 of degree with his key set 𝐾 𝐵 = {1, 2, 4, 5, 8, 9}, tries to unlock 𝑉 . The col- 𝑛 = 3 which has the secret embedded into its coefficients: 𝑃(𝑥) = lection of 𝑀 for Bob will be: 𝑀 = {(1, 13), (2, 37), (4, 187), (5, 337), 2 ∗ 𝑥 3 + 3 ∗ 𝑥 2 + 1 ∗ 𝑥 1 + 7 ∗ 𝑥 0 . The next step involves the pro- (8, 360)}. Bob has to reconstruct 𝑃 in order to access the vault’s jection of her key on the polynomial 𝑃. Assume that Alice’s key contents. For our case, let us consider that this occurs through Algorithm 1: Protocol Overview. /* Operations at Alice */ 1 Build_Fuzzy_Vaults (); 2 Transmit_To_Bob (); /* Operations at Bob */ 3 Generate_Keys (); 4 Unlock_Alice_Vaults () ; // Privacy Preserving Matching 5 Transmit_Results_To_Alice (); /* Operations by both Alice and Bob */ Figure 2: Steps for locking record 𝑟 𝐴 into a Fuzzy Vault. 6 Transmit_Matching_Records (); behavior, attacking the operation of the protocol, but they try Lagrange interpolation [24]. We note here that, part of 𝐾 for to infer as much information as possible without deviating from Bob’s key is a point (8, 360) which comprises a noise element the protocol. for 𝑅 and as such it does not belong to the original polynomial. Consequently, each of the combinations that include the noise 4.1 Protocol Description element will return a false polynomial. Note here that the cardi- Let us begin by providing a simple protocol for using Fuzzy nality of Bob’s key is |𝐾 𝐵 |= |𝐾 𝐴 |= 6. In this case, we have chosen Vaults for privacy preserving record linkage. For this purpose, both Alice and Bob to have keys of the same size. a separate Fuzzy Vault for each record in a database, as such, Nevertheless, there is the capacity for Bob to have a key with a separate polynomial, has to be created and secretly shared. a different size to Alice’s, given that it has at least one element We will consider for our case, that alphanumeric identifiers are more than the degree of the polynomial used. Since Lagrange used. As our focus is on privately matching records, we assume interpolation requires 𝑛 + 1 genuine points in order to reconstruct that either both Bob and Alice share a common schema, or they the polynomial which corresponds to them, if |𝑀 |< 𝑛 + 1, Bob have different schemas but corresponding fields on the use of will fail to unlock the vault. On the other hand, if |𝑀 |> 𝑛 + 1 Bob which they have agreed beforehand, or they use some privacy- has to create all possible 𝑛 + 1 combinations of 𝑀 and use them preserving schema matching method. in order to interpolate the polynomial. Now, we should take into The protocol we are describing is asymmetric. This means consideration that each, one of the possible combinations will that both parties do not perform exactly the same steps, but they return a polynomial but Bob has somehow to decide which one cooperate in order to achieve their common goal of privately is the correct (there is no need to calculate all of them). linking their records. These steps, are illustrated in Algorithm 1. First, Alice converts all of the records in her database to Fuzzy 3.4 Reference Set Vaults, resulting in one Fuzzy Vault for each record. Then, she A Reference Set is a corpus of data used as a means for prepro- transmits these Fuzzy Vaults to Bob through a considered secure cessing when matching sensitive information. This corpus may channel. Bob, in his turn, creates his Fuzzy Vaults and, using the either be publicly available or pre-agreed among the matching resulting keys, attempts to unlock the received vaults, resulting parties, Alice and Bob, in our case. It works as follows. Consider- in a list of matches. When the process concludes, he sends the ids ing that Alice and Bob wish to match their data without revealing of the matching Fuzzy Vaults back to Alice. The procedure con- any information to each other, they use some mapping function, cludes with both Alice and Bob exchanging their actual matching having also been preagreed and the same for both, which relates records. each of the values in their dataset to one or more values in the At this point, the question that rises is how these alphanu- Reference Set. Then, instead of comparing their data directly, meric fields are going to be transformed into polynomials and, they are able to compare the results of the mapping function. eventually, into Fuzzy Vaults. This is the process we are going to Such a construct has been used in the privacy preserving record describe next. linkage context either for matching [19] or for blocking [13]. Example 3.3 illustrates how a Reference Set may be used. 4.2 String Encoding for Locking Example 3.3. In this example we are using Levenshtein dis- Beginning with the locking phase, Alice will secure strings inside tance [17], indicated by 𝐿(·), as the mapping function. We still Fuzzy Vaults. Each string is the concatenation of all the string consider Alice and Bob, with Alice holding the word ‘error’, while fields in the record that will be used for matching. In order for Bob holds the word ‘erasure’ and they wish to privately match such a string to be locked, and to be able to be unlocked afterward them. For the sake of presentation, the Reference Set contains the by Bob by a fairly similar string, it should be somehow associated single word ‘energy’. In this case, for Alice 𝐿𝐴 (𝑒𝑟𝑟𝑜𝑟, 𝑒𝑛𝑒𝑟𝑔𝑦) = 4, with the polynomial used to lock it. As such, our method embeds while for Bob, 𝐿𝐵 (𝑒𝑟𝑎𝑠𝑢𝑟𝑒, 𝑒𝑛𝑒𝑟𝑔𝑦) = 6. Since 𝐿𝐴 ̸= 𝐿𝐵 , we may irreversibly the string into polynomial’s coefficients. We cannot easily deduce that Alice and Bob hold different values. use a direct representation of the string since this would jeop- ardize exposing the key as the abscissas of the genuine points 4 METHOD DESCRIPTION of the vault, allowing an adversary obtaining the key to directly Having examined the operation of the Fuzzy Vault for numerical recover the string. Additionally, as we will describe later on, Bob data, let us now describe our approach for exploiting this mecha- will use this property after unlocking the Fuzzy Vaults, in order nism for approximate private string matching, which will allow to verify if the match he achieved with Alice’s secret is identical us to perform privacy preserving record linkage. We consider two or approximate. honest but curious matching parties, Alice and Bob. This means To be able to adapt the Fuzzy Vault scheme so as to exhibit that, both of these two participants do not exhibit a malicious privacy preservation properties for record linkage, a series of Algorithm 2: Mapping a Record to a Polynomial. Input: - r: A record with alphanumeric fields - f: Size of fragment - n: Size of polynomial Output: - p: A polynomial representation of 𝑟 1 𝑛𝑐𝑜𝑢𝑛𝑡 ← 0; 2 𝑠 ← Concatenate (r); Figure 3: Example of polynomial representation for the 3 𝑓 𝑟𝑎𝑔𝑚𝑒𝑛𝑡𝑠[] ← Fragment(𝑓 , 𝑠); concatenated name and surname “JOE_DOE”. 4 foreach 𝑓 𝑟𝑎𝑔𝑚𝑒𝑛𝑡 ∈ 𝑓 𝑟𝑎𝑔𝑚𝑒𝑛𝑡𝑠[] do 5 𝐴𝑆𝐶𝐼 𝐼 _𝐶𝑜𝑑𝑒𝑠 ← ∅; 6 foreach 𝑐ℎ𝑎𝑟 ∈ 𝑓 𝑟𝑎𝑔𝑚𝑒𝑛𝑡 do polynomials. In our example, the encoded string has less than 27 7 𝐴𝑆𝐶𝐼 𝐼 _𝐶𝑜𝑑𝑒𝑠.append (to_ASCII (𝑐ℎ𝑎𝑟 )); characters. In this case, the remaining coefficients are all set to 1. 8 end 9 SetCoefficient (𝐴𝑆𝐶𝐼 𝐼 _𝐶𝑜𝑑𝑒𝑠, 𝑛 − 𝑛𝑐𝑜𝑢𝑛𝑡 ); 4.2.2 Verification. The aim of the verification step is, as its 10 𝑛𝑐𝑜𝑢𝑛𝑡 + = 1; name suggests, to verify at the unlocking phase that the recovered 11 end polynomial matches the encoded one. In our approach, we use 12 for 𝑖 ← 𝑛𝑐𝑜𝑢𝑛𝑡 to 𝑛 by 1 do the method proposed in [20]. This method, is based on auxiliary 13 SetCoefficient (1, 𝑛 − 𝑖); data generated during the polynomial construction phase. An 14 end MD5 hash 𝐻 which derives directly from the created polynomial, is created having as input the concatenation of the polynomial’s coefficients. steps is necessary. The broad image of these steps are illustrated 4.2.3 Key Generation. Building a polynomial is one of the two in Figure 2. First of all, a record, which plays the role of the string steps required toward generating a record’s set of genuine points. to be secured, plays a double role in the whole process, as it is The second step regards the generation of the key. Since the used both for creating a polynomial and for building a key for locking and the unlocking keys are the two components which encoding this record. This polynomial and the key are then used are being compared, the key in our method should be directly to create the set of Genuine Points, which, in combination with linked with the string. The caveat here is to avoid an one - to the generated chaff points, will result in a Fuzzy Vault. At this - one connection between the key and the string as a measure point, we have to mention that the polynomial representation of against record multiplicity attacks, described in [23]. For instance, the record is also used to produce the verification data, also used in the case that an adversary somehow gets access to the key, during the unlocking phase. our method should not allow the recovering of the secret string. As such, the need to devise an one-way mapping emerged. To 4.2.1 Polynomial generation. Let us begin with examining this end, we employ a Reference Set in order to associate its values the process that generates the aforementioned polynomial. The with the string resulting from the record’s concatenated fields mapping algorithm we propose is illustrated in Algorithm 2. First, by means of a string similarity function. Then, the most similar the alphanumeric fields of the record to be locked are concate- values are selected and they are transformed, in an irreversible nated so as to form a single string (line 2). Then, this string is manner, to a set of numbers comprising the key for the locked fragmented into substrings of size equal to 𝑓 (line 3), with 𝑓 string, without, however, revealing any information about it. being a user defined parameter, an approach we adopted for se- Let us now see in how this works. The generation of a locking curity purposes. This is particularly useful for long strings. Each key for a single record is detailed in Algorithm 3. This algorithm character of each fragment is converted to its ASCII code equiva- should be applied on each record in the recordset resulting in a lent. The concatenation of these codes creates a number, which set of separate keys, one for each record. This algorithm requests will form a polynomial’s coefficient (lines 4-11). The remaining as input the concatenated alphanumeric fields of record 𝑟 , as a coefficients of the polynomial are set to 1. Example 4.1 further single string, to be encoded, a Reference Set 𝑅𝑆 that we are going clarifies the operation of this algorithm. On the other hand, if to use for encoding and the size |𝐾 | of the key 𝐾 which we are the string exceeds the length of 27 characters either a higher going to produce. degree polynomial should be created or some characters should Our algorithm also requires the use of a string similarity, or be skipped. distance, function 𝐹 (·). This is applied on the pairs formed by 𝑟 and each of the values of the Reference Set 𝑅𝑆, 𝑅𝑆_𝑣𝑎𝑙𝑢𝑒 resulting Example 4.1. Let us assume, as illustrated in Figure 3, that 𝑓 = to an array of similarity scores 𝑠𝑖𝑚_𝑠𝑐𝑜𝑟𝑒𝑠 (lines 1-4). 𝑠𝑖𝑚_𝑠𝑐𝑜𝑟𝑒𝑠 3, meaning that the fragment size will consist of three characters is then used to partially sort 𝑅𝑆. As our aim is to retrieve the to and that we wish to encode the name JOE_DOE. Let us also |𝐾 | values of the 𝑅𝑆, a full sort is unnecessary. The sort order assume that the degree of the polynomial we wish to use is 𝑛 = 8. depends on the type of measure used for 𝐹 (·). If 𝐹 (·) is a similarity This will yield a polynomial of the form: 𝑃(𝑥) = 𝑐 8 ·𝑥 8 +𝑐 7 ·𝑥 7 +. . .+ function, then sorting occurs in descending order, as we need to 𝑐 2 ·𝑥 2 +𝑐 1 ·𝑥 1 +𝑐 0 . Given our settings for 𝑛 and 𝑓 , this polynomial retrieve the most similar of 𝑅𝑆’s values with 𝑟 . Otherwise, sort has a capacity of embedding 9 · 3 = 27 characters, because it in ascending order is performed. Afterward, 𝑡𝑜𝑝_𝑘_𝑣𝑎𝑙𝑢𝑒𝑠 are has 9 coefficients. Moreover, there are 263 = 17576 possible extracted from 𝑅𝑆, which are the top |𝐾 | most similar Reference combinations for each coefficient, because each coefficient can Set values with 𝑟 (lines 6-9). Next, we are going to build 𝐾 by have up to three letters and there are 26 available letters (we take applying a series of transformations on each of 𝑡𝑜𝑝_𝑘_𝑣𝑎𝑙𝑢𝑒𝑠, into consideration only capital letters) and 263∗9 total possible resulting in a numerical set. Algorithm 3: Key Generation. 1e6 'Fuzzy Vault' Input: CHAFF POINTS - r: A record with concatenated alphanumeric fields GENUINE POINTS - RS: Reference Set 1.5 - |K|: Size of resulting key Output: 1.0 - K: Resulting key 1 𝑠𝑖𝑚_𝑠𝑐𝑜𝑟𝑒𝑠 ← ∅; 2 foreach 𝑅𝑆_𝑣𝑎𝑙𝑢𝑒 ∈ 𝑅𝑆 do 0.5 3 𝑠𝑐𝑜𝑟𝑒𝑠.append (F (𝑟, 𝑅𝑆_𝑣𝑎𝑙𝑢𝑒)); 4 end 0.0 5 Partial_Sort (𝑅𝑆, 𝑠𝑖𝑚_𝑠𝑐𝑜𝑟𝑒𝑠, |𝐾 |); 6 𝑡𝑜𝑝_𝑘_𝑣𝑎𝑙𝑢𝑒𝑠 ← ∅; −0.5 7 for 𝑖 ← 0 to |𝐾 | by 1 do 8 𝑡𝑜𝑝_𝑘_𝑣𝑎𝑙𝑢𝑒𝑠.append (𝑅𝑆[𝑖]); −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 9 end 10 𝐾 ← ∅; 11 for 𝑖 ← 0 to |𝐾 | by 1 do Figure 4: Fuzzy vault where chaff points have abscissas 12 ℎ𝑎𝑠ℎ ← Hash (𝑡𝑜𝑝_𝑘_𝑣𝑎𝑙𝑢𝑒𝑠[𝑖]); from the Reference Set and random ordinates. 13 𝑠𝑒𝑒𝑑 ← ℎ𝑎𝑠ℎ; 14 𝑟𝑎𝑛𝑑𝑜𝑚_𝑖𝑛𝑡𝑒𝑔𝑒𝑟 ← Rand (𝑠𝑒𝑒𝑑, 0, 106 ); 15 𝐾 .append (Cosine (𝑟𝑎𝑛𝑑𝑜𝑚_𝑖𝑛𝑡𝑒𝑔𝑒𝑟 )); 16 end chosen but with the restriction that the resulting points do not lie on the polynomial. This way, 𝐶 is built. The transformations that each string 𝑡𝑜𝑝_𝑘_𝑣𝑎𝑙𝑢𝑒𝑠[𝑖] under- Final Step. Eventually, the points of these sets are shuffled in goes are as follows (lines 10-16). Initially, 𝑡𝑜𝑝_𝑘_𝑣𝑎𝑙𝑢𝑒𝑠[𝑖] is order for genuine and chaff points to be mixed, applying formula hashed through a simple hash function so as to create a numerical 5 so as to create 𝑉 . A Fuzzy Vault created with this method is equivalent ℎ𝑎𝑠ℎ (line 12). Then, ℎ𝑎𝑠ℎ is used as an input seed in a illustrated in Figure 4. At this point, the data encoding from random number generator which produces 𝑟𝑎𝑛𝑑𝑜𝑚_𝑖𝑛𝑡𝑒𝑔𝑒𝑟 rang- Alice’s side concludes. Alice sends Bob the Fuzzy Vaults she has ing between [0, 106 ] (lines 13-14). The series of transformation constructed together with the verification hash, One pair for conclude with applying the cosine function on the 𝑟𝑎𝑛𝑑𝑜𝑚_𝑖𝑛𝑡𝑒𝑔𝑒𝑟 , each record. At this point the first step (line 1) of the protocol resulting in a value between [−1, 1], which is now a part of 𝐾 illustrated in Algorithm 1. To further clarify our approach, we (line 15). will use Example 4.2: It is evident that after applying this series of mappings we Example 4.2. Let us assume that “JOHN_SNOW” is the string have managed to perform an one-way transformation between Alice wishes to lock. Let us also assume that 𝐹 (·) is going to the initial string that consists of the concatenation of the records be Levenshtein distance [17]. First, she encodes the string into fields and the numbers that comprise 𝐾. This is due to the use a polynomial, using the fragmentation of the respective ASCII of the numerical representation of the string as a seed and the codes, with 𝑓 = 3: 𝑃(𝑥) = 747972𝑥 8 + 789583𝑥 7 + 787987𝑥 6 + 1𝑥 5 + periodic nature of the cosine function. 1𝑥 4 + 1𝑥 3 + 1𝑥 2 + 1𝑥 1 + 1𝑥 0 . 4.2.4 Fuzzy Vault Construction. At this point, three steps re- Based on the representation of the polynomial, she also con- maining for constructing a Fuzzy Vault: Generating Genuine structs the verification hash, by concatenating the coefficients Points, generating Chaff Points and, finally, blending these to- of the polynomial so as to form “747972786583787987111111”. gether. Applying MD5 results in: “2d73f7e7425e2d20264a799e24715317”, which will be delivered together with the Fuzzy Vault 𝑉 to Bob. Genuine Points Generation. Now that we have created the key Then, Alice will have to build her key 𝐾 𝐴 . For this purpose, 𝐾 and the polynomial 𝑃 required by the Fuzzy Vault scheme to she selects the top |𝐾 𝐴 |= 15 elements of the Reference Set which encode a secret, we will projected 𝐾 over 𝑃, as indicated in [12]: are more similar with the word “JOHN_SNOW”. For instance, 𝑅 = {𝑃(𝑥)}, ∀𝑥 ∈ 𝐾 resulting in a set of 𝐾 points considered to the first element is “JONES_OWEN” which has F(JOHN_SNOW, be the set 𝑅 of genuine points in the Fuzzy Vault. JONES_OWEN) = 5, while the last one is the string “XU_BO” with Noise as Chaff Points. After the creation of the genuine points, F(JOHN_SNOW, XU_BO) = 7. Now, we hash each of the strings chaff points should also be generated ensuring fuzzy vault’s secu- in 𝐾 𝐴 into numbers and then the numbers are mapped into the rity. Noise comprises a standard method for enhancing privacy range [−1, 1] using the cosine function: 𝐶𝑂𝑆(𝐽𝑂𝑁 𝐸𝑆_𝑂𝑊 𝐸𝑁 ) = [4]. In our method we adhere to this principle by generating −0.355866. chaff points which aim at distracting a potential attacker from Next, in order to create the genuine points, Alice projects the identifying the points that belong to the actual polynomial. A elements of her key 𝐾 𝐴 onto the polynomial 𝑃. For instance, for naive approach would be to randomly generate points based on the first key element this yields 𝑃(−0.355866) = 1222.8648. The one or more distributions. However, we consider that this would rest are accordingly projected. When this step concludes, 𝑅 will allow for identifying and phasing out noise quite easily. As such, have been built. we designed a novel approach, more suitable for our case, which Eventually, Alice will have to build the set of chaff points 𝐶 is based on the use of Reference Sets. We use abscissas from and then calculate its union with 𝑅 so as to build the Fuzzy Vault a Reference Set we employ, while the ordinates are randomly 𝑉 . 𝑉 coupled with the verification hash 𝐻 are delivered to Bob. 𝐻 𝐴 . Then these two are checked for matching. If 𝐾 𝐵 sufficiently overlaps 𝐾 𝐴 and 𝐻 𝐴 = 𝐻 𝐵 , the two polynomials will be the same and there is a match. On the other hand, if the two sets are not similar enough, verification will fail. If the polynomial is classified as a match, then the strings for which the vault has been created is considered to be the same with the string Bob holds. To facilitate the reader, we now provide Example 4.3: Example 4.3. Continuing Example 4.2 Bob tries to unlock Figure 5: Steps for linking record 𝑟 𝐵 with a secured record a Fuzzy Vault he received from Alice containing the record inside a Fuzzy Vault. “JOHNY_SNOW”. First, he creates his key the same way Alice did, using the same fragment size. After the creation of his key, Bob has to find the common points between his key and the 4.3 Matching through Unlocking fuzzy vault. In this case, Bob managed to find n = 10 common At this point, we assume that the second step (line 2) of Algorithm points. Thereby, by utilizing Lagrange interpolation, he managed 1 has been successfully completed, thus Alice’s records have to recover a polynomial of 8th degree by using 9 out of the 10 been successfully transmitted to Bob in the form of Fuzzy Vaults, common points (he might have to try all the combinations) as accompanied by their Verification Data and we are moving to shown: 𝑃𝐵 (𝑥) = 747972𝑥 8 + 789583𝑥 7 + 787987𝑥 6 + 1𝑥 5 + 1𝑥 4 + steps 3 and 4. Now, Bob will attempt to unlock these Fuzzy Vaults, 1𝑥 3 + 1𝑥 2 + 1𝑥 1 + 1𝑥 0 . Last but not least, Bob has to verify the thus, linking his records with Alice’s. To do so, he will have to recovered polynomial, because he has no knowledge if it is the perform the actions illustrated in Figure 5. First, he will have correct one or not. Therefore, he calculates the MD5 hash of the to create a key for each of his records. Then, all of his keys are concatenation string of his polynomial’s coefficients, which in going to be tested against each of Alice’s Fuzzy Vaults, also using our case is “2d73f7e7425e2d20264a799e24715317”. Comparing the each vault’s Verification Data. Let us see these in detail. two hashes Bob can asses that the two strings match. 4.3.1 Verification. As described in Section 3.3, a Fuzzy Vault 4.5 Protocol Conclusion is unlocked by a key similar to the one that locked it. It is evident Eventually, after Bob has decided which of his records match that, for this to happen, Bob should create his unlocking keys Alice’s, he transmits a list of the matching records he recovered in way identical to the process that Alice followed to create requesting their entire data. Alice responds and she exchanges her locking keys. Thus, Bob builds each of his unlocking keys her corresponding data with Bob (lines 5, 6 of Algorithm 1). 𝐾 𝐵 , for each of his records, using Algorithm 3. Afterwards, he will calculate the intersections of all the unlocking keys with 5 PRIVACY DISCUSSION all the Fuzzy Vaults, forming a series of Common Sets, each of In our setup, we consider that Alice and Bob exhibit an honest which contains (𝑥, 𝑦) pairs. Each of Common Set 𝑀 is then used but curious behavior, meaning that they try to gather as much for reconstructing the polynomial. Adhering to the literature additional information they can by adhering to the protocol. [6, 20, 25], this achieved through Lagrange interpolation [7] on A Fuzzy Vault’s security properties rely on the number of chaff the common points 𝑀, to create an interpolating polynomial. points used while constructing it [12]. We remind our notation, where 𝑛 represents the degree of the polynomial used, |𝐶 | the 4.4 Approximate Matching number of generated chaff points and |𝑅| the number of real, At this point, our methodology for approximate matching is ap- genuine points. Taking into account that, in order to have a plied. The key point for achieving approximation with a Fuzzy correctly reconstructed polynomial, 𝑛 + 1 of the genuine points Vault is the step of the reconstruction of the polynomial described must be used. Therefore, any adversary who has access to a by the Fuzzy Vault. As mentioned earlier, an interpolated poly- fuzzy vault and wants to unlock it has to find all possible 𝑛 + 1 nomial of 𝑛-th degree needs at least 𝑛 + 1 interpolating points combinations of points and make reconstruction attempts. Even in order to be created. In our case, the polynomial is secured in this case the adversary has to know the polynomial degree 𝑛 in a Fuzzy Vault as |𝐾 𝐴 | genuine pairs (𝑥, 𝑦). Therefore, when which has been used. |𝐴|= 𝑛 + 1, in order to interpolate an 𝑛-th degree polynomial, the Without any extra information (rather than the polynomial two keys have to be identical. degree) any adversary is required to find |𝑅𝑛+1 |+ |𝐶 |  point combi- However, if we build a key 𝐾 𝐴 , having a size greater than nations. From all the combinations of 𝑛 + 1 points that can be the minimum number of points that an 𝑛-th degree polynomial |𝑅 |  extracted by the vault only 𝑛+1 combinations are able to un- needs in order to be recovered, i.e., |𝐾 𝐴 |> 𝑛 + 1 we can exploit lock it. Therefore, the probability of guessing correctly a needed these redundant (𝑥, 𝑦) pairs to achieve approximation. In other combination is given by Equation 7. words, since 𝑛 + 1 points are required to unlock an 𝑛-th degree polynomial and we project a key with size larger than 𝑛 + 1, we |𝑅 |  may use any 𝑛 + 1 points of 𝐾 𝐴 to unlock the Fuzzy Vault. As 𝑛+1 (7) |𝑅 |+|𝐶 |  such, 𝐾 𝐵 should only match 𝑛 +1 of 𝐾 𝐴 ’s points, with |𝐾 𝐴 |> 𝑛 +1, 𝑛+1 thus leading to linking non-identical records. On the other hand, Consequently, a brute force attack in a well concealed fuzzy as |𝐾 𝐴 | increases, the probability of having common elements vault is considered impracticable. In our approach, to further in different keys is increased too. As such, too large keys would increase privacy, we employ a series of further enhancements. result into increased numbers of fault positives. First, there are the chaff points, which comprise the Fuzzy Vault. Now, after the recovering the polynomial from 𝑉 , Bob, creates Then, there is the use of a Reference Set and, finally, the use of a Verification Hash 𝐻 𝐵 in an identical way that Alice created non-bijective mapping functions. Matching Performance 1.0 0.85 0.83 0.84 0.83 0.82 0.83 0.82 0.81 0.8 0.80 0.69 0.74 0.66 0.68 0.65 0.65 0.67 0.64 0.61 0.59 0.62 0.63 0.6 0.55 0.48 0.46 0.46 0.44 0.43 0.4 Precision 0.2 Recall Balanced Accuracy 0.0 1 2 3 4 5 6 7 8 9 Experiment id Figure 6: Fuzzy Vault Matching Performance. Chaff points are responsible for concealing the vault’s secret database as matching attributes, assuming they comprise a can- which, in our case, is a record . We have taken the following didate key. Sampling this database, we constructed two datasets, measures in order to further enhance this property. First of all, dataset_A and dataset_B. Dataset_A initially consisted of 1000 regarding the use of the Reference Set and mapping functions. records and then deduplicated to 990 ones. Dataset_B has been Sensitive data are not encoded themselves within the Fuzzy Vault. built from dataset_A. Since our method is capable of achieving What is encoded with the Reference Set is the output of a distance approximate matching and in order to evaluate its performance or similarity function. Furthermore, the output of this function is in this terms, we corrupted Dataset_B so as to contain one error fed to a non-bijective function, thus constituting both the polyno- per attribute, as it is rather unusual for a word to contain more mial representation and the key generation processes irreversible. than one to two typographical errors [8]. Experiments were con- To continue, to avoid frequency attacks, instead of using a ducted on an 8-core virtual server with 16GB of RAM, utilizing, random function to generate points, the chaff generation tech- however, only a single core was for the assessment each time. nique we propose is based on the use of a Reference Set. This way, For the Reference Set, now, there are two aspects we are in- the abscissas of the points delivered to Bob are indistinguishable terested in. Its content and its size. As for the content, we used from the original points. Consider that the vault has 15 genuine names and surnames originating from the initial database. In points and 200 chaff points which their abscissas os also derived terms of Reference Set sizes, we experimented with three distinct from the Reference Set, the same with the genuine points. Bob dataset sizes. For each letter of the latin alphabet, 15, 20 and 25 knows all the possible abscissas which are produced by the Refer- records were selected with a uniform distribution, resulting in ence Set but our method uses these abscissas as noise. Therefore, 15 ∗ 26 =390 records, 20 ∗ 26 = 520 records and 25 ∗ 26 = 650 there is no information leakage which can help Bob to unlock records. For each size we created three different Reference Sets the vault without having a proper key. in order to have more information about the effect of the set’s Furthermore, At this point, it should be noted that, in contrast content on the results. Table 2 contains the configurations for with the biometrics applications utilizing Fuzzy Vaults, [6, 25] all these Reference Sets. Since for each size of Reference Set we the leakage of a key may only affect a single vault. This is due will use three different Reference Sets, each time we perform an to the fact that we employ a separate polynomial for each string experiment with each of them, we maintain a fixed id. To this end, to be locked in the Fuzzy Vault. Additionally, the degree of the experiment ids 1–3 are for the Reference sets of 390 elements, polynomial and the size of the chaff set also affect a Fuzzy Vault’s 4–6, for those of 520 ones and 7–9 for those with 600 elements. offered privacy. For further hardening the Fuzzy Vault, the degree The evaluation of our method is going to be based on metrics of the polynomial may be increased. Increasing the size of the broadly used in information retrieval and data mining. These met- chaff set will as well favor privacy. rics are Precision, Recall and Balanced Accuracy. Precision is the fraction of the relevant elements among the retrieved elements 6 EXPERIMENTAL EVALUATION (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃TP+𝐹 𝑃 ). Recall is the fraction of the retrieved ele- In this section, we will provide empirical evidence regarding the ments divided by the total relevant elements (𝑅𝑒𝑐𝑎𝑙𝑙 = TP ). 𝑇 𝑃 +𝐹 𝑁 matching performance and the time characteristics of our method. Balanced Accuracy is used as a metric to indicate accuracy in Finally, we will compare the Fuzzy Vault method we propose cases of data with imbalanced classes, where the number of in- against a baseline privacy preserving matching method based on stances of one class, overwhelms is significantly larger than the phonetic codes [14]. For our evaluation, we have implemented a number of instances of the other one. This metric normalizes the prototype in Python 31 enhanced by a series of packages2 . true positive and true negative predictions by the number of ac- tual positive and negative samples, respectively, and divides their 6.1 Setup In our experiments, we have used samples from real-world data originating from the North Carolina voters database3 . To evaluate Table 2: Reference sets used in the evaluation. our approach, we have used ‘last name’ and ‘first name’ from this Reference Set id Samples per letter Size 1 https://github.com/XhinoMullaymeri/Fuzzy_Vault_PPRL_DOLAP21 1, 2, 3 15 390 2 Python 3 packages: hashlib, random, scipy.interpolate.lagrange, numpy, 4, 5, 6 20 520 numpy.polyfit, Levenshtein 3 Available at: https://dl.ncsbe.gov/index.html?prefix=data/ 7, 8, 9 25 650 Execution Times Fuzzy Vault vs. Private Soundex 8000 6925 7089 7157 1.0 0.97 7000 6000 5511 5673 5597 0.8 0.81 0.71 0.70 5000 4256 4308 4255 0.61 Time(s) 4000 0.6 3000 0.4 0.40 2000 1000 0.2 Fuzzy Vault 0 Private Soundex 1 2 3 4 5 6 7 8 9 0.0 Recall Precision Balanced Accuracy Experiment id Figure 7: Fuzzy Vault Execution Time. Figure 8: Fuzzy Vault vs. Privacy Preserving Soundex sum by two as shown : 𝐵𝐴 = 𝑇 𝑃𝑅+𝑇 𝑁 𝑅 , where 𝑇 𝑃𝑅 = TP 2 𝑇 𝑃 +𝐹 𝑁 dataset, we would ideally end up with 1,000 pairs matched and and 𝑇 𝑁 𝑅 = TN . 𝑇 𝑁 +𝐹 𝑃 1, 000 ∗ 1, 000 − 1, 000 = 999, 000 pairs not matched. Precision In our experiments, we chose 𝑛 = 8 for the polynomials we and recall do not take into account such imbalances, a situation used and the key sizes to be equal to 15. Our interpolation method captured, however by Balanced Accuracy. was Lagrange interpolation. Also, we have fixed the number of chaff points to 150, which were selected from the Reference 6.2.2 Total Execution Time. In this set of experiments, we will Set using a Normal distribution. Concerning the algorithm for assess the relation between the execution time and the size of comparing the Reference Set with the strings resulting from the the Reference Set. The results of this evaluation are illustrated in records of our dataset we used Levenshtein distance [17]. Figure 7. The vertical axis represents time in seconds, while the horizontal one, again, stands for the Reference Set used. The first 6.2 Experimental results observation we can make is that there are groups formed, based Next, we will examine in detail the outcomes of our empirical on the size of the Reference Set. Within each group, execution evaluation. times are similar, as the sizes of these Reference Sets are identical and for each key the same number of comparisons has to take 6.2.1 Matching Performance. Let us begin our experimental place. On the other hand, as the size of the Reference Set increases, assessment by examining the matching performance achieved execution times increase as well. As a result, we can deduce that, by our Fuzzy Vault scheme. This is illustrated in Figure 6. The total execution time is proportional to the size of the Reference horizontal axis represents the ids of the Reference Sets used, Set. while the vertical one represents the value for the respective This observation is particularly useful, especially when com- metric, all ranging between 0 and 1. Regarding the results, now, bined with the conclusions from the previous set of experiments. first of all, we may observe that Recall levels, represented by the In more detail, using smaller Reference Sets has the result of yellow bars, are almost for every test above 60%. This means higher recall and lower precision. On the other hand larger Ref- that at least 60% of the fuzzy vaults which had been created erence Sets increase, not only execution time but also precision, using dataset_A, managed to be unlocked by the proper record of with the cost of lower recall. Selecting intermediate values for dataset_B resulting in a successful record linkage. Furthermore, the Reference Sets offer a fair trade-off between precision and we may observe three clusters forming, one for each Reference recall and average time performance. Set size, where the results are similar within each cluster. Comparing precision and recall, we may observe that they 6.2.3 Locking and unlocking time shares. In this experiment, follow opposite directions, a fact that constitutes an expected we will focus on Reference Set 8, in order to illustrate time shares outcome. Related with the Reference Set used, the following situ- for locking and unlocking the vaults. These are illustrated in ation occurs. When smaller Reference Sets are employed, more Figures 9a and 9b. We will begin by describing the locking phase. keys are assigned to the same item of the Reference Set. Thus, Here as we may see, the time required by the locking phase is more keys are associated. On the other hand, larger Reference mostly occupied by the chaff points generation. This is, however, Sets associate small numbers of keys in each of their elements, one of the basic elements for offering strong privacy character- leading to higher precision, yet lower recall. istics. The rest of the steps of the locking step occupy only a It is evident that the size of the Reference Set used plays an small portion of the overall time required, making this part of important role in the overall behavior of our method. Neverthe- our approach open for further improvements. less, we have to stress that no particular measures were taken Regarding unlocking, it is easy to see that for unlocking, time to optimize the outcome regarding the distribution of the values is almost equally shared between reconstruction and key genera- of the Reference Sets. We have deliberately chosen this route tion. Identifying common points requires just a tiny portion of in order to illustrate the baseline characteristics our approach the overall time. exhibits. At this point, we will make use of one more plot to be able to Considering Balanced Accuracy now, it is easy to see that it extract safe conclusions regarding execution time. As we may remains at very high levels in all cases. This is an interesting result see, Figure 9c illustrates a time comparison between the locking by itself, since it is not only for the matching pairs of records we and the unlocking phase. It is evident, that the time required to are interested, but also, for the non-matching ones. Considering unlock all vaults is much larger than the time required to lock that we aim at an 1-1 matching for each of dataset_A’s records them. In particular, while 58 seconds are required for encoding, with the ones of dataset_B. Considering 1,000 records for each 6869 seconds are required for both decoding and performing an Key phase 11.74% Validation Data phase 0.00% Genuine Points phase Key phase 1.58% 54.41% Chaff Points phase Common Points phase Lock time 86.44% 44.94% 0.81% Shuffling phase Polynomial Reconstruction Unlock time 0.23% phase 0.65% 99.19% (a) Locking. (b) Unlocking. (c) Successful Unlocking. Figure 9: Execution time shares. all-to-all comparison, since exhaustive all-to-all matching was record linkage. IEEE Transactions on Knowledge and Data Engineering 31, 11 employed in order to allow us extracting safe conclusions. (2018), 2164–2177. [6] T Charles Clancy, Negar Kiyavash, and Dennis J Lin. 2003. Secure smart- cardbased fingerprint authentication. In Proceedings of the 2003 ACM SIGMM 6.2.4 Comparison with privacy preserving Soundex. Finally, workshop on Biometrics methods and applications. 45–52. we conclude our experimental evaluation, comparing against [7] Samuel Daniel Conte and Carl De Boor. 2017. Elementary numerical analysis: Privacy Preserving Soundex [14]. For this experiment, we used an algorithmic approach. SIAM. 31–41 pages. [8] Fred J Damerau. 1964. A technique for computer detection and correction of Reference Set 7, comprising of 650 entries. spelling errors. Commun. ACM 7, 3 (1964), 171–176. This experiment is illustrated in Figure 8. Again, the verti- [9] Aleksander Essex. 2019. Secure Approximate String Matching for Privacy- Preserving Record Linkage. IEEE Transactions on Information Forensics and cal axis displays values of the metrics, while the horizontal one Security 14, 10 (2019), 2623–2632. represents the name of each metric. Green bars correspond to [10] Michael T Goodrich. 2009. The mastermind attack on genomic data. In 2009 Privacy Preserving Soundex, while our method is illustrated with 30th IEEE Symposium on Security and Privacy. IEEE, 204–218. [11] Ali Inan, Murat Kantarcioglu, Gabriel Ghinita, and Elisa Bertino. 2010. Pri- the red bars. We may observe that only in terms of Precision our vate record matching using differential privacy. In Proceedings of the 13th method is inferior to Privacy Preserving Soundex. This, however, International Conference on Extending Database Technology. 123–134. is a result of the low Recall that Privacy Preserving Soundex has [12] Ari Juels and Madhu Sudan. 2006. A fuzzy vault scheme. Designs, Codes and Cryptography 38, 2 (2006), 237–257. exhibited. On the other hand, Fuzzy Vault manages to exhibit a [13] Alexandros Karakasidis, Georgia Koloniari, and Vassilios S Verykios. 2015. more balanced behavior. This result is promising, also taking into Scalable blocking for privacy preserving record linkage. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data account that Privacy Preserving Soundex is a three-party method, Mining. 527–536. as opposed to our two-party method. Regarding execution time, [14] Alexandros Karakasidis and Vassilios S Verykios. 2009. Privacy preserving Private Soundex operates in less than a second. However, as the record linkage using phonetic codes. In 2009 Fourth Balkan Conference in Informatics. IEEE, 101–106. results are simulated and this protocol requires a three-party set- [15] Basit Khurram and Florian Kerschbaum. 2020. SFour: A Protocol for Crypto- ting for achieving privacy, we may not extract valid conclusions graphically Secure Record Linkage at Scale. In 2020 IEEE 36th International based on this comparison, since protocol communication times Conference on Data Engineering (ICDE). IEEE, 277–288. [16] Mehmet Kuzu, Murat Kantarcioglu, Ali Inan, Elisa Bertino, Elizabeth Durham, are not taken into account. and Bradley Malin. 2013. Efficient privacy-aware record integration. In Pro- ceedings of the 16th International Conference on Extending Database Technology. 167–178. 7 CONCLUSIONS AND FUTURE WORK [17] Vladimir Levenshtein. 1965. Binary codes capable of correcting spurious In this paper, we presented the details of our method for per- insertions and deletion of ones. Problems of information Transmission 1, 1 (1965), 8–17. forming two-party privacy preserving record linkage using Fuzzy [18] Xhino Mullaymeri and Alexandros Karakasidis. 2021. A Two-Party Private Vaults. We discussed the privacy properties of our approach and String Matching Fuzzy Vault Scheme. In The 36th ACM/SIGAPP Symposium provided extended empirical evidence that this new approach is On Applied Computing. ACM. [19] Chaoyi Pang, Lifang Gu, David Hansen, and Anthony Maeder. 2009. Privacy- very promising. In our next steps, we aim at further refining our preserving fuzzy matching using a public reference table. In Intelligent Patient method aiming at increasing matching performance while, at the Management. Springer, 71–89. [20] Wendy Ponce-Hernandez, Ramon Blanco-Gonzalo, Judith Liu-Jimenez, and same time, improving execution times, maintaining, nevertheless, Raul Sanchez-Reillo. 2020. Fuzzy Vault Scheme Based on Fixed-Length Tem- our method’s privacy characteristics. We will further experiment plates Applied to Dynamic Signature Verification. IEEE Access 8 (2020), 11152– with tuning our method’s parameters, trying alternative interpo- 11164. [21] Thilina Ranbaduge, Peter Christen, and Rainer Schnell. 2020. Secure and lation methods. Accurate Two-Step Hash Encoding for Privacy-Preserving Record Linkage. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 139–151. REFERENCES [22] Ahsan Saleem, Abid Khan, Furqan Shahid, M Masoom Alam, and Muham- [1] Luca Bonomi, Yingxiang Huang, and Lucila Ohno-Machado. 2020. Privacy mad Khurram Khan. 2018. Recent advancements in garbled computing: how challenges and research opportunities for genomic data sharing. Nature far have we come towards achieving secure, efficient and reusable garbled Genetics (2020), 1–9. circuits. Journal of Network and Computer Applications 108 (2018), 1–19. [2] Feng Chen, Xiaoqian Jiang, Shuang Wang, Lisa M Schilling, Daniella Meeker, [23] Walter J Scheirer and Terrance E Boult. 2007. Cracking fuzzy vaults and Toan Ong, Michael E Matheny, Jason N Doctor, Lucila Ohno-Machado, and biometric encryption. In 2007 Biometrics Symposium. IEEE, 1–6. Jaideep Vaidya. 2018. Perfectly secure and efficient two-party electronic- [24] Springer Verlag GmbH, European Mathematical Society. [n. d.]. Encyclope- health-record linkage. IEEE internet computing 22, 2 (2018), 32–41. dia of Mathematics. Website. URL: https://www.encyclopediaofmath.org/. [3] Yanling Chen. [n. d.]. Current approaches and challenges for the two-party Accessed on 2020-08-11. privacy-preserving record linkage (PPRL). Collaborative Technologies and Data [25] Umut Uludag, Sharath Pankanti, and Anil K Jain. 2005. Fuzzy vault for finger- Science in Artificial Intelligence Applications ([n. d.]). prints. In International Conference on Audio-and Video-Based Biometric Person [4] Peter Christen, Thilina Ranbaduge, and Rainer Schnell. 2020. Linking Sensitive Authentication. Springer, 310–319. Data: Methods and Techniques for Practical Privacy-Preserving Information [26] Dinusha Vatsalan, Ziad Sehili, Peter Christen, and Erhard Rahm. 2017. Privacy- Sharing. Springer International Publishing AG. preserving record linkage for big data: Current approaches and research [5] Peter Christen, Thilina Ranbaduge, Dinusha Vatsalan, and Rainer Schnell. challenges. In Handbook of Big Data Technologies. Springer, 851–895. 2018. Precise and fast cryptanalysis for Bloom filter based privacy-preserving