=Paper=
{{Paper
|id=Vol-1558/paper33
|storemode=property
|title=Using Robust Estimation Theory to Design Efficient Secure Multiparty Linear Regression
|pdfUrl=https://ceur-ws.org/Vol-1558/paper33.pdf
|volume=Vol-1558
|authors=Fida K. Dankar,Sabri Boughorbel,Radja Badji
|dblpUrl=https://dblp.org/rec/conf/edbt/DankarBB16
}}
==Using Robust Estimation Theory to Design Efficient Secure Multiparty Linear Regression==
Using Robust Estimation Theory to Design Efficient Secure Multiparty Linear Regression Fida K. Dankar Sabri Boughorbel Radja Badji Sidra medical and Research Center Sidra medical and Research Center Sidra medical and Research Center Doha, Qatar Doha, Qatar Doha, Qatar fdankar@sidra.org sboughorbel@sidra.org rbadji@sidra.org ABSTRACT These secure techniques enable data holders to compute a Various biomedical research studies, such as large-population function over their data while keeping these data private. Only the studies and studies on rare diseases, require sharing of data across final result of the computation is revealed to all parties. multiple sources or institutions. In fact, data sharing will enable Although secure computation protocols exist for multiple data the collection of more cases for analysis and thus increase the mining/ statistical functions, these protocols are mostly statistical power of the study. However, combining data from inappropriate for real world application due to their computational various sources poses privacy risks. A number of protocols have overhead. Communication cost is the main factor driving this been proposed in the literature to address the privacy concerns; inefficiency (extensive message passing between the different but these protocols do not fully deliver either on privacy or concerned parties is the main bottleneck of existing secure complexity. The main reason lies in the methodology used to computations). The main reason lies in the methodology used to design these secure algorithms. It is based on translating regular design these secure algorithms. It is based on translating regular algorithms into secure versions using cryptographic procedures algorithms into secure versions using cryptographic procedures and tricks rather than on establishing robust theory for designing and tricks rather than on establishing robust theory for designing secure and communication free distributed algorithms. In this secure and communication free- distributed algorithms. paper, we use well-established theoretical results to design a secure and low communication linear regression protocol. The There is considerable research devoted to designing distributed method used is comprehensive and can be generalized to other algorithms that are communication free. In such algorithms, data estimators. is partitioned into multiple subsets, subsets are stored on different machines, processes run on the different subsets simultaneously CCS Concepts and the final outcome is calculated as a function of the individual • Security and Privacy➝Cryptography • Security and ones, thus limiting communication to the final step. In this paper, Privacy➝Security services➝Privacy-preserving protocols. we import some of this theory into the design of a secure and low communication linear regression protocol. The method used is Keywords comprehensive and can be generalized to other estimators Data sharing; secure multiparty computation; linear regression; (geometric median, principal component analysis, etc). The theory information privacy. is described in the next section followed by a description of the secure protocol. The paper concludes with a discussion of the 1. INTRODUCTION limitations and future directions. To advance large-scale biomedical studies, and to strengthen the statistical power of research studies (such as complex 2. THEORY associations), researchers frequently need to share data with We introduce the classical setting of a linear regression problem. colleagues from around the world. Such data sharing activities are Let 𝑋 = {𝑥%,' } be an 𝑛×𝑝 matrix of features and 𝑌 = (𝑦/ , … , 𝑦1 )3 contingent on the protection of patient anonymity. Traditionally, a corresponding 𝑛×1 response vector, where 𝑛 is the number of researchers would strip the data from its identifying information - samples and 𝑝 is the number of features. The linear regression such as names and unique IDs- before sharing it with each other. problem consists of solving the linear model, 𝑌 = 𝑋𝛽 + 𝜖 [4]. To However, as many recent studies have shown [1]–[3], it is now improve the computational efficiency of the linear regression possible to deduce the identity of research participants from problem, Wang et al [5] designed a distributed linear regression genomic and/or clinical data that was considered anonymized. algorithm. The algorithm, referred to as message, partitions the dataset into multiple subsets and processes these subsets in In the face of these growing privacy concerns, researchers are parallel. Formally, the data (𝑋, 𝑌) is divided horizontally into 𝑘 looking more and more at cryptographic secure computations (also known as secure multiparty computations). subsets (𝑋 / , 𝑌/ ; … ; (𝑋 : , 𝑌 : )}, with 𝑋 % = (𝑥/% , … , 𝑥;% ) the 𝑛% ×𝑝 feature matrix for subset 𝑖 and 𝑌 % = (𝑦/% , … , 𝑦1% = )3 the corresponding 𝑛% ×1 response vector. The algorithm then executes the following two steps: (c) 2016, Copyright is with the authors. Published in the Workshop Proceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 1. In the first step, the feature selection vector for each 2016, Bordeaux, France) on CEUR-WS.org (ISSN 1613-0073). subset is calculated separately using the Lasso algorithm Distribution of this paper is permitted under the terms of the Creative [6]. After that, the overall model is obtained by Commons license CC-by-nc-nd 4.0 including the features selected by the majority of the subsets. Thus, if 𝛾 % = {𝛾/% , … , 𝛾;% } is the feature selection vector for site 𝑖, (with 𝛾'% = 1 if feature 𝑗 is 3.2 Protocol included and 0 otherwise), then the overall vector is The main algorithm is described in Algorithm 1; it can be roughly obtained by: divided into 4 steps: 𝛾 = {𝑚𝑜𝑑𝑒 𝛾// , … , 𝛾/: , … , 𝑚𝑜𝑑𝑒 𝛾:/ , … , 𝛾:: } 1. The third party generates the keys for the Paillier 2. In the second step, the coefficients of the selected cryptosystem and propagates the public key to all features are estimated separately for each subset, and parties the final result is obtained by averaging the coefficient 2. Each party calculates its local feature selection vector estimates of each feature. Thus, if 𝛽 % = {𝛽/% , … , 𝛽;% } is via Lasso, the overall feature selection vector is then the feature coefficients vector for site 𝑖, then the overall calculated via a secure mode protocol (Algorithm 2). vector is obtained by: The secure mode protocol retains all features that have 𝛾 = {𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝛽// , … , 𝛽/: , … , 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝛽:/ , … , 𝛽:: } an overall inclusion probability greater than ½. The calculated mode vector is then propagated to the The message algorithm fairs well not only in terms of different parties without leaking any information about computational time, but also in terms of feature selection individual feature selection vectors. performance. In fact, the authors prove that under certain assumptions (that will be detailed in the discussion section) the feature inclusion vector converges exponentially to the optimal Initialization model [5]. A result that is better than applying Lasso to the whole 1: Input dataset. 𝑛, 𝑝, 𝑘, {𝑆/ , … , 𝑆: }, {(𝑋 / , 𝑌/ ); … ; (𝑋 : , 𝑌 : )}, {𝑛/ , … , 𝑛: } # Privacy is not a concern in the message algorithm. In fact, sharing 𝑛 is the total sample size, 𝑝 is the number of features, 𝑘 is of intermediate and local results between the different subsets is the number of sites, {𝑆/ , … , 𝑆: } are the different sites, supported in the algorithm. Prior work, [7], demonstrated through multiple scenarios, how identifying information can be inferred 𝑋 % , 𝑌 % represent the data set held by site 𝑖, and 𝑛% is the from local and intermediate statistics using various different sample size at site 𝑖. attacks. This paper uses the theory behind the message algorithm to Third party formulate an efficient and private linear regression algorithm. The proposed algorithm performs efficiently with regards to feature 2: Propagates a public encryption key to all sites selection and model calculation under certain assumptions. The Each party calculates algorithm is presented in the next section and the assumptions are discussed in Section 5. The paper concludes with a discussion of future work. 3: 𝛾 % ∈ {0,1}; # the feature inclusion model via lasso 3. SECURE PROTOCOL All parties calculate We assume that a dataset (𝑋, 𝑌) is horizontally distributed among 𝑘 ≥ 2 data holders (or sites) 𝑆/ , … , 𝑆: . The different sites are 4: 𝛾 = 𝑚𝑜𝑑𝑒(𝛾 % , 𝑖 ∈ {1, … , 𝑘}) via Secure mode protocol interested in cooperatively performing linear regression on the Each party calculates union of their datasets, however they are not willing to share their 3 3 data. Only the final result of the computation should be revealed 5: 𝛽% = (𝑋 % 𝑋 % )W/ 𝑋 % 𝑌 % # the estimated coefficients for to all parties. The secure algorithm is depicted in Algorithms1 and features 𝛾 2, it requires a semi-trusted third party whose role is the generation of keys for a public encryption cryptosystem as well as Each party calculates the decryption of some results. X= 6: 𝐸𝑛𝑐( ) : Paillier cryptosystem [8] will be used due to its nice homomorphic properties. These properties will be introduced in the next All parties calculate subsection followed by a presentation of the secure protocol. X= X= 7: 𝐸𝑛𝑐(𝛽) = 𝐸𝑛𝑐 Y∑:%[/ \ = ∏:%[/ 𝐸𝑛𝑐( ) : : 3.1 Paillier Public Cryptosystem #Calculation is done sequentially: party 1 calculates In public key cryptosystems, The ciphertext (the encryption) 𝑐 of X^ a message 𝑚 (usually an integer) is obtained by applying an 𝐸𝑛𝑐( ) and sends it to party 2, party 2 calculates : encryption function: 𝑐 = 𝐸𝑛𝑐;: (𝑚), where 𝑝𝑘 is the public X^ XT 𝐸𝑛𝑐 Y \ ∗ 𝐸𝑛𝑐( ) and sends it to party 3, and so on … encryption key. The ciphertext 𝑐 can be decrypted using another : : key 𝑠𝑘 (referred to as the secret key) and a decryption function: 8: 𝐸𝑛𝑐(𝛽) is sent to the third party 𝑚 = 𝐷𝑒𝑐Q: (𝑐). Paillier public cryptosystem is additively homomorphic, as such the sum of two messages can be obtained from their respective cyphertexts. For Paillier, this translates to Third party 𝐸𝑛𝑐;: 𝑚/ + 𝑚R = 𝐸𝑛𝑐;: 𝑚/ ×𝐸𝑛𝑐;: 𝑚R [8]. Moreover, Paillier allows a limited form of homomorphic multiplication, in 9: Decrypts and propagates 𝛽 that we can multiply an encrypted message by a plaintext. It is done as follows: 𝐸𝑛𝑐(𝑚/ )ST = 𝐸𝑛𝑐(𝑚/ 𝑚R ). Algorithm 1. Main algorithm 3. After receiving the overall feature selection vector, each : % If 𝜕 = %[/ 𝛾 is the sum of the feature inclusion vectors of the party calculates the coefficients of the selected features : different parties, then 𝑤 = 𝜕 − { }; would indicate whether the separately. The encrypted average of these features is R then securely computed using Paillier encryption as feature should be included or not through its sign: if w is positive X= at position 𝑗 (𝑤 𝑗 > 0) this implies that the majority of sites had follows: each site 𝑖 calculates 𝐸𝑛𝑐( ), then all sites S X= X= 1 at the 𝑗th position in their feature inclusion model, and thus the calculate 𝐸𝑛𝑐 :%[/ = :%[/ 𝐸𝑛𝑐( ) sequentially. : : overall feature inclusion vector should be set as 𝛾 𝑗 = 1, The encrypted result is sent to the third party. otherwise it is set to 0. Thus, in order to calculate the feature 4. The third party decrypts and propagates the estimated inclusion vector, it is enough to securely calculate 𝑤. However, as feature coefficients. 𝑤 presents aggregated information about the sites (the number of The secure mode protocol is described in Algorithm 2, it sites that include every feature), our secure protocol will instead computes the mode of the 𝑘 feature inclusion vectors without calculate an obfuscated version: 𝑤𝑥, where 𝑥 is a positive integer revealing any information about the different sites (in other words, whose factors are distributed among the different parties: 𝑥 = the inclusion information for the sites is kept confidential, only : %[/ 𝑥% , where 𝑥% is held by site 𝑖 and is kept confidential. As 𝑥 is the mode is revealed to the different parties). positive, 𝑤𝑥 would still indicate whether the feature should be included or not through its sign. The protocol is explained in detail in Algorithm2. Initialization / 1: 𝜃 = { }; 3.3 Protocol Modification R It is important to note that the algorithm can be slightly changed to operate without the need of a third party. In such case, a threshold Paillier cryptosystem can replace the third party [9]– Each party 𝒊 [11]. In a threshold cryptosystem, the secret decryption key is 2: Generates a random positive integer 𝑥% # denote by 𝑥 = distributed among a preset number, 𝑡, of entities (the data holders in our case). For the decryption to occur, each of the 𝑡 entities has ∏:%[/ 𝑥% to perform its share of the decryption. The decryption shares are Each party calculates then combined to obtain the final result. This way, the protocol will be robust against the corruption of at most 𝑡 − 1 entities [12] 3: 𝑤 % = (𝛾 % − 𝜃) (note that the current version works under the assumptions that all 4: 𝐸𝑛𝑐k𝑤 % l sites are non-corruptible, because if one party is corrupted, then they can communicate with the third party to decrypt intermediate results). All parties calculate The threshold Paillier cryptosystem can be set up through a trusted party that will generate and distribute the public and secret 5: 𝐸𝑛𝑐(𝑤) = ∑:%[/ 𝐸𝑛𝑐(𝑤 % ) # note if 𝑤 is positive at keys. The trusted party can then erase all information pertaining to position 𝑗 (𝑤[𝑗] > 0) this implies that the majority of sites the key generation. If no such trusted party is available, the keys had 1 at the 𝑗 position in their feature inclusion model can be generated using secure multiparty computations [13]. Although this requires more computation overhead from each data Sequentially the parties calculate owner, it only has to be done once. As such, it is an acceptable 6: 𝐸𝑛𝑐(𝑤𝑥) = {𝐸𝑛𝑐(𝑤[1]) o , … , 𝐸𝑛𝑐(𝑤[𝑝]) o } # party1 tradeoff. calculates 𝐸𝑛𝑐(𝑤𝑥/ ) = {𝐸𝑛𝑐(𝑤[1])o^ , … , 𝐸𝑛𝑐(𝑤[𝑝]) o^ } and sends it to party2, party 2 calculates 𝐸𝑛𝑐(𝑤𝑥/ 𝑥R ) = 4. COMPLEXITY In this section, we evaluate the additional computational burden {𝐸𝑛𝑐(𝑤𝑥/ )oT , … , 𝐸𝑛𝑐(𝑤𝑥/ )oT } and so on. of the algorithm incurred to achieve security. This burden is 7: 𝐸𝑛𝑐(𝑤𝑥) is sent to third party evaluated on each participating party. The complexity will be expressed in terms of the number of messages and in terms of basic functional units. These units are homomorphic Third party multiplication (HM) and homomorphic addition (HA). Assuming an instance of Paillier modulus 𝑚R [8], HA is equivalent to 8: Decrypts 𝐸𝑛𝑐(𝑤𝑥) multiplying two integers modulo 𝑚R , and HM is equivalent to 9: Propagates 𝑤𝑥 computing an exponentiation modulo 𝑚R where the exponent is at most 𝑚. As such, an HM operation is equivalent to log (𝑚) HA. Each site With these considerations, it follows that a message encryption is γ[j] = 1 if W[j] > 0 dominated by 2HM, while a decryption is dominated by HM. 10: Calculate γ as follows q γ[j] = 0 otherwise The algorithm performs 2 encryptions per party (step 6 of main algorithm and step 4 of secure protocol), and one homomorphic multiplication per party (step 6 in secure protocol). This amount to a total of 5 HM per party. On the other hand, the number of Algorithm 2. Secure mode protocol message passing per party is constant, amounting to 5 to 7 messages. Prior secure linear regression algorithms bear much higher computational burdens on the participants [7], [12], [14]–[16]. Table 1. Choices of sample sizes (𝒏) and number of sites (𝒌) as ;T ;| The best algorithm [12] requires 𝑂(𝑘 + ) messages per party in [5]. R } ;~ and 𝑂( ) HM operations per party. Number of Number of sites • Experimental data samples (𝑛) (𝑘) 5. DISCUSSION Synthetic 2000-10000 50,200 Algorithm consistency for regression or classification is a well- studied property in the field of statistical learning. It gives a household electric theoretical guarantee that an algorithm converges to a optimal power 75259, 2 millions 200 algorithm when the number of training samples becomes very consumption large. In practical situations, the number of samples is not that large and therefore the (theoretical) consistency property is not of HIGGS critical importance, what’s more critical is the validation of the classification 11 millions 1000 proposed algorithms on multiple diverse datasets as it provides confidence in their performances on real datasets and under different settings. In summary, the previous experimental results showed that a ratio between the number of samples and sites of about 50 might ensure In general, both consistency and good performance on the nice consistency property. experimental datasets are well-desired properties for regression algorithms. The algorithm message (that was used to derive our 6. FUTURE WORK algorithm) was tried in various experiments using multiple As future work, we plan to investigate the optimal choice of the datasets in [5], and it showed good consistency and good number of sites as a function of the number of samples to performance. Nonetheless, in what follows, we discuss the guarantee the trade-off between accurate regression and efficient required assumptions to ensure the theoretical property of computation. The consistency assumption gives an inequality consistency. There are four assumptions for the consistency of the condition between the number of sites and samples. We plan to selection algorithm [5]: look deeper into this relation and validate the theoretical findings • A.1 Consistency condition for estimation: This using experimental data. condition is usually required for high dimensional We also plan to test our algorithm on real data sets and compare it model regression. (i) to existing secure computation protocols, and (ii) to the • A.2 Conditions on model parameters. It imposes a ordinary Lasso technique (applied on the combined dataset) in restriction on model noise, parameter estimate and the terms of both accuracy and performance. number of selected features. • A.3 (Lasso) The strong irrepresentable condition. Additionally, we plan to extend our method to other estimators • A.4 The sparse Riesz condition. such as principal component analysis. This may require designing new secure protocols for dealing with intermediate statistical Where A.2 is a basic assumption that is required for high- calculations among the different sites, analogous to the presented dimensional model estimation [17], [18], A.3 and A.4 are the secure mode protocol. specific conditions for model selection consistency in Lasso. In [5] a theoretical analysis of assumptions A.1, A.3 and A.4 is proposed. This analysis indicates that it is possible to ensure the 7. REFERENCES validity of these three assumptions under the assumption that 𝑛 ≥ [1] Y. Erlich and A. Narayanan, “Routes for breaching and 𝑘 ∗ (𝐴 + 𝑘𝐶) where 𝑛 is the total number of samples; 𝑘 is the protecting genetic privacy,” Nat. Rev. Genet., vol. 15, no. number of sites and the number of samples per site is assumed to 6, pp. 409–421, 2014. 1 be equal (𝑛% = ). 𝐴 and 𝐵 are complex terms that depend on the [2] M. Naveed, E. Ayday, E. W. Clayton, J. Fellay, C. A. : training data and model parameters. The previous equation Gunter, J.-P. Hubaux, B. A. Malin, and X. Wang, “Privacy guarantees that the three assumptions are valid in probability in the genomic era,” ACM Comput. Surv. CSUR, vol. 48, provided that 𝑘 and 𝑛 are chosen such that 𝑘 = 𝑂(𝑛). no. 1, p. 6, 2015. The experimental results in the paper that introduced message [3] E. Check Hayden, “Researchers wrestle with a privacy algorithm are obtained for both synthetic data and real-world data problem,” Nature, vol. 525, no. 7570, pp. 440–442, Sep. 2015. [5]. For the synthetic data, the chosen values for 𝑘 is 200 and the sample size 𝑛 is varied between 2000 and 10000. The number of [4] D. C. Montgomery, E. A. Peck, and G. G. Vining, features 𝑝 is 10000, 𝑘 =50 and 𝑛 varied between 2000 and 10000. Introduction to linear regression analysis, vol. 821. John For the real-world data (household electric power consumption Wiley & Sons, 2012. dataset), the sample size is defined as 𝑛 =2 million for training [5] X. Wang, P. Peng, and D. B. Dunson, “Median Selection and 𝑛 =75259 for testing. The number of sites is 𝑘 = 200. For the Subset Aggregation for Parallel Inference,” in Advances in second real-world data (HIGGS classification dataset), 𝑛 =11 Neural Information Processing Systems, 2014, pp. 2195– million and 𝑘 =1000. For the previously described experimental 2203. settings, the median selection algorithm showed very good results. Table 1 summarizes the values of 𝑛 and 𝑘 chosen in [5]. [6] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Ser. B Methodol., pp. 267–288, 1996. [7] K. El Emam, S. Samet, L. Arbuckle, R. Tamblyn, C. Earle, [12] F. Dankar, “Privacy Preserving Linear Regression on and M. Kantarcioglu, “A secure distributed logistic Distributed Databases,” Trans. Data Priv., vol. 8, pp. 3–28, regression protocol for the detection of rare adverse drug 2015. events,” J. Am. Med. Inform. Assoc. JAMIA, vol. 20, no. 3, [13] T. Nishide and K. Sakurai, “Distributed paillier pp. 453–461, May 2013. cryptosystem without trusted dealer,” in Information [8] P. Paillier, “Public-key cryptosystems based on composite Security Applications, Springer, 2011, pp. 44–60. degree residuosity classes,” in Advances in cryptology— [14] R. Hall, S. E. Fienberg, and Y. Nardi, “Secure multiple EUROCRYPT’99, 1999, pp. 223–238. linear regression based on homomorphic encryption,” J. [9] Y. Desmedt, “Threshold cryptosystems,” in Advances in Off. Stat., vol. 27, no. 4, p. 669, 2011. Cryptology — AUSCRYPT ’92, J. Seberry and Y. Zheng, [15] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Eds. Springer Berlin Heidelberg, 1993, pp. 1–14. Boneh, and N. Taft, “Privacy-Preserving Ridge Regression [10] C. Hazay, G. L. Mikkelsen, T. Rabin, and T. Toft, on Hundreds of Millions of Records,” 2012. “Efficient rsa key generation and threshold paillier in the [16] F. Dankar, R. Brien, C. Adams, and S. Matwin, “Secure two-party setting,” in Topics in Cryptology–CT-RSA 2012, Multi-Party linear Regression.,” in EDBT/ICDT Springer, 2012, pp. 313–331. Workshops, 2014, pp. 406–414. [11] R. Canetti and S. Goldwasser, “An efficient threshold [17] P. Zhao and B. Yu, “On model selection consistency of public key cryptosystem secure against adaptive chosen Lasso,” J. Mach. Learn. Res., vol. 7, pp. 2541–2563, 2006. ciphertext attack,” in Advances in Cryptology— EUROCRYPT’99, 1999, pp. 90–106. [18] Y. Kim, S. Kwon, and H. Choi, “Consistent model selection criteria on high dimensions,” J. Mach. Learn. Res., vol. 13, no. 1, pp. 1037–1057, 2012.