Low-Latency Privacy-Preserving Deep Learning Design via Secure MPC Ke Lin1 , Yasir Glani1 and Ping Luo1,* 1 Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China Abstract Secure multi-party computation (MPC) facilitates privacy-preserving computation between multiple parties without leaking private information. While most secure deep learning techniques utilize MPC operations to achieve feasible privacy-preserving machine learning on downstream tasks, the overhead of the computation and communication still hampers their practical application. This work proposes a low-latency secret-sharing-based MPC design that reduces unnecessary communication rounds during the execution of MPC protocols. We also present a method for improving the computation of commonly used nonlinear functions in deep learning by integrating multivariate multiplication and coalescing different packets into one to maximize network utilization. Our experimental results indicate that our method is effective in a variety of settings, with a speedup in communication latency of 10 ∼ 20%. Keywords Multi-party computation, deep learning, privacy-preserving 1. Introduction • We propose a low-latency secret-sharing-based method for computing multivariate multiplications Secure multi-party computation (MPC) [1, 2] enables par- and univariate polynomials using network commu- ties to compute securely over their private data without nication that is efficient and reduces unnecessary revealing the data to each other. Secure MPC offers privacy- communication rounds on the fly. preserving property, which makes it suitable for most privacy-sensitive domains, such as medical research and • We improve the computation of nonlinear functions finance. Upon the development of deep learning techniques, by integrating the proposed multivariate multiplica- the ability to capture important information from large tion and coalescing different packets into one single datasets of neural models raises concerns regarding the packet to maximize network utilization. surveillance of individuals [3]. In this case, the prospects of • We conducted experiments to evaluate the effec- secure MPC demonstrate its application to secure machine tiveness of our method in the context of models learning and deep learning. While MPC-based deep learn- with varying sizes, networks with different latency ing frameworks have achieved significant performance in and bandwidth, the accuracy of downstream clas- general scenarios, most works suffer from the limitations sification tasks, and the number of participants in- caused by 1. network communication due to the nature of ex- volved. The results indicate an overall improvement changing intermediate information during MPC execution, of 10 ∼ 20% in communication latency. 2. excessive computation introduced by complex MPC pro- tocols. Since the computation of MPC protocols is largely determined by their sophisticated design, optimizing the pro- 2. Background tocol itself would seem to be difficult and infeasible. Thus, some studies [4] are concerned with improving the com- 2.1. Arithmetic Secret Sharing Based munication stage of MPC protocols to make them more Scheme practical. In this paper, we present an approach to reduce the communication latency of the MPC protocol through Our setup is primarily focused on arithmetic operations, so optimized multivariate multiplication. we represent all inputs and intermediate results in terms In general, privacy-preserving deep learning frameworks of linear secret sharing between 𝑛 parties, especially in the usually adopt secret-sharing-based techniques to avoid ex- context of additive secret-sharing schemes. tensive computational overheads [5, 6, 7, 8]. Consequently, Apart from the general (𝑛, 𝑡)-Shamir secret sharing secret-sharing-based methods require multiple exchanges scheme [10], which relies on the degree-𝑡 polynomials over of intermediate results to achieve collaborative MPC opera- 𝑛 parties, we adopt the simple arithmetic secret sharing tions. As these MPC techniques are based on linear compu- scheme based on (𝑛, 0)-Shamir secret sharing. In other tations, such as addition and multiplication, modern deep words, we share a scalar value 𝑥 ∈ Z/𝑄Z across 𝑛 parties learning techniques that inherently rely on linear algebra 𝒫, where Z/𝑄Z denotes a ring with 𝑄 elements, follow- benefit significantly from them. Considering the heavy de- ing the notations of [5]. The sharing of 𝑥 is defined as pendency on linear operations, our research aims to reduce [𝑥] = {[𝑥]𝑝 }𝑝∈𝒫 , where [𝑥]𝑝 is the party 𝑝’s share of 𝑥. unnecessary communication rounds following [9] during The ground-truth value 𝑥 could be reconstructed ∑︀ from the the execution of MPC protocols. sum of the shares of each party, i.e. 𝑥 = 𝑝∈𝒫 [𝑥]𝑝 . Our main contributions are as follows: When parties wish to share a value 𝑥, they generate a pseudorandom zero-share that sums to 0. The party that pos- The IJCAI-2024 AISafety Workshop sesses the value adds 𝑥 to their share in secret. To represent * Corresponding author. floating-point numbers, we adopt a fixed-point encoding $ leonard.keilin@gmail.com (K. Lin); yasirglani@gmail.com to encode any floating-point number 𝑥𝐹 into a fixed-point (Y. Glani); luop@tsinghua.edu.cn (P. Luo)  0009-0002-5376-7881 (K. Lin); 0000-0003-0060-4771 (Y. Glani); representation, 𝑥. Alternatively, we consider that each 𝑥 is 0000-0001-6171-3811 (P. Luo) the result of multiplying a floating-point number 𝑥𝐹 by a © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). scaling factor 𝐵 = 2𝐿 and rounding to the nearest integer, CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings i.e. 𝑥 = ⌊𝐵𝑥𝐹 ⌉. Here 𝐿 is the precision of the fixed-point party involved in the computation through one-round com- encoding. To decode a ground-truth floating-point value munications. Since most linear operations are also appli- 𝑥𝐹 from 𝑥, we compute as follows: 𝑥𝐹 ≈ 𝑥/𝐵. cable to element-wise operations and matrix operations, 𝑥 can also represent a vector, matrix, or even a tensor if there 2.2. Arithmetic Secret Sharing Based MPC is no confusion and ambiguity. It is noteworthy that arithmetic secret shares are homomor- phic and can be used to implement secure MPC, especially 3. Related Work in the context of linear computation in most cases. To achieve communication-efficient MPC, various ap- proaches have been developed to optimize the communica- Addition. The sum of two secret shared values [𝑥] and tion rounds and the throughput of communication. Ishai [𝑦] could be directly computed as [𝑧] = [𝑥] + [𝑦], where and Kushilevitz [12] proposes a new representation of poly- each party 𝑝 ∈ 𝒫 computes [𝑧]𝑝 = [𝑥]𝑝 + [𝑦]𝑝 without nomials for round-efficient secure computation, dividing multi-party communications. high-degree polynomials into multiple low-degree polyno- mials that are easy to solve. Mohassel and Franklin [13] Multiplication. Two secret shared values [𝑥] and [𝑦] are performs operations directly on polynomials, such as poly- multiplied using a random Beaver triple [11] generated by nomial multiplication and division. Dachman-Soled et al. the Trusted Third Party (TTP): a triplet ([𝑎], [𝑏], [𝑎𝑏]). It [14] improves the evaluation of multivariate polynomials should be noted that the Beaver triple could be shared in with different variables being held as private inputs by each advance by each party. The parties first calculate [𝜖] = party. Then, Lu et al. [4] proposes an efficient method for [𝑥] − [𝑎] and [𝛿] = [𝑦] − [𝑏]. In this way, the [𝜖] and [𝛿] are evaluating high-degree polynomials with arbitrary num- then revealed to all parties (denoted as Reveal(·)) without bers of variables. While the current research has focused on compromising information since the ground-truth values improving the calculation of polynomials, our study aims 𝑎, 𝑏 remain unknown to each party except for the TTP. The to develop a communication-efficient and effective MPC final results could be computed as [𝑥𝑦] = [𝑐] + 𝜖[𝑏] + [𝑎]𝛿 + system for use in modern deep learning frameworks by 𝜖𝛿. Algorithm 1 illustrates the multiplication using Beaver leveraging arithmetic tuples computation from Krips et al. triples. [9]. This system is not confined to only computing polyno- mials within finite rings, as seen in previous studies. Linear functions. It is possible to implement functions In recent years, several deep learning frameworks that that consist of linear operations by combining additions and preserve privacy have emerged to enable the secure in- multiplications. Common operations in deep learning, such ference of neural network models. Wagh et al. [8] imple- as element-wise product and convolution, are allowed in a ments a maliciously secure 3-party MPC protocol from Se- linear paradigm. cureNN [15] and ABY3 [16]. Knott et al. [5] provides flexi- ble machine-learning APIs with a rich set of functions for Nonlinear functions. Due to the inherent infeasibility secure deep learning. Li et al. [7] presents a fast and perfor- of nonlinear functions in the standard arithmetic secret- mant MPC Transformer inference framework designed to be sharing scheme, most works use approximation methods privacy-preserving. Our low-latency linear MPC implemen- to simulate the outcome of nonlinear functions. In particu- tation is built on top of Knott et al.’s CrypTen framework lar, Taylor Expansion, Newton-Rhapson, and Householder and provides a significant improvement in communication methods are commonly used to approximate nonlinear func- latency. tions using only linear operations. All reciprocal functions, exponential functions, loss functions, kernel functions, and other useful functions in deep learning are calculated this 4. Methodology way, for example. 4.1. Multivariate Multiplication Algorithm 1 Beaver Multiplication Mul([𝑥], [𝑦]) Since Beaver triples illustrate how to multiply two variables Input: Secret-shared inputs [𝑥], [𝑦], Beaver triple with pre-shared triplets, a classic multiplication between ([𝑎], [𝑏], [𝑎𝑏]). multiple variables, such as [𝑥𝑦𝑧], requires several rounds Output: [𝑥𝑦]. of binary multiplication, i.e. Mul(Mul([𝑥], [𝑦]), [𝑧]). This 1: ◁ Compute masked values naive implementation, however, introduces additional com- 2: [𝜖] ← [𝑥 − 𝑎] = [𝑥] − [𝑎] munication rounds during the on-the-fly Reveal process. 3: [𝛿] ← [𝑦 − 𝑏] = [𝑦] − [𝑏] In general, a 𝑛-ary multiplication requires 𝑛 − 1 rounds of 4: ◁ Reveal 𝜖 and 𝛿 through one-round communications communication. 5: 𝜖 ← Reveal([𝜖]) To reduce the communication rounds involved in the 6: 𝛿 ← Reveal([𝛿]) multivariate multiplications, the basic binary Beaver triple 7: return 𝜖𝛿 + 𝜖[𝑏] + [𝑎]𝛿 + [𝑎𝑏] is extended into a general 𝑛-ary Beaver triple. This results in only one round of communication required throughout the entire process. Assume the 𝑛 inputs could be represented as {[𝑥𝑖 ]}𝑛 𝑖=1 . 2.3. Notations The precomputation and preshared information required by the extended protocol is {𝒜𝑖 }𝑛 𝑛 𝑖=1 . Here 𝒜1 := {[𝑎𝑗 ]}𝑗=1 This section summarizes the notations used throughout this is defined as the set of 𝑛 auxiliary shared values used to work. We denote [𝑥] as a secret sharing of 𝑥. Reveal([𝑥]) blind the the inputs {[𝑥𝑖 ]}𝑛𝑖=1 , which is also similar to the means that the ground-truth value 𝑥 is revealed to every Beaver’s idea. Then 𝒜𝑖 (𝑖 ≥ 2) is defined as the set of shared degree-𝑖 cross-terms consisting of the variables in the auxiliary sets {𝒜𝑖 }𝑛 𝑖=1 . Therefore, in practice, there is a 𝒜1 . For example, 𝒜2 could be defined as 𝒜2 := {[𝑎𝑖 𝑎𝑗 ] | trade-off between communication latency and throughput. 𝑖 ̸= 𝑗 ∧ 1 ≤ 𝑖, 𝑗 ≤ 𝑛}, and 𝒜3 := {[𝑎𝑖 𝑎𝑗 𝑎𝑘 ] | 𝑖 ̸= 𝑗 ̸= 𝑘 ∧ 1 ≤ 𝑖, 𝑗, 𝑘 ≤ 𝑛}, and so on. Similar to the construction 4.2. Univariate Polynomials of [𝜖] and [𝛿] in Section 2.2, we define the difference between the inputs and the masks as [𝛿𝑖 ] := [𝑥𝑖 ] − [𝑎𝑖 ]. The secret- The formal∑︀ form of𝑖 univariate polynomials is defined as shared [𝛿𝑖 ] is then made public across all parties without 𝑃 (𝑥) = 𝑛 𝑖=0 𝑏𝑖 𝑥 , where 𝑏𝑖 refers to the coefficients of leakage to the ground-truth value of both [𝑥𝑖 ] and [𝑎𝑖 ]. The the degree-𝑖 term. The use of univariate polynomials en- improvement of our method originates from the following ables efficient evaluation and manipulation of polynomial equation: expressions. According to Damgård et al. [17], we can com- pute all required [𝑥𝑖 ] in parallel using multivariate multipli- 𝑛 𝑛 ∏︁ ∏︁ cations, then multiply them with corresponding plaintext 𝑥𝑖 = (𝛿𝑖 + 𝑎𝑖 ) coefficients. Despite its benefits, this trick has the disadvan- 𝑖=1 𝑖=1 ∏︀ ∏︀ tage of exponentially increasing the size of transmitted data, ∏︁ ∑︁ 𝑎𝑚 ∑︁ 𝑎𝑚 (1) = 𝛿𝑖 + 𝛿𝑖 + 𝛿𝑖 𝛿𝑗 which becomes unbearable when the exponent exceeds 5. 𝑖 𝑎𝑖 𝑎𝑖 𝑎𝑗 This method can be implemented in practice by comput- 𝑖,𝑗,𝑖̸=𝑗 ∏︁ ing a tuple of base terms and then multiplying the tuple by + ··· + 𝑎𝑖 . a certain term iteratively, as in the exponentiating by squar- ing method or the fast modulo algorithm. In other words, Here∏︀ we informally use the fractional representation, such a tuple 𝑔 = (1, 𝑥, . . . , 𝑥𝑚−1 ) of size ‖𝑔‖ = 𝑚 could be as 𝑎𝑎𝑖𝑚 , to denote the products of all the terms except multiplied by 𝑥‖𝑔‖ repeatedly to iterate all the 𝑥𝑖 terms. The for certain ones. Note that this fractional form does not overview of the implementation of univariate polynomials involve ∏︀ any actual division. Also, each secret-shared term is described in Algorithm 3. Note that 𝑏𝑠:𝑒 is the subvector 𝑎𝑚 of [ 𝑎𝑖 ...𝑎 𝑗 ] could be found in the auxiliary sets {𝒜𝑖 }𝑛 𝑖=1 , of 𝑏 from position 𝑠 to 𝑒. which is preshared across all parties. Adaptation of Equation 1 in secret-sharing scheme is as Algorithm 3 Univariate Polynomial Poly([𝑥], 𝑏) follows: Input: Secret-shared input [𝑥], coefficients 𝑏 = 𝑛 ∑︁ ∏︀ 𝑎𝑚 ∏︀ , . . . , 𝑏𝑛 ), base terms size ‖𝑔‖. (𝑏0 , 𝑏1∑︀ ∏︁ ∏︁ ∑︁ 𝑎𝑚 [ 𝑥𝑖 ] = 𝛿𝑖 + 𝛿𝑖 [ ]+ 𝛿𝑖 𝛿𝑗 [ ] Output: 𝑛 𝑖 𝑎𝑖 𝑎𝑖 𝑎𝑗 𝑖=0 𝑏𝑖 [𝑥 ]. 𝑖=1 𝑖 𝑖,𝑗,𝑖̸=𝑗 1: ◁ Construct base terms 2: parallel for 𝑖 ∈ [1, ‖𝑔‖] do ∏︁ + · · · + [ 𝑎𝑖 ]. (2) 3: [𝑥𝑖 ] ← Mul([𝑥], . . . , [𝑥]) ◁ multiplied by [𝑥] of 𝑖 Since Equation 2 is linear to the secret-sharing terms, all times 4: end parallel for communications could be conducted in parallel, i.e. in a ‖𝑔‖−1 5: 𝑔 ← (1, [𝑥], . . . , [𝑥 ]) single round of communications. In this case, we could 6: ◁ Iteratively exponentiating simply reveal all the secret-sharing terms in {𝒜𝑖 }𝑛 𝑖=1 and 7: 𝑡 ← 0 compute the sharing of final results in constant complexity. 𝑛 8: for 𝑖 ∈ [0, ⌊ ‖𝑔‖ ⌋ − 1] do The protocol is formally described in Algorithm 2. 9: s ← 𝑖 · ‖𝑔‖ Algorithm 2 Multivariate Beaver Multiplication of 𝑛 inputs 10: e ← (𝑖 + 1) · ‖𝑔‖ Mul([𝑥1 ], [𝑥2 ], . . . , [𝑥𝑛 ]) 11: 𝑡 ← 𝑡 + 𝑏𝑠:𝑒 · 𝑔 12: ◁ Vectorized Beaver Multiplication Input: Secret-shared inputs {[𝑥𝑖 ]}𝑛 𝑖=1 , auxiliary sets 13: 𝑔 ← [𝑥‖𝑔‖ ] · 𝑔 {𝒜𝑖 }𝑛 𝑖=1 ∏︀ . 14: end for Output: [ 𝑥𝑖 ]. 15: return 𝑡 1: ◁ Compute masked values 2: for 𝑖 ∈ [1, 𝑛] do 3: [𝛿𝑖 ] ← [𝑥𝑖 − 𝑎𝑖 ] = [𝑥𝑖 ] − [𝑎𝑖 ] 4: end for 4.3. Nonlinear Approximations 5: ◁ Reveal 𝛿𝑖 through one-round communications In this section, we take one step further to optimize the com- 6: parallel for 𝑖 ∈ [1, 𝑛] do monly used nonlinear functions by leveraging the property 7: 𝛿𝑖 ← Reveal([𝛿𝑖 ]) of parallelization of our proposed multivariate multiplica- 8: end parallel for tion. 9: ◁ Compute results using ∏︀ preshared {𝒜𝑖 }𝑛 𝑖=1 ∏︀ 𝛿𝑖 + 𝑖 𝛿𝑖 [ 𝑎𝑖 ] + 𝑖,𝑗,𝑖̸=𝑗 𝛿𝑖 𝛿𝑗 [ 𝑎𝑖𝑎𝑎𝑚 𝑎𝑚 ∏︀ ∑︀ ∑︀ 10: return ]+ ∏︀ 𝑗 Exponentiation. Since exponential functions grow in ge- · · · + [ 𝑎𝑖 ] ometrical speed, approximations based on series expansion generally suffer from a significant reduction in accuracy The total rounds of communications are indeed reduced since we do not know the exact value of the input. Con- from 𝑛−1 to constant 1 when a regular 𝑛-ary multiplication sequently, we resort to the naive iterative approximation, is performed, but the overall size of communication data which is capable of utilizing multivariate multiplication ef- increases from linear to exponential. In a naïve implemen- fectively: (︂ )︂𝑑𝑛 tation, the data size of 𝑛-ary multiplication is only 3(𝑛 − 1) [𝑥] [𝑥] 𝑒 = lim 1 + 𝑛 . for a total transmission of 𝑛 − 1 Beaver triples. As opposed 𝑛→∞ 𝑑 to a multivariate implementation, it is 2𝑛 − 1 to transmit During each iteration, the 𝑑-th power of the previous result alter the order of communication rounds without modifying is calculated. With increasing iteration rounds 𝑛, the answer the payload, which is also reliable and secure. would become closer to the actual results. Multivariate computations are similarly secure as tradi- tional Beaver multiplications under semi-honest conditions. Logarithm. The calculation of logarithms relies on the It is intuitively obvious that since 𝑎𝑖 is chosen at random higher-order iterative methods for a better convergence, i.e. by TTP, the 𝛿𝑖 = 𝑥𝑖 − 𝑎𝑖 value is indistinguishable from a Householder methods on 𝑦 = ln 𝑥: random number. Consequently, the disclosure of [𝛿𝑖 ] does not reveal any critical information regarding 𝑥𝑖 . This as- [ℎ𝑛 ] = 1 − [𝑥]𝑒−[𝑦𝑛 ] sumption holds even if multiple parties, except for the TTP, ∞ collude. ∑︁ 1 [𝑦𝑛+1 ] = [𝑦𝑛 ] − [ℎ𝑘𝑛 ] To clarify the security of multivariate multiplication for- 𝑘 mally, we denote [𝑥]𝑝 as the secret share of 𝑥 for party 𝑘=1 𝑝 ∈ 𝒫. The global equations of the multivariate system are Note that the implementation of logarithm is the combi- as follows: ∑︁ nation of exponentiation and univariate polynomials. The [𝑥𝑖 ]𝑝 = 𝑥𝑖 degree of the polynomials determines the precision of the 𝑝∈𝒫 output. ∑︁ [𝑎𝑖 ]𝑝 = 𝑎𝑖 Reciprocal. The reciprocal function 𝑦 = 𝑥1 is calculated ∑︁ 𝑝∈𝒫 using the Newton-Raphson method with an initial guess 𝑦0 : [𝑎𝑖 𝑎𝑗 ]𝑝 = 𝑎𝑖 𝑎𝑗 (3) 𝑝∈𝒫 [𝑦𝑛+1 ] = [𝑦𝑛 ](2 − [𝑥][𝑦𝑛 ]) ...... = 2[𝑦𝑛 ] − [𝑥][𝑦𝑛 ][𝑦𝑛 ] ∑︁ [𝑎1 . . . 𝑎𝑛 ] = 𝑎1 . . . 𝑎𝑛 Trigonometry. Trigonometric functions could be treated 𝑝∈𝒫 as the special case of exponentiation with 𝑑 = 2. The sine [𝑥𝑖 ]𝑝 − [𝑎𝑖 ]𝑝 = [𝛿𝑖 ]𝑝 and cosine functions are calculated in the field of complex with known [𝛿𝑖 ]𝑝 for every 𝑝 ∈ 𝒫 to each party. From each numbers: [sin 𝑥] = Im([𝑒𝑖𝑥 ]) party’s view, these 2𝑛 + 2𝑛 − 1 equations have Θ(2𝑛 ‖𝒫‖) unknown variables. This indicates the difficulty in deter- [cos 𝑥] = Re([𝑒𝑖𝑥 ]) mining the exact value of 𝑥𝑖 , as shown in [18]. Using the above-mentioned nonlinear functions, we can A party’s view represents all the values it can obtain calculate most of the existing loss functions in deep learning, during its execution. Then the following theorem holds: such as the sigmoid, tanh, and cross-entropy functions. Theorem 1. Let {𝑥′𝑖 } and {𝑥′′𝑖 } be random values. The Various other common nonlinear functions, such as the distribution of the view of each party is identical when 𝑥𝑖 = 𝑥′𝑖 softmax function and kernel function, can also be calcu- or 𝑥𝑖 = 𝑥′′𝑖 . lated using exponential and reciprocal functions. This guarantees the security of multivariate multiplication 4.4. Communication Coalescing by ensuring the indistinguishability between the random distribution and the view’s distribution. The key to achieving low-latency secret-sharing compu- tation is to reduce the total number of rounds of commu- nications among different parties. While we introduce a 6. Experiments latency-friendly implementation of basic math operations, other kinds of communications, such as precision checking, 6.1. Experimental Setup still require an additional but independent communication As part of our proposed methodology, we use CrypTen [5] round. as the basic MPC deep learning framework, which has al- In general, the communication involved in multiple math ready provided naïve implementations of secret-sharing- operations could be abstracted as a communication graph, based computations. In most of our experiments, we use or strictly, as a communication tree. Accordingly, we ob- 3-party MPC on CPUs. Additionally, we allow a maximum serve some independent communications that do not affect of 4-ary multiplication as stated in Section 4.1, and we set downstream results can be deferred and combined into one 𝑑 = 3 for exponentiation and 𝑘 = 8 for logarithm as de- single round of communication. This process is referred to scribed in Section 4.3. as communication coalescing, and it eliminates unnecessary To measure the performance, we perform several ex- rounds of communication and improves the utilization of periments with deep learning models with different sizes: network bandwidth. (a) Linear Support Vector Classification (LinearSVC) with L2 penalty; (b) LeNet [19] with shallow convolutional 5. Security Analysis and linear layers along with ReLU activation functions; (c) ResNet-18 model [20] with multiple convolutional, lin- The correctness of the multivariate multiplication is trivial ear, pooling, and activation layers; (d) Transformer Encoder based on the observation in Equation 1 and 2. As univariate model [21] with a single multi-head attention layer and polynomials are implemented using the same method as the BatchNorm [22] in place of LayerNorm [23]. We employ extended fast modulo algorithm, their effectiveness could several datasets for classification tasks with appropriate also be demonstrated by the correctness and security of adaption to specific models, including MNIST [19], CIFAR- multivariate multiplication. Coalescing mechanisms only 10 [24], ImageNet[25], and Sentiment140 [26] datasets. Table 1 Communication Latency and Data with Different Network Settings. Latency and data are measured in milliseconds and MiBs. Data Size 𝑡comm (𝑁low ) 𝑡comm (𝑁med ) 𝑡comm (𝑁high ) Model Dataset 𝑡comp Naïve Ours Naïve Ours Naïve Ours Naïve Ours LinearSVC MNIST 0.032 0.022 0.024 0.082 0.059 0.604 0.568 4.339 4.026 LeNet CIFAR-10 0.900 38.562 41.647 1.373 1.182 7.673 6.983 53.010 47.868 ResNet-18 ImageNet 110.447 11571.294 12612.711 219.541 185.293 1681.685 1407.570 ~11 760 ~9 870 Transformer Sentiment140 7.824 771.787 893.421 12.190 9.715 78.624 61.877 559.645 421.413 Each of our experiments is conducted in a simulated multi- Table 2 node environment using Docker. TTP is conducted in an Classification Accuracy using Different Methods. independent environment separate from the normal parties. Accuracy (%) To manually simulate different network environments con- Model Dataset cerning bandwidth and latency, we utilize the docker-tc Origin Naïve Ours tool to adjust the docker network settings accordingly. LinearSVC MNIST 100.00 100.00 100.00 LeNet CIFAR-10 100.00 100.00 100.00 6.2. Metrics ResNet-18 ImageNet 69.30 61.58 60.10 Transformer Sentiment140 59.87 58.55 57.74 To provide a comprehensive evaluation of our proposed method, we adopt metrics from a variety of perspectives. Softmax Conv2D (60.5%) (49.8%) • 𝑡comp : The computational time cost for evaluating a single data sample in one round. Other (6.1%) Other • 𝑡comm : The time cost of communication associated (2.0%) with evaluating a single data sample in one round. MatMul ReLU BatchNorm (14.4%) • Size of Transmission Data: The size of transmit- (19.1%) (25.0%) GeLU ted network packets when evaluating a single data (23.1%) sample in one round. (a) ResNet Basic Block (b) Attention Block • Accuracy: The classification accuracy when evalu- Figure 1: Communication percentage of different models. ated on a particular dataset. 6.3. Latency & Throughput percentage of ResNet basic blocks and Attention blocks. To assess the efficiency of our proposed method, we simu- As can be seen, the attention mechanism is constrained late networks with different network latencies: (a) network by its communication bottleneck in Softmax operation, 𝑁low with 0.1ms latency, (b) network 𝑁med with 5ms la- while CNN is constrained by its communication via convo- tency, (c) and network 𝑁high with 40ms latency. All of these lutional operations. Considering that our method makes an networks have a bandwidth of 1Gbps. Our simulated multi- improved optimization for nonlinear functions, attention- node settings include 3 nodes with an additional TTP by based models show a significant improvement in latency, default. with almost a 25% improvement. Additionally, this explains As shown in Table 1, the computation cost of each model the limited improvement of only 8 ∼ 15% in traditional is negligible in medium and high latency network settings machine-learning models and CNN-based models. in comparison to the communication cost. Therefore, we will focus only on the communication costs associated with 6.4. Evaluation our proposed method. Compared to the naïve method implemented by CrypTen, In this section, we examine the side effects and factors asso- our method illustrated in Section 4.1 remains close since ciated with the basic settings, such as the downstream tasks’ it does not introduce a substantial amount of additional accuracy, the number of parties involved, and the trade-off communication payload if the maximum number of input between network latency and bandwidth. variables is set appropriately. For instance, a 3-ary or 4-ary To evaluate the drop in accuracy, we compare our method multiplication would not produce a significant increase in with both the original baseline and the naïve implementa- the total size of communications. tion without a low-latency design. Figure 2 shows that, in It is noteworthy that our proposed method reduces the relatively small scenarios, both the naïve implementation communication cost in every network setting as compared and our methods are capable of achieving perfect perfor- to the naïve implementation of MPC. Overall, we achieve mance as the baselines. Nevertheless, both MPC-based im- an improvement of 10 ∼ 20%, which shows significant plementations obtain lower accuracy in complex scenarios enhancement in the performance of high-latency environ- than the baseline, while our methods perform slightly worse ments for practical purposes. than the naïve implementation. We hypothesize that the Furthermore, we observe that our proposed method be- multivariate multiplication introduces additional precision haves differently with neural models with different archi- requirements, which in turn reduces accuracy. tectures. Figure 1 illustrates the communication occupation The throughput and latency of MPC-based methods are Naı̈ve 90 Naı̈ve 35 Ours Ours 30 Transmission Data (MiB) 80 25 tcomm (ms) 70 20 60 15 50 10 40 100 101 102 103 3 4 5 6 Bandwidth (Mbps) # Parties Figure 3: Latency of naïve and our proposed methods concerning different network bandwidth. The experiment is conducted using (a) Size of Trans. Data w.r.t # Parties the LeNet model on CIFAR-10 using a medium latency network. 20 Naı̈ve 18 Ours distortion in the input data. 16 Moreover, our proposed method is only applicable to functions that are based on linear MPC operation. To avoid tcomm (ms) 14 heavy communications, a modern MPC-based deep learn- ing framework would also involve other protocols, such as 12 Homomorphic Encryption [27], Garbled Circuit [28], and Function Secret Sharing[29]. Though these works may have 10 less communication, our approach could still be seamlessly integrated with the current secret-sharing framework and 8 achieve a latency improvement of ~20% without adding ex- cessive computational workload. 3 4 5 6 # Parties 8. Conclusion (b) Latency w.r.t # Parties This study proposes a secret-sharing-based MPC method for Figure 2: Transimission data and latency of naïve and our pro- enhancing the linear computation required in deep learning posed methods when a different number of parties are involved. through increased communication utilization. By utilizing The experiment is conducted using the LeNet model on CIFAR-10 the multivariate multiplication and communication coalesc- using a medium latency network. ing mechanisms, we can reduce the number of unnecessary communication rounds during the execution of both linear and nonlinear deep learning functions. In our experiments, also affected by the number of parties involved in the compu- we demonstrate that our proposed methods achieve an over- tation. From Figure 2, it can be seen that the communication all improvement in latency of 10 ∼ 20% when compared data size of both methods increases linearly as the number to the naïve MPC implementation. Additionally, it indicates of parties involved increases. There is, however, a tendency that throughput and downstream task performance are com- for the latency to be worse when there are more parties parable to naïve implementations, which demonstrate the involved. method’s validity and efficiency. We hope that this work Moreover, Figure 3 illustrates how network bandwidth will inspire future improvements in privacy-preserving deep affects communication costs. When sufficient bandwidth is learning techniques and lead to more practical MPC appli- available, our method can still optimize the network latency. cations. It is important to note, however, that when the bandwidth becomes the bottleneck, our method would not be any more effective in reducing the overall costs of communication. Acknowledgments This indicates that bandwidth remains an important factor in a multi-node MPC setting, especially as the number of This work is supported by the National Key R&D Program nodes in use grows. of China under grant (No. 2022YFB2703001). 7. Discussion References Since the proposed multivariate multiplication is based on [1] I. Damgård, V. Pastro, N. Smart, S. Zakarias, a finite ring, it is likely to have precision issues that lead to Multiparty computation from somewhat homomor- incorrect results. Fortunately, a loss in precision would not phic encryption, in: R. Safavi-Naini, R. Canetti significantly affect the overall performance of deep learning, (Eds.), CRYPTO 2012, Springer Berlin Heidelberg, since the loss could be interpreted as random noise and Berlin, Heidelberg, 2012, pp. 643–662. doi:10.1007/ 2018/442. 978-3-642-32009-5_38. [16] P. Mohassel, P. Rindal, Aby3: A mixed protocol frame- [2] M. Keller, Mp-spdz: A versatile framework for multi- work for machine learning, in: Proceedings of the 2018 party computation, in: Proceedings of the 2020 ACM ACM SIGSAC Conference on Computer and Commu- SIGSAC Conference on Computer and Communica- nications Security, CCS ’18, Association for Comput- tions Security, CCS ’20, Association for Computing ing Machinery, New York, NY, USA, 2018, p. 35–52. Machinery, New York, NY, USA, 2020, p. 1575–1590. doi:10.1145/3243734.3243760. doi:10.1145/3372297.3417872. [17] I. Damgård, M. Fitzi, E. Kiltz, J. B. Nielsen, T. Toft, Un- [3] X. Liu, L. Xie, Y. Wang, J. Zou, J. Xiong, Z. Ying, A. V. conditionally secure constant-rounds multi-party com- Vasilakos, Privacy and security issues in deep learning: putation for equality, comparison, bits and exponenti- A survey, IEEE Access 9 (2021) 4566–4593. doi:10. ation, in: S. Halevi, T. Rabin (Eds.), Theory of Cryptog- 1109/ACCESS.2020.3045078. raphy, Springer Berlin Heidelberg, Berlin, Heidelberg, [4] D. Lu, A. Yu, A. Kate, H. Maji, Polymath: Low-latency 2006, pp. 285–304. doi:10.1007/11681878_15. mpc via secure polynomial evaluations and its applica- [18] G. Couteau, A note on the communication complexity tions, Proceedings on Privacy Enhancing Technologies of multiparty computation in the correlated random- 2022 (2021). doi:10.2478/popets-2022-0020. ness model, in: EUROCRYPT 2019, Springer-Verlag, [5] B. Knott, S. Venkataraman, A. Hannun, S. Sengupta, Berlin, Heidelberg, 2019, p. 473–503. doi:10.1007/ M. Ibrahim, L. van der Maaten, Crypten: Secure multi- 978-3-030-17656-3_17. party computation meets machine learning, NIPS [19] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient- 34 (2021) 4961–4973. doi:10.48550/arXiv.2109. based learning applied to document recognition, Pro- 00984. ceedings of the IEEE 86 (1998) 2278–2324. doi:10. [6] S. Tan, B. Knott, Y. Tian, D. J. Wu, CryptGPU: 1109/5.726791. Fast privacy-preserving machine learning on the gpu, [20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning in: IEEE S&P, 2021. doi:10.1109/SP40001.2021. for image recognition, in: CVPR, 2016, pp. 770–778. 00098. doi:10.1109/CVPR.2016.90. [7] D. Li, H. Wang, R. Shao, H. Guo, E. Xing, H. Zhang, [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Mpcformer: fast, performant and private transformer L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, At- inference with mpc, in: The 11th International Con- tention is all you need, in: I. Guyon, U. V. Luxburg, ference on Learning Representations, 2023. doi:10. S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, 48550/arXiv.2211.01452. R. Garnett (Eds.), NIPS, volume 30, Curran Associates, [8] S. Wagh, S. Tople, F. Benhamouda, E. Kushilevitz, Inc., 2017. doi:10.5555/3295222.3295349. P. Mittal, T. Rabin, Falcon: Honest-majority mali- [22] S. Ioffe, C. Szegedy, Batch normalization: accelerating ciously secure framework for private deep learning, deep network training by reducing internal covari- Proceedings on Privacy Enhancing Technologies 2021 ate shift, in: Proc. 32nd Int. Conf. Machine Learning (2020) 188 – 208. doi:10.2478/popets-2021-0011. - Volume 37, ICML’15, JMLR.org, 2015, p. 448–456. [9] T. Krips, R. Küsters, P. Reisert, M. Rivinius, Arithmetic doi:10.5555/3045118.3045167. tuples for mpc, Cryptology ePrint Archive (2022). URL: [23] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, https://eprint.iacr.org/2022/667. 2016. arXiv:1607.06450. [10] A. Shamir, How to share a secret, Commun. ACM 22 [24] A. Krizhevsky, G. Hinton, et al., Learning multiple (1979) 612–613. doi:10.1145/359168.359176. layers of features from tiny images (2009). [11] D. Beaver, Efficient multiparty protocols using circuit [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Im- randomization, in: J. Feigenbaum (Ed.), CRYPTO 1991, agenet: A large-scale hierarchical image database, in: Springer Berlin Heidelberg, Berlin, Heidelberg, 1992, CVPR, 2009, pp. 248–255. doi:10.1109/CVPR.2009. pp. 420–432. doi:10.1007/3-540-46766-1_34. 5206848. [12] Y. Ishai, E. Kushilevitz, Randomizing polynomials: [26] A. Go, R. Bhayani, L. Huang, Twitter sentiment clas- A new representation with applications to round- sification using distant supervision, CS224N project efficient secure computation, in: Proceedings 41st report, Stanford 1 (2009) 2009. Annual Symposium on Foundations of Computer Sci- [27] C. Gentry, Fully homomorphic encryption using ideal ence, 2000, pp. 294–304. doi:10.1109/SFCS.2000. lattices, in: Proceedings of the Forty-First Annual 892118. ACM Symposium on Theory of Computing, STOC [13] P. Mohassel, M. Franklin, Efficient polynomial opera- ’09, Association for Computing Machinery, New York, tions in the shared-coefficients setting, in: M. Yung, NY, USA, 2009, p. 169–178. doi:10.1145/1536414. Y. Dodis, A. Kiayias, T. Malkin (Eds.), PKC 2006, 1536440. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, [28] A. C. Yao, Protocols for secure computations, in: pp. 44–57. doi:10.1007/11745853_4. 23rd Annual Symposium on Foundations of Computer [14] D. Dachman-Soled, T. Malkin, M. Raykova, M. Yung, Science (sfcs 1982), 1982, pp. 160–164. doi:10.1109/ Secure efficient multiparty computing of multivariate SFCS.1982.38. polynomials and applications, in: J. Lopez, G. Tsudik [29] E. Boyle, N. Gilboa, Y. Ishai, Function secret sharing, (Eds.), Applied Cryptography and Network Security, in: E. Oswald, M. Fischlin (Eds.), EUROCRYPT 2015, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, Springer Berlin Heidelberg, Berlin, Heidelberg, 2015, pp. 130–146. doi:10.1007/978-3-642-21554-4_ pp. 337–367. doi:10.1007/978-3-662-46803-6_ 8. 12. [15] S. Wagh, D. Gupta, N. Chandran, Securenn: Effi- cient and private neural network training, Cryptol- ogy ePrint Archive (2018). URL: https://eprint.iacr.org/