Low-Latency Privacy-Preserving Deep Learning Design via Secure MPC

Low-Latency Privacy-Preserving Deep Learning Design via Secure MPC KeLin Tsinghua University

30 Shuangqing Rd., Haidian District 100084 Beijing China

YasirGlani yasirglani@gmail.com Tsinghua University

30 Shuangqing Rd., Haidian District 100084 Beijing China

PingLuo luop@tsinghua.edu.cn Tsinghua University

30 Shuangqing Rd., Haidian District 100084 Beijing China

Low-Latency Privacy-Preserving Deep Learning Design via Secure MPC 1613-0073 13F124455D8F40F6B75A9EE4BC935051 GROBID - A machine learning software for extracting information from scholarly documents Multi-party computation deep learning privacy-preserving

Secure multi-party computation (MPC) facilitates privacy-preserving computation between multiple parties without leaking private information. While most secure deep learning techniques utilize MPC operations to achieve feasible privacy-preserving machine learning on downstream tasks, the overhead of the computation and communication still hampers their practical application. This work proposes a low-latency secret-sharing-based MPC design that reduces unnecessary communication rounds during the execution of MPC protocols. We also present a method for improving the computation of commonly used nonlinear functions in deep learning by integrating multivariate multiplication and coalescing different packets into one to maximize network utilization. Our experimental results indicate that our method is effective in a variety of settings, with a speedup in communication latency of 10 ∼ 20%.

Introduction

Secure multi-party computation (MPC) [1,2] enables parties to compute securely over their private data without revealing the data to each other. Secure MPC offers privacypreserving property, which makes it suitable for most privacy-sensitive domains, such as medical research and finance. Upon the development of deep learning techniques, the ability to capture important information from large datasets of neural models raises concerns regarding the surveillance of individuals [3]. In this case, the prospects of secure MPC demonstrate its application to secure machine learning and deep learning. While MPC-based deep learning frameworks have achieved significant performance in general scenarios, most works suffer from the limitations caused by 1. network communication due to the nature of exchanging intermediate information during MPC execution, 2. excessive computation introduced by complex MPC protocols. Since the computation of MPC protocols is largely determined by their sophisticated design, optimizing the protocol itself would seem to be difficult and infeasible. Thus, some studies [4] are concerned with improving the communication stage of MPC protocols to make them more practical. In this paper, we present an approach to reduce the communication latency of the MPC protocol through optimized multivariate multiplication.

In general, privacy-preserving deep learning frameworks usually adopt secret-sharing-based techniques to avoid extensive computational overheads [5,6,7,8]. Consequently, secret-sharing-based methods require multiple exchanges of intermediate results to achieve collaborative MPC operations. As these MPC techniques are based on linear computations, such as addition and multiplication, modern deep learning techniques that inherently rely on linear algebra benefit significantly from them. Considering the heavy dependency on linear operations, our research aims to reduce unnecessary communication rounds following [9] during the execution of MPC protocols.

Our main contributions are as follows:

• We improve the computation of nonlinear functions by integrating the proposed multivariate multiplication and coalescing different packets into one single packet to maximize network utilization.

• We conducted experiments to evaluate the effectiveness of our method in the context of models with varying sizes, networks with different latency and bandwidth, the accuracy of downstream classification tasks, and the number of participants involved. The results indicate an overall improvement of 10 ∼ 20% in communication latency.

Background

Arithmetic Secret Sharing Based Scheme

Our setup is primarily focused on arithmetic operations, so we represent all inputs and intermediate results in terms of linear secret sharing between 𝑛 parties, especially in the context of additive secret-sharing schemes.

Apart from the general (𝑛, 𝑡)-Shamir secret sharing scheme [10], which relies on the degree-𝑡 polynomials over 𝑛 parties, we adopt the simple arithmetic secret sharing scheme based on (𝑛, 0)-Shamir secret sharing. In other words, we share a scalar value 𝑥 ∈ Z/𝑄Z across 𝑛 parties 𝒫, where Z/𝑄Z denotes a ring with 𝑄 elements, following the notations of [5]. The sharing of 𝑥 is defined as [𝑥] = {[𝑥]𝑝}𝑝∈𝒫 , where [𝑥]𝑝 is the party 𝑝's share of 𝑥. The ground-truth value 𝑥 could be reconstructed from the sum of the shares of each party, i.e. 𝑥 = ∑︀ 𝑝∈𝒫 [𝑥]𝑝. When parties wish to share a value 𝑥, they generate a pseudorandom zero-share that sums to 0. The party that possesses the value adds 𝑥 to their share in secret. To represent floating-point numbers, we adopt a fixed-point encoding to encode any floating-point number 𝑥𝐹 into a fixed-point representation, 𝑥. Alternatively, we consider that each 𝑥 is the result of multiplying a floating-point number 𝑥𝐹 by a scaling factor 𝐵 = 2 𝐿 and rounding to the nearest integer, i.e. 𝑥 = ⌊𝐵𝑥𝐹 ⌉. Here 𝐿 is the precision of the fixed-point encoding. To decode a ground-truth floating-point value 𝑥𝐹 from 𝑥, we compute as follows: 𝑥𝐹 ≈ 𝑥/𝐵.

Arithmetic Secret Sharing Based MPC

It is noteworthy that arithmetic secret shares are homomorphic and can be used to implement secure MPC, especially in the context of linear computation in most cases. Linear functions. It is possible to implement functions that consist of linear operations by combining additions and multiplications. Common operations in deep learning, such as element-wise product and convolution, are allowed in a linear paradigm.

Addition

Nonlinear functions. Due to the inherent infeasibility of nonlinear functions in the standard arithmetic secretsharing scheme, most works use approximation methods to simulate the outcome of nonlinear functions. In particular, Taylor Expansion, Newton-Rhapson, and Householder methods are commonly used to approximate nonlinear functions using only linear operations. All reciprocal functions, exponential functions, loss functions, kernel functions, and other useful functions in deep learning are calculated this way, for example.

Algorithm 1 Beaver Multiplication Mul([𝑥], [𝑦])Input: Secret-shared inputs [𝑥], [𝑦], Beaver triple ([𝑎], [𝑏], [𝑎𝑏]). Output: [𝑥𝑦]. 1: ◁ Compute masked values 2: [𝜖] ← [𝑥 − 𝑎] = [𝑥] − [𝑎] 3: [𝛿] ← [𝑦 − 𝑏] = [𝑦] − [𝑏]

Notations

This section summarizes the notations used throughout this work. We denote [𝑥] as a secret sharing of 𝑥. Reveal([𝑥]) means that the ground-truth value 𝑥 is revealed to every party involved in the computation through one-round communications. Since most linear operations are also applicable to element-wise operations and matrix operations, 𝑥 can also represent a vector, matrix, or even a tensor if there is no confusion and ambiguity.

Related Work

To achieve communication-efficient MPC, various approaches have been developed to optimize the communication rounds and the throughput of communication. Ishai and Kushilevitz [12] proposes a new representation of polynomials for round-efficient secure computation, dividing high-degree polynomials into multiple low-degree polynomials that are easy to solve. Mohassel and Franklin [13] performs operations directly on polynomials, such as polynomial multiplication and division. Dachman-Soled et al. [14] improves the evaluation of multivariate polynomials with different variables being held as private inputs by each party. Then, Lu et al. [4] proposes an efficient method for evaluating high-degree polynomials with arbitrary numbers of variables. While the current research has focused on improving the calculation of polynomials, our study aims to develop a communication-efficient and effective MPC system for use in modern deep learning frameworks by leveraging arithmetic tuples computation from Krips et al. [9]. This system is not confined to only computing polynomials within finite rings, as seen in previous studies.

In recent years, several deep learning frameworks that preserve privacy have emerged to enable the secure inference of neural network models. Wagh et al. [8] implements a maliciously secure 3-party MPC protocol from Se-cureNN [15] and ABY 3 [16]. Knott et al. [5] provides flexible machine-learning APIs with a rich set of functions for secure deep learning. Li et al. [7] presents a fast and performant MPC Transformer inference framework designed to be privacy-preserving. Our low-latency linear MPC implementation is built on top of Knott et al.'s CrypTen framework and provides a significant improvement in communication latency.

Methodology

Multivariate Multiplication

Since Beaver triples illustrate how to multiply two variables with pre-shared triplets, a classic multiplication between multiple variables, such as [𝑥𝑦𝑧], requires several rounds of binary multiplication, i.e. Mul(Mul([𝑥], [𝑦]), [𝑧]). This naive implementation, however, introduces additional communication rounds during the on-the-fly Reveal process. In general, a 𝑛-ary multiplication requires 𝑛 − 1 rounds of communication.

To reduce the communication rounds involved in the multivariate multiplications, the basic binary Beaver triple is extended into a general 𝑛-ary Beaver triple. This results in only one round of communication required throughout the entire process.

Assume the 𝑛 inputs could be represented as {[𝑥𝑖]} 𝑛 𝑖=1 . The precomputation and preshared information required by the extended protocol is {𝒜𝑖} 𝑛 𝑖=1 . Here 𝒜1 := {[𝑎𝑗]} 𝑛 𝑗=1 is defined as the set of 𝑛 auxiliary shared values used to blind the the inputs {[𝑥𝑖]} 𝑛 𝑖=1 , which is also similar to the Beaver's idea. Then 𝒜𝑖(𝑖 ≥ 2) is defined as the set of shared degree-𝑖 cross-terms consisting of the variables in 𝒜1. For example, 𝒜2 could be defined as

𝒜2 := {[𝑎𝑖𝑎𝑗] | 𝑖 ̸ = 𝑗 ∧ 1 ≤ 𝑖, 𝑗 ≤ 𝑛}, and 𝒜3 := {[𝑎𝑖𝑎𝑗𝑎 𝑘 ] | 𝑖 ̸ = 𝑗 ̸ = 𝑘 ∧ 1 ≤ 𝑖, 𝑗, 𝑘 ≤ 𝑛},= ∏︁ 𝛿𝑖 + ∑︁ 𝑖 𝛿𝑖 ∏︀ 𝑎𝑚 𝑎𝑖 + ∑︁ 𝑖,𝑗,𝑖̸ =𝑗 𝛿𝑖𝛿𝑗 ∏︀ 𝑎𝑚 𝑎𝑖𝑎𝑗 + • • • + ∏︁ 𝑎𝑖.

(

Here we informally use the fractional representation, such as

∏︀ 𝑎𝑚

𝑎 𝑖 , to denote the products of all the terms except for certain ones. Note that this fractional form does not involve any actual division. Also, each secret-shared term of [ ∏︀ 𝑎𝑚 𝑎 𝑖 ...𝑎 𝑗 ] could be found in the auxiliary sets {𝒜𝑖} 𝑛 𝑖=1 , which is preshared across all parties.

Adaptation of Equation 1 in secret-sharing scheme is as follows:

[ 𝑛 ∏︁ 𝑖=1 𝑥𝑖] = ∏︁ 𝛿𝑖 + ∑︁ 𝑖 𝛿𝑖[ ∏︀ 𝑎𝑚 𝑎𝑖 ] + ∑︁ 𝑖,𝑗,𝑖̸ =𝑗 𝛿𝑖𝛿𝑗[ ∏︀ 𝑎𝑚 𝑎𝑖𝑎𝑗 ] + • • • + [ ∏︁ 𝑎𝑖].

(2) Since Equation 2 is linear to the secret-sharing terms, all communications could be conducted in parallel, i.e. in a single round of communications. In this case, we could simply reveal all the secret-sharing terms in {𝒜𝑖} 𝑛 𝑖=1 and compute the sharing of final results in constant complexity. The protocol is formally described in Algorithm 2.

Algorithm 2 Multivariate Beaver Multiplication of 𝑛 inputsMul([𝑥1], [𝑥2], . . . , [𝑥𝑛]) Input: Secret-shared inputs {[𝑥𝑖]} 𝑛 𝑖=1 , auxiliary sets {𝒜𝑖} 𝑛 𝑖=1 . Output: [ ∏︀ 𝑥𝑖]. 1: ◁ Compute masked values 2: for 𝑖 ∈ [1, 𝑛] do 3: [𝛿𝑖] ← [𝑥𝑖 − 𝑎𝑖] = [𝑥𝑖] − [𝑎𝑖]∏︀ 𝑎𝑚 𝑎 𝑖 𝑎 𝑗 ] + • • • + [ ∏︀ 𝑎𝑖]

The total rounds of communications are indeed reduced from 𝑛−1 to constant 1 when a regular 𝑛-ary multiplication is performed, but the overall size of communication data increases from linear to exponential. In a naïve implementation, the data size of 𝑛-ary multiplication is only 3(𝑛 − 1) for a total transmission of 𝑛 − 1 Beaver triples. As opposed to a multivariate implementation, it is 2 𝑛 − 1 to transmit the auxiliary sets {𝒜𝑖} 𝑛 𝑖=1 . Therefore, in practice, there is a trade-off between communication latency and throughput.

Univariate Polynomials

The formal form of univariate polynomials is defined as 𝑃 (𝑥) = ∑︀ 𝑛 𝑖=0 𝑏𝑖𝑥 𝑖 , where 𝑏𝑖 refers to the coefficients of the degree-𝑖 term. The use of univariate polynomials enables efficient evaluation and manipulation of polynomial expressions. According to Damgård et al. [17], we can compute all required [𝑥 𝑖 ] in parallel using multivariate multiplications, then multiply them with corresponding plaintext coefficients. Despite its benefits, this trick has the disadvantage of exponentially increasing the size of transmitted data, which becomes unbearable when the exponent exceeds 5.

This method can be implemented in practice by computing a tuple of base terms and then multiplying the tuple by a certain term iteratively, as in the exponentiating by squaring method or the fast modulo algorithm. In other words, a tuple 𝑔 = (1, 𝑥, . . . , 𝑥 𝑚−1 ) of size ‖𝑔‖ = 𝑚 could be multiplied by 𝑥 ‖𝑔‖ repeatedly to iterate all the 𝑥 𝑖 terms. The overview of the implementation of univariate polynomials is described in Algorithm 3. Note that 𝑏𝑠:𝑒 is the subvector of 𝑏 from position 𝑠 to 𝑒.

Nonlinear Approximations

In this section, we take one step further to optimize the commonly used nonlinear functions by leveraging the property of parallelization of our proposed multivariate multiplication.

Exponentiation. Since exponential functions grow in geometrical speed, approximations based on series expansion generally suffer from a significant reduction in accuracy since we do not know the exact value of the input. Consequently, we resort to the naive iterative approximation, which is capable of utilizing multivariate multiplication effectively:

𝑒 [𝑥] = lim 𝑛→∞ (︂ 1 + [𝑥] 𝑑 𝑛 )︂ 𝑑 𝑛 .

During each iteration, the 𝑑-th power of the previous result is calculated. With increasing iteration rounds 𝑛, the answer would become closer to the actual results.

Logarithm. The calculation of logarithms relies on the higher-order iterative methods for a better convergence, i.e. Householder methods on 𝑦 = ln 𝑥:

[ℎ𝑛] = 1 − [𝑥]𝑒 −[𝑦𝑛] [𝑦𝑛+1] = [𝑦𝑛] − ∞ ∑︁ 𝑘=1 1 𝑘 [ℎ 𝑘 𝑛 ]

Note that the implementation of logarithm is the combination of exponentiation and univariate polynomials. The degree of the polynomials determines the precision of the output.

Reciprocal. The reciprocal function 𝑦 = 1 𝑥 is calculated using the Newton-Raphson method with an initial guess 𝑦0:

[𝑦𝑛+1] = [𝑦𝑛](2 − [𝑥][𝑦𝑛]) = 2[𝑦𝑛] − [𝑥][𝑦𝑛][𝑦𝑛]

Trigonometry. Trigonometric functions could be treated as the special case of exponentiation with 𝑑 = 2. The sine and cosine functions are calculated in the field of complex numbers:

[sin 𝑥] = Im([𝑒 𝑖𝑥 ]) [cos 𝑥] = Re([𝑒 𝑖𝑥 ])

Using the above-mentioned nonlinear functions, we can calculate most of the existing loss functions in deep learning, such as the sigmoid, tanh, and cross-entropy functions. Various other common nonlinear functions, such as the softmax function and kernel function, can also be calculated using exponential and reciprocal functions.

Communication Coalescing

The key to achieving low-latency secret-sharing computation is to reduce the total number of rounds of communications among different parties. While we introduce a latency-friendly implementation of basic math operations, other kinds of communications, such as precision checking, still require an additional but independent communication round.

In general, the communication involved in multiple math operations could be abstracted as a communication graph, or strictly, as a communication tree. Accordingly, we observe some independent communications that do not affect downstream results can be deferred and combined into one single round of communication. This process is referred to as communication coalescing, and it eliminates unnecessary rounds of communication and improves the utilization of network bandwidth.

Security Analysis

The correctness of the multivariate multiplication is trivial based on the observation in Equation 1 and 2. As univariate polynomials are implemented using the same method as the extended fast modulo algorithm, their effectiveness could also be demonstrated by the correctness and security of multivariate multiplication. Coalescing mechanisms only alter the order of communication rounds without modifying the payload, which is also reliable and secure.

Multivariate computations are similarly secure as traditional Beaver multiplications under semi-honest conditions. It is intuitively obvious that since 𝑎𝑖 is chosen at random by TTP, the 𝛿𝑖 = 𝑥𝑖 − 𝑎𝑖 value is indistinguishable from a random number. Consequently, the disclosure of [𝛿𝑖] does not reveal any critical information regarding 𝑥𝑖. This assumption holds even if multiple parties, except for the TTP, collude.

To clarify the security of multivariate multiplication formally, we denote [𝑥]𝑝 as the secret share of 𝑥 for party 𝑝 ∈ 𝒫. The global equations of the multivariate system are as follows:

∑︁

𝑝∈𝒫 [𝑥𝑖]𝑝 = 𝑥𝑖 ∑︁ 𝑝∈𝒫 [𝑎𝑖]𝑝 = 𝑎𝑖 ∑︁ 𝑝∈𝒫 [𝑎𝑖𝑎𝑗]𝑝 = 𝑎𝑖𝑎𝑗 . . . . . . ∑︁ 𝑝∈𝒫 [𝑎1 . . . 𝑎𝑛] = 𝑎1 . . . 𝑎𝑛 [𝑥𝑖]𝑝 − [𝑎𝑖]𝑝 = [𝛿𝑖]𝑝(3)

with known [𝛿𝑖]𝑝 for every 𝑝 ∈ 𝒫 to each party. From each party's view, these 2 𝑛 + 2𝑛 − 1 equations have Θ(2 𝑛 ‖𝒫‖) unknown variables. This indicates the difficulty in determining the exact value of 𝑥𝑖, as shown in [18]. A party's view represents all the values it can obtain during its execution. Then the following theorem holds: This guarantees the security of multivariate multiplication by ensuring the indistinguishability between the random distribution and the view's distribution.

Experiments

Experimental Setup

As part of our proposed methodology, we use CrypTen [5] as the basic MPC deep learning framework, which has already provided naïve implementations of secret-sharingbased computations. In most of our experiments, we use 3-party MPC on CPUs. Additionally, we allow a maximum of 4-ary multiplication as stated in Section 4.1, and we set 𝑑 = 3 for exponentiation and 𝑘 = 8 for logarithm as described in Section 4.3.

To measure the performance, we perform several experiments with deep learning models with different sizes: (a) Linear Support Vector Classification (LinearSVC) with L2 penalty; (b) LeNet [19] with shallow convolutional and linear layers along with ReLU activation functions; (c) ResNet-18 model [20] with multiple convolutional, linear, pooling, and activation layers; (d) Transformer Encoder model [21] with a single multi-head attention layer and BatchNorm [22] in place of LayerNorm [23]. We employ several datasets for classification tasks with appropriate adaption to specific models, including MNIST [19], CIFAR-10 [24], ImageNet [25], and Sentiment140 [26] datasets. Each of our experiments is conducted in a simulated multinode environment using Docker. TTP is conducted in an independent environment separate from the normal parties. To manually simulate different network environments concerning bandwidth and latency, we utilize the docker-tc tool to adjust the docker network settings accordingly.

Metrics

To provide a comprehensive evaluation of our proposed method, we adopt metrics from a variety of perspectives.

• 𝑡comp: The computational time cost for evaluating a single data sample in one round.

• 𝑡comm: The time cost of communication associated with evaluating a single data sample in one round.

• Size of Transmission Data: The size of transmitted network packets when evaluating a single data sample in one round.

• Accuracy: The classification accuracy when evaluated on a particular dataset.

Latency & Throughput

To assess the efficiency of our proposed method, we simulate networks with different network latencies: (a) network 𝑁 low with 0.1ms latency, (b) network 𝑁 med with 5ms latency, (c) and network 𝑁 high with 40ms latency. All of these networks have a bandwidth of 1Gbps. Our simulated multinode settings include 3 nodes with an additional TTP by default.

As shown in Table 1, the computation cost of each model is negligible in medium and high latency network settings in comparison to the communication cost. Therefore, we will focus only on the communication costs associated with our proposed method.

Compared to the naïve method implemented by CrypTen, our method illustrated in Section 4.1 remains close since it does not introduce a substantial amount of additional communication payload if the maximum number of input variables is set appropriately. For instance, a 3-ary or 4-ary multiplication would not produce a significant increase in the total size of communications.

It is noteworthy that our proposed method reduces the communication cost in every network setting as compared to the naïve implementation of MPC. Overall, we achieve an improvement of 10 ∼ 20%, which shows significant enhancement in the performance of high-latency environments for practical purposes.

Furthermore, we observe that our proposed method behaves differently with neural models with different architectures. Figure 1 illustrates the communication occupation As can be seen, the attention mechanism is constrained by its communication bottleneck in Softmax operation, while CNN is constrained by its communication via convolutional operations. Considering that our method makes an improved optimization for nonlinear functions, attentionbased models show a significant improvement in latency, with almost a 25% improvement. Additionally, this explains the limited improvement of only 8 ∼ 15% in traditional machine-learning models and CNN-based models.

Evaluation

In this section, we examine the side effects and factors associated with the basic settings, such as the downstream tasks' accuracy, the number of parties involved, and the trade-off between network latency and bandwidth.

To evaluate the drop in accuracy, we compare our method with both the original baseline and the naïve implementation without a low-latency design. Figure 2 shows that, in relatively small scenarios, both the naïve implementation and our methods are capable of achieving perfect performance as the baselines. Nevertheless, both MPC-based implementations obtain lower accuracy in complex scenarios than the baseline, while our methods perform slightly worse than the naïve implementation. We hypothesize that the multivariate multiplication introduces additional precision requirements, which in turn reduces accuracy.

The throughput and latency of MPC-based methods are also affected by the number of parties involved in the computation. From Figure 2, it can be seen that the communication data size of both methods increases linearly as the number of parties involved increases. There is, however, a tendency for the latency to be worse when there are more parties involved. Moreover, Figure 3 illustrates how network bandwidth affects communication costs. When sufficient bandwidth is available, our method can still optimize the network latency. It is important to note, however, that when the bandwidth becomes the bottleneck, our method would not be any more effective in reducing the overall costs of communication. This indicates that bandwidth remains an important factor in a multi-node MPC setting, especially as the number of nodes in use grows.

Discussion

Since the proposed multivariate multiplication is based on a finite ring, it is likely to have precision issues that lead to incorrect results. Fortunately, a loss in precision would not significantly affect the overall performance of deep learning, since the loss could be interpreted as random noise and Moreover, our proposed method is only applicable to functions that are based on linear MPC operation. To avoid heavy communications, a modern MPC-based deep learning framework would also involve other protocols, such as Homomorphic Encryption [27], Garbled Circuit [28], and Function Secret Sharing [29]. Though these works may have less communication, our approach could still be seamlessly integrated with the current secret-sharing framework and achieve a latency improvement of ~20% without adding excessive computational workload.

Conclusion

This study proposes a secret-sharing-based MPC method for enhancing the linear computation required in deep learning through increased communication utilization. By utilizing the multivariate multiplication and communication coalescing mechanisms, we can reduce the number of unnecessary communication rounds during the execution of both linear and nonlinear deep learning functions. In our experiments, we demonstrate that our proposed methods achieve an overall improvement in latency of 10 ∼ 20% when compared to the naïve MPC implementation. Additionally, it indicates that throughput and downstream task performance are comparable to naïve implementations, which demonstrate the method's validity and efficiency. We hope that this work will inspire future improvements in privacy-preserving deep learning techniques and lead to more practical MPC applications.

.The sum of two secret shared values [𝑥] and [𝑦] could be directly computed as [𝑧] = [𝑥] + [𝑦], where each party 𝑝 ∈ 𝒫 computes [𝑧]𝑝 = [𝑥]𝑝 + [𝑦]𝑝 without multi-party communications. Multiplication. Two secret shared values [𝑥] and [𝑦] are multiplied using a random Beaver triple [11] generated by the Trusted Third Party (TTP): a triplet ([𝑎], [𝑏], [𝑎𝑏]). It should be noted that the Beaver triple could be shared in advance by each party. The parties first calculate [𝜖] = [𝑥] − [𝑎] and [𝛿] = [𝑦] − [𝑏]. In this way, the [𝜖] and [𝛿] are then revealed to all parties (denoted as Reveal(•)) without compromising information since the ground-truth values 𝑎, 𝑏 remain unknown to each party except for the TTP. The final results could be computed as [𝑥𝑦] = [𝑐] + 𝜖[𝑏] + [𝑎]𝛿 + 𝜖𝛿. Algorithm 1 illustrates the multiplication using Beaver triples.

and so on. Similar to the construction of [𝜖] and [𝛿] in Section 2.2, we define the difference between the inputs and the masks as [𝛿𝑖] := [𝑥𝑖] − [𝑎𝑖]. The secretshared [𝛿𝑖] is then made public across all parties without leakage to the ground-truth value of both [𝑥𝑖] and [𝑎𝑖]. The improvement of our method originates from the following equation:

Algorithm 3 ◁ Vectorized Beaver Multiplication 13 :313Univariate Polynomial Poly([𝑥], 𝑏) Input: Secret-shared input [𝑥], coefficients 𝑏 = (𝑏0, 𝑏1, . . . , 𝑏𝑛), base terms size ‖𝑔‖. Output: ∑︀ 𝑛 𝑖=0 𝑏𝑖[𝑥 𝑖 ]. 1: ◁ Construct base terms 2: parallel for 𝑖 ∈ [1, ‖𝑔‖] do 3: [𝑥 𝑖 ] ← Mul([𝑥], . . . , [𝑥]) ◁ multiplied by [𝑥] of 𝑖 times 4: end parallel for 5: 𝑔 ← (1, [𝑥], . . . , [𝑥 ‖𝑔‖−1 ]) 6: ◁ Iteratively exponentiating 7: 𝑡 ← 0 8: for 𝑖 ∈ [0, ⌊ 𝑛 ‖𝑔‖ ⌋ − 1] do 9: s ← 𝑖 • ‖𝑔‖ 10: e ← (𝑖 + 1) • ‖𝑔‖ 11: 𝑡 ← 𝑡 + 𝑏𝑠:𝑒 • 𝑔 12: 𝑔 ← [𝑥 ‖𝑔‖ ] • 𝑔 14: end for 15: return 𝑡

Theorem 1 .1Let {𝑥 ′ 𝑖 } and {𝑥 ′′ 𝑖 } be random values. The distribution of the view of each party is identical when 𝑥𝑖 = 𝑥 ′ 𝑖 or 𝑥𝑖 = 𝑥 ′′ 𝑖 .

Figure 1 :1Figure 1: Communication percentage of different models.

Figure 2 :2Figure 2: Transimission data and latency of naïve and our proposed methods when a different number of parties are involved. The experiment is conducted using the LeNet model on CIFAR-10 using a medium latency network.

Figure 3 :3Figure 3: Latency of naïve and our proposed methods concerning different network bandwidth. The experiment is conducted using the LeNet model on CIFAR-10 using a medium latency network.

Table 11Communication Latency and Data with Different Network Settings. Latency and data are measured in milliseconds and MiBs.ModelDataset𝑡compData Size𝑡comm (𝑁 low )𝑡comm (𝑁 med )𝑡comm (𝑁 high )NaïveOursNaïveOursNaïveOursNaïveOursLinearSVCMNIST0.0320.0220.0240.0820.0590.6040.5684.3394.026LeNetCIFAR-100.90038.56241.6471.3731.1827.6736.98353.01047.868ResNet-18ImageNet110.44711571.294 12612.711219.541 185.293 1681.685 1407.570 ~11 760~9 870Transformer Sentiment1407.824771.787893.42112.1909.71578.62461.877559.645 421.413

Table 22Classification Accuracy using Different Methods.ModelDatasetAccuracy (%)Origin NaïveOursLinearSVCMNIST100.00100.00100.00LeNetCIFAR-10100.00100.00100.00ResNet-18ImageNet69.3061.5860.10Transformer Sentiment14059.8758.5557.74Conv2D (49.8%)Softmax (60.5%)Other (6.1%)Other(2.0%)MatMul(19.1%) ReLU(25.0%) BatchNormGeLU(14.4%)(23.1%)(a) ResNet Basic Block(b) Attention Block

Acknowledgments

This work is supported by the National Key R&D Program of China under grant (No. 2022YFB2703001).

Multiparty computation from somewhat homomorphic encryption IDamgård VPastro NSmart SZakarias 10.1007/978-3-642-32009-5_38 CRYPTO 2012 RSafavi-Naini RCanetti

Berlin Heidelberg; Berlin, Heidelberg

Springer 2012 Mp-spdz: A versatile framework for multiparty computation MKeller 10.1145/3372297.3417872 Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS '20 the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS '20

New York, NY, USA

Association for Computing Machinery 2020 Privacy and security issues in deep learning: A survey XLiu LXie YWang JZou JXiong ZYing AVVasilakos 10.1109/ACCESS.2020.3045078 IEEE Access 9 2021 Polymath: Low-latency mpc via secure polynomial evaluations and its applications DLu AYu AKate HMaji 10.2478/popets-2022-0020 Proceedings on Privacy Enhancing Technologies on Privacy Enhancing Technologies 2022. 2021 Crypten: Secure multiparty computation meets machine learning BKnott SVenkataraman AHannun SSengupta MIbrahim LVan Der Maaten 10.48550/arXiv.2109.00984 NIPS 34 2021 CryptGPU: Fast privacy-preserving machine learning on the gpu STan BKnott YTian DJWu 10.1109/SP40001.2021.00098 IEEE S&P 2021 Mpcformer: fast, performant and private transformer inference with mpc DLi HWang RShao HGuo EXing HZhang 10.48550/arXiv.2211.01452 The 11th International Conference on Learning Representations 2023 Falcon: Honest-majority maliciously secure framework for private deep learning SWagh STople FBenhamouda EKushilevitz PMittal TRabin 10.2478/popets-2021-0011 Proceedings on Privacy Enhancing Technologies on Privacy Enhancing Technologies 2021. 2020 Arithmetic tuples for mpc TKrips RKüsters PReisert MRivinius Cryptology ePrint Archive 2022 How to share a secret AShamir 10.1145/359168.359176 Commun. ACM 22 1979 Efficient multiparty protocols using circuit randomization DBeaver 10.1007/3-540-46766-1_34 CRYPTO 1991 JFeigenbaum

Berlin Heidelberg; Berlin, Heidelberg

Springer 1992 Randomizing polynomials: A new representation with applications to roundefficient secure computation YIshai EKushilevitz 10.1109/SFCS.2000.892118 Proceedings 41st Annual Symposium on Foundations of Computer Science 41st Annual Symposium on Foundations of Computer Science 2000 Efficient polynomial operations in the shared-coefficients setting PMohassel MFranklin 10.1007/11745853_4 PKC 2006 MYung YDodis AKiayias TMalkin

Berlin Heidelberg; Berlin, Heidelberg

Springer 2006 Secure efficient multiparty computing of multivariate polynomials and applications DDachman-Soled TMalkin MRaykova MYung 10.1007/978-3-642-21554-4_8 Applied Cryptography and Network Security JLopez GTsudik

Berlin Heidelberg; Berlin, Heidelberg

Springer 2011 SWagh DGupta NChandran Securenn: Efficient and private neural network training Cryptology ePrint Archive 2018 Aby3: A mixed protocol framework for machine learning PMohassel P 10.1145/3243734.3243760 Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS '18 the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS '18

New York, NY, USA

Association for Computing Machinery 2018 Unconditionally secure constant-rounds multi-party computation for equality, comparison, bits and exponentiation IDamgård MFitzi EKiltz JBNielsen TToft 10.1007/11681878_15 Theory of Cryptography SHalevi TRabin

Berlin Heidelberg; Berlin, Heidelberg

Springer 2006 A note on the communication complexity of multiparty computation in the correlated randomness model GCouteau 10.1007/978-3-030-17656-3_17 EUROCRYPT 2019

Berlin, Heidelberg

Springer-Verlag 2019 Gradientbased learning applied to document recognition YLecun LBottou YBengio PHaffner 10.1109/5.726791 Proceedings of the IEEE 86 1998 Deep residual learning for image recognition KHe XZhang SRen JSun 10.1109/CVPR.2016.90 CVPR 2016 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LUKaiser IPolosukhin 10.5555/3295222.3295349 I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett 2017 Curran Associates, Inc 30 Batch normalization: accelerating deep network training by reducing internal covariate shift SIoffe CSzegedy 10.5555/3045118.3045167 Proc. 32nd Int. Conf. Machine Learning -Volume 32nd Int. Conf. Machine Learning -Volume JMLR 2015 37 ICML'15 JLBa JRKiros GEHinton arXiv:1607.06450 Layer normalization 2016 AKrizhevsky GHinton Learning multiple layers of features from tiny images 2009 Imagenet: A large-scale hierarchical image database JDeng WDong RSocher L.-JLi KLi LFei-Fei 10.1109/CVPR.2009.5206848 CVPR 2009 Twitter sentiment classification using distant supervision, CS224N project report AGo RBhayani LHuang Stanford 1 2009. 2009 Fully homomorphic encryption using ideal lattices CGentry 10.1145/1536414.1536440 Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC '09 the Forty-First Annual ACM Symposium on Theory of Computing, STOC '09

New York, NY, USA

Association for Computing Machinery 2009 Protocols for secure computations ACYao 10.1109/SFCS.1982.38 23rd Annual Symposium on Foundations of Computer Science

sfcs

1982. 1982 Function secret sharing EBoyle NGilboa YIshai 10.1007/978-3-662-46803-6_12 EUROCRYPT 2015 EOswald MFischlin

Berlin Heidelberg; Berlin, Heidelberg

Springer 2015