Low-Latency Privacy-Preserving Deep Learning Design via
                         Secure MPC
                         Ke Lin1 , Yasir Glani1 and Ping Luo1,*
                         1
                             Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China


                                            Abstract
                                            Secure multi-party computation (MPC) facilitates privacy-preserving computation between multiple parties without leaking private
                                            information. While most secure deep learning techniques utilize MPC operations to achieve feasible privacy-preserving machine
                                            learning on downstream tasks, the overhead of the computation and communication still hampers their practical application. This work
                                            proposes a low-latency secret-sharing-based MPC design that reduces unnecessary communication rounds during the execution of
                                            MPC protocols. We also present a method for improving the computation of commonly used nonlinear functions in deep learning by
                                            integrating multivariate multiplication and coalescing different packets into one to maximize network utilization. Our experimental
                                            results indicate that our method is effective in a variety of settings, with a speedup in communication latency of 10 ∼ 20%.

                                            Keywords
                                            Multi-party computation, deep learning, privacy-preserving


                         1. Introduction                                                                                                    • We propose a low-latency secret-sharing-based
                                                                                                                                              method for computing multivariate multiplications
                         Secure multi-party computation (MPC) [1, 2] enables par-                                                             and univariate polynomials using network commu-
                         ties to compute securely over their private data without                                                             nication that is efficient and reduces unnecessary
                         revealing the data to each other. Secure MPC offers privacy-                                                         communication rounds on the fly.
                         preserving property, which makes it suitable for most
                         privacy-sensitive domains, such as medical research and                                                            • We improve the computation of nonlinear functions
                         finance. Upon the development of deep learning techniques,                                                           by integrating the proposed multivariate multiplica-
                         the ability to capture important information from large                                                              tion and coalescing different packets into one single
                         datasets of neural models raises concerns regarding the                                                              packet to maximize network utilization.
                         surveillance of individuals [3]. In this case, the prospects of
                                                                                                                                            • We conducted experiments to evaluate the effec-
                         secure MPC demonstrate its application to secure machine
                                                                                                                                              tiveness of our method in the context of models
                         learning and deep learning. While MPC-based deep learn-
                                                                                                                                              with varying sizes, networks with different latency
                         ing frameworks have achieved significant performance in
                                                                                                                                              and bandwidth, the accuracy of downstream clas-
                         general scenarios, most works suffer from the limitations
                                                                                                                                              sification tasks, and the number of participants in-
                         caused by 1. network communication due to the nature of ex-
                                                                                                                                              volved. The results indicate an overall improvement
                         changing intermediate information during MPC execution,
                                                                                                                                              of 10 ∼ 20% in communication latency.
                         2. excessive computation introduced by complex MPC pro-
                         tocols. Since the computation of MPC protocols is largely
                         determined by their sophisticated design, optimizing the pro-                                                 2. Background
                         tocol itself would seem to be difficult and infeasible. Thus,
                         some studies [4] are concerned with improving the com-                                                        2.1. Arithmetic Secret Sharing Based
                         munication stage of MPC protocols to make them more
                                                                                                                                            Scheme
                         practical. In this paper, we present an approach to reduce
                         the communication latency of the MPC protocol through                                                         Our setup is primarily focused on arithmetic operations, so
                         optimized multivariate multiplication.                                                                        we represent all inputs and intermediate results in terms
                            In general, privacy-preserving deep learning frameworks                                                    of linear secret sharing between 𝑛 parties, especially in the
                         usually adopt secret-sharing-based techniques to avoid ex-                                                    context of additive secret-sharing schemes.
                         tensive computational overheads [5, 6, 7, 8]. Consequently,                                                      Apart from the general (𝑛, 𝑡)-Shamir secret sharing
                         secret-sharing-based methods require multiple exchanges                                                       scheme [10], which relies on the degree-𝑡 polynomials over
                         of intermediate results to achieve collaborative MPC opera-                                                   𝑛 parties, we adopt the simple arithmetic secret sharing
                         tions. As these MPC techniques are based on linear compu-                                                     scheme based on (𝑛, 0)-Shamir secret sharing. In other
                         tations, such as addition and multiplication, modern deep                                                     words, we share a scalar value 𝑥 ∈ Z/𝑄Z across 𝑛 parties
                         learning techniques that inherently rely on linear algebra                                                    𝒫, where Z/𝑄Z denotes a ring with 𝑄 elements, follow-
                         benefit significantly from them. Considering the heavy de-                                                    ing the notations of [5]. The sharing of 𝑥 is defined as
                         pendency on linear operations, our research aims to reduce                                                    [𝑥] = {[𝑥]𝑝 }𝑝∈𝒫 , where [𝑥]𝑝 is the party 𝑝’s share of 𝑥.
                         unnecessary communication rounds following [9] during                                                         The ground-truth value 𝑥 could be reconstructed
                                                                                                                                                                                   ∑︀        from the
                         the execution of MPC protocols.                                                                               sum of the shares of each party, i.e. 𝑥 = 𝑝∈𝒫 [𝑥]𝑝 .
                            Our main contributions are as follows:                                                                        When parties wish to share a value 𝑥, they generate a
                                                                                                                                       pseudorandom zero-share that sums to 0. The party that pos-
                          The IJCAI-2024 AISafety Workshop                                                                             sesses the value adds 𝑥 to their share in secret. To represent
                         *
                           Corresponding author.                                                                                       floating-point numbers, we adopt a fixed-point encoding
                          $ leonard.keilin@gmail.com (K. Lin); yasirglani@gmail.com
                                                                                                                                       to encode any floating-point number 𝑥𝐹 into a fixed-point
                          (Y. Glani); luop@tsinghua.edu.cn (P. Luo)
                           0009-0002-5376-7881 (K. Lin); 0000-0003-0060-4771 (Y. Glani);                                              representation, 𝑥. Alternatively, we consider that each 𝑥 is
                          0000-0001-6171-3811 (P. Luo)                                                                                 the result of multiplying a floating-point number 𝑥𝐹 by a
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                        Attribution 4.0 International (CC BY 4.0).
                                                                                                                                       scaling factor 𝐵 = 2𝐿 and rounding to the nearest integer,

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
i.e. 𝑥 = ⌊𝐵𝑥𝐹 ⌉. Here 𝐿 is the precision of the fixed-point       party involved in the computation through one-round com-
encoding. To decode a ground-truth floating-point value           munications. Since most linear operations are also appli-
𝑥𝐹 from 𝑥, we compute as follows: 𝑥𝐹 ≈ 𝑥/𝐵.                       cable to element-wise operations and matrix operations, 𝑥
                                                                  can also represent a vector, matrix, or even a tensor if there
2.2. Arithmetic Secret Sharing Based MPC                          is no confusion and ambiguity.

It is noteworthy that arithmetic secret shares are homomor-
phic and can be used to implement secure MPC, especially          3. Related Work
in the context of linear computation in most cases.
                                                                  To achieve communication-efficient MPC, various ap-
                                                                  proaches have been developed to optimize the communica-
Addition. The sum of two secret shared values [𝑥] and
                                                                  tion rounds and the throughput of communication. Ishai
[𝑦] could be directly computed as [𝑧] = [𝑥] + [𝑦], where
                                                                  and Kushilevitz [12] proposes a new representation of poly-
each party 𝑝 ∈ 𝒫 computes [𝑧]𝑝 = [𝑥]𝑝 + [𝑦]𝑝 without
                                                                  nomials for round-efficient secure computation, dividing
multi-party communications.
                                                                  high-degree polynomials into multiple low-degree polyno-
                                                                  mials that are easy to solve. Mohassel and Franklin [13]
Multiplication. Two secret shared values [𝑥] and [𝑦] are          performs operations directly on polynomials, such as poly-
multiplied using a random Beaver triple [11] generated by         nomial multiplication and division. Dachman-Soled et al.
the Trusted Third Party (TTP): a triplet ([𝑎], [𝑏], [𝑎𝑏]). It     [14] improves the evaluation of multivariate polynomials
should be noted that the Beaver triple could be shared in         with different variables being held as private inputs by each
advance by each party. The parties first calculate [𝜖] =          party. Then, Lu et al. [4] proposes an efficient method for
[𝑥] − [𝑎] and [𝛿] = [𝑦] − [𝑏]. In this way, the [𝜖] and [𝛿] are   evaluating high-degree polynomials with arbitrary num-
then revealed to all parties (denoted as Reveal(·)) without       bers of variables. While the current research has focused on
compromising information since the ground-truth values            improving the calculation of polynomials, our study aims
𝑎, 𝑏 remain unknown to each party except for the TTP. The         to develop a communication-efficient and effective MPC
final results could be computed as [𝑥𝑦] = [𝑐] + 𝜖[𝑏] + [𝑎]𝛿 +     system for use in modern deep learning frameworks by
𝜖𝛿. Algorithm 1 illustrates the multiplication using Beaver       leveraging arithmetic tuples computation from Krips et al.
triples.                                                          [9]. This system is not confined to only computing polyno-
                                                                  mials within finite rings, as seen in previous studies.
Linear functions. It is possible to implement functions              In recent years, several deep learning frameworks that
that consist of linear operations by combining additions and      preserve privacy have emerged to enable the secure in-
multiplications. Common operations in deep learning, such         ference of neural network models. Wagh et al. [8] imple-
as element-wise product and convolution, are allowed in a         ments a maliciously secure 3-party MPC protocol from Se-
linear paradigm.                                                  cureNN [15] and ABY3 [16]. Knott et al. [5] provides flexi-
                                                                  ble machine-learning APIs with a rich set of functions for
Nonlinear functions. Due to the inherent infeasibility            secure deep learning. Li et al. [7] presents a fast and perfor-
of nonlinear functions in the standard arithmetic secret-         mant MPC Transformer inference framework designed to be
sharing scheme, most works use approximation methods              privacy-preserving. Our low-latency linear MPC implemen-
to simulate the outcome of nonlinear functions. In particu-       tation is built on top of Knott et al.’s CrypTen framework
lar, Taylor Expansion, Newton-Rhapson, and Householder            and provides a significant improvement in communication
methods are commonly used to approximate nonlinear func-          latency.
tions using only linear operations. All reciprocal functions,
exponential functions, loss functions, kernel functions, and
other useful functions in deep learning are calculated this       4. Methodology
way, for example.
                                                                  4.1. Multivariate Multiplication
Algorithm 1 Beaver Multiplication Mul([𝑥], [𝑦])                   Since Beaver triples illustrate how to multiply two variables
Input: Secret-shared inputs [𝑥], [𝑦], Beaver triple               with pre-shared triplets, a classic multiplication between
    ([𝑎], [𝑏], [𝑎𝑏]).                                             multiple variables, such as [𝑥𝑦𝑧], requires several rounds
Output: [𝑥𝑦].                                                     of binary multiplication, i.e. Mul(Mul([𝑥], [𝑦]), [𝑧]). This
 1: ◁ Compute masked values                                       naive implementation, however, introduces additional com-
 2: [𝜖] ← [𝑥 − 𝑎] = [𝑥] − [𝑎]                                     munication rounds during the on-the-fly Reveal process.
 3: [𝛿] ← [𝑦 − 𝑏] = [𝑦] − [𝑏]                                     In general, a 𝑛-ary multiplication requires 𝑛 − 1 rounds of
 4: ◁ Reveal 𝜖 and 𝛿 through one-round communications             communication.
 5: 𝜖 ← Reveal([𝜖])                                                  To reduce the communication rounds involved in the
 6: 𝛿 ← Reveal([𝛿])                                               multivariate multiplications, the basic binary Beaver triple
 7: return 𝜖𝛿 + 𝜖[𝑏] + [𝑎]𝛿 + [𝑎𝑏]                                is extended into a general 𝑛-ary Beaver triple. This results
                                                                  in only one round of communication required throughout
                                                                  the entire process.
                                                                     Assume the 𝑛 inputs could be represented as {[𝑥𝑖 ]}𝑛    𝑖=1 .
2.3. Notations                                                    The precomputation and preshared information required by
                                                                  the extended protocol is {𝒜𝑖 }𝑛                            𝑛
                                                                                                    𝑖=1 . Here 𝒜1 := {[𝑎𝑗 ]}𝑗=1
This section summarizes the notations used throughout this
                                                                  is defined as the set of 𝑛 auxiliary shared values used to
work. We denote [𝑥] as a secret sharing of 𝑥. Reveal([𝑥])
                                                                  blind the the inputs {[𝑥𝑖 ]}𝑛𝑖=1 , which is also similar to the
means that the ground-truth value 𝑥 is revealed to every
                                                                  Beaver’s idea. Then 𝒜𝑖 (𝑖 ≥ 2) is defined as the set of
shared degree-𝑖 cross-terms consisting of the variables in               the auxiliary sets {𝒜𝑖 }𝑛
                                                                                                 𝑖=1 . Therefore, in practice, there is a
𝒜1 . For example, 𝒜2 could be defined as 𝒜2 := {[𝑎𝑖 𝑎𝑗 ] |               trade-off between communication latency and throughput.
𝑖 ̸= 𝑗 ∧ 1 ≤ 𝑖, 𝑗 ≤ 𝑛}, and 𝒜3 := {[𝑎𝑖 𝑎𝑗 𝑎𝑘 ] | 𝑖 ̸= 𝑗 ̸=
𝑘 ∧ 1 ≤ 𝑖, 𝑗, 𝑘 ≤ 𝑛}, and so on. Similar to the construction             4.2. Univariate Polynomials
of [𝜖] and [𝛿] in Section 2.2, we define the difference between
the inputs and the masks as [𝛿𝑖 ] := [𝑥𝑖 ] − [𝑎𝑖 ]. The secret-          The formal∑︀ form of𝑖 univariate polynomials is defined as
shared [𝛿𝑖 ] is then made public across all parties without              𝑃 (𝑥) = 𝑛    𝑖=0 𝑏𝑖 𝑥 , where 𝑏𝑖 refers to the coefficients of
leakage to the ground-truth value of both [𝑥𝑖 ] and [𝑎𝑖 ]. The           the degree-𝑖 term. The use of univariate polynomials en-
improvement of our method originates from the following                  ables efficient evaluation and manipulation of polynomial
equation:                                                                expressions. According to Damgård et al. [17], we can com-
                                                                         pute all required [𝑥𝑖 ] in parallel using multivariate multipli-
    𝑛            𝑛
   ∏︁           ∏︁                                                       cations, then multiply them with corresponding plaintext
         𝑥𝑖 =        (𝛿𝑖 + 𝑎𝑖 )                                          coefficients. Despite its benefits, this trick has the disadvan-
   𝑖=1          𝑖=1
                                         ∏︀              ∏︀              tage of exponentially increasing the size of transmitted data,
                ∏︁           ∑︁            𝑎𝑚   ∑︁          𝑎𝑚    (1)
           =          𝛿𝑖 +         𝛿𝑖         +    𝛿𝑖 𝛿𝑗                 which becomes unbearable when the exponent exceeds 5.
                              𝑖
                                          𝑎𝑖              𝑎𝑖 𝑎𝑗             This method can be implemented in practice by comput-
                                                𝑖,𝑗,𝑖̸=𝑗
                             ∏︁                                          ing a tuple of base terms and then multiplying the tuple by
                + ··· +           𝑎𝑖 .                                   a certain term iteratively, as in the exponentiating by squar-
                                                                         ing method or the fast modulo algorithm. In other words,
Here∏︀
       we informally use the fractional representation, such             a tuple 𝑔 = (1, 𝑥, . . . , 𝑥𝑚−1 ) of size ‖𝑔‖ = 𝑚 could be
as 𝑎𝑎𝑖𝑚 , to denote the products of all the terms except                 multiplied by 𝑥‖𝑔‖ repeatedly to iterate all the 𝑥𝑖 terms. The
for certain ones. Note that this fractional form does not                overview of the implementation of univariate polynomials
involve
     ∏︀
           any actual division. Also, each secret-shared term            is described in Algorithm 3. Note that 𝑏𝑠:𝑒 is the subvector
         𝑎𝑚
of [ 𝑎𝑖 ...𝑎 𝑗
               ] could be found in the auxiliary sets {𝒜𝑖 }𝑛     𝑖=1 ,   of 𝑏 from position 𝑠 to 𝑒.
which is preshared across all parties.
   Adaptation of Equation 1 in secret-sharing scheme is as               Algorithm 3 Univariate Polynomial Poly([𝑥], 𝑏)
follows:
                                                                         Input: Secret-shared input [𝑥], coefficients 𝑏               =
    𝑛                       ∑︁ ∏︀ 𝑎𝑚                       ∏︀                         , . . . , 𝑏𝑛 ), base terms size ‖𝑔‖.
                                                                              (𝑏0 , 𝑏1∑︀
   ∏︁             ∏︁                        ∑︁                𝑎𝑚
  [ 𝑥𝑖 ] =           𝛿𝑖 +      𝛿𝑖 [    ]+          𝛿𝑖 𝛿𝑗 [        ]      Output:         𝑛           𝑖
                                    𝑎𝑖                      𝑎𝑖 𝑎𝑗                        𝑖=0 𝑏𝑖 [𝑥 ].
   𝑖=1                       𝑖            𝑖,𝑗,𝑖̸=𝑗                         1: ◁ Construct base terms
                                                                           2: parallel for 𝑖 ∈ [1, ‖𝑔‖] do
                            ∏︁
                 + · · · + [ 𝑎𝑖 ].
                                                                  (2)      3:      [𝑥𝑖 ] ← Mul([𝑥], . . . , [𝑥]) ◁ multiplied by [𝑥] of 𝑖
Since Equation 2 is linear to the secret-sharing terms, all                   times
                                                                           4: end parallel for
communications could be conducted in parallel, i.e. in a                                                 ‖𝑔‖−1
                                                                           5: 𝑔 ← (1, [𝑥], . . . , [𝑥          ])
single round of communications. In this case, we could
                                                                           6: ◁ Iteratively exponentiating
simply reveal all the secret-sharing terms in {𝒜𝑖 }𝑛        𝑖=1 and
                                                                           7: 𝑡 ← 0
compute the sharing of final results in constant complexity.                                     𝑛
                                                                           8: for 𝑖 ∈ [0, ⌊ ‖𝑔‖      ⌋ − 1] do
The protocol is formally described in Algorithm 2.
                                                                           9:      s ← 𝑖 · ‖𝑔‖
Algorithm 2 Multivariate Beaver Multiplication of 𝑛 inputs                10:      e ← (𝑖 + 1) · ‖𝑔‖
Mul([𝑥1 ], [𝑥2 ], . . . , [𝑥𝑛 ])                                          11:      𝑡 ← 𝑡 + 𝑏𝑠:𝑒 · 𝑔
                                                                          12:      ◁ Vectorized Beaver Multiplication
Input: Secret-shared inputs {[𝑥𝑖 ]}𝑛          𝑖=1 , auxiliary sets        13:      𝑔 ← [𝑥‖𝑔‖ ] · 𝑔
     {𝒜𝑖 }𝑛  𝑖=1
              ∏︀  .                                                       14: end for
Output: [ 𝑥𝑖 ].                                                           15: return 𝑡
  1: ◁ Compute masked values
  2: for 𝑖 ∈ [1, 𝑛] do
  3:       [𝛿𝑖 ] ← [𝑥𝑖 − 𝑎𝑖 ] = [𝑥𝑖 ] − [𝑎𝑖 ]
  4: end for                                                             4.3. Nonlinear Approximations
  5: ◁ Reveal 𝛿𝑖 through one-round communications
                                                                         In this section, we take one step further to optimize the com-
  6: parallel for 𝑖 ∈ [1, 𝑛] do
                                                                         monly used nonlinear functions by leveraging the property
  7:       𝛿𝑖 ← Reveal([𝛿𝑖 ])
                                                                         of parallelization of our proposed multivariate multiplica-
  8: end parallel for
                                                                         tion.
  9: ◁ Compute results using   ∏︀
                                  preshared {𝒜𝑖 }𝑛   𝑖=1 ∏︀
                    𝛿𝑖 + 𝑖 𝛿𝑖 [ 𝑎𝑖 ] + 𝑖,𝑗,𝑖̸=𝑗 𝛿𝑖 𝛿𝑗 [ 𝑎𝑖𝑎𝑎𝑚
                                  𝑎𝑚
                 ∏︀     ∑︀                ∑︀
 10: return                                                     ]+
                ∏︀                                            𝑗          Exponentiation. Since exponential functions grow in ge-
     · · · + [ 𝑎𝑖 ]                                                      ometrical speed, approximations based on series expansion
                                                                         generally suffer from a significant reduction in accuracy
   The total rounds of communications are indeed reduced                 since we do not know the exact value of the input. Con-
from 𝑛−1 to constant 1 when a regular 𝑛-ary multiplication               sequently, we resort to the naive iterative approximation,
is performed, but the overall size of communication data                 which is capable of utilizing multivariate multiplication ef-
increases from linear to exponential. In a naïve implemen-               fectively:
                                                                                                      (︂        )︂𝑑𝑛
tation, the data size of 𝑛-ary multiplication is only 3(𝑛 − 1)                            [𝑥]               [𝑥]
                                                                                        𝑒 = lim 1 + 𝑛                .
for a total transmission of 𝑛 − 1 Beaver triples. As opposed                                    𝑛→∞         𝑑
to a multivariate implementation, it is 2𝑛 − 1 to transmit
During each iteration, the 𝑑-th power of the previous result      alter the order of communication rounds without modifying
is calculated. With increasing iteration rounds 𝑛, the answer     the payload, which is also reliable and secure.
would become closer to the actual results.                           Multivariate computations are similarly secure as tradi-
                                                                  tional Beaver multiplications under semi-honest conditions.
Logarithm. The calculation of logarithms relies on the            It is intuitively obvious that since 𝑎𝑖 is chosen at random
higher-order iterative methods for a better convergence, i.e.     by TTP, the 𝛿𝑖 = 𝑥𝑖 − 𝑎𝑖 value is indistinguishable from a
Householder methods on 𝑦 = ln 𝑥:                                  random number. Consequently, the disclosure of [𝛿𝑖 ] does
                                                                  not reveal any critical information regarding 𝑥𝑖 . This as-
                   [ℎ𝑛 ] = 1 − [𝑥]𝑒−[𝑦𝑛 ]                         sumption holds even if multiple parties, except for the TTP,
                                    ∞                             collude.
                                    ∑︁ 1
                [𝑦𝑛+1 ] = [𝑦𝑛 ] −             [ℎ𝑘𝑛 ]                 To clarify the security of multivariate multiplication for-
                                          𝑘                       mally, we denote [𝑥]𝑝 as the secret share of 𝑥 for party
                                    𝑘=1
                                                                  𝑝 ∈ 𝒫. The global equations of the multivariate system are
Note that the implementation of logarithm is the combi-
                                                                  as follows:            ∑︁
nation of exponentiation and univariate polynomials. The
                                                                                             [𝑥𝑖 ]𝑝 = 𝑥𝑖
degree of the polynomials determines the precision of the                                 𝑝∈𝒫
output.                                                                                    ∑︁
                                                                                                [𝑎𝑖 ]𝑝 = 𝑎𝑖
Reciprocal. The reciprocal function 𝑦 = 𝑥1 is calculated                               ∑︁
                                                                                          𝑝∈𝒫

using the Newton-Raphson method with an initial guess 𝑦0 :                                   [𝑎𝑖 𝑎𝑗 ]𝑝 = 𝑎𝑖 𝑎𝑗
                                                                                                                                (3)
                                                                                       𝑝∈𝒫
               [𝑦𝑛+1 ] = [𝑦𝑛 ](2 − [𝑥][𝑦𝑛 ])
                                                                                                  ......
                       = 2[𝑦𝑛 ] − [𝑥][𝑦𝑛 ][𝑦𝑛 ]                                    ∑︁
                                                                                         [𝑎1 . . . 𝑎𝑛 ] = 𝑎1 . . . 𝑎𝑛
Trigonometry. Trigonometric functions could be treated                             𝑝∈𝒫

as the special case of exponentiation with 𝑑 = 2. The sine                           [𝑥𝑖 ]𝑝 − [𝑎𝑖 ]𝑝 = [𝛿𝑖 ]𝑝
and cosine functions are calculated in the field of complex
                                                                  with known [𝛿𝑖 ]𝑝 for every 𝑝 ∈ 𝒫 to each party. From each
numbers:
                     [sin 𝑥] = Im([𝑒𝑖𝑥 ])                         party’s view, these 2𝑛 + 2𝑛 − 1 equations have Θ(2𝑛 ‖𝒫‖)
                                                                  unknown variables. This indicates the difficulty in deter-
                    [cos 𝑥] = Re([𝑒𝑖𝑥 ])                          mining the exact value of 𝑥𝑖 , as shown in [18].
   Using the above-mentioned nonlinear functions, we can            A party’s view represents all the values it can obtain
calculate most of the existing loss functions in deep learning,   during its execution. Then the following theorem holds:
such as the sigmoid, tanh, and cross-entropy functions.
                                                                  Theorem 1. Let {𝑥′𝑖 } and {𝑥′′𝑖 } be random values. The
Various other common nonlinear functions, such as the
                                                                  distribution of the view of each party is identical when 𝑥𝑖 = 𝑥′𝑖
softmax function and kernel function, can also be calcu-
                                                                  or 𝑥𝑖 = 𝑥′′𝑖 .
lated using exponential and reciprocal functions.
                                                                  This guarantees the security of multivariate multiplication
4.4. Communication Coalescing                                     by ensuring the indistinguishability between the random
                                                                  distribution and the view’s distribution.
The key to achieving low-latency secret-sharing compu-
tation is to reduce the total number of rounds of commu-
nications among different parties. While we introduce a           6. Experiments
latency-friendly implementation of basic math operations,
other kinds of communications, such as precision checking,        6.1. Experimental Setup
still require an additional but independent communication
                                                                  As part of our proposed methodology, we use CrypTen [5]
round.
                                                                  as the basic MPC deep learning framework, which has al-
   In general, the communication involved in multiple math
                                                                  ready provided naïve implementations of secret-sharing-
operations could be abstracted as a communication graph,
                                                                  based computations. In most of our experiments, we use
or strictly, as a communication tree. Accordingly, we ob-
                                                                  3-party MPC on CPUs. Additionally, we allow a maximum
serve some independent communications that do not affect
                                                                  of 4-ary multiplication as stated in Section 4.1, and we set
downstream results can be deferred and combined into one
                                                                  𝑑 = 3 for exponentiation and 𝑘 = 8 for logarithm as de-
single round of communication. This process is referred to
                                                                  scribed in Section 4.3.
as communication coalescing, and it eliminates unnecessary
                                                                     To measure the performance, we perform several ex-
rounds of communication and improves the utilization of
                                                                  periments with deep learning models with different sizes:
network bandwidth.
                                                                  (a) Linear Support Vector Classification (LinearSVC) with
                                                                  L2 penalty; (b) LeNet [19] with shallow convolutional
5. Security Analysis                                              and linear layers along with ReLU activation functions;
                                                                  (c) ResNet-18 model [20] with multiple convolutional, lin-
The correctness of the multivariate multiplication is trivial     ear, pooling, and activation layers; (d) Transformer Encoder
based on the observation in Equation 1 and 2. As univariate       model [21] with a single multi-head attention layer and
polynomials are implemented using the same method as the          BatchNorm [22] in place of LayerNorm [23]. We employ
extended fast modulo algorithm, their effectiveness could         several datasets for classification tasks with appropriate
also be demonstrated by the correctness and security of           adaption to specific models, including MNIST [19], CIFAR-
multivariate multiplication. Coalescing mechanisms only           10 [24], ImageNet[25], and Sentiment140 [26] datasets.
    Table 1
    Communication Latency and Data with Different Network Settings. Latency and data are measured in milliseconds and MiBs.

                                                   Data Size              𝑡comm (𝑁low )                    𝑡comm (𝑁med )            𝑡comm (𝑁high )
    Model           Dataset        𝑡comp
                                              Naïve        Ours        Naïve           Ours           Naïve         Ours           Naïve              Ours
   LinearSVC        MNIST          0.032       0.022       0.024        0.082           0.059          0.604         0.568          4.339          4.026
     LeNet         CIFAR-10        0.900       38.562      41.647       1.373           1.182          7.673         6.983         53.010         47.868
   ResNet-18       ImageNet       110.447    11571.294   12612.711     219.541         185.293       1681.685      1407.570       ~11 760        ~9 870
  Transformer    Sentiment140      7.824      771.787     893.421      12.190           9.715         78.624        61.877        559.645        421.413


  Each of our experiments is conducted in a simulated multi-         Table 2
node environment using Docker. TTP is conducted in an                Classification Accuracy using Different Methods.
independent environment separate from the normal parties.
                                                                                                                           Accuracy (%)
To manually simulate different network environments con-                  Model                  Dataset
cerning bandwidth and latency, we utilize the docker-tc                                                           Origin         Naïve       Ours
tool to adjust the docker network settings accordingly.                 LinearSVC             MNIST                100.00        100.00     100.00
                                                                          LeNet              CIFAR-10              100.00        100.00     100.00
6.2. Metrics                                                            ResNet-18            ImageNet               69.30         61.58     60.10
                                                                       Transformer         Sentiment140             59.87         58.55     57.74
To provide a comprehensive evaluation of our proposed
method, we adopt metrics from a variety of perspectives.                                                          Softmax
                                                                         Conv2D
                                                                                                                     (60.5%)
                                                                             (49.8%)
     • 𝑡comp : The computational time cost for evaluating a
       single data sample in one round.
                                                                                                       Other
                                                                                                       (6.1%)                                  Other
     • 𝑡comm : The time cost of communication associated                                                                                       (2.0%)
       with evaluating a single data sample in one round.
                                                                                                                                            MatMul
                                                                       ReLU                      BatchNorm                                  (14.4%)
     • Size of Transmission Data: The size of transmit-                 (19.1%)
                                                                                                 (25.0%)
                                                                                                                      GeLU
       ted network packets when evaluating a single data                                                               (23.1%)

       sample in one round.
                                                                          (a) ResNet Basic Block                    (b) Attention Block
     • Accuracy: The classification accuracy when evalu-             Figure 1: Communication percentage of different models.
       ated on a particular dataset.

6.3. Latency & Throughput                                            percentage of ResNet basic blocks and Attention blocks.
To assess the efficiency of our proposed method, we simu-            As can be seen, the attention mechanism is constrained
late networks with different network latencies: (a) network          by its communication bottleneck in Softmax operation,
𝑁low with 0.1ms latency, (b) network 𝑁med with 5ms la-               while CNN is constrained by its communication via convo-
tency, (c) and network 𝑁high with 40ms latency. All of these         lutional operations. Considering that our method makes an
networks have a bandwidth of 1Gbps. Our simulated multi-             improved optimization for nonlinear functions, attention-
node settings include 3 nodes with an additional TTP by              based models show a significant improvement in latency,
default.                                                             with almost a 25% improvement. Additionally, this explains
   As shown in Table 1, the computation cost of each model           the limited improvement of only 8 ∼ 15% in traditional
is negligible in medium and high latency network settings            machine-learning models and CNN-based models.
in comparison to the communication cost. Therefore, we
will focus only on the communication costs associated with           6.4. Evaluation
our proposed method.
   Compared to the naïve method implemented by CrypTen,              In this section, we examine the side effects and factors asso-
our method illustrated in Section 4.1 remains close since            ciated with the basic settings, such as the downstream tasks’
it does not introduce a substantial amount of additional             accuracy, the number of parties involved, and the trade-off
communication payload if the maximum number of input                 between network latency and bandwidth.
variables is set appropriately. For instance, a 3-ary or 4-ary          To evaluate the drop in accuracy, we compare our method
multiplication would not produce a significant increase in           with both the original baseline and the naïve implementa-
the total size of communications.                                    tion without a low-latency design. Figure 2 shows that, in
   It is noteworthy that our proposed method reduces the             relatively small scenarios, both the naïve implementation
communication cost in every network setting as compared              and our methods are capable of achieving perfect perfor-
to the naïve implementation of MPC. Overall, we achieve              mance as the baselines. Nevertheless, both MPC-based im-
an improvement of 10 ∼ 20%, which shows significant                  plementations obtain lower accuracy in complex scenarios
enhancement in the performance of high-latency environ-              than the baseline, while our methods perform slightly worse
ments for practical purposes.                                        than the naïve implementation. We hypothesize that the
   Furthermore, we observe that our proposed method be-              multivariate multiplication introduces additional precision
haves differently with neural models with different archi-           requirements, which in turn reduces accuracy.
tectures. Figure 1 illustrates the communication occupation             The throughput and latency of MPC-based methods are
                                                                                                                                        Naı̈ve
                            90            Naı̈ve                                                 35
                                                                                                                                        Ours
                                          Ours
                                                                                                 30
  Transmission Data (MiB)

                            80
                                                                                                 25


                                                                                    tcomm (ms)
                            70
                                                                                                 20
                            60
                                                                                                 15

                            50
                                                                                                 10

                            40
                                                                                                      100   101              102           103
                                 3                 4                5          6                              Bandwidth (Mbps)

                                                       # Parties                   Figure 3: Latency of naïve and our proposed methods concerning
                                                                                   different network bandwidth. The experiment is conducted using
                                     (a) Size of Trans. Data w.r.t # Parties       the LeNet model on CIFAR-10 using a medium latency network.
                            20
                                          Naı̈ve
                            18            Ours
                                                                                   distortion in the input data.
                            16                                                        Moreover, our proposed method is only applicable to
                                                                                   functions that are based on linear MPC operation. To avoid
  tcomm (ms)


                            14                                                     heavy communications, a modern MPC-based deep learn-
                                                                                   ing framework would also involve other protocols, such as
                            12                                                     Homomorphic Encryption [27], Garbled Circuit [28], and
                                                                                   Function Secret Sharing[29]. Though these works may have
                            10                                                     less communication, our approach could still be seamlessly
                                                                                   integrated with the current secret-sharing framework and
                            8                                                      achieve a latency improvement of ~20% without adding ex-
                                                                                   cessive computational workload.
                                 3                 4                5          6
                                                       # Parties
                                                                                   8. Conclusion
                                          (b) Latency w.r.t # Parties
                                                                                   This study proposes a secret-sharing-based MPC method for
Figure 2: Transimission data and latency of naïve and our pro-                     enhancing the linear computation required in deep learning
posed methods when a different number of parties are involved.                     through increased communication utilization. By utilizing
The experiment is conducted using the LeNet model on CIFAR-10
                                                                                   the multivariate multiplication and communication coalesc-
using a medium latency network.
                                                                                   ing mechanisms, we can reduce the number of unnecessary
                                                                                   communication rounds during the execution of both linear
                                                                                   and nonlinear deep learning functions. In our experiments,
also affected by the number of parties involved in the compu-                      we demonstrate that our proposed methods achieve an over-
tation. From Figure 2, it can be seen that the communication                       all improvement in latency of 10 ∼ 20% when compared
data size of both methods increases linearly as the number                         to the naïve MPC implementation. Additionally, it indicates
of parties involved increases. There is, however, a tendency                       that throughput and downstream task performance are com-
for the latency to be worse when there are more parties                            parable to naïve implementations, which demonstrate the
involved.                                                                          method’s validity and efficiency. We hope that this work
   Moreover, Figure 3 illustrates how network bandwidth                            will inspire future improvements in privacy-preserving deep
affects communication costs. When sufficient bandwidth is                          learning techniques and lead to more practical MPC appli-
available, our method can still optimize the network latency.                      cations.
It is important to note, however, that when the bandwidth
becomes the bottleneck, our method would not be any more
effective in reducing the overall costs of communication.                          Acknowledgments
This indicates that bandwidth remains an important factor
in a multi-node MPC setting, especially as the number of                           This work is supported by the National Key R&D Program
nodes in use grows.                                                                of China under grant (No. 2022YFB2703001).


7. Discussion                                                                      References
Since the proposed multivariate multiplication is based on                          [1] I. Damgård, V. Pastro, N. Smart, S. Zakarias,
a finite ring, it is likely to have precision issues that lead to                       Multiparty computation from somewhat homomor-
incorrect results. Fortunately, a loss in precision would not                           phic encryption, in: R. Safavi-Naini, R. Canetti
significantly affect the overall performance of deep learning,                          (Eds.), CRYPTO 2012, Springer Berlin Heidelberg,
since the loss could be interpreted as random noise and
     Berlin, Heidelberg, 2012, pp. 643–662. doi:10.1007/              2018/442.
     978-3-642-32009-5_38.                                       [16] P. Mohassel, P. Rindal, Aby3: A mixed protocol frame-
 [2] M. Keller, Mp-spdz: A versatile framework for multi-             work for machine learning, in: Proceedings of the 2018
     party computation, in: Proceedings of the 2020 ACM               ACM SIGSAC Conference on Computer and Commu-
     SIGSAC Conference on Computer and Communica-                     nications Security, CCS ’18, Association for Comput-
     tions Security, CCS ’20, Association for Computing               ing Machinery, New York, NY, USA, 2018, p. 35–52.
     Machinery, New York, NY, USA, 2020, p. 1575–1590.                doi:10.1145/3243734.3243760.
     doi:10.1145/3372297.3417872.                                [17] I. Damgård, M. Fitzi, E. Kiltz, J. B. Nielsen, T. Toft, Un-
 [3] X. Liu, L. Xie, Y. Wang, J. Zou, J. Xiong, Z. Ying, A. V.        conditionally secure constant-rounds multi-party com-
     Vasilakos, Privacy and security issues in deep learning:         putation for equality, comparison, bits and exponenti-
     A survey, IEEE Access 9 (2021) 4566–4593. doi:10.                ation, in: S. Halevi, T. Rabin (Eds.), Theory of Cryptog-
     1109/ACCESS.2020.3045078.                                        raphy, Springer Berlin Heidelberg, Berlin, Heidelberg,
 [4] D. Lu, A. Yu, A. Kate, H. Maji, Polymath: Low-latency            2006, pp. 285–304. doi:10.1007/11681878_15.
     mpc via secure polynomial evaluations and its applica-      [18] G. Couteau, A note on the communication complexity
     tions, Proceedings on Privacy Enhancing Technologies             of multiparty computation in the correlated random-
     2022 (2021). doi:10.2478/popets-2022-0020.                       ness model, in: EUROCRYPT 2019, Springer-Verlag,
 [5] B. Knott, S. Venkataraman, A. Hannun, S. Sengupta,               Berlin, Heidelberg, 2019, p. 473–503. doi:10.1007/
     M. Ibrahim, L. van der Maaten, Crypten: Secure multi-            978-3-030-17656-3_17.
     party computation meets machine learning, NIPS              [19] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-
     34 (2021) 4961–4973. doi:10.48550/arXiv.2109.                    based learning applied to document recognition, Pro-
     00984.                                                           ceedings of the IEEE 86 (1998) 2278–2324. doi:10.
 [6] S. Tan, B. Knott, Y. Tian, D. J. Wu, CryptGPU:                   1109/5.726791.
     Fast privacy-preserving machine learning on the gpu,        [20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning
     in: IEEE S&P, 2021. doi:10.1109/SP40001.2021.                    for image recognition, in: CVPR, 2016, pp. 770–778.
     00098.                                                           doi:10.1109/CVPR.2016.90.
 [7] D. Li, H. Wang, R. Shao, H. Guo, E. Xing, H. Zhang,         [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     Mpcformer: fast, performant and private transformer              L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, At-
     inference with mpc, in: The 11th International Con-              tention is all you need, in: I. Guyon, U. V. Luxburg,
     ference on Learning Representations, 2023. doi:10.               S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
     48550/arXiv.2211.01452.                                          R. Garnett (Eds.), NIPS, volume 30, Curran Associates,
 [8] S. Wagh, S. Tople, F. Benhamouda, E. Kushilevitz,                Inc., 2017. doi:10.5555/3295222.3295349.
     P. Mittal, T. Rabin, Falcon: Honest-majority mali-          [22] S. Ioffe, C. Szegedy, Batch normalization: accelerating
     ciously secure framework for private deep learning,              deep network training by reducing internal covari-
     Proceedings on Privacy Enhancing Technologies 2021               ate shift, in: Proc. 32nd Int. Conf. Machine Learning
     (2020) 188 – 208. doi:10.2478/popets-2021-0011.                  - Volume 37, ICML’15, JMLR.org, 2015, p. 448–456.
 [9] T. Krips, R. Küsters, P. Reisert, M. Rivinius, Arithmetic        doi:10.5555/3045118.3045167.
     tuples for mpc, Cryptology ePrint Archive (2022). URL:      [23] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization,
     https://eprint.iacr.org/2022/667.                                2016. arXiv:1607.06450.
[10] A. Shamir, How to share a secret, Commun. ACM 22            [24] A. Krizhevsky, G. Hinton, et al., Learning multiple
     (1979) 612–613. doi:10.1145/359168.359176.                       layers of features from tiny images (2009).
[11] D. Beaver, Efficient multiparty protocols using circuit     [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Im-
     randomization, in: J. Feigenbaum (Ed.), CRYPTO 1991,             agenet: A large-scale hierarchical image database, in:
     Springer Berlin Heidelberg, Berlin, Heidelberg, 1992,            CVPR, 2009, pp. 248–255. doi:10.1109/CVPR.2009.
     pp. 420–432. doi:10.1007/3-540-46766-1_34.                       5206848.
[12] Y. Ishai, E. Kushilevitz, Randomizing polynomials:          [26] A. Go, R. Bhayani, L. Huang, Twitter sentiment clas-
     A new representation with applications to round-                 sification using distant supervision, CS224N project
     efficient secure computation, in: Proceedings 41st               report, Stanford 1 (2009) 2009.
     Annual Symposium on Foundations of Computer Sci-            [27] C. Gentry, Fully homomorphic encryption using ideal
     ence, 2000, pp. 294–304. doi:10.1109/SFCS.2000.                  lattices, in: Proceedings of the Forty-First Annual
     892118.                                                          ACM Symposium on Theory of Computing, STOC
[13] P. Mohassel, M. Franklin, Efficient polynomial opera-            ’09, Association for Computing Machinery, New York,
     tions in the shared-coefficients setting, in: M. Yung,           NY, USA, 2009, p. 169–178. doi:10.1145/1536414.
     Y. Dodis, A. Kiayias, T. Malkin (Eds.), PKC 2006,                1536440.
     Springer Berlin Heidelberg, Berlin, Heidelberg, 2006,       [28] A. C. Yao, Protocols for secure computations, in:
     pp. 44–57. doi:10.1007/11745853_4.                               23rd Annual Symposium on Foundations of Computer
[14] D. Dachman-Soled, T. Malkin, M. Raykova, M. Yung,                Science (sfcs 1982), 1982, pp. 160–164. doi:10.1109/
     Secure efficient multiparty computing of multivariate            SFCS.1982.38.
     polynomials and applications, in: J. Lopez, G. Tsudik       [29] E. Boyle, N. Gilboa, Y. Ishai, Function secret sharing,
     (Eds.), Applied Cryptography and Network Security,               in: E. Oswald, M. Fischlin (Eds.), EUROCRYPT 2015,
     Springer Berlin Heidelberg, Berlin, Heidelberg, 2011,            Springer Berlin Heidelberg, Berlin, Heidelberg, 2015,
     pp. 130–146. doi:10.1007/978-3-642-21554-4_                      pp. 337–367. doi:10.1007/978-3-662-46803-6_
     8.                                                               12.
[15] S. Wagh, D. Gupta, N. Chandran, Securenn: Effi-
     cient and private neural network training, Cryptol-
     ogy ePrint Archive (2018). URL: https://eprint.iacr.org/