Curl: Private LLMs through Wavelet-Encoded
                         Look-Up Tables
                         Manuel B. Santos1,* , Dimitris Mouris1 , Mehmet Ugurbil1 , Stanislaw Jarecki1 , José Reis1 ,
                         Shubho Sengupta2 and Miguel de Vega1
                         1
                             Nillion, Zählerweg 5, 6300 Zug, Switzerland
                         2
                             Meta Platforms, Menlo Park, California 94025, United States


                                        Abstract
                                        Recent advancements in transformers have revolutionized machine learning, forming the core of Large language
                                        models (LLMs). However, integrating these systems into everyday applications raises privacy concerns as client
                                        queries are exposed to model owners. Secure multiparty computation (MPC) allows parties to evaluate machine
                                        learning applications while keeping sensitive user inputs and proprietary models private. Due to inherent MPC
                                        costs, recent works introduce model-specific optimizations that hinder widespread adoption by machine learning
                                        researchers. CrypTen (NeurIPS’21) aimed to solve this problem by exposing MPC primitives via common machine
                                        learning abstractions such as tensors and modular neural networks. Unfortunately, CrypTen and many other
                                        MPC frameworks rely on polynomial approximations of the non-linear functions, resulting in high errors and
                                        communication complexity.
                                            This paper introduces Curl, an easy-to-use MPC framework that evaluates non-linear functions as lookup
                                        tables, resulting in better approximations and significant round and communication reduction. Curl exposes a
                                        similar programming model as CrypTen and is highly parallelizable through tensors. At its core, Curl relies on
                                        discrete wavelet transformations to reduce the lookup table size without sacrificing accuracy, which results in up
                                        to 19× round and communication reduction compared to CrypTen for non-linear functions such as logarithms
                                        and reciprocals. We evaluate Curl on a diverse set of LLMs, including BERT, GPT-2, and GPT Neo, and compare
                                        against state-of-the-art related works such as Iron (NeurIPS’22) and Bolt (S&P’24) achieving at least 1.9× less
                                        communication and latency.
                                            Finally, we resolve a long-standing debate regarding the security of widely used probabilistic truncation
                                        protocols by proving their security in the stand-alone model. This is of independent interest as many related
                                        works rely on this truncation style.

                                        Keywords
                                        Large Language Models (LLMs), Privacy-Enhancing Technologies (PETs), Secure Multiparty Computation (MPC)


                         1. Introduction
                         Large language models (LLMs) like GPT-2, GPT-4 [1], BERT [2], and LLaMA [3] have emerged as prime
                         examples showcasing the capabilities of artificial intelligence. LLMs assist individuals and businesses
                         in everyday tasks; from machine translation [4], to text generation [5], and question answering [6],
                         among others. To generate human-like responses, LLMs have been trained in vast amounts of data and
                         continue learning through their interactions with users.
                            However, as LLMs become increasingly integrated into human lives, privacy becomes a critical
                         concern as individuals frequently share sensitive information, including names, addresses, and credit
                         card numbers, or even financial and healthcare information [7]. On top of that, there is a shift towards
                         personalized AI, with OpenAI enabling a memory feature to ChatGPT [8]. For an LLM-powered assistant
                         to be able to understand individual preferences, habits, and workflows and provide tailored assistance,
                         it would require access to an immense amount of personal data. As this personalized data is retained

                          CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA
                         *
                           Corresponding author.
                          $ manuel.santos@nillion.com (M. B. Santos); dimitris@nillion.com (D. Mouris); memo@nillion.com (M. Ugurbil);
                          stasio@ics.uci.edu (S. Jarecki); jose.reis@nillion.com (J. Reis); ssengupta@meta.com (S. Sengupta); miguel@nillion.com
                          (M. de Vega)
                           0000-0003-4156-8607 (M. B. Santos); 0000-0002-2601-203X (D. Mouris); 0009-0003-5042-6851 (M. Ugurbil);
                          0000-0002-5055-2407 (S. Jarecki); 0000-0001-8022-7499 (J. Reis)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
and used to continuously improve the LLMs, the possibility of a data breach or unauthorized access
poses huge risks to individuals’ privacy.
   Privacy-enhancing technologies (PETs) such as multiparty computation (MPC) [9, 10] enable a
plethora of privacy-preserving machine learning (PPML) use-cases. In the most prominent setting,
a server possesses a proprietary model and aims to provide it as a service to clients for use with
their private data [11, 12, 13, 14, 15]. The objective is for clients to receive only the inference results
without gaining any knowledge about the model, while the model provider does not learn anything
about clients’ input. Another popular PPML use case involves multiple distrusting parties to securely
train a model over their joint sensitive data without revealing anything about their data to each other
[16, 17, 18, 19, 20].
   Several MPC systems leverage linear secret sharing schemes as they allow linear operations to
be performed with minimal communication overhead [21, 22]. However, non-linear operations (e.g.,
square roots, logarithms) often present greater challenges and require specialized techniques such as
polynomial approximations [12], look-up table evaluations [23], or other protocol-specific approaches
[24, 25]. Furthermore, incorporating domain knowledge can significantly enhance the efficiency and
effectiveness of these protocols by tailoring the parameters to better fit specific use cases [26, 27, 28].

1.1. Related Work
1.1.1. Focusing on Function Secret Sharing
Recently, several secure computation works have emerged leveraging function secret sharing (FSS)
[29, 23, 30, 31, 32]. Pika [23] extends the prior work of [29] by showcasing a novel approach to securely
evaluate look-up tables (LUTs) through FSS. While it demonstrates efficacy in benchmarking against
popular datasets like MNIST and CIFAR-10, its scalability poses challenges. As noted by Grotto [30], for
large LUTs, the computational cost may render the protocol infeasible due to the extensive number of
distributed point functions (DPF) evaluations required. In contrast, Curl circumvents this challenge by
eliminating the need for DPF evaluations, thanks to the integration of the discrete wavelet transform
(DWT) technique. On top of that, our technique can also benefit FSS-based frameworks by reducing the
size of the LUTs, resulting in less computation and communication. On the other hand, Grotto [30]
introduces innovative protocols leveraging custom splines and DPFs to handle a subset of functions
efficiently. Yet, it faces challenges in terms of computational and communication overhead compared to
alternatives like Sigma [15]. Orca [20] showcases the potential of GPU acceleration in FSS protocols,
particularly tailored for convolutional neural networks (CNNs). However, its suitability for other
architectures like transformers is questioned due to its reliance on heavy non-linearities [20].
   Sigma [15] builds on top of Orca and distinguishes itself by relying on minimal-sized LUTs through
protocols tailored for small ranges. Despite its efficiency from custom protocols with small-sized LUTs,
it demands significant computing and communication resources (e.g., 1 TB RAM and around 9Gbps
communication link). Furthermore, its use of deterministic truncation, although claiming improved
security, falls short in speed compared to probabilistic truncation methods. Unfortunately, all FSS-based
works focus on the two-party setting as FSS becomes impractical with more than two parties.

1.1.2. Focusing on Preprocessing
The line of work initiated by Pika [23] focuses on the two-party setting with an additional dealer party.
In this scenario, the preprocessing phase necessitates the dealer to prepare and distribute an 𝒪(𝑛)-sized
element. Given the assumption that the dealer will not collude with any party, reducing dependency on
such assumptions is desirable. This has led to the development of protocols that eliminate the need for
a dealer. Notably, some previous works have aimed to achieve this in the two-party setting, enabling
both parties to independently generate the required preprocessing material [33, 34, 35].
   The one-time truth table (OTTT) protocol [33] employs a boolean circuit to represent the table
for every possible input, resulting in an exponential complexity relative to the bit size. The OP-LUT
protocol [34] attempts to enhance the OTTT preprocessing phase but only achieves improvements
for small bit sizes, with the communication cost remaining exponential. The SP-LUT protocol [34]
significantly enhances the preprocessing phase but modifies the online phase, leading to an exponential
communication cost in the bit size during the online phase. FLUTE [35] offers advancements over
previous works but still requires 𝒪(2𝑛 ) communication in the setup phase. Therefore, finding a
preprocessing phase with sub-exponential communication complexity while maintaining the efficiency
of the online phase remains an open challenge.

1.1.3. Focusing on Fully-Homomorphic Encryption
Fully homomorphic encryption (FHE) schemes have benefited significantly from the use of LUTs to
enhance performance and enable new applications. This concept was initially proposed by Ducas and
Micciancio [36], who devised a method to evaluate arbitrary binary gates in FHE using LUTs. Building
on this foundation, Chillotti et al. [37] implemented the evaluation of arbitrary functions through a tree
of leveled multiplexers, leading to the development of the Torus FHE (TFHE) scheme. Despite their
development of a fast bootstrapping mechanism, capable of execution in approximately 10 milliseconds
on a CPU, its adoption has been limited. This is primarily because the control inputs for the multiplexers
required fresh ciphertexts – meaning prior computation on them was not possible – and the approach
necessitated expressing programs as deterministic automata. A further improvement over TFHE was
presented in [38] with the introduction of the programmable bootstrapping (PBS) technique, allowing
for efficient and general-purpose LUT evaluation. To enhance adaptability, HELM [39] expanded on the
PBS technique and introduced a framework for automatically converting Verilog hardware description
language (HDL) into encrypted circuits. HELM employs three modes of operation. The first mode
exclusively processes binary gates. The second mode operates on integers using secure LUT evaluations.
The third, mixed mode, works with binary circuits and "bridges" over to integers for secure LUT
evaluations before returning to the binary domain. However, HELM is limited to very low-precision
LUTs. This limitation arises because converting an integer to multiple bits requires multiple n-to-1 LUTs.
Recently, Ripple [40] proposed using compression techniques based on discrete wavelet transform
(DWT) theory to decrease the size of LUT, thereby accelerating the PBS technique for smooth functions.

1.1.4. Focusing on Secure LLM Inference
Secure frameworks for LLM inference have only recently gained traction due to the inherent complexity
of transformers compared to traditional neural networks. Techniques to ensure secure inference include
notable implementations like THE-X [41] and Iron [26], which were among the first to use homomorphic
encryption for matrix multiplication and non-linear functions.
   Subsequently, several studies have explored MPC techniques for private transformer inference
system [27, 42]. MPCFormer [27] leverages the CrypTen [12] MPC engine, while Puma [42] utilizes the
SecretFlow-SPU [43] MPC engine to assess MPC’s suitability for LLM inference. MPCFormer introduces
a distillation process where stronger models train weaker models to improve accuracy, compensating
for the limitations of using small (2-degree) approximations of non-linear functions. Nevertheless,
MPCFormer faces challenges due to the loss of approximation accuracy, necessitating fine-tuning the
models. On the other hand, Puma uses a piece-wise approximation of the GeLU activation function,
requiring comparisons and polynomials up to degree 6.
   Recent research has increasingly focused on hybrid solutions, combining the best techniques for
specific operations [28, 20]. Bolt [28] developed a solution integrating both fully homomorphic encryp-
tion and MPC. Like Puma [42], Bolt uses a piece-wise approximation that requires comparison, which
increases round complexity and communication. To boost efficiency, Bolt employs word elimination
while maintaining accuracy. Additionally, it enhances MPCFormer’s polynomial approximations by
increasing the polynomial degree and mitigates the corresponding efficiency loss through Motzkin’s
polynomial pre-processing procedure. In contrast, Curl avoids both comparisons and polynomial
approximations by using an optimized lookup table protocol that reduces the round and communication
complexity typically added by polynomial approximations while maintaining accuracy.
1.1.5. Focusing on Secure Truncation
Machine learning tasks employ floating-point numbers to construct precise models [44]. However,
it is widely recognized that the use of floating-point numbers renders MPC impractical from an effi-
ciency standpoint. For this reason, most works on machine learning under MPC employ fixed-point
representation instead, requiring a truncation protocol to maintain constant precision.
    Among the available truncation methodologies – deterministic, nearest integer, and probabilistic –
the deterministic and probabilistic approaches (see Section 3) are the most widely used in the secure
computation setting [24, 45]. Moreover, a substantial body of research favors probabilistic truncation due
to its lower communication costs and resultant enhanced performance, e.g., [46, 16, 47, 11, 48, 13, 18, 49].
Despite these advantages, some recent works express concerns about probabilistic truncation, and opt
for more resource-intensive protocols to achieve deterministic truncation [45, 50, 15]. The concerns
raised with regard to probabilistic truncation include security and rounding errors.
Correctness error. Before we address the above two concerns we recall that a widely used and inexpen-
sive probabilistic truncation protocol over fields, due to Catrina and Hoogh [24], has a correctness errors
with probability 2−(𝑛−|𝑥|) where |𝑥| is the maximum bitsize of truncation input 𝑥. This necessitated
using significantly larger rings, setting 𝑛 ≥ |𝑥| + 𝜅 where 𝜅 is a statistical security parameter. Damgård
et al. [51] presented a probabilistic protocol with zero probability of correctness error, but it required
a non-constant-round bit-decomposition protocol. Subsequent works [11, 52] optimized [24, 51] and
showed a constant-round probabilistic truncation protocol with communication and computation costs
matching Catrina and Hoogh [24], but with no correctness error and allowing 𝑛 ≥ |𝑥| + 1. CrypTen [12]
proposed a protocol inspired by [53] for division by public value, which can be reduced to a probabilistic
truncation with non-zero correctness error.
Security. Recently, Li et al. [45] identified a security flaw in the security proof of the probabilistic
truncation of [24], revealing a discrepancy between the information leaked by that protocol and by
the ideal probabilistic truncation functionality. Note that the same issue pertains to the 𝑛-optimal
probabilistic truncation of [52]. This led several recent papers, e.g., [15, 15], to move away from the
inexpensive probabilistic truncation protocols because of concerns regarding their security. In this
paper, we show that the 𝑛-optimal probabilistic truncation protocol of [52] (as well as the original
probabilistic truncation of [24]) securely realizes a natural ideal probabilistic truncation functionality
which we define. This enables us to capitalize on the efficiency benefits of probabilistic truncation
without compromising security (Section 3).
Rounding Errors. The LLAMA paper [50] recently suggested the use of deterministic truncation,
claiming it improves inference accuracy. However, this assertion has not been backed by experimental
evidence, and earlier studies indicating a possible decrease in accuracy with probabilistic truncation
have not been confirmed through empirical testing [49, 11]. Intriguingly, Gupta et al. [44] demonstrated
that computation using 16-bit fixed-point with 12-bit precision and probabilistic truncation achieves
nearly equivalent accuracy to that of a 32-bit floating-point. Conversely, truncation to the nearest
integer fails to train under such conditions adequately. Additionally, although not the primary focus of
the work, Keller and Sun [54] provided experimental evidence indicating that in neural networks, there
is minimal disparity between probabilistic truncation and truncation to the nearest integer. Indeed,
out of 28 distinct experiments, 13 exhibited superior accuracy with probabilistic truncation, 13 with
truncation to the nearest integer, and 2 yielded identical accuracy.

1.2. Our Contributions
In this paper, we introduce Curl, a user-friendly MPC framework designed for efficiently evaluating
non-linear functions using lookup tables. Curl addresses three critical aspects essential to secure
protocols: efficiency, accuracy, and security.
   Curl employs a novel LUT compression technique using discrete wavelet transformation (DWT)
achieving efficient approximations with minimal errors. This effectively reduces the size of original
LUTs while preserving the accuracy of non-linear functions, thus surpassing traditional polynomial
piece-wise approximation methods. By minimizing communication costs, Curl significantly enhances
end-to-end runtime performance and supports a wide range of activation functions including GeLU,
SiLU, Gaussian error function, Sigmoid, and Hyperbolic tangent. Additionally, Curl generalizes the
piece-wise approximation used within the GeLU protocol from [28, 15] to encompass any bounded odd
or even function that converges to a piecewise polynomial.
   Curl’s core technique extends to complex non-linear functions used by LLMs, ensuring both client
input data privacy and safeguarding the intellectual property of service providers, such as their models.
We evaluate Curl across a spectrum of LLMs including BERT (tiny, base, large), GPT-2, and GPT-Neo,
including both CPU and GPU backends. Our evaluation demonstrates advancements over state-of-the-
art frameworks, achieving at least 6.5 times fewer rounds and 1.9 times less communication overhead.
   Finally, Curl addresses the security concerns surrounding efficient probabilistic truncation by in-
troducing a natural ideal functionality, a corresponding protocol, and providing a simulation security
proof. Many works that rely on this truncation style were proven to have a security flaw by [45] as
they cannot be simulated in the real-ideal paradigm. Our result is of independent interest since our
proof settles this debate and restores confidence in secure probabilistic truncation protocols.
   We summarize our contributions as follows:

    • A novel framework based on DWT compression that achieves high accuracy with low round and
      communication complexities. When applied to LLM inference, Curl demonstrates at least a 6.5×
      reduction in round complexity and a 1.9× reduction in communication when compared against
      existing methods.

    • Curl builds on top of the user-friendly CrypTen [12] framework, offering high flexibility and
      low adoption friction for developers and researchers, thereby democratizing access to secure
      computation techniques. As a proof-of-concept, we implement a variety of LLM models and
      numerous non-linear functions, all operable on both CPU and GPU.

    • We introduce a novel natural ideal functionality for efficient probabilistic truncation and prove
      Escudero et al. [52] to be secure. This result is of independent interest.


2. Preliminaries
2.1. Arithmetic & Binary Secret Sharing
We consider two main algebraic structures: the ring Z𝑄 where 𝑄 = 2𝑛 for some bit width 𝑛, and the
binary field Z2 . We consider additive secret sharing between a set of parties 𝒫 in both Z𝑄 and Z2 .
Arithmetic secret sharing. J𝑥K denotes additive shares in the ring Z𝑄 for parties 𝒫0 , . . . , 𝒫N−1 . This
means each party 𝒫𝑖 owns a share J𝑥K𝑖 , where the sum of all shares is equal to 𝑥. Private addition
of secret shared values is executed by local share addition and private multiplication is implemented
through Beaver triples [55]. Fixed-point representation of a floating-point number   𝑥𝑅⌉︀ is considered
using the standard integer encoding with nearest-integer approximation: 𝑥 = 𝑥𝑅 · 2𝑓 , for precision
                                                                               ⌊︀

𝑓 [24]. To decode, we simply divide by 2𝑓 .
Binary secret sharing. Some secure operations profit from a secret sharing representation in the
binary domain. For this reason, we consider a binary secret sharing scheme for 𝑥 ∈ Z2𝑛 , denoted as
⟨⟨𝑥⟩⟩. A binary secret share ⟨⟨𝑥⟩⟩ is formed by secret sharing all the bits of 𝑥 in Z2 and each party 𝒫𝑖
owns a share ⟨⟨𝑥⟩⟩𝑖 . To reconstruct the secret, the parties 𝒫0 , . . . , 𝒫N−1 add their shares in Z2𝑛 (i.e.,
equivalent to bitwise XOR).
Conversions. Arithmetic shares suit best arithmetic operations (+, ×) and binary shares perform best
in evaluating logical expressions (XOR, AND, right-shift and left-shift operations). For this reason,
following the line of research of mixed-mode secure computation [52, 12, 56], we make use of known
conversions between the arithmetic and the binary domain. We follow the approach provided by
Damgård et al. ([25], S3) to convert from arithmetic shares to binary shares (A2B) and the procedure
using algorithm 2 in [12] to convert binary shares into arithmetic shares (B2A). We denote these
protocols by ΠA2B             B2A
                 𝑛,𝑙 and Π𝑛,𝑙 , respectively, where 𝑛 represents the input bitwidth and 𝑙 the output
bitwidth.
   We note that the preprocessing procedure used for B2A conversion can be optimized using the
techniques proposed by Escudero et. al [52]. In particular, CrypTen’s approach requires 𝑛 tuples of
shares (J𝑟K, ⟨⟨𝑟⟩⟩) for a bit 𝑟, whereas [52] employs just 1 tuple of share (J𝑠K, ⟨⟨𝑠⟩⟩) for a ring element 𝑠,
with the former being comparatively more costly than the latter.

2.2. Large Language Models (LLMs)
In 2017, Vaswani et al. [57] introduced the Transformer architecture, a type of neural network that
utilizes attention mechanisms, enabling AI models to prioritize different segments of the input sequence
when generating an output. A transformer comprises an encoder-only architecture (like in BERT [2, 58]),
a decoder-only (e.g., GPT), or both encoder and decoder [57]. The encoder takes the input sequence and
maps it to a lower-dimensional representation to generate a sequence of hidden states. The decoder
generates an output sequence by iteratively generating tokens from the previously generated hidden
states. In practice, multiple encoder and decoder blocks are used together to achieve higher accuracy.
Attention. A core component of transformers is self-attention, a mechanism that weighs the importance
of different parts of an input sequence when making predictions. Self-attention takes a query matrix
𝑄 and a set of key-value matrix pairs 𝐾, 𝑉 , which are all produced by linear layers, and produces an
output as follows, where 𝑑 is the dimension of the keys:

                                                               𝑄 · 𝐾𝑇
                                                            (︂        )︂
                            Attention(𝑄, 𝐾, 𝑉 ) = softmax       √        · 𝑉.
                                                                   𝑑
Usually, the inputs to the softmax are masked so that each token is only influenced by previous tokens,
thereby creating a causal relationship.
Layers. In the context of secure LLM inference, all the components of transformers can be seen as
linear and non-linear layers. A linear layer consists of matrix multiplications and feed-forward neural
networks (FNN) which can be evaluated using fixed-point arithmetic. A crucial part of evaluating
the linear layers is to maintain the fixed-point precision using truncation.1 For instance, the matrix
multiplication 𝑄 · 𝐾 𝑇 in self-attention can be computed as multiple inner products, each of which
needs to be followed by a truncation. Next comes the FNN which relies on an inner product followed
by an activation function or a non-linear layer.
   A non-linear layer, like the rectified linear unit (ReLU) [59], Gaussian error linear unit (GeLU) [60],
sigmoid linear unit (SiLU), softmax, etc., is not straightforward to evaluate securely. ReLU can be
implemented with a secure comparison (i.e., ReLU(𝑥) = max(0, 𝑥)), while GeLU and other activation
functions are not as trivial. Several works like CrypTen [12] resort to polynomial approximations
which has a significant impact on accuracy. More recent works like Sigma [15] approximate GeLU
using ReLU and then using lookup table methods to evaluate the difference. Lastly, layer normalization
(LayerNorm) and root mean squared normalization (RMSNorm) require non-linear functions such as
reciprocals or reciprocal square roots.
Transformer Block. A transformer block comprises several linear and non-linear layers. This usually
includes attention, linear layers, non-linear activation functions, and layer normalization.
Embedding Layers. Like any language model, LLMs start off with a word embedding layer and a
positional embedding layer. These layers are LUT protocols based on the token value and position of
the token respectively. As LUT protocols, they can also be thought of as multiplications with a one-hot
vector. To implement these under MPC, we have several options.
  Firstly, even though the token value is a hidden input, its position is public. While we keep the
weights hidden under MPC, with the public input we can do a plaintext lookup. With the hidden token

1
    The precision on fixed-point values doubles after multiplication so truncation is used to bring it back down.
value, however, the lookup is not so straightforward. The client either has to secret share the token
as a one-hot vector, which can be quite large or we can use a random one-hot vector generated by
the provider. The alternative, which is to run equality with every possible value is also quite costly.
To relieve the client from sharing a one-hot vector for every token and to avoid running the costly
equality multiple times, we use the provider to generate the one-hot vector. This protocol then works
just like the LUT protocols, with the exception that the LUT is not public. Since the LUT, which is the
embedding weights matrix, is also private, we need to rely on Beaver triple multiplication.

2.3. The Discrete Wavelet Transform (DWT)
A discrete wavelet transformation (DWT) is a transformation that decomposes a discretely sampled
signal into approximation and detail coefficients [61]. The detail coefficients contain the information
about the error incurred by the approximation coefficients. Given the approximation and detail coef-
ficients, one can invert the application of a DWT to obtain the original signal. We convey this idea
in Fig. 1. Starting with the whole signal we can apply a DWT to obtain a first level of approximation
and detail coefficients. This first-level approximation and detail coefficients are enough to recover
the original signal. We can repeat this process on the approximation coefficient to obtain a second
level of approximation and detail coefficients. The second-level approximation and detail coefficients
together with the first-level detail coefficients allow us to recover the original signal. This process can
be repeated as necessary to reduce the size of the approximation coefficients.

                     Whole signal                                                          Details


 Level 0             Approximation


 Level 1


 Level 2
Figure 1: Overview of two DWT iterations. The signal is represented by the approximation (blue) and the details
(gray).

  There are many different wavelet families and each uses linear transformations (e.g., matrix multipli-
cation): Daubechy (Db) wavelets, Biorthogonal wavelets, and others. Let us explore in more detail two
types of wavelets: the Haar and the (5, 3) Biorthogonal transformations.

2.3.1. Haar DWT
A special case of the Db family is the Haar wavelet, also called Db1. In Haar, the approximation
coefficients are obtained by averaging every two consecutive samples of the original signal. One of the
properties of the Db wavelet of degree 𝑚 (Dbm) is that if a signal encodes data of polynomials of degree
strictly below 𝑚, then the corresponding detail coefficients for those polynomial parts will be zero.
   More specifically, the Haar DWT is a linear transformation represented by a matrix 𝑊 . When applied
to a one-dimensional signal 𝑣 of length 2𝑛 (where 𝑛 ∈ Z), this linear transformation is defined as:
                                            ⎡        ⎤ ⎡        ⎤
                                                 𝑣0          𝑎0
                                            ⎢ .. ⎥ ⎢ .. ⎥
                                            ⎢ . ⎥ ⎢ . ⎥
                                            ⎢        ⎥ ⎢        ⎥ [︂ ]︂
                                            ⎢ 𝑣𝑛−1 ⎥ ⎢𝑎𝑛−1 ⎥
                               𝑊 ·𝑣 =𝑊 ·⎢   ⎢        ⎥ =⎢⎢      ⎥= 𝑎 ,
                                                 𝑣
                                            ⎢ 𝑛 ⎥ ⎢ 0 ⎥
                                                     ⎥       𝑑  ⎥      𝑑
                                            ⎢ .. ⎥ ⎢ .. ⎥
                                            ⎣ . ⎦ ⎣ . ⎦
                                                𝑣2𝑛−1         𝑑𝑛−1
where 𝑎 and 𝑑 correspond to the approximation and detail coefficients and are computed as:
                                        𝑣2𝑘 + 𝑣2𝑘+1         𝑣2𝑘 − 𝑣2𝑘+1
                                  𝑎𝑘 =              , 𝑑𝑘 =              ,
                                              2                  2
for 𝑘 = 0, 1, . . . , 𝑛 − 1. We note that the Haar DWT operation 𝑊 can be applied multiple 𝑗 times. For
a vector 𝑣 with size 𝑛 divisible by 2𝑗 ,              ⎡ ⎤
                                                            ↶𝑗
                                                  𝑗     𝑎
                                               𝑊 · 𝑣 = ⎣↶𝑗 ⎦ ,
                                                        𝑑
                                     ↶𝑗                                           ↶𝑗
where the approximation vector 𝑎 has size 𝑛/2𝑗 and the detail vector 𝑑 amounts to the rest (i.e.,
𝑛 − 𝑛/2𝑗 ). Note that, since 𝑊 is an orthogonal matrix, its inverse is obtained by its transpose. One can
see that, for 𝑘 = 0, 1, . . . , 𝑛 − 1, the inverse is given by 𝑣2𝑘+1 = 𝑎𝑘 + 𝑑𝑘 and 𝑣2𝑘 = 𝑎𝑘 − 𝑑𝑘 .

2.3.2. Biorthogonal DWT
While the Dbm wavelets use an orthogonal matrix, the Biorthogonal wavelets require two different
matrices where the transpose of one matrix is the inverse of the other. One matrix, 𝐵, is used for the
decomposition of the signal and the other, 𝐵,
                                            ̃︀ is used for the reconstruction.
  In this work, we consider the (5, 3) Biorthogonal DWT defined by 𝐵 where the corresponding
approximation and detail coefficients are given by:
                               1
                       𝑎𝑘 =      (−𝑣2𝑘−2 + 2 · 𝑣2𝑘−1 + 6 · 𝑣2𝑘 + 2 · 𝑣2𝑘+1 − 𝑣2𝑘+2 )
                               8
                                                        1
                                                  𝑑𝑘 = (𝑣2𝑘 − 2 · 𝑣2𝑘+1 + 𝑣2𝑘+2 ) ,
                                                        4
for 𝑘 = 0, . . . , 𝑛 − 1. Note that the (5, 3) Biorthogonal DWT computes the approximation and detail
coefficients with a weighted average of 5 and 3 points, respectively. The inverse transformation 𝐵 −1 is
given by 𝐵̃︀ where 𝐵 −1 = 𝐵   ̃︀ ⊤ .

2.4. Threat & Computation Model
Curl allows two or more computing parties (𝒫0 , 𝒫1 , . . . ) to perform private inference and training of
machine-learning models. In the most prominent use case, one party (e.g., 𝒫0 ) owns a trained model,
while another party (e.g., 𝒫1 ) aims to utilize the model for inference on their private data. Both parties
wish to keep both the data and the model’s architecture private. For instance, imagine a scenario where
𝒫0 is a cybersecurity firm developing a malware detection machine learning model that they offer as a
service (MLaaS) to client organizations. The clients (𝒫1 ) would like to use the model in their internal
codebase while keeping sensitive implementation details private. In MLaaS, both the model owner
and the clients could benefit from outsourcing the model and their inputs respectively to an MPC
infrastructure and not participate in the computation.
   Curl is flexible and can facilitate multiple other use cases such as: (a) having different parties that
possess distinct sets of features jointly train a model without exchanging raw data, (b) unveiling
correlations where one party retains feature data and another party holds corresponding labels without
revealing anything else, among others. For the sake of presentation, we assume a two-party protocol
and we focus on the inference setting (i.e., 𝒫0 holds the private model and 𝒫1 holds the sensitive inputs),
as demonstrated in Fig. 2.
   We assume the parties are semi-honest (i.e., follow the protocol specification) in the pre-processing
model. This threat model enables a wide range of realistic use cases of secure machine learning
[62, 12, 63, 46, 64], while pre-processing is used to generate correlated randomness which the parties
consume in the online phase. There is a plethora of ways to generate this randomness; one can use
general-purpose MPC [65, 66, 67], specialized protocols [68, 34, 35], or more commonly a trusted dealer
[29, 69, 50, 20, 12, 49, 30]. In Curl, we resort to the latter. Lastly, as long as not all the parties collude, no
information is leaked about the private inputs (or private models depending on the use case).
                   𝒫0                                                                  𝒫1
                                                        Secret share inputs

                                         Secret share model

                                           Final inference

              Private model                                                       Private inputs

Figure 2: Overview of Curl for privacy-preserving inference.


3. Secure Truncation
In this work, we use truncation over the ring Z𝑄 and we define it as follows. For input J𝑥K, it outputs
J𝑥′ K where 𝑥′ = ⌊𝑥/2𝑓 ⌋ and 𝑓 is the precision of the fixed-point representation.
   The truncation operation is required whenever we work with fixed-point numbers. Multiplication of
two fixed-point rationals with precision 𝑓 obtains a fixed-point rational with precision 2𝑓 . Therefore,
we require truncation to go back to a fixed-point rational with precision 𝑓 . As we analyze in Section 3.3,
the number of truncations required in a GPT-2 model is at least 220 . Hence, it is paramount to use
an efficient truncation protocol. As mentioned in Section 1.1.5, works on secure computation over
fixed-point rationals use two types of truncation protocols: a deterministic truncation, where output
𝑥′ is set exactly to ⌊𝑥/2𝑓 ⌋; or a probabilistic truncation, where 𝑥′ = ⌊𝑥/2𝑓 ⌋ + 𝑢 where 𝑢 = 1 with
probability (𝑥 mod 2𝑓 )/2𝑓 and 𝑢 = 0 otherwise. Although implementing probabilistic truncations
incurs minimal costs, two concerns have been raised about it: security and the impact of rounding
errors in ML applications. In this section, we summarize the security issue raised by Li et al. [45] and
present a solution in the stand-alone secure computation framework [70, 71].

3.1. Revisiting Li et al. [45]
Li et al. [45] pointed out that some efficient probabilistic truncation protocols fail to meet the criteria
for standard simulation-based security [71], rendering them insecure in this framework. This includes
Catrina and Hoogh’s [24], CrypTen’s [12], and other proposals including [51, 11, 52].
   Li et al. use a variant of the protocol of [24] to exemplify that the above protocols are not simulatable
as realizations of a standard ideal probabilistic truncation functionality, because in these protocols
the rounding error 𝑢 is a deterministic function of input 𝑥 and an information which the protocol
reveals. Specifically, given the protocol execution, the rounding error 𝑢 is a deterministic function
of 𝑥(𝑓 ) = (𝑥 mod 2𝑓 ), i.e., the last 𝑓 bits of 𝑥. This violates an ideal functionality for probabilistic
truncation, where 𝑢 = 1 with probability 𝑥(𝑓 ) /2𝑓 (and 𝑢 = 0 with probability 1 − 𝑥(𝑓 ) /2𝑓 ), but whether
or not 𝑢 becomes 1 or 0 is not revealed.
   We show that the probabilistic rounding protocols cited above realize a modified probabilistic trun-
cation functionality, which we define in Fig. 3. The modification is natural: The functionality picks
and reveals a “cutoff” point 𝑠(𝑓 ) chosen at random in Z2𝑓 = [0, ..., 2𝑓 − 1], and sets the rounding error
as 𝑢 = 1 if 𝑥(𝑓 ) ≥ 𝑠(𝑓 ) , and as 𝑢 = 0 otherwise. In Appendix A we prove that the 𝑛-optimal proba-
bilistic truncation protocol of Escudero et al. [52], included in Fig. 9, realizes this modified truncation
functionality. We note that several other probabilistic truncation protocols, e.g., [24], also realize this
functionality, after adjustments related to correctness error.
   Technically, we show an efficient simulator Sim s.t. for every input 𝑥 and every set C of corrupted
parties,               (︀ ′
                          ˜ K, ˜𝑠(𝑓 ) , ViewC (J𝑥K) ≈𝜖 J𝑥K′ , 𝑠(𝑓 ) , Sim(J𝑥KC , J𝑥′ KC , 𝑠(𝑓 ) )
                                                   )︀ (︀                                         )︀
                         J𝑥
where J𝑥˜ ′ K and ˜𝑠(𝑓 ) are the output and the cutoff point set by a protocol execution, ViewC (J𝑥K) is the
adversarial view of that protocol execution, J𝑥′ K and 𝑠(𝑓 ) are the output and the cutoff point set by the
ideal functionality, and J𝑥KC and J𝑥′ KC are the shares of resp. J𝑥K and J𝑥′ K held by parties C.
3.2. Modified Probabilistic Truncation Functionality
If 𝑥 is an integer in Z2𝑛 , we define 𝑥(𝑓 ) = (𝑥 mod 2𝑓 ) as the 𝑓 least significant bits of 𝑥, and 𝑥(𝑓 ) =
⌊𝑥/2𝑓 ⌋ as the truncation of the last 𝑓 bits of 𝑥. In particular it holds that 𝑥 = 𝑥(𝑓 ) · 2𝑓 + 𝑥(𝑓 ) .
We present the modified probabilistic truncation functionality ℱ𝑛,𝑓     TruncPr (J𝑥K) in Fig. 3. Note that the

proposed functionality reveals the cutoff point 𝑠(𝑓 ) whereas the functionality considered by Li et al. ([45,
Functionality 2]) can be stated as following the same procedure except that it hides 𝑠(𝑓 ) . (In particular,
if 𝑢 is based on the random cutoff 𝑠(𝑓 ) as in Fig. 3 then 𝑢 = 1 with probability 𝑥(𝑓 ) /2𝑓 , as expected.)
   Note that this extra element conveys as much information as the correctness definition of the
probabilistic truncation itself, which has seen wide acceptance and use in the literature. By definition,
probabilistic truncation incurs a rounding error with probability 𝑝𝑥 = 𝑥(𝑓 ) /2𝑓 . Therefore, for a fixed
input value 𝑥, one can estimate probability 𝑝𝑥 given outputs 𝑥′ from multiple runs of this functionality,
and then approximate the 𝑓 least significant bits of 𝑥 as 𝑥(𝑓 ) ≈ 𝑝𝑥 · 2𝑓 . Each run of functionality
   TruncPr (J𝑥K) reveals in addition the cutoff points 𝑠
ℱ𝑛,𝑓                                                     (𝑓 ) , which can also be used to approximate 𝑥(𝑓 )
because each 𝑠(𝑓 ) partitions range [0, ..., 2 − 1], and bit 𝑢 = 𝑥′ − 𝑥(𝑓 ) reveals to which side of this
                                              𝑓

partition 𝑥(𝑓 ) belongs.

        TruncPr
       ℱ𝑛,𝑓     (J𝑥K)

      Input: Assume parties hold J𝑥K ∈ Z2𝑛 for 𝑥 < 2𝑛−1 .

       1. Reconstruct 𝑥 from sharing J𝑥K.
       2. Set 𝑥(𝑓 ) = ⌊𝑥/2𝑓 ⌋ and 𝑥(𝑓 ) = (𝑥 mod 2𝑓 ).
       3. Pick cutoff point 𝑠(𝑓 ) at random in Z2𝑓 .

       4. Set 𝑥′ = 𝑥(𝑓 ) + 𝑢 where
                                                                    if 𝑥(𝑓 ) ≤ 𝑠(𝑓 ) ,
                                                           {︂
                                                                0
                                                      𝑢=
                                                                1      otherwise.

       5. Generate a random sharing J𝑥′ K of 𝑥′ .
       6. Output J𝑥′ K and leak 𝑠(𝑓 ) to the adversary.

Figure 3: Ideal modified probabilistic truncation functionality

      We prove the following theorem in Appendix A:
Theorem 3.1. Protocol ΠTruncPr
                            𝑛,𝑓    (J𝑥K) of Escudero et al. [52], shown in Fig. 9 in Appendix A, securely
                        TruncPr (J𝑥K) shown in Fig. 3 with semi-honest security in the stand-alone security
realizes functionality ℱ𝑛,𝑓
framework [71].

3.3. Comparison to CryptTen’s Truncation
We chose to implement truncation using the protocol of Escudero et al. [52] instead of CrypTen’s
truncation [12] because of correctness and rounding errors in the latter protocol. First, as in the case of
the probabilistic truncation of [24] discussed in Section 1.1.5, CrypTen’s truncation incurs a correctness
error with probability 2−(𝑛−|𝑥|) where |𝑥| is the maximum bit size of the truncation input 𝑥.
   Consider the implications of the above for LLM inference. Assuming the maximum bit size of fixed-
point elements is 24 (8-bit integer part and 16-bit precision), after multiplication, the input bit size
doubles. Since 𝑛 is fixed at 64 bits, the correctness error occurs with a probability of 2−(64−48) = 2−16 .
For instance, the GPT-2 model requires at least 220 inner products.2 Assuming a truncation is executed
2
    For GPT-2, 𝑑model = 768, 𝑑𝑓 𝑓 = 3072, 𝑛heads = 12, and 𝑑k = 64. Considering an input of size 𝐿 = 64, the token embedding
    requires 𝑑model × 𝐿 inner products, each attention mechanism requires (𝑑k + 𝑑model ) × 𝐿, and each feed-forward layer requires
    at least (𝑑ff + 𝑑model ) × 𝐿 inner products. This amounts to at least 3, 637, 248 > 220 inner products.
after each inner product, we expect 16 random errors, undermining inference correctness.
   We verified this experimentally by running 220 CrypTen truncations of 48-bit elements 30 times,
resulting in an average error of 15.83, which aligns with the expected value of 24 . By contrast, the
probabilistic truncation protocol of Escudero et al. [52] is always correct.
   Furthermore, CrypTen’s truncation mechanism exhibits worse rounding errors compared to function-
ality in Fig. 3 realized by the protocol of [52], even for 𝑁 = 2 parties, increasing further if the number
of parties increases. CrypTen’s rounding ⌊︀ error⌋︀is linear in N because each party truncates their share of
the input value 𝑥 locally, i.e., [𝑥′ ]𝑖 = [𝑥]𝑖 /2𝑓 . Since each [𝑥′ ]𝑖 can be 1 away from the rational value,
summing N of them results in the output 𝑥′ which can be up to N away from 𝑥(𝑓 ) . In contrast, value 𝑥′
output by the truncation protocol of [52] is at most 1 away from 𝑥(𝑓 ) . For these reasons we opt for the
truncation protocol from Escudero et al. [52] instead of CrypTen’s truncation mechanism after each
multiplication.


4. Curl Overview
As we mentioned in Section 2.3, DWT is used to decompose signals into approximation and detail
coefficients. DWT is particularly effective for signals that represent smooth functions (e.g., logarithm,
square root) since the detail coefficients are relatively small compared to the approximation coefficients.
This unique observation allows us to set the detail coefficients to zero and invert the application of the
DWT to obtain an approximation of the original signal. The main benefit of this process is that for
smooth functions we acquire detailed approximations while requiring significantly fewer samples than
the original signals. We leverage this to develop a secure protocol with compressed LUTs achieving
high accuracy with low round and communication complexities.
  We begin by analyzing several DWT techniques, specifically Haar and Biorthogonal DWT, for
plaintext evaluation of LUTs and we translate the plaintext procedures into the secure setting. Next,
we apply our DWT LUT techniques and create Curl, an easy-to-use framework for privacy-preserving
machine learning that exposes a tensor-based programming model for both CPU and GPU backends.
Finally, we evaluate Curl on non-linear functions and LLMs securely, improving accuracy and reducing
communication compared to related works.

4.1. Improving Approximations with DWT-Encoded LUTs
Our key contribution lies in utilizing discrete wavelet transform (DWT) to securely evaluate a table.
We start by explaining the secure evaluation of a look-up table (LUT) and demonstrate the application
of both Haar and Biorthogonal DWT compression techniques for function evaluation using plaintext
values. Next, we extend those two techniques to the secure setting. Our protocols use some techniques
introduced in [72, 33, 23] but contrary to previous works, Curl can be easily extended to any number of
parties. For simplicity, we present our protocols in the two-party setting.

4.1.1. Secure LUT Evaluation
Let F : R → R be a function and consider its fixed-point encoded counterpart tF : Z2𝑛 → Z2𝑛 , with
                                                                       𝑛
precision 𝑓 . The function tF can be represented as a LUT in (Z2𝑛 )2 , where tF (𝑥) corresponds the
value at the 𝑥-th index. We briefly explain the method for securely computing tF (J𝑥K) = J𝑦K, which is
depicted in Fig. 4.
  Before initiating the protocol, both parties 𝒫0 and 𝒫1 can locally produce a LUT tF representation of
the function F. This evaluation needs to be performed only once and can be reused indefinitely. The
secure evaluation of the LUT begins with the dealer generating a random element 𝑟 ∈ Z2𝑛 . Then, it
generates a 1-hot vector 𝑣 with elements equal to 0 everywhere except in the 𝑟-th entry. Then, the
dealer secret-shares to both 𝒫0 and 𝒫1 the random value 𝑟 along with the 1-hot vector 𝑣. These steps
make up the preprocessing phase as they do not depend on the function input 𝑥. During the online
phase, the parties locally compute the offset value 𝛿 given by 𝛿 = 𝑥 − 𝑟. This offset value is revealed to
                                                   𝒫0               J𝑣K0                            J𝑦K0
                                                                            Rotate by 𝛿
                                             J𝑥K0 , J𝑟K0 , J𝑣K0
                                                                     J0K    & Multiply               J0K
                                                                     ...
                 𝒟                          Reveal 𝛿 = 𝑥 − 𝑟
                                                                                                    J119K
                                                                     J0K                             J0K
            𝑟←  −$
                    Z                                                J1K                  LUT tF     ...
 𝑣 = [0, . . . , 0, 1, 0, . . . , 0]                                 J0K                   47        J0K
    s.t. 𝑣 is 1-hot at 𝑟
                                                                                           119              𝑥th index
    Share 𝑣 to J𝑣K0 , J𝑣K1                                                                 268
    Share 𝑟 to J𝑟K0 , J𝑟K1                         𝒫1               J𝑣K1                   ...      J𝑦K1
                                                                                           723
                                             J𝑥K1 , J𝑟K1 , J𝑣K1
                                                                     J0K                             J0K
                                                                     ...
                                            Reveal 𝛿 = 𝑥 − 𝑟
                                                                                                    J119K
                                                                     J0K    Rotate by 𝛿              J0K
                                                                     J1K    & Multiply               ...
                                                                     J0K                             J0K


Figure 4: Secure evaluation tF (𝑥) = 𝑦 of a public LUT (tF ) on secret-shared 𝑥 (𝒫𝑖 holds J𝑥K𝑖 ) with the help of
dealer 𝒟.


both parties 𝒫0 and 𝒫1 as it is used to locally rotate the secret-shared vector 𝑣. Note that 𝛿 cannot be
revealed to the dealer, otherwise 𝒟 would be able to recover the underlying value of the input 𝑥. After
the parties rotate their shares of 𝑣 by 𝛿, the underlying value 1 is placed at 𝑟 + (𝑥 − 𝑟) = 𝑥-th index.
Finally, the parties compute the inner product between the shares of the rotated vector and the public
LUT tF . Indeed, this yields the share of tF at index 𝑥. All operations are executed in the ring Z2𝑛 .

4.2. Look-Up Table Approximation
As previously discussed, DWT provides a way to approximate signals. We observe that it is specifically
well-suited for smooth functions, such as exponential, sigmoid, logarithm, and square root, to name
a few. Additionally, it provides a compression mechanism that allows us to reduce bandwidth usage.
Next, we delve into two approaches for evaluating a DWT-approximated LUT in the plaintext setting.
This exploration will help inform the secure protocols for both Haar and (5, 3) Biorthogonal DWTs.
Approach 1. Recall from Section 2.3 that when DWT is applied to a vector 𝑣 𝑗 times, we end up
                            ↶𝑗          ↶𝑗              ↶𝑗
with two vectors 𝑎 and 𝑑 , where 𝑎 is 2𝑗 times smaller than the original vector. By setting the
detail coefficients to zero, we define the approximation of 𝑣 to be the approximation coefficients. This
effectively reduces the size of the original vector by 2𝑗 .
                                                                        𝑛
   Similar to how we apply the DWT to a general vector 𝑣 ∈ (Z2𝑛 )2 , we apply it to the LUT repre-
                                       ↶𝑗             𝑛−𝑗
sentation tF . We define tF ∈ (Z2𝑛 )2   to be the LUT containing only the approximation coefficients
of the original LUT after 𝑗 DWT operations, 𝑊 𝑗 · tF . Note that, in this process, the original LUT is
compressed by a factor of 2𝑗 , while the original input value 𝑥 remains within Z2𝑛 . This leads to a
                                                                                                   ↶𝑗
natural question: how do we find the original input 𝑥 in the 𝑗-times compressed table tF ? To locate 𝑥 in
↶𝑗                                                               ↶𝑗
tF , we truncate 𝑥 by 𝑗 bits and find the corresponding index in tF . This process yields the following
approximation:
                                                                     ↶𝑗
                                                             tF (𝑥) ≈ tF (𝑥 ≫ 𝑗),
                                                                                                   ↶𝑗
which is depicted in Fig. 5. Observe that the output bitwidth of the compressed LUT tF is the same as
the output bitwidth of the original table tF .
Approach 2. From DWT theory, we can retrieve the original signal from the approximation and
detail coefficients by evaluating the inverse transform over them. Moreover, it is known that the
inverse transform provides a method to amplify a signal, by considering the original signal as the
                     𝑥                tF
                                                                                                               ↶1
                   000            0000                                                        𝑥≫1              tF
                   001            0010
                                                                                                 00            0001
                   010            0100
                                                                                                 01            0101
                   011            0110
                                                                                                 10            1001
                   100            1000
                                                                                                 11            1101
                   101            1010
                   110            1100
                                                                       (b) Approach 1 for LUT approximation. Evaluation of
                   111            1110
                                                                           the Haar-compressed LUT over the truncated
                                                                                         ↶1
(a) Evaluation of the original LUT, i.e., tF (𝑥).                           input, i.e., tF .
                                                                                        ↶1
Figure 5: Execution of the original and Haar-compressed LUT (tF and tF ) on inputs 𝑥 and 𝑥 ≫ 1, respectively.


                                                                                                                      ↶𝑗
approximation coefficients [61]. We can use this method to amplify the approximated LUT tF and get a
better approximation than approach 1. We execute the following chain of operations:
                                               ⎡        ⎤
                                               ↶𝑗           [︃
                                                                 ↶𝑗
                                                                      ]︃
                                 𝑊𝑗    𝑗
                                                tF               tF          ↶𝑗𝑊 −𝑗
                                                                                                                           (1)
                            tF        𝑊 · tF = ↶𝑗 ⎦
                                              ⎣                            =: eF                ↶𝑗
                                                                                      𝑊 −𝑗 · eF =: ̃︀
                                                                                                   tF ,
                                                    𝑑            0


and we obtain tF (𝑥) ≈ ̃︀tF (𝑥) for any DWT type.
   Next, we explain both approaches for the Haar and (5, 3) Biorthogonal transformations. We discuss
how the structure of the DWT matrices affects the approximation, as this provides insight into translating
a plaintext approximation to a secure protocol. Finally, we extend these approaches to the secure setting
and present the corresponding protocols.

4.2.1. Haar-LUT Evaluation
The two aforementioned approaches presented in Section 4.2 can be applied to the Haar DWT. Consid-
ering vectors of size 2𝑛 , recall from Section 2.3.1 that the Haar DWT compresses data by computing
the average of two elements: 𝑎𝑘 = (𝑣2𝑘 + 𝑣2𝑘+1 )/2, for 𝑘 = 0, 1, . . . , 2𝑛−1 − 1. In other words, as 𝑘
represents the 𝑛 − 1 most significant bits (MSB) of the original index, the index elements with the same
𝑛 − 1 MSBs are averaged. Applying consecutively DWT 𝑗 times, the DWT averages the index elements
with the same 𝑛 − 𝑗 MSBs.
   We observe that for the Haar DWT, both approaches 1 and 2 yield the same approximation. As
described in Section 4.2, the inverse Haar transformation is given by 𝑣2𝑘+1 = 𝑎𝑘 + 𝑑𝑘 and 𝑣2𝑘 = 𝑎𝑘 − 𝑑𝑘 .
Since the detail coefficients are set to 0, we observe that 𝑣2𝑘 = 𝑣2𝑘+1 = 𝑎𝑘 , i.e., the elements from
indexes with the same 𝑛 − 1 MSBs are reconstructed to the same value. In other words, the inverse
transformation duplicates the entries of the indexes having the same 𝑛 − 1 MSBs. This yields the
                                                                                                          ↶1
same result as taking the 𝑛 − 1 MSB of 𝑥 and looking up the compressed table tF , i.e., more generally
↶𝑗
tF (𝑥 ≫ 𝑗) = ̃︀
             tF (𝑥).
Secure Haar LUT Evaluation. The techniques described above can be used together to securely
compute an approximated function F through a LUT. The full protocol is formally described in Fig. 6.
   The secure LUT procedure of Fig. 4 can be adapted to be compatible with the DWT compression
technique. Assume input bitwidth 𝑛, output bitwidth 𝑙, and the number of DWT transformations to be
𝑗. The secure Haar LUT protocol, ΠHaar-LUT
                                    𝑛,𝑙,𝑗,F  (Fig. 6) is given as follows. During the preprocessing phase,
similar to the secure LUT evaluation, the dealer produces a secret-shared random 1-hot vector. However,
to profit from DWT compression, the 1-hot vector has reduced size, i.e., 2𝑛−𝑗 , and 𝑟 ∈ Z2𝑛−𝑗 . Following
approach 1, we have that the look-up index is given by the 𝑛 − 𝑗 MSBs of the input 𝑥. Therefore, the
                                                    ↶𝑗
shift 𝛿 to be applied to the compressed LUT tF is computed using the 𝑛 − 𝑗 MSBs of 𝑥 and 𝑟. The online
   ΠHaar-LUT
    𝑛,𝑙,𝑗,F (J𝑥K)

  Parameters: 𝑛, 𝑙 for input bit-width and output bit-width. F is the function to be computed and
                                                                ↶𝑗
  represented by the corresponding 𝑗−times compressed LUT tF .
  Dealer setup:
   1. 𝒟 randomly generates 𝑟 ←
                             −
                             $
                               Z2𝑛−𝑗 .
   2. 𝒟 builds a 1-hot vector 𝑣 of size 2𝑛−𝑗 with value 1 at position 𝑟.
   3. 𝒟 shares 𝑟 and 𝑣 to 𝒫0 and 𝒫1 .

  LUT evaluation for parties 𝒫𝑖 for 𝑖 ∈ {0, 1}:
   4. 𝒫𝑖 truncates the input value J𝑥K by 𝑗: J𝑥* K = J𝑥 ≫ 𝑗K.               ◁ Truncation protocol [52, §5.1]
   5. 𝒫𝑖 reduces its share J𝑥* K𝑖 by locally computing J𝑥* K𝑖 mod 2𝑛−𝑗 .
   6. 𝒫𝑖 computes J𝛿K = J𝑥* K − J𝑟K and reveal 𝛿.
   7. 𝒫𝑖 rotates J𝑣K𝑖 by 𝛿.
                                             ↶𝑗
   8. 𝒫𝑖 applies the inner product between tF and the rotated vector, and gets the share J𝑦K.
   9. Output: J𝑦K.

Figure 6: Secure Haar protocol.


phase starts with the parties truncating the shares of 𝑥 by 𝑗 and computing its modulo in Z2𝑛−𝑗 . As we
saw, the truncation from the DWT-LUT evaluation approach 1 can be executed with the Escudero et
al. [52, §5.1] probabilistic truncation protocol over rings. The modulo operation is required because the
shares of 𝑟 belong to Z2𝑛−𝑗 and the shares of truncated 𝑥 belong to Z2𝑛 . We note that modulo can be
executed locally as described below. The rest of the protocol is similar to the online phase of the secure
LUT evaluation (Fig. 4).
Local modulo. In the literature, several protocols evaluate the modulo operation over shares in a ring or
field [24, 52]. Those protocols are used whenever the space of the modulo operation is the same for both
inputs and outputs. However, we observe that the modulo operation required for our protocol needs
to reduce the ring size of the operand. More specifically, the shares J𝑥* K in Z2𝑛 have to be reduced to
shares in Z2𝑛−𝑗 . This can be achieved locally by letting the parties compute the modulo operation over
their shares. For correctness, consider 𝑥 = (J𝑥K1 + J𝑥K2 ) mod 2𝑛 and 𝑥′ = 𝑥 mod 2𝑛−𝑗 . We have,

                      𝑥′ = 𝑥 mod 2𝑛−𝑗
                         = ((J𝑥K1 + J𝑥K2 )     mod 2𝑛 )    mod 2𝑛−𝑗
                         = (J𝑥K1 + J𝑥K2 ) mod 2𝑛−𝑗 ◁ 2𝑛−𝑗 is a factor of 2𝑛 .
                         = J𝑥K1 mod 2𝑛−𝑗 + J𝑥K2 mod 2𝑛−𝑗        mod 2𝑛−𝑗
                           (︀                              )︀

                         = J𝑥 K1 + J𝑥′ K2
                           (︀ ′
                                             mod 2𝑛−𝑗 .
                                          )︀


Thus, the shares of 𝑥′ can be computed locally from J𝑥K.

4.2.2. Biorthogonal-LUT Evaluation
Now, we explore the LUT DWT approximation for the (5, 3) Biorthogonal DWT. As described in
Section 2.3.2, the approximation coefficients are given by a weighted average of 5 elements and, similar
to Haar, approach 1 for the Biorthogonal compresses two indexes with the same 𝑛 − 1 MSBs to the
                                                                           ↶𝑗
same index. On the other hand, approach 2 requires computing 𝐵 −𝑗 · eF =: ̃︀
                                                                           tF as shown in Eq. (1). For
simplicity, we set 𝑗 = 1 and examine the structure of the inverted Biorthogonal DWT matrix 𝐵 −1 :
                                            Index ⎛                     ⎞
                                            0000 2       0       0
                                            0 0 0 1⎜ 1   1       0      ⎟ 1
                                                   ⎜                    ⎟
                                                   ⎜0
                                            0 0 1 0⎜     2       0 ... ⎟⎟ · 4.
                                            0 0 1 1⎝ 0   1       1      ⎠
                                                         ..        ..
                                                          .           .

To simplify the exposition, we present matrix 𝐵 −1 only with the elements that interact with the
                                       ↶𝑗
approximation coefficients, i.e., tF . Thus, the detail coefficients are effectively set to zero. Observe that
the line indexes with the same 𝑛 − 1 MSBs ( green ) constitute a 2 × 2 submatrix that combines two
                ↶1
elements from tF , i.e., its corresponding index element and the subsequent element. For instance, the
                                                                                                   ↶1
                 tF , combines the 001 -th and the ( 001 + 1)-th index element of tF . Note that the
001 -th block of ̃︀
LSB ( red ) of the line index plays a role in the weights of the sub-matrices. Indeed, we have:
                                                 (︂    )︂ (︂                  )︂
                                                   2 0      2− 0          0
                                                         =                         .
                                                   1 1      2− 1          1


Finally, consider the case 𝑗 = 2. We omit the (1/4)𝑗 constant for readability. The 2 LSBs of the indices
define the elements of 4 × 2 sub-matrices and the 𝑛 − 2 MSBs of the indices define the elements from
                          ↶2
the approximation LUT tF to be used.
                                                         ↶2
                     Index ⎛       𝐵 −2                  𝑇Fext   ⎛              tF
                                                                                ̃︀             ⎞
                                                   ⎞ ⎛
                                                                     2
                                                             ⎞
                     0 0 0 0⎜ 4   0     0                        ⎜ (2 − 0 ) · 𝑎 + 0 · 𝑎
                                                   ⎟ ⎜ 𝑎0 ⎟ ⎜ (22 − 1 ) · 𝑎 0 + 1 · 𝑎 0 +1 ⎟
                                                                                               ⎟
                                                   ⎟ ⎜ . ⎟ ⎜
                            ⎜3
                     0 0 0 1⎜     1     0                                        0
                                                   ⎟ ⎜ . ⎟ ⎜ (22 − 2 ) · 𝑎 + 2 · 𝑎        0 +1 ⎟
                                                   ⎟ ⎜ . ⎟ ⎜
                                                                                               ⎟
                     0 0 1 0⎜ 2   2     0                            2
                                                                                 0        0 +1 ⎟
                            ⎜
                     0 0 1 1⎜ 1   3     0
                                                   ⎟ ⎜
                                                   ⎟ ⎜ 𝑎2𝑛−2 ⎟
                                                                 ⎜ (2 − 3 ) · 𝑎 0 + 3 · 𝑎 0 +1 ⎟
                                                             ⎟ ⎜                               ⎟
                            ⎜
                     0 1 0 0⎜ 0   4     0     ... ⎟ ·
                                                   ⎟
                                                               =
                                                                 ⎜
                                                                     2
                                                                                               ⎟
                                                             ⎟ ⎜ (2 − 0 ) · 𝑎 1 + 0 · 𝑎 1 +1 ⎟
                                                      ⎜      ⎟   ⎜
                                                   ⎟ ⎜
                     0 1 0 1⎜ 0   3     1
                            ⎜                                                                  ⎟
                                                   ⎟ ⎜ 0 ⎟ ⎜ (22 − 1 ) · 𝑎 + 1 · 𝑎
                                                      ⎜      ⎟                                 ⎟
                                                                                 1        1 +1
                     0 1 1 0⎜ 0   2     2          ⎟ ⎜ .. ⎟
                            ⎜                      ⎟ ⎜           ⎜                             ⎟
                                                                 ⎜ (22 − 2 ) · 𝑎 + 2 · 𝑎
                                                   ⎟ ⎝ . ⎠ ⎜ 2
                                                                                               ⎟
                                                             ⎟                   1        1 +1 ⎟
                            ⎝0
                     0 1 1 1⎜     1     3                        ⎜ (2 − 3 ) · 𝑎 1 + 3 · 𝑎 1 +1 ⎟
                                  ..          ..
                                                   ⎠
                                   .             .       0                        ..
                                                                 ⎝                             ⎠
                                                                                   .

  We can generalize the above pattern for any 𝑗, which provides an approximation formula suited for
the secure setting:
                               (︂                                          )︂
                             1 (︀ 𝑗         )︀ ↶𝑗 (𝑗)           ↶𝑗 (𝑗)
                    tF (𝑥) = 𝑗
                    ̃︀             2 − 𝑥(𝑗) · tF (𝑥 ) + 𝑥(𝑗) · tF (𝑥 + 1) ,                      (2)
                            4

where 𝑥(𝑗) and 𝑥(𝑗) represent the 𝑛 − 𝑗 MSBs and 𝑗 LSBs of 𝑥, respectively. We recall that the 𝑛 − 𝑗
MSBs can be obtained by truncation and the 𝑗 LSB by the mod 2𝑗 operation.
Secure Biorthogonal LUT evaluation. Eq. (2) can be used to securely compute an approximated
function F through a look-up table. The full protocol is formally described in Fig. 7. Similarly, the
secure LUT procedure described before (Fig. 4) can be adapted to be compatible with the Biorthogonal
DWT compression technique. Assume input bitwidth 𝑛, output bitwidth 𝑙, and the number of DWT
transformations to be 𝑗.
   The secure Biorthogonal LUT protocol, ΠBior-LUT                           Haar-LUT with the difference
                                            𝑛,𝑙,𝑗,F (Fig. 7), is similar to Π𝑛,𝑙,𝑗,F
                                                  ↶𝑗
that it requires two LUT evaluations of tF . As an optimization, note that these two LUT evaluations
do not require two different one-hot vectors. Since the LUT evaluation is executed in two consecutive
indices, we can use the same vector 𝑣 to generate two shifted vectors: one shifted by 𝛿 and the other
by 𝛿 + 1. This allows us to use the same dealer setup phase as in secure Haar LUT evaluation. The
online phase starts with the parties computing 𝑥(𝑗) as the 𝑛 − 𝑗 MSBs and 𝑥(𝑗) as the 𝑗 LSBs of 𝑥. More
specifically, 𝑥(𝑗) is computed by truncating J𝑥K by 𝑗 (i.e., [52, §5.1]) and locally computing its modulo
in Z2𝑛−𝑗 , while 𝑥(𝑗) is computed locally based on 𝑥(𝑗) as 𝑥 = 2𝑗 · 𝑥(𝑗) + 𝑥(𝑗) . The rest of the protocol is
   ΠBior-LUT
    𝑛,𝑙,𝑗,F (J𝑥K)

  Parameters: 𝑛, 𝑙 for input bit-width and output bit-width. F is the function to be computed and
                                                                        ↶𝑗
  represented by the corresponding 𝑗−times compressed LUT tF .
  Dealer setup:
   1. 𝒟 randomly generates 𝑟 ←
                             −
                             $
                               Z2𝑛−𝑗 .
   2. 𝒟 builds a 1-hot vector 𝑣 of size 2𝑛−𝑗 with value 1 at position 𝑟.
   3. 𝒟 shares 𝑟 and 𝑣 to 𝒫0 and 𝒫1 .

  LUT evaluation for parties 𝒫𝑖 for 𝑖 ∈ {0, 1}:
   4. 𝒫𝑖 truncates the input value J𝑥K by 𝑗, J𝑥 ≫ 𝑗K := J𝑥(𝑗) K. Then, 𝒫𝑖 reduces its share J𝑥(𝑗) K𝑖 by
      locally computing J𝑥(𝑗) K𝑖 mod 2𝑛−𝑗 .
   5. 𝒫𝑖 computes the 𝑗 LSB of 𝑥 by locally computing J𝑥(𝑗) K := J𝑥K − 2𝑗 · J𝑥(𝑗) K.

   6. 𝒫𝑖 compute J𝛿K = J𝑥(𝑗) K − J𝑟K and reveal 𝛿.
   7. 𝒫𝑖 rotate J𝑣K𝑖 by 𝛿 and 𝛿 + 1.
                                               ↶𝑗
   8. 𝒫𝑖 apply the inner product between tF and the rotated vectors. We denote the resulting shares
          ↶𝑗                ↶𝑗
      by J tF (𝑥(𝑗) )K and J tF (𝑥(𝑗) + 1)K.
                                                (︂                                                             )︂
                                                                  )︀ ↶𝑗                         ↶𝑗
   9. 𝒫𝑖 computes the expression J𝑦K := 41𝑗            2 − J𝑥(𝑗) K · J tF (𝑥(𝑗) )K + J𝑥(𝑗) K · J tF (𝑥(𝑗) + 1)K .
                                                     (︀ 𝑗


  10. Output: J𝑦K.

Figure 7: Secure Biorthogonal protocol.


similar to the online phase of the secure LUT evaluation (Fig. 4) on 𝛿 and 𝛿 + 1. In the final step 10, the
parties compute the expression (2) as the output.
  Both Haar and Biorthogonal secure evaluation protocols can be used to directly evaluate a function
over a fixed interval. Furthermore, we note that our protocols can be combined with a piece-wise
approach commonly employed in prior works [28, 15]. We extend this method in Appendix B to
accommodate any bounded even or odd function (Fig 11). We demonstrate the applicability of this
protocol to various functions, including s-shape functions such as tanh, erf and sigmoid, as well as
activation functions like GeLU and SiLUq. Additionally, in Appendix B.3, we introduce a protocol for
both sin and cos.

4.3. Complexity Analysis
4.3.1. Memory
The secure DWT-LUT approach, using compressed tables, significantly minimizes the memory required
for LUT handling. For instance, consider the interval (0, 4096) with a precision of 𝑓 = 16. A conven-
tional LUT protocol would typically demand a table size of approximately 2.15 GB. However, employing
the DWT technique allows us to reduce the original table size by a factor of approximately 220 while
maintaining high accuracy levels as discussed in Section 5.1.2. The corresponding DWT-reduced LUT
occupies only 1 KB of memory, demonstrating a substantial reduction in storage requirements without
compromising accuracy.

4.3.2. Round & Communication
Both Haar and Biorthogonal protocols have the same preprocessing phase and both start the online phase
by truncating and opening a single field element. However, the Biorthogonal requires an additional
                                    Curl Latency       Curl MAE             CrypTen Latency     CrypTen MAE

           100                           10−1         100                           101         100                        10−1

                                         10−2                                       100                                    10−2
Latency


                                                                                               10−1


                                                                                                                                  MAE
                                         10−3                                       10−1
                                                     10−1                                                                  10−3

                                                                                    10−2       10−2
          10−1                           10−4                                                                              10−4
                 7     8     9      10                      6     7     8       9                     5    6       7   8
                     LUT bit size                                 LUT Size                                 LUT Size

(a) Logarithm (Fig. 7).                            (b) Square root (Fig. 7).                  (c) Inv. square root (Fig. 6).

           100                                        100                           10−2        100                        100


                                         10−4                                                                              10−1
Latency


                                                                                                                                  MAE
                                                                                               10−1
          10−1                                                                                                             10−2
                                                     10−1                           10−3
                                         10−5                                                  10−2                        10−3
                 5     6     7      8                       4     5     6       7                     5    6       7   8
                     LUT bit size                                 LUT Size                                 LUT Size

(d) Sigmoid (Fig. 11).                             (e) SiLU (App. B.2).                       (f) SiLU (Fig. 7).
Figure 8: Mean absolute error (MAE) and latency trade-offs for different DWT compressions. We plot CrypTen
as a baseline.


truncation and two share multiplications (see Fig. 7 Step 9). Observe that Escudero et al. [52, §5.1]
probabilistic truncation protocol over rings involves one round to open a ring element. Consequently,
while the Haar protocol operates within 2 rounds and transmits 2 ring elements, the Biorthogonal is
more resource-intensive, requiring 4 rounds and transmitting 5 elements.


5. Experimental Results
We implemented Curl by building on top of CrypTen and open-source it in https://github.com/jimouris/
curl . Curl natively supports both CPU and GPU backends, however, we focus on CPU experiments as
this is the most widely used setting in MPC. For completeness, we conduct some GPU experiments using
an NVIDIA GeForce RTX 4080. All parties run in separate processes on a c5n.9xlarge AWS instance
with 36 vCPUs and 96 GB of memory. As communication and round complexity form crucial metrics
for assessing MPC end-to-end runtime, we extensively report these numbers in all our benchmarks.
This can be used to glean WAN performance.
   Furthermore, as Curl’s novelty lies in improving approximations through DWT-encoded lookup
tables, we focus on reporting the mean absolute errors (MAE) and mean relative errors (MRE) over the
expected result. Finally, we focus on private LLM inference using secret-shared models on secret-shared
data. Note that Curl can be used for private training in a similar way, which we omit due to space
constraints. We use 16 bits of precision, following related works that typically use precision between 9
and 17 bits [46, 15, 73, 23].

5.1. Non-linear Functions
We experimentally verified that the Biorthogonal protocol achieves better approximations than Haar
with almost identical latency and communication costs (Appendix C). Therefore, we use the Biorthogonal
protocol for almost all our experiments. We choose constrained ranges of input values to enable a
meaningful comparison with CrypTen. For wider ranges, such as (0, 212 ), CrypTen’s approximations
exhibit significantly higher errors. Since previous MPC works [28, 27, 42] rely on similar polynomial
approximations, we can argue that they all have similar error levels. For example, for the log function,
Table 1
Overview of Curl’s improvement over CrypTen for selected tensors functions in terms of latency, mean absolute
error (MAE), and mean relative error (MRE), for tensors of 256 × 256 elements.
                                                                   Curl                                          CrypTen [12]
                                                                      ‡                     §
      Op.         Protocol    Domain            Latency       Com.                  Error            Latency      Com.         Error
                                         LUT
                                                 (sec.)†   Rounds MB             MAE    MRE           (sec.)   Rounds MB     MAE    MRE
          log       Fig. 7     (0, 64)    28       0.17        4          2.6    2.09e-2   5.48e-2    0.17      40    39.8   2.14e-2   6.36e-3
    reciprocal      Fig. 7     (1, 64)    2 7
                                                   0.09        4          2.6    7.18e-4   1.43e-3    0.11      59    38.3    1.7e-4   7.05e-3
      sqrt          Fig. 7    (0, 256)    26       0.06        4          2.6    1.23-1    1.11e-2    0.09      26    17.3   6.09e+0   4.04e-1
     invsqrt        Fig. 6    (0, 256)    2 6
                                                   0.04        2          1.0    1.45e-2   1.14e-1    0.09      24    15.7   2.69e-2   0.405e-1
          sin     App. B.3   (−64, 64)    25       0.08       16          20.4   4.55e-3   1.14e-2    0.11      37    24.1   8.52e-1   1.58e+0
          cos     App. B.3   (−64, 64)    25       0.08       16          20.4   4.77e-3   9.85e-2    0.10      37    24.1   8.86e-1   1.45e+0
                   Fig. 11   (−64, 64)    2 6
                                                   0.10       22          33.6   4.70e-5   7.83e-2    0.11      26    26.2   7.00e-5   3.49e+0
     sigmoid
                   Fig. 7    (−64, 64)    26       0.10        4           2.6   1.11e-2   6.59e-2    0.11      26    26.2   7.00e-5   3.49e+0
      tanh         Fig. 11   (−64, 64)    25       0.09       22          33.6   2.31e-4   3.96e-4    0.13      26    26.2   8.60e-5   1.19e-4
          erf      Fig. 11   (−64, 64)    2 3
                                                   0.09       22          33.6   8.98e-4   1.83e-3    0.21      56    36.2   3.39e+7   3.40e+7
                  App. B.2   (−64, 64)    24       0.10       30          47.7   5.95e-3   2.79e+0    N/A       N/A   N/A       N/A     N/A
      GeLU
                   Fig. 7     (−4, 4)     24       0.11        4          2.6    2.60e-3   5.02e-2    N/A       N/A   N/A       N/A     N/A
                  App. B.2   (−64, 64)    26       0.14       30          47.7   2.61e-3   5.48e-3    N/A       N/A   N/A       N/A     N/A
      SiLU
                   Fig. 7    (−64, 64)    26       0.09        4          2.6    1.54e-1   1.18e-1    N/A       N/A   N/A       N/A     N/A
      †
           Green background indicates the fastest runtime, ‡ orange indicates the lowest communication, and § blue indicates the lowest error
          (MAE or MRE).


CrypTen’s approximation yields a MAE of 1.09e+10, whereas the Biorthogonal with a compression
𝑗 = 21 achieves a substantially lower MAE of 2.66e-1. We note that the latency time reported is
dominated by computation time as the protocol is executed in the same machine.

5.1.1. Choosing LUT Sizes
We evaluate our secure DWT protocols to compute a set of smooth functions commonly used in
machine learning tasks such as logarithm (used for perplexity and cross-entropy that measure LLM’s
performance), square root and inverse square root (used for layer normalization), sigmoid, and SiLU. In
Fig. 8 we investigate the trade-offs between latency on the left y-axes (green) and MAE on the right
y-axes (red) for the aforementioned non-linear functions. As a baseline, we show CrypTen’s latency
and MAE with dashed green and red lines, respectively. In the case of SiLU, which we evaluate with
two different protocols, CrypTen does not support this so we only show Curl’s latency and MAE. Of
course, bigger LUTs in Curl achieve better approximations but require more computation and thus
higher latency. However, DWT reduces the size of the LUTs which therefore decreases the computation
complexity of Curl while still achieving good approximations. For a LAN setting, the total runtime is
expected to be dominated by the computation time. However, for WAN, the total runtime is mostly
dominated by communication. Therefore, we focus on achieving better errors than CrypTen while
keeping a similar latency level. As an example, consider running 265×265 log functions in a LAN
setting with 0.8ms ping-time and 3 Gbps bandwidth and in a WAN setting with 80ms ping-time and
200 Mbps bandwidth. We estimate3 Curl to take 180ms and CrypTen to take 313ms (×1.8 more) in
LAN and Curl to take 599ms and CrypTen to take 5, 039ms (×10.1 more) in WAN.
   For general and unbounded functions, we apply the secure Biorthogonal protocol (Fig. 7). Exception-
ally, we use Haar-encoded protocol (Fig. 6) for the invsqrt function as it achieves better errors whenever
the (0, 1) interval is included. For sqrt and invsqrt we consider the interval (0, 256). However, for
log we consider the interval to be (0, 64) as CrypTen’s MAE and MRE are of the order of 1e+6 for
(0, 256). This corresponds to an original LUT of size 224 for sqrt and invsqrt and of size 222 for log.
Observe the logarithm approximation in Fig. 8a in which both latency and MAE of Curl matches that
3
    We add the computation latency, the ping-time multiplied by the number of rounds and the amount of information sent
    divided by the bandwidth. We use the numbers reported in Table 1.
of CrypTen’s for LUT sizes of 8 bits. In Figs. 8b and 8c, Curl’s MAE is always below CrypTen’s. For
these two functions, we use LUT sizes with 6 bits as these provide lower latency for Curl. We set the
compression parameter to 𝑗 = 18 for both sqrt and invsqrt and 14 for log. For sigmoid, we use the
protocol presented in Fig. 11 and set the interval to be (−64, 64). As analyzed in Appendix B.1, sigmoid
is not an odd function but the last step of the protocol can be adapted to be applied to any s-shape
function, including sigmoid. We observe that Curl achieves a better approximation than CrypTen for
LUT sizes with 6 or more bits with runtime being marginally slower. Finally, for SiLU, we set the
interval to be (−64, 64) and we use two different protocols: the approach described in Appendix B.2
which is a derivation of Fig 11 protocol (Fig. 8e); and the Biorthogonal protocol (Fig. 8f). We observe
that Appendix B.2 approach achieves better MAE errors when compared to the Biorthogonal protocol.
The latency from Fig. 8e is less affected by an increase in the LUT when compared with Fig. 8f. The
same behavior happens with GeLU and SiLU as we can observe in Table 1. This is justified by the fact
that the approach from Appendix B.2 requires comparisons, whether the Biorthogonal protocol does
not. Indeed, note that in all functions where a DWT-LUT protocol is applied, the latency decreases by
half when we half the LUT size.

5.1.2. Evaluations
Having found the optimal tradeoffs for LUT sizes, we compare the Curl approximations with the DWT
technique to CrypTen’s polynomial approximations in terms of latency, communication, and errors.
In Table 1, we evaluate a set of commonly used non-linear functions including logarithm, inverse
square root, the sigmoid activation function, as well as GeLU and SiLU. For unbounded functions, we
use the protocol from Fig. 7, while for bounded functions (e.g., sin) we use our optimized protocols
which are referenced in the table. We experimentally verified that, in most cases, the Biorthogonal
protocol outperforms the Haar DWT in terms of both MAE and MRE, with the only exception being
invsqrt. Observe that for the same LUT size, the latency for Haar is 2/3 lower than the Biorthogonal
approach (compare sqrt 0.06s latency using Biorthogonal with invsqrt 0.04s latency using Haar). For
each function, we specify the evaluation domain for which we report the errors (i.e., MAE and MRE).
   Observe that all functions that utilize the protocol from Fig. 7 and Fig. 6 have constant rounds and
communication, regardless of the LUT size, as the one-hot vectors (which depend on the size) are
generated during preprocessing from the dealer. Latency is affected by the LUT size (e.g., sqrt vs log)
since the LUT output value is accumulated with shares of zero, which are as many as the LUT size
(see last step of Fig. 4). However, as discussed before, the end-to-end latency will be dominated by the
number of rounds and total communication size, for which Curl is almost always significantly better.
Regarding trigonometric functions (sin and cos) we use the approach proposed in Appendix B.3. We note
that CrypTen’s approximation of both functions already has a 1.2e+2 error for the range (−128, 128).
For this reason, we decrease the interval range to (−64, 64). On the other hand, Curl’s approximation
keeps the same level of errors independent of the input range. Moreover, Curl achieves that while
having less latency, rounds and communication when compared with CrypTen.

5.2. LLM Inference
We evaluate Curl with BERT Tiny (4 million parameters), BERT Base (110 million parameters), BERT
Large (340 million parameters), GPT-2 (124 million parameters), and GPT Neo (1.3 billion parameters).
In Table 2, we report the latency, number of rounds, and total communication for each LLM for 64
inputs. For completeness, we also evaluate BERT Tiny and Base on GPU for 64 elements and report
between 2 − 3× speedup compared to CPU (1.1 and 6.5 seconds, respectively). Finally, we use the QNLI
classification task from the GLUE benchmark [74] to evaluate the accuracy of our models. Curl achieves
an accuracy of 80.3% for encrypted BERT Tiny, which almost matches the plaintext accuracy (i.e.,
81.4%). This further verifies our claim regarding the probabilistic truncation protocol that is sufficiently
accurate to evaluate LLMs. However, we observed that in LLMs the LUT protocols are not great for
inverse square root as the input range varies and the output of the function changes rapidly in (0, 1].
Thus, to achieve high accuracy we resort to a comparison and two LUTs: a dense one between (0, 1]
and a sparse one in (1, 256), getting the best of both worlds.

Table 2
Runtime and communication of Curl with a sequence length of 64 items.
                             Model          Latency (s)           Rounds         Com. (GB)
                          BERT Tiny                3.55             409             1.34
                          BERT Base               13.63            1,629             2.8
                          BERT Large              33.93            3,093            5.66
                            GPT-2                 16.61            1,630            3.77
                           GPT-Neo                103.4            3,118            14.9

   Next, we compare with Iron, MPCFormer, Puma, and Bolt in Table 3 for BERT Base with 128 items.
In terms of end-to-end latency, Curl outperforms all other frameworks by at least 1.5×, with as much
as 20× in the case of Iron. More importantly, Curl achieves this with significantly less amount of
rounds and total communication. Specifically, Iron and Bolt require 8× and 6× more rounds than Curl,
respectively, while in terms of total communication, all other frameworks need more than 10 GBs (with
Iron needing 281 GBs) while Curl only needs 5.7 GBs. Although the end-to-end runtime in a real-world
instantiation would increase due to communication, Curl would not be affected as much by that since it
exhibits the lowest communication and number of rounds.

Table 3
Runtime and communication comparisons with a sequence length of 128 items for the full BERT Base LLM.
                           Framework             Latency (s)       Rounds          Com. (GB)
                          Iron [26]                  475            13,663           281
                       MPCFormer [27]                55.3             –              12.1
                         Puma [42]                   33.9             –              10.8
                          Bolt [28]                  185            10,509           59.6
                        Bolt (WE) [28]†               91            10,901           25.7
                               Curl                  22.5            1,629            5.7
                      †
                          In Bolt, WE stands for word elimination.

   Finally, in Table 4 we compare Curl with CrypTen for five LLMs with 64 inputs. This differs from
Tables 2 and 3 as CrypTen lacks a protocol to evaluate embeddings (Section 2.2). Thus, in Table 4, we
skip embedding for a fair comparison. Curl needs less than half the amount of rounds CrypTen does for
all five models and almost half the total communication. Although Curl’s runtimes are already faster
than CrypTen’s, the difference in end-to-end latency will increase further in a realistic instantiation due
to communication.

Table 4
Runtime and communication comparisons with CrypTen with a sequence length of 64 items. These models skip
the embeddings as CrypTen does not support lookups.
                                          Curl                                   CrypTen [12]
             Model
                          Latency (s)   Rounds       Com. (GB)       Latency (s)    Rounds     Com. (GB)
        BERT Tiny             0.51         252            0.02          0.86           539        0.05
        BERT Base            9.01         1,472           1.31          13.23         3,089      2.62
        BERT Large           28.79        2,936           4.11         38.59          6,149       7.6
          GPT-2               8.94        1,464           1.31         12.93          3,060       2.62
         GPT-Neo             98.44        2,952           11.95        137.20†        6,144      18.91
         †
             Approximated since CrypTen could not fit GPT Neo in RAM.
6. Concluding Remarks
We have introduced Curl, a framework for privacy-preserving machine learning that evaluates non-linear
functions as lookup tables, resulting in better approximations and significant round and communication
reduction. Curl achieves this by relying on discrete wavelet transformations to reduce the lookup table
sizes without sacrificing accuracy, which has resulted in up to 19× round and communication reduction
compared to related works. We have evaluated Curl on five large language models such as BERT, GPT-2,
and GPT Neo (1.3 billion parameters), and have compared it against state-of-the-art related works,
significantly reducing latency, the number of communication rounds, and the total communication. On
top of that, Curl is easy to use by exposing a tensor-based programming model that machine learning
researchers and developers are familiar with which allows for both CPU and GPU backends. Lastly,
we have introduced a new ideal functionality for probabilistic truncation protocols and proved their
security in the real-ideal paradigm. This is of independent interest, as truncation is a core component of
encrypted machine-learning models, and probabilistic truncation was previously considered insecure.


References
 [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
     S. Altman, S. Anadkat, et al., GPT-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
 [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
     for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
 [3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
     E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint
     arXiv:2302.13971 (2023).
 [4] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
     processing systems 33 (2020) 1877–1901.
 [5] T. Goyal, J. J. Li, G. Durrett, News summarization and evaluation in the era of GPT-3, arXiv
     preprint arXiv:2209.12356 (2022).
 [6] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders,
     et al., Webgpt: Browser-assisted question-answering with human feedback, arXiv preprint
     arXiv:2112.09332 (2021).
 [7] C. Peris, C. Dupuy, J. Majmudar, R. Parikh, S. Smaili, R. Zemel, R. Gupta, Privacy in the time of
     language models, in: Proceedings of the Sixteenth ACM International Conference on Web Search
     and Data Mining, 2023, pp. 1291–1292.
 [8] D. Gong, X. Wan, D. Wang, Working memory capacity of ChatGPT: An empirical study, in:
     Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 2024, pp. 10048–10056.
 [9] O. Goldreich, S. Micali, A. Wigderson, How to play any mental game or A completeness theorem
     for protocols with honest majority, in: A. Aho (Ed.), 19th Annual ACM Symposium on Theory of
     Computing, ACM Press, New York City, NY, USA, 1987, pp. 218–229. doi:10.1145/28395.28420.
[10] A. C.-C. Yao, Protocols for secure computations (extended abstract), in: 23rd Annual Symposium
     on Foundations of Computer Science, IEEE Computer Society Press, Chicago, Illinois, 1982, pp.
     160–164. doi:10.1109/SFCS.1982.38.
[11] A. P. K. Dalskov, D. Escudero, M. Keller, Secure evaluation of quantized neural networks, Proceed-
     ings on Privacy Enhancing Technologies 2020 (2020) 355–375. doi:10.2478/popets-2020-0077.
[12] B. Knott, S. Venkataraman, A. Hannun, S. Sengupta, M. Ibrahim, L. van der Maaten, Crypten: Secure
     multi-party computation meets machine learning, in: M. Ranzato, A. Beygelzimer, Y. Dauphin,
     P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems, volume 34,
     Curran Associates, Inc., 2021, pp. 4961–4973.
[13] N. Koti, M. Pancholi, A. Patra, A. Suresh, SWIFT: Super-fast and robust privacy-preserving
     machine learning, in: M. Bailey, R. Greenstadt (Eds.), USENIX Security 2021: 30th USENIX Security
     Symposium, USENIX Association, 2021, pp. 2651–2668.
[14] Z. Huang, W. jie Lu, C. Hong, J. Ding, Cheetah: Lean and fast secure two-party deep neural
     network inference, in: K. R. B. Butler, K. Thomas (Eds.), USENIX Security 2022: 31st USENIX
     Security Symposium, USENIX Association, Boston, MA, USA, 2022, pp. 809–826.
[15] K. Gupta, N. Jawalkar, A. Mukherjee, N. Chandran, D. Gupta, A. Panwar, R. Sharma, SIGMA: Secure
     GPT Inference with Function Secret Sharing, Proceedings on Privacy Enhancing Technologies
     2024 (2024) 61–79. doi:10.56553/popets-2024-0107.
[16] P. Mohassel, P. Rindal, ABY3 : A mixed protocol framework for machine learning, in: D. Lie,
     M. Mannan, M. Backes, X. Wang (Eds.), ACM CCS 2018: 25th Conference on Computer and
     Communications Security, ACM Press, Toronto, ON, Canada, 2018, pp. 35–52. doi:10.1145/
     3243734.3243760.
[17] S. Tan, B. Knott, Y. Tian, D. J. Wu, CryptGPU: Fast privacy-preserving machine learning on the
     GPU, in: 2021 IEEE Symposium on Security and Privacy, IEEE Computer Society Press, San
     Francisco, CA, USA, 2021, pp. 1021–1038. doi:10.1109/SP40001.2021.00098.
[18] M. Keller, K. Sun, Secure quantized training for deep learning, in: International Conference on
     Machine Learning, PMLR, 2022, pp. 10912–10938.
[19] J.-L. Watson, S. Wagh, R. A. Popa, Piranha: A GPU platform for secure computation, in: K. R. B.
     Butler, K. Thomas (Eds.), USENIX Security 2022: 31st USENIX Security Symposium, USENIX
     Association, Boston, MA, USA, 2022, pp. 827–844.
[20] N. Jawalkar, K. Gupta, A. Basu, N. Chandran, D. Gupta, R. Sharma, Orca: FSS-based secure training
     with GPUs, Cryptology ePrint Archive, Report 2023/206, 2023. https://eprint.iacr.org/2023/206.
[21] I. Damgård, V. Pastro, N. P. Smart, S. Zakarias, Multiparty computation from somewhat homomor-
     phic encryption, in: R. Safavi-Naini, R. Canetti (Eds.), Advances in Cryptology – CRYPTO 2012,
     volume 7417 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Santa Barbara,
     CA, USA, 2012, pp. 643–662. doi:10.1007/978-3-642-32009-5_38.
[22] D. Escudero, An introduction to secret-sharing-based secure multiparty computation, Cryptology
     ePrint Archive, Report 2022/062, 2022. https://eprint.iacr.org/2022/062.
[23] S. Wagh, Pika: Secure computation using function secret sharing over rings, Proceedings on
     Privacy Enhancing Technologies 2022 (2022) 351–377. doi:10.56553/popets-2022-0113.
[24] O. Catrina, S. de Hoogh, Improved primitives for secure multiparty integer computation, in: J. A.
     Garay, R. D. Prisco (Eds.), SCN 10: 7th International Conference on Security in Communication
     Networks, volume 6280 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany,
     Amalfi, Italy, 2010, pp. 182–199. doi:10.1007/978-3-642-15317-4_13.
[25] I. Damgård, M. Fitzi, E. Kiltz, J. B. Nielsen, T. Toft, Unconditionally secure constant-rounds multi-
     party computation for equality, comparison, bits and exponentiation, in: S. Halevi, T. Rabin (Eds.),
     TCC 2006: 3rd Theory of Cryptography Conference, volume 3876 of Lecture Notes in Computer
     Science, Springer, Heidelberg, Germany, New York, NY, USA, 2006, pp. 285–304. doi:10.1007/
     11681878_15.
[26] M. Hao, H. Li, H. Chen, P. Xing, G. Xu, T. Zhang, Iron: Private inference on transformers,
     in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural
     Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 15718–15731.
[27] D. Li, R. Shao, H. Wang, H. Guo, E. P. Xing, H. Zhang, MPCFormer: Fast, performant and private
     transformer inference with MPC, in: International Conference on Learning Representations (ICLR),
     2023, pp. 1–16.
[28] Q. Pang, J. Zhu, H. Möllering, W. Zheng, T. Schneider, Bolt: Privacy-preserving, accurate and
     efficient inference for transformers, in: 2024 IEEE Symposium on Security and Privacy (SP),
     IEEE Computer Society, Los Alamitos, CA, USA, 2024, pp. 133–133. doi:10.1109/SP54263.2024.
     00130.
[29] E. Boyle, N. Chandran, N. Gilboa, D. Gupta, Y. Ishai, N. Kumar, M. Rathee, Function secret
     sharing for mixed-mode and fixed-point secure computation, in: A. Canteaut, F.-X. Standaert
     (Eds.), Advances in Cryptology – EUROCRYPT 2021, Part II, volume 12697 of Lecture Notes in
     Computer Science, Springer, Heidelberg, Germany, Zagreb, Croatia, 2021, pp. 871–900. doi:10.
     1007/978-3-030-77886-6_30.
[30] K. Storrier, A. Vadapalli, A. Lyons, R. Henry, Grotto: Screaming fast (2+1)-PC or Z2𝑛 via (2, 2)-DPFs,
     in: ACM CCS 2023: 30th Conference on Computer and Communications Security, ACM Press,
     2023, pp. 2143–2157. doi:10.1145/3576915.3623147.
[31] D. Mouris, P. Sarkar, N. G. Tsoutsos, PLASMA: Private, Lightweight Aggregated Statistics against
     Malicious Adversaries, Proceedings on Privacy Enhancing Technologies 2024 (2024) 1–19. doi:10.
     56553/popets-2024-0064.
[32] D. Mouris, C. Patton, H. Davis, P. Sarkar, N. G. Tsoutsos, Mastic: Private Weighted Heavy-Hitters
     and Attribute-Based Metrics, Proceedings on Privacy Enhancing Technologies 2025 (2025) 1–30.
[33] Y. Ishai, E. Kushilevitz, S. Meldgaard, C. Orlandi, A. Paskin-Cherniavsky, On the power of correlated
     randomness in secure computation, in: A. Sahai (Ed.), TCC 2013: 10th Theory of Cryptography
     Conference, volume 7785 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany,
     Tokyo, Japan, 2013, pp. 600–620. doi:10.1007/978-3-642-36594-2_34.
[34] G. Dessouky, F. Koushanfar, A.-R. Sadeghi, T. Schneider, S. Zeitouni, M. Zohner, Pushing the com-
     munication barrier in secure computation using lookup tables, in: ISOC Network and Distributed
     System Security Symposium – NDSS 2017, The Internet Society, San Diego, CA, USA, 2017, pp.
     1–15.
[35] A. Brüggemann, R. Hundt, T. Schneider, A. Suresh, H. Yalame, FLUTE: Fast and secure lookup
     table evaluations, in: 2023 IEEE Symposium on Security and Privacy, IEEE Computer Society
     Press, San Francisco, CA, USA, 2023, pp. 515–533. doi:10.1109/SP46215.2023.10179345.
[36] L. Ducas, D. Micciancio, FHEW: Bootstrapping homomorphic encryption in less than a second,
     in: E. Oswald, M. Fischlin (Eds.), Advances in Cryptology – EUROCRYPT 2015, Part I, volume
     9056 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Sofia, Bulgaria, 2015,
     pp. 617–640. doi:10.1007/978-3-662-46800-5_24.
[37] I. Chillotti, N. Gama, M. Georgieva, M. Izabachène, TFHE: Fast fully homomorphic encryption
     over the torus, Journal of Cryptology 33 (2020) 34–91. doi:10.1007/s00145-019-09319-x.
[38] I. Chillotti, M. Joye, P. Paillier, Programmable bootstrapping enables efficient homomorphic
     inference of deep neural networks, in: S. Dolev, O. Margalit, B. Pinkas, A. Schwarzmann (Eds.),
     Cyber Security Cryptography and Machine Learning, Springer International Publishing, Cham,
     2021, pp. 1–19.
[39] C. Gouert, D. Mouris, N. G. Tsoutsos, HELM: Navigating Homomorphic Encryption through Gates
     and Lookup Tables, Cryptology ePrint Archive, Paper 2023/1382, 2023. https://eprint.iacr.org/2023/
     1382.
[40] C. Gouert, M. Ugurbil, D. Mouris, M. de Vega, N. G. Tsoutsos, Ripple: Accelerating Programmable
     Bootstraps for FHE with Wavelet Approximations, in: International Conference on Information
     Security, Springer, 2024, pp. 1–20.
[41] T. Chen, H. Bao, S. Huang, L. Dong, B. Jiao, D. Jiang, H. Zhou, J. Li, F. Wei, THE-X: Privacy-
     preserving transformer inference with homomorphic encryption, in: S. Muresan, P. Nakov,
     A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics: ACL 2022,
     Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 3510–3520. doi:10.18653/
     v1/2022.findings-acl.277.
[42] Y. Dong, W. jie Lu, Y. Zheng, H. Wu, D. Zhao, J. Tan, Z. Huang, C. Hong, T. Wei, W. Cheng, Puma:
     Secure inference of llama-7b in five minutes, arXiv preprint arXiv:2307.12533 (2023).
[43] J. Ma, Y. Zheng, J. Feng, D. Zhao, H. Wu, W. Fang, J. Tan, C. Yu, B. Zhang, L. Wang, SecretFlow-SPU:
     A performant and User-Friendly framework for Privacy-Preserving machine learning, in: 2023
     USENIX Annual Technical Conference (USENIX ATC 23), USENIX Association, Boston, MA, 2023,
     pp. 17–33.
[44] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with limited numerical
     precision, in: Proceedings of the 32nd International Conference on International Conference on
     Machine Learning - Volume 37, ICML’15, JMLR.org, 2015, p. 1737–1746.
[45] Y. Li, Y. Duan, Z. Huang, C. Hong, C. Zhang, Y. Song, Efficient 3PC for binary circuits with
     application to Maliciously-Secure DNN inference, in: 32nd USENIX Security Symposium (USENIX
     Security 23), USENIX Association, Anaheim, CA, 2023, pp. 5377–5394.
[46] P. Mohassel, Y. Zhang, SecureML: A system for scalable privacy-preserving machine learning, in:
     2017 IEEE Symposium on Security and Privacy, IEEE Computer Society Press, San Jose, CA, USA,
     2017, pp. 19–38. doi:10.1109/SP.2017.12.
[47] A. Patra, A. Suresh, BLAZE: Blazing fast privacy-preserving machine learning, in: ISOC Network
     and Distributed System Security Symposium – NDSS 2020, The Internet Society, San Diego, CA,
     USA, 2020, pp. 1–18.
[48] A. P. K. Dalskov, D. Escudero, M. Keller, Fantastic four: Honest-majority four-party secure
     computation with malicious security, in: M. Bailey, R. Greenstadt (Eds.), USENIX Security 2021:
     30th USENIX Security Symposium, USENIX Association, 2021, pp. 2183–2200.
[49] T. Ryffel, P. Tholoniat, D. Pointcheval, F. R. Bach, AriaNN: Low-interaction privacy-preserving
     deep learning via function secret sharing, Proceedings on Privacy Enhancing Technologies 2022
     (2022) 291–316. doi:10.2478/popets-2022-0015.
[50] K. Gupta, D. Kumaraswamy, N. Chandran, D. Gupta, LLAMA: A low latency math library
     for secure inference, Proceedings on Privacy Enhancing Technologies 2022 (2022) 274–294.
     doi:10.56553/popets-2022-0109.
[51] I. Damgård, D. Escudero, T. K. Frederiksen, M. Keller, P. Scholl, N. Volgushev, New primitives
     for actively-secure MPC over rings with applications to private machine learning, in: 2019 IEEE
     Symposium on Security and Privacy, IEEE Computer Society Press, San Francisco, CA, USA, 2019,
     pp. 1102–1120. doi:10.1109/SP.2019.00078.
[52] D. Escudero, S. Ghosh, M. Keller, R. Rachuri, P. Scholl, Improved primitives for MPC over mixed
     arithmetic-binary circuits, in: D. Micciancio, T. Ristenpart (Eds.), Advances in Cryptology –
     CRYPTO 2020, Part II, volume 12171 of Lecture Notes in Computer Science, Springer, Heidelberg,
     Germany, Santa Barbara, CA, USA, 2020, pp. 823–852. doi:10.1007/978-3-030-56880-1_29.
[53] S. Wagh, D. Gupta, N. Chandran, SecureNN: 3-party secure computation for neural network
     training, Proceedings on Privacy Enhancing Technologies 2019 (2019) 26–49. doi:10.2478/
     popets-2019-0035.
[54] M. Keller, K. Sun, Effectiveness of MPC-friendly Softmax Replacement, arXiv, 2020. doi:10.48550/
     ARXIV.2011.11202.
[55] D. Beaver, Efficient multiparty protocols using circuit randomization, in: J. Feigenbaum (Ed.), Ad-
     vances in Cryptology – CRYPTO’91, volume 576 of Lecture Notes in Computer Science, Springer, Hei-
     delberg, Germany, Santa Barbara, CA, USA, 1992, pp. 420–432. doi:10.1007/3-540-46766-1_
     34.
[56] A. Patra, T. Schneider, A. Suresh, H. Yalame, ABY2.0: Improved mixed-protocol secure two-party
     computation, Cryptology ePrint Archive, Report 2020/1225, 2020. https://eprint.iacr.org/2020/1225.
[57] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
     Attention is All you Need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
     wanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30,
     Curran Associates, Inc., 2017, pp. 1–11.
[58] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, X. Hu, Harnessing the power of
     LLMs in practice: A survey on ChatGPT and beyond, ACM Transactions on Knowledge Discovery
     from Data (2023).
[59] K. Fukushima, Cognitron: A self-organizing multilayered neural network, Biological cybernetics
     20 (1975) 121–136.
[60] D. Hendrycks, K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv:1606.08415
     (2016).
[61] P. Van Fleet, Discrete Wavelet Transformations: An Elementary Approach with Applications,
     Wiley, 2019.
[62] C. Juvekar, V. Vaikuntanathan, A. Chandrakasan, GAZELLE: A low latency framework for secure
     neural network inference, in: W. Enck, A. P. Felt (Eds.), USENIX Security 2018: 27th USENIX
     Security Symposium, USENIX Association, Baltimore, MD, USA, 2018, pp. 1651–1669.
[63] J. Liu, M. Juuti, Y. Lu, N. Asokan, Oblivious neural network predictions via MiniONN transforma-
     tions, in: B. M. Thuraisingham, D. Evans, T. Malkin, D. Xu (Eds.), ACM CCS 2017: 24th Conference
     on Computer and Communications Security, ACM Press, Dallas, TX, USA, 2017, pp. 619–631.
     doi:10.1145/3133956.3134056.
[64] M. S. Riazi, M. Samragh, H. Chen, K. Laine, K. E. Lauter, F. Koushanfar, XONN: XNOR-based
     oblivious deep neural network inference, in: N. Heninger, P. Traynor (Eds.), USENIX Security
     2019: 28th USENIX Security Symposium, USENIX Association, Santa Clara, CA, USA, 2019, pp.
     1501–1518.
[65] A. C.-C. Yao, How to generate and exchange secrets (extended abstract), in: 27th Annual
     Symposium on Foundations of Computer Science, IEEE Computer Society Press, Toronto, Ontario,
     Canada, 1986, pp. 162–167. doi:10.1109/SFCS.1986.25.
[66] M. Keller, E. Orsini, P. Scholl, MASCOT: Faster malicious arithmetic secure computation with
     oblivious transfer, in: E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, S. Halevi (Eds.),
     ACM CCS 2016: 23rd Conference on Computer and Communications Security, ACM Press, Vienna,
     Austria, 2016, pp. 830–842. doi:10.1145/2976749.2978357.
[67] M. Keller, MP-SPDZ: A versatile framework for multi-party computation, in: J. Ligatti, X. Ou,
     J. Katz, G. Vigna (Eds.), ACM CCS 2020: 27th Conference on Computer and Communications
     Security, ACM Press, Virtual Event, USA, 2020, pp. 1575–1590. doi:10.1145/3372297.3417872.
[68] J. Doerner, a. shelat, Scaling ORAM for secure computation, in: B. M. Thuraisingham, D. Evans,
     T. Malkin, D. Xu (Eds.), ACM CCS 2017: 24th Conference on Computer and Communications
     Security, ACM Press, Dallas, TX, USA, 2017, pp. 523–535. doi:10.1145/3133956.3133967.
[69] E. Boyle, N. Gilboa, Y. Ishai, Secure computation with preprocessing via function secret sharing,
     in: D. Hofheinz, A. Rosen (Eds.), TCC 2019: 17th Theory of Cryptography Conference, Part I,
     volume 11891 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Nuremberg,
     Germany, 2019, pp. 341–371. doi:10.1007/978-3-030-36030-6_14.
[70] R. Canetti, Security and composition of multiparty cryptographic protocols, Journal of CRYPTOL-
     OGY 13 (2000) 143–202.
[71] Y. Lindell, How to simulate it - A tutorial on the simulation proof technique, Cryptology ePrint
     Archive, Report 2016/046, 2016. https://eprint.iacr.org/2016/046.
[72] B. Chor, O. Goldreich, E. Kushilevitz, M. Sudan, Private information retrieval, in: 36th Annual
     Symposium on Foundations of Computer Science, IEEE Computer Society Press, Milwaukee,
     Wisconsin, 1995, pp. 41–50. doi:10.1109/SFCS.1995.492461.
[73] S. Wagh, S. Tople, F. Benhamouda, E. Kushilevitz, P. Mittal, T. Rabin, Falcon: Honest-majority
     maliciously secure framework for private deep learning, Proceedings on Privacy Enhancing
     Technologies 2021 (2021) 188–208. doi:10.2478/popets-2021-0011.
[74] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, GLUE: A multi-task benchmark and
     analysis platform for natural language understanding, in: T. Linzen, G. Chrupała, A. Alishahi
     (Eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting
     Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belgium, 2018, pp.
     353–355. doi:10.18653/v1/W18-5446.
[75] S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neural network function
     approximation in reinforcement learning, Neural Networks 107 (2018) 3–11. doi:10.1016/j.
     neunet.2017.12.012.
[76] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are Unsupervised
     Multitask Learners, 2019.
Appendices

A. Proof of Theorem 3.1
We consider the probabilistic truncation protocol of Escudero et al. [52], denoted ΠTruncPr 𝑛,𝑓   (J𝑥K) and
 shown in Fig. 9. Below we prove Theorem 3.1, i.e. we show that protocol Π𝑛,𝑓 TruncPr (J𝑥K) securely realizes
 the modified probabilistic truncation functionality ℱ𝑛,𝑓   TruncPr (J𝑥K) shown in Fig. 3 in Section 3. To
 simplify notation, we assume below that ℓ = 𝑛 − 1, in which case ˜𝑠 = 𝑠 = (𝑥 + ˜𝑟) mod 2𝑛 for
˜𝑟 = 2𝑛−1 𝑏 + 2𝑓 𝑟 + 𝑟′ . In the general case of ℓ < 𝑛 all arguments are the same except that computation
 is done over Z2ℓ+1 instead of Z2𝑛 .

   ΠTruncPr
    𝑛,𝑓     (J𝑥K)


   1. Assume parties hold J𝑥K ∈ Z2𝑛 for 𝑥 < 2ℓ , ℓ < 𝑛.
   2. Assume parties hold (J𝑟K, J𝑟′ K, J𝑏K) from pre-processing where 𝑟, 𝑟′ , 𝑏 are random in resp. Z2ℓ−𝑓 ,
      Z2𝑓 , and Z2 .
   3. Parties set J𝑠K = 2𝑛−(ℓ+1) · (J𝑥K + 2ℓ J𝑏K + 2𝑓 J𝑟K + J𝑟′ K) and publicly reconstruct 𝑠 from J𝑠K.
   4. Denote 𝑠 = 2𝑛−(ℓ+1) · ˜𝑠 and note that ˜𝑠 = (𝑥 + ˜𝑟) mod 2ℓ+1 where ˜𝑟 = 2ℓ 𝑏 + 2𝑓 𝑟 + 𝑟′ < 2ℓ+1 .
   5. Parties compute J𝑣K = J𝑏 ⊕ ˜𝑠ℓ K = J𝑏K + ˜𝑠ℓ · (1 − 2J𝑏K), where ˜𝑠ℓ is the ℓ-th bit of ˜𝑠.
   6. Parties output J𝑥′ K = 2ℓ−𝑓 J𝑣K + ⌊(𝑠˜ mod 2ℓ )/2𝑓 ⌋ − J𝑟K.

Figure 9: Probabilistic truncation protocol of [52].


Proof. The simulator receives the shares of corrupted parties in sharings J𝑥K, J𝑟K, J𝑟′ K, J𝑏K and a random
value 𝑠(𝑓 ) ∈ Z2𝑓 from functionality ℱ𝑛,𝑓  TruncPr (J𝑥K). On these inputs the simulator picks a random

value 𝑠(𝑓 ) in Z2𝑛−𝑓 , sets 𝑠 = 2𝑓 · 𝑠(𝑓 ) + 𝑠(𝑓 ) , and simulates the reconstruction of secret-sharing
J𝑠K = J𝑥 + ˜𝑟K = J𝑥K + 2𝑛−1 J𝑏K + 2𝑓 J𝑟K + J𝑟′ K into value 𝑠. Note that all other computations in the
protocol are local given the input shares and the revealed value 𝑠, so the simulator’s work is done.
  Note that ΠTruncPr
                𝑛,𝑓    (J𝑥K) reveals only 𝑠 reconstructed from secret-sharing J𝑠K = J𝑥 + ˜𝑟K. Since
𝑠 = 𝑥 + ˜𝑟 mod 2𝑛 and ˜𝑟 is random in Z2𝑛 it follows that 𝑠 is uniform in Z2𝑛 for every 𝑥, hence the
simulation above is perfect. Note also that since the simulator sets 𝑠 as 2𝑓 · 𝑠(𝑓 ) + 𝑠(𝑓 ) , output 𝑠(𝑓 ) in
the simulation agrees with value 𝑠(𝑓 ) set by the functionality.
  It remains to argue that distribution of 𝑥′ output by the ΠTruncPr𝑛,𝑓     (J𝑥K) is the same as in the ideal-
                       TruncPr
world functionality ℱ𝑛,𝑓       (J𝑥K). In the ideal-world functionality, value 𝑥′ is a function of 𝑥 and 𝑠(𝑓 ) ,
determined by the following formula:
                                           {︂ (𝑓 )
                                       ′        𝑥        if 𝑠(𝑓 ) ≥ 𝑥(𝑓 ) ,
                                      𝑥 =      (𝑓 )                                                        (3)
                                              𝑥 +1          otherwise.
We argue that 𝑥′ is determined in the same way in protocol ΠTruncPr        𝑛,𝑓   (J𝑥K). Recall that for any 𝑣, we
         (𝑓 )       𝑓                           𝑓                    (𝑓 )    𝑓
denote 𝑣 = ⌊𝑣/2 ⌋ and 𝑣(𝑓 ) = (𝑣 mod 2 ), hence 𝑣 = 𝑣 · 2 + 𝑣(𝑓 ) . Correctness of the protocol
is based on the observation that if 𝑠 = 𝑥 + ˜𝑟 mod 2𝑛 then 𝑠 = 𝑥 + (2𝑓 𝑟 + 𝑟′ ) − 2𝑛−1 𝑣 over integers,
where 𝑣 = 𝑏 ⊕ 𝑠𝑛−1 , equivalently 𝑠 = 2𝑓 (𝑥(𝑓 ) + (𝑟 − 𝑣˜)) + (𝑥(𝑓 ) + 𝑟′ ) where 𝑣˜ = (2(𝑛−1)−𝑓 )𝑣. There
are two cases for how 𝑠(𝑓 ) and 𝑠(𝑓 ) relate to 𝑥(𝑓 ) , 𝑟, 𝑣˜ and 𝑥(𝑓 ) , 𝑟′ :

                                                                  if 𝑥(𝑓 ) + 𝑟′ < 2𝑓 ,
                                    {︂ (𝑓 )
                             (𝑓 )        𝑥 + (𝑟 − 𝑣˜)
                            𝑠 =                                                                                (4)
                                      𝑥(𝑓 ) + (𝑟 − 𝑣˜) + 1             otherwise.
                                             𝑥(𝑓 ) + 𝑟′        if 𝑥(𝑓 ) + 𝑟′ < 2𝑓 ,
                                       {︂
                              𝑠(𝑓 ) =                                                                          (5)
                                          𝑥(𝑓 ) + 𝑟′ − 2𝑓           otherwise.
Since J𝑥′ K is set to 𝑠(𝑓 ) − J𝑟 − 𝑣˜K, equation (4) implies that:

                                                                   if 𝑥(𝑓 ) + 𝑟′ < 2𝑓 ,
                                                     {︂ (𝑓 )
                              ′    (𝑓 )                  𝑥
                             𝑥 = 𝑠 − (𝑟 − 𝑣˜) =         (𝑓 )                                                (6)
                                                       𝑥 +1            otherwise.

However, equation (5) implies that:

                      𝑥(𝑓 ) +𝑟′ < 2𝑓     if and only if      𝑠(𝑓 ) = 𝑥(𝑓 ) +𝑟′ ≥ 𝑥(𝑓 )
                                                                                                            (7)
                      𝑥(𝑓 ) +𝑟′ ≥ 2𝑓     if and only if 𝑠(𝑓 ) = 𝑥(𝑓 ) +(𝑟′ −2𝑓 ) < 𝑥(𝑓 ) .

Equations (6) and (7) together imply the same expression of 𝑥′ according to 𝑥(𝑓 ) and 𝑠(𝑓 ) as in equation
(3). We conclude that 𝑥′ output by protocol ΠTruncPr
                                               𝑛,𝑓    (J𝑥K) is set by the same formula as in the ideal
                TruncPr
functionality ℱ𝑛,𝑓      (J𝑥K).

  In the proof above, note that the randomness of the opened masked value is part of the view of the
adversary, so the rounding error is fixed in both the real and ideal worlds by the adversary’s view.
This argument can be extended to all probabilistic truncation protocols that follow the same pattern
regarding masking the input value with randomness. More specifically, the proof can be adapted for
the protocols with non-zero correctness error, including [24] and [51], and for other protocols with
perfect correctness, e.g., [11].


B. Bounded functions
Recall that an odd function F obeys the following property F(−𝑥) = −F(𝑥) and an even function F
is such that F(−𝑥) = F(𝑥). In this section, we generalize the approach taken by Sigma for the GeLU
function [15] and apply it to any bounded odd or even function that converges to some constant 𝑐 for
|𝑥| → ∞ (e.g., s-shape functions). This method can be applied to several important functions: s-shape
functions, used as activation functions (hyperbolic tangent tanh and error function erf); shifted s-shape
functions, like sigmoid; and more general functions such as GeLU and SiLU.
   The odd/even property guarantees that we only need to compute the function for positive values.
This effectively halves the size of the required LUT. To achieve this, we compute the sign of the input
and multiply it by its value, then get its absolute value. The convergence property ensures that, given
some acceptable error, we can clip the function for some interval [−𝑎, 𝑎] and assign a constant value 𝑐
to all elements outside that interval. We note that [15] sets 𝑐 = F(𝑎) but we set the constant 𝑐 to be
the limit value of the function F. Outside [−𝑎, 𝑎], we claim that the error incurred by [15] is bigger for
most of the inputs but for a short interval. Intuitively, Fig. 10 shows the errors incurred by Curl’s and
Sigma’s approach to the Sigmoid function. Observe that the error following Sigma’s approach ( red
area) is bigger than the error by Curl’s approach ( green area) outside the interval [2, 2.76].

                                𝜎(𝑥)
                                                                               Curl
                            1
                         0.94
                         0.88
                                                                               [20]
                                             [20] has              Curl has
                                           lower error           lower error
                          0.5
                                                                                          𝑥
                                            2    2.76

Figure 10: Analysis of error incurred by Curl’s ( green ) and Sigma’s ( red ) approach on approximating Sigmoid
function 𝜎(𝑥) for 𝑥 > 2. Curl’s approach sets the approximation 𝜎(𝑥)      ̃︀    = 1 and Sigma’s approach sets
𝜎(𝑥)
̃︀    = 𝜎(2) for 𝑥 > 2. Sigma’s approach error is lower in [2, 2.76) and Curl’s approach error is lower in
(2.76, +∞).
   ΠF𝑛,𝑎,𝑐,𝑝,𝑗 (J𝑥K)

  Parameters: 𝑛 is the full bit-width, 𝑎 defines the interval in which the bior-LUT protocol is applied,
  𝑐 is the constant value for the function outside the interval, 𝑝 is the parity of F (𝑝 = 1 for odd
  functions and 𝑝 = 0 for even functions) and 𝑗 is the compression variable for the bior-LUT protocol.
  Compute sign and absolute value, Jsgn(𝑥)K and J|𝑥|K:
   1. ⟨⟨𝑥⟩⟩ ← ΠA2B          𝑛
               𝑛,𝑛 (J𝑥K) ∈ Z2
                                                                                  Compute comparison bit:
   2. ⟨⟨𝑏⟩⟩ ← ⟨⟨𝑦⟩⟩ ≫ (𝑛 − 1) ∈ Z2
                                                                                       𝑏 = 𝑥 < 0.
   3. J𝑏K ← ΠB2A
             1,𝑛 (⟨⟨𝑏⟩⟩) ∈ Z2𝑛

   4. Jsgn(𝑥)K ← 1 − 2 · J𝑏K ∈ Z2𝑛
   5. J|𝑥|K ← Jsgn(𝑥)K · J𝑥K ∈ Z2𝑛

  Check |𝑥| < 𝑎:
   6. J𝑧K ← J|𝑥|K − 𝑎 ∈ Z2𝑛
   7. ⟨⟨𝑧⟩⟩ ← ΠA2B          𝑛
               𝑛,𝑛 (J𝑧K) ∈ Z2                                                     Compute comparison bit:
   8. ⟨⟨𝑖⟩⟩ ← ⟨⟨𝑧⟩⟩ ≫ (𝑛 − 1) ∈ Z2                                                    𝑖 = |𝑥| < 𝑎.

   9. J𝑖K ← ΠB2A
             1,𝑛 (⟨⟨𝑖⟩⟩) ∈ Z2𝑛

  Compute F:
  10. J𝑦 ′ K ← ΠBior-LUT
                𝑛,𝑛,𝑗,F (J|𝑥|K)
                                                                                           {︃
                                                                                                𝑦 ′ , if |𝑥| < 𝑎
  11. J𝑦K ← (J𝑖K · J𝑦 K + (1 − J𝑖K) · 𝑐)
                       ′
                                                                ◁ Computes share of: 𝑦 =
                                                                                                𝑐, otherwise.

  12. Output: Jsgn(𝑥)K𝑝 · J𝑦K ∈ Z2𝑛              ◁ Equivalent to J𝑦K for even functions and J−𝑦K otherwise.

Figure 11: Protocol for convergent bounded odd (𝑝 = 1) or even (𝑝 = 0) function F based on the secure
biorthogonal-encoded LUT protocol.


   The ΠF𝑛,𝑎,𝑐,𝑝,𝑗 (J𝑥K) protocol is formally shown in Fig. 11. Recall from Section 2.1, that ΠA2B        B2A
                                                                                               𝑛,𝑙 and Π𝑛,𝑙
represent the arithmetic to binary and binary to arithmetic conversion protocols. The protocol goes as
follows. For input value 𝑥, the parties start by computing the signal of 𝑥 and its absolute value. Note
that the signal value can be computed using the expression sgn(𝑥) = 1 − 2 · (𝑥 < 0) and the absolute
value of 𝑥 can be computed as |𝑥| = sgn(𝑥) · 𝑥. The absolute value is then used to check whether 𝑥
lies within a predetermined interval [−𝑎, 𝑎] or not. Then, parties run the secure Biorthogonal protocol
ΠBior-LUT                                         ′                                         ′
  𝑛,𝑙,𝑗,F (J𝑥K) to evaluate on |𝑥|, returning 𝑦 . In case 𝑥 ∈ [−𝑎, 𝑎], parties output 𝑦 , otherwise they
output the limit value 𝑐 of the function F. This is effectively computed by the operation 𝑖 · 𝑦 ′ + (1 − 𝑖) · 𝑐,
where 𝑖 is the comparison bit for |𝑥| < 𝑎. In the last step, the output value is corrected in case the
function is odd: returning sgn(𝑥) · 𝑦. This is set by the parity parameter 𝑝, which is 𝑝 = 0 for even
functions and 𝑝 = 1 for odd functions.

B.1. S-shape functions
In the previous section, we presented a protocol that can be applied to any bounded odd or even function
that is convergent. S-shape functions are bounded functions that converge to some value 𝑐 for |𝑥| → ∞
but are not necessarily odd or even. For example, we have that tanh and erf are odd functions but the
sigmoid function 𝜎(𝑥) is not. However, we note that we can transform any s-shape function to an odd
function by applying a shift. For example, the sigmoid function 𝜎(𝑥) can be shifted by 1/2, resulting in
an odd function: 𝜎(𝑥) − 1/2.
   Generally, observe that, for an s-shape function F, there exists a shift value 𝑠, such that F(−𝑥) − 𝑠 =
−F(𝑥) + 𝑠 ⇔ F(−𝑥) = 2 · 𝑠 − F(𝑥). So, setting for 𝑏 = 𝑥 < 0, we can change the last step (12) of
protocol ΠF𝑛,𝑎,𝑐,𝑝,𝑗 (J𝑥K) to evaluate the function F:
                              F(𝑥) = 𝑏 · F(−|𝑥|) + (1 − 𝑏) · F(|𝑥|)
                                   = 𝑏 · (2 · 𝑠 − F(|𝑥|)) + (1 − 𝑏) · F(|𝑥|)
                                   = 𝑏 · 2 · 𝑠 + sgn(𝑥) · F(|𝑥|),
where sgn(𝑥) = (1 − 2 · 𝑏). Thus, for the sigmoid function where 𝑠 = 1/2, the last step (12) from
Fig. 11 is as follows: J𝑏K + Jsgn(𝑥)K · J𝑦K. Note that for both tanh and erf 𝑏 = 0, recovering the initial
expression.
Parameters. Now, for each of the three s-shape functions, we need to define the following parameters:
bitwidth 𝑛, the interval [−𝑎, 𝑎], the constant 𝑐, the parity 𝑝 and the compression value 𝑗. We use the
hardcoded bitwidth value from CrypTen 𝑛 = 64, imposed by the PyTorch dependency. Also, 𝑐, 𝑝 = 1
for all three s-shape functions. We note that [15] clips the interval in GeLu and accepts a constant
error on the order of ∼ 10−5 . We follow a similar approach and accept errors < 10−5 . For the error
function (erf), hyperbolic tangent (tanh) and sigmoid (sigmoid) we set the interval to be [−4, 4], [−8, 8]
and [−16, 16], respectively. These intervals incurr a maximum error on the order of 10−8 , 10−7 and
10−7 , and the size of the original LUT is 218 , 219 and 220 , respectively.

B.2. Activation Functions: GeLU And SiLU
In this section, we consider two popular activation functions: GeLU [60] and SiLU [75], given by
                                                 (︂        (︂    )︂)︂
                                               𝑥              𝑥
                                  GeLU(𝑥) :=       1 + erf √          ,
                                               2               2
where erf(·) denotes the Gauss error function and
                                      SiLU(𝑥) := 𝑥 · sigmoid(𝑥).
These activation functions are sometimes considered smooth approximations to the classical ReLU of
[59]. We observe that the functions DGeLU := ReLU(𝑥) − GeLU(𝑥) and DSiLU := ReLU(𝑥) − SiLU(𝑥)
are even functions. Regarding DGeLU , note that
                                               (︂           (︂    )︂)︂
                                             1                 𝑥
                                   DGeLU =       |𝑥| − 𝑥 erf √         ,
                                             2                  2
so the claim follows as |𝑥| is even and 𝑥 and erf are both odd. Regarding DSiLU , observe that 𝜎(−𝑥) =
1 − 𝜎(𝑥), for sigmoid 𝜎. Thus, we have:
                                              {︃
                                                𝑥 · (1 − 𝜎(𝑥)) 𝑥 ≥ 0,
                                  DSiLU (𝑥) =
                                                −𝑥 · 𝜎(𝑥)        𝑥<0
                                              {︃
                                                𝑥 · 𝜎(−𝑥) 𝑥 ≥ 0,
                                            =
                                                −𝑥 · 𝜎(𝑥) 𝑥 < 0
                                           = DSiLU (−𝑥).
Since these two activation functions are also bounded and converge to 0 for |𝑥| → ∞, we can use
ΠF𝑛,𝑎,𝑐,𝑝,𝑗 protocol (Fig. 11) to compute both GeLU and SiLU by evaluating: GeLU(𝑥) = ReLU(𝑥) −
ΠD                                             DSiLU
  𝑛,𝑎,𝑐,𝑝,𝑗 (𝑥) and SiLU(𝑥) = ReLU(𝑥) − Π𝑛,𝑎,𝑐,𝑝,𝑗 (𝑥).
    GeLU

   Note that ReLU can be computed using ReLU(𝑥) = (𝑥 < 0) · 𝑥. Since we compute 𝑥 < 0 in step 3.
of the ΠF𝑛,𝑎,𝑐,𝑝,𝑗 protocol (Fig. 11), we can reuse this value.
Parameters. Again, we use the hardcoded bitwidth value from CrypTen 𝑛 = 64 and 𝑐, 𝑝 = 0. As noted
before, [15] clips the interval in GeLU and accepts a constant error on the order of ∼ 10−5 . We follow
a similar approach and accept errors < 10−5 . For GeLU and SiLU we set the interval to be [−4, 4] and
[−16, 16], respectively. These intervals incur a maximum error on the order of 10−8 and 10−7 , and the
size of the original LUT is 218 and 220 , respectively.
   Πsin
    𝑛,𝑓,𝑗 (J𝑥K)

  Parameters: 𝑛 is the full bit-width, 𝑓 is the number of bits used for the fractional part, and 𝑝 = 0 for
  even functions) and 𝑗 is the compression variable for the Biorthogonal protocol.
  Compute sign and absolute value, Jsgn(𝑥)K and J|𝑥|K:
   1. ⟨⟨𝑥⟩⟩ ← ΠA2B
               𝑛,𝑛 (J𝑥K)

   2. ⟨⟨𝑏⟩⟩ ← ⟨⟨𝑦⟩⟩ ≫ (𝑛 − 1)
   3. J𝑏K ← ΠB2A
             1,𝑛 (⟨⟨𝑏⟩⟩)

   4. Jsgn(𝑥)K ← 1 − 2 · J𝑏K
   5. J|𝑥|K ← Jsgn(𝑥)K · J𝑥K

  Compute sin:
   6. Use encoding of 2𝜋, 𝑞 := ⌊2𝜋 · 2𝑓 ⌋, and compute J|𝑥′ |K ← Πdiv (J|𝑥|K, 𝑞).
   7. J𝑦 ′ K ← ΠBior-LUT           ′
                𝑛,𝑛,𝑗,sin(2𝜋·) (J|𝑥 |K)

   8. Output: Jsgn(𝑥)K · J𝑦K ∈ Z2𝑛                                                  ◁ For cos it returns J𝑦K.

Figure 12: Protocol for sin function based on the secure Biorthogonal protocol.


B.3. Trigonometric functions
It is known that sin is an odd function and cos is an even function. However, these two functions do not
converge, making protocol ΠF𝑛,𝑎,𝑐,𝑝,𝑗 (Fig. 11) unsuitable. Despite this, we can still use the core concepts
from that protocol and adapt them to these trigonometric functions. The key insight is that both
functions are periodic with a period of 2𝜋. We leverage this periodicity by scaling these trigonometric
functions to have period 1. This is achieved by generating LUT for sin(2𝜋 · 𝑥) and cos(2𝜋𝑥), where
𝑥 ∈ [0, 1]. Then, for input 𝑥, we evaluate the LUT using the secure Biorthogonal protocol at point
𝑥/(2𝜋). We execute division (Πdiv ) based on the Escudero et al. [52, §5.1] probabilistic truncation
protocol over rings. The protocol for sin is formally described in Fig. 12. The cos protocol is similar but
does not require the sign correction in the last step.

B.4. The Softmax Function
Recall that softmax is given by the following expression:

                                                  𝑒𝑥𝑖 −max(𝑥)
                                          𝑦𝑖 = ∑︀𝑘−1              ,
                                                       𝑥𝑗 −max(𝑥)
                                                 𝑗=0 𝑒

where 𝑥 ∈ R𝑘 and max is the maximum operation over the vector 𝑥. To be able to compute the
expression above, we need three main functions: maximum, natural exponentiation, and reciprocal.
The maximum operation is computed using the protocol implemented in CrypTen (Section C.1.4 [12]).
We note that both CrypTen adopts a tree-reduction algorithm but also provides a pairwise method with
constant round complexity.
Reciprocal. From the softmax expression, we have that there exists at least one index 𝑗 such that
𝑒𝑥𝑗 −max(𝑥) = 1. Also, since there are at most 𝑘 terms, we have that the reciprocal function can be only
evaluated within the [1, 𝑘] interval. We set 𝑘 = 64 as per in GPT-2 model [76]. Then, we apply the
secure ΠBior-LUT
         𝑛,𝑙,𝑗,F to the interval [1, 64].
Table 5
Comparison of the mean absolute error (MAE) between Haar and Bioriothogonal LUT protocol with probabilistic
and deterministic truncation in (1, 64).
                                 Probabilistic Truncation      Deterministic Truncation
                      Op.
                                 Haar       Biorthogonal        Haar       Biorthogonal
                      log        8.63e-3        2.53e-2        9.76e-03       4.63e-04
                  reciprocal     2.26e-3        4.58e-4        1.87e-03       5.79e-04
                      sqrt       1.29e-1        8.84e-4        6.04e-02       6.41e-05
                    invsqrt      7.58e-4        5.40e-5        9.36e-04       2.20e-05
                  §
                      The light blue background indicates the lowest MAE error.


C. Truncation with Haar vs Biorthogonal
We compare the accuracy of Haar and Biorthogonal DWT approaches using both probabilistic and
deterministic truncation methods in Table 5. Our findings indicate that the Biorthogonal approach
consistently outperforms the Haar approach. The reason for this is that Haar averages out two consecu-
tive values whereas Birthogonal uses five consecutive elements, thus retaining more information and
capturing the trend of the function. Finally, for the both Haar and Biorthogonal approaches we observe
only a marginal difference between the probabilistic and deterministic truncation protocols.