Curl: Private LLMs through Wavelet-Encoded Look-Up Tables Manuel B. Santos1,* , Dimitris Mouris1 , Mehmet Ugurbil1 , Stanislaw Jarecki1 , José Reis1 , Shubho Sengupta2 and Miguel de Vega1 1 Nillion, Zählerweg 5, 6300 Zug, Switzerland 2 Meta Platforms, Menlo Park, California 94025, United States Abstract Recent advancements in transformers have revolutionized machine learning, forming the core of Large language models (LLMs). However, integrating these systems into everyday applications raises privacy concerns as client queries are exposed to model owners. Secure multiparty computation (MPC) allows parties to evaluate machine learning applications while keeping sensitive user inputs and proprietary models private. Due to inherent MPC costs, recent works introduce model-specific optimizations that hinder widespread adoption by machine learning researchers. CrypTen (NeurIPS’21) aimed to solve this problem by exposing MPC primitives via common machine learning abstractions such as tensors and modular neural networks. Unfortunately, CrypTen and many other MPC frameworks rely on polynomial approximations of the non-linear functions, resulting in high errors and communication complexity. This paper introduces Curl, an easy-to-use MPC framework that evaluates non-linear functions as lookup tables, resulting in better approximations and significant round and communication reduction. Curl exposes a similar programming model as CrypTen and is highly parallelizable through tensors. At its core, Curl relies on discrete wavelet transformations to reduce the lookup table size without sacrificing accuracy, which results in up to 19× round and communication reduction compared to CrypTen for non-linear functions such as logarithms and reciprocals. We evaluate Curl on a diverse set of LLMs, including BERT, GPT-2, and GPT Neo, and compare against state-of-the-art related works such as Iron (NeurIPS’22) and Bolt (S&P’24) achieving at least 1.9× less communication and latency. Finally, we resolve a long-standing debate regarding the security of widely used probabilistic truncation protocols by proving their security in the stand-alone model. This is of independent interest as many related works rely on this truncation style. Keywords Large Language Models (LLMs), Privacy-Enhancing Technologies (PETs), Secure Multiparty Computation (MPC) 1. Introduction Large language models (LLMs) like GPT-2, GPT-4 [1], BERT [2], and LLaMA [3] have emerged as prime examples showcasing the capabilities of artificial intelligence. LLMs assist individuals and businesses in everyday tasks; from machine translation [4], to text generation [5], and question answering [6], among others. To generate human-like responses, LLMs have been trained in vast amounts of data and continue learning through their interactions with users. However, as LLMs become increasingly integrated into human lives, privacy becomes a critical concern as individuals frequently share sensitive information, including names, addresses, and credit card numbers, or even financial and healthcare information [7]. On top of that, there is a shift towards personalized AI, with OpenAI enabling a memory feature to ChatGPT [8]. For an LLM-powered assistant to be able to understand individual preferences, habits, and workflows and provide tailored assistance, it would require access to an immense amount of personal data. As this personalized data is retained CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA * Corresponding author. $ manuel.santos@nillion.com (M. B. Santos); dimitris@nillion.com (D. Mouris); memo@nillion.com (M. Ugurbil); stasio@ics.uci.edu (S. Jarecki); jose.reis@nillion.com (J. Reis); ssengupta@meta.com (S. Sengupta); miguel@nillion.com (M. de Vega)  0000-0003-4156-8607 (M. B. Santos); 0000-0002-2601-203X (D. Mouris); 0009-0003-5042-6851 (M. Ugurbil); 0000-0002-5055-2407 (S. Jarecki); 0000-0001-8022-7499 (J. Reis) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings and used to continuously improve the LLMs, the possibility of a data breach or unauthorized access poses huge risks to individuals’ privacy. Privacy-enhancing technologies (PETs) such as multiparty computation (MPC) [9, 10] enable a plethora of privacy-preserving machine learning (PPML) use-cases. In the most prominent setting, a server possesses a proprietary model and aims to provide it as a service to clients for use with their private data [11, 12, 13, 14, 15]. The objective is for clients to receive only the inference results without gaining any knowledge about the model, while the model provider does not learn anything about clients’ input. Another popular PPML use case involves multiple distrusting parties to securely train a model over their joint sensitive data without revealing anything about their data to each other [16, 17, 18, 19, 20]. Several MPC systems leverage linear secret sharing schemes as they allow linear operations to be performed with minimal communication overhead [21, 22]. However, non-linear operations (e.g., square roots, logarithms) often present greater challenges and require specialized techniques such as polynomial approximations [12], look-up table evaluations [23], or other protocol-specific approaches [24, 25]. Furthermore, incorporating domain knowledge can significantly enhance the efficiency and effectiveness of these protocols by tailoring the parameters to better fit specific use cases [26, 27, 28]. 1.1. Related Work 1.1.1. Focusing on Function Secret Sharing Recently, several secure computation works have emerged leveraging function secret sharing (FSS) [29, 23, 30, 31, 32]. Pika [23] extends the prior work of [29] by showcasing a novel approach to securely evaluate look-up tables (LUTs) through FSS. While it demonstrates efficacy in benchmarking against popular datasets like MNIST and CIFAR-10, its scalability poses challenges. As noted by Grotto [30], for large LUTs, the computational cost may render the protocol infeasible due to the extensive number of distributed point functions (DPF) evaluations required. In contrast, Curl circumvents this challenge by eliminating the need for DPF evaluations, thanks to the integration of the discrete wavelet transform (DWT) technique. On top of that, our technique can also benefit FSS-based frameworks by reducing the size of the LUTs, resulting in less computation and communication. On the other hand, Grotto [30] introduces innovative protocols leveraging custom splines and DPFs to handle a subset of functions efficiently. Yet, it faces challenges in terms of computational and communication overhead compared to alternatives like Sigma [15]. Orca [20] showcases the potential of GPU acceleration in FSS protocols, particularly tailored for convolutional neural networks (CNNs). However, its suitability for other architectures like transformers is questioned due to its reliance on heavy non-linearities [20]. Sigma [15] builds on top of Orca and distinguishes itself by relying on minimal-sized LUTs through protocols tailored for small ranges. Despite its efficiency from custom protocols with small-sized LUTs, it demands significant computing and communication resources (e.g., 1 TB RAM and around 9Gbps communication link). Furthermore, its use of deterministic truncation, although claiming improved security, falls short in speed compared to probabilistic truncation methods. Unfortunately, all FSS-based works focus on the two-party setting as FSS becomes impractical with more than two parties. 1.1.2. Focusing on Preprocessing The line of work initiated by Pika [23] focuses on the two-party setting with an additional dealer party. In this scenario, the preprocessing phase necessitates the dealer to prepare and distribute an 𝒪(𝑛)-sized element. Given the assumption that the dealer will not collude with any party, reducing dependency on such assumptions is desirable. This has led to the development of protocols that eliminate the need for a dealer. Notably, some previous works have aimed to achieve this in the two-party setting, enabling both parties to independently generate the required preprocessing material [33, 34, 35]. The one-time truth table (OTTT) protocol [33] employs a boolean circuit to represent the table for every possible input, resulting in an exponential complexity relative to the bit size. The OP-LUT protocol [34] attempts to enhance the OTTT preprocessing phase but only achieves improvements for small bit sizes, with the communication cost remaining exponential. The SP-LUT protocol [34] significantly enhances the preprocessing phase but modifies the online phase, leading to an exponential communication cost in the bit size during the online phase. FLUTE [35] offers advancements over previous works but still requires 𝒪(2𝑛 ) communication in the setup phase. Therefore, finding a preprocessing phase with sub-exponential communication complexity while maintaining the efficiency of the online phase remains an open challenge. 1.1.3. Focusing on Fully-Homomorphic Encryption Fully homomorphic encryption (FHE) schemes have benefited significantly from the use of LUTs to enhance performance and enable new applications. This concept was initially proposed by Ducas and Micciancio [36], who devised a method to evaluate arbitrary binary gates in FHE using LUTs. Building on this foundation, Chillotti et al. [37] implemented the evaluation of arbitrary functions through a tree of leveled multiplexers, leading to the development of the Torus FHE (TFHE) scheme. Despite their development of a fast bootstrapping mechanism, capable of execution in approximately 10 milliseconds on a CPU, its adoption has been limited. This is primarily because the control inputs for the multiplexers required fresh ciphertexts – meaning prior computation on them was not possible – and the approach necessitated expressing programs as deterministic automata. A further improvement over TFHE was presented in [38] with the introduction of the programmable bootstrapping (PBS) technique, allowing for efficient and general-purpose LUT evaluation. To enhance adaptability, HELM [39] expanded on the PBS technique and introduced a framework for automatically converting Verilog hardware description language (HDL) into encrypted circuits. HELM employs three modes of operation. The first mode exclusively processes binary gates. The second mode operates on integers using secure LUT evaluations. The third, mixed mode, works with binary circuits and "bridges" over to integers for secure LUT evaluations before returning to the binary domain. However, HELM is limited to very low-precision LUTs. This limitation arises because converting an integer to multiple bits requires multiple n-to-1 LUTs. Recently, Ripple [40] proposed using compression techniques based on discrete wavelet transform (DWT) theory to decrease the size of LUT, thereby accelerating the PBS technique for smooth functions. 1.1.4. Focusing on Secure LLM Inference Secure frameworks for LLM inference have only recently gained traction due to the inherent complexity of transformers compared to traditional neural networks. Techniques to ensure secure inference include notable implementations like THE-X [41] and Iron [26], which were among the first to use homomorphic encryption for matrix multiplication and non-linear functions. Subsequently, several studies have explored MPC techniques for private transformer inference system [27, 42]. MPCFormer [27] leverages the CrypTen [12] MPC engine, while Puma [42] utilizes the SecretFlow-SPU [43] MPC engine to assess MPC’s suitability for LLM inference. MPCFormer introduces a distillation process where stronger models train weaker models to improve accuracy, compensating for the limitations of using small (2-degree) approximations of non-linear functions. Nevertheless, MPCFormer faces challenges due to the loss of approximation accuracy, necessitating fine-tuning the models. On the other hand, Puma uses a piece-wise approximation of the GeLU activation function, requiring comparisons and polynomials up to degree 6. Recent research has increasingly focused on hybrid solutions, combining the best techniques for specific operations [28, 20]. Bolt [28] developed a solution integrating both fully homomorphic encryp- tion and MPC. Like Puma [42], Bolt uses a piece-wise approximation that requires comparison, which increases round complexity and communication. To boost efficiency, Bolt employs word elimination while maintaining accuracy. Additionally, it enhances MPCFormer’s polynomial approximations by increasing the polynomial degree and mitigates the corresponding efficiency loss through Motzkin’s polynomial pre-processing procedure. In contrast, Curl avoids both comparisons and polynomial approximations by using an optimized lookup table protocol that reduces the round and communication complexity typically added by polynomial approximations while maintaining accuracy. 1.1.5. Focusing on Secure Truncation Machine learning tasks employ floating-point numbers to construct precise models [44]. However, it is widely recognized that the use of floating-point numbers renders MPC impractical from an effi- ciency standpoint. For this reason, most works on machine learning under MPC employ fixed-point representation instead, requiring a truncation protocol to maintain constant precision. Among the available truncation methodologies – deterministic, nearest integer, and probabilistic – the deterministic and probabilistic approaches (see Section 3) are the most widely used in the secure computation setting [24, 45]. Moreover, a substantial body of research favors probabilistic truncation due to its lower communication costs and resultant enhanced performance, e.g., [46, 16, 47, 11, 48, 13, 18, 49]. Despite these advantages, some recent works express concerns about probabilistic truncation, and opt for more resource-intensive protocols to achieve deterministic truncation [45, 50, 15]. The concerns raised with regard to probabilistic truncation include security and rounding errors. Correctness error. Before we address the above two concerns we recall that a widely used and inexpen- sive probabilistic truncation protocol over fields, due to Catrina and Hoogh [24], has a correctness errors with probability 2−(𝑛−|𝑥|) where |𝑥| is the maximum bitsize of truncation input 𝑥. This necessitated using significantly larger rings, setting 𝑛 ≥ |𝑥| + 𝜅 where 𝜅 is a statistical security parameter. Damgård et al. [51] presented a probabilistic protocol with zero probability of correctness error, but it required a non-constant-round bit-decomposition protocol. Subsequent works [11, 52] optimized [24, 51] and showed a constant-round probabilistic truncation protocol with communication and computation costs matching Catrina and Hoogh [24], but with no correctness error and allowing 𝑛 ≥ |𝑥| + 1. CrypTen [12] proposed a protocol inspired by [53] for division by public value, which can be reduced to a probabilistic truncation with non-zero correctness error. Security. Recently, Li et al. [45] identified a security flaw in the security proof of the probabilistic truncation of [24], revealing a discrepancy between the information leaked by that protocol and by the ideal probabilistic truncation functionality. Note that the same issue pertains to the 𝑛-optimal probabilistic truncation of [52]. This led several recent papers, e.g., [15, 15], to move away from the inexpensive probabilistic truncation protocols because of concerns regarding their security. In this paper, we show that the 𝑛-optimal probabilistic truncation protocol of [52] (as well as the original probabilistic truncation of [24]) securely realizes a natural ideal probabilistic truncation functionality which we define. This enables us to capitalize on the efficiency benefits of probabilistic truncation without compromising security (Section 3). Rounding Errors. The LLAMA paper [50] recently suggested the use of deterministic truncation, claiming it improves inference accuracy. However, this assertion has not been backed by experimental evidence, and earlier studies indicating a possible decrease in accuracy with probabilistic truncation have not been confirmed through empirical testing [49, 11]. Intriguingly, Gupta et al. [44] demonstrated that computation using 16-bit fixed-point with 12-bit precision and probabilistic truncation achieves nearly equivalent accuracy to that of a 32-bit floating-point. Conversely, truncation to the nearest integer fails to train under such conditions adequately. Additionally, although not the primary focus of the work, Keller and Sun [54] provided experimental evidence indicating that in neural networks, there is minimal disparity between probabilistic truncation and truncation to the nearest integer. Indeed, out of 28 distinct experiments, 13 exhibited superior accuracy with probabilistic truncation, 13 with truncation to the nearest integer, and 2 yielded identical accuracy. 1.2. Our Contributions In this paper, we introduce Curl, a user-friendly MPC framework designed for efficiently evaluating non-linear functions using lookup tables. Curl addresses three critical aspects essential to secure protocols: efficiency, accuracy, and security. Curl employs a novel LUT compression technique using discrete wavelet transformation (DWT) achieving efficient approximations with minimal errors. This effectively reduces the size of original LUTs while preserving the accuracy of non-linear functions, thus surpassing traditional polynomial piece-wise approximation methods. By minimizing communication costs, Curl significantly enhances end-to-end runtime performance and supports a wide range of activation functions including GeLU, SiLU, Gaussian error function, Sigmoid, and Hyperbolic tangent. Additionally, Curl generalizes the piece-wise approximation used within the GeLU protocol from [28, 15] to encompass any bounded odd or even function that converges to a piecewise polynomial. Curl’s core technique extends to complex non-linear functions used by LLMs, ensuring both client input data privacy and safeguarding the intellectual property of service providers, such as their models. We evaluate Curl across a spectrum of LLMs including BERT (tiny, base, large), GPT-2, and GPT-Neo, including both CPU and GPU backends. Our evaluation demonstrates advancements over state-of-the- art frameworks, achieving at least 6.5 times fewer rounds and 1.9 times less communication overhead. Finally, Curl addresses the security concerns surrounding efficient probabilistic truncation by in- troducing a natural ideal functionality, a corresponding protocol, and providing a simulation security proof. Many works that rely on this truncation style were proven to have a security flaw by [45] as they cannot be simulated in the real-ideal paradigm. Our result is of independent interest since our proof settles this debate and restores confidence in secure probabilistic truncation protocols. We summarize our contributions as follows: • A novel framework based on DWT compression that achieves high accuracy with low round and communication complexities. When applied to LLM inference, Curl demonstrates at least a 6.5× reduction in round complexity and a 1.9× reduction in communication when compared against existing methods. • Curl builds on top of the user-friendly CrypTen [12] framework, offering high flexibility and low adoption friction for developers and researchers, thereby democratizing access to secure computation techniques. As a proof-of-concept, we implement a variety of LLM models and numerous non-linear functions, all operable on both CPU and GPU. • We introduce a novel natural ideal functionality for efficient probabilistic truncation and prove Escudero et al. [52] to be secure. This result is of independent interest. 2. Preliminaries 2.1. Arithmetic & Binary Secret Sharing We consider two main algebraic structures: the ring Z𝑄 where 𝑄 = 2𝑛 for some bit width 𝑛, and the binary field Z2 . We consider additive secret sharing between a set of parties 𝒫 in both Z𝑄 and Z2 . Arithmetic secret sharing. J𝑥K denotes additive shares in the ring Z𝑄 for parties 𝒫0 , . . . , 𝒫N−1 . This means each party 𝒫𝑖 owns a share J𝑥K𝑖 , where the sum of all shares is equal to 𝑥. Private addition of secret shared values is executed by local share addition and private multiplication is implemented through Beaver triples [55]. Fixed-point representation of a floating-point number 𝑥𝑅⌉︀ is considered using the standard integer encoding with nearest-integer approximation: 𝑥 = 𝑥𝑅 · 2𝑓 , for precision ⌊︀ 𝑓 [24]. To decode, we simply divide by 2𝑓 . Binary secret sharing. Some secure operations profit from a secret sharing representation in the binary domain. For this reason, we consider a binary secret sharing scheme for 𝑥 ∈ Z2𝑛 , denoted as ⟨⟨𝑥⟩⟩. A binary secret share ⟨⟨𝑥⟩⟩ is formed by secret sharing all the bits of 𝑥 in Z2 and each party 𝒫𝑖 owns a share ⟨⟨𝑥⟩⟩𝑖 . To reconstruct the secret, the parties 𝒫0 , . . . , 𝒫N−1 add their shares in Z2𝑛 (i.e., equivalent to bitwise XOR). Conversions. Arithmetic shares suit best arithmetic operations (+, ×) and binary shares perform best in evaluating logical expressions (XOR, AND, right-shift and left-shift operations). For this reason, following the line of research of mixed-mode secure computation [52, 12, 56], we make use of known conversions between the arithmetic and the binary domain. We follow the approach provided by Damgård et al. ([25], S3) to convert from arithmetic shares to binary shares (A2B) and the procedure using algorithm 2 in [12] to convert binary shares into arithmetic shares (B2A). We denote these protocols by ΠA2B B2A 𝑛,𝑙 and Π𝑛,𝑙 , respectively, where 𝑛 represents the input bitwidth and 𝑙 the output bitwidth. We note that the preprocessing procedure used for B2A conversion can be optimized using the techniques proposed by Escudero et. al [52]. In particular, CrypTen’s approach requires 𝑛 tuples of shares (J𝑟K, ⟨⟨𝑟⟩⟩) for a bit 𝑟, whereas [52] employs just 1 tuple of share (J𝑠K, ⟨⟨𝑠⟩⟩) for a ring element 𝑠, with the former being comparatively more costly than the latter. 2.2. Large Language Models (LLMs) In 2017, Vaswani et al. [57] introduced the Transformer architecture, a type of neural network that utilizes attention mechanisms, enabling AI models to prioritize different segments of the input sequence when generating an output. A transformer comprises an encoder-only architecture (like in BERT [2, 58]), a decoder-only (e.g., GPT), or both encoder and decoder [57]. The encoder takes the input sequence and maps it to a lower-dimensional representation to generate a sequence of hidden states. The decoder generates an output sequence by iteratively generating tokens from the previously generated hidden states. In practice, multiple encoder and decoder blocks are used together to achieve higher accuracy. Attention. A core component of transformers is self-attention, a mechanism that weighs the importance of different parts of an input sequence when making predictions. Self-attention takes a query matrix 𝑄 and a set of key-value matrix pairs 𝐾, 𝑉 , which are all produced by linear layers, and produces an output as follows, where 𝑑 is the dimension of the keys: 𝑄 · 𝐾𝑇 (︂ )︂ Attention(𝑄, 𝐾, 𝑉 ) = softmax √ · 𝑉. 𝑑 Usually, the inputs to the softmax are masked so that each token is only influenced by previous tokens, thereby creating a causal relationship. Layers. In the context of secure LLM inference, all the components of transformers can be seen as linear and non-linear layers. A linear layer consists of matrix multiplications and feed-forward neural networks (FNN) which can be evaluated using fixed-point arithmetic. A crucial part of evaluating the linear layers is to maintain the fixed-point precision using truncation.1 For instance, the matrix multiplication 𝑄 · 𝐾 𝑇 in self-attention can be computed as multiple inner products, each of which needs to be followed by a truncation. Next comes the FNN which relies on an inner product followed by an activation function or a non-linear layer. A non-linear layer, like the rectified linear unit (ReLU) [59], Gaussian error linear unit (GeLU) [60], sigmoid linear unit (SiLU), softmax, etc., is not straightforward to evaluate securely. ReLU can be implemented with a secure comparison (i.e., ReLU(𝑥) = max(0, 𝑥)), while GeLU and other activation functions are not as trivial. Several works like CrypTen [12] resort to polynomial approximations which has a significant impact on accuracy. More recent works like Sigma [15] approximate GeLU using ReLU and then using lookup table methods to evaluate the difference. Lastly, layer normalization (LayerNorm) and root mean squared normalization (RMSNorm) require non-linear functions such as reciprocals or reciprocal square roots. Transformer Block. A transformer block comprises several linear and non-linear layers. This usually includes attention, linear layers, non-linear activation functions, and layer normalization. Embedding Layers. Like any language model, LLMs start off with a word embedding layer and a positional embedding layer. These layers are LUT protocols based on the token value and position of the token respectively. As LUT protocols, they can also be thought of as multiplications with a one-hot vector. To implement these under MPC, we have several options. Firstly, even though the token value is a hidden input, its position is public. While we keep the weights hidden under MPC, with the public input we can do a plaintext lookup. With the hidden token 1 The precision on fixed-point values doubles after multiplication so truncation is used to bring it back down. value, however, the lookup is not so straightforward. The client either has to secret share the token as a one-hot vector, which can be quite large or we can use a random one-hot vector generated by the provider. The alternative, which is to run equality with every possible value is also quite costly. To relieve the client from sharing a one-hot vector for every token and to avoid running the costly equality multiple times, we use the provider to generate the one-hot vector. This protocol then works just like the LUT protocols, with the exception that the LUT is not public. Since the LUT, which is the embedding weights matrix, is also private, we need to rely on Beaver triple multiplication. 2.3. The Discrete Wavelet Transform (DWT) A discrete wavelet transformation (DWT) is a transformation that decomposes a discretely sampled signal into approximation and detail coefficients [61]. The detail coefficients contain the information about the error incurred by the approximation coefficients. Given the approximation and detail coef- ficients, one can invert the application of a DWT to obtain the original signal. We convey this idea in Fig. 1. Starting with the whole signal we can apply a DWT to obtain a first level of approximation and detail coefficients. This first-level approximation and detail coefficients are enough to recover the original signal. We can repeat this process on the approximation coefficient to obtain a second level of approximation and detail coefficients. The second-level approximation and detail coefficients together with the first-level detail coefficients allow us to recover the original signal. This process can be repeated as necessary to reduce the size of the approximation coefficients. Whole signal Details Level 0 Approximation Level 1 Level 2 Figure 1: Overview of two DWT iterations. The signal is represented by the approximation (blue) and the details (gray). There are many different wavelet families and each uses linear transformations (e.g., matrix multipli- cation): Daubechy (Db) wavelets, Biorthogonal wavelets, and others. Let us explore in more detail two types of wavelets: the Haar and the (5, 3) Biorthogonal transformations. 2.3.1. Haar DWT A special case of the Db family is the Haar wavelet, also called Db1. In Haar, the approximation coefficients are obtained by averaging every two consecutive samples of the original signal. One of the properties of the Db wavelet of degree 𝑚 (Dbm) is that if a signal encodes data of polynomials of degree strictly below 𝑚, then the corresponding detail coefficients for those polynomial parts will be zero. More specifically, the Haar DWT is a linear transformation represented by a matrix 𝑊 . When applied to a one-dimensional signal 𝑣 of length 2𝑛 (where 𝑛 ∈ Z), this linear transformation is defined as: ⎡ ⎤ ⎡ ⎤ 𝑣0 𝑎0 ⎢ .. ⎥ ⎢ .. ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ ⎥ [︂ ]︂ ⎢ 𝑣𝑛−1 ⎥ ⎢𝑎𝑛−1 ⎥ 𝑊 ·𝑣 =𝑊 ·⎢ ⎢ ⎥ =⎢⎢ ⎥= 𝑎 , 𝑣 ⎢ 𝑛 ⎥ ⎢ 0 ⎥ ⎥ 𝑑 ⎥ 𝑑 ⎢ .. ⎥ ⎢ .. ⎥ ⎣ . ⎦ ⎣ . ⎦ 𝑣2𝑛−1 𝑑𝑛−1 where 𝑎 and 𝑑 correspond to the approximation and detail coefficients and are computed as: 𝑣2𝑘 + 𝑣2𝑘+1 𝑣2𝑘 − 𝑣2𝑘+1 𝑎𝑘 = , 𝑑𝑘 = , 2 2 for 𝑘 = 0, 1, . . . , 𝑛 − 1. We note that the Haar DWT operation 𝑊 can be applied multiple 𝑗 times. For a vector 𝑣 with size 𝑛 divisible by 2𝑗 , ⎡ ⎤ ↶𝑗 𝑗 𝑎 𝑊 · 𝑣 = ⎣↶𝑗 ⎦ , 𝑑 ↶𝑗 ↶𝑗 where the approximation vector 𝑎 has size 𝑛/2𝑗 and the detail vector 𝑑 amounts to the rest (i.e., 𝑛 − 𝑛/2𝑗 ). Note that, since 𝑊 is an orthogonal matrix, its inverse is obtained by its transpose. One can see that, for 𝑘 = 0, 1, . . . , 𝑛 − 1, the inverse is given by 𝑣2𝑘+1 = 𝑎𝑘 + 𝑑𝑘 and 𝑣2𝑘 = 𝑎𝑘 − 𝑑𝑘 . 2.3.2. Biorthogonal DWT While the Dbm wavelets use an orthogonal matrix, the Biorthogonal wavelets require two different matrices where the transpose of one matrix is the inverse of the other. One matrix, 𝐵, is used for the decomposition of the signal and the other, 𝐵, ̃︀ is used for the reconstruction. In this work, we consider the (5, 3) Biorthogonal DWT defined by 𝐵 where the corresponding approximation and detail coefficients are given by: 1 𝑎𝑘 = (−𝑣2𝑘−2 + 2 · 𝑣2𝑘−1 + 6 · 𝑣2𝑘 + 2 · 𝑣2𝑘+1 − 𝑣2𝑘+2 ) 8 1 𝑑𝑘 = (𝑣2𝑘 − 2 · 𝑣2𝑘+1 + 𝑣2𝑘+2 ) , 4 for 𝑘 = 0, . . . , 𝑛 − 1. Note that the (5, 3) Biorthogonal DWT computes the approximation and detail coefficients with a weighted average of 5 and 3 points, respectively. The inverse transformation 𝐵 −1 is given by 𝐵̃︀ where 𝐵 −1 = 𝐵 ̃︀ ⊤ . 2.4. Threat & Computation Model Curl allows two or more computing parties (𝒫0 , 𝒫1 , . . . ) to perform private inference and training of machine-learning models. In the most prominent use case, one party (e.g., 𝒫0 ) owns a trained model, while another party (e.g., 𝒫1 ) aims to utilize the model for inference on their private data. Both parties wish to keep both the data and the model’s architecture private. For instance, imagine a scenario where 𝒫0 is a cybersecurity firm developing a malware detection machine learning model that they offer as a service (MLaaS) to client organizations. The clients (𝒫1 ) would like to use the model in their internal codebase while keeping sensitive implementation details private. In MLaaS, both the model owner and the clients could benefit from outsourcing the model and their inputs respectively to an MPC infrastructure and not participate in the computation. Curl is flexible and can facilitate multiple other use cases such as: (a) having different parties that possess distinct sets of features jointly train a model without exchanging raw data, (b) unveiling correlations where one party retains feature data and another party holds corresponding labels without revealing anything else, among others. For the sake of presentation, we assume a two-party protocol and we focus on the inference setting (i.e., 𝒫0 holds the private model and 𝒫1 holds the sensitive inputs), as demonstrated in Fig. 2. We assume the parties are semi-honest (i.e., follow the protocol specification) in the pre-processing model. This threat model enables a wide range of realistic use cases of secure machine learning [62, 12, 63, 46, 64], while pre-processing is used to generate correlated randomness which the parties consume in the online phase. There is a plethora of ways to generate this randomness; one can use general-purpose MPC [65, 66, 67], specialized protocols [68, 34, 35], or more commonly a trusted dealer [29, 69, 50, 20, 12, 49, 30]. In Curl, we resort to the latter. Lastly, as long as not all the parties collude, no information is leaked about the private inputs (or private models depending on the use case). 𝒫0 𝒫1 Secret share inputs Secret share model Final inference Private model Private inputs Figure 2: Overview of Curl for privacy-preserving inference. 3. Secure Truncation In this work, we use truncation over the ring Z𝑄 and we define it as follows. For input J𝑥K, it outputs J𝑥′ K where 𝑥′ = ⌊𝑥/2𝑓 ⌋ and 𝑓 is the precision of the fixed-point representation. The truncation operation is required whenever we work with fixed-point numbers. Multiplication of two fixed-point rationals with precision 𝑓 obtains a fixed-point rational with precision 2𝑓 . Therefore, we require truncation to go back to a fixed-point rational with precision 𝑓 . As we analyze in Section 3.3, the number of truncations required in a GPT-2 model is at least 220 . Hence, it is paramount to use an efficient truncation protocol. As mentioned in Section 1.1.5, works on secure computation over fixed-point rationals use two types of truncation protocols: a deterministic truncation, where output 𝑥′ is set exactly to ⌊𝑥/2𝑓 ⌋; or a probabilistic truncation, where 𝑥′ = ⌊𝑥/2𝑓 ⌋ + 𝑢 where 𝑢 = 1 with probability (𝑥 mod 2𝑓 )/2𝑓 and 𝑢 = 0 otherwise. Although implementing probabilistic truncations incurs minimal costs, two concerns have been raised about it: security and the impact of rounding errors in ML applications. In this section, we summarize the security issue raised by Li et al. [45] and present a solution in the stand-alone secure computation framework [70, 71]. 3.1. Revisiting Li et al. [45] Li et al. [45] pointed out that some efficient probabilistic truncation protocols fail to meet the criteria for standard simulation-based security [71], rendering them insecure in this framework. This includes Catrina and Hoogh’s [24], CrypTen’s [12], and other proposals including [51, 11, 52]. Li et al. use a variant of the protocol of [24] to exemplify that the above protocols are not simulatable as realizations of a standard ideal probabilistic truncation functionality, because in these protocols the rounding error 𝑢 is a deterministic function of input 𝑥 and an information which the protocol reveals. Specifically, given the protocol execution, the rounding error 𝑢 is a deterministic function of 𝑥(𝑓 ) = (𝑥 mod 2𝑓 ), i.e., the last 𝑓 bits of 𝑥. This violates an ideal functionality for probabilistic truncation, where 𝑢 = 1 with probability 𝑥(𝑓 ) /2𝑓 (and 𝑢 = 0 with probability 1 − 𝑥(𝑓 ) /2𝑓 ), but whether or not 𝑢 becomes 1 or 0 is not revealed. We show that the probabilistic rounding protocols cited above realize a modified probabilistic trun- cation functionality, which we define in Fig. 3. The modification is natural: The functionality picks and reveals a “cutoff” point 𝑠(𝑓 ) chosen at random in Z2𝑓 = [0, ..., 2𝑓 − 1], and sets the rounding error as 𝑢 = 1 if 𝑥(𝑓 ) ≥ 𝑠(𝑓 ) , and as 𝑢 = 0 otherwise. In Appendix A we prove that the 𝑛-optimal proba- bilistic truncation protocol of Escudero et al. [52], included in Fig. 9, realizes this modified truncation functionality. We note that several other probabilistic truncation protocols, e.g., [24], also realize this functionality, after adjustments related to correctness error. Technically, we show an efficient simulator Sim s.t. for every input 𝑥 and every set C of corrupted parties, (︀ ′ ˜ K, ˜𝑠(𝑓 ) , ViewC (J𝑥K) ≈𝜖 J𝑥K′ , 𝑠(𝑓 ) , Sim(J𝑥KC , J𝑥′ KC , 𝑠(𝑓 ) ) )︀ (︀ )︀ J𝑥 where J𝑥˜ ′ K and ˜𝑠(𝑓 ) are the output and the cutoff point set by a protocol execution, ViewC (J𝑥K) is the adversarial view of that protocol execution, J𝑥′ K and 𝑠(𝑓 ) are the output and the cutoff point set by the ideal functionality, and J𝑥KC and J𝑥′ KC are the shares of resp. J𝑥K and J𝑥′ K held by parties C. 3.2. Modified Probabilistic Truncation Functionality If 𝑥 is an integer in Z2𝑛 , we define 𝑥(𝑓 ) = (𝑥 mod 2𝑓 ) as the 𝑓 least significant bits of 𝑥, and 𝑥(𝑓 ) = ⌊𝑥/2𝑓 ⌋ as the truncation of the last 𝑓 bits of 𝑥. In particular it holds that 𝑥 = 𝑥(𝑓 ) · 2𝑓 + 𝑥(𝑓 ) . We present the modified probabilistic truncation functionality ℱ𝑛,𝑓 TruncPr (J𝑥K) in Fig. 3. Note that the proposed functionality reveals the cutoff point 𝑠(𝑓 ) whereas the functionality considered by Li et al. ([45, Functionality 2]) can be stated as following the same procedure except that it hides 𝑠(𝑓 ) . (In particular, if 𝑢 is based on the random cutoff 𝑠(𝑓 ) as in Fig. 3 then 𝑢 = 1 with probability 𝑥(𝑓 ) /2𝑓 , as expected.) Note that this extra element conveys as much information as the correctness definition of the probabilistic truncation itself, which has seen wide acceptance and use in the literature. By definition, probabilistic truncation incurs a rounding error with probability 𝑝𝑥 = 𝑥(𝑓 ) /2𝑓 . Therefore, for a fixed input value 𝑥, one can estimate probability 𝑝𝑥 given outputs 𝑥′ from multiple runs of this functionality, and then approximate the 𝑓 least significant bits of 𝑥 as 𝑥(𝑓 ) ≈ 𝑝𝑥 · 2𝑓 . Each run of functionality TruncPr (J𝑥K) reveals in addition the cutoff points 𝑠 ℱ𝑛,𝑓 (𝑓 ) , which can also be used to approximate 𝑥(𝑓 ) because each 𝑠(𝑓 ) partitions range [0, ..., 2 − 1], and bit 𝑢 = 𝑥′ − 𝑥(𝑓 ) reveals to which side of this 𝑓 partition 𝑥(𝑓 ) belongs. TruncPr ℱ𝑛,𝑓 (J𝑥K) Input: Assume parties hold J𝑥K ∈ Z2𝑛 for 𝑥 < 2𝑛−1 . 1. Reconstruct 𝑥 from sharing J𝑥K. 2. Set 𝑥(𝑓 ) = ⌊𝑥/2𝑓 ⌋ and 𝑥(𝑓 ) = (𝑥 mod 2𝑓 ). 3. Pick cutoff point 𝑠(𝑓 ) at random in Z2𝑓 . 4. Set 𝑥′ = 𝑥(𝑓 ) + 𝑢 where if 𝑥(𝑓 ) ≤ 𝑠(𝑓 ) , {︂ 0 𝑢= 1 otherwise. 5. Generate a random sharing J𝑥′ K of 𝑥′ . 6. Output J𝑥′ K and leak 𝑠(𝑓 ) to the adversary. Figure 3: Ideal modified probabilistic truncation functionality We prove the following theorem in Appendix A: Theorem 3.1. Protocol ΠTruncPr 𝑛,𝑓 (J𝑥K) of Escudero et al. [52], shown in Fig. 9 in Appendix A, securely TruncPr (J𝑥K) shown in Fig. 3 with semi-honest security in the stand-alone security realizes functionality ℱ𝑛,𝑓 framework [71]. 3.3. Comparison to CryptTen’s Truncation We chose to implement truncation using the protocol of Escudero et al. [52] instead of CrypTen’s truncation [12] because of correctness and rounding errors in the latter protocol. First, as in the case of the probabilistic truncation of [24] discussed in Section 1.1.5, CrypTen’s truncation incurs a correctness error with probability 2−(𝑛−|𝑥|) where |𝑥| is the maximum bit size of the truncation input 𝑥. Consider the implications of the above for LLM inference. Assuming the maximum bit size of fixed- point elements is 24 (8-bit integer part and 16-bit precision), after multiplication, the input bit size doubles. Since 𝑛 is fixed at 64 bits, the correctness error occurs with a probability of 2−(64−48) = 2−16 . For instance, the GPT-2 model requires at least 220 inner products.2 Assuming a truncation is executed 2 For GPT-2, 𝑑model = 768, 𝑑𝑓 𝑓 = 3072, 𝑛heads = 12, and 𝑑k = 64. Considering an input of size 𝐿 = 64, the token embedding requires 𝑑model × 𝐿 inner products, each attention mechanism requires (𝑑k + 𝑑model ) × 𝐿, and each feed-forward layer requires at least (𝑑ff + 𝑑model ) × 𝐿 inner products. This amounts to at least 3, 637, 248 > 220 inner products. after each inner product, we expect 16 random errors, undermining inference correctness. We verified this experimentally by running 220 CrypTen truncations of 48-bit elements 30 times, resulting in an average error of 15.83, which aligns with the expected value of 24 . By contrast, the probabilistic truncation protocol of Escudero et al. [52] is always correct. Furthermore, CrypTen’s truncation mechanism exhibits worse rounding errors compared to function- ality in Fig. 3 realized by the protocol of [52], even for 𝑁 = 2 parties, increasing further if the number of parties increases. CrypTen’s rounding ⌊︀ error⌋︀is linear in N because each party truncates their share of the input value 𝑥 locally, i.e., [𝑥′ ]𝑖 = [𝑥]𝑖 /2𝑓 . Since each [𝑥′ ]𝑖 can be 1 away from the rational value, summing N of them results in the output 𝑥′ which can be up to N away from 𝑥(𝑓 ) . In contrast, value 𝑥′ output by the truncation protocol of [52] is at most 1 away from 𝑥(𝑓 ) . For these reasons we opt for the truncation protocol from Escudero et al. [52] instead of CrypTen’s truncation mechanism after each multiplication. 4. Curl Overview As we mentioned in Section 2.3, DWT is used to decompose signals into approximation and detail coefficients. DWT is particularly effective for signals that represent smooth functions (e.g., logarithm, square root) since the detail coefficients are relatively small compared to the approximation coefficients. This unique observation allows us to set the detail coefficients to zero and invert the application of the DWT to obtain an approximation of the original signal. The main benefit of this process is that for smooth functions we acquire detailed approximations while requiring significantly fewer samples than the original signals. We leverage this to develop a secure protocol with compressed LUTs achieving high accuracy with low round and communication complexities. We begin by analyzing several DWT techniques, specifically Haar and Biorthogonal DWT, for plaintext evaluation of LUTs and we translate the plaintext procedures into the secure setting. Next, we apply our DWT LUT techniques and create Curl, an easy-to-use framework for privacy-preserving machine learning that exposes a tensor-based programming model for both CPU and GPU backends. Finally, we evaluate Curl on non-linear functions and LLMs securely, improving accuracy and reducing communication compared to related works. 4.1. Improving Approximations with DWT-Encoded LUTs Our key contribution lies in utilizing discrete wavelet transform (DWT) to securely evaluate a table. We start by explaining the secure evaluation of a look-up table (LUT) and demonstrate the application of both Haar and Biorthogonal DWT compression techniques for function evaluation using plaintext values. Next, we extend those two techniques to the secure setting. Our protocols use some techniques introduced in [72, 33, 23] but contrary to previous works, Curl can be easily extended to any number of parties. For simplicity, we present our protocols in the two-party setting. 4.1.1. Secure LUT Evaluation Let F : R → R be a function and consider its fixed-point encoded counterpart tF : Z2𝑛 → Z2𝑛 , with 𝑛 precision 𝑓 . The function tF can be represented as a LUT in (Z2𝑛 )2 , where tF (𝑥) corresponds the value at the 𝑥-th index. We briefly explain the method for securely computing tF (J𝑥K) = J𝑦K, which is depicted in Fig. 4. Before initiating the protocol, both parties 𝒫0 and 𝒫1 can locally produce a LUT tF representation of the function F. This evaluation needs to be performed only once and can be reused indefinitely. The secure evaluation of the LUT begins with the dealer generating a random element 𝑟 ∈ Z2𝑛 . Then, it generates a 1-hot vector 𝑣 with elements equal to 0 everywhere except in the 𝑟-th entry. Then, the dealer secret-shares to both 𝒫0 and 𝒫1 the random value 𝑟 along with the 1-hot vector 𝑣. These steps make up the preprocessing phase as they do not depend on the function input 𝑥. During the online phase, the parties locally compute the offset value 𝛿 given by 𝛿 = 𝑥 − 𝑟. This offset value is revealed to 𝒫0 J𝑣K0 J𝑦K0 Rotate by 𝛿 J𝑥K0 , J𝑟K0 , J𝑣K0 J0K & Multiply J0K ... 𝒟 Reveal 𝛿 = 𝑥 − 𝑟 J119K J0K J0K 𝑟← −$ Z J1K LUT tF ... 𝑣 = [0, . . . , 0, 1, 0, . . . , 0] J0K 47 J0K s.t. 𝑣 is 1-hot at 𝑟 119 𝑥th index Share 𝑣 to J𝑣K0 , J𝑣K1 268 Share 𝑟 to J𝑟K0 , J𝑟K1 𝒫1 J𝑣K1 ... J𝑦K1 723 J𝑥K1 , J𝑟K1 , J𝑣K1 J0K J0K ... Reveal 𝛿 = 𝑥 − 𝑟 J119K J0K Rotate by 𝛿 J0K J1K & Multiply ... J0K J0K Figure 4: Secure evaluation tF (𝑥) = 𝑦 of a public LUT (tF ) on secret-shared 𝑥 (𝒫𝑖 holds J𝑥K𝑖 ) with the help of dealer 𝒟. both parties 𝒫0 and 𝒫1 as it is used to locally rotate the secret-shared vector 𝑣. Note that 𝛿 cannot be revealed to the dealer, otherwise 𝒟 would be able to recover the underlying value of the input 𝑥. After the parties rotate their shares of 𝑣 by 𝛿, the underlying value 1 is placed at 𝑟 + (𝑥 − 𝑟) = 𝑥-th index. Finally, the parties compute the inner product between the shares of the rotated vector and the public LUT tF . Indeed, this yields the share of tF at index 𝑥. All operations are executed in the ring Z2𝑛 . 4.2. Look-Up Table Approximation As previously discussed, DWT provides a way to approximate signals. We observe that it is specifically well-suited for smooth functions, such as exponential, sigmoid, logarithm, and square root, to name a few. Additionally, it provides a compression mechanism that allows us to reduce bandwidth usage. Next, we delve into two approaches for evaluating a DWT-approximated LUT in the plaintext setting. This exploration will help inform the secure protocols for both Haar and (5, 3) Biorthogonal DWTs. Approach 1. Recall from Section 2.3 that when DWT is applied to a vector 𝑣 𝑗 times, we end up ↶𝑗 ↶𝑗 ↶𝑗 with two vectors 𝑎 and 𝑑 , where 𝑎 is 2𝑗 times smaller than the original vector. By setting the detail coefficients to zero, we define the approximation of 𝑣 to be the approximation coefficients. This effectively reduces the size of the original vector by 2𝑗 . 𝑛 Similar to how we apply the DWT to a general vector 𝑣 ∈ (Z2𝑛 )2 , we apply it to the LUT repre- ↶𝑗 𝑛−𝑗 sentation tF . We define tF ∈ (Z2𝑛 )2 to be the LUT containing only the approximation coefficients of the original LUT after 𝑗 DWT operations, 𝑊 𝑗 · tF . Note that, in this process, the original LUT is compressed by a factor of 2𝑗 , while the original input value 𝑥 remains within Z2𝑛 . This leads to a ↶𝑗 natural question: how do we find the original input 𝑥 in the 𝑗-times compressed table tF ? To locate 𝑥 in ↶𝑗 ↶𝑗 tF , we truncate 𝑥 by 𝑗 bits and find the corresponding index in tF . This process yields the following approximation: ↶𝑗 tF (𝑥) ≈ tF (𝑥 ≫ 𝑗), ↶𝑗 which is depicted in Fig. 5. Observe that the output bitwidth of the compressed LUT tF is the same as the output bitwidth of the original table tF . Approach 2. From DWT theory, we can retrieve the original signal from the approximation and detail coefficients by evaluating the inverse transform over them. Moreover, it is known that the inverse transform provides a method to amplify a signal, by considering the original signal as the 𝑥 tF ↶1 000 0000 𝑥≫1 tF 001 0010 00 0001 010 0100 01 0101 011 0110 10 1001 100 1000 11 1101 101 1010 110 1100 (b) Approach 1 for LUT approximation. Evaluation of 111 1110 the Haar-compressed LUT over the truncated ↶1 (a) Evaluation of the original LUT, i.e., tF (𝑥). input, i.e., tF . ↶1 Figure 5: Execution of the original and Haar-compressed LUT (tF and tF ) on inputs 𝑥 and 𝑥 ≫ 1, respectively. ↶𝑗 approximation coefficients [61]. We can use this method to amplify the approximated LUT tF and get a better approximation than approach 1. We execute the following chain of operations: ⎡ ⎤ ↶𝑗 [︃ ↶𝑗 ]︃ 𝑊𝑗 𝑗 tF tF ↶𝑗𝑊 −𝑗 (1) tF 𝑊 · tF = ↶𝑗 ⎦ ⎣ =: eF ↶𝑗 𝑊 −𝑗 · eF =: ̃︀ tF , 𝑑 0 and we obtain tF (𝑥) ≈ ̃︀tF (𝑥) for any DWT type. Next, we explain both approaches for the Haar and (5, 3) Biorthogonal transformations. We discuss how the structure of the DWT matrices affects the approximation, as this provides insight into translating a plaintext approximation to a secure protocol. Finally, we extend these approaches to the secure setting and present the corresponding protocols. 4.2.1. Haar-LUT Evaluation The two aforementioned approaches presented in Section 4.2 can be applied to the Haar DWT. Consid- ering vectors of size 2𝑛 , recall from Section 2.3.1 that the Haar DWT compresses data by computing the average of two elements: 𝑎𝑘 = (𝑣2𝑘 + 𝑣2𝑘+1 )/2, for 𝑘 = 0, 1, . . . , 2𝑛−1 − 1. In other words, as 𝑘 represents the 𝑛 − 1 most significant bits (MSB) of the original index, the index elements with the same 𝑛 − 1 MSBs are averaged. Applying consecutively DWT 𝑗 times, the DWT averages the index elements with the same 𝑛 − 𝑗 MSBs. We observe that for the Haar DWT, both approaches 1 and 2 yield the same approximation. As described in Section 4.2, the inverse Haar transformation is given by 𝑣2𝑘+1 = 𝑎𝑘 + 𝑑𝑘 and 𝑣2𝑘 = 𝑎𝑘 − 𝑑𝑘 . Since the detail coefficients are set to 0, we observe that 𝑣2𝑘 = 𝑣2𝑘+1 = 𝑎𝑘 , i.e., the elements from indexes with the same 𝑛 − 1 MSBs are reconstructed to the same value. In other words, the inverse transformation duplicates the entries of the indexes having the same 𝑛 − 1 MSBs. This yields the ↶1 same result as taking the 𝑛 − 1 MSB of 𝑥 and looking up the compressed table tF , i.e., more generally ↶𝑗 tF (𝑥 ≫ 𝑗) = ̃︀ tF (𝑥). Secure Haar LUT Evaluation. The techniques described above can be used together to securely compute an approximated function F through a LUT. The full protocol is formally described in Fig. 6. The secure LUT procedure of Fig. 4 can be adapted to be compatible with the DWT compression technique. Assume input bitwidth 𝑛, output bitwidth 𝑙, and the number of DWT transformations to be 𝑗. The secure Haar LUT protocol, ΠHaar-LUT 𝑛,𝑙,𝑗,F (Fig. 6) is given as follows. During the preprocessing phase, similar to the secure LUT evaluation, the dealer produces a secret-shared random 1-hot vector. However, to profit from DWT compression, the 1-hot vector has reduced size, i.e., 2𝑛−𝑗 , and 𝑟 ∈ Z2𝑛−𝑗 . Following approach 1, we have that the look-up index is given by the 𝑛 − 𝑗 MSBs of the input 𝑥. Therefore, the ↶𝑗 shift 𝛿 to be applied to the compressed LUT tF is computed using the 𝑛 − 𝑗 MSBs of 𝑥 and 𝑟. The online ΠHaar-LUT 𝑛,𝑙,𝑗,F (J𝑥K) Parameters: 𝑛, 𝑙 for input bit-width and output bit-width. F is the function to be computed and ↶𝑗 represented by the corresponding 𝑗−times compressed LUT tF . Dealer setup: 1. 𝒟 randomly generates 𝑟 ← − $ Z2𝑛−𝑗 . 2. 𝒟 builds a 1-hot vector 𝑣 of size 2𝑛−𝑗 with value 1 at position 𝑟. 3. 𝒟 shares 𝑟 and 𝑣 to 𝒫0 and 𝒫1 . LUT evaluation for parties 𝒫𝑖 for 𝑖 ∈ {0, 1}: 4. 𝒫𝑖 truncates the input value J𝑥K by 𝑗: J𝑥* K = J𝑥 ≫ 𝑗K. ◁ Truncation protocol [52, §5.1] 5. 𝒫𝑖 reduces its share J𝑥* K𝑖 by locally computing J𝑥* K𝑖 mod 2𝑛−𝑗 . 6. 𝒫𝑖 computes J𝛿K = J𝑥* K − J𝑟K and reveal 𝛿. 7. 𝒫𝑖 rotates J𝑣K𝑖 by 𝛿. ↶𝑗 8. 𝒫𝑖 applies the inner product between tF and the rotated vector, and gets the share J𝑦K. 9. Output: J𝑦K. Figure 6: Secure Haar protocol. phase starts with the parties truncating the shares of 𝑥 by 𝑗 and computing its modulo in Z2𝑛−𝑗 . As we saw, the truncation from the DWT-LUT evaluation approach 1 can be executed with the Escudero et al. [52, §5.1] probabilistic truncation protocol over rings. The modulo operation is required because the shares of 𝑟 belong to Z2𝑛−𝑗 and the shares of truncated 𝑥 belong to Z2𝑛 . We note that modulo can be executed locally as described below. The rest of the protocol is similar to the online phase of the secure LUT evaluation (Fig. 4). Local modulo. In the literature, several protocols evaluate the modulo operation over shares in a ring or field [24, 52]. Those protocols are used whenever the space of the modulo operation is the same for both inputs and outputs. However, we observe that the modulo operation required for our protocol needs to reduce the ring size of the operand. More specifically, the shares J𝑥* K in Z2𝑛 have to be reduced to shares in Z2𝑛−𝑗 . This can be achieved locally by letting the parties compute the modulo operation over their shares. For correctness, consider 𝑥 = (J𝑥K1 + J𝑥K2 ) mod 2𝑛 and 𝑥′ = 𝑥 mod 2𝑛−𝑗 . We have, 𝑥′ = 𝑥 mod 2𝑛−𝑗 = ((J𝑥K1 + J𝑥K2 ) mod 2𝑛 ) mod 2𝑛−𝑗 = (J𝑥K1 + J𝑥K2 ) mod 2𝑛−𝑗 ◁ 2𝑛−𝑗 is a factor of 2𝑛 . = J𝑥K1 mod 2𝑛−𝑗 + J𝑥K2 mod 2𝑛−𝑗 mod 2𝑛−𝑗 (︀ )︀ = J𝑥 K1 + J𝑥′ K2 (︀ ′ mod 2𝑛−𝑗 . )︀ Thus, the shares of 𝑥′ can be computed locally from J𝑥K. 4.2.2. Biorthogonal-LUT Evaluation Now, we explore the LUT DWT approximation for the (5, 3) Biorthogonal DWT. As described in Section 2.3.2, the approximation coefficients are given by a weighted average of 5 elements and, similar to Haar, approach 1 for the Biorthogonal compresses two indexes with the same 𝑛 − 1 MSBs to the ↶𝑗 same index. On the other hand, approach 2 requires computing 𝐵 −𝑗 · eF =: ̃︀ tF as shown in Eq. (1). For simplicity, we set 𝑗 = 1 and examine the structure of the inverted Biorthogonal DWT matrix 𝐵 −1 : Index ⎛ ⎞ 0000 2 0 0 0 0 0 1⎜ 1 1 0 ⎟ 1 ⎜ ⎟ ⎜0 0 0 1 0⎜ 2 0 ... ⎟⎟ · 4. 0 0 1 1⎝ 0 1 1 ⎠ .. .. . . To simplify the exposition, we present matrix 𝐵 −1 only with the elements that interact with the ↶𝑗 approximation coefficients, i.e., tF . Thus, the detail coefficients are effectively set to zero. Observe that the line indexes with the same 𝑛 − 1 MSBs ( green ) constitute a 2 × 2 submatrix that combines two ↶1 elements from tF , i.e., its corresponding index element and the subsequent element. For instance, the ↶1 tF , combines the 001 -th and the ( 001 + 1)-th index element of tF . Note that the 001 -th block of ̃︀ LSB ( red ) of the line index plays a role in the weights of the sub-matrices. Indeed, we have: (︂ )︂ (︂ )︂ 2 0 2− 0 0 = . 1 1 2− 1 1 Finally, consider the case 𝑗 = 2. We omit the (1/4)𝑗 constant for readability. The 2 LSBs of the indices define the elements of 4 × 2 sub-matrices and the 𝑛 − 2 MSBs of the indices define the elements from ↶2 the approximation LUT tF to be used. ↶2 Index ⎛ 𝐵 −2 𝑇Fext ⎛ tF ̃︀ ⎞ ⎞ ⎛ 2 ⎞ 0 0 0 0⎜ 4 0 0 ⎜ (2 − 0 ) · 𝑎 + 0 · 𝑎 ⎟ ⎜ 𝑎0 ⎟ ⎜ (22 − 1 ) · 𝑎 0 + 1 · 𝑎 0 +1 ⎟ ⎟ ⎟ ⎜ . ⎟ ⎜ ⎜3 0 0 0 1⎜ 1 0 0 ⎟ ⎜ . ⎟ ⎜ (22 − 2 ) · 𝑎 + 2 · 𝑎 0 +1 ⎟ ⎟ ⎜ . ⎟ ⎜ ⎟ 0 0 1 0⎜ 2 2 0 2 0 0 +1 ⎟ ⎜ 0 0 1 1⎜ 1 3 0 ⎟ ⎜ ⎟ ⎜ 𝑎2𝑛−2 ⎟ ⎜ (2 − 3 ) · 𝑎 0 + 3 · 𝑎 0 +1 ⎟ ⎟ ⎜ ⎟ ⎜ 0 1 0 0⎜ 0 4 0 ... ⎟ · ⎟ = ⎜ 2 ⎟ ⎟ ⎜ (2 − 0 ) · 𝑎 1 + 0 · 𝑎 1 +1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 0 1 0 1⎜ 0 3 1 ⎜ ⎟ ⎟ ⎜ 0 ⎟ ⎜ (22 − 1 ) · 𝑎 + 1 · 𝑎 ⎜ ⎟ ⎟ 1 1 +1 0 1 1 0⎜ 0 2 2 ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎜ (22 − 2 ) · 𝑎 + 2 · 𝑎 ⎟ ⎝ . ⎠ ⎜ 2 ⎟ ⎟ 1 1 +1 ⎟ ⎝0 0 1 1 1⎜ 1 3 ⎜ (2 − 3 ) · 𝑎 1 + 3 · 𝑎 1 +1 ⎟ .. .. ⎠ . . 0 .. ⎝ ⎠ . We can generalize the above pattern for any 𝑗, which provides an approximation formula suited for the secure setting: (︂ )︂ 1 (︀ 𝑗 )︀ ↶𝑗 (𝑗) ↶𝑗 (𝑗) tF (𝑥) = 𝑗 ̃︀ 2 − 𝑥(𝑗) · tF (𝑥 ) + 𝑥(𝑗) · tF (𝑥 + 1) , (2) 4 where 𝑥(𝑗) and 𝑥(𝑗) represent the 𝑛 − 𝑗 MSBs and 𝑗 LSBs of 𝑥, respectively. We recall that the 𝑛 − 𝑗 MSBs can be obtained by truncation and the 𝑗 LSB by the mod 2𝑗 operation. Secure Biorthogonal LUT evaluation. Eq. (2) can be used to securely compute an approximated function F through a look-up table. The full protocol is formally described in Fig. 7. Similarly, the secure LUT procedure described before (Fig. 4) can be adapted to be compatible with the Biorthogonal DWT compression technique. Assume input bitwidth 𝑛, output bitwidth 𝑙, and the number of DWT transformations to be 𝑗. The secure Biorthogonal LUT protocol, ΠBior-LUT Haar-LUT with the difference 𝑛,𝑙,𝑗,F (Fig. 7), is similar to Π𝑛,𝑙,𝑗,F ↶𝑗 that it requires two LUT evaluations of tF . As an optimization, note that these two LUT evaluations do not require two different one-hot vectors. Since the LUT evaluation is executed in two consecutive indices, we can use the same vector 𝑣 to generate two shifted vectors: one shifted by 𝛿 and the other by 𝛿 + 1. This allows us to use the same dealer setup phase as in secure Haar LUT evaluation. The online phase starts with the parties computing 𝑥(𝑗) as the 𝑛 − 𝑗 MSBs and 𝑥(𝑗) as the 𝑗 LSBs of 𝑥. More specifically, 𝑥(𝑗) is computed by truncating J𝑥K by 𝑗 (i.e., [52, §5.1]) and locally computing its modulo in Z2𝑛−𝑗 , while 𝑥(𝑗) is computed locally based on 𝑥(𝑗) as 𝑥 = 2𝑗 · 𝑥(𝑗) + 𝑥(𝑗) . The rest of the protocol is ΠBior-LUT 𝑛,𝑙,𝑗,F (J𝑥K) Parameters: 𝑛, 𝑙 for input bit-width and output bit-width. F is the function to be computed and ↶𝑗 represented by the corresponding 𝑗−times compressed LUT tF . Dealer setup: 1. 𝒟 randomly generates 𝑟 ← − $ Z2𝑛−𝑗 . 2. 𝒟 builds a 1-hot vector 𝑣 of size 2𝑛−𝑗 with value 1 at position 𝑟. 3. 𝒟 shares 𝑟 and 𝑣 to 𝒫0 and 𝒫1 . LUT evaluation for parties 𝒫𝑖 for 𝑖 ∈ {0, 1}: 4. 𝒫𝑖 truncates the input value J𝑥K by 𝑗, J𝑥 ≫ 𝑗K := J𝑥(𝑗) K. Then, 𝒫𝑖 reduces its share J𝑥(𝑗) K𝑖 by locally computing J𝑥(𝑗) K𝑖 mod 2𝑛−𝑗 . 5. 𝒫𝑖 computes the 𝑗 LSB of 𝑥 by locally computing J𝑥(𝑗) K := J𝑥K − 2𝑗 · J𝑥(𝑗) K. 6. 𝒫𝑖 compute J𝛿K = J𝑥(𝑗) K − J𝑟K and reveal 𝛿. 7. 𝒫𝑖 rotate J𝑣K𝑖 by 𝛿 and 𝛿 + 1. ↶𝑗 8. 𝒫𝑖 apply the inner product between tF and the rotated vectors. We denote the resulting shares ↶𝑗 ↶𝑗 by J tF (𝑥(𝑗) )K and J tF (𝑥(𝑗) + 1)K. (︂ )︂ )︀ ↶𝑗 ↶𝑗 9. 𝒫𝑖 computes the expression J𝑦K := 41𝑗 2 − J𝑥(𝑗) K · J tF (𝑥(𝑗) )K + J𝑥(𝑗) K · J tF (𝑥(𝑗) + 1)K . (︀ 𝑗 10. Output: J𝑦K. Figure 7: Secure Biorthogonal protocol. similar to the online phase of the secure LUT evaluation (Fig. 4) on 𝛿 and 𝛿 + 1. In the final step 10, the parties compute the expression (2) as the output. Both Haar and Biorthogonal secure evaluation protocols can be used to directly evaluate a function over a fixed interval. Furthermore, we note that our protocols can be combined with a piece-wise approach commonly employed in prior works [28, 15]. We extend this method in Appendix B to accommodate any bounded even or odd function (Fig 11). We demonstrate the applicability of this protocol to various functions, including s-shape functions such as tanh, erf and sigmoid, as well as activation functions like GeLU and SiLUq. Additionally, in Appendix B.3, we introduce a protocol for both sin and cos. 4.3. Complexity Analysis 4.3.1. Memory The secure DWT-LUT approach, using compressed tables, significantly minimizes the memory required for LUT handling. For instance, consider the interval (0, 4096) with a precision of 𝑓 = 16. A conven- tional LUT protocol would typically demand a table size of approximately 2.15 GB. However, employing the DWT technique allows us to reduce the original table size by a factor of approximately 220 while maintaining high accuracy levels as discussed in Section 5.1.2. The corresponding DWT-reduced LUT occupies only 1 KB of memory, demonstrating a substantial reduction in storage requirements without compromising accuracy. 4.3.2. Round & Communication Both Haar and Biorthogonal protocols have the same preprocessing phase and both start the online phase by truncating and opening a single field element. However, the Biorthogonal requires an additional Curl Latency Curl MAE CrypTen Latency CrypTen MAE 100 10−1 100 101 100 10−1 10−2 100 10−2 Latency 10−1 MAE 10−3 10−1 10−1 10−3 10−2 10−2 10−1 10−4 10−4 7 8 9 10 6 7 8 9 5 6 7 8 LUT bit size LUT Size LUT Size (a) Logarithm (Fig. 7). (b) Square root (Fig. 7). (c) Inv. square root (Fig. 6). 100 100 10−2 100 100 10−4 10−1 Latency MAE 10−1 10−1 10−2 10−1 10−3 10−5 10−2 10−3 5 6 7 8 4 5 6 7 5 6 7 8 LUT bit size LUT Size LUT Size (d) Sigmoid (Fig. 11). (e) SiLU (App. B.2). (f) SiLU (Fig. 7). Figure 8: Mean absolute error (MAE) and latency trade-offs for different DWT compressions. We plot CrypTen as a baseline. truncation and two share multiplications (see Fig. 7 Step 9). Observe that Escudero et al. [52, §5.1] probabilistic truncation protocol over rings involves one round to open a ring element. Consequently, while the Haar protocol operates within 2 rounds and transmits 2 ring elements, the Biorthogonal is more resource-intensive, requiring 4 rounds and transmitting 5 elements. 5. Experimental Results We implemented Curl by building on top of CrypTen and open-source it in https://github.com/jimouris/ curl . Curl natively supports both CPU and GPU backends, however, we focus on CPU experiments as this is the most widely used setting in MPC. For completeness, we conduct some GPU experiments using an NVIDIA GeForce RTX 4080. All parties run in separate processes on a c5n.9xlarge AWS instance with 36 vCPUs and 96 GB of memory. As communication and round complexity form crucial metrics for assessing MPC end-to-end runtime, we extensively report these numbers in all our benchmarks. This can be used to glean WAN performance. Furthermore, as Curl’s novelty lies in improving approximations through DWT-encoded lookup tables, we focus on reporting the mean absolute errors (MAE) and mean relative errors (MRE) over the expected result. Finally, we focus on private LLM inference using secret-shared models on secret-shared data. Note that Curl can be used for private training in a similar way, which we omit due to space constraints. We use 16 bits of precision, following related works that typically use precision between 9 and 17 bits [46, 15, 73, 23]. 5.1. Non-linear Functions We experimentally verified that the Biorthogonal protocol achieves better approximations than Haar with almost identical latency and communication costs (Appendix C). Therefore, we use the Biorthogonal protocol for almost all our experiments. We choose constrained ranges of input values to enable a meaningful comparison with CrypTen. For wider ranges, such as (0, 212 ), CrypTen’s approximations exhibit significantly higher errors. Since previous MPC works [28, 27, 42] rely on similar polynomial approximations, we can argue that they all have similar error levels. For example, for the log function, Table 1 Overview of Curl’s improvement over CrypTen for selected tensors functions in terms of latency, mean absolute error (MAE), and mean relative error (MRE), for tensors of 256 × 256 elements. Curl CrypTen [12] ‡ § Op. Protocol Domain Latency Com. Error Latency Com. Error LUT (sec.)† Rounds MB MAE MRE (sec.) Rounds MB MAE MRE log Fig. 7 (0, 64) 28 0.17 4 2.6 2.09e-2 5.48e-2 0.17 40 39.8 2.14e-2 6.36e-3 reciprocal Fig. 7 (1, 64) 2 7 0.09 4 2.6 7.18e-4 1.43e-3 0.11 59 38.3 1.7e-4 7.05e-3 sqrt Fig. 7 (0, 256) 26 0.06 4 2.6 1.23-1 1.11e-2 0.09 26 17.3 6.09e+0 4.04e-1 invsqrt Fig. 6 (0, 256) 2 6 0.04 2 1.0 1.45e-2 1.14e-1 0.09 24 15.7 2.69e-2 0.405e-1 sin App. B.3 (−64, 64) 25 0.08 16 20.4 4.55e-3 1.14e-2 0.11 37 24.1 8.52e-1 1.58e+0 cos App. B.3 (−64, 64) 25 0.08 16 20.4 4.77e-3 9.85e-2 0.10 37 24.1 8.86e-1 1.45e+0 Fig. 11 (−64, 64) 2 6 0.10 22 33.6 4.70e-5 7.83e-2 0.11 26 26.2 7.00e-5 3.49e+0 sigmoid Fig. 7 (−64, 64) 26 0.10 4 2.6 1.11e-2 6.59e-2 0.11 26 26.2 7.00e-5 3.49e+0 tanh Fig. 11 (−64, 64) 25 0.09 22 33.6 2.31e-4 3.96e-4 0.13 26 26.2 8.60e-5 1.19e-4 erf Fig. 11 (−64, 64) 2 3 0.09 22 33.6 8.98e-4 1.83e-3 0.21 56 36.2 3.39e+7 3.40e+7 App. B.2 (−64, 64) 24 0.10 30 47.7 5.95e-3 2.79e+0 N/A N/A N/A N/A N/A GeLU Fig. 7 (−4, 4) 24 0.11 4 2.6 2.60e-3 5.02e-2 N/A N/A N/A N/A N/A App. B.2 (−64, 64) 26 0.14 30 47.7 2.61e-3 5.48e-3 N/A N/A N/A N/A N/A SiLU Fig. 7 (−64, 64) 26 0.09 4 2.6 1.54e-1 1.18e-1 N/A N/A N/A N/A N/A † Green background indicates the fastest runtime, ‡ orange indicates the lowest communication, and § blue indicates the lowest error (MAE or MRE). CrypTen’s approximation yields a MAE of 1.09e+10, whereas the Biorthogonal with a compression 𝑗 = 21 achieves a substantially lower MAE of 2.66e-1. We note that the latency time reported is dominated by computation time as the protocol is executed in the same machine. 5.1.1. Choosing LUT Sizes We evaluate our secure DWT protocols to compute a set of smooth functions commonly used in machine learning tasks such as logarithm (used for perplexity and cross-entropy that measure LLM’s performance), square root and inverse square root (used for layer normalization), sigmoid, and SiLU. In Fig. 8 we investigate the trade-offs between latency on the left y-axes (green) and MAE on the right y-axes (red) for the aforementioned non-linear functions. As a baseline, we show CrypTen’s latency and MAE with dashed green and red lines, respectively. In the case of SiLU, which we evaluate with two different protocols, CrypTen does not support this so we only show Curl’s latency and MAE. Of course, bigger LUTs in Curl achieve better approximations but require more computation and thus higher latency. However, DWT reduces the size of the LUTs which therefore decreases the computation complexity of Curl while still achieving good approximations. For a LAN setting, the total runtime is expected to be dominated by the computation time. However, for WAN, the total runtime is mostly dominated by communication. Therefore, we focus on achieving better errors than CrypTen while keeping a similar latency level. As an example, consider running 265×265 log functions in a LAN setting with 0.8ms ping-time and 3 Gbps bandwidth and in a WAN setting with 80ms ping-time and 200 Mbps bandwidth. We estimate3 Curl to take 180ms and CrypTen to take 313ms (×1.8 more) in LAN and Curl to take 599ms and CrypTen to take 5, 039ms (×10.1 more) in WAN. For general and unbounded functions, we apply the secure Biorthogonal protocol (Fig. 7). Exception- ally, we use Haar-encoded protocol (Fig. 6) for the invsqrt function as it achieves better errors whenever the (0, 1) interval is included. For sqrt and invsqrt we consider the interval (0, 256). However, for log we consider the interval to be (0, 64) as CrypTen’s MAE and MRE are of the order of 1e+6 for (0, 256). This corresponds to an original LUT of size 224 for sqrt and invsqrt and of size 222 for log. Observe the logarithm approximation in Fig. 8a in which both latency and MAE of Curl matches that 3 We add the computation latency, the ping-time multiplied by the number of rounds and the amount of information sent divided by the bandwidth. We use the numbers reported in Table 1. of CrypTen’s for LUT sizes of 8 bits. In Figs. 8b and 8c, Curl’s MAE is always below CrypTen’s. For these two functions, we use LUT sizes with 6 bits as these provide lower latency for Curl. We set the compression parameter to 𝑗 = 18 for both sqrt and invsqrt and 14 for log. For sigmoid, we use the protocol presented in Fig. 11 and set the interval to be (−64, 64). As analyzed in Appendix B.1, sigmoid is not an odd function but the last step of the protocol can be adapted to be applied to any s-shape function, including sigmoid. We observe that Curl achieves a better approximation than CrypTen for LUT sizes with 6 or more bits with runtime being marginally slower. Finally, for SiLU, we set the interval to be (−64, 64) and we use two different protocols: the approach described in Appendix B.2 which is a derivation of Fig 11 protocol (Fig. 8e); and the Biorthogonal protocol (Fig. 8f). We observe that Appendix B.2 approach achieves better MAE errors when compared to the Biorthogonal protocol. The latency from Fig. 8e is less affected by an increase in the LUT when compared with Fig. 8f. The same behavior happens with GeLU and SiLU as we can observe in Table 1. This is justified by the fact that the approach from Appendix B.2 requires comparisons, whether the Biorthogonal protocol does not. Indeed, note that in all functions where a DWT-LUT protocol is applied, the latency decreases by half when we half the LUT size. 5.1.2. Evaluations Having found the optimal tradeoffs for LUT sizes, we compare the Curl approximations with the DWT technique to CrypTen’s polynomial approximations in terms of latency, communication, and errors. In Table 1, we evaluate a set of commonly used non-linear functions including logarithm, inverse square root, the sigmoid activation function, as well as GeLU and SiLU. For unbounded functions, we use the protocol from Fig. 7, while for bounded functions (e.g., sin) we use our optimized protocols which are referenced in the table. We experimentally verified that, in most cases, the Biorthogonal protocol outperforms the Haar DWT in terms of both MAE and MRE, with the only exception being invsqrt. Observe that for the same LUT size, the latency for Haar is 2/3 lower than the Biorthogonal approach (compare sqrt 0.06s latency using Biorthogonal with invsqrt 0.04s latency using Haar). For each function, we specify the evaluation domain for which we report the errors (i.e., MAE and MRE). Observe that all functions that utilize the protocol from Fig. 7 and Fig. 6 have constant rounds and communication, regardless of the LUT size, as the one-hot vectors (which depend on the size) are generated during preprocessing from the dealer. Latency is affected by the LUT size (e.g., sqrt vs log) since the LUT output value is accumulated with shares of zero, which are as many as the LUT size (see last step of Fig. 4). However, as discussed before, the end-to-end latency will be dominated by the number of rounds and total communication size, for which Curl is almost always significantly better. Regarding trigonometric functions (sin and cos) we use the approach proposed in Appendix B.3. We note that CrypTen’s approximation of both functions already has a 1.2e+2 error for the range (−128, 128). For this reason, we decrease the interval range to (−64, 64). On the other hand, Curl’s approximation keeps the same level of errors independent of the input range. Moreover, Curl achieves that while having less latency, rounds and communication when compared with CrypTen. 5.2. LLM Inference We evaluate Curl with BERT Tiny (4 million parameters), BERT Base (110 million parameters), BERT Large (340 million parameters), GPT-2 (124 million parameters), and GPT Neo (1.3 billion parameters). In Table 2, we report the latency, number of rounds, and total communication for each LLM for 64 inputs. For completeness, we also evaluate BERT Tiny and Base on GPU for 64 elements and report between 2 − 3× speedup compared to CPU (1.1 and 6.5 seconds, respectively). Finally, we use the QNLI classification task from the GLUE benchmark [74] to evaluate the accuracy of our models. Curl achieves an accuracy of 80.3% for encrypted BERT Tiny, which almost matches the plaintext accuracy (i.e., 81.4%). This further verifies our claim regarding the probabilistic truncation protocol that is sufficiently accurate to evaluate LLMs. However, we observed that in LLMs the LUT protocols are not great for inverse square root as the input range varies and the output of the function changes rapidly in (0, 1]. Thus, to achieve high accuracy we resort to a comparison and two LUTs: a dense one between (0, 1] and a sparse one in (1, 256), getting the best of both worlds. Table 2 Runtime and communication of Curl with a sequence length of 64 items. Model Latency (s) Rounds Com. (GB) BERT Tiny 3.55 409 1.34 BERT Base 13.63 1,629 2.8 BERT Large 33.93 3,093 5.66 GPT-2 16.61 1,630 3.77 GPT-Neo 103.4 3,118 14.9 Next, we compare with Iron, MPCFormer, Puma, and Bolt in Table 3 for BERT Base with 128 items. In terms of end-to-end latency, Curl outperforms all other frameworks by at least 1.5×, with as much as 20× in the case of Iron. More importantly, Curl achieves this with significantly less amount of rounds and total communication. Specifically, Iron and Bolt require 8× and 6× more rounds than Curl, respectively, while in terms of total communication, all other frameworks need more than 10 GBs (with Iron needing 281 GBs) while Curl only needs 5.7 GBs. Although the end-to-end runtime in a real-world instantiation would increase due to communication, Curl would not be affected as much by that since it exhibits the lowest communication and number of rounds. Table 3 Runtime and communication comparisons with a sequence length of 128 items for the full BERT Base LLM. Framework Latency (s) Rounds Com. (GB) Iron [26] 475 13,663 281 MPCFormer [27] 55.3 – 12.1 Puma [42] 33.9 – 10.8 Bolt [28] 185 10,509 59.6 Bolt (WE) [28]† 91 10,901 25.7 Curl 22.5 1,629 5.7 † In Bolt, WE stands for word elimination. Finally, in Table 4 we compare Curl with CrypTen for five LLMs with 64 inputs. This differs from Tables 2 and 3 as CrypTen lacks a protocol to evaluate embeddings (Section 2.2). Thus, in Table 4, we skip embedding for a fair comparison. Curl needs less than half the amount of rounds CrypTen does for all five models and almost half the total communication. Although Curl’s runtimes are already faster than CrypTen’s, the difference in end-to-end latency will increase further in a realistic instantiation due to communication. Table 4 Runtime and communication comparisons with CrypTen with a sequence length of 64 items. These models skip the embeddings as CrypTen does not support lookups. Curl CrypTen [12] Model Latency (s) Rounds Com. (GB) Latency (s) Rounds Com. (GB) BERT Tiny 0.51 252 0.02 0.86 539 0.05 BERT Base 9.01 1,472 1.31 13.23 3,089 2.62 BERT Large 28.79 2,936 4.11 38.59 6,149 7.6 GPT-2 8.94 1,464 1.31 12.93 3,060 2.62 GPT-Neo 98.44 2,952 11.95 137.20† 6,144 18.91 † Approximated since CrypTen could not fit GPT Neo in RAM. 6. Concluding Remarks We have introduced Curl, a framework for privacy-preserving machine learning that evaluates non-linear functions as lookup tables, resulting in better approximations and significant round and communication reduction. Curl achieves this by relying on discrete wavelet transformations to reduce the lookup table sizes without sacrificing accuracy, which has resulted in up to 19× round and communication reduction compared to related works. We have evaluated Curl on five large language models such as BERT, GPT-2, and GPT Neo (1.3 billion parameters), and have compared it against state-of-the-art related works, significantly reducing latency, the number of communication rounds, and the total communication. On top of that, Curl is easy to use by exposing a tensor-based programming model that machine learning researchers and developers are familiar with which allows for both CPU and GPU backends. Lastly, we have introduced a new ideal functionality for probabilistic truncation protocols and proved their security in the real-ideal paradigm. This is of independent interest, as truncation is a core component of encrypted machine-learning models, and probabilistic truncation was previously considered insecure. References [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., GPT-4 technical report, arXiv preprint arXiv:2303.08774 (2023). [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423. [3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023). [4] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [5] T. Goyal, J. J. Li, G. Durrett, News summarization and evaluation in the era of GPT-3, arXiv preprint arXiv:2209.12356 (2022). [6] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al., Webgpt: Browser-assisted question-answering with human feedback, arXiv preprint arXiv:2112.09332 (2021). [7] C. Peris, C. Dupuy, J. Majmudar, R. Parikh, S. Smaili, R. Zemel, R. Gupta, Privacy in the time of language models, in: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, 2023, pp. 1291–1292. [8] D. Gong, X. Wan, D. Wang, Working memory capacity of ChatGPT: An empirical study, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 2024, pp. 10048–10056. [9] O. Goldreich, S. Micali, A. Wigderson, How to play any mental game or A completeness theorem for protocols with honest majority, in: A. Aho (Ed.), 19th Annual ACM Symposium on Theory of Computing, ACM Press, New York City, NY, USA, 1987, pp. 218–229. doi:10.1145/28395.28420. [10] A. C.-C. Yao, Protocols for secure computations (extended abstract), in: 23rd Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press, Chicago, Illinois, 1982, pp. 160–164. doi:10.1109/SFCS.1982.38. [11] A. P. K. Dalskov, D. Escudero, M. Keller, Secure evaluation of quantized neural networks, Proceed- ings on Privacy Enhancing Technologies 2020 (2020) 355–375. doi:10.2478/popets-2020-0077. [12] B. Knott, S. Venkataraman, A. Hannun, S. Sengupta, M. Ibrahim, L. van der Maaten, Crypten: Secure multi-party computation meets machine learning, in: M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems, volume 34, Curran Associates, Inc., 2021, pp. 4961–4973. [13] N. Koti, M. Pancholi, A. Patra, A. Suresh, SWIFT: Super-fast and robust privacy-preserving machine learning, in: M. Bailey, R. Greenstadt (Eds.), USENIX Security 2021: 30th USENIX Security Symposium, USENIX Association, 2021, pp. 2651–2668. [14] Z. Huang, W. jie Lu, C. Hong, J. Ding, Cheetah: Lean and fast secure two-party deep neural network inference, in: K. R. B. Butler, K. Thomas (Eds.), USENIX Security 2022: 31st USENIX Security Symposium, USENIX Association, Boston, MA, USA, 2022, pp. 809–826. [15] K. Gupta, N. Jawalkar, A. Mukherjee, N. Chandran, D. Gupta, A. Panwar, R. Sharma, SIGMA: Secure GPT Inference with Function Secret Sharing, Proceedings on Privacy Enhancing Technologies 2024 (2024) 61–79. doi:10.56553/popets-2024-0107. [16] P. Mohassel, P. Rindal, ABY3 : A mixed protocol framework for machine learning, in: D. Lie, M. Mannan, M. Backes, X. Wang (Eds.), ACM CCS 2018: 25th Conference on Computer and Communications Security, ACM Press, Toronto, ON, Canada, 2018, pp. 35–52. doi:10.1145/ 3243734.3243760. [17] S. Tan, B. Knott, Y. Tian, D. J. Wu, CryptGPU: Fast privacy-preserving machine learning on the GPU, in: 2021 IEEE Symposium on Security and Privacy, IEEE Computer Society Press, San Francisco, CA, USA, 2021, pp. 1021–1038. doi:10.1109/SP40001.2021.00098. [18] M. Keller, K. Sun, Secure quantized training for deep learning, in: International Conference on Machine Learning, PMLR, 2022, pp. 10912–10938. [19] J.-L. Watson, S. Wagh, R. A. Popa, Piranha: A GPU platform for secure computation, in: K. R. B. Butler, K. Thomas (Eds.), USENIX Security 2022: 31st USENIX Security Symposium, USENIX Association, Boston, MA, USA, 2022, pp. 827–844. [20] N. Jawalkar, K. Gupta, A. Basu, N. Chandran, D. Gupta, R. Sharma, Orca: FSS-based secure training with GPUs, Cryptology ePrint Archive, Report 2023/206, 2023. https://eprint.iacr.org/2023/206. [21] I. Damgård, V. Pastro, N. P. Smart, S. Zakarias, Multiparty computation from somewhat homomor- phic encryption, in: R. Safavi-Naini, R. Canetti (Eds.), Advances in Cryptology – CRYPTO 2012, volume 7417 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Santa Barbara, CA, USA, 2012, pp. 643–662. doi:10.1007/978-3-642-32009-5_38. [22] D. Escudero, An introduction to secret-sharing-based secure multiparty computation, Cryptology ePrint Archive, Report 2022/062, 2022. https://eprint.iacr.org/2022/062. [23] S. Wagh, Pika: Secure computation using function secret sharing over rings, Proceedings on Privacy Enhancing Technologies 2022 (2022) 351–377. doi:10.56553/popets-2022-0113. [24] O. Catrina, S. de Hoogh, Improved primitives for secure multiparty integer computation, in: J. A. Garay, R. D. Prisco (Eds.), SCN 10: 7th International Conference on Security in Communication Networks, volume 6280 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Amalfi, Italy, 2010, pp. 182–199. doi:10.1007/978-3-642-15317-4_13. [25] I. Damgård, M. Fitzi, E. Kiltz, J. B. Nielsen, T. Toft, Unconditionally secure constant-rounds multi- party computation for equality, comparison, bits and exponentiation, in: S. Halevi, T. Rabin (Eds.), TCC 2006: 3rd Theory of Cryptography Conference, volume 3876 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, New York, NY, USA, 2006, pp. 285–304. doi:10.1007/ 11681878_15. [26] M. Hao, H. Li, H. Chen, P. Xing, G. Xu, T. Zhang, Iron: Private inference on transformers, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 15718–15731. [27] D. Li, R. Shao, H. Wang, H. Guo, E. P. Xing, H. Zhang, MPCFormer: Fast, performant and private transformer inference with MPC, in: International Conference on Learning Representations (ICLR), 2023, pp. 1–16. [28] Q. Pang, J. Zhu, H. Möllering, W. Zheng, T. Schneider, Bolt: Privacy-preserving, accurate and efficient inference for transformers, in: 2024 IEEE Symposium on Security and Privacy (SP), IEEE Computer Society, Los Alamitos, CA, USA, 2024, pp. 133–133. doi:10.1109/SP54263.2024. 00130. [29] E. Boyle, N. Chandran, N. Gilboa, D. Gupta, Y. Ishai, N. Kumar, M. Rathee, Function secret sharing for mixed-mode and fixed-point secure computation, in: A. Canteaut, F.-X. Standaert (Eds.), Advances in Cryptology – EUROCRYPT 2021, Part II, volume 12697 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Zagreb, Croatia, 2021, pp. 871–900. doi:10. 1007/978-3-030-77886-6_30. [30] K. Storrier, A. Vadapalli, A. Lyons, R. Henry, Grotto: Screaming fast (2+1)-PC or Z2𝑛 via (2, 2)-DPFs, in: ACM CCS 2023: 30th Conference on Computer and Communications Security, ACM Press, 2023, pp. 2143–2157. doi:10.1145/3576915.3623147. [31] D. Mouris, P. Sarkar, N. G. Tsoutsos, PLASMA: Private, Lightweight Aggregated Statistics against Malicious Adversaries, Proceedings on Privacy Enhancing Technologies 2024 (2024) 1–19. doi:10. 56553/popets-2024-0064. [32] D. Mouris, C. Patton, H. Davis, P. Sarkar, N. G. Tsoutsos, Mastic: Private Weighted Heavy-Hitters and Attribute-Based Metrics, Proceedings on Privacy Enhancing Technologies 2025 (2025) 1–30. [33] Y. Ishai, E. Kushilevitz, S. Meldgaard, C. Orlandi, A. Paskin-Cherniavsky, On the power of correlated randomness in secure computation, in: A. Sahai (Ed.), TCC 2013: 10th Theory of Cryptography Conference, volume 7785 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Tokyo, Japan, 2013, pp. 600–620. doi:10.1007/978-3-642-36594-2_34. [34] G. Dessouky, F. Koushanfar, A.-R. Sadeghi, T. Schneider, S. Zeitouni, M. Zohner, Pushing the com- munication barrier in secure computation using lookup tables, in: ISOC Network and Distributed System Security Symposium – NDSS 2017, The Internet Society, San Diego, CA, USA, 2017, pp. 1–15. [35] A. Brüggemann, R. Hundt, T. Schneider, A. Suresh, H. Yalame, FLUTE: Fast and secure lookup table evaluations, in: 2023 IEEE Symposium on Security and Privacy, IEEE Computer Society Press, San Francisco, CA, USA, 2023, pp. 515–533. doi:10.1109/SP46215.2023.10179345. [36] L. Ducas, D. Micciancio, FHEW: Bootstrapping homomorphic encryption in less than a second, in: E. Oswald, M. Fischlin (Eds.), Advances in Cryptology – EUROCRYPT 2015, Part I, volume 9056 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Sofia, Bulgaria, 2015, pp. 617–640. doi:10.1007/978-3-662-46800-5_24. [37] I. Chillotti, N. Gama, M. Georgieva, M. Izabachène, TFHE: Fast fully homomorphic encryption over the torus, Journal of Cryptology 33 (2020) 34–91. doi:10.1007/s00145-019-09319-x. [38] I. Chillotti, M. Joye, P. Paillier, Programmable bootstrapping enables efficient homomorphic inference of deep neural networks, in: S. Dolev, O. Margalit, B. Pinkas, A. Schwarzmann (Eds.), Cyber Security Cryptography and Machine Learning, Springer International Publishing, Cham, 2021, pp. 1–19. [39] C. Gouert, D. Mouris, N. G. Tsoutsos, HELM: Navigating Homomorphic Encryption through Gates and Lookup Tables, Cryptology ePrint Archive, Paper 2023/1382, 2023. https://eprint.iacr.org/2023/ 1382. [40] C. Gouert, M. Ugurbil, D. Mouris, M. de Vega, N. G. Tsoutsos, Ripple: Accelerating Programmable Bootstraps for FHE with Wavelet Approximations, in: International Conference on Information Security, Springer, 2024, pp. 1–20. [41] T. Chen, H. Bao, S. Huang, L. Dong, B. Jiao, D. Jiang, H. Zhou, J. Li, F. Wei, THE-X: Privacy- preserving transformer inference with homomorphic encryption, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics: ACL 2022, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 3510–3520. doi:10.18653/ v1/2022.findings-acl.277. [42] Y. Dong, W. jie Lu, Y. Zheng, H. Wu, D. Zhao, J. Tan, Z. Huang, C. Hong, T. Wei, W. Cheng, Puma: Secure inference of llama-7b in five minutes, arXiv preprint arXiv:2307.12533 (2023). [43] J. Ma, Y. Zheng, J. Feng, D. Zhao, H. Wu, W. Fang, J. Tan, C. Yu, B. Zhang, L. Wang, SecretFlow-SPU: A performant and User-Friendly framework for Privacy-Preserving machine learning, in: 2023 USENIX Annual Technical Conference (USENIX ATC 23), USENIX Association, Boston, MA, 2023, pp. 17–33. [44] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with limited numerical precision, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, JMLR.org, 2015, p. 1737–1746. [45] Y. Li, Y. Duan, Z. Huang, C. Hong, C. Zhang, Y. Song, Efficient 3PC for binary circuits with application to Maliciously-Secure DNN inference, in: 32nd USENIX Security Symposium (USENIX Security 23), USENIX Association, Anaheim, CA, 2023, pp. 5377–5394. [46] P. Mohassel, Y. Zhang, SecureML: A system for scalable privacy-preserving machine learning, in: 2017 IEEE Symposium on Security and Privacy, IEEE Computer Society Press, San Jose, CA, USA, 2017, pp. 19–38. doi:10.1109/SP.2017.12. [47] A. Patra, A. Suresh, BLAZE: Blazing fast privacy-preserving machine learning, in: ISOC Network and Distributed System Security Symposium – NDSS 2020, The Internet Society, San Diego, CA, USA, 2020, pp. 1–18. [48] A. P. K. Dalskov, D. Escudero, M. Keller, Fantastic four: Honest-majority four-party secure computation with malicious security, in: M. Bailey, R. Greenstadt (Eds.), USENIX Security 2021: 30th USENIX Security Symposium, USENIX Association, 2021, pp. 2183–2200. [49] T. Ryffel, P. Tholoniat, D. Pointcheval, F. R. Bach, AriaNN: Low-interaction privacy-preserving deep learning via function secret sharing, Proceedings on Privacy Enhancing Technologies 2022 (2022) 291–316. doi:10.2478/popets-2022-0015. [50] K. Gupta, D. Kumaraswamy, N. Chandran, D. Gupta, LLAMA: A low latency math library for secure inference, Proceedings on Privacy Enhancing Technologies 2022 (2022) 274–294. doi:10.56553/popets-2022-0109. [51] I. Damgård, D. Escudero, T. K. Frederiksen, M. Keller, P. Scholl, N. Volgushev, New primitives for actively-secure MPC over rings with applications to private machine learning, in: 2019 IEEE Symposium on Security and Privacy, IEEE Computer Society Press, San Francisco, CA, USA, 2019, pp. 1102–1120. doi:10.1109/SP.2019.00078. [52] D. Escudero, S. Ghosh, M. Keller, R. Rachuri, P. Scholl, Improved primitives for MPC over mixed arithmetic-binary circuits, in: D. Micciancio, T. Ristenpart (Eds.), Advances in Cryptology – CRYPTO 2020, Part II, volume 12171 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Santa Barbara, CA, USA, 2020, pp. 823–852. doi:10.1007/978-3-030-56880-1_29. [53] S. Wagh, D. Gupta, N. Chandran, SecureNN: 3-party secure computation for neural network training, Proceedings on Privacy Enhancing Technologies 2019 (2019) 26–49. doi:10.2478/ popets-2019-0035. [54] M. Keller, K. Sun, Effectiveness of MPC-friendly Softmax Replacement, arXiv, 2020. doi:10.48550/ ARXIV.2011.11202. [55] D. Beaver, Efficient multiparty protocols using circuit randomization, in: J. Feigenbaum (Ed.), Ad- vances in Cryptology – CRYPTO’91, volume 576 of Lecture Notes in Computer Science, Springer, Hei- delberg, Germany, Santa Barbara, CA, USA, 1992, pp. 420–432. doi:10.1007/3-540-46766-1_ 34. [56] A. Patra, T. Schneider, A. Suresh, H. Yalame, ABY2.0: Improved mixed-protocol secure two-party computation, Cryptology ePrint Archive, Report 2020/1225, 2020. https://eprint.iacr.org/2020/1225. [57] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is All you Need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017, pp. 1–11. [58] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, X. Hu, Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond, ACM Transactions on Knowledge Discovery from Data (2023). [59] K. Fukushima, Cognitron: A self-organizing multilayered neural network, Biological cybernetics 20 (1975) 121–136. [60] D. Hendrycks, K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv:1606.08415 (2016). [61] P. Van Fleet, Discrete Wavelet Transformations: An Elementary Approach with Applications, Wiley, 2019. [62] C. Juvekar, V. Vaikuntanathan, A. Chandrakasan, GAZELLE: A low latency framework for secure neural network inference, in: W. Enck, A. P. Felt (Eds.), USENIX Security 2018: 27th USENIX Security Symposium, USENIX Association, Baltimore, MD, USA, 2018, pp. 1651–1669. [63] J. Liu, M. Juuti, Y. Lu, N. Asokan, Oblivious neural network predictions via MiniONN transforma- tions, in: B. M. Thuraisingham, D. Evans, T. Malkin, D. Xu (Eds.), ACM CCS 2017: 24th Conference on Computer and Communications Security, ACM Press, Dallas, TX, USA, 2017, pp. 619–631. doi:10.1145/3133956.3134056. [64] M. S. Riazi, M. Samragh, H. Chen, K. Laine, K. E. Lauter, F. Koushanfar, XONN: XNOR-based oblivious deep neural network inference, in: N. Heninger, P. Traynor (Eds.), USENIX Security 2019: 28th USENIX Security Symposium, USENIX Association, Santa Clara, CA, USA, 2019, pp. 1501–1518. [65] A. C.-C. Yao, How to generate and exchange secrets (extended abstract), in: 27th Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press, Toronto, Ontario, Canada, 1986, pp. 162–167. doi:10.1109/SFCS.1986.25. [66] M. Keller, E. Orsini, P. Scholl, MASCOT: Faster malicious arithmetic secure computation with oblivious transfer, in: E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, S. Halevi (Eds.), ACM CCS 2016: 23rd Conference on Computer and Communications Security, ACM Press, Vienna, Austria, 2016, pp. 830–842. doi:10.1145/2976749.2978357. [67] M. Keller, MP-SPDZ: A versatile framework for multi-party computation, in: J. Ligatti, X. Ou, J. Katz, G. Vigna (Eds.), ACM CCS 2020: 27th Conference on Computer and Communications Security, ACM Press, Virtual Event, USA, 2020, pp. 1575–1590. doi:10.1145/3372297.3417872. [68] J. Doerner, a. shelat, Scaling ORAM for secure computation, in: B. M. Thuraisingham, D. Evans, T. Malkin, D. Xu (Eds.), ACM CCS 2017: 24th Conference on Computer and Communications Security, ACM Press, Dallas, TX, USA, 2017, pp. 523–535. doi:10.1145/3133956.3133967. [69] E. Boyle, N. Gilboa, Y. Ishai, Secure computation with preprocessing via function secret sharing, in: D. Hofheinz, A. Rosen (Eds.), TCC 2019: 17th Theory of Cryptography Conference, Part I, volume 11891 of Lecture Notes in Computer Science, Springer, Heidelberg, Germany, Nuremberg, Germany, 2019, pp. 341–371. doi:10.1007/978-3-030-36030-6_14. [70] R. Canetti, Security and composition of multiparty cryptographic protocols, Journal of CRYPTOL- OGY 13 (2000) 143–202. [71] Y. Lindell, How to simulate it - A tutorial on the simulation proof technique, Cryptology ePrint Archive, Report 2016/046, 2016. https://eprint.iacr.org/2016/046. [72] B. Chor, O. Goldreich, E. Kushilevitz, M. Sudan, Private information retrieval, in: 36th Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press, Milwaukee, Wisconsin, 1995, pp. 41–50. doi:10.1109/SFCS.1995.492461. [73] S. Wagh, S. Tople, F. Benhamouda, E. Kushilevitz, P. Mittal, T. Rabin, Falcon: Honest-majority maliciously secure framework for private deep learning, Proceedings on Privacy Enhancing Technologies 2021 (2021) 188–208. doi:10.2478/popets-2021-0011. [74] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, GLUE: A multi-task benchmark and analysis platform for natural language understanding, in: T. Linzen, G. Chrupała, A. Alishahi (Eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 353–355. doi:10.18653/v1/W18-5446. [75] S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, Neural Networks 107 (2018) 3–11. doi:10.1016/j. neunet.2017.12.012. [76] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are Unsupervised Multitask Learners, 2019. Appendices A. Proof of Theorem 3.1 We consider the probabilistic truncation protocol of Escudero et al. [52], denoted ΠTruncPr 𝑛,𝑓 (J𝑥K) and shown in Fig. 9. Below we prove Theorem 3.1, i.e. we show that protocol Π𝑛,𝑓 TruncPr (J𝑥K) securely realizes the modified probabilistic truncation functionality ℱ𝑛,𝑓 TruncPr (J𝑥K) shown in Fig. 3 in Section 3. To simplify notation, we assume below that ℓ = 𝑛 − 1, in which case ˜𝑠 = 𝑠 = (𝑥 + ˜𝑟) mod 2𝑛 for ˜𝑟 = 2𝑛−1 𝑏 + 2𝑓 𝑟 + 𝑟′ . In the general case of ℓ < 𝑛 all arguments are the same except that computation is done over Z2ℓ+1 instead of Z2𝑛 . ΠTruncPr 𝑛,𝑓 (J𝑥K) 1. Assume parties hold J𝑥K ∈ Z2𝑛 for 𝑥 < 2ℓ , ℓ < 𝑛. 2. Assume parties hold (J𝑟K, J𝑟′ K, J𝑏K) from pre-processing where 𝑟, 𝑟′ , 𝑏 are random in resp. Z2ℓ−𝑓 , Z2𝑓 , and Z2 . 3. Parties set J𝑠K = 2𝑛−(ℓ+1) · (J𝑥K + 2ℓ J𝑏K + 2𝑓 J𝑟K + J𝑟′ K) and publicly reconstruct 𝑠 from J𝑠K. 4. Denote 𝑠 = 2𝑛−(ℓ+1) · ˜𝑠 and note that ˜𝑠 = (𝑥 + ˜𝑟) mod 2ℓ+1 where ˜𝑟 = 2ℓ 𝑏 + 2𝑓 𝑟 + 𝑟′ < 2ℓ+1 . 5. Parties compute J𝑣K = J𝑏 ⊕ ˜𝑠ℓ K = J𝑏K + ˜𝑠ℓ · (1 − 2J𝑏K), where ˜𝑠ℓ is the ℓ-th bit of ˜𝑠. 6. Parties output J𝑥′ K = 2ℓ−𝑓 J𝑣K + ⌊(𝑠˜ mod 2ℓ )/2𝑓 ⌋ − J𝑟K. Figure 9: Probabilistic truncation protocol of [52]. Proof. The simulator receives the shares of corrupted parties in sharings J𝑥K, J𝑟K, J𝑟′ K, J𝑏K and a random value 𝑠(𝑓 ) ∈ Z2𝑓 from functionality ℱ𝑛,𝑓 TruncPr (J𝑥K). On these inputs the simulator picks a random value 𝑠(𝑓 ) in Z2𝑛−𝑓 , sets 𝑠 = 2𝑓 · 𝑠(𝑓 ) + 𝑠(𝑓 ) , and simulates the reconstruction of secret-sharing J𝑠K = J𝑥 + ˜𝑟K = J𝑥K + 2𝑛−1 J𝑏K + 2𝑓 J𝑟K + J𝑟′ K into value 𝑠. Note that all other computations in the protocol are local given the input shares and the revealed value 𝑠, so the simulator’s work is done. Note that ΠTruncPr 𝑛,𝑓 (J𝑥K) reveals only 𝑠 reconstructed from secret-sharing J𝑠K = J𝑥 + ˜𝑟K. Since 𝑠 = 𝑥 + ˜𝑟 mod 2𝑛 and ˜𝑟 is random in Z2𝑛 it follows that 𝑠 is uniform in Z2𝑛 for every 𝑥, hence the simulation above is perfect. Note also that since the simulator sets 𝑠 as 2𝑓 · 𝑠(𝑓 ) + 𝑠(𝑓 ) , output 𝑠(𝑓 ) in the simulation agrees with value 𝑠(𝑓 ) set by the functionality. It remains to argue that distribution of 𝑥′ output by the ΠTruncPr𝑛,𝑓 (J𝑥K) is the same as in the ideal- TruncPr world functionality ℱ𝑛,𝑓 (J𝑥K). In the ideal-world functionality, value 𝑥′ is a function of 𝑥 and 𝑠(𝑓 ) , determined by the following formula: {︂ (𝑓 ) ′ 𝑥 if 𝑠(𝑓 ) ≥ 𝑥(𝑓 ) , 𝑥 = (𝑓 ) (3) 𝑥 +1 otherwise. We argue that 𝑥′ is determined in the same way in protocol ΠTruncPr 𝑛,𝑓 (J𝑥K). Recall that for any 𝑣, we (𝑓 ) 𝑓 𝑓 (𝑓 ) 𝑓 denote 𝑣 = ⌊𝑣/2 ⌋ and 𝑣(𝑓 ) = (𝑣 mod 2 ), hence 𝑣 = 𝑣 · 2 + 𝑣(𝑓 ) . Correctness of the protocol is based on the observation that if 𝑠 = 𝑥 + ˜𝑟 mod 2𝑛 then 𝑠 = 𝑥 + (2𝑓 𝑟 + 𝑟′ ) − 2𝑛−1 𝑣 over integers, where 𝑣 = 𝑏 ⊕ 𝑠𝑛−1 , equivalently 𝑠 = 2𝑓 (𝑥(𝑓 ) + (𝑟 − 𝑣˜)) + (𝑥(𝑓 ) + 𝑟′ ) where 𝑣˜ = (2(𝑛−1)−𝑓 )𝑣. There are two cases for how 𝑠(𝑓 ) and 𝑠(𝑓 ) relate to 𝑥(𝑓 ) , 𝑟, 𝑣˜ and 𝑥(𝑓 ) , 𝑟′ : if 𝑥(𝑓 ) + 𝑟′ < 2𝑓 , {︂ (𝑓 ) (𝑓 ) 𝑥 + (𝑟 − 𝑣˜) 𝑠 = (4) 𝑥(𝑓 ) + (𝑟 − 𝑣˜) + 1 otherwise. 𝑥(𝑓 ) + 𝑟′ if 𝑥(𝑓 ) + 𝑟′ < 2𝑓 , {︂ 𝑠(𝑓 ) = (5) 𝑥(𝑓 ) + 𝑟′ − 2𝑓 otherwise. Since J𝑥′ K is set to 𝑠(𝑓 ) − J𝑟 − 𝑣˜K, equation (4) implies that: if 𝑥(𝑓 ) + 𝑟′ < 2𝑓 , {︂ (𝑓 ) ′ (𝑓 ) 𝑥 𝑥 = 𝑠 − (𝑟 − 𝑣˜) = (𝑓 ) (6) 𝑥 +1 otherwise. However, equation (5) implies that: 𝑥(𝑓 ) +𝑟′ < 2𝑓 if and only if 𝑠(𝑓 ) = 𝑥(𝑓 ) +𝑟′ ≥ 𝑥(𝑓 ) (7) 𝑥(𝑓 ) +𝑟′ ≥ 2𝑓 if and only if 𝑠(𝑓 ) = 𝑥(𝑓 ) +(𝑟′ −2𝑓 ) < 𝑥(𝑓 ) . Equations (6) and (7) together imply the same expression of 𝑥′ according to 𝑥(𝑓 ) and 𝑠(𝑓 ) as in equation (3). We conclude that 𝑥′ output by protocol ΠTruncPr 𝑛,𝑓 (J𝑥K) is set by the same formula as in the ideal TruncPr functionality ℱ𝑛,𝑓 (J𝑥K). In the proof above, note that the randomness of the opened masked value is part of the view of the adversary, so the rounding error is fixed in both the real and ideal worlds by the adversary’s view. This argument can be extended to all probabilistic truncation protocols that follow the same pattern regarding masking the input value with randomness. More specifically, the proof can be adapted for the protocols with non-zero correctness error, including [24] and [51], and for other protocols with perfect correctness, e.g., [11]. B. Bounded functions Recall that an odd function F obeys the following property F(−𝑥) = −F(𝑥) and an even function F is such that F(−𝑥) = F(𝑥). In this section, we generalize the approach taken by Sigma for the GeLU function [15] and apply it to any bounded odd or even function that converges to some constant 𝑐 for |𝑥| → ∞ (e.g., s-shape functions). This method can be applied to several important functions: s-shape functions, used as activation functions (hyperbolic tangent tanh and error function erf); shifted s-shape functions, like sigmoid; and more general functions such as GeLU and SiLU. The odd/even property guarantees that we only need to compute the function for positive values. This effectively halves the size of the required LUT. To achieve this, we compute the sign of the input and multiply it by its value, then get its absolute value. The convergence property ensures that, given some acceptable error, we can clip the function for some interval [−𝑎, 𝑎] and assign a constant value 𝑐 to all elements outside that interval. We note that [15] sets 𝑐 = F(𝑎) but we set the constant 𝑐 to be the limit value of the function F. Outside [−𝑎, 𝑎], we claim that the error incurred by [15] is bigger for most of the inputs but for a short interval. Intuitively, Fig. 10 shows the errors incurred by Curl’s and Sigma’s approach to the Sigmoid function. Observe that the error following Sigma’s approach ( red area) is bigger than the error by Curl’s approach ( green area) outside the interval [2, 2.76]. 𝜎(𝑥) Curl 1 0.94 0.88 [20] [20] has Curl has lower error lower error 0.5 𝑥 2 2.76 Figure 10: Analysis of error incurred by Curl’s ( green ) and Sigma’s ( red ) approach on approximating Sigmoid function 𝜎(𝑥) for 𝑥 > 2. Curl’s approach sets the approximation 𝜎(𝑥) ̃︀ = 1 and Sigma’s approach sets 𝜎(𝑥) ̃︀ = 𝜎(2) for 𝑥 > 2. Sigma’s approach error is lower in [2, 2.76) and Curl’s approach error is lower in (2.76, +∞). ΠF𝑛,𝑎,𝑐,𝑝,𝑗 (J𝑥K) Parameters: 𝑛 is the full bit-width, 𝑎 defines the interval in which the bior-LUT protocol is applied, 𝑐 is the constant value for the function outside the interval, 𝑝 is the parity of F (𝑝 = 1 for odd functions and 𝑝 = 0 for even functions) and 𝑗 is the compression variable for the bior-LUT protocol. Compute sign and absolute value, Jsgn(𝑥)K and J|𝑥|K: 1. ⟨⟨𝑥⟩⟩ ← ΠA2B 𝑛 𝑛,𝑛 (J𝑥K) ∈ Z2 Compute comparison bit: 2. ⟨⟨𝑏⟩⟩ ← ⟨⟨𝑦⟩⟩ ≫ (𝑛 − 1) ∈ Z2 𝑏 = 𝑥 < 0. 3. J𝑏K ← ΠB2A 1,𝑛 (⟨⟨𝑏⟩⟩) ∈ Z2𝑛 4. Jsgn(𝑥)K ← 1 − 2 · J𝑏K ∈ Z2𝑛 5. J|𝑥|K ← Jsgn(𝑥)K · J𝑥K ∈ Z2𝑛 Check |𝑥| < 𝑎: 6. J𝑧K ← J|𝑥|K − 𝑎 ∈ Z2𝑛 7. ⟨⟨𝑧⟩⟩ ← ΠA2B 𝑛 𝑛,𝑛 (J𝑧K) ∈ Z2 Compute comparison bit: 8. ⟨⟨𝑖⟩⟩ ← ⟨⟨𝑧⟩⟩ ≫ (𝑛 − 1) ∈ Z2 𝑖 = |𝑥| < 𝑎. 9. J𝑖K ← ΠB2A 1,𝑛 (⟨⟨𝑖⟩⟩) ∈ Z2𝑛 Compute F: 10. J𝑦 ′ K ← ΠBior-LUT 𝑛,𝑛,𝑗,F (J|𝑥|K) {︃ 𝑦 ′ , if |𝑥| < 𝑎 11. J𝑦K ← (J𝑖K · J𝑦 K + (1 − J𝑖K) · 𝑐) ′ ◁ Computes share of: 𝑦 = 𝑐, otherwise. 12. Output: Jsgn(𝑥)K𝑝 · J𝑦K ∈ Z2𝑛 ◁ Equivalent to J𝑦K for even functions and J−𝑦K otherwise. Figure 11: Protocol for convergent bounded odd (𝑝 = 1) or even (𝑝 = 0) function F based on the secure biorthogonal-encoded LUT protocol. The ΠF𝑛,𝑎,𝑐,𝑝,𝑗 (J𝑥K) protocol is formally shown in Fig. 11. Recall from Section 2.1, that ΠA2B B2A 𝑛,𝑙 and Π𝑛,𝑙 represent the arithmetic to binary and binary to arithmetic conversion protocols. The protocol goes as follows. For input value 𝑥, the parties start by computing the signal of 𝑥 and its absolute value. Note that the signal value can be computed using the expression sgn(𝑥) = 1 − 2 · (𝑥 < 0) and the absolute value of 𝑥 can be computed as |𝑥| = sgn(𝑥) · 𝑥. The absolute value is then used to check whether 𝑥 lies within a predetermined interval [−𝑎, 𝑎] or not. Then, parties run the secure Biorthogonal protocol ΠBior-LUT ′ ′ 𝑛,𝑙,𝑗,F (J𝑥K) to evaluate on |𝑥|, returning 𝑦 . In case 𝑥 ∈ [−𝑎, 𝑎], parties output 𝑦 , otherwise they output the limit value 𝑐 of the function F. This is effectively computed by the operation 𝑖 · 𝑦 ′ + (1 − 𝑖) · 𝑐, where 𝑖 is the comparison bit for |𝑥| < 𝑎. In the last step, the output value is corrected in case the function is odd: returning sgn(𝑥) · 𝑦. This is set by the parity parameter 𝑝, which is 𝑝 = 0 for even functions and 𝑝 = 1 for odd functions. B.1. S-shape functions In the previous section, we presented a protocol that can be applied to any bounded odd or even function that is convergent. S-shape functions are bounded functions that converge to some value 𝑐 for |𝑥| → ∞ but are not necessarily odd or even. For example, we have that tanh and erf are odd functions but the sigmoid function 𝜎(𝑥) is not. However, we note that we can transform any s-shape function to an odd function by applying a shift. For example, the sigmoid function 𝜎(𝑥) can be shifted by 1/2, resulting in an odd function: 𝜎(𝑥) − 1/2. Generally, observe that, for an s-shape function F, there exists a shift value 𝑠, such that F(−𝑥) − 𝑠 = −F(𝑥) + 𝑠 ⇔ F(−𝑥) = 2 · 𝑠 − F(𝑥). So, setting for 𝑏 = 𝑥 < 0, we can change the last step (12) of protocol ΠF𝑛,𝑎,𝑐,𝑝,𝑗 (J𝑥K) to evaluate the function F: F(𝑥) = 𝑏 · F(−|𝑥|) + (1 − 𝑏) · F(|𝑥|) = 𝑏 · (2 · 𝑠 − F(|𝑥|)) + (1 − 𝑏) · F(|𝑥|) = 𝑏 · 2 · 𝑠 + sgn(𝑥) · F(|𝑥|), where sgn(𝑥) = (1 − 2 · 𝑏). Thus, for the sigmoid function where 𝑠 = 1/2, the last step (12) from Fig. 11 is as follows: J𝑏K + Jsgn(𝑥)K · J𝑦K. Note that for both tanh and erf 𝑏 = 0, recovering the initial expression. Parameters. Now, for each of the three s-shape functions, we need to define the following parameters: bitwidth 𝑛, the interval [−𝑎, 𝑎], the constant 𝑐, the parity 𝑝 and the compression value 𝑗. We use the hardcoded bitwidth value from CrypTen 𝑛 = 64, imposed by the PyTorch dependency. Also, 𝑐, 𝑝 = 1 for all three s-shape functions. We note that [15] clips the interval in GeLu and accepts a constant error on the order of ∼ 10−5 . We follow a similar approach and accept errors < 10−5 . For the error function (erf), hyperbolic tangent (tanh) and sigmoid (sigmoid) we set the interval to be [−4, 4], [−8, 8] and [−16, 16], respectively. These intervals incurr a maximum error on the order of 10−8 , 10−7 and 10−7 , and the size of the original LUT is 218 , 219 and 220 , respectively. B.2. Activation Functions: GeLU And SiLU In this section, we consider two popular activation functions: GeLU [60] and SiLU [75], given by (︂ (︂ )︂)︂ 𝑥 𝑥 GeLU(𝑥) := 1 + erf √ , 2 2 where erf(·) denotes the Gauss error function and SiLU(𝑥) := 𝑥 · sigmoid(𝑥). These activation functions are sometimes considered smooth approximations to the classical ReLU of [59]. We observe that the functions DGeLU := ReLU(𝑥) − GeLU(𝑥) and DSiLU := ReLU(𝑥) − SiLU(𝑥) are even functions. Regarding DGeLU , note that (︂ (︂ )︂)︂ 1 𝑥 DGeLU = |𝑥| − 𝑥 erf √ , 2 2 so the claim follows as |𝑥| is even and 𝑥 and erf are both odd. Regarding DSiLU , observe that 𝜎(−𝑥) = 1 − 𝜎(𝑥), for sigmoid 𝜎. Thus, we have: {︃ 𝑥 · (1 − 𝜎(𝑥)) 𝑥 ≥ 0, DSiLU (𝑥) = −𝑥 · 𝜎(𝑥) 𝑥<0 {︃ 𝑥 · 𝜎(−𝑥) 𝑥 ≥ 0, = −𝑥 · 𝜎(𝑥) 𝑥 < 0 = DSiLU (−𝑥). Since these two activation functions are also bounded and converge to 0 for |𝑥| → ∞, we can use ΠF𝑛,𝑎,𝑐,𝑝,𝑗 protocol (Fig. 11) to compute both GeLU and SiLU by evaluating: GeLU(𝑥) = ReLU(𝑥) − ΠD DSiLU 𝑛,𝑎,𝑐,𝑝,𝑗 (𝑥) and SiLU(𝑥) = ReLU(𝑥) − Π𝑛,𝑎,𝑐,𝑝,𝑗 (𝑥). GeLU Note that ReLU can be computed using ReLU(𝑥) = (𝑥 < 0) · 𝑥. Since we compute 𝑥 < 0 in step 3. of the ΠF𝑛,𝑎,𝑐,𝑝,𝑗 protocol (Fig. 11), we can reuse this value. Parameters. Again, we use the hardcoded bitwidth value from CrypTen 𝑛 = 64 and 𝑐, 𝑝 = 0. As noted before, [15] clips the interval in GeLU and accepts a constant error on the order of ∼ 10−5 . We follow a similar approach and accept errors < 10−5 . For GeLU and SiLU we set the interval to be [−4, 4] and [−16, 16], respectively. These intervals incur a maximum error on the order of 10−8 and 10−7 , and the size of the original LUT is 218 and 220 , respectively. Πsin 𝑛,𝑓,𝑗 (J𝑥K) Parameters: 𝑛 is the full bit-width, 𝑓 is the number of bits used for the fractional part, and 𝑝 = 0 for even functions) and 𝑗 is the compression variable for the Biorthogonal protocol. Compute sign and absolute value, Jsgn(𝑥)K and J|𝑥|K: 1. ⟨⟨𝑥⟩⟩ ← ΠA2B 𝑛,𝑛 (J𝑥K) 2. ⟨⟨𝑏⟩⟩ ← ⟨⟨𝑦⟩⟩ ≫ (𝑛 − 1) 3. J𝑏K ← ΠB2A 1,𝑛 (⟨⟨𝑏⟩⟩) 4. Jsgn(𝑥)K ← 1 − 2 · J𝑏K 5. J|𝑥|K ← Jsgn(𝑥)K · J𝑥K Compute sin: 6. Use encoding of 2𝜋, 𝑞 := ⌊2𝜋 · 2𝑓 ⌋, and compute J|𝑥′ |K ← Πdiv (J|𝑥|K, 𝑞). 7. J𝑦 ′ K ← ΠBior-LUT ′ 𝑛,𝑛,𝑗,sin(2𝜋·) (J|𝑥 |K) 8. Output: Jsgn(𝑥)K · J𝑦K ∈ Z2𝑛 ◁ For cos it returns J𝑦K. Figure 12: Protocol for sin function based on the secure Biorthogonal protocol. B.3. Trigonometric functions It is known that sin is an odd function and cos is an even function. However, these two functions do not converge, making protocol ΠF𝑛,𝑎,𝑐,𝑝,𝑗 (Fig. 11) unsuitable. Despite this, we can still use the core concepts from that protocol and adapt them to these trigonometric functions. The key insight is that both functions are periodic with a period of 2𝜋. We leverage this periodicity by scaling these trigonometric functions to have period 1. This is achieved by generating LUT for sin(2𝜋 · 𝑥) and cos(2𝜋𝑥), where 𝑥 ∈ [0, 1]. Then, for input 𝑥, we evaluate the LUT using the secure Biorthogonal protocol at point 𝑥/(2𝜋). We execute division (Πdiv ) based on the Escudero et al. [52, §5.1] probabilistic truncation protocol over rings. The protocol for sin is formally described in Fig. 12. The cos protocol is similar but does not require the sign correction in the last step. B.4. The Softmax Function Recall that softmax is given by the following expression: 𝑒𝑥𝑖 −max(𝑥) 𝑦𝑖 = ∑︀𝑘−1 , 𝑥𝑗 −max(𝑥) 𝑗=0 𝑒 where 𝑥 ∈ R𝑘 and max is the maximum operation over the vector 𝑥. To be able to compute the expression above, we need three main functions: maximum, natural exponentiation, and reciprocal. The maximum operation is computed using the protocol implemented in CrypTen (Section C.1.4 [12]). We note that both CrypTen adopts a tree-reduction algorithm but also provides a pairwise method with constant round complexity. Reciprocal. From the softmax expression, we have that there exists at least one index 𝑗 such that 𝑒𝑥𝑗 −max(𝑥) = 1. Also, since there are at most 𝑘 terms, we have that the reciprocal function can be only evaluated within the [1, 𝑘] interval. We set 𝑘 = 64 as per in GPT-2 model [76]. Then, we apply the secure ΠBior-LUT 𝑛,𝑙,𝑗,F to the interval [1, 64]. Table 5 Comparison of the mean absolute error (MAE) between Haar and Bioriothogonal LUT protocol with probabilistic and deterministic truncation in (1, 64). Probabilistic Truncation Deterministic Truncation Op. Haar Biorthogonal Haar Biorthogonal log 8.63e-3 2.53e-2 9.76e-03 4.63e-04 reciprocal 2.26e-3 4.58e-4 1.87e-03 5.79e-04 sqrt 1.29e-1 8.84e-4 6.04e-02 6.41e-05 invsqrt 7.58e-4 5.40e-5 9.36e-04 2.20e-05 § The light blue background indicates the lowest MAE error. C. Truncation with Haar vs Biorthogonal We compare the accuracy of Haar and Biorthogonal DWT approaches using both probabilistic and deterministic truncation methods in Table 5. Our findings indicate that the Biorthogonal approach consistently outperforms the Haar approach. The reason for this is that Haar averages out two consecu- tive values whereas Birthogonal uses five consecutive elements, thus retaining more information and capturing the trend of the function. Finally, for the both Haar and Biorthogonal approaches we observe only a marginal difference between the probabilistic and deterministic truncation protocols.