A Smartnic-Based Secure Aggregation Scheme for Federated Learning 1 Shengyin Zang1, Jiawei Fei2, Xiaodong Ren1, Yan Wang1, Zhuang Cao3,*, Jiagui Wu4,* 1 College of Artificial Intelligence, Southwest University, Chongqing, China 2 Defense innovation institute, Beijing, China 3 College of Computer, National University of Defense Technology, Changsha, China 4 School of Physical Science and Technology, Southwest University, Chongqing, China Abstract Federated learning is a widely used distributed machine learning technique where participants collaborate to train neural network models without disclosing private training datasets. Tech- nologies like homomorphic encryption are proposed to avoid private data leakage. However, they are plagued by two problems, including performance degradation caused by excessive computational and communication overhead and insecurity due to the data exposure on the parameter server. Focusing on these problems, we propose an efficient and privacy- preserving federated learning solution that improves performance and security by offloading the aggregation procedure into the hardware data plane, like FPGA-based SmartNICs. Fur- thermore, we also combine it with differential privacy techniques. Our method has higher se- curity because it is hard to access the data plane. Besides, massive experiments show that, compared to the system employing additively homomorphic encryption, our scheme reduces the communication cost by around 59.5% and offers around 2.5× speedup at the aggregation stage while significantly decreasing the participant’s computational overhead. Keywords Federated Learning, Differential Privacy, SmartNIC, Secure Aggregation, Privacy-Preserving 1. Introduction In the current era of big data, dispersed data cannot be integrated and utilized due to data security, privacy protection, and others, making data silos a severe obstacle to the advancement of artificial in- telligence [1] [2]. In order to overcome this dilemma, federated learning technology [3] [4] was pro- posed by Google Research and is now widely employed in various scenarios. Federated learning is a distributed machine learning technique that enables model training on a sizable amount of decentral- ized data. The goal of federated learning is to accomplish multi-party cooperative modeling while safeguarding data privacy and satisfying regulatory compliance requirements. Compared with conven- tional machine learning techniques, federated learning not only improves learning efficiency but also resolves the problem of data silos and preserves local data privacy. However, it still runs the danger of leaking privacy information because it does not offer comprehensive and adequate privacy protection by itself. For instance, a malicious server can still deduce sensitive user information from the local gradients during the aggregation phase [5] [6]. To cope with this challenge, many scholars have con- ducted research on privacy leakage in federated learning. Homomorphic encryption [7] is one of the most commonly applied privacy technologies in federated learning. Although it offers security ad- vantages, the tremendous computational complexity and ciphertext expansion also place a heavy bur- den on participants in federated learning in terms of computing and communication. Despite numer- ous optimization strategies for this issue [8] [9] [10], the performance bottlenecks remain fundamen- tally unresolved. Additionally, since the parameter server has access to ciphertext data, user privacy still faces the risk of leakage. Therefore, how to design a scheme to meet the above challenges re- ICCEIC2022@3rd International Conference on Computer Engineering and Intelligent Control EMAIL: *caozhuang16@nudt.edu.cn (Zhuang Cao), mgh@swu.edu.cn (Jiagui Wu). © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 81 mains an urgent problem. In this paper, we present an efficient federated learning privacy computing scheme based on hardware security by offloading the gradients aggregation operation to an FPGA- based SmartNIC [11] and combining it with differential privacy techniques [12] as an alternative to traditional software protections. Our contributions are summarized as follows: We propose a SmartNIC-based gradients aggregation algorithm, which improves the security and performance of federated learning by offloading the gradients aggregation operation onto SmartNIC. We implement an efficient aggregation structure on SmartNIC, which enables our scheme to com- plete privacy-preserving computations while ensuring computational efficiency. We evaluate the performance benefits of the SmartNIC-based gradients aggregation algorithm. Ex- tensive experiments show that our scheme has lower communication and computational overhead than schemes using additively homomorphic encryption. 2. System model and threat model 2.1. System model As shown in Figure 1, the SmartNIC-based federated learning system mainly consists of two main components: users and the SmartNIC-based parameter server on the cloud server. All users agree on an identical initial model and common training objectives. During the training process, all participants do not directly share their respective private data. Figure 1 System model On the users’ side, they first download the global network model and the initial parameters from the cloud server, then perform model training based on the local dataset to obtain local gradients, en- crypt and upload those gradients to the server, and finally decrypt the global gradients returned by the server to update local model parameters. On the SmartNIC-based parameter server, the primary task of SmartNIC is to achieve efficient and secure aggregation of gradients by utilizing its hardware-isolated execution environment and extreme computing performance. The SmartNIC decrypts local gradients uploaded by each participant, aggre- gates them to obtain global gradients, and adds Gaussian noise perturbation to thwart differential at- tacks [13]. The global gradients are then broadcast to all users after being encrypted. After continuous iterations, until the loss function reaches the minimum value, the optimal neural network model is fi- nally constructed. 2.2. Threat model Our proposed scheme aims to safeguard the user’s private information throughout the training phase. The cloud server is assumed to be honest-but-curious, meaning that it will adhere to the proto- 82 col to execute gradients aggregation, while also showing some degree of curiosity about the users’ raw data and may attempt to bypass some security measures to access the users’ privacy data directly. Additionally, malicious participants try to determine whether a particular user is involved in the train- ing process by analyzing the shared global gradients, i.e., performing membership inference attacks [14]. 3. Proposed scheme In this section, we propose a SmartNIC-based federated learning gradients aggregation scheme that works as an alternative to traditional software protections. The fundamental idea is to provide a trust- ed execution environment [15] for isolated execution of procedures and privacy data processing by offloading the gradients aggregation operation to the FPGA-based SmartNIC. It will be difficult for the server to access the data in the SmartNIC to realize the privacy calculation of sensitive data. In addition, the perturbation noise satisfying the Gaussian mechanism is added to the aggregated global gradients to prevent user privacy data from being subjected to differential attacks and model overfit- ting, although this may lead to a decrease in model accuracy or increase the number of iterations re- quired for model convergence. 3.1. Overview of secure aggregation scheme Figure 2 illustrates the proposed aggregation scheme, and the process of joint modeling can be broadly divided into the following parts: Figure 2 Schematic of aggregation scheme Initialization. The cloud server broadcasts the global network model, the initial parameters ω0 of the model and the learning rate η to all users participating in the training. Different encryption meth- ods will be assigned to each participant at the same time. Local Training. Based on the network model and initial parameters distributed by the server, each participant trains based on stochastic gradient descent [16] with a mini-batch of local datasets and cal- culates the local gradients. Those gradients are then encrypted and sent to the server’s SmartNIC for aggregation. Secure Aggregation. SmartNIC separately decrypts local gradients from different participants. Af- ter receiving all users’ data, SmartNIC starts to perform aggregation operations. The aggregated glob- al gradients are perturbed for privacy-preserving purposes with noise satisfying a Gaussian distribu- tion. Finally, those gradients are broadcast to all users after being encrypted by the SmartNIC. Global Update. After receiving the global gradients returned by SmartNIC, the user decrypts them and updates the model parameter ω according to the global gradients and learning rate η. The system will repeat the above steps until the loss function reaches the minimum value. The fi- nal neural network model is constructed through a continuous loop iteration between SmartNIC and users. During the whole training process, the server only manages the SmartNIC but cannot access the users’ private data on the SmartNIC, thus playing the role of privacy protection. 83 3.2. Aggregation architecture on SmartNIC Figure 3 outlines the aggregation structure on the SmartNIC. During the entire training process, the data handling is wholly implemented on the SmartNIC, while the cloud server only configures the SmartNIC in the initial stage of training. For input transactions, the Ethernet MAC parses the data from the Ethernet PHY and transmits it to the Decrypt Engine for decryption. The decrypted user data will be stored in the DDR of the Storage Engine, where the DDR allocates separate memory space for each user. Once all the memory spaces contain user data, the Aggregate Engine initiates the gradients aggregation operation and transfers the resulting data to the Perturb Engine for perturbation, where the noise satisfies the Gaussian distribution. The disturbed global gradients are then encrypted by the En- crypt Engine and stored in the FIFO. For output transactions, once a complete packet is stored in the FIFO, the Ethernet MAC drives the Ethernet PHY to distribute the encrypted global gradients to all users. Both transactions use the standard 128-bit AXI-stream bus to interact with the Ethernet inter- face. Figure 3 Architecture of aggregation scheme 4. Implementation In this section, we will then discuss the hardware structure of the components on the SmartNIC. Decrypt Engine. We have implemented DES and 3DES encryption and decryption algorithms in a pipelined manner on the decryption engine. After receiving the data, the decryption engine will de- crypt it based on the user ID and sequence number. The Encrypt Engine uses a similar structure for encryption. Storage Engine. The structure of the Storage Engine is shown in Figure 4. Each user is assigned a separate memory space in the DDR of the Storage Engine. When data enters the Storage Engine, the user ID and sequence number are used to control where the data is stored and to check for packet loss. Data is read from the DDR and delivered to the aggregation engine after being synchronized by the FIFOs when none of the address spaces are empty. Figure 4 Storage Engine architecture Aggregate Engine. The structure of the Aggregate Engine is shown in Figure 5. We implement a scenario with up to 64 sets of gradient data that can be aggregated simultaneously, which uses a six- stage pipeline structure to improve the processing performance. 32 Aggregate Blocks (ABs) are used in the first stage, and each AB aggregates two gradients. 16 ABs are used in the second stage, and then the number of ABs is halved sequentially in subsequent pipelines until the final global gradients are output and submitted to the Perturb Engine. In the case of more than 64 users, the result should be temporarily stored in the DDR pending further aggregation. Moreover, if fewer than 64 users, set some of the ABs’ inputs in the pipeline’s first stage to zero. 84 Figure 5 64-bit Aggregate Engine architecture The structure of the Aggregate Block is shown in Figure 6. Since the Encrypt Engine only imple- mented DES and 3DES encryption algorithms and the ciphertexts are 64 bits, we presume that the in- put of the AB is also 64 bits. Gradients data are presented as 32-bit fixed-point numbers. The input of AB is the gradient data after 32-bit alignment, i.e., two gradients at a time. AB splits the input and sends it to the two adders, respectively. The results of the two adders are concatenated into 64 bits and then delivered to the register for temporary storage. Figure 6 64-bit Aggregate Block architecture 5. Experiment In this section, we will assess our proposed scheme from the aspects of communication and com- putational overhead, and hardware resource consumption by comparing it with the novel scheme PPDL [8], in which additively homomorphic encryption is adopted as the privacy-preserving ap- proach. We implemented the SmartNIC-based secure aggregation structure on the Xilinx Zynq Ul- traScale+ZCU111 evaluation platform using Verilog language, and used Vivado 2018.3 for logic syn- thesis and implementation. Without path violations, the final data processing speed can be up to 425 MHz, and the data throughput rate can reach approximately 27 Gbps. Table 1 describes the resource utilization of the SmartNIC-based aggregation scheme. PPDL’s benchmarks are based on the Tensor- flow 1.1.0 library over Cuda-8.0 and GPU Tesla K40m, with a Xeon CPU E5-2660 v3@ 2.60GHz server, and assume that each user uses only one thread. Since we completely offload the gradients ag- gregation process in federated learning from the cloud server to the SmartNIC, and the cloud server only configures the SmartNIC, the resource utilization on the cloud server of our scheme can be ap- proximately considered zero compared with that of PPDL. Next, we will mainly compare the two schemes’ communication and computational overhead differences. Table 1 The area utilization report of our scheme Resource Utilization Available Utilization % LUT 40962 425280 9.63 LUTRAM 663 213600 0.3111 FF 30366 850560 3.57 BRAM 256 1080 23.70 DSP 2 4272 0.05 BUFG 1 696 0.14 85 5.1. Communication overhead Assuming that each participant uses only one thread for computation, we first compare the com- munication overhead of each participant in federated learning in PPDL. The relationship between the communication overhead and the number of gradients is shown in Figure 7. The figure clearly indi- cates that the communication overhead of PPDL is more than twice as high as our scheme. The main reason is the rapid growth of ciphertext volume brought on by homomorphic encryption. Figure 7 Comparison of communication overhead 5.2. Computational overhead The computational cost differences between our approach and PPDL during the encryption, aggre- gation, and decryption stages will be compared. Since each participant in our scheme may use differ- ent encryption methods, we pick the 3DES algorithm with the highest computational cost to compare with the homomorphic encryption of PPDL. As shown in Figure 8, our encryption overhead is consid- erably lower than that of PPDL as the number of gradients increases, which is caused by the high computational complexity of homomorphic encryption. Figure 9 describes the difference in computa- tional cost between our scheme and PPDL in the aggregation stage. It is worth noting that the aggre- gation stage in our scheme includes four sub-stages: decryption, aggregation, noise addition, and en- cryption. Benefiting from the high performance of SmartNIC, our computational overhead in the ag- gregation phase is less than half of PPDL. Similarly, as shown in Figure 10, as the number of gradi- ents increases, the decryption overhead is also much smaller than PPDL. Therefore, our scheme is more suitable for training large-scale deep neural network models. Figure 8 Comparison of computational overhead at the encryption phase 86 Figure 9 Comparison of computational overhead at the aggregation phase Figure 10 Comparison of computational overhead at the decryption phase 6. Related work Some related work on increasing the performance and security of federated learning systems by using homomorphic encryption has been studied recently. For example, AONO Y et al. [8] imple- mented the aggregation of gradients on the cloud server using an additively homomorphic encryption scheme with low computational load and guaranteed the system’s high accuracy. Even yet, there is still a substantial computational overhead associated with this scheme due to the growing size of the neural network and the growing amount of training set samples involved in the modeling process. At the same time, this scheme is also vulnerable to differential attacks, and the privacy of honest partici- pants’ data will be threatened by the analysis of the shared model [17]. For this reason, Meng Hao et al. [9] proposed an efficient federated deep learning scheme by integrating a lightweight symmetric additively homomorphic encryption [10] with differential privacy [12]. This scheme is secure for honest-but-curious server settings, even if the cloud server colludes with multiple users. Unfortunately, all participants in this scheme use a single encryption key, making the system vulnerable to security threats even though the aggregation can be carried out smoothly. Therefore, how to design a scheme to meet the above challenges remains an urgent problem. 7. Conclusion This paper proposes an efficient and privacy-preserving federated learning system based on hard- ware security by offloading the gradients aggregation operation onto SmartNIC as an alternative to homomorphic cryptography. Extensive experiments demonstrate that our scheme has lower commu- nication and computational overhead than schemes using additively homomorphic encryption. Mean- while, our scheme is secure for the honest-but-curious cloud server and has superior security by using different encryption methods for each participant compared to single-key homomorphic encryption. In 87 addition, we further add Gaussian noise that satisfies differential privacy to the aggregated global gra- dients to defend against differential attacks in federated learning. In future work, we will train real- world deep neural networks to evaluate the impact of our scheme on model accuracy compared to tra- ditional centralized machine learning and investigate using SmartNIC clusters for secure aggregation to support federated learning for larger-scale applications. 8. Acknowledgment This work is supported by the National Natural Science Foundation of China (61875168); Chong- qing Science Funds for Distinguished Young Scientists(cstc2021jcyj-jqX0027); Innovation Research 2035 Pilot Plan of Southwest University (SWU-XDPY22012); Innovation Support Program for Over- seas Students in Chongqing (cx2021008). 9. References [1] T. Tiffany, L. Josh, M. Daniele, “Asynchronous collaborative learning across data silos,” arXiv preprint arXiv:2203.12637, 2021. [2] P. Franz, S. Olga, R. Kay, G. Florian, J. Igor, S. Franz et al., “Embracing opportunities of live- stock big data integration with privacy constraints,” in Proceedings of the 9th international con- ference on the internet of things, pp. 1-4, 2019. [3] H. B. McMahan, E. Moore, D. Ramage, B. A. Arcas, “Federated learning of deep networks using model averaging,” arXiv preprint arXiv:1602.05629, 2016. [4] Y. Cheng, Y. Liu, T. Chen, Q. Yang, “Federated learning for privacy-preserving AI,” Communi- cations of the ACM, vol. 63, no. 12, pp. 33-36, 2020. [5] Z. Ligeng, L. Zhijian, H. Song, “Deep Leakage from Gradients,” arXiv preprint arXiv:1906.08935, 2019. [6] X. Xiaoyun, W. Jingzheng, Y. Mutian, L. Tianyue, D. Xu, L. Weiheng Li et al., “Information Leakage by Model Weights on Federated Learning,” in Proceedings of the 2020 workshop on privacy-preserving machine learning in practice, pp. 31–36, 2020. [7] M. Naehrig, K. Lauter,V VaikuntanathanN, “Can homomorphic encryption be practical?,” in Proceedings of the 3rd ACM workshop on cloud computing security workshop, pp. 113–124, 2011. [8] Y. Aono, T. Hayashi, L. Wang, S. Moriai, “Privacy-preserving deep learning via additively ho- momorphic encryption,” IEEE transactions on information forensics and security, vol. 13, no. 5, pp. 1333–1345, 2018. [9] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang, “Towards efficient and privacy-preserving federated deep learning,” in 2019 IEEE international conference on communications, pp. 1–6, 2019. [10] Z. Jun, C. Zhenfu, D. Xiaolei, and L. Xiaodong, “Ppdm: A privacy-preserving protocol for cloud-assisted e-healthcare systems,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 7, pp. 1332–1344, 2015. [11] L. Ming, C.Tianyi, S. Henry, K. Arvind, P. Simon, G. Karan, “Offloading distributed applica- tions onto smartNICs using iPipe,” in Proceedings of the ACM special interest group on data communication. Association for Computing Machinery, pp. 318–333, 2019. [12] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar et al., “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. ACM, 2016, pp. 308–318. [13] Z. Chuanxin, S. Yi,W. Degang, “Federated Learning with Gaussian Differential Privacy,” in Pro- ceedings of the 2020 2nd international conference on robotics, intelligent control and artificial in- telligence, pp. 296–301, 2020. [14] R. Shokri, M. Stronati, C. Song and V. Shmatikov, “Membership Inference Attacks Against Ma- chine Learning Models,” 2017 IEEE symposium on security and privacy (SP), 2017, pp. 3-18. [15] A. Mondal, Y. More, R. H. Rooparaghunath, D. Gupta, “Flatee: Federated Learning Across Trusted Execution Environments,” arXiv preprint arXiv:2111.06867, 2021. 88 [16] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. ACM, 2015, pp. 1310– 1321. [17] R. Geyer, T. Klein, M. Nabi, “Differentially private federated learning: a client level perspec- tive,” arXiv preprint arXiv:1712.07557, 2017. 89