1. Introduction

Split Ways: Privacy-Preserving Training of Encrypted Data Using Split Learning

Tanveer Khan

Khoa Nguyen

Antonis Michalas

0 1 0 RISE Research Institutes of Sweden 1 Tampere University , Tampere , Finland

Split Learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies Homomorphic Encryption (HE) on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.

eol>Homomorphic Encryption Privacy-preserving Machine Learning Split Learning

1. Introduction

Published in the Workshop Proceedings of the EDBT/ICDT 2023 Joint Conference (March 28-March 31, 2023), Ioannina, Greece $ tanveer.khan@tuni.fi (T. Khan); khoa.nguyen@tuni.fi (K. Nguyen); antonios.michalas@tuni.fi (A. Michalas) https://www.amichalas.com/ (A. Michalas)

0000-0001-7296-2178 (T. Khan); 0000-0002-0189-3520 (A. Michalas)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Vision AI systems have proven surpass people in recCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) ognizing abnormalities such as tumours on X-rays and ultrasound scans [ 7 ]. In addition to that, machines can outperforms FL in terms of accuracy [ 4 ]. reliably make diagnoses equal to those of human experts. Initially, it was believed that SL is a promising apAll the evidence indicates that we can now build systems proach in terms of client raw data protection, however, that achieve human expert performance in analyzing SL provides data privacy on the grounds that only intermedical data – systems allowing humans to send their mediate activation maps are shared between the parties. medical data to a remote AI service and receive an ac- Diferent studies showed the possibility of privacy leakcurate automated diagnosis. An intelligent and eficient age in SL. In [ 2 ], the authors analyzed the privacy leakage AI healthcare system of this type ofers a great poten- of SL and found a considerable leakage from the split tial since it can improve the health of humans but also layer in the 2D CNN model. Furthermore, the authors have an important social impact. However, these oppor- mentioned that it is possible to reduce the distance cortunities come with certain pitfalls, mainly concerning relation between the split layer and raw data by slightly privacy. With this in mind, we have designed a system scaling the weights of all layers before the split. This that analyzes images in a privacy-preserving way. More type of scaling works well in models with a large number precisely, we show how encrypted images can be ana- of hidden layers before the split. lyzed with high accuracy without leaking information The work of Abuadbba et al. [ 6 ] is the first study exabout their actual content. While this is still far from ploring whether SL can deal with time-series data. It is our big dream (namely automated AI diagnosis) we still dedicated to investigating (i) whether an SL can achieve believe it is an important step that will eventually pave the same model accuracy for a 1D CNN model compared the way towards our timate goal. to the non-split version and (ii) whether it can be used Contributions The main contributions are: to protect privacy in sequential data. According to the • We designed a simplified version of the 1D CNN model results, SL can be applied to a model without the model presented in [ 6 ] and we are using it to classify the ECG classification accuracy degradation. As for the second signals [8] in both local and SL settings. More specifi- question, the authors proved it is possible to reconstruct cally, we construct a U-shaped split 1D CNN model and the raw data (personal ECG signal) in the 1D CNN model experiment using plaintext activation maps (PAMs) sent using SL by proposing a privacy assessment framework. from the client to the server. Through the U-shaped They suggested three metrics: visual invertibility, dis1D CNN model, clients do not need to share the input tance correlation, and dynamic time warping. The retraining samples and the ground truth labels with the sults showed that when SL is directly adopted into 1D server – this is an important improvement that reduces CNN models for time series data could result in signifiprivacy leakage compared to [ 6 ]. cant privacy leakage. Two mitigation techniques were • We constructed the HE version of the U-shaped SL. In employed to limit the potential privacy leakage in SL: the encrypted U-shaped SL, the client encrypts the ac- (i) increasing the number of layers before the split on tivation map using HE and sends it to the server. The the client-side and (ii) applying diferential privacy to advantage of the HE encrypted U-shaped SL over the the split layer activation before sending the activation plaintext U-shaped SL is that the server performs com- map to the server. However, both techniques sufer from putation over the EAMs. a loss of model accuracy, particularly when diferential • To assess the applicability of our framework, we per- privacy is used. The strongest diferential privacy can formed experiments on a heartbeat datasets (MIT- increase the dissimilarity between the activation map DB [8]). We experimented with activation maps of 256 and the corresponding raw data. However, it degrades for both plaintext and homomorphically EAMs and we the classification accuracy significantly from 98.9% to 50%. measured the model’s performance by considering train- In [ 6 ], during the forward propagation, the client sends ing duration, test accuracy, and communication cost. the PAMs to the server, where the server can easily reconstruct the original raw data from the activated vector of the split layer leading to clear privacy leakage. In our 2. Related Work work, we constructed a training protocol, where, instead of sending PAMs, the client first conducts an encryption using HE and then sends said maps to the server. In this way, the server is unable to reconstruct the original raw data, but can still perform a computation on the EAMs and realize the training process.

The SL approach proposed by Gupta and Raskar [9] ofers a number of significant advantages over FL. Similar to FL [10], SL does not share raw data. In addition, it has the benefit of not disclosing the model’s architecture and weights. For example, [9] predicted that reconstructing raw data on the client-side , while using SL would be dificult. In addition, the authors of [ 4 ]employed the SL model to the healthcare applications to protect the users’ personal data. Vepakomma et al. found that SL

3. Architecture

In this section,we first describe the non-split version or local model of the 1D CNN used to classify the ECG y F u C o detcenn ll l 1 gnilooPxaM anoitluovnoCD uleRykaeL S x atfmo Client-side

Server-side signal. Then, we discuss the process of splitting this local model into a U-shaped split model. Furthermore, we also describe the involved parties (a client and a server) in the training process of the split model, focusing on their roles and the parameters assigned to them throughout the training process.

3.1. 1D CNN Local Model Architecture

We first implement and successfully reproduce the local model results [ 6 ]. This model contains two Conv1D layers and two FC layers. The optimal test accuracy that this model achieves is 98.9%. We implement a simplified version where the model has one less FC layer compared to the model from [ 6 ]. Our local model consists of all the layer of Figure 1 without any split between the client and the server. As can be seen in Figure 1, we limit our model to two Conv1D layers and one linear layer as we aim to reduce computational costs when HE is applied on activation maps in the model’s split version. Reducing the number of FC layers leads to a drop in the accuracy of the model. The best test accuracy we obtained after training our local model for 10 epochs with a batch size of 4 is 92.84%. Although reducing the number of layers afects the model’s accuracy, it is not within our goals to demonstrate how successful our ML model is for this task; instead, our focus is to construct a split model where training and evaluation on encrypted data are comparable to training and evaluation on plaintext data.

In section 5, we detail the results for the non-split version and compare them with the split version. layer are on the client-side, while the remaining layers are on the server-side.

Actors in the Split Learning Model As mentioned earlier, in our SL setting, we have two involved parties: the client and the server. Each party plays a specific role and has access to certain parameters. More specifically, their roles and accesses are described as: • Client: In the plaintext version, the client holds two Conv1D layers and can access their weights and biases in plaintext. Other layers (Max Pooling layers, Leaky ReLU layers, Softmax layer) do not have weights and biases. Apart from these, in the HE encrypted version, the client is also responsible for generating the context for HE and has access to all context parameters (Polynomial modulus (), Coeficient modulus ( ), Scaling factor (∆ ), Public key (pk) and Secret key (sk)). Note that for both training on plaintext and EAMs, the raw data examples x’s and their corresponding labels y’s reside on the client side and are never sent to the server during the training process. • Server: In our model, the computation performed on the server-side is limited to only one linear layer. Hence, the server can exclusively access the weights and biases of this linear layer. Regarding the HE context parameters, the server has access to , , ∆ , and pk shared by the client, with the exception of the sk. Not holding the sk, the server cannot decrypt the HE EAMs sent from the client. The hyperparameters shared between the client and the server are the learning rate ( ), batch size (), number of batches to be trained ( ), and number of training epochs ().

4. Split Model Training Protocols

In this section, we first present the protocol for training the U-shaped split 1D CNN on PAMs, followed by the protocol for training the U-shaped split 1D CNN on EAMs.

4.1. Training U-shaped Split Learning with Plaintext Activation Maps

We have used algorithm 1 and algorithm 2 to train the U-shaped split 1D CNN reported in subsection 3.2. First, 3.2. U-shaped Split 1D CNN Model the client and server start the socket initialization process and synchronize the hyperparameters , , , . They The SL protocol consists of two parties: the client and also initialize the weights () and biases () of their server. We split the local 1D CNN into multiple parts, layers according to Φ . where each party trains its part(s) and communicates During the forward propagation phase, the client with others to complete the overall training procedure. forward-propagates the input x until the ℎ layer and More specifically, we construct the U-shaped split 1D sends the activation a() to the server. The server continCNN in such a way that the first few as well as the last ues to forward propagate and sends the output a() to the client. Next, the client applies the Softmax function

Algorithm 1: Client Side Initialization:

← s.connect , , , ← {(), () z() { {︁ z()

}︁ for ∈ do

∀∈{0..} }∀∈{0..}, {a() }∀∈{0..} ← .ℎ()

Φ , {︁ }︁ a() }∀∈{0..} ← ∅

∀∈{0..} ← ∅ socket initialized with port and address; for each batch (x, y) generated from do socket initialized with port and address; Forward propagation : gorithm 2. Sharif et al. [ 6 ] showed that the exchange of PAMs between client and server using SL reveals important information regarding the client’s raw sequential data. Later, in subsection 5.1 we show in detail how passing the forward activation maps from the client to the server in the plaintext will result in information leakage.

To mitigate this privacy leakage, we propose the protocol, where the client encrypts the activation maps before sending them to the server, as described in subsection 4.2.

.ℎ() Φ The client starts the backward propagation by calculating on a() to get y^ and calculates the error = ℒ(y^, y). a() and sending a() in the plaintext as can be seen in aland sending the gradient of the error w.r.t a(), i.e. to the server. The server continues the backward prop

agation, calculates a() and sends a() to the client.

After receiving the gradients a() from the server, the backward propagation continues to the first hidden layer on the client-side. Note that the exchange of information between client and server in these algorithms takes place in plaintext. The client sends the activation maps a() to the server in plaintext and receives the output of the linear layer a() from the server in plaintext (see algorithm 1). The same applies on the server side: receiving a() ,

4.2. Training U-shaped Split 1D CNN with Encrypted Activation Maps

The protocol for training the U-shaped 1D CNN with a homomorphically EAP consists of four phases: initialization, forward propagation, classification, and backward propagation. The initialization phase only takes place once at the beginning of the procedure, whereas the other phases continue until the model iterates through all epochs. Each of these phases are described in detail in the following subsections.

Algorithm 2: Server Side Initialization:

← s.connect , , , ← {(), () z() ︂{ { Initialization The initialization phase consists of In algorithm 3, () can be seen as the combination of socket initialization, context generation, and random Max Pooling and Leaky ReLU functions. The final output weight loading. The client first establishes a socket con- activation maps of the ℎ layer from the client is a(). nection to the server and synchronizes the four hyper- The client then homomorphically encrypts a() and sends parameters , , , with the server, shown in al- the EAMs a() to the server. In algorithm 4, the server gorithm 3 and algorithm 4. These parameters must be receives a() and then performs forward propagation, synchronized on both sides to be trained in the same way. which is a linear layer evaluated on HE encrypted data Also, the weights on the client and server are initialized a() as wmiotdhetlhteo saacmcuerasetetloyfacsosrersessapnodndcoinmgpwareeigthhtesiinnflutehneceloocfal a() = a()() + (). (3) SL on performance. On both the client and the server After that, the server sends a() to the client (algosides, () are initialized using corresponding parts of rithm 4). Upon reception, the client decrypts a() to Φ . The activation map at layer i (a()), output tensor of get a(), performs Softmax on a() to produce the prea Conv1D layer (z()), and the gradients are initially set dicted output y^ and calculate the loss ( algorithm 3). to zero. In this phase, the context generated is a specific Having finished the forward propagation we may move object that holds encryption keys pk and sk of the HE on to the backward propagation part of the protocol. scheme as well as additional parameters like , and ∆ .

Further information on the HE parameters and how Backward propagation After calculating the loss , to choose the best-suited parameters can be found in the client starts the backward propagation by computing the TenSEAL’s benchmarks tutorial1. As shown in al- ^y and then a() and () using the chain rule (algorithm 3 and algorithm 4, the context is either public gorithm 3). Specifically, the client calculates (ctxpub) or private (ctxpri) depending on whether it holds the secret key sk. Both the ctxpub and ctxpri have the same yˆ parameters, though ctxpri holds a sk and ctxpub does not. a() = yˆ a() , and (4) Tioznhaletyiossenhrapvrheerassdteoh,eebsocntthxopttuhhbeawvciletiehancttchaeesnssdetrsoveerthvr.eeArspfktreaorsctethehedeitncoliitetihanelt- () = a() a(()) . (5) forward and backward propagation phases.

Following, the client sends a() and () to the

server. Upon reception, the server computes by sim

ply doing = a() , based on equation (3). The server then updates the weights and biases of his linear layer according to equation (6).

() (− 1) = () (− 1)

Finally, after calculating the gradients () , () , the client updates () and () using the Adam optimization algorithm [11].

() = () − () , () = () − () .

(6) Next, the server calculates a() a() = a() a() , and sends

a() to the client. After receiving a() , the client calculates the gradients of with respect to the weights and biases of the Conv1D layer using the chainrule, which can generally be described as (1) (2) (7) (8) (9) Forward propagation The forward propagation starts on the client side. The client first zeroes out the gradients for the batch of data (x, y). He then begins calculating the a() activation maps from x, as can be seen in algorithm 3 where each () is a Conv1D layer.

The Conv1D layer can be described as following: given a 1D input signal that contains channels, where each channel x() is a 1D array ( ∈ {1, . . . , }), a Conv1D layer produces an output that contains ′ channels. The ℎ output channel y(), where ∈ {1, . . . , ′} is:2

y() = () + ∑︁ () ⋆ x(),

=1 where (), ∈ {1, . . . , } are the weights, () are biases of the Conv1D layer, and ⋆ is the 1D cross-correlation operation. The ⋆ operation can be described as

− 1 z() = ( ⋆ x)() = ∑︁ () · x( + ),

=0 where z() denotes the ℎ element of the output vector z, and starts at 0 and size of 1D weighted kernel is . 1https://bit.ly/3KY8ByN 2https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html

Algorithm 3: Client Side

Context Initialization: ctxpri, ← ctxpub, ← .(ctxpub) , , ∆ , pk, sk , , ∆ , pk for in do for each batch (x, y) generated from D do Note that in the backward pass, by sending both a() and () to the server, we help the server keep his parameters in plaintext and prevent the multiplicative depth of the HE from growing out of bound, however, this leads to a privacy leakage of the activation maps.

5. Performance Analysis

We evaluate our method on the MIT-BIH dataset [8]. MIT-BIH We use the pre-processed dataset from [ 6 ], which is based on the MIT-BIH arrhythmia (abnormal heart rhythm) database [8]. The processed dataset contains 26,490 samples of heartbeat that belong to 5 diferent

Algorithm 4: Server Side Context Initialization:

.(ctxpub) for e in E do

The neural nets are constructed using the PyTorch library version 1.8.1+cu102. For HE algorithms, we employ the TenSeal library version 0.3.10. We perform our experiments in the localhost setting. The open source implementation of our work is publically available3.

In terms of hyperparameters, we train all networks with 10 epochs, = 0.001 learning rate, and = 4 training batch size. For the split neural network with HE activation maps, we use the Adam optimizer for the client model and mini-batch Gradient Descent for the server.

We use GPU for networks trained on the plaintext. For the U-shaped SL model on HE activation maps, we train the client model on GPU, and the server model on CPU.

Visual Invertibility In the SL model, the activation maps are sent from client to server to continue the training process. A visual representation of the activation maps reveals a high similarity between certain activation maps and the input data from the client, as demonstrated in Figure 4 for the models trained on the MIT-BIH dataset. 5.1. Evaluation The figure indicates that, compared to the raw input data from the client (the first row of Figure 4), some activaIn this section, we report the experimental results in tion maps (as plotted in the second row of Figure 4) have terms of accuracy, training duration and communica- exceedingly similar patterns. This phenomenon clearly tion throughput. We measure the accuracy of the neural compromises the privacy of the client’s raw data. The nets on the plaintext test set after the training processes authors of [ 6 ] quantify the privacy leakage by measuring are completed. The 1D CNN models used on MIT-BIH the correlations between the activation maps and the dataset have two Conv1D layers and one linear layer. The raw input signal by using two metrics: distance correlaactivation maps are the output of the last Conv1D layer. tion and Dynamic Time Warping. This approach allows

We experiment with the activation maps of them to measure whether their solutions mitigate privacy [batch size, 256] for the MIT-BIH dataset. We denote leakage work. Since our work uses HE, said metrics are the 1D CNN model with an activation map sized unnecessary as the activation maps are encrypted. [batch size, 256] as 1. 43.9% longer than local training. The U-shaped split models take longer to train due to the communication between the client and the server. The communication cost for one epoch of training split 1 is 33.06 Mb.

Training Locally Results when training 1 locally on the MIT-BIH plaintext dataset are shown in Figure 3. The neural network learns quickly and is able to decrease the loss drastically from epoch 1 to 5. From epoch 6-10, the loss begins to plateau. After training for 10 epochs, we test the trained neural network on the test dataset and get 88.06% accuracy. Training the model locally on plaintext takes 4.8sec for each epoch on average.

U-shaped Split Learning using Plaintext Activation Maps Our experiments, show that training the U-shaped split model on plaintext (reported in section 3.2) produces the same results in terms of accuracy compared to local training for model 1. This result is similar to the findings of [ 6 ]. Even though the authors of [ 6 ] only used the vanilla version of the split model, they too found that, compared to training locally, accuracy was not reduced.

We will now discuss the training time and communication overhead of the U-shaped split models and compare them to their local versions. For the split version of 1, each training epoch takes 8.56 seconds on average, hence 3https://github.com/khoaguin/HESplitNet

U-shaped Split 1D CNN with Homomorphic Encrypted Activation Maps We train the split neural networks 1 on the MIT-BIH dataset using EAMs according to subsection 4.2. To encrypt the activation maps on client side (i.e. before sending them to the server), we experiment with five diferent sets of HE parameters for model 1. Additionally, we perform experiments using diferent combinations of HE parameters. Table 1 shows the results in terms of training time, testing accuracy, and communication overhead for the neural networks with diferent configurations. For the U-shaped SL version on the plaintext, we captured all communication between client and server. For training split models on EAPs, we approximate the communication overhead for one training epoch by getting the average communication of training on the first ten batches of data, then multiply that with the total number of training batches.

For the 1 model, the best test accuracy was 85.41%, when using the HE parameters with polynomial modulus = 4096, coeficient modulus = [40, 20, 20], scale ∆ = 2 21. The accuracy drop was 2.65% compared to training the same network on plaintext. This set of parameters achieves higher accuracy compared to the bigger sets of parameters with = 8192, while requiring much lower training time and communication overhead. The result when using the first set of parameters with = 8192 is close (85.31%), but with a much longer training time (3.67 times longer) and communication overhead (8.43 times higher).

Uur experiments show that training on EAMs can produce optimistic results, with accuracy dropping by 2-3% for the best sets of HE parameters.

The set of parameters with = 8192 achieve the second highest test accuracy, though incurring the highest communication overhead and the longest training time. The set of parameters with = 4096 can ofer a good trade-of as they can produce on-par accuracy with = 8192, while requiring significantly less communication and training time. Experimental results show that with the smallest set of HE parameters = 2048, = [18, 18, 18], ∆ = 2 16, the least amount of communication and training time is required.

[1]

Khan ,

Bakas ,

Michalas , Blind faith: Privacypreserving machine learning using function approximation , in: 2021 IEEE Symposium on Computers and Communications (ISCC) , IEEE, 2021 , pp. 1 - 7 .

[2]

Vepakomma ,

Gupta ,

Dubey ,

Raskar , Reducing leakage in distributed deep learning for sensitive health data , arXiv: 1812 . 00564 ( 2019 ).

[3]

Singh ,

Vepakomma ,

Gupta ,

Raskar , Detailed comparison of communication eficiency of split learning and federated learning , arXiv preprint arXiv: 1909 . 09145 ( 2019 ).

[4]

Vepakomma ,

Gupta ,

Swedish ,

Raskar , Split learning for health: Distributed deep learning without sharing raw patient data , arXiv preprint arXiv: 1812 . 00564 ( 2018 ).

[5]

J. H.

Cheon ,

Kim ,

Song , Homomorphic encryption for arithmetic of approximate numbers , in: International Conference on the Theory and Application of Cryptology and Information Security , Springer, 2017 , pp. 409 - 437 .

[6]

Abuadbba ,

Kim ,

Thapa ,

S. A.

Camtepe ,

Gao ,

Kim ,

Nepal , Can we use split learning on 1d cnn models for privacy preserving training? , in: Proceedings of the 15th ACM Asia Conference on Computer and Communications Se6. Conclusion curity , 2020 , pp. 305 - 318 .

[7]

Wooldridge , The Road to Conscious Machines: This paper focused on how to train ML models in a The Story of AI, Pelican Books, Penguin Books Limprivacy-preserving way using a combination of split ited, 2020. learning and homomorphic encryption . We constructed [8]

G. B.

Moody , R. G. Mark, The impact of the mit-bih protocols by which a client and a server could collabora- arrhythmia database, IEEE Engineering in Medicine tively train a model without revealing significant infor- and Biology Magazine 20 ( 2001 ) 45 - 50 . mation about the raw data. As far as we are aware , this [9]

Gupta ,

Raskar , Distributed learning of deep is the first time split learning is used on encrypted data. neural network over multiple agents , Journal of Network and Computer Applications 116 ( 2018 ). Acknowledgments [10]

Yang ,

Liu , Y. Cheng, Y. Kang,

Chen ,

Yu , Federated learning , Synthesis Lectures on Artificial This work was funded by the HARPOCRATES EU re- Intelligence and Machine Learning 13 ( 2019 ) 1 - 207 . search project (No. 101069535) and the Technology In- [11]

D. P.

Kingma ,

Ba , Adam: A method for stochasnovation Institute (TII), UAE, for the project ARROW- tic optimization , arXiv preprint arXiv:1412.6980 SMITH . ( 2014 ).