1. Introduction

Workshop on Software Quality Analysis, Monitoring, Improvement, and Applications, September

Data Partitioning Efects in Federated Learning

Mirwais Ahmadzai

Giang Nguyen

0 0 Faculty of Informatics and Information Technologies, STU in Bratislava , Ilkovičova 2, Bratislava 84216 , Slovakia

2023

1 0 13

Federated learning is a potential ML approach that promotes cooperative learning among many distributed systems while ensuring data privacy. In this study, we present a wide review of the design and evaluation of FL, with a particular focus on data partitioning. We discuss the challenges and solutions associated with FL implementation and demonstrate the design and execution of our proposed FL architecture. The main contribution of this paper is an investigation of data partitioning in FL and its impact on system performance. Using real-world public opinion data, we evaluate our proposed FL architecture and investigate performance measures such as binary accuracy, F1 score, loss, communication overhead, and data transmission between the server and clients. The experimental results provide useful information on the efective use of FL in various contexts. We underline the distinct advantages of various data partitioning algorithms based on data distribution and privacy requirements. Our findings contribute to the creation of successful FL systems that protect privacy.

eol>Data Partitioning Federated Learning Architecture Design Implementation Evaluation

1. Introduction

suggesting best practices for selecting and implementing data partitioning techniques in FL for public opinion survey data.

Although a variety of model performance evaluation metrics are discussed, such as communication eficiency, model performance, privacy, system performance, system and statistics heterogeneity, and motivitibility, our experimental evaluation only focuses on model performance (accuracy and F1 score) and communication overhead using public opinion data. Furthermore, this work does not address the ethical issues of FL, which would require further investigation. An additional study is required to investigate the influence of data partitioning strategies on other areas of FL system performance in order to provide an improved understanding of their efects and potential best practices.

In this context, the remainder of this paper is structured as follows: it starts with a brief review of related work and highlights the diferences and contributions of our paper in Section 2. The contribution and motivation of the research in the context of data from the public opinion survey are described in Section 2.2. The proposed design of the FL architecture is described in Section 3. Data partitioning in FL is discussed in Section 4. The performance evaluation of the FL architecture using public opinion data and presenting the metrics and techniques used for this evaluation is done in Section 5, and Section 5.1. Finally, the article concludes the work and suggests potential research directions in Section 6.

2. Related Work

Recent years have seen a considerable increase in the level of research on FL, and many studies have been conducted on its application in various domains. In this section, a systematic literature review is applied to select and highlight relevant work on its design, application, and evaluation. The review is targeted searches in reputable databases using topic keywords to ensure completeness. Table 1 summarizes the review according to the focus area of the study, the method used to determine the main findings and limitations.

The paper [ 2 ] presents a scalable production system for FL on mobile devices, with an emphasis on the dificulties of privacy, security, and communication. Although the study provides useful insights into the practical implementation of FL, it does not evaluate or compare the performance of the system with other systems. The paper [ 3 ] looks at the latest developments and problems in FL, including ways to mitigate privacy risks, without focusing on their limits. The paper [ 4 ] proposes a practical FL method that reduces communication costs and is robust to non-IID data distributions, but its limitations include experiments conducted on a limited number of data sets and model architectures, as well as a lack of consideration for privacy preservation. The paper [ 5 ] proposes an algorithm that minimizes learning loss within a given resource budget, although it has constraints such as focusing on a certain class of ML models and conducting experiments in a simulated environment. The paper [ 6 ] presents a complete study of current research in the management of non-Independent and Identically Distributed (non-IID) data on ML models in FL, but no concrete conclusions are presented.

2.1. Challenges and Solutions in Federated Learning Implementation

Implementing FL can be dificult because it requires balancing the privacy and utility of local data with the efectiveness of the ML process. Due to its distributed nature, FL encounters a variety of challenges during training, including problems with communication, heterogeneity of data and systems, and data privacy and security. In general, it requires careful consideration when designing an FL system [ 7 ], [ 8 ].

Table 2, describes the main issues in FL architecture due to privacy requirements and data volume, leading to limited communication in FL networks. Local updating, compression approaches, decentralized training, and importance-based updating are some of the solutions suggested by researchers. These strategies are designed to maintain the balance between effective communication, convergence, and accuracy of the model. Federated networks face the challenge of system heterogeneity, which causes participants with various communication, processing, and storage capacities. To address this issue, asynchronous communication, client participation, and fault tolerance are used. Client participation selects devices based on their resources and data quality, while fault tolerance adds algorithmic redundancy or coded computation to handle device failures.

The existence of non-IID data throughout the network causes challenges in statistical heterogeneity in FL. To solve this, the researchers propose employing multitask learning, measuring heterogeneity with measures such as local dissimilarity, and representing user preferences with personalization layers. Recent research has revealed that FL may not always provide adequate privacy guarantees during model updates and may be vulnerable to two types of attack, including poisoning attacks and inference attacks [ 9 ], [ 10 ]. Poisoning attacks can be carried out during the model’s training phase or on the data. Inference attacks can occur during model updates and expose participants’ private information to the adversary [ 11 ]. There are various privacy-preserving mechanisms, such as Secure Multiparty Computing (SMPC), Diferential Privacy (DP), and Homomorphic Encryption (HE), that can be used in FL. By integrating numerous parties, the SMPC maintains security. To preserve individual privacy, DP adds noise to the data, while HE modifies the encryption parameters to protect user data.

2.2. Motivation and Contribution

Federated learning has limited use in the real world despite its benefits, such as improved model accuracy and privacy preservation. These issues can be resolved and its practical adoption improved by looking into the efects of data partitioning in FL. This article’s goal is to investigate how data partitioning techniques afect FL system performance with our following contributions: 1. An examination of data partitioning techniques in FL, focusing on how they afect system performance and communication efectiveness. 2. A novel approach to choosing and putting into practice the best data partitioning strategies for certain use cases. 3. Evaluating the efectiveness of the proposed methodology in increasing model accuracy and decreasing communication overhead using data from public opinion surveys.

3. Federated Learning Architecture Design and Implementation

The hardware and software requirements for implementing the FL architecture can vary depending on the individual use case and the scale of the devices involved. However, in general, clients, servers, models, and algorithms are components of the FL architecture [ 3 ]. Hardware requirements include a collection of distributed devices (such as smartphones, laptops, and the Internet of Things (IoT) devices) with enough processing power to locally train an ML model, a server that meets certain criteria, and a reliable network connection that can interact with the central server and other participating devices. Each client (local device) has its own data set, which is used to train the ML model. The FL process is coordinated by the server, which sends model updates to the clients and aggregates their updates. The server also keeps the global model safe and secure. The model is an ML model that has been trained using the FL method. It is usually a deep neural network that is trained and decentralized across clients. The optimization technique used to train the model is called an algorithm. In terms of software requirements, they are as follows [ 22, 23 ].

Client side model evaluation and deployment

No Yes

ML frameworks that support the FL process such as TensorFlow or PyTorch. A central server software that manages the process, including model aggregation and device synchronization. which can be built with technologies like Apache Kafka, RabbitMQ, and Redis. A client-side software library that allows devices to participate in FL and communicate securely with the central server. Protocols for secure and encrypted communication to protect the privacy of data on participating devices. FL architecture design workflow for the public opinion survey example is depicted in Fig. 1, the central server distributes the initial model parameters to all clients. Clients train their local models with initial parameters and exchange the results with the central server. The central server aggregates the local models and distributes the global model to the clients.

Depending on the particular use case and privacy restrictions, many methodologies can be utilized to evaluate and deploy the model. Clients do a local evaluation of the model and send the results to the central server for aggregation. The performance of the global model is then evaluated in general by the server. As an alternative, the server validation data set can be used for evaluation. When privacy is an issue, clients can also receive the global model that has been aggregated for local predictions. The global model, on the other hand, can be hosted by the server and made available as a service to clients, who can then send their data for predictions. Data privacy, resource limitations, and the complexity of the model management process are a few examples of considerations that impact the decision to choose client-side or server-side evaluation and deployment [ 24 ].

4. Data Partitioning in Federated Learning

FL partitioning distributes data across multiple parties who collaborate to increase the usefulness of their combined data. This method overcomes the limitations of domain-specific data and makes it easier for clients with various interests to work together. Based on data flow between parties, FL data partitioning can include transfer learning, vertical partitioning, and horizontal partitioning (Fig. 2). It takes careful preparation to bring together the interested parties and partition the data in a way that produces an FL environment, as proper data partitioning is crucial for the FL process [ 25 ].

Horizontal FL (HFL) combines data from entities with similar features but diferent samples. In the HFL example, two research organizations (regions A and B) collect data from a public opinion survey but are only able to share limited information because of privacy concerns. The purpose is to develop an ML model that uses parameters such as age, gender, and service type to predict how satisfied clients are with government services. Each organization first trains a local model using its own data, then shares model updates with a central server to create a global model, and then deploys the aggregated model to clients.

Vertical FL (VFL) combines data from entities with the same sample IDs but distinct features. VFL allows diferent respondents to share demographic data while maintaining the privacy of survey responses. Each client trains a local model using local data and survey results and then shares model updates with the server, which constructs a global model capable of generating predictions across all attributes.

Transfer FL (TFL) involves the use of a previously trained model on a similar task to improve the performance of a new model on a new task. TFL can be applied for both VFL and HFL. TFL involves training a model on one set of data and then fine-tuning it on another. When one region has more data than the other, the model is trained on the bigger data set first and then ifne-tuned on the smaller data set.

Data in Region A

ParticipantID 001 002 003 ...

Age 35 42 55 ...

...

Data in Region B ParticipantID 001 002 003 ...

serviceType

A B C ... serviceType

A B C ...

Satisfactory_Level 1 4 5 ...

5. Performance Evaluation of Data Partitioning in Federated Learning Architectures

Evaluation of FL architecture is a crucial component because it allows us to measure the eficiency of the model and make additional improvements. Evaluation metrics, methods, and best practices for FL architecture will be discussed in this part. The metrics shown in Table 3 are frequently used to assess the efectiveness and eficiency of the FL approach. These metrics include communication costs, model performance, system scalability and performance, attack rates, computation and energy costs, convergence rates, statistical and system heterogeneity, client motivation, and data and device security [ 26, 27 ].

5.1. Experimental Results and Discussion

The performance of HFL, VFL, and TFL was compared using data collected by the Asia Foundation. It was a public opinion survey to obtain civilian thoughts and impressions on a variety of Afghanistan-related issues. The data set includes survey questions and responses related to security, governance, and country development.

Due to the FL nature, clients send their local model updates to the server, which aggregate these updates to improve the global model. Communication is a substantial bottleneck in the FL process, especially if the network bandwidth is limited or a large number of clients are participating. In our experiments, the quantization technique was investigated as a substantial solution to this problem for three FL architectures (HFL, VFL, and TFL). The findings showed in Fig. 3 that the transmission overhead without quantization was HFL:14.44 MB, VFL:0.12 MB, TFL:17.32 MB, while, the communication overhead for the three techniques decreased significantly after applying quantization to model updates as HFL:7.22, VFL:0.06, TFL:8.70.

These findings indicate the eficiency of quantification in reducing communication overhead in FL systems. Reduced overhead can result in faster convergence, improved scalability, and lower communication costs. However, it is critical to assess the impact of quantization on the accuracy and loss metrics of the model. In this research, we also conducted tests for binary classification models in horizontal, vertical, and transfer FL setups. The results showed that the use of quantization maintained acceptable levels of accuracy and loss, making it a feasible solution to reduce communication overhead in FL systems.

Table 4, summarizes the experimental methodology and analysis performed in our study. It includes key elements such as the description of the data set, the comparison of diferent data partitioning methods, the specific federated learning model used, the hyperparameters chosen for training, the evaluation metrics used, and an outline of the experimental procedure, details about the analysis process, and the source of the results shown in Fig. 3, Fig. 4, Fig. 5, and Fig. 6. Mentioned and presented figures represent the author’s contribution in this research.

Accuracy

F1 Score

Loss

Accuracy

F1 Score

Loss 0.0 0 20 4R0ounds 60 80

4R0ounds 20 60 80

The performance of three diferent FL approaches for HFL, VFL, and TFL is studied. The performance of each approach is evaluated using three metrics: test loss, accuracy, and F1 score. The findings of the HFL experiment are indicated in Fig. 4. theTtheestinacitciualralcoyssiso0f.5t3h,eanHdFtLheteFs1t sicsor0e.6i9s, 0.90 Accuracy F1 Score Loss 0.54. Model performance improved consis- 0.88 tently throughout 90 rounds, with the test re loss dropped to 0.32, the accuracy increased coS10.86 to 0.85, and the F1 score increased to 0.83. As /Fy shown in Fig. 5, the first test loss for verti- rccau0.84 cal FL is 0.66, the accuracy is 0.55, and the cA F1 score is 0.0. Over 90 rounds, the model 0.82 improved in all metrics, with the test loss decreased to 0.30, the accuracy increased to 0.85, 0.80 0 20 40 60 80 and the F1 score increased to 0.84. Rounds i0n.8Ft4hi6ne1a,ilnlayni,tdifaotlrhdTeaFFtLa1issnectoFirigse. 0i6s.,30t2h.,8e2th9ine2i.taiTcahcluetermsatcolyodseissl Figure 6: tThFoLr’MscoodnetlriPbeurtfioornm. ance. Source: auimproved during 90 rounds, with the test loss dropped to 0.30, and the accuracy and F1 score slightly increased to 0.8480, 0.8329 respectively.

In summary, during 90 rounds, the three FL approaches showed a continuous improvement in performance in all evaluation metrics. The findings show that FL can be eficiently applied to a variety of situations, each strategy providing unique benefits based on specific data distribution and privacy needs.

6. Conclusion

The paper provides a thorough investigation of the architecture of FL, with a particular emphasis on data partitioning. The importance of FL has been emphasized and the problems and limitations of existing FL techniques have been studied. The design concepts and factors necessary to establish an FL architecture have also been investigated. FL architectures have been evaluated using metrics and approaches related to data partitioning strategies. The implementation and evaluation of the FL architecture was carried out using various data partitioning architectures and the results were thoroughly explained. By evaluating the FL system using new measures, future development of more eficient, efective, and privacy-preserving FL systems can be helped. These measures should address statistical and system heterogeneity, system performance, client motivation, system scalability, and data privacy, in particular. Taking these factors into account, we can improve our understanding of FL, leading to the development of more eficient, efective, and secure FL learning systems.

Acknowledgments

This publication has been written thanks to the support of the Operational Programme Integrated Infrastructure for the project: International Center of Excellence for Research on Intelligent and Secure Information and Communication Technologies and Systems – Phase II (ITMS code: 313021W404), co-funded by the European Regional Development Fund. It is also supported by the Operational Program Integrated Infrastructure for the project: National infrastructure for supporting technology transfer in Slovakia II – NITT SK II, co-funded by the European Regional Development Fund, and the AI4EOSC project under grant number 101058593.

[1]

Hard ,

Rao ,

Mathews ,

Ramaswamy ,

Beaufays ,

Augenstein ,

Eichner ,

Kiddon ,

Ramage , Federated learning for mobile keyboard prediction , 2019 . arXiv: 1811 .03604.

[2]

Bonawitz ,

Eichner ,

Grieskamp ,

Huba ,

Ingerman ,

Ivanov ,

Kiddon ,

Konečný ,

Mazzocchi , H. B. McMahan , T. V.

Overveldt , D.

Petrou , D.

Ramage , J.

Roselander , Towards federated learning at scale: System design , 2019 . arXiv: 1902 .01046.

[3]

Kairouz , H. B. McMahan , B.

Avent , A.

Bellet , M.

Bennis , A. N.

Bhagoji , K.

Bonawitz , Z.

Charles , G. Cormode, R.

Cummings , et al., Advances and open problems in federated learning , Foundations and Trends® in Machine Learning 14 ( 2021 ) 1 - 210 . doi: 10 .1561/ 2200000083.

[4]

McMahan ,

Moore ,

Ramage ,

Hampson , B. A. y Arcas , Communication-eficient learning of deep networks from decentralized data , in: Artificial intelligence and statistics , PMLR, 2017 , pp. 1273 - 1282 . URL: http://proceedings.mlr. press/v54/mcmahan17a/ mcmahan17a.pdf.

[5]

Wang ,

Tuor ,

Salonidis ,

K. K.

Leung ,

Makaya ,

He ,

Chan , Adaptive federated learning in resource constrained edge computing systems , IEEE journal on selected areas in communications 37 ( 2019 ) 1205 - 1221 . doi: 10 .1109/JSAC. 2019 . 2904348 .

[6]

Zhu ,

Xu ,

Liu ,

Jin , Federated learning on non-iid data: A survey , Neurocomputing 465 ( 2021 ) 371 - 390 . doi: 10 .1016/j.neucom. 2021 . 07 .098.

[7]

Zheng ,

Zhou ,

Sun ,

Wang ,

Liu ,

Li , Applications of federated learning in smart cities: recent advances, taxonomy , and open challenges, Connection Science 34 ( 2022 ) 1 - 28 . doi: 10 .1080/09540091. 2021 . 1936455 .

[8]

P. M.

Mammen , Federated learning: Opportunities and challenges , 2021 . arXiv: 2101 . 05428 .

[9]

Shafahi ,

W. R.

Huang ,

Najibi ,

Suciu ,

Studer ,

Dumitras , T. Goldstein, Poison frogs! targeted clean-label poisoning attacks on neural networks , Advances in neural information processing systems 31 ( 2018 ). URL: https://proceedings.neurips.cc/paper_files/ paper/2018/file/22722a343513ed45f14905eb07621686-Paper.pdf.

[10]

A. N.

Bhagoji ,

Chakraborty ,

Mittal ,

Calo , Analyzing federated learning through an adversarial lens , in: International Conference on Machine Learning, PMLR , 2019 , pp. 634 - 643 . URL: http://proceedings.mlr.press/v97/bhagoji19a/bhagoji19a.pdf.

[11]

Melis ,

Song , E. De Cristofaro,

Shmatikov , Exploiting unintended feature leakage in collaborative learning , in: 2019 IEEE symposium on security and privacy (SP) , IEEE, 2019 , pp. 691 - 706 . doi: 10 .1109/SP. 2019 . 00029 .

[12]

Zhang ,

A. E.

Choromanska , Y. LeCun, Deep learning with elastic averaging sgd , Advances in neural information processing systems 28 ( 2015 ). URL: https://proceedings.neurips.cc/ paper_files/paper/2015/file/d18f655c3fce66ca401d5f38b48c89af-Paper.pdf.

[13]

Han ,

Mao ,

W. J.

Dally , Deep compression: Compressing deep neural networks with pruning, trained quantization and hufman coding, 2016 . arXiv: 1510 . 00149 .

[14]

Lin ,

S. U.

Stich ,

K. K.

Patel ,

Jaggi , Don't use large mini-batches , use local sgd , 2020 . arXiv: 1808 .07217.

[15]

Tao ,

Li , esgd: Commutation eficient distributed deep learning on the edge , HotEdge ( 2018 ) 6 . URL: https://proceedings.neurips.cc/paper_files/paper/2010/file/ abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf.

[16]

Zinkevich ,

Weimer ,

Li ,

Smola , Parallelized stochastic gradient descent , Advances in neural information processing systems 23 ( 2010 ). URL: https://proceedings. neurips.cc/paper_files/paper/2010/file/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf.

[17]

Nishio ,

Yonetani , Client selection for federated learning with heterogeneous resources in mobile edge , in: ICC 2019 -2019 IEEE international conference on communications (ICC) , IEEE, 2019 , pp. 1 - 7 . doi: 10 .1109/ICC. 2019 . 8761315 .

[18]

Smith , C.-K. Chiang , M.

Sanjabi , A. S.

Talwalkar , Federated multi-task learning , Advances in neural information processing systems 30 ( 2017 ). URL: https://proceedings.neurips.cc/ paper_files/paper/2017/file/6211080fa89981f66b1a0c9d55c61d0f-Paper.pdf.

[19]

I. I.

Eliazar ,

I. M.

Sokolov , Measuring statistical heterogeneity: The pietra index , Physica A: Statistical Mechanics and its Applications 389 ( 2010 ) 117 - 125 . doi: 10 .1016/j.physa. 2009 . 08 .006.

[20]

M. G.

Arivazhagan ,

Aggarwal ,

A. K.

Singh ,

Choudhary , Federated learning with personalization layers , 2019 . arXiv: 1912 .00818.

[21]

Bonawitz ,

Ivanov ,

Kreuter ,

Marcedone , H. B. McMahan , S.

Patel , D.

Ramage , A.

Segal , K.

Seth , Practical secure aggregation for privacy-preserving machine learning , in: proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , 2017 , pp. 1175 - 1191 . doi: 10 .1145/3133956.3133982.

[22]

M. R.

Behera ,

Otter ,

Shetty , et al., Federated learning using distributed messaging with entitlements for anonymous computation and secure delivery of model ( 2020 ). URL: https://www.academia.edu/download/70178855/25661363.pdf.

[23]

Á. Morell , E. Alba, Dynamic and adaptive fault-tolerant asynchronous federated learning using volunteer edge devices , Future Generation Computer Systems 133 ( 2022 ) 53 - 67 . doi: 10 .1016/j.future. 2022 . 02 .024.

[24]

Yang , G. Andrew,

Eichner ,

Sun ,

Li ,

Kong ,

Ramage ,

Beaufays , Applied federated learning: Improving google keyboard query suggestions , 2018 . arXiv: 1812 .02903.

[25]

Mothukuri ,

R. M.

Parizi ,

Pouriyeh ,

Huang ,

Dehghantanha ,

Srivastava , A survey on security and privacy of federated learning , Future Generation Computer Systems 115 ( 2021 ) 619 - 640 . doi: 10 .1016/j.future. 2020 . 10 .007.

[26]

Xu ,

Li ,

Liu ,

Yang ,

Lin , Verifynet: Secure and verifiable federated learning , IEEE Transactions on Information Forensics and Security 15 ( 2019 ) 911 - 926 . doi: 10 .1109/ TIFS. 2019 . 2929409 .

[27]

S. K.

Lo ,

Lu ,

Wang , H.-

Paik ,

Zhu , A systematic literature review on federated machine learning: From a software engineering perspective , ACM Computing Surveys (CSUR) 54 ( 2021 ) 1 - 39 . doi: 10 .1145/3450288.