FedTCS: Federated Learning with Time-based Client Selection to Optimize Edge Resources Saira Bano1,2 , Nicola Tonellotto1 , Pietro Cassarà2,3 and Alberto Gotta2,3 1 Department of Information Engineering, University of Pisa, Pisa, Italy 2 Information Science and Technology Institute "A. Faedo", National Research Council, Pisa, Italy 3 CNIT - National Inter-University Consortium for Telecommunications, Parma, Italy Abstract Client sampling in federated learning (FL) is a significant problem, especially in massive cross-device scenarios where communication with all devices is not possible. In this work, we study the client selection problem using a time-based back-off system in federated learning for a MEC-based network infrastructure. In the FL paradigm, where a group of nodes can jointly train a machine learning model with the help of a central server, client selection is expected to have a significant impact in FL applications deployed in future 6G networks, given the increasing number of connected devices. Our timer settings are based on an exponential distribution to obtain an expected number of clients for the FL process. Empirical results show that our technique is scalable and robust for a large number of clients and keeps data queues stable at the edge. Keywords Federated Learning, Clients Selection, Mobile Edge Computing (MEC) framework 1. Introduction Recently, mobile telecommunications is facing a digital transformation triggered by the more frequent use of artificial intelligence (AI) based services and massive connection requests from a plethora of smart devices, autonomous vehicles and industrial IoT-based systems. In this scenario, a report published by CISCO [1] estimates that the number of IoT devices will reach to 20 billion by 2023. The massive communication of these devices are generating ever-increasing distributed traffic as they are integrated into many AI-based application scenarios such as computer vision [2] or autonomous vehicles [3]. The growth of these AI applications continues to drive the development of wireless networks, and 6G is expected to bring the evolution of mobile devices from "connected things" to "connected intelligence" [4]. One of the fundamental goals of 6G is to create a holistic system in which communication, computation and control are jointly orchestrated to achieve unprecedented levels of reliability, energy efficiency and sustainability [5]. In this 6G ecosystem, machine learning (ML) and AI play a critical role particularly through their deployment at the edge of wireless networks, close to end users, AI6G’22: First International Workshop on Artificial Intelligence in beyond 5G and 6G Wireless Networks, July 21, 2022, Padua, Italy $ saira.bano@phd.unipi.it (S. Bano); nicola.tonellotto@unipi.it (N. Tonellotto); pietro.cassara@isti.cnr.it (P. Cassarà); alberto.gotta@isti.cnr.it (A. Gotta)  0000-0002-3704-4133 (P. Cassarà); 0000-0002-8134-7844 (A. Gotta) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) and by using the user data to train AI models. However, the challenges of trustworthiness and scalability, especially user’s privacy and security, are one of the most important requirements for 6G smart services and applications, for which the General Data Protection Regulation (GDPR) guidelines must be met and the direct transmission or collection of user data is prohibited [6]. To address the problem of privacy, Google has proposed Federated Learning (FL) [7], an approach to address distributed learning problems where a shared global model can be learned using locally generated models from remote clients without sharing private data. This approach has two advantages: first, it decouples the need for a shared data repository for learning, and second, it preserves user privacy. In the process of FL, clients willing to participate in the learning process collaboratively execute training tasks together, which are orchestrated by edge or cloud entities. In the FL, each participating client must download the current global shared model, improve it through training with its local data, and summarise the changes in the form of a light- focused update in the form of weights or gradients. These client-generated updates are sent to the entity which is in charge of aggregating these model updates into the global model. The updated model is then sent back to the clients, that replace their local model with the updated model. In this paper, we focus on the implementation of an FL-based protocol suitable for network infrastructure relying on Mobile Edge Computing (MEC). In particular, we address the problem of client selection in the FL process. In the process of FL, even if an increasing number of clients can generate a more accurate shared global model, the amount of data generated to update the global model by these clients can lead to overflows and instability of the data queues at the edge of the MEC-based network infrastructure. One of the solution to prevent this overflow and to keep the data queues stable is to select a subset of clients to send their models to the edge. There are many techniques that are proposed in the literature for the client selection, such as in [8], in which the authors try to overcome the challenge of selecting clients with heterogeneous resources. In the proposed system, random clients receive the MEC operator’s request to participate in the FL, then these clients inform the operator about their resources. Finally, the MEC operator selects only those clients that can complete the tasks within a certain time frame. In [9], the authors proposed the biased client selection technique and showed that selecting clients that have higher losses of their learning model can improve the speed of error convergence, which significantly reduces the communication overhead. However, the above techniques are difficult to implement in many application scenarios, especially in the case of mobile nodes. In this paper, we propose a new protocol called FedTCS, i.e., Federated Learning with Time- based Client Selection. In this protocol, the control agent at the edge of the MEC-based in- frastructure selects the clients involved in the FL process to enable more efficient updating of the shared global model. We assume that the buffers for storing the model updates generated by the clients are placed at the edge. We propose a time-based scheme protocol for selecting clients involved in the FL process to avoid buffer overflows in the storage at the edge caused by excessive data flow due to the increasing number of clients. We use Kafka1 , a distributed event streaming platform used to process data streams. Precisely, clients store their local updates in the Kafka broker, from where the entity in charge of averaging the updates can read these 1 https://kafka.apache.org/ updates to generate the global model. After averaging, the global model is again stored in the broker so that clients can download the global model once it is available. The main contributions in this paper are summarised below: • We propose a probabilistic time-based client selection scheme for the FL procedure based on exponentially distributed timers. Through simulation-based analysis, we show that our scheme allows to maintain queue stability even as the number of clients increases, thus avoiding the implosion of model updates at the edge. • We propose a client selection scheme to deal with stragglers of edge devices, by using an equal selection probability during the FL procedure. • We provide a preliminary performance analysis to prove the scalability of the proposed selection scheme, while keeping the expected number of clients optimal according to the size of the queue. The remainder of this article is organized as follows: in Section 2, we present our proposed system for the client selection based on exponentially distributed timers; the experimental setup and numerical results for FL are discussed in Section 3. Finally, in Section 4 we provide the conclusions and future developments. 2. Time-based Clients Selection Federated learning can involve communications between numerous clients and the model averaging entity, resulting in excessive latency and network congestion. We propose a client selection approach that reduces congestion and latency while keeping buffer queues stable at the edge. Our approach allows computing the average number of transmitting clients to maximise the time-averaged accuracy of the federated learning procedure. In this work, a two-tier distributed network infrastructure based on the MEC architecture is used to make an optimal selection from a large number of clients for FL, as shown in Figure 1. In this distributed architecture, the server FL is located in the cloud, while a control agent is placed at the edge for efficient client selection. Since there may be many edge systems serving different areas, we have a control agent (CA) on each of these systems that are closer to the users. This control agent uses a system based on ACK (short message in the form of an acknowledgment) to send a short message to all clients and the server. This ACK generated by CA is triggered by the arrival of the first message from any client in the given round of FL, which is explained in more detail later in the algorithm. This edge-based system also makes use of the Kafka broker, which acts as a buffer to allow asynchronous reception of model updates from clients. It also controls the flow of updates to and from clients and makes them available to the aggregation server in the cloud. The goal of this work is to select a subset of clients such that the flow of information in the form of client updates toward the edge is reduced and does not saturate the Kafka buffer at the edge. For this purpose, in this work we propose a timer-based algorithm, in which the optimal number of clients is selected according to the size of the queue and the whole algorithm works as follows: 1. We assume that all clients train their model on locally available data and all the clients are available for all rounds of FL. Before the FL task starts, the server broadcasts a Figure 1: Two-Tier Distributed Architecture for Clients Selection, where 𝑑𝑖 = 𝑑 i.e. the delay of all clients from the edge configuration message in the form of a tuple (𝑅, 𝜇, 𝑇 ) and sends the initial model parameters to all the 𝐾 clients, where 𝑅 is the incremental FL round number, 𝑇 is the time interval, and 𝜇 is the exponential distribution parameter. 2. Each client i receives the tuple from the server after a delay of 2𝑑𝑖 . In this work, due to space constraints, we assume that all clients have heterogeneous training times but homogeneous delays 𝑑𝑖 = 𝑑 from the edge and each edge has the same delay 𝑑 from the server. 3. After receiving the request from the server, each client i extracts a random exponentially distributed timer 𝑡𝑖 also called a backoff timer, based on the received parameter, i.e., the interval size, and delays its training until the timer expires. Each backoff timer value is between [0, 𝑇 ] according to a truncated exponential distribution. 4. The client with the least backoff timer and training time sends its model parameter to the edge. Upon receiving the model parameter, the CA at the edge as described above is triggered and sends an ACK message to both the server and to all the clients connected to the edge via a control topic managed by Kafka. 5. When clients send their model updates after their backoff timer and training have finished, some control parameters are also appended to the message header, namely the round number, the training time, and their timer value (𝑅, 𝑇 𝑟𝑖 , 𝑡𝑖 ). These parameters are used by the server to adjust the 𝑇 for the next round and to check whether a received message belongs to the current round or not. 6. Any client i that receives the ACK before it has finished its training or when the timer has not yet expired will suppress training to save computational resources. This is due to the fact that the sum of the client 𝑖’s training time 𝑇 𝑟𝑖 and its timer 𝑡𝑖 is larger than the sum of minimum training time, timer, and the round-trip delay 2𝑑 for receiving an ACK: 𝑇 𝑟𝑖 + 𝑡𝑖 > 𝑇 𝑟𝑚𝑖𝑛 + 𝑡𝑚𝑖𝑛 + 2𝑑. 7. After receiving an ACK, the server reads the messages from the broker at the edge and calculates the average number of clients based on the timer values it receives from each client for a given round of FL 8. After evaluating the expected number of clients 𝐸(𝑋), it performs global aggregation based on the FedProx algorithm [10], which achieves higher accuracy compared to FedAvg for heterogeneous networks. This updated model is then distributed to all the 𝐾 clients. 9. The server also sends 𝑇 and 𝜇 for the next round along with the updated model, based on the average number of clients it wants to receive and the maximum latency it can experience in a round. 10. All the above steps form a complete round of FL. This procedure continues until desired accuracy of the global model is reached. This mechanism of sending the model updates based on probabilistic based timers requires a very low computational complexity at every IoT edge device. 2.1. Exponential Distributed Time-based Client Selection Suppose that for a random variable 𝑋 with a probability density function (PDF), a mean (1/𝜇) is given, then the PDF of the random variable 𝑇 with the truncated exponential distribution on the right-hand side of 𝑇 is given as follows: {︃ 𝜇𝑡𝑖 𝜇 1 𝜇 −1 𝑒 𝑇 if 0 ≤ 𝑡𝑖 ≤ 𝑇 𝑓 (𝑡𝑖 ; 𝜇, 𝑇 ) = 𝑇 𝑒 (1) 0 otherwise where 𝑇 is known and sent by the server. The exponential distribution has often been used to estimate the mean of the desired population belonging to a particular group. The densities of truncated distributions are significant in modelling such populations[11]. In order to find the values of timer 𝑡𝑖 for each client 𝑖, we used the inverse cumulative distributive function (CDF) sampling theorem and calculated 𝑡𝑖 from the following expression: 𝑇 log 𝑢(𝑒𝜇 − 1) + 1 (2) (︀ )︀ 𝑡𝑖 (𝑢; 𝜇, 𝑇 ) = 𝜇 where 𝑢 belongs to the uniform distribution ∈ [0, 1]. The FL server adjusts these 𝜇 and 𝑇 values based on the number of clients expected for a given number of rounds and the desired accuracy, while keeping the queue stable. However, as we increase the values of 𝜇, the weight of density shifts towards 𝑇 , resulting in a dense timer setting [12]. In order to find the 𝐸(𝑋) for 𝐾 = 1000 clients, we change the timer and 𝜇 values as shown in Figure 2a. The results in Figure 2a show the expected number of clients for 𝑇 equal to 4𝑑, 6𝑑, and 8𝑑, where 𝑑 is the delay between edge and client devices. For simplicity, we chose the same delay between each device 𝑖 ∈ 𝐾 devices and the edge when formulating the problem. The results show that for different values of 𝑇 , the optimal number of clients is almost reached at 𝜇 = 10. Further increase in 𝜇 leads to higher number of clients and thus the lower number of clients suppress sending their model updates by discarding their training. For 𝜇 > 10 larger timer values are closer to 𝑇 and when 𝜇 < 10 the timer values are closer to zero. In both cases, we have a large number of clients because the timer values are distributed in a narrow range. So, (a) 𝐸(𝑋) vs. 𝜇 for 𝑇 = 4𝑑, 6𝑑, 8𝑑, and number (b) 𝐸(𝑋) vs. number of clients 𝐾 for 𝜇 = 10, of clients 𝐾 = 1000 and 𝑇 = 4𝑑, 6𝑑, 8𝑑 Figure 2: Expected number 𝐸(𝑋) of selected clients for model updates given exponentially distributed timers for the given scenario, if we set the 𝜇 parameter close to or equal to 10, our queue will stay below the overflow. Also, all clients whose training time is close to or equal to the minimum timer will complete their training and send their model updates before receiving the ACK from the edge. Only the clients whose training time is greater suppress both training and model updates. So it is not feasible to set the time equal to the delay time, as this would completely ignore the straggler devices. We repeat the experiment by fixing the value of the parameter 𝜇 and changing the values of 𝑇 . From Figure 2b, we can see that for the smaller values of 𝑇 , the expected number of clients in the queue 𝐸(𝑋) increases with the total number of clients. However, when we move to the larger values of 𝑇 , 𝐸(𝑋) remains stable even with a larger number of clients. This is because for larger values of 𝑇 , clients will receive the ACK before their timer expires, so a large number of clients will suppress their training. 3. Experimental Environment and Results for Federated Learning In the process of federated learning, each client sends its model updates to the server, which aggregates them in each round of communication. However, when the number of clients is large, the cost and overhead of communication between clients and edge can be a significant bottleneck. In this paper, we propose to use a time-based technique to select only a small subset of clients and communicate only the updates of this subset to the server. Our goal is to aggregate only a subset of client updates, where the subset of clients selected in each round is different and depends on the exponential timer, so that the final model update looks like an aggregation of all client updates. We run the experiments with 50 user nodes participating in the FL process. Increasing the number of nodes would be possible if a more powerful test environment were available. To select a different number of clients using a probabilistic time-based client method based on Figure 3: Expected number 𝐸(𝑋) of selected clients for model updates with exponentially distributed timers with 𝜇 = 10 and intervals of size 𝑇 =4𝑑, 6𝑑, and number of FL rounds for given number of clients 𝐾 exponentially distributed timers, we set the timer parameters such as 𝜇 and 𝑇 for 5, 10, 15, 20, 30, and 40 clients to test how changing the number of clients affects the accuracy. Furthermore, during the FL process we assigned weights to all clients that corresponded to the number of samples that each client used for training. Each experiment is repeated 10 times to find the average number of rounds for the selected clients. We then calculate the confidence interval for the specified rounds and for the selected number of clients, as shown in Figure 3. 3.1. Dataset Distribution We started with the MNIST dataset2 for convenience. The MNIST dataset contains 60,000 training examples and 10,000 test examples. The image has a fixed dimension of 28x28 pixels with a value between 0 and 9. Each image is converted into an array of 784 features. We require each client to randomly select images from a different subset of the training data that are not independently and identically distributed (non-iid). For the training dataset of the experiment, each user selects 1200 images at random, without repetitions. 3.2. Experimental Results We use for the MNIST classification a neural network consisting of three layers, with 200 neurons in the first two layers and 10 output neurons in the third layer, with a learning rate of 2 http://yann.lecun.com/exdb/mnist/ 0.01, a batch size of 32, and each client performs 10 local epochs before sending its model updates. As shown in Figure 3, the number of FL rounds decreases as we increase the number of clients. With 5 clients, we have a maximum number of rounds of 20, and with 50 clients, 13 rounds to achieve 97% accuracy. In these experiments, we distribute the updated model to all clients and not only to the clients that participated in the aggregation process, so whichever clients we select have the same updated global model. This also reduces the communication cost selecting only a subset of clients that deliver their local parameters and saves the computational resources of the rest of the ensemble leading to a fast convergence in a small number of rounds. Also, the communication cost from the server to distribute the model to all clients is compensated by a smaller number of rounds. If we look at Figure 3, the intersection of the two lines shows the optimal working point at which we expect 𝐸(𝑋) clients selected for the FL process based on above-discussed parameters 𝜇 and 𝑇 , and the relative number of rounds for a given number of clients to achieve the desired accuracy, which in these tests was up to 97%. After repeating each test more than 10 times, we also defined a confidence interval of 95% for the number of rounds to achieve the desired accuracy. According to the empirical results, our approach improves learning efficiency and enables more uniform and fair performance among clients. 4. Conclusion and Future Works In this work, we have investigated a technique for selecting clients in the FL process by using the probabilistic timers for a large number of clients. The proposed technique is robust and scalable for a large number of clients. It controls the number of updates to be received to control the queue length at the edge by adjusting the parameters for the average number of clients and the latency for each round. We also investigate the relationship between adjusting the parameters for the exponential distribution and the convergence time for FL. In future work, we will refine the methods and details of the client selection technique by also considering the heterogeneous delays between edge and client devices and analyzing the heterogeneous computational constraints between devices. We will also try to implement the timers with different probability models to see which distribution is more effective. While the focus of this research is on the impact of client sampling on the global optimization goal of FL, future developments could investigate the impact of the client selection technique on the performance of individual users, especially in the context of heterogeneity. Acknowledgments This research was partially supported by TEACHING project funded by the EU Horizon 2020 research and innovation programme under GA n. 871385, by the Italian Ministry of Education and Research in the framework of the CrossLab project (Departments of Excellence), and by the University of Pisa in the framework of the PRA 2020 program (AUTENS project, Sustainable Energy Autarky). References [1] Cisco annual internet report, 2021. URL: https://www.cisco.com/c/en/us/solutions/ executive-perspectives/annual-internet-report/index.html. [2] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [3] D. Bacciu, S. Akarmazyan, E. Armengaud, M. Bacco, G. Bravos, C. Calandra, E. Carlini, A. Carta, P. Cassarà, M. Coppola, et al., Teaching-trustworthy autonomous cyber-physical applications through human-centred intelligence, in: 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), IEEE, 2021, pp. 1–6. [4] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, Y.-J. A. Zhang, The roadmap to 6g: Ai empowered wireless networks, IEEE Communications Magazine 57 (2019) 84–90. doi:10.1109/MCOM. 2019.1900271. [5] Y. Shi, K. Yang, T. Jiang, J. Zhang, K. B. Letaief, Communication-efficient edge ai: Algo- rithms and systems, IEEE Communications Surveys & Tutorials 22 (2020) 2167–2191. [6] K. B. Letaief, Y. Shi, J. Lu, J. Lu, Edge artificial intelligence for 6g: Vision, enabling technologies, and applications, IEEE Journal on Selected Areas in Communications 40 (2021) 5–36. [7] B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in: Artificial Intelligence and Statistics, PMLR, 2017, pp. 1273–1282. [8] T. Nishio, R. Yonetani, Client selection for federated learning with heterogeneous resources in mobile edge, in: ICC 2019-2019 IEEE international conference on communications (ICC), IEEE, 2019, pp. 1–7. [9] Y. J. Cho, S. Gupta, G. Joshi, O. Yağan, Bandit-based communication-efficient client selection strategies for federated learning, in: 2020 54th Asilomar Conference on Signals, Systems, and Computers, IEEE, 2020, pp. 1066–1069. [10] A. K. Sahu, T. Li, M. Sanjabi, M. Zaheer, A. Talwalkar, V. Smith, On the convergence of federated optimization in heterogeneous networks, arXiv preprint arXiv:1812.06127 3 (2018) 3. [11] M. F. M. Al-Athari, Estimation of the mean of truncated exponential distribution, Journal of Mathematics and Statistics 4 (2008) 284. [12] J. Nonnenmacher, E. W. Biersack, Optimal multicast feedback, in: Proceedings. IEEE INFOCOM’98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No. 98, volume 3, IEEE, 1998, pp. 964–971.