Adaptive Federated Learning for Electric Power Inspection with UAV System 1 Yu Liang, Ruifan Huang, Xun Li, Junjie Yang, Xinkai Zhang, Xuehe Wang* School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China Abstract With the rapid development of the national power grid, the demand for efficient and reliable power supply is increasing. As the labor cost and the size of the power grid are increasing, the Unmanned Aerial Vehicle (UAV) power inspection is a new and efficient way of detecting power grid abrasion. By analyzing the data collected by the UAVs, a smart detection and maintenance service can be provided. To improve the model robustness and accuracy, data collected from different companies and regions are required, which may violate the data privacy policy. As a distributed machine learning technique, Federated Learning (FL) can collaboratively train global models without sharing private data. In this article, in order to protect the data privacy between different systems, optimize models’ accuracy and convergence performance for non-Independently-and-Identically-Distributed (non-IID) data, we propose an adaptive method that jointly adjusts the learning rate and gradient based on the idea of FL. By recording global gradient information and using the momentum to accelerate the training process, our method adaptively controls the local gradient and learning rate in training of local models, and be more robust to local minima. Finally, we verify the superiority of our model compared with the generic FL model for non- IDD data through experiments. Keywords: Adaptive Federated Learning; Non-IDD Data; UAV; Smart Grid. 1. Introduction 1.1. Motivation and Background Recent times, China's power grid is undergoing rapid development. During the 12th Five-Year pe- riod, the scale of China's power grid has jumped to the first place in the world. So far, China has built six major cross-provincial power grids with a total transmission line length of over 1.15 million kilo- meters. However, the automation of distribution is in its infancy, and it takes a long time for fault di- agnosis, isolation and recovery. Purely by manual work in such a long line inspection, patrol person- nel will face large workload, high labor intensity, long patrol time, low patrol efficiency and other is- sues, and the complex structure of the power grid will make maintenance becomes a certain risk. At present, the Unmanned Aerial Vehicle (UAV) cruise detection is mainly used to detect the power grid fault, return and repair the fault problem. That is, a new UAV-enabled inspection method of automatic, intelligent, high efficiency and supervision is introduced [1]. Distribution network involves different departments, and the data collected in different systems is inconsistent. In addition, there is a lack of data sharing mechanism and information acquisition channels. This is reflected in the lack of management refinement and the limited exchange of data, graphics and information. To solve the problem, a kind of distributed machine learning technology called Federated Learning (FL) is introduced [2]. FL’s core idea is to train the distributed models across multiple data sources, which offers the possibility to construct global models based on virtual fusion data by exchanging ICCEIC2022@3rd International Conference on Computer Engineering and Intelligent Control EMAIL: *Corresponding author: wangxuehe@mail.sysu.edu.cn (Xuehe Wang) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 234 model parameters and intermediate results without exposing the local source data to each other. FL approaches a unique way to balance the data sharing and data privacy protection, making data "available but not visible". In previous distributed models, it is often assumed that the participants had Independently-and- Identically-Distributed (IDD) dataset. However, such an ideal scenario is often not available in most realistic scenarios. Participants are usually quite different from each other, as whose data are often non-IDD in practical questions. For example, in the power grid distributed across provinces, the climate in different regions causes different losses to cables. It also indicates that the server and participants need to communicate and exchange more times to achieve the required model accuracy. Unsatisfactory convergence performance, high communication cost and privacy guarantee also pose challenges to the optimization of FL’s non-IDD training model. 1.2. Related work and Our idea In the current research of FL, many training methods for models based on non-IDD data have been proposed. FedAVG [2] is the most classical and widely used federated optimization method, which can effectively reduce the communication cost compared with the traditional stochastic gradient de- cent (SGD) model. However, as it uses relatively static parameters, its convergence effect fluctuates greatly in the face of different optimization problems, thus it does not always have good enough convergence performance in the face of data with a higher heterogeneity [3]. In response to this problem, some adaptive federated optimization methods have attracted extensive attention. [4] proposed dynamic learning rate (DLR), which improved FedAVG algorithm by optimizing local learning rate to adapt the fading channel and realize efficient aggregation of wireless data. [5] proposed an adaptive data enhancement framework for imbalanced distributed training data to reduce communication traffic and accelerate convergence. [6] synthesized the current general ideas of adaptive optimization for FL and summarized the adaptive methods such as FedADAM, FedADAGRAD, FedYOGI by adaptively adjusting the learning rate. All related works list above extend our research ideas, nevertheless, few articles discuss the direction of integrated adaptive optimization of learning rate and gradient. In the training of deep learning network, the gradient and learning rate are both factors of great significance. The convergence performance of the model will be greatly improved theoretically by adapting them in local training and global aggregation. More specific method will be presented in next section. 2. Adaptive Optimization Federated Learning (FL) was approached by B. McMahan et al. in 2016 as a decentralized ma- chine learning mechanism [2], which is trained jointly by a central server coordinating a set of distrib- uted participating devices (which we refer to as clients). It avoids the direct aggregation of source data and protects the privacy of user data. 2.1. Generic Algorithm FedAVG algorithm is a classical federated learning algorithm [2]. This algorithm proposes the basic idea of federated learning based on stochastic gradient decent algorithm (SGD). 235 Figure 1 Background Architecture with FL Assume that is the total communication rounds, where , and the global parameter in the -th round is . At the beginning of each communication round, clients are randomly selected as participating samples, where . Sample clients has its local dataset size , then the total amount of sample data is denoted as . In each communication round, the client uses its source data locally to perform one-step gradient descent on the current model to obtain the model parameters of the client: where is the gradient at iteration of client , is the learning rate of the model. Af- ter receiving local parameters, servers perform model average operation to update global parameters: The updated global parameters are synchronized to all clients for the next round of local training, and the process is repeated (Show in Figure 1). Compared with the generic FedSGD model, the Fe-dAVG algorithm is much more accurate and robust to non-IID data to a certain extent. However, as it adopts static learning rate, gradient and other parameters in local training, its convergence speed will still be slow in the case of more imbalanced non-IDD data in practical problems. There is still room for its improvement in this case [8]. Recent times, the idea of adaptively updating static parameters has spawned many optimization methods of FL, which can achieve faster convergence on non-IDD datasets while ensuring robustness. FedADAM [6] is one of the most advanced algorithms. Based on SGD model, it adaptively adjusts the learning rate according to the momentum information of the gradient by tracking and calculating the first moment and second moment of gradient parameter of the model, and use Adam’s update method for iteration of its global parameter [7]. Compared with standard SGD, these features enable FedADAM to converge faster and be more robust to local minima [6][9]. 2.2. Our Adaptive Optimized Method Communication cost plays a dominant role in the optimization of federated learning. We consider reducing the number of communication rounds required for training the model by using additional computation, so as to achieve faster convergence. Therefore, we refer to FedADAM algorithm and propose a new adaptive method according to the idea of dynamic adjustment of static parameters such as learning rate and gradient. The pseudocode of the Algorithm is presented in Algorithm 1 and abstractly show in Figure 2: 236 Figure 2 Adaptive Algorithm Process Algorithm 1 shows our adaptive approach, is the fraction of clients participating in each round, is the set of all clients, are hyperparameters of the model which could be customize be- fore training, denotes the learning rate. Therefore, clients are sampled at the beginning of each global iteration. Assume the local parameter that the -th involved client received as , in each iteration the global parameter is transferred thus is assigned as: 237 Since the historical gradient information hasn’t been recorded in the first global iteration, the gra- dient of the loss function would be applied as the replacement of estimated local gradient function in the first epoch : , where As the gradient of the loss function of participated clients differ from each other, which leads to a situation that FedAVG may have more divergent results and poor convergence when it performs multiple local updates for non-IDD dataset. To mitigate the unstable convergence perfor- mance of the model for non-IDD data in practical problems, the model could be adaptively adjusted by introducing the estimated gradient function, which enables the model to achieve better results for non-IDD data after multiple local updates in the correct update direction. After the first local update, the central server calculates estimated global gradient function with the global parameters of current and the previous global iterations as: where , which will be delivered to the participated clients in the next global itera- tion for the update of the estimated gradient function of client at the remaining global iterations: , where As shown in Algorithm 1, the core idea of our adaptive method is to perform the update of in each epoch of client as follows: where is the gradient of the loss function with the local dataset at each epoch, is a hyper-parameter that we preset for adjusting . Then the update of local parameters at each epoch can be formulated as: In addition, in order to accelerate the convergence speed and improve the performance of the mod- el for non-IDD data, -th epoch value of local parameters would be used to calculate accord- ing to the idea of Adam [7]: During the aggregation phase, the central server receives the parameters and af- ter local updates from the involved clients, and aggregates their values to calculate the weighted average of the local 238 Figure 3 Testing accuracy of FedAVG, FedADAM, Our Adaptive Method over MNIST dataset, with non-IDD,IDD data Figure 4 Testing accuracy of FedAVG, FedADAM, Our Adaptive Method over CIFAR-10 dataset, with non-IDD data parameters of the involved clients, which can be formulated as: their values to calculate the weighted average of the local parameters of the involved clients, which can be formulated as: At the end of each global iteration, the first moment estimation and second moment estimation are calculated with Adam’s method, and finally the global parameters is updated by: Such a design has the following advantages: a) Source Data in most of the practical problems (e.g. Electric power inspection with UAV System in our background) is often distributed inconsistently, which makes the model trained at the case of non-IDD data. By utilizing the global gradient information to adaptively update local parameter, faster and more robust convergence can be achieved. b) According to the idea of FedADAM, we calculated the moment estimation using gradient information recorded in the history to adaptively regulate learning rate of training. It should receive much better effect in accelerating network training and inhibiting data oscillations. c) The adaptive adjustment of parameters such as gradient and learning rate does not involve direct transmission of source data, which ensures data privacy of participants in federated learning. d) Our adaptive adjustment focuses on the optimization in case of non-IDD data, in which participated clients can obtain more stable parameters through more local training. Through this way, 239 the whole model is more inclined to train on the client side, which will greatly reduce the communication rounds and lower the cost of communication between the server and client. 3. Experiment and Comparison In the following, we will compare Our Adaptive Method, FedAVG and FedADAM to show the superiority of Our Method under local multi-epoch training. There is no publicly available data about the electric power inspection with UAV on the Internet at present. Without loss of generality, we selected MNIST [10] datasets and CIFAR-10 [11] datasets which are widely used in machine learning area for simulation respectively. We divide the dataset into IDD data and non-IDD data, construct the CNN model, and select different combinations of local epochs and communication round for training. In the experiment, we set the learning rate , and set for local update of participants. At the same time, both FedADAM and Our Adaptive Method use hyper-parameters for training. As shown in Figure 3 and Figure 4, Our Adaptive Method can achieve the expected result. By analyzing the results of the algorithm, the following conclusions can be obtained: a) For IDD data and implementing only one local training, the performance of Our Adaptive Method is not necessarily better than generic FedAVG and FedADAM, or even more mediocre. The reason is the main purpose of our adaptive improvement is to optimize the convergence speed and robustness of the model under the condition of non-IDD data. To save communication cost, more local training epochs is required to run on client. Under this circumstance, this method can truly attribute much faster and robust convergence with non-IDD data and significantly reduce the communication cost between server and client at the meantime, which is more meaningful in solving practical problems. b) As the local epochs gradually increase, it can be observed that both FedAVG and FedADAM slow down the convergence speed to different degrees, while Our Adaptive Method keeps stable convergence performance. This is because there are often large differences between local data among participants for non-IID dataset. If the participated clients have gone through local updates for many times, the differences between participating clients will become larger and larger, thus the convergence efficiency will slow down a lot during server aggregation. However, Our Adaptive Method ensures higher accuracy and faster convergence speed through adaptive adjustment of parameters in this case. 4. Conclusion In this article, we propose an adaptive FL method by using the momentum and adaptive gradient to optimize the convergence performance of the model. To achieve fast convergence, we introduce the new local gradient by considering the difference between the local gradient and historical global gra- dient. Furthermore, by tracking the first and second moment estimation of the gradients for model pa- rameter, our algorithm adjusts the learning rate adaptively. At last, we perform simulation experi- ments using MNIST dataset and CIFAR-10 dataset to verify that our model can achieve faster con- vergence speed and higher accuracy than the widely used FL algorithm for non-IDD data. It is of great significance to accelerate the convergence of the model and reduce the communication cost in the practical problems of imbalanced data distribution. 5. Acknowledge This paper is funded by the Innovation and Entrepreneurship Training Program for College Stu- dents of Sun Yat-sen University. (Project number: 202211500) 240 6. Reference [1] S. C. Sun and L. Y. Zhang, et al, " Application of intelligent identification technology in UAV power inspection," The Journal of New Industrialization, 2020. [2] B. McMahan and E. Moore, et al., "Communication-efficient learning of deep networks from de- centralized data," in Artificial intelligence and statistics, pp. 1273-1282, 2017. [3] X. Li and K. Huang, et al., " On the Convergence of FedAvg on Non-IID Data," in International Conference on Learning Representations, 2020. [4] C. Xu and S. Liu, et al., "Learning rate optimization for federated learning exploiting over-the-air computation," IEEE Journal on Selected Areas in Communications, vol. 39, no. 12, pp. 3742- 3756, 2021. [5] M. Duan and D. Liu, et al., "Self-balancing federated learning with global imbalanced data in mobile systems," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 1, pp. 59- 71, 2021. [6] S. Reddi and Z. Charles, et al., "Adaptive federated optimization," in International Conference on Learning Representations, 2021. [7] Kingma, D. P. and Ba, J., "Adam: A method for stochastic optimization," in International Con- ference on Learning Representations, 2015. [8] Hsu and T. M. H., et al. "Measuring the effects of non-identical data distribution for federated visual classification." arXiv preprint arXiv:1909.06335, 2019. [9] Mills, J and Hu, J., et al. "User-oriented multi-task federated deep learning for mobile edge com- puting." arXiv preprint arXiv: 2007.09236, 2020. [10] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges [11] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (Canadian institute for advanced research),” 2010. [Online]. Available: CIFAR-10 and CIFAR-100 datasets (toronto.edu) 241