1. Introduction

10.3390/s21248282

Boosting Methods for Federated Learning

Roberto Esposito

Mirko Polato

Marco Aldinucci

0 0 Dipartimento di Informatica, Universita` di Torino , Corso Svizzera 185, 10145 Torino

2020

51 02 05

Federated Learning (FL) has been proposed to develop better AI systems without compromising the privacy of nal users and the legitimate interests of private companies. Initially deployed by Google to predict text input on mobile devices, FL has been deployed in many other industries. Since its introduction, Federated Learning mainly exploited the inner working of neural networks and other gradient descent-based algorithms by either exchanging the weights of the model or the gradients computed during learning. While this approach has been very successful, it rules out applying FL in contexts where other models are preferred, e.g., easier to interpret or known to work better. This paper proposes to leverage distributed versions of the AdaBoost algorithm to acquire strong federated models. In contrast with previous approaches, our proposal does not put any constraint on the client-side learning models and does not rely on inner workings of the learning algorithms used in the clients. We perform a large set of experiments on ten UCI datasets, comparing the algorithms in six non-iidness settings. Results show that the approach is eective, in the case of an IID setting, results are oen near to the theoretical optimum (i.e., the performances of AdaBoost on the complete dataset). In case of non-IID settings, results very much depend on the severity of the non-IIDness.

eol>federated learning cross-silo boosting adaboost ensemble learning

1. Introduction

Recent years have been characterized by crucial advances in articial intelligence and machine learning systems, by the widespread availability of massive computational resources, and by the availability of huge datasets. The consequent deployment of AI and ML methods throughout many industries has been a welcome innovation that generated, nonetheless, newfound concerns about the fairness of the results and the privacy of the involved data. As a result, it is oen the case that data is dispersed into many isolated islands, and ML practitioners are forbidden by laws and by the legitimate owners from collecting, fusing, and ultimately using the data to improve their systems. While protecting the privacy of users and the competing advantages of companies is arguably a fair objective, it nonetheless hampers the development of learning models that, by leveraging all the available data, could make a dierence in the quality of life of many people who are subjected to the decisions made using AI systems.

Federated Learning (FL) has been proposed by McMahan et al. [ 1 ] as a way out of this conundrum, i.e., as a way to develop better AI systems without compromising the privacy of nal users and the legitimate interests of private companies.

FL is a learning paradigm where multiple parties (clients) collaborate in solving a machine learning task using their private data under the coordination of an aggregator (a.k.a. server or coordinator). Each client’s local data is not exchanged or transferred to any participant. The learning happens in rounds where model updates are computed by clients in insulation using local and private data, then aggregated on the server, then broadcast to the clients for the next round.

There are two main federated settings: cross-device and cross-silo. In cross-device FL, the parties can be edge devices (e.g., smart devices and laptops); they can be numerous (order of thousands or even millions). Parties are considered not reliable and with limited computational power. In the Cross-silo FL setting, the involved parties are instead organizations; the number of parties is limited, usually in the range [ 2, 100 ]. Given the nature of the parties, it can also be assumed that communication and computation are no real bottlenecks.

Since its introduction [ 1 ], Federated Learning mainly exploited the inner working of neural networks and other gradient descent-based algorithms by either exchanging the weights of the model or the gradients computed during learning. While this approach has been very successful, it rules out applying FL in contexts where other models would be preferred, either because they are more interpretable or known to work better. For instance in the case of medical studies, it is oen the case that data comes in tabular form and examples are not numerous and distributed among several medical centers that need to respect hard privacy constraints. Also, medical doctors oen require to be able to interpret the inferred models. In these situations decision trees or rule based system are oen justiably preferred to neural networks, but they cannot be readily applied without collecting the data in one single place (e.g., [ 2 ]), which makes the whole process hard or impossible to implement due to the aforementioned privacy constraints.

This is a position paper based on the work in [ 3 ], where we proposed a series of cross-silo FL algorithms for classication based on distributed versions of the AdaBoost algorithm [ 4, 5, 6, 7, 8 ] allowing gradient-free federated learning. The algorithms pose minimal constraints on the learning settings of the clients, thus allowing a federation of models not specically designed for FL, such as decision trees and SVMs. While there is no technical barrier to using our approach in cross-device federated learning settings, we have not conducted experiments to clarify the issue. Our intuition is that the approach will best work with reliable clients that own many examples, and when communication cost is not high. We, therefore, believe that they are best suited for cross-silo settings and leave to future work investigating alternatives more kin to cross-device environments.

The main contributions of this work are: i) we propose two new FL algorithms inspired by distributed AdaBoost literature, namely

DistBoost.F and PreWeak.F; ii) we introduce a third algorithm (AdaBoost.F) purposely developed for FL; iii) we present a comprehensive evaluation of our solutions on ten UCI datasets and 6 data distribution settings.

For reproducibility purposes, all the code used to perform the experiments in this paper is available at https://github.com/ml-unito/federation boosting.

2. Related Works

Ensemble Learning copes with the problem of strengthening the performances of a learning algorithm by iterating it and combining the results. Ensemble Learning is oen employed by practitioners because it requires almost no parameters and can be used along with o-the-shelf algorithms to obtain strong models that are usually very robust to overtting [ 6 ]. It is not surprising then that, at the beginning of this century a large swat of research has been devoted to the topic and that many avors of ensemble learning have been proposed during those years (e.g., Bagging [ 9 ], Boosting and its variants [ 5 ], Stacking [ 10 ], ECOC [ 11 ], etc.). In this context, the original boosting algorithm from Schapire [ 12 ] is fundamental because by constructively solving the weak learnability problem [ 12 ] spawned massive interest in the eld and posed the basis for the development of AdaBoost [ 5 ], arguably the best-known algorithm in the eld. The main idea in Schapire’s boosting algorithm [ 12 ], and hence in AdaBoost, under the assumption that the base learning algorithm (the weak learner) will always strictly better than random guess, one can leverage the distribution of the examples to force the weak learner to focus on specic portions of the examples space. This can be then used to drive down the error of the ensemble exponentially fast. AdaBoost appears particularly interesting as a candidate tool for FL, as it eectively combines classiers which may be learned independently by the FL clients. Furthermore, it could be argued that, as long as at least one of the clients can nd a model which is slightly better than the random guess over the complete dataset, AdaBoost should be able to drive the error of the ensemble on the training set to its theoretical minimum no matter other factors (such as the possible non-iidness of the data distribution).

Most of the FL literature focuses on gradient-based methods with very few exceptions. [ 13 ] proposes Federated Forest, a lossless federated version of the classical Random Forest (RF) algorithm for vertically partitioned data. In this method, trees are built on node splits selected by the aggregator that repeatedly asks clients for the impurity index and picks the minimum. Federated Forest guarantees privacy preservation mainly using features/labels’ encoding. However, label encoding may fail in the case of binary classication tasks. A very dierent approach to learning RFs is presented in [ 14 ] where the federation is managed using Blockchain technology that guarantees security even against adversarial participants. Vertical FL (VFL) is the learning setting in [ 15 ] that presents federated algorithm for classication/regression trees based on Multi-Party Computation [ 16 ]. The authors also describe possible extensions of the methodology to gradient-boosting trees and linear regression. In [ 17 ], the VFL setting is considered in the context of kernel-based methods. The authors propose a privacy-preserving protocol to build dot-product kernel matrices, showing the technique’s eectiveness on top-N recommendation tasks. To the best of our knowledge, we are the rst to propose a federated version(s) of AdaBoost where the (weak) classiers can be induced by any learning algorithm.

As briey mentioned in the introduction, two of the algorithms presented in this paper are based on a distributed version of AdaBoost, namely DistBoost [ 7 ] and PreWeak [ 8 ] that we will describe in Section 3. In [18], a distributed agnostic boosting algorithm is described. Dierently from AdaBoost, the method uses a non-exponential multiplicative weight update rule that is further adjusted using the Bregman projection. Here, we propose a federated adaptation of AdaBoost, and we would argue that a similar methodology may also apply to the approach in [18]. Boosting-based FL has been little studied in the literature. All published works on the topic focus on gradient-boosting trees [19, 20] and most of them are designed for vertically partitioned data [21, 22, 23, 24]. Homomorphic encryption and secret sharing schemes are used to guarantee privacy, with the only exception of [21, 19] that use a dierential private approach. The cross-silo setting is considered in both [21] and [24] (decentralized FL).

We dierentiate from these previous works because our federated boosting algorithms can be used with any weak learner, and our setting is horizontal FL. Even if our work focuses on a very specic case (classication in a vertical setting) in federated learning, we believe the techiques proposed could be extended and generalized to cover other learning tasks (e.g., regression, clustering, . . . ) and FL settings.

3. Ensemble Learning based Federated Learning

In this work, we set ourselves in a cross-silo FL setting, we assume that the clients are reliable and have enough computational power as well as a stable and secure connection [25, 26]. With these assumptions, our proposals expects a certain degree of synchronicity between the clients and the aggregator. However, all the proposed techniques can easily handle clients’ failure, for instance, by using a timeout on the clients that exclude their participation from that federated round.

In [ 5 ] Freund and Schapire formally proved that, provided that the weak learner can induce a decision rule which is consistently better than random guessing, AdaBoost reduces the ensemble error over the training set exponentially fast in the number of the combined weak models. It is worth emphasizing that this is the only constraint posed by the algorithm. As shown by Freund and Schapire [ 4 ], this holds true even when the weak learner behaves adversarially towards the ensemble learner. While this is not relevant in most scenarios, in the federated learning case, the weak learners only work with a subset of the available data. In a sense, it can be thought that malevolent learners try to make the ensemble learner fail on that part of the data (the data they do not own). This argument shows that, as long as at least one client can produce a model better than random guess over the entire dataset, a distributed version of AdaBoost, modied to guarantee that no information about the local dataset is exchanged, should be able to drive the ensemble error to its minimum exponentially fast. This is the main idea on the basis of our work.

In the past, there have been several attempts to build distributed versions of AdaBoost [ 7, 8 ]. In these works, the main aim was to distribute the computation; there was no attempt to provide privacy over the data and, indeed, all clients were supposed to hold the complete dataset. In [ 3 ], we have shown how to adapt two of these algorithms to work in a FL setting and also proposed an additional original algorithm. The main contributions were to provide mechanisms to cope with the fact that dierent clients hold dierent portions of the dataset, which have repercussions over the way the distribution over the examples is handled (e.g., how weights are normalized). For a detailed description of the working of the algorithms, we refer to the original publication [ 3 ]; for details of their implementation in actual (not simulated) FL environments, please refer to [27, 28]; here we only provide a brief summary of the ideas on which the algorithms are based. A common trait of the algorithms is that care is put in ensuring that the necessary statistics over the examples are computed in a privacy preserving way. To do that, all clients maintain unnormalized statistics over the examples and communicate them to the aggregator. The aggregator collect all statistics and uses them to compute a common normalization factor. The normalization factor can then be used to properly compute the and values that are central to the working of algorithms based on AdaBoost. The terms are then broadcasted to all clients so that they can update their local set of statistics and the process repeats.

DistBoost.F At each round, all clients build weak hypotheses over their local dataset; the hypotheses are sent to the aggregator that forms a bagged ensemble and uses that as the weak hypotesis for the current round. That weak hypothesis is transferred to each client so that they can use it to measure the performances of the newly learnt hypothesis and communicate them back to the aggregator (needed to let everyone to update the weights distribution).

PreWeak.F In an initial step all clients train an AdaBoost classier over their local datasets. In this step a xed number of weak hypotheses are built in each client without exchanging any information with the aggregator. Once all clients complete, all the learnt weak hypotheses are transmitted to the aggregator, which starts a global AdaBoost process. In this step, only the weak hypotheses already learnt in the previous step are considered as candidates to be added to the ensemble. At the end of each round the selected hypothesis is communicated to the clients so to allow the computation of the statistics necessary to maintain the global distribution of weights.

AdaBoost.F At each round each client builds a weak hypothesis which is communicated to the aggregator. The aggregator distribute these hypotheses to the clients so that they can compute the necessary statistics over the local dataset. These statistics are then used to pick the best weak hypothesis that is then added to the ensemble.

4. Experiments

We compare the federated algorithms introduced in [ 3 ], namely DistBoost.F, PreWeak.F, and AdaBoost.F, with the centralized algorithm SAMME [29] (multiclass AdaBoost). For all methods, we x the number of weak learners (federated rounds) = 300. As weak learners, we employ Decision Trees with up to 10 leaves (as in [29]). However, it is worth mentioning that the proposed algorithms are agnostic to the choice of the weak learner; better still, there is nothing preventing building a system where each client adopts a dierent model. The simulated federation contains 10 clients, which is a standard choice [26] in the cross-silo setting. We assumed that all clients correctly participated in all rounds during the simulation.

letters 0

We evaluated the methods on the following 10 datasets from the UCI [30] repository: adult, kr-vs-kp, forestcover, splice, vehicle, segmentation, sat, pendigits, vowel, letter. The datasets have been distributed across the clients using six dierent data distributions. Besides the iid case (uniform data distribution), we also consider the following types of non-iidness: quantity skew, prior shi (pathological, Dirichlet, and labels quantity), and covariate shi [26]. For more details about the implementation of these data distribution please refer to [ 3 ]. It is worth noticing that each client has at least two examples of dierent classes in every type of skewness.

The methods have been compared using standard classication metrics like accuracy, precision, recall, and F1. For space reasons, we only report the F1 score, which is the harmonic mean of the precision and recall, and it considers how the data is distributed.

Each experiment has been repeated ve times. The reported results are the averages (with their standard deviation) over these runs. The python implementation of the methods and their evaluation is available at https://github.com/ml-unito/federation boosting. 4.1. Results We start by investigating how benecial are the federations built by the proposed algorithm. To do that, we need to evaluate the performance of a possible competitor built only on local data. Then, for each non-iidness type, we ran the SAMME algorithm on each client, using only the local data for training and recorded the F1 score over a xed independent test set.

Figure 1 shows, for all the clients and all the data distributions, the dierence in F1 score (ΔF1) between the local run of SAMME (local SAMME in the following) and the best F1 score achieved by one of the federated algorithms on the pendigits and the letters datasets. The lower (more negative) ΔF1 is for a given point, the more benecial is the federation for the corresponding client and data distribution setting.

Dataset adult s t e s a t ad forestcover y r a n i b kr-vs-kp splice vehicle segmentation s t e s a t a d ss sat a l c i t l u m pendigits vowel letter

Avg. rank Model

Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme DistBoost.F PreWeak.F AdaBoost.F Samme PreWeak.F DistBoost.F AdaBoost.F

Barring small dierences in the actual numbers, the two experiments narrate the same story. The rst thing to notice is that participating in the federation is generally benecial to all clients, especially in non-iid data distributions.

An interesting observation is that, in the quantity skew scenario, clients with many examples (the head of the power-law) can reach F1 scores that are even higher than the federation. This is reasonable because those clients are close to having all the available data; i.e., they run in a setting similar to running SAMME over the fused dataset, that is generally better than having to deal with the split dataset scenario. We can also observe that the scenarios with a prior shi (i.e., Labels Quantity, Dirichlet, and Pathological) are the most challenging ones. This is particularly apparent for the label quantity skew and the pathological label skew where, by design, we assign only a small subset of labels per client. We note that, contrary to what the gure might suggest, in absolute terms the performances of local SAMME on the label quantity skew case are worse than those in the pathological skew: the corresponding points (⊗ symbols) appear upper (w.r.t. ) because the federation does not perform well in this particular case. This is particularly apparent for the pendigits dataset where the label quantity skew is not as detrimental to the performances as in the letters dataset.

In the uniform data distribution case, the federation is only slightly useful (pendigits) and slightly detrimental (letters).

Table 1 provides all the average F1 scores (± standard deviation) for all methods, datasets, and skewness. Overall, the performance of PreWeak.F and AdaBoost.F are signicantly better than DistBoost.F. We can observe that, in general, the federation tends to achieve F1 scores very close to the centralized SAMME on datasets with few labels (e.g., 2 and 3), even in non-iid settings. Clearly, as the number of classes increases, the prior shi scenario becomes more and more challenging. The Labels Quantity skew is the most demanding setting because each client only has two labels. Thus, their weak classiers are not good enough to be boosted eectively.

Overall, we believe that the evidence presented here is enough to conclude that the approach is benecial and that DistBoost.F is not performing as well as the other two algorithms. There is evidence, albeit not conclusive, that PreWeak.F outperforms AdaBoost.F in terms of performances and that PreWeak.F might suer more than AdaBoost.F from overtting problems.

5. Conclusions

The possibility of applying federated learning beyond gradient-based methods may broaden the adaptation of this methodology. In this paper, we exploit ideas from distributed boosting literature to propose three algorithms DistBoost.F, PreWeak.F, and AdaBoost.F, which allow, for the rst time ever, the federation of parties without putting constraints on the type of models learned in the clients. Indeed, to the best of our knowledge, our proposal is also the rst to allow each client to choose a dierent local model.

Our experiments show that the federation works. The generalization error of the federation is driven down by the three algorithms and, except in trivial cases, the federated model largely outperforms the models that could have been learned locally. Experiments also show that non-iid data distributions can harm the quality of the federated model. Specically, when an extreme skew on the labels is present, the federation might suer, especially when the problem is multi-class and the number of possible labels is large. We leave as future work a comparison between our approach and traditional (gradient-based) federated algorithms. The comparison would also allow us to assess how much the problems we observed in some non-iid settings are specic to our methodology.

This work opens the doors to many possible future directions. We aim to perform an in-depth analysis of these algorithms’ security and privacy aspects in our future work. As mentioned, we would like to compare their behavior against gradient-based alternatives.

[1] McMahan et al., Communication-ecient learning of deep networks from decentralized data, in: Articial intelligence and statistics , PMLR, 2017 , pp. 1273 - 1282 .

[2] F. D'Ascenzo , O.

De Filippo , G. Gallone, G. Mittone, M. A.

Deriu , M.

Iannaccone , A. ArizaSole´, C.

Liebetrau , S. Manzano-Ferna´ndez, G. Quadri, et al., Machine learning-based prediction of adverse events following an acute coronary syndrome (praise): a modelling study of pooled datasets , The Lancet 397 ( 2021 ) 199 - 207 .

[3]

Polato , M. Aldinucci, Boosting the federation: Cross-silo federated learning without gradient descent , 2022 International Joint Conference on Neural Networks (IJCNN) ( 2022 ) 1 - 10 .

[4]

Freund ,

R. E.

Schapire , Game theory, on-line prediction and boosting , in: Proceedings of the ninth annual conference on Computational learning theory, 1996 , pp. 325 - 332 .

[5]

Freund ,

R. E.

Schapire , A decision-theoretic generalization of on-line learning and an application to boosting , Journal of computer and system sciences 55 ( 1997 ) 119 - 139 .

[6]

Freund ,

Schapire ,

Abe , A short introduction to boosting , Journal-Japanese Society For Articial Intelligence 14 ( 1999 ) 1612 .

[7]

Lazarevic ,

Obradovic , Boosting algorithms for parallel and distributed learning , Distributed and Parallel Databases 11 ( 2002 ) 203 - 229 . URL: https://doi.org/10.1023/A: 1013992203485. doi: 10 .1023/A: 1013992203485 .

[8]

Cooper , L. Reyzin, Improved algorithms for distributed boosting , in: 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , 2017 , pp. 806 - 813 . doi: 10 .1109/ALLERTON. 2017 . 8262822 .

[9]

Breiman , Bagging predictors, Machine learning 24 ( 1996 ) 123 - 140 .

[10]

D. H.

Wolpert , Stacked generalization, Neural networks 5 ( 1992 ) 241 - 259 .

[11]

E. B.

Kong ,

T. G.

Dietterich , Error-correcting output coding corrects bias and variance , in: Machine learning proceedings 1995, Elsevier , 1995 , pp. 313 - 321 .

[12] R. E. Schapire, The strength of weak learnability , Machine learning 5 ( 1990 ) 197 - 227 .

[13]

Liu ,

Zhang ,

Meng ,

Zheng , Federated forest ( 2019 ). doi: 10 .1109/ TBDATA. 2020 . 2992755 . arXiv:arXiv: 1905 .10053.

[14] L. A. C. de Souza , G. Antonio F. Rebello, G. F. Camilo , L. C. B. Guimara˜es, O. C. M. B. Duarte , Dfedforest: Decentralized federated forest , in: 2020 IEEE International Conference on Blockchain (Blockchain) , 2020 , pp. 90 - 97 . doi: 10 .1109/Blockchain50366. 2020 . 00019 .

[15]

Wu ,

Cai ,

Xiao ,

Chen ,

B. C.

Ooi , Privacy preserving vertical federated learning for tree-based models , Proc. VLDB Endow . 13 ( 2020 ) 2090 - 2103 . URL: https://doi.org/10. 14778/3407790.3407811. doi: 10 .14778/3407790.3407811.

[16]

Cramer , I. B. Damga˚rd, J. B. Nielsen , Secure Multiparty Computation and Secret Sharing , Cambridge University Press, 2015 . doi: 10 .1017/CBO9781107337756.

[17]

Polato ,

Gallinaro ,

Aiolli , Privacy-preserving kernel computation for ver-