=Paper=
{{Paper
|id=None
|storemode=property
|title=Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations
|pdfUrl=https://ceur-ws.org/Vol-1018/paper6.pdf
|volume=Vol-1018
|dblpUrl=https://dblp.org/rec/conf/vldb/BoleyKKSS13
}}
==Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations==
Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations [Extended Abstract] Mario Boley and Michael Kamp Daniel Keren Fraunhofer IAIS & University Bonn Haifa University {mario.boley,michael.kamp}@iais.fraunhofer.de dkeren@cs.haifa.ac.il Assaf Schuster and Izchak Sharfman Technion, Israel Institute of Technology assaf@technion.ac.il & tsachis@technion.ac.il ABSTRACT 1.1 Setting and Related Work We present the first protocol for distributed online predic- Despite the communication costs it induces, decentraliza- tion that aims to minimize online prediction loss and net- tion is inevitable in many modern scale applications. Hence, work communication at the same time. Applications in- recent articles [Balcan et al., 2012, Daumé III et al.] explic- clude social content recommendation, algorithmic trading, itly investigate the communication complexity of learning and other scenarios where a configuration of local prediction with decentralized data. They consider, however, the of- models of high-frequency streams is used to provide a real- fline task of finding a good global model over the union of time service. For stationary data, the proposed protocol re- all data as a final computation result. The same applies tains the asymptotic optimal regret of previous algorithms. to some work on parallel machine learning (e.g., Zinkevich At the same time, it allows to substantially reduce network et al. [2010], McDonald et al. [2010]) where data shards are communication, and, in contrast to previous approaches, distributed among several processors and then all computa- it remains applicable when the data is non-stationary and tion is carried out independently in parallel except for one shows rapid concept drift. The protocol is based on con- final model merging step. While these approaches avoid trolling the divergence of the local models in a decentralized communication for performance reasons, they do not intend way. Its beneficial properties are also confirmed empirically. to optimize the predictive performance during the compu- tation. In contrast, we are interested in the online in-place performance, i.e., for every data point performance is as- 1. INTRODUCTION sessed locally when and where it is received or sampled. We consider online prediction problems where data points To this end, research focused so far on specific environ- are observed at local nodes in a distributed environment ments with fixed communication constraints. Correspond- and there is a trade-off between maximizing prediction ac- ingly, the learning strategies that are proposed and analyzed curacy and minimizing network communication. This situ- for these settings, do not aim to minimize communication ation abounds in a wide range of machine learning applica- beyond the level that is enforced by these constraints. Zinke- tions, in which communication induces a severe cost. Ex- vich et al. [2009] considers a shared-memory model, in which amples are parallel data mining [Zinkevich et al., 2009, Hsu all local nodes can update a global model in a round-robin et al.] where communication constitutes a performance bot- fashion as they process their training examples. Since this tleneck, learning with mobile sensors [Nguyen et al., 2004, approach is problematic if there is a notable communication Predd et al., 2006] where communication drains battery latency, strategies have been investigated [Mann et al., 2009, power, and, most centrally, prediction-based real-time ser- Dekel et al., 2012] that communicate only periodically after vices [Dekel et al., 2012] carried out by several servers, e.g., a statically fixed number of data points have been processed. for social content promotion, ad placement, or algorithmic Dekel et al. [2012] shows that for smooth loss functions and trading. In addition to the above, here the cost of commu- stationary environments optimal asymptotic regret bounds nication can also be a loss of prediction quality itself when can be retained √ by updating a global model only after mini- training examples have to be discarded due to network la- batches of O( 3 m) data points. Here, m denotes the total tency. number of data points observed throughout the lifetime of the system. For large values of m, the effect of bounded latency values is asymptotically outgrown by the increasing mini-batch size. While a fixed periodic communication schedule reduces the communication by some fixed amount, further reduction is desirable: The above mentioned costs of communication can have a severe impact on the practical performance—even if they are not reflected in asymptotic performance bounds. This is further amplified because a large number of modeling 1 tasks are performed simultaneously sharing the same limited Ei = {di , . . . , di+1 − 1} between any two drift points. We bandwidth. Moreover, distributed learning systems that are assume that all learners sample from D independently in deployed for a long lifetime relative to their data through- parallel using a constant and uniform sampling frequency, put can experience periodical or singular target drifts (e.g., and we denote by (xt,l , yt,l ) ∼ Dt the training example corresponding to micro-trends in social networks). In these received at node l at time t. Generally, we assume that all settings, a static schedule is bound to either provide only lit- training examples are bounded by a ball with radius R. tle to no communication reduction or to insufficiently react Conceptually, every learner first observes the input part to changing data distributions. xt,l and performs a real time service based on the linear prediction score pt,l = hwt,l , xt,l i, i.e., the inner prod- 1.2 Contributions and Outline uct of xt,l and the learner’s current model vector. Only In this work, we give the first distributed prediction proto- then it receives as feedback the true label yt,l , which it can col for linear models that, at the same time, aims to provide use to locally update its model to wt+1,l = ϕ(wt,l , xt,l , yt,l ) a high online in-place prediction performance and explic- by some update rule ϕ : Rn × X × Y → Rn . Finally, itly tries to minimize communication. In terms of predictive the learners are connected by a communication infrastruc- power, as shown Sec. 3.1, the protocol retains the asymp- ture that allows them to jointly perform a synchroniza- totic optimal regret of the distributed mini-batch algorithm tion operation σ : Rk×n → Rk×n that resets the whole of Dekel et al. [2012] for stationary data. In addition, it model configuration to a new state and that may take into allows to reduce the communication among the local nodes account the information of all local learners simultaneously. substantially. This is achieved by a dynamic data depen- The performance of such a distributed online prediction sys- dent communication schedule, which, in contrast to previ- tem is measured by two quantities: 1) the predictive perfor- ous algorithms, remains applicable when the data is non- mance Tt=1 kl=1 f (pt,l , yt,l ) measured by a loss function P P stationary and shows rapid concept drifts. The main idea is f : R × Y → R+ that assigns positive penalties to predic- to synchronize the local models to their mean model in order tion scores; and 2) the amount of communication within to reduce their variance, but to do so only in system states the system that is measured by the number of bits sent in- that show a high divergence among the models. This diver- between learners to compute the sync operation. Next, spec- gence, measured by the average model distance to the mean ify possible choices for the update rule, the loss function, and model, indicates the synchronizations that are most impor- the synchronization operator. tant in terms of their correcting effects on the predictions. In stable phases this allows communicative quiescence, while, in hard phases where variance reduction is crucial, the pro- 2.2 Losses and Gradient Descent tocol will trigger a lot of model synchronizations. In order Generally, the communication protocol developed in this to efficiently implement this strategy one has to monitor the paper is applicable to a wide range of online update rules for non-linear divergence function without communication over- linear models from, e.g., the passive aggressive rule [Cram- head. We propose a solution to this problem that adapts re- mer and Singer, 2001] to regularized dual averaging [Xiao, cent ideas from distributed systems research based on local 2010]. However, the regret bound given in Theorem 2 as- safe-zones in the function domain (Sec. 3.2). Experiments sumes that the updates are contractions. That is, there confirm the beneficial properties of the protocol (Sec. 4). is some constant c < 1 such that for all w, w0 ∈ Rn , and x, y ∈ X × Y it holds that kϕ(w, x, y) − ϕ(w0 , x, y)k ≤ 2. PRELIMINARIES ckw − w0 k. For the sake of simplicity, in this paper, we focus on rules based on l2 -regularized stochastic gradient descent, In this section we formally introduce the distributed on- for which this contraction property is readily available. We line prediction task. As simple local learning tool we recall note that by considering expected contractions the result can stochastic gradient descent for linear models. Finally, we be extended to rules that reduce on average the distance to review the state-of-the-art communication protocol as a de- a (regularized) loss minimizer. parture point for developing a more communication-efficient Before we can define gradient descent updates, we have to solution in subsequent sections. introduce the underlying loss functions measuring predictive 2.1 Distributed Online Prediction performance. Again for convenience, we restrict ourselves to functions that are differentiable, convex, and globally Lips- Throughout this paper we consider a distributed online chitz continuous in the prediction score, i.e., there is some prediction system of k local learners that maintain individ- constant L such that for all p, p0 , y ∈ R2n × Y it holds that ual linear models wt,1 , . . . , wt,k ∈ Rn of some global envi- |f (p, y) − f (p0 , y)| ≤ L|p − p0 |. While these assumptions can ronment through discrete time t ∈ [T ] where T ∈ N denotes be relaxed by spending some technical effort, they already the total time horizon with respect to which we analyze the include loss functions for all standard predictions tasks such system’s performance. This environment is represented by as the logistic loss flg (p, y) = ln(1 + exp(−yp)) for binary a target distribution Dt : X × Y → [0, 1] that describes classification (case Y = {−1, 1}) or the Huber loss for re- the relation between an input space X ⊆ Rn and an output gression (in the case Y = R) space Y ⊆ R. The nature of Y varies with the learning task at hand; Y = {−1, 1} is used for binary classification, Y = R ( for regression. While we allow Dt to vary with time, we as- 1 (p − y)2 , for |p − y| ≤ 1 fhu (p, y) = 2 . sume that it remains constant most of the time and only |p − y| − 21 experiences a small number of rapid drifts. That is, there are drift points 0 = d0 < d1 < · · · < dp = T such that for all i ∈ [p] and t, t0 ∈ [T ] with di−1 ≤ t ≤ t0 < di it holds that See, e.g., Zhang [2004] for further possible choices. In both Dt = Dt0 . Hence, there are identically distributed episodes of these cases the (best) Lipschitz constant is L = 1. 2 Algorithm 1 Static Synchronization Protocol protocol similar to Alg. 1. Here, after a batch of kb examples Initialization: has been processed globally in the system, all local models are re-set to the mean model of the configuration w de- local models w1,1 , . . . , w1,k ← (0, . . . , 0) fined as w = 1/k kl=1 wl . Formally, the synchronization P Round t at node l: operator that is implicitly employed in these algorithms is observe xt,l and provide service based on pt,l given by σ(wt ) = (wt , . . . , wt ). We refer to this operation observe yt,l and update wt+1,l ← ϕ(wt,l , xt , yt ) as full mean synchronization. The choice of a (uniform) if t mod b = 0 then model mixture is often used for combining linear models that send wt,l to coordinator have been learned in parallel on independent training data (see Mann et al. [2009], McDonald et al. [2010], Zinkevich At coordinator every b rounds: et al. [2010]). The motivation is √ that the mean of k models receive local models {wP t,l : l ∈ [k]} provides a variance reduction of k over an individual ran- send wt,1 , . . . , wt,k ← k1 l∈[k] wl dom model (recall that all learners sample from the same distribution, hence their models are identically distributed). Dekel et al. [2012] shows that when the gradient variance is With this we can define stochastic gradient descent bounded then the optimal √ regret can be asymptotically re- (SGD) rules with l2 -regularization, i.e., rules of the form tained by setting b = O( 3 Ti ) even if a constant number of examples have to be discarded during each synchronization λ ϕ(w, x, y) = w − ηt ∇w kwk2 + f (hw, xi, y) due to network latency. Note that this reference considers 2 a slightly modified algorithm based on delayed gradient de- where λ ∈ R+ is a strictly positive regularization pa- scent, which only applies (accumulated) updates at synchro- rameter and ηt ∈ R+ are strictly positive learning rates nization points. However, the expected loss of eager updates for t ∈ N. For stationary target distributions, one √ often (as used in Alg. 1) is bounded by the expected loss of de- chooses a decreasing learning rate such as ηt = 1/ t in or- layed updates (as used in Dekel et al. [2012]) as long as the der to guarantee convergence of the learning process. For updates reduce the distance to a loss minimizer on average non-stationary targets this is infeasible, because for large t (which is the case for sufficiently small learning rates and it would prevent sufficient model adaption to target changes. regularization parameters; see again Zhang [2004, Eq. 5]). However, one can show [Zinkevich et al., 2010] that stochas- Closing this section, let us analyze the communication tic gradient descent is a contraction for sufficiently small cost of this protocol.Using a designated coordinator note as constant learning rates. Namely, for η ≤ (RL + λ)−1 the in Alg. 1, σ can computed simply by all nodes sending their updates do contract with constant c = 1 − ηλ. This can be current model to the coordinator, who in turn computes the used to show that the stochastic learning process converges mean model and sends it back to all the nodes. For assessing to a distribution centered close to a regularized loss mini- the communication cost of this operation, we only count the mizer even when the process is distributed among k nodes number of model vectors sent between the learners. This is (see the analysis of Zinkevich et al. [2010]). This refers to feasible because, independently of the exact communication the stochastic learning process defined by the mean of inde- infrastructure, the number of model messages asymptoti- pendent local models that result from SGD with iid samples cally determines the true bit-based cost. Hence, asymptot- from (episodes of) the target distribution. In this paper, the ically the communication cost of static model synchro- contraction property is used for the regret bound of Thm. 2. nization over k nodes with batch size b is O(kT /b). Dekel et al. [2012] assumes that the data distribution is station- √ 2.3 Communication and Mini-batches ary over all rounds and b can therefore be set to O( 3 T ). For every episode Ei , the predictive performance of a dis- This results in an automatic communication reduction that tributed prediction system lies between two baselines that increases with a longer system lifetime. However, this strat- correspond to the two extremes in terms of communication egy is not applicable when we want to stay adaptive towards behavior—complete centralization and no communication. changing data distributions. In this case, we have to set the Let Ti = |Ei | denote the length of episode Ei and by R = batch size with respect to the expected episode length and P ∗ t∈Ei , l∈[k] f (pt,l , yt,l )−f the regret with respect to the op- not with respect to the overall system lifetime. This num- ∗ ber can be much smaller than T resulting in batch sizes that timal expected loss f = argminw∈Rn E(x,y)∼Di [f (hw, xi, y)]. When all data points are centrally processed by one online are too small to meet our communication reduction goal. In learner, for long enough episodes one can achieve an ex- the following section, we therefore design a synchronization √ pected regret of O( kTi ) which is optimal (see Cesa-Bianchi protocol that can substantially reduce this cost based on a and Lugosi [2006] and Abernethy et al. [2009]). In contrast, data-dependent dynamic schedule. when the k nodes perform their learning processes in paral- lel without any √ communication this results in an expected regret of O(k Ti ), which√ is worse than the centralized per- 3. DYNAMIC SYNCHRONIZATION formance by a factor of k. Therefore, we are interested The synchronization protocol of Alg. 1 is static because it in algorithms that lie between these two extremes and that synchronizes after a fixed number of rounds independently show a beneficial trade-off between predictive performance of the sampled data and its effect on the local models. Con- and the amount communication. sequently, it incurs the communication cost of a full synchro- Mann et al. [2009] and Dekel et al. [2012] give algorithms nization round even if the models are (almost) identical and where information between nodes is only exchanged every thus only receive little to none correction. In this section, b rounds where b ∈ N is referred to as batch size. These we develop a dynamic protocol for synchronizations based algorithms can be written as static model synchronization on quantifying their effect. After showing that this approach 3 is sound from a learning perspective, we discuss how it can Algorithm 2 Dynamic Synchronization Protocol be implemented in a communication-efficient way. Initialization: 3.1 Partial Synchronizations local models w1,1 , . . . , w1,k ← (0, . . . , 0) reference point r ← (0, . . . , 0) A simple measure to quantify the correcting effect of syn- violation counter v ← 0 chronizations is given by the average Euclidean distance be- tween the current local models and the result model. We Round t at node l: refer to this quantity as the divergence P of a model con- observe xt,l and provide service based on pt,l figuration, denoted by δ(·), i.e., δ(w) = k1 kl=1 kw − wl k2 . observe yt,l and update wt+1,l ← ϕ(wt,l , xt , yt ) In the following definition we provide a relaxation of the if t mod b = 0 and kr − wt,l k > ∆/2 then full mean synchronization operation that introduces some send wt,l to coordinator leeway in terms of this divergence. At coordinator on violation: Definition 1. A partial synchronization operator with let B be set of nodes with violation a positive divergence threshold ∆ ∈ R is an operator σ∆ : v ← v + |B| Rk×n → Rk×n that 1) leaves the mean model invariant and if v = k then B ← [k], v ←P0 2) after its application the model divergence is bounded by while B 6= [k] and kr − B1 l∈B wl k > ∆ do ∆. That is, for all model configurations w ∈ Rk×n it holds augment B by augmentation strategy that w = σ∆ w and δ(σ∆ w) ≤ ∆. receive models from nodes added P to B send to nodes in B model w = B1 l∈B wl An operator adhering to this definition does not generally if B = [k] also set new reference model r ← w put all nodes into sync (albeit the fact that we still refer to it as synchronization operator). In particular it allows to leave all models untouched as long as the divergence re- mains below the threshold ∆. The following theorem notes that transfers the system back into a valid state whenever that partial synchronization has a controlled regret over full one or more of these local conditions are violated. This in- synchronization if the batch size is sufficiently large and the cludes carrying out a sufficient amount of synchronization divergence threshold is set proportional to the Lipschitz con- to reduce the divergence to be less or equal than ∆. stant L of the losses and the data radius R. For deriving local conditions we consider the domain of the divergence function restricted to an individual model Theorem 2. Suppose the update rule ϕ is a contraction vector. Here, we identify a safe-zone S (see Keren et al. with constant c. Then, for batch sizes b ≥ log−1 2 c −1 and [2012]) such that the global divergence can not cross the divergence thresholds ∆ ≤ /(2RL), the average regret of ∆-threshold as long as all local models remain in S.1 The using a partial synchronization operator σ∆ instead of σ is following statement, which we give again without proof, pro- bounded by , i.e.,Pfor all rounds t ∈ N it holds that the vides a valid spherical safe zone Sr that is centered around average regret 1/k kl=1 |f (p∆t,l , yt,l ) − f (pt,l , yt,l )| is bounded some global reference point r. by where pt,l and p∆ t,l denote the prediction scores at learner l and time t resulting from σ and σ∆ , respectively. Theorem 3. Let r ∈ Rd be some reference point. If for all nodes l ∈ {1, . . . , k} it holds that kr − wl k ≤ ∆/2 then We omit the proof here referring to the full version of this we have for the model divergence that δ(w) ≤ ∆. paper. While the contraction assumption is readily avail- able for regularized SGD, as mentioned in Sec. 2, it can be We now incorporate these local conditions into a distributed relaxed: by requiring the updates to only contract on expec- prediction protocol. As a first step, we have to guarantee tation it is possible to extend the theorem to unregularized that at all times all nodes use the same reference point. For SGD updates as well as to other rules. Moreover, we remark a prediction t, let us denote by t0 the last time prior to that Thm. 2 implies that partial synchronizations retain the t when a full model synchronization was performed (resp. optimality of the static mini-batch algorithm of Dekel et al. t0 = 0 in case no full synchronization has happened un- [2012] for the case of stationary targets: By using √a time- til round t). The mean model wt0 is known to all local dependent divergence threshold based on t ∈ O(1/ t) the learners. We use this model as the reference model and set √ bound of O( T ) follows. r = wt0 . A local learners l can then monitor their local condition kr − wl k ≤ ∆/2 in a decentralized manner. 3.2 Communication-efficient Protocol It remains to design a resolution protocol that specifies After seeing that partial synchronization operators are how to react when one or several of the local conditions are sound from the learning perspective, we now turn to how violated. A direct solution is to trigger a full synchroniza- they can be implemented in a communication-efficient way. tion in that case. This approach, however, does not scale Every distributed learning protocol that implements a par- well with a high number of nodes in cases where model up- tial synchronization operator has to implicitly control the dates have a non-zero probability even in the asymptotic divergence of the model configuration. However, we cannot regime of the learning process. When, e.g., PAC models for simply compute the divergence by centralizing all local mod- the current target distribution are present at all local nodes, els, because this would incur just as much communication the probability of one local violation, albeit very low for an as static full synchronization. Our strategy to overcome this individual node, increases exponentially with the number of problem is to first decompose the global condition δ(w) ≤ ∆ nodes. An alternative approach that can keep the amount into a set of local conditions that can be monitored at their 1 Note that a direct distribution of the threshold across the respective nodes without communication (see, e.g., Sharf- local nodes (as in, e.g., Keralapura et al. [2006]) is in- man et al. [2007]). Secondly, we define a resolution protocol feasible, because the divergence function is non-linear. 4 240000 100000 No Synchronization 220000 No Synchronization 98000 200000 96000 180000512 64 94000 static (batch sizes) 0.2 static (batch sizes) Error Error 160000 256 32 dynamic (div. thres.) dynamic (div. thres.) 92000 140000 128 90000 24 120000 64 0.15 16 5.0 32 24 16 12 100000 3.0 0.9 0.5 0.3 0.1 0.05 8 88000 0.1 0.075 0.05 0.025 8 800000 500000 1000000 1500000 860000 50000 100000 150000 200000 250000 300000 350000 Number of messages Number of messages Figure 1: Performance of static and dynamic model synchronization that track (left) a rapidly drifting disjunction over 100-dimensional data with 512 nodes; and (right) a neural network with one hidden layer and 150 output variables. with 1024 nodes. of communication low relative to the number of nodes is verge to a linear model with zero classification error within to perform a local balancing procedure: on a violation, the each episode. Formally, we identify a target disjunction with respective node sends his model to a designated note we a binary vector z ∈ {0, 1}n . A data point x ∈ X = {0, 1}n refer to as coordinator. The coordinator then tries to bal- is labeled positively y = 1 if hx, zi ≥ 1 and otherwise re- ance this violation by incrementally querying other nodes for ceives a negative label y = −1. The target disjunction is their models. If the mean of all received models lies within drawn randomly at the beginning of the learning process the safe zone, it is transferred back as new model to all par- and is randomly re-set after each round with a fixed drift ticipating nodes, and the resolution is finished. If all nodes probability of 0.0002. In order to have balanced classes, the have been queried, the result is equal to a full synchroniza- disjunctions as well as the data points are generated such tion and the reference point can be updated. In both cases, that each √ coordinate is set independently to 1 with proba- the divergence of the model configuration is bounded by ∆ bility 1 − 2−1/n . As loss function for the stochastic gradi- at the end of the balancing process, because all local condi- ent descent we use the logistic loss. Corresponding to our tions hold. Also this protocol leaves the global mean model setting of noise-free linearly separable data, we choose the unchanged. Hence, it is complying to Def. 1. regularization parameter λ = 0 and the learning rate η = 1. While balancing can achieve a high communication reduc- In Fig. 1 (left) we present the result for dimensionality tion over direct resolution particularly for a large number n = 100, with k = 512 nodes, processing m = 12.8M data of nodes, it potentially degenerates in certain special situ- points through T = 25000 rounds. For divergence thresholds ations: We can end up in a stable regime in which local up to 3.0, dynamic synchronization can retain the error num- violations are likely to be balanced by a subset of the nodes; ber of statically synchronizing every 8 rounds. At the same however a full synchronization would strongly reduce the time the communication is reduced to 3.9% of the original expected number of violations in future rounds. In other number of messages. An approximately similar amount of words: balancing can delay crucial reference point updates communication reduction can also be achieved using static indefinitely. A simple hedging mechanism for online opti- synchronization by increasing the batch size to 128. This mization can be employed to avoid this situation: we count approach, however, only retains 51.5% of the error reduc- the number of local violations using the current reference tion over no communication. Analyzing the development of point and trigger a full synchronization whenever this num- the evaluation metrics over time reveals: At the beginning ber exceeds the number of nodes. This concludes our dy- of each episode there is a relatively short phase in which namic protocol for distributed prediction. All components additional errors are accumulated and the communicative are summarized in Alg. 2 protocols acquire an advantage over the baseline of never synchronizing. This is followed by a phase during which no additional error is made. Here, the communication curve 4. EMPIRICAL EVALUATION of the dynamic protocols remain constant acquiring a gain In this section we investigate the practical performance over the static protocols in terms of communication. of the dynamic learning protocol for two controlled settings: We now turn to a harder experimental setting, in which one with linearly separable data and one with unsepara- the target distribution is given by a rapidly drifting two- ble data. Our main goal is to empirically confirm that the layer neural network. For this target even the Bayes op- predictive gain of static full synchronizations (using a batch timal classifier per episode has a non-zero error, and, in size of 8) over no synchronization can be approximately pre- particular, the generated data is not linearly separable. In- served for small enough thresholds, and to assess the amount tuitively, it is harder in this setting to save communication, of communication reduction achieved by these thresholds. because a non-zero residual error can cause the linear mod- We start with the problem of tracking a rapidly drifting els to periodically fluctuate around a local loss minimizer— random disjunction. In this case the target distribution pro- resulting in crossings of the divergence threshold even when duces data that is episode-wise linearly separable. Hence, we the learning processes have reached their asymptotic regime. can set up the individual learning processes so that they con- We choose the network structure and parameter ranges in 5 a way that allow for a relatively good approximation by Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, linear models (see Bshouty and Long [2012]). The pro- and games. Cambridge University Press, 2006. ISBN 978- cess for generating a single labeled data point is as fol- 0-521-84108-5. lows: First, the label y ∈ Y = {−1, 1} is drawn uniformly Koby Crammer and Yoram Singer. On the algorithmic im- from Y . Then, values are determined for hidden variables plementation of multiclass kernel-based vector machines. Hi with 1 ≤ i ≤ dlog ne based on a Bernoulli distribution Journal of Machine Learning Research, 2:265–292, 2001. P [Hi = · |Y = y] = Ber(phi,y ). Finally, x ∈ X = {−1, 1}n Hal Daumé III, Jeff M. Phillips, Avishek Saha, and Suresh is determined by drawing xi for 1 ≤ i ≤ n according to Venkatasubramanian. Efficient protocols for distributed P [Xi = xi , |Hp(i) = h] = Ber(poi,h ) where p(i) denotes the classification and optimization. In ALT 2012. unique hidden layer parent of xi . In order to ensure lin- Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin ear approximability, the parameters of the output layer are Xiao. Optimal distributed online prediction using mini- drawn such that |poi,−1 − poi,1 | ≥ 0.9, i.e., their values have batches. Journal of Machine Learning Research, 13:165– a high relevance in determining the hidden values. As in 202, 2012. the disjunction case all parameters are re-set randomly af- Daniel Hsu, Nikos Karampatziakis, John Langford, and ter each round with a fixed drift probability (here, 0.005). Alexander J. Smola. Parallel online learning. In Scaling For this non-separable setting we choose again to optimize up machine learning: Parallel and distributed approaches. the logistic loss, this time with parameters λ = 0.5 and Cambridge University Press. η = 0.05 respectively. Also, in order to increase the stabil- Ram Keralapura, Graham Cormode, and Jeyashankher Ra- ity of the learning process, we apply averaged updates over mamirtham. Communication-efficient distributed mon- mini-batches of size 8. itoring of thresholded counts. In Proc. of the ACM Figure 1 (right) contains the results for dimensionality SIGMOD Int. Conf. on Management of Data (SIGMOD 150, with k = 1024 nodes, processing m = 2.56M data 2006), pages 289–300, 2006. points through T = 2500 rounds. For divergence thresholds Daniel Keren, Izchak Sharfman, Assaf Schuster, and up to 0.05, dynamic synchronization can retain the error Avishay Livne. Shape sensitive geometric monitoring. of the baseline. At the same time the communication is Knowledge and Data Engineering, IEEE Transactions on, reduced to 46% of the original number of messages. 24(8):1520–1535, 2012. G. Mann, R. McDonald, M. Mohri, N. Silberman, and 5. CONCLUSION D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In Advances in We presented a protocol for distributed online prediction Neural Information Processing Systems (NIPS 2009), vol- that aims to dynamically save on network communications ume 22, pages 1231–1239, 2009. in sufficiently easy phases of the modeling task. The pro- Ryan T. McDonald, Keith Hall, and Gideon Mann. Dis- tocol has a controlled predictive regret over its static coun- tributed training strategies for the structured perceptron. terpart and experiments show that it can indeed reduce the In Human Language Technologies: Conf. of the North communication substantially—up to 95% in settings where American Chapter of the Association of Computational the linear learning processes are suitable to model the data Linguistics, Proceedings (HLT-NAACL), pages 456–464, well and converge reasonably fast. Generally, the effectivity 2010. of the approach appears to correspond to the effectivity of XuanLong Nguyen, Martin J Wainwright, and Michael I linear modeling by SGD in the given setting. Jordan. Decentralized detection and classification using For future research a theoretical characterization of this kernel methods. In Proceedings of the twenty-first inter- behavior is desirable. A practically even more important di- national conference on Machine learning, page 80. ACM, 2004. rection is to extend the approach to other model classes that can tackle a wider range of learning problems. In principle, Joel B Predd, SB Kulkarni, and H Vincent Poor. Distributed the approach of controlling model divergence remains appli- learning in wireless sensor networks. Signal Processing cable, as long as the divergence is measured with respect Magazine, IEEE, 23(4):56–69, 2006. to a distance function that induces a useful loss bound be- Izchak Sharfman, Assaf Schuster, and Daniel Keren. A ge- tween two models. For probabilistic models this can for ometric approach to monitoring threshold functions over instance be the KL-divergence. However, more complex distributed data streams. ACM Trans. Database Syst., 32 (4), 2007. distance functions constitute more challenging distributed monitoring tasks, which currently are open problems. Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. The Journal of Machine Learning Research, 11:2543–2596, 2010. References Tong Zhang. Solving large scale linear prediction problems Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and using stochastic gradient descent algorithms. In Proceed- Alexander Rakhlin. A stochastic view of optimal regret ings of the 21st int. conf. on Machine learning (ICML through minimax duality. In COLT 2009 - The 22nd Con- 2004), 2004. ference on Learning Theory, 2009. Martin Zinkevich, Alex J. Smola, and John Langford. Slow Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay learners are fast. In Proc. of 23rd Annual Conference Mansour. Distributed learning, communication complex- on Neural Information Processing Systems (NIPS 2009), ity and privacy. Journal of Machine Learning Research - pages 2331–2339, 2009. Proceedings Track, 23:26.1–26.22, 2012. Martin Zinkevich, Markus Weimer, Alexander J. Smola, and Nader H. Bshouty and Philip M. Long. Linear classifiers are Lihong Li. Parallelized stochastic gradient descent. In nearly optimal when hidden variables have diverse effects. Proc. of 24th Annual Conference on Neural Information Machine Learning, 86(2):209–231, 2012. Processing Systems (NIPS 2010), pages 2595–2603, 2010. 6