=Paper= {{Paper |id=None |storemode=property |title=Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations |pdfUrl=https://ceur-ws.org/Vol-1018/paper6.pdf |volume=Vol-1018 |dblpUrl=https://dblp.org/rec/conf/vldb/BoleyKKSS13 }} ==Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations== https://ceur-ws.org/Vol-1018/paper6.pdf

Communication-Efficient Distributed Online Prediction
using Dynamic Model Synchronizations
[Extended Abstract]
Mario Boley and Michael Kamp Daniel Keren
Fraunhofer IAIS & University Bonn Haifa University
{mario.boley,michael.kamp}@iais.fraunhofer.de dkeren@cs.haifa.ac.il
Assaf Schuster and Izchak Sharfman
Technion, Israel Institute of Technology
assaf@technion.ac.il & tsachis@technion.ac.il

ABSTRACT 1.1 Setting and Related Work
We present the first protocol for distributed online predic- Despite the communication costs it induces, decentraliza-
tion that aims to minimize online prediction loss and net- tion is inevitable in many modern scale applications. Hence,
work communication at the same time. Applications in- recent articles [Balcan et al., 2012, Daumé III et al.] explic-
clude social content recommendation, algorithmic trading, itly investigate the communication complexity of learning
and other scenarios where a configuration of local prediction with decentralized data. They consider, however, the of-
models of high-frequency streams is used to provide a real- fline task of finding a good global model over the union of
time service. For stationary data, the proposed protocol re- all data as a final computation result. The same applies
tains the asymptotic optimal regret of previous algorithms. to some work on parallel machine learning (e.g., Zinkevich
At the same time, it allows to substantially reduce network et al. [2010], McDonald et al. [2010]) where data shards are
communication, and, in contrast to previous approaches, distributed among several processors and then all computa-
it remains applicable when the data is non-stationary and tion is carried out independently in parallel except for one
shows rapid concept drift. The protocol is based on con- final model merging step. While these approaches avoid
trolling the divergence of the local models in a decentralized communication for performance reasons, they do not intend
way. Its beneficial properties are also confirmed empirically. to optimize the predictive performance during the compu-
tation. In contrast, we are interested in the online in-place
performance, i.e., for every data point performance is as-
1. INTRODUCTION sessed locally when and where it is received or sampled.
We consider online prediction problems where data points To this end, research focused so far on specific environ-
are observed at local nodes in a distributed environment ments with fixed communication constraints. Correspond-
and there is a trade-off between maximizing prediction ac- ingly, the learning strategies that are proposed and analyzed
curacy and minimizing network communication. This situ- for these settings, do not aim to minimize communication
ation abounds in a wide range of machine learning applica- beyond the level that is enforced by these constraints. Zinke-
tions, in which communication induces a severe cost. Ex- vich et al. [2009] considers a shared-memory model, in which
amples are parallel data mining [Zinkevich et al., 2009, Hsu all local nodes can update a global model in a round-robin
et al.] where communication constitutes a performance bot- fashion as they process their training examples. Since this
tleneck, learning with mobile sensors [Nguyen et al., 2004, approach is problematic if there is a notable communication
Predd et al., 2006] where communication drains battery latency, strategies have been investigated [Mann et al., 2009,
power, and, most centrally, prediction-based real-time ser- Dekel et al., 2012] that communicate only periodically after
vices [Dekel et al., 2012] carried out by several servers, e.g., a statically fixed number of data points have been processed.
for social content promotion, ad placement, or algorithmic Dekel et al. [2012] shows that for smooth loss functions and
trading. In addition to the above, here the cost of commu- stationary environments optimal asymptotic regret bounds
nication can also be a loss of prediction quality itself when can be retained √ by updating a global model only after mini-
training examples have to be discarded due to network la- batches of O( 3 m) data points. Here, m denotes the total
tency. number of data points observed throughout the lifetime of
the system. For large values of m, the effect of bounded
latency values is asymptotically outgrown by the increasing
mini-batch size.
While a fixed periodic communication schedule reduces
the communication by some fixed amount, further reduction
is desirable: The above mentioned costs of communication
can have a severe impact on the practical performance—even
if they are not reflected in asymptotic performance bounds.
This is further amplified because a large number of modeling

1
tasks are performed simultaneously sharing the same limited Ei = {di , . . . , di+1 − 1} between any two drift points. We
bandwidth. Moreover, distributed learning systems that are assume that all learners sample from D independently in
deployed for a long lifetime relative to their data through- parallel using a constant and uniform sampling frequency,
put can experience periodical or singular target drifts (e.g., and we denote by (xt,l , yt,l ) ∼ Dt the training example
corresponding to micro-trends in social networks). In these received at node l at time t. Generally, we assume that all
settings, a static schedule is bound to either provide only lit- training examples are bounded by a ball with radius R.
tle to no communication reduction or to insufficiently react Conceptually, every learner first observes the input part
to changing data distributions. xt,l and performs a real time service based on the linear
prediction score pt,l = hwt,l , xt,l i, i.e., the inner prod-
1.2 Contributions and Outline uct of xt,l and the learner’s current model vector. Only
In this work, we give the first distributed prediction proto- then it receives as feedback the true label yt,l , which it can
col for linear models that, at the same time, aims to provide use to locally update its model to wt+1,l = ϕ(wt,l , xt,l , yt,l )
a high online in-place prediction performance and explic- by some update rule ϕ : Rn × X × Y → Rn . Finally,
itly tries to minimize communication. In terms of predictive the learners are connected by a communication infrastruc-
power, as shown Sec. 3.1, the protocol retains the asymp- ture that allows them to jointly perform a synchroniza-
totic optimal regret of the distributed mini-batch algorithm tion operation σ : Rk×n → Rk×n that resets the whole
of Dekel et al. [2012] for stationary data. In addition, it model configuration to a new state and that may take into
allows to reduce the communication among the local nodes account the information of all local learners simultaneously.
substantially. This is achieved by a dynamic data depen- The performance of such a distributed online prediction sys-
dent communication schedule, which, in contrast to previ- tem is measured by two quantities: 1) the predictive perfor-
ous algorithms, remains applicable when the data is non- mance Tt=1 kl=1 f (pt,l , yt,l ) measured by a loss function
P P
stationary and shows rapid concept drifts. The main idea is f : R × Y → R+ that assigns positive penalties to predic-
to synchronize the local models to their mean model in order tion scores; and 2) the amount of communication within
to reduce their variance, but to do so only in system states the system that is measured by the number of bits sent in-
that show a high divergence among the models. This diver- between learners to compute the sync operation. Next, spec-
gence, measured by the average model distance to the mean ify possible choices for the update rule, the loss function, and
model, indicates the synchronizations that are most impor- the synchronization operator.
tant in terms of their correcting effects on the predictions. In
stable phases this allows communicative quiescence, while,
in hard phases where variance reduction is crucial, the pro- 2.2 Losses and Gradient Descent
tocol will trigger a lot of model synchronizations. In order Generally, the communication protocol developed in this
to efficiently implement this strategy one has to monitor the paper is applicable to a wide range of online update rules for
non-linear divergence function without communication over- linear models from, e.g., the passive aggressive rule [Cram-
head. We propose a solution to this problem that adapts re- mer and Singer, 2001] to regularized dual averaging [Xiao,
cent ideas from distributed systems research based on local 2010]. However, the regret bound given in Theorem 2 as-
safe-zones in the function domain (Sec. 3.2). Experiments sumes that the updates are contractions. That is, there
confirm the beneficial properties of the protocol (Sec. 4). is some constant c < 1 such that for all w, w0 ∈ Rn , and
x, y ∈ X × Y it holds that kϕ(w, x, y) − ϕ(w0 , x, y)k ≤
2. PRELIMINARIES ckw − w0 k. For the sake of simplicity, in this paper, we focus
on rules based on l2 -regularized stochastic gradient descent,
In this section we formally introduce the distributed on-
for which this contraction property is readily available. We
line prediction task. As simple local learning tool we recall
note that by considering expected contractions the result can
stochastic gradient descent for linear models. Finally, we
be extended to rules that reduce on average the distance to
review the state-of-the-art communication protocol as a de-
a (regularized) loss minimizer.
parture point for developing a more communication-efficient
Before we can define gradient descent updates, we have to
solution in subsequent sections.
introduce the underlying loss functions measuring predictive
2.1 Distributed Online Prediction performance. Again for convenience, we restrict ourselves to
functions that are differentiable, convex, and globally Lips-
Throughout this paper we consider a distributed online
chitz continuous in the prediction score, i.e., there is some
prediction system of k local learners that maintain individ-
constant L such that for all p, p0 , y ∈ R2n × Y it holds that
ual linear models wt,1 , . . . , wt,k ∈ Rn of some global envi-
|f (p, y) − f (p0 , y)| ≤ L|p − p0 |. While these assumptions can
ronment through discrete time t ∈ [T ] where T ∈ N denotes
be relaxed by spending some technical effort, they already
the total time horizon with respect to which we analyze the
include loss functions for all standard predictions tasks such
system’s performance. This environment is represented by
as the logistic loss flg (p, y) = ln(1 + exp(−yp)) for binary
a target distribution Dt : X × Y → [0, 1] that describes
classification (case Y = {−1, 1}) or the Huber loss for re-
the relation between an input space X ⊆ Rn and an output
gression (in the case Y = R)
space Y ⊆ R. The nature of Y varies with the learning task
at hand; Y = {−1, 1} is used for binary classification, Y = R (
for regression. While we allow Dt to vary with time, we as- 1
(p − y)2 , for |p − y| ≤ 1
fhu (p, y) = 2 .
sume that it remains constant most of the time and only |p − y| − 21
experiences a small number of rapid drifts. That is, there
are drift points 0 = d0 < d1 < · · · < dp = T such that for
all i ∈ [p] and t, t0 ∈ [T ] with di−1 ≤ t ≤ t0 < di it holds that See, e.g., Zhang [2004] for further possible choices. In both
Dt = Dt0 . Hence, there are identically distributed episodes of these cases the (best) Lipschitz constant is L = 1.

2
Algorithm 1 Static Synchronization Protocol protocol similar to Alg. 1. Here, after a batch of kb examples
Initialization: has been processed globally in the system, all local models
are re-set to the mean model of the configuration w de-
local models w1,1 , . . . , w1,k ← (0, . . . , 0)
fined as w = 1/k kl=1 wl . Formally, the synchronization
P
Round t at node l: operator that is implicitly employed in these algorithms is
observe xt,l and provide service based on pt,l given by σ(wt ) = (wt , . . . , wt ). We refer to this operation
observe yt,l and update wt+1,l ← ϕ(wt,l , xt , yt ) as full mean synchronization. The choice of a (uniform)
if t mod b = 0 then model mixture is often used for combining linear models that
send wt,l to coordinator have been learned in parallel on independent training data
(see Mann et al. [2009], McDonald et al. [2010], Zinkevich
At coordinator every b rounds:
et al. [2010]). The motivation is √ that the mean of k models
receive local models {wP t,l : l ∈ [k]}
provides a variance reduction of k over an individual ran-
send wt,1 , . . . , wt,k ← k1 l∈[k] wl
dom model (recall that all learners sample from the same
distribution, hence their models are identically distributed).
Dekel et al. [2012] shows that when the gradient variance is
With this we can define stochastic gradient descent bounded then the optimal √ regret can be asymptotically re-
(SGD) rules with l2 -regularization, i.e., rules of the form tained by setting b = O( 3 Ti ) even if a constant number of
examples have to be discarded during each synchronization
λ
ϕ(w, x, y) = w − ηt ∇w kwk2 + f (hw, xi, y) due to network latency. Note that this reference considers
2 a slightly modified algorithm based on delayed gradient de-
where λ ∈ R+ is a strictly positive regularization pa- scent, which only applies (accumulated) updates at synchro-
rameter and ηt ∈ R+ are strictly positive learning rates nization points. However, the expected loss of eager updates
for t ∈ N. For stationary target distributions, one √ often (as used in Alg. 1) is bounded by the expected loss of de-
chooses a decreasing learning rate such as ηt = 1/ t in or- layed updates (as used in Dekel et al. [2012]) as long as the
der to guarantee convergence of the learning process. For updates reduce the distance to a loss minimizer on average
non-stationary targets this is infeasible, because for large t (which is the case for sufficiently small learning rates and
it would prevent sufficient model adaption to target changes. regularization parameters; see again Zhang [2004, Eq. 5]).
However, one can show [Zinkevich et al., 2010] that stochas- Closing this section, let us analyze the communication
tic gradient descent is a contraction for sufficiently small cost of this protocol.Using a designated coordinator note as
constant learning rates. Namely, for η ≤ (RL + λ)−1 the in Alg. 1, σ can computed simply by all nodes sending their
updates do contract with constant c = 1 − ηλ. This can be current model to the coordinator, who in turn computes the
used to show that the stochastic learning process converges mean model and sends it back to all the nodes. For assessing
to a distribution centered close to a regularized loss mini- the communication cost of this operation, we only count the
mizer even when the process is distributed among k nodes number of model vectors sent between the learners. This is
(see the analysis of Zinkevich et al. [2010]). This refers to feasible because, independently of the exact communication
the stochastic learning process defined by the mean of inde- infrastructure, the number of model messages asymptoti-
pendent local models that result from SGD with iid samples cally determines the true bit-based cost. Hence, asymptot-
from (episodes of) the target distribution. In this paper, the ically the communication cost of static model synchro-
contraction property is used for the regret bound of Thm. 2. nization over k nodes with batch size b is O(kT /b). Dekel
et al. [2012] assumes that the data distribution is station-
√
2.3 Communication and Mini-batches ary over all rounds and b can therefore be set to O( 3 T ).
For every episode Ei , the predictive performance of a dis- This results in an automatic communication reduction that
tributed prediction system lies between two baselines that increases with a longer system lifetime. However, this strat-
correspond to the two extremes in terms of communication egy is not applicable when we want to stay adaptive towards
behavior—complete centralization and no communication. changing data distributions. In this case, we have to set the
Let Ti = |Ei | denote the length of episode Ei and by R = batch size with respect to the expected episode length and
P ∗
t∈Ei , l∈[k] f (pt,l , yt,l )−f the regret with respect to the op- not with respect to the overall system lifetime. This num-
∗ ber can be much smaller than T resulting in batch sizes that
timal expected loss f = argminw∈Rn E(x,y)∼Di [f (hw, xi, y)].
When all data points are centrally processed by one online are too small to meet our communication reduction goal. In
learner, for long enough episodes one can achieve an ex- the following section, we therefore design a synchronization
√
pected regret of O( kTi ) which is optimal (see Cesa-Bianchi protocol that can substantially reduce this cost based on a
and Lugosi [2006] and Abernethy et al. [2009]). In contrast, data-dependent dynamic schedule.
when the k nodes perform their learning processes in paral-
lel without any √ communication this results in an expected
regret of O(k Ti ), which√ is worse than the centralized per- 3. DYNAMIC SYNCHRONIZATION
formance by a factor of k. Therefore, we are interested The synchronization protocol of Alg. 1 is static because it
in algorithms that lie between these two extremes and that synchronizes after a fixed number of rounds independently
show a beneficial trade-off between predictive performance of the sampled data and its effect on the local models. Con-
and the amount communication. sequently, it incurs the communication cost of a full synchro-
Mann et al. [2009] and Dekel et al. [2012] give algorithms nization round even if the models are (almost) identical and
where information between nodes is only exchanged every thus only receive little to none correction. In this section,
b rounds where b ∈ N is referred to as batch size. These we develop a dynamic protocol for synchronizations based
algorithms can be written as static model synchronization on quantifying their effect. After showing that this approach

3
is sound from a learning perspective, we discuss how it can Algorithm 2 Dynamic Synchronization Protocol
be implemented in a communication-efficient way. Initialization:
3.1 Partial Synchronizations local models w1,1 , . . . , w1,k ← (0, . . . , 0)
reference point r ← (0, . . . , 0)
A simple measure to quantify the correcting effect of syn-
violation counter v ← 0
chronizations is given by the average Euclidean distance be-
tween the current local models and the result model. We Round t at node l:
refer to this quantity as the divergence P of a model con- observe xt,l and provide service based on pt,l
figuration, denoted by δ(·), i.e., δ(w) = k1 kl=1 kw − wl k2 . observe yt,l and update wt+1,l ← ϕ(wt,l , xt , yt )
In the following definition we provide a relaxation of the if t mod b = 0 and kr − wt,l k > ∆/2 then
full mean synchronization operation that introduces some send wt,l to coordinator
leeway in terms of this divergence.
At coordinator on violation:
Definition 1. A partial synchronization operator with let B be set of nodes with violation
a positive divergence threshold ∆ ∈ R is an operator σ∆ : v ← v + |B|
Rk×n → Rk×n that 1) leaves the mean model invariant and if v = k then B ← [k], v ←P0
2) after its application the model divergence is bounded by while B 6= [k] and kr − B1 l∈B wl k > ∆ do
∆. That is, for all model configurations w ∈ Rk×n it holds augment B by augmentation strategy
that w = σ∆ w and δ(σ∆ w) ≤ ∆. receive models from nodes added P to B
send to nodes in B model w = B1 l∈B wl
An operator adhering to this definition does not generally
if B = [k] also set new reference model r ← w
put all nodes into sync (albeit the fact that we still refer
to it as synchronization operator). In particular it allows
to leave all models untouched as long as the divergence re-
mains below the threshold ∆. The following theorem notes that transfers the system back into a valid state whenever
that partial synchronization has a controlled regret over full one or more of these local conditions are violated. This in-
synchronization if the batch size is sufficiently large and the cludes carrying out a sufficient amount of synchronization
divergence threshold is set proportional to the Lipschitz con- to reduce the divergence to be less or equal than ∆.
stant L of the losses and the data radius R. For deriving local conditions we consider the domain of
the divergence function restricted to an individual model
Theorem 2. Suppose the update rule ϕ is a contraction vector. Here, we identify a safe-zone S (see Keren et al.
with constant c. Then, for batch sizes b ≥ log−1 2 c
−1
and [2012]) such that the global divergence can not cross the
divergence thresholds ∆ ≤ /(2RL), the average regret of ∆-threshold as long as all local models remain in S.1 The
using a partial synchronization operator σ∆ instead of σ is following statement, which we give again without proof, pro-
bounded by , i.e.,Pfor all rounds t ∈ N it holds that the vides a valid spherical safe zone Sr that is centered around
average regret 1/k kl=1 |f (p∆t,l , yt,l ) − f (pt,l , yt,l )| is bounded some global reference point r.
by where pt,l and p∆
t,l denote the prediction scores at learner
l and time t resulting from σ and σ∆ , respectively. Theorem 3. Let r ∈ Rd be some reference point. If for
all nodes l ∈ {1, . . . , k} it holds that kr − wl k ≤ ∆/2 then
We omit the proof here referring to the full version of this we have for the model divergence that δ(w) ≤ ∆.
paper. While the contraction assumption is readily avail-
able for regularized SGD, as mentioned in Sec. 2, it can be We now incorporate these local conditions into a distributed
relaxed: by requiring the updates to only contract on expec- prediction protocol. As a first step, we have to guarantee
tation it is possible to extend the theorem to unregularized that at all times all nodes use the same reference point. For
SGD updates as well as to other rules. Moreover, we remark a prediction t, let us denote by t0 the last time prior to
that Thm. 2 implies that partial synchronizations retain the t when a full model synchronization was performed (resp.
optimality of the static mini-batch algorithm of Dekel et al. t0 = 0 in case no full synchronization has happened un-
[2012] for the case of stationary targets: By using √a time- til round t). The mean model wt0 is known to all local
dependent divergence threshold based on t ∈ O(1/ t) the learners. We use this model as the reference model and set
√
bound of O( T ) follows. r = wt0 . A local learners l can then monitor their local
condition kr − wl k ≤ ∆/2 in a decentralized manner.
3.2 Communication-efficient Protocol It remains to design a resolution protocol that specifies
After seeing that partial synchronization operators are how to react when one or several of the local conditions are
sound from the learning perspective, we now turn to how violated. A direct solution is to trigger a full synchroniza-
they can be implemented in a communication-efficient way. tion in that case. This approach, however, does not scale
Every distributed learning protocol that implements a par- well with a high number of nodes in cases where model up-
tial synchronization operator has to implicitly control the dates have a non-zero probability even in the asymptotic
divergence of the model configuration. However, we cannot regime of the learning process. When, e.g., PAC models for
simply compute the divergence by centralizing all local mod- the current target distribution are present at all local nodes,
els, because this would incur just as much communication the probability of one local violation, albeit very low for an
as static full synchronization. Our strategy to overcome this individual node, increases exponentially with the number of
problem is to first decompose the global condition δ(w) ≤ ∆ nodes. An alternative approach that can keep the amount
into a set of local conditions that can be monitored at their 1
Note that a direct distribution of the threshold across the
respective nodes without communication (see, e.g., Sharf- local nodes (as in, e.g., Keralapura et al. [2006]) is in-
man et al. [2007]). Secondly, we define a resolution protocol feasible, because the divergence function is non-linear.

4
240000 100000
No Synchronization
220000 No Synchronization
98000
200000
96000
180000512 64
94000
static (batch sizes) 0.2 static (batch sizes)
Error

Error
160000 256 32
dynamic (div. thres.) dynamic (div. thres.)
92000
140000 128
90000 24
120000 64 0.15 16
5.0 32 24 16 12
100000 3.0 0.9 0.5 0.3 0.1 0.05 8 88000
0.1
0.075
0.05 0.025 8
800000 500000 1000000 1500000 860000 50000 100000 150000 200000 250000 300000 350000
Number of messages Number of messages
Figure 1: Performance of static and dynamic model synchronization that track (left) a rapidly drifting
disjunction over 100-dimensional data with 512 nodes; and (right) a neural network with one hidden layer and
150 output variables. with 1024 nodes.

of communication low relative to the number of nodes is verge to a linear model with zero classification error within
to perform a local balancing procedure: on a violation, the each episode. Formally, we identify a target disjunction with
respective node sends his model to a designated note we a binary vector z ∈ {0, 1}n . A data point x ∈ X = {0, 1}n
refer to as coordinator. The coordinator then tries to bal- is labeled positively y = 1 if hx, zi ≥ 1 and otherwise re-
ance this violation by incrementally querying other nodes for ceives a negative label y = −1. The target disjunction is
their models. If the mean of all received models lies within drawn randomly at the beginning of the learning process
the safe zone, it is transferred back as new model to all par- and is randomly re-set after each round with a fixed drift
ticipating nodes, and the resolution is finished. If all nodes probability of 0.0002. In order to have balanced classes, the
have been queried, the result is equal to a full synchroniza- disjunctions as well as the data points are generated such
tion and the reference point can be updated. In both cases, that each
√ coordinate is set independently to 1 with proba-
the divergence of the model configuration is bounded by ∆ bility 1 − 2−1/n . As loss function for the stochastic gradi-
at the end of the balancing process, because all local condi- ent descent we use the logistic loss. Corresponding to our
tions hold. Also this protocol leaves the global mean model setting of noise-free linearly separable data, we choose the
unchanged. Hence, it is complying to Def. 1. regularization parameter λ = 0 and the learning rate η = 1.
While balancing can achieve a high communication reduc- In Fig. 1 (left) we present the result for dimensionality
tion over direct resolution particularly for a large number n = 100, with k = 512 nodes, processing m = 12.8M data
of nodes, it potentially degenerates in certain special situ- points through T = 25000 rounds. For divergence thresholds
ations: We can end up in a stable regime in which local up to 3.0, dynamic synchronization can retain the error num-
violations are likely to be balanced by a subset of the nodes; ber of statically synchronizing every 8 rounds. At the same
however a full synchronization would strongly reduce the time the communication is reduced to 3.9% of the original
expected number of violations in future rounds. In other number of messages. An approximately similar amount of
words: balancing can delay crucial reference point updates communication reduction can also be achieved using static
indefinitely. A simple hedging mechanism for online opti- synchronization by increasing the batch size to 128. This
mization can be employed to avoid this situation: we count approach, however, only retains 51.5% of the error reduc-
the number of local violations using the current reference tion over no communication. Analyzing the development of
point and trigger a full synchronization whenever this num- the evaluation metrics over time reveals: At the beginning
ber exceeds the number of nodes. This concludes our dy- of each episode there is a relatively short phase in which
namic protocol for distributed prediction. All components additional errors are accumulated and the communicative
are summarized in Alg. 2 protocols acquire an advantage over the baseline of never
synchronizing. This is followed by a phase during which no
additional error is made. Here, the communication curve
4. EMPIRICAL EVALUATION of the dynamic protocols remain constant acquiring a gain
In this section we investigate the practical performance over the static protocols in terms of communication.
of the dynamic learning protocol for two controlled settings: We now turn to a harder experimental setting, in which
one with linearly separable data and one with unsepara- the target distribution is given by a rapidly drifting two-
ble data. Our main goal is to empirically confirm that the layer neural network. For this target even the Bayes op-
predictive gain of static full synchronizations (using a batch timal classifier per episode has a non-zero error, and, in
size of 8) over no synchronization can be approximately pre- particular, the generated data is not linearly separable. In-
served for small enough thresholds, and to assess the amount tuitively, it is harder in this setting to save communication,
of communication reduction achieved by these thresholds. because a non-zero residual error can cause the linear mod-
We start with the problem of tracking a rapidly drifting els to periodically fluctuate around a local loss minimizer—
random disjunction. In this case the target distribution pro- resulting in crossings of the divergence threshold even when
duces data that is episode-wise linearly separable. Hence, we the learning processes have reached their asymptotic regime.
can set up the individual learning processes so that they con- We choose the network structure and parameter ranges in

5
a way that allow for a relatively good approximation by Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning,
linear models (see Bshouty and Long [2012]). The pro- and games. Cambridge University Press, 2006. ISBN 978-
cess for generating a single labeled data point is as fol- 0-521-84108-5.
lows: First, the label y ∈ Y = {−1, 1} is drawn uniformly Koby Crammer and Yoram Singer. On the algorithmic im-
from Y . Then, values are determined for hidden variables plementation of multiclass kernel-based vector machines.
Hi with 1 ≤ i ≤ dlog ne based on a Bernoulli distribution Journal of Machine Learning Research, 2:265–292, 2001.
P [Hi = · |Y = y] = Ber(phi,y ). Finally, x ∈ X = {−1, 1}n Hal Daumé III, Jeff M. Phillips, Avishek Saha, and Suresh
is determined by drawing xi for 1 ≤ i ≤ n according to Venkatasubramanian. Efficient protocols for distributed
P [Xi = xi , |Hp(i) = h] = Ber(poi,h ) where p(i) denotes the classification and optimization. In ALT 2012.
unique hidden layer parent of xi . In order to ensure lin- Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin
ear approximability, the parameters of the output layer are Xiao. Optimal distributed online prediction using mini-
drawn such that |poi,−1 − poi,1 | ≥ 0.9, i.e., their values have batches. Journal of Machine Learning Research, 13:165–
a high relevance in determining the hidden values. As in 202, 2012.
the disjunction case all parameters are re-set randomly af- Daniel Hsu, Nikos Karampatziakis, John Langford, and
ter each round with a fixed drift probability (here, 0.005). Alexander J. Smola. Parallel online learning. In Scaling
For this non-separable setting we choose again to optimize up machine learning: Parallel and distributed approaches.
the logistic loss, this time with parameters λ = 0.5 and Cambridge University Press.
η = 0.05 respectively. Also, in order to increase the stabil- Ram Keralapura, Graham Cormode, and Jeyashankher Ra-
ity of the learning process, we apply averaged updates over mamirtham. Communication-efficient distributed mon-
mini-batches of size 8. itoring of thresholded counts. In Proc. of the ACM
Figure 1 (right) contains the results for dimensionality SIGMOD Int. Conf. on Management of Data (SIGMOD
150, with k = 1024 nodes, processing m = 2.56M data 2006), pages 289–300, 2006.
points through T = 2500 rounds. For divergence thresholds Daniel Keren, Izchak Sharfman, Assaf Schuster, and
up to 0.05, dynamic synchronization can retain the error Avishay Livne. Shape sensitive geometric monitoring.
of the baseline. At the same time the communication is Knowledge and Data Engineering, IEEE Transactions on,
reduced to 46% of the original number of messages. 24(8):1520–1535, 2012.
G. Mann, R. McDonald, M. Mohri, N. Silberman, and
5. CONCLUSION D. Walker. Efficient large-scale distributed training of
conditional maximum entropy models. In Advances in
We presented a protocol for distributed online prediction Neural Information Processing Systems (NIPS 2009), vol-
that aims to dynamically save on network communications ume 22, pages 1231–1239, 2009.
in sufficiently easy phases of the modeling task. The pro- Ryan T. McDonald, Keith Hall, and Gideon Mann. Dis-
tocol has a controlled predictive regret over its static coun- tributed training strategies for the structured perceptron.
terpart and experiments show that it can indeed reduce the In Human Language Technologies: Conf. of the North
communication substantially—up to 95% in settings where American Chapter of the Association of Computational
the linear learning processes are suitable to model the data Linguistics, Proceedings (HLT-NAACL), pages 456–464,
well and converge reasonably fast. Generally, the effectivity 2010.
of the approach appears to correspond to the effectivity of XuanLong Nguyen, Martin J Wainwright, and Michael I
linear modeling by SGD in the given setting. Jordan. Decentralized detection and classification using
For future research a theoretical characterization of this kernel methods. In Proceedings of the twenty-first inter-
behavior is desirable. A practically even more important di- national conference on Machine learning, page 80. ACM,
2004.
rection is to extend the approach to other model classes that
can tackle a wider range of learning problems. In principle, Joel B Predd, SB Kulkarni, and H Vincent Poor. Distributed
the approach of controlling model divergence remains appli- learning in wireless sensor networks. Signal Processing
cable, as long as the divergence is measured with respect Magazine, IEEE, 23(4):56–69, 2006.
to a distance function that induces a useful loss bound be- Izchak Sharfman, Assaf Schuster, and Daniel Keren. A ge-
tween two models. For probabilistic models this can for ometric approach to monitoring threshold functions over
instance be the KL-divergence. However, more complex distributed data streams. ACM Trans. Database Syst., 32
(4), 2007.
distance functions constitute more challenging distributed
monitoring tasks, which currently are open problems. Lin Xiao. Dual averaging methods for regularized stochastic
learning and online optimization. The Journal of Machine
Learning Research, 11:2543–2596, 2010.
References
Tong Zhang. Solving large scale linear prediction problems
Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and using stochastic gradient descent algorithms. In Proceed-
Alexander Rakhlin. A stochastic view of optimal regret ings of the 21st int. conf. on Machine learning (ICML
through minimax duality. In COLT 2009 - The 22nd Con- 2004), 2004.
ference on Learning Theory, 2009.
Martin Zinkevich, Alex J. Smola, and John Langford. Slow
Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay learners are fast. In Proc. of 23rd Annual Conference
Mansour. Distributed learning, communication complex- on Neural Information Processing Systems (NIPS 2009),
ity and privacy. Journal of Machine Learning Research - pages 2331–2339, 2009.
Proceedings Track, 23:26.1–26.22, 2012.
Martin Zinkevich, Markus Weimer, Alexander J. Smola, and
Nader H. Bshouty and Philip M. Long. Linear classifiers are Lihong Li. Parallelized stochastic gradient descent. In
nearly optimal when hidden variables have diverse effects. Proc. of 24th Annual Conference on Neural Information
Machine Learning, 86(2):209–231, 2012. Processing Systems (NIPS 2010), pages 2595–2603, 2010.