=Paper= {{Paper |id=None |storemode=property |title=Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations |pdfUrl=https://ceur-ws.org/Vol-1018/paper6.pdf |volume=Vol-1018 |dblpUrl=https://dblp.org/rec/conf/vldb/BoleyKKSS13 }} ==Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations== https://ceur-ws.org/Vol-1018/paper6.pdf
     Communication-Efficient Distributed Online Prediction
          using Dynamic Model Synchronizations
                                                  [Extended Abstract]
                      Mario Boley and Michael Kamp                      Daniel Keren
                    Fraunhofer IAIS & University Bonn                 Haifa University
              {mario.boley,michael.kamp}@iais.fraunhofer.de         dkeren@cs.haifa.ac.il
                                Assaf Schuster and Izchak Sharfman
                               Technion, Israel Institute of Technology
                            assaf@technion.ac.il & tsachis@technion.ac.il

ABSTRACT                                                               1.1    Setting and Related Work
We present the first protocol for distributed online predic-              Despite the communication costs it induces, decentraliza-
tion that aims to minimize online prediction loss and net-             tion is inevitable in many modern scale applications. Hence,
work communication at the same time. Applications in-                  recent articles [Balcan et al., 2012, Daumé III et al.] explic-
clude social content recommendation, algorithmic trading,              itly investigate the communication complexity of learning
and other scenarios where a configuration of local prediction          with decentralized data. They consider, however, the of-
models of high-frequency streams is used to provide a real-            fline task of finding a good global model over the union of
time service. For stationary data, the proposed protocol re-           all data as a final computation result. The same applies
tains the asymptotic optimal regret of previous algorithms.            to some work on parallel machine learning (e.g., Zinkevich
At the same time, it allows to substantially reduce network            et al. [2010], McDonald et al. [2010]) where data shards are
communication, and, in contrast to previous approaches,                distributed among several processors and then all computa-
it remains applicable when the data is non-stationary and              tion is carried out independently in parallel except for one
shows rapid concept drift. The protocol is based on con-               final model merging step. While these approaches avoid
trolling the divergence of the local models in a decentralized         communication for performance reasons, they do not intend
way. Its beneficial properties are also confirmed empirically.         to optimize the predictive performance during the compu-
                                                                       tation. In contrast, we are interested in the online in-place
                                                                       performance, i.e., for every data point performance is as-
1.   INTRODUCTION                                                      sessed locally when and where it is received or sampled.
   We consider online prediction problems where data points               To this end, research focused so far on specific environ-
are observed at local nodes in a distributed environment               ments with fixed communication constraints. Correspond-
and there is a trade-off between maximizing prediction ac-             ingly, the learning strategies that are proposed and analyzed
curacy and minimizing network communication. This situ-                for these settings, do not aim to minimize communication
ation abounds in a wide range of machine learning applica-             beyond the level that is enforced by these constraints. Zinke-
tions, in which communication induces a severe cost. Ex-               vich et al. [2009] considers a shared-memory model, in which
amples are parallel data mining [Zinkevich et al., 2009, Hsu           all local nodes can update a global model in a round-robin
et al.] where communication constitutes a performance bot-             fashion as they process their training examples. Since this
tleneck, learning with mobile sensors [Nguyen et al., 2004,            approach is problematic if there is a notable communication
Predd et al., 2006] where communication drains battery                 latency, strategies have been investigated [Mann et al., 2009,
power, and, most centrally, prediction-based real-time ser-            Dekel et al., 2012] that communicate only periodically after
vices [Dekel et al., 2012] carried out by several servers, e.g.,       a statically fixed number of data points have been processed.
for social content promotion, ad placement, or algorithmic             Dekel et al. [2012] shows that for smooth loss functions and
trading. In addition to the above, here the cost of commu-             stationary environments optimal asymptotic regret bounds
nication can also be a loss of prediction quality itself when          can be retained √ by updating a global model only after mini-
training examples have to be discarded due to network la-              batches of O( 3 m) data points. Here, m denotes the total
tency.                                                                 number of data points observed throughout the lifetime of
                                                                       the system. For large values of m, the effect of bounded
                                                                       latency values is asymptotically outgrown by the increasing
                                                                       mini-batch size.
                                                                          While a fixed periodic communication schedule reduces
                                                                       the communication by some fixed amount, further reduction
                                                                       is desirable: The above mentioned costs of communication
                                                                       can have a severe impact on the practical performance—even
                                                                       if they are not reflected in asymptotic performance bounds.
                                                                       This is further amplified because a large number of modeling



                                                                   1
tasks are performed simultaneously sharing the same limited              Ei = {di , . . . , di+1 − 1} between any two drift points. We
bandwidth. Moreover, distributed learning systems that are               assume that all learners sample from D independently in
deployed for a long lifetime relative to their data through-             parallel using a constant and uniform sampling frequency,
put can experience periodical or singular target drifts (e.g.,           and we denote by (xt,l , yt,l ) ∼ Dt the training example
corresponding to micro-trends in social networks). In these              received at node l at time t. Generally, we assume that all
settings, a static schedule is bound to either provide only lit-         training examples are bounded by a ball with radius R.
tle to no communication reduction or to insufficiently react                Conceptually, every learner first observes the input part
to changing data distributions.                                          xt,l and performs a real time service based on the linear
                                                                         prediction score pt,l = hwt,l , xt,l i, i.e., the inner prod-
1.2    Contributions and Outline                                         uct of xt,l and the learner’s current model vector. Only
   In this work, we give the first distributed prediction proto-         then it receives as feedback the true label yt,l , which it can
col for linear models that, at the same time, aims to provide            use to locally update its model to wt+1,l = ϕ(wt,l , xt,l , yt,l )
a high online in-place prediction performance and explic-                by some update rule ϕ : Rn × X × Y → Rn . Finally,
itly tries to minimize communication. In terms of predictive             the learners are connected by a communication infrastruc-
power, as shown Sec. 3.1, the protocol retains the asymp-                ture that allows them to jointly perform a synchroniza-
totic optimal regret of the distributed mini-batch algorithm             tion operation σ : Rk×n → Rk×n that resets the whole
of Dekel et al. [2012] for stationary data. In addition, it              model configuration to a new state and that may take into
allows to reduce the communication among the local nodes                 account the information of all local learners simultaneously.
substantially. This is achieved by a dynamic data depen-                 The performance of such a distributed online prediction sys-
dent communication schedule, which, in contrast to previ-                tem is measured by two quantities: 1) the predictive perfor-
ous algorithms, remains applicable when the data is non-                 mance Tt=1 kl=1 f (pt,l , yt,l ) measured by a loss function
                                                                                 P       P
stationary and shows rapid concept drifts. The main idea is              f : R × Y → R+ that assigns positive penalties to predic-
to synchronize the local models to their mean model in order             tion scores; and 2) the amount of communication within
to reduce their variance, but to do so only in system states             the system that is measured by the number of bits sent in-
that show a high divergence among the models. This diver-                between learners to compute the sync operation. Next, spec-
gence, measured by the average model distance to the mean                ify possible choices for the update rule, the loss function, and
model, indicates the synchronizations that are most impor-               the synchronization operator.
tant in terms of their correcting effects on the predictions. In
stable phases this allows communicative quiescence, while,
in hard phases where variance reduction is crucial, the pro-             2.2    Losses and Gradient Descent
tocol will trigger a lot of model synchronizations. In order                Generally, the communication protocol developed in this
to efficiently implement this strategy one has to monitor the            paper is applicable to a wide range of online update rules for
non-linear divergence function without communication over-               linear models from, e.g., the passive aggressive rule [Cram-
head. We propose a solution to this problem that adapts re-              mer and Singer, 2001] to regularized dual averaging [Xiao,
cent ideas from distributed systems research based on local              2010]. However, the regret bound given in Theorem 2 as-
safe-zones in the function domain (Sec. 3.2). Experiments                sumes that the updates are contractions. That is, there
confirm the beneficial properties of the protocol (Sec. 4).              is some constant c < 1 such that for all w, w0 ∈ Rn , and
                                                                         x, y ∈ X × Y it holds that kϕ(w, x, y) − ϕ(w0 , x, y)k ≤
2.    PRELIMINARIES                                                      ckw − w0 k. For the sake of simplicity, in this paper, we focus
                                                                         on rules based on l2 -regularized stochastic gradient descent,
   In this section we formally introduce the distributed on-
                                                                         for which this contraction property is readily available. We
line prediction task. As simple local learning tool we recall
                                                                         note that by considering expected contractions the result can
stochastic gradient descent for linear models. Finally, we
                                                                         be extended to rules that reduce on average the distance to
review the state-of-the-art communication protocol as a de-
                                                                         a (regularized) loss minimizer.
parture point for developing a more communication-efficient
                                                                            Before we can define gradient descent updates, we have to
solution in subsequent sections.
                                                                         introduce the underlying loss functions measuring predictive
2.1    Distributed Online Prediction                                     performance. Again for convenience, we restrict ourselves to
                                                                         functions that are differentiable, convex, and globally Lips-
   Throughout this paper we consider a distributed online
                                                                         chitz continuous in the prediction score, i.e., there is some
prediction system of k local learners that maintain individ-
                                                                         constant L such that for all p, p0 , y ∈ R2n × Y it holds that
ual linear models wt,1 , . . . , wt,k ∈ Rn of some global envi-
                                                                         |f (p, y) − f (p0 , y)| ≤ L|p − p0 |. While these assumptions can
ronment through discrete time t ∈ [T ] where T ∈ N denotes
                                                                         be relaxed by spending some technical effort, they already
the total time horizon with respect to which we analyze the
                                                                         include loss functions for all standard predictions tasks such
system’s performance. This environment is represented by
                                                                         as the logistic loss flg (p, y) = ln(1 + exp(−yp)) for binary
a target distribution Dt : X × Y → [0, 1] that describes
                                                                         classification (case Y = {−1, 1}) or the Huber loss for re-
the relation between an input space X ⊆ Rn and an output
                                                                         gression (in the case Y = R)
space Y ⊆ R. The nature of Y varies with the learning task
at hand; Y = {−1, 1} is used for binary classification, Y = R                                  (
for regression. While we allow Dt to vary with time, we as-                                        1
                                                                                                     (p − y)2     , for |p − y| ≤ 1
                                                                                fhu (p, y) =       2                                  .
sume that it remains constant most of the time and only                                            |p − y| − 21
experiences a small number of rapid drifts. That is, there
are drift points 0 = d0 < d1 < · · · < dp = T such that for
all i ∈ [p] and t, t0 ∈ [T ] with di−1 ≤ t ≤ t0 < di it holds that       See, e.g., Zhang [2004] for further possible choices. In both
Dt = Dt0 . Hence, there are identically distributed episodes             of these cases the (best) Lipschitz constant is L = 1.


                                                                     2
Algorithm 1 Static Synchronization Protocol                                 protocol similar to Alg. 1. Here, after a batch of kb examples
Initialization:                                                             has been processed globally in the system, all local models
                                                                            are re-set to the mean    model of the configuration w de-
  local models w1,1 , . . . , w1,k ← (0, . . . , 0)
                                                                            fined as w = 1/k kl=1 wl . Formally, the synchronization
                                                                                                P
Round t at node l:                                                          operator that is implicitly employed in these algorithms is
  observe xt,l and provide service based on pt,l                            given by σ(wt ) = (wt , . . . , wt ). We refer to this operation
  observe yt,l and update wt+1,l ← ϕ(wt,l , xt , yt )                       as full mean synchronization. The choice of a (uniform)
  if t mod b = 0 then                                                       model mixture is often used for combining linear models that
     send wt,l to coordinator                                               have been learned in parallel on independent training data
                                                                            (see Mann et al. [2009], McDonald et al. [2010], Zinkevich
At coordinator every b rounds:
                                                                            et al. [2010]). The motivation is √   that the mean of k models
  receive local models {wP     t,l : l ∈ [k]}
                                                                            provides a variance reduction of k over an individual ran-
  send wt,1 , . . . , wt,k ← k1 l∈[k] wl
                                                                            dom model (recall that all learners sample from the same
                                                                            distribution, hence their models are identically distributed).
                                                                            Dekel et al. [2012] shows that when the gradient variance is
  With this we can define stochastic gradient descent                       bounded then the optimal   √ regret can be asymptotically re-
(SGD) rules with l2 -regularization, i.e., rules of the form                tained by setting b = O( 3 Ti ) even if a constant number of
                                                                          examples have to be discarded during each synchronization
                              λ
     ϕ(w, x, y) = w − ηt ∇w     kwk2 + f (hw, xi, y)                        due to network latency. Note that this reference considers
                              2                                             a slightly modified algorithm based on delayed gradient de-
where λ ∈ R+ is a strictly positive regularization pa-                      scent, which only applies (accumulated) updates at synchro-
rameter and ηt ∈ R+ are strictly positive learning rates                    nization points. However, the expected loss of eager updates
for t ∈ N. For stationary target distributions, one  √ often                (as used in Alg. 1) is bounded by the expected loss of de-
chooses a decreasing learning rate such as ηt = 1/ t in or-                 layed updates (as used in Dekel et al. [2012]) as long as the
der to guarantee convergence of the learning process. For                   updates reduce the distance to a loss minimizer on average
non-stationary targets this is infeasible, because for large t              (which is the case for sufficiently small learning rates and
it would prevent sufficient model adaption to target changes.               regularization parameters; see again Zhang [2004, Eq. 5]).
However, one can show [Zinkevich et al., 2010] that stochas-                   Closing this section, let us analyze the communication
tic gradient descent is a contraction for sufficiently small                cost of this protocol.Using a designated coordinator note as
constant learning rates. Namely, for η ≤ (RL + λ)−1 the                     in Alg. 1, σ can computed simply by all nodes sending their
updates do contract with constant c = 1 − ηλ. This can be                   current model to the coordinator, who in turn computes the
used to show that the stochastic learning process converges                 mean model and sends it back to all the nodes. For assessing
to a distribution centered close to a regularized loss mini-                the communication cost of this operation, we only count the
mizer even when the process is distributed among k nodes                    number of model vectors sent between the learners. This is
(see the analysis of Zinkevich et al. [2010]). This refers to               feasible because, independently of the exact communication
the stochastic learning process defined by the mean of inde-                infrastructure, the number of model messages asymptoti-
pendent local models that result from SGD with iid samples                  cally determines the true bit-based cost. Hence, asymptot-
from (episodes of) the target distribution. In this paper, the              ically the communication cost of static model synchro-
contraction property is used for the regret bound of Thm. 2.                nization over k nodes with batch size b is O(kT /b). Dekel
                                                                            et al. [2012] assumes that the data distribution is station-
                                                                                                                                       √
2.3     Communication and Mini-batches                                      ary over all rounds and b can therefore be set to O( 3 T ).
   For every episode Ei , the predictive performance of a dis-              This results in an automatic communication reduction that
tributed prediction system lies between two baselines that                  increases with a longer system lifetime. However, this strat-
correspond to the two extremes in terms of communication                    egy is not applicable when we want to stay adaptive towards
behavior—complete centralization and no communication.                      changing data distributions. In this case, we have to set the
Let Ti = |Ei | denote the length of episode Ei and by R =                   batch size with respect to the expected episode length and
P                                 ∗
   t∈Ei , l∈[k] f (pt,l , yt,l )−f the regret with respect to the op-       not with respect to the overall system lifetime. This num-
                             ∗                                              ber can be much smaller than T resulting in batch sizes that
timal expected loss f = argminw∈Rn E(x,y)∼Di [f (hw, xi, y)].
When all data points are centrally processed by one online                  are too small to meet our communication reduction goal. In
learner, for long enough           episodes one can achieve an ex-          the following section, we therefore design a synchronization
                          √
pected regret of O( kTi ) which is optimal (see Cesa-Bianchi                protocol that can substantially reduce this cost based on a
and Lugosi [2006] and Abernethy et al. [2009]). In contrast,                data-dependent dynamic schedule.
when the k nodes perform their learning processes in paral-
lel without any   √ communication this results in an expected
regret of O(k Ti ), which√         is worse than the centralized per-       3.   DYNAMIC SYNCHRONIZATION
formance by a factor of k. Therefore, we are interested                        The synchronization protocol of Alg. 1 is static because it
in algorithms that lie between these two extremes and that                  synchronizes after a fixed number of rounds independently
show a beneficial trade-off between predictive performance                  of the sampled data and its effect on the local models. Con-
and the amount communication.                                               sequently, it incurs the communication cost of a full synchro-
   Mann et al. [2009] and Dekel et al. [2012] give algorithms               nization round even if the models are (almost) identical and
where information between nodes is only exchanged every                     thus only receive little to none correction. In this section,
b rounds where b ∈ N is referred to as batch size. These                    we develop a dynamic protocol for synchronizations based
algorithms can be written as static model synchronization                   on quantifying their effect. After showing that this approach


                                                                        3
is sound from a learning perspective, we discuss how it can                     Algorithm 2 Dynamic Synchronization Protocol
be implemented in a communication-efficient way.                                Initialization:
3.1     Partial Synchronizations                                                  local models w1,1 , . . . , w1,k ← (0, . . . , 0)
                                                                                  reference point r ← (0, . . . , 0)
   A simple measure to quantify the correcting effect of syn-
                                                                                  violation counter v ← 0
chronizations is given by the average Euclidean distance be-
tween the current local models and the result model. We                         Round t at node l:
refer to this quantity as the divergence P   of a model con-                      observe xt,l and provide service based on pt,l
figuration, denoted by δ(·), i.e., δ(w) = k1 kl=1 kw − wl k2 .                    observe yt,l and update wt+1,l ← ϕ(wt,l , xt , yt )
In the following definition we provide a relaxation of the                        if t mod b = 0 and kr − wt,l k > ∆/2 then
full mean synchronization operation that introduces some                             send wt,l to coordinator
leeway in terms of this divergence.
                                                                                At coordinator on violation:
  Definition 1. A partial synchronization operator with                           let B be set of nodes with violation
a positive divergence threshold ∆ ∈ R is an operator σ∆ :                         v ← v + |B|
Rk×n → Rk×n that 1) leaves the mean model invariant and                           if v = k then B ← [k], v ←P0
2) after its application the model divergence is bounded by                       while B 6= [k] and kr − B1     l∈B wl k > ∆ do
∆. That is, for all model configurations w ∈ Rk×n it holds                           augment B by augmentation strategy
that w = σ∆ w and δ(σ∆ w) ≤ ∆.                                                       receive models from nodes added P to B
                                                                                  send to nodes in B model w = B1        l∈B wl
An operator adhering to this definition does not generally
                                                                                  if B = [k] also set new reference model r ← w
put all nodes into sync (albeit the fact that we still refer
to it as synchronization operator). In particular it allows
to leave all models untouched as long as the divergence re-
mains below the threshold ∆. The following theorem notes                        that transfers the system back into a valid state whenever
that partial synchronization has a controlled regret over full                  one or more of these local conditions are violated. This in-
synchronization if the batch size is sufficiently large and the                 cludes carrying out a sufficient amount of synchronization
divergence threshold is set proportional to the Lipschitz con-                  to reduce the divergence to be less or equal than ∆.
stant L of the losses and the data radius R.                                       For deriving local conditions we consider the domain of
                                                                                the divergence function restricted to an individual model
   Theorem 2. Suppose the update rule ϕ is a contraction                        vector. Here, we identify a safe-zone S (see Keren et al.
with constant c. Then, for batch sizes b ≥ log−1              2 c
                                                                   −1
                                                                      and       [2012]) such that the global divergence can not cross the
divergence thresholds ∆ ≤ /(2RL), the average regret of                        ∆-threshold as long as all local models remain in S.1 The
using a partial synchronization operator σ∆ instead of σ is                     following statement, which we give again without proof, pro-
bounded by , i.e.,Pfor all rounds t ∈ N it holds that the                      vides a valid spherical safe zone Sr that is centered around
average regret 1/k kl=1 |f (p∆t,l , yt,l ) − f (pt,l , yt,l )| is bounded       some global reference point r.
by  where pt,l and p∆
                     t,l denote the prediction scores at learner
l and time t resulting from σ and σ∆ , respectively.                               Theorem 3. Let r ∈ Rd be some reference point. If for
                                                                                all nodes l ∈ {1, . . . , k} it holds that kr − wl k ≤ ∆/2 then
We omit the proof here referring to the full version of this                    we have for the model divergence that δ(w) ≤ ∆.
paper. While the contraction assumption is readily avail-
able for regularized SGD, as mentioned in Sec. 2, it can be                        We now incorporate these local conditions into a distributed
relaxed: by requiring the updates to only contract on expec-                    prediction protocol. As a first step, we have to guarantee
tation it is possible to extend the theorem to unregularized                    that at all times all nodes use the same reference point. For
SGD updates as well as to other rules. Moreover, we remark                      a prediction t, let us denote by t0 the last time prior to
that Thm. 2 implies that partial synchronizations retain the                    t when a full model synchronization was performed (resp.
optimality of the static mini-batch algorithm of Dekel et al.                   t0 = 0 in case no full synchronization has happened un-
[2012] for the case of stationary targets: By using √a time-                    til round t). The mean model wt0 is known to all local
dependent divergence    threshold based on t ∈ O(1/ t) the                     learners. We use this model as the reference model and set
              √
bound of O( T ) follows.                                                        r = wt0 . A local learners l can then monitor their local
                                                                                condition kr − wl k ≤ ∆/2 in a decentralized manner.
3.2     Communication-efficient Protocol                                           It remains to design a resolution protocol that specifies
   After seeing that partial synchronization operators are                      how to react when one or several of the local conditions are
sound from the learning perspective, we now turn to how                         violated. A direct solution is to trigger a full synchroniza-
they can be implemented in a communication-efficient way.                       tion in that case. This approach, however, does not scale
Every distributed learning protocol that implements a par-                      well with a high number of nodes in cases where model up-
tial synchronization operator has to implicitly control the                     dates have a non-zero probability even in the asymptotic
divergence of the model configuration. However, we cannot                       regime of the learning process. When, e.g., PAC models for
simply compute the divergence by centralizing all local mod-                    the current target distribution are present at all local nodes,
els, because this would incur just as much communication                        the probability of one local violation, albeit very low for an
as static full synchronization. Our strategy to overcome this                   individual node, increases exponentially with the number of
problem is to first decompose the global condition δ(w) ≤ ∆                     nodes. An alternative approach that can keep the amount
into a set of local conditions that can be monitored at their                   1
                                                                                 Note that a direct distribution of the threshold across the
respective nodes without communication (see, e.g., Sharf-                       local nodes (as in, e.g., Keralapura et al. [2006]) is in-
man et al. [2007]). Secondly, we define a resolution protocol                   feasible, because the divergence function is non-linear.


                                                                            4
    240000                                                                                             100000
                                                                      No Synchronization
    220000                                                                                                                                                            No Synchronization
                                                                                                           98000
    200000
                                                                                                           96000
        180000512                                                                                                    64
                                                                                                           94000
                                                                       static (batch sizes)                         0.2                                                static (batch sizes)
Error




                                                                                                   Error
        160000 256                                                                                                             32
                                                                       dynamic (div. thres.)                                                                           dynamic (div. thres.)
                                                                                                           92000
        140000      128
                                                                                                           90000                       24
        120000            64                                                                                          0.15                        16
              5.0                32      24          16          12
        100000 3.0 0.9 0.5 0.3                 0.1        0.05                             8               88000
                                                                                                                             0.1
                                                                                                                                   0.075
                                                                                                                                              0.05                0.025                    8
        800000                        500000               1000000              1500000                    860000    50000          100000    150000    200000       250000     300000         350000
                                                Number of messages                                                                           Number of messages
Figure 1: Performance of static and dynamic model synchronization that track (left) a rapidly drifting
disjunction over 100-dimensional data with 512 nodes; and (right) a neural network with one hidden layer and
150 output variables. with 1024 nodes.


of communication low relative to the number of nodes is                                               verge to a linear model with zero classification error within
to perform a local balancing procedure: on a violation, the                                           each episode. Formally, we identify a target disjunction with
respective node sends his model to a designated note we                                               a binary vector z ∈ {0, 1}n . A data point x ∈ X = {0, 1}n
refer to as coordinator. The coordinator then tries to bal-                                           is labeled positively y = 1 if hx, zi ≥ 1 and otherwise re-
ance this violation by incrementally querying other nodes for                                         ceives a negative label y = −1. The target disjunction is
their models. If the mean of all received models lies within                                          drawn randomly at the beginning of the learning process
the safe zone, it is transferred back as new model to all par-                                        and is randomly re-set after each round with a fixed drift
ticipating nodes, and the resolution is finished. If all nodes                                        probability of 0.0002. In order to have balanced classes, the
have been queried, the result is equal to a full synchroniza-                                         disjunctions as well as the data points are generated such
tion and the reference point can be updated. In both cases,                                           that each
                                                                                                             √ coordinate is set independently to 1 with proba-
the divergence of the model configuration is bounded by ∆                                             bility 1 − 2−1/n . As loss function for the stochastic gradi-
at the end of the balancing process, because all local condi-                                         ent descent we use the logistic loss. Corresponding to our
tions hold. Also this protocol leaves the global mean model                                           setting of noise-free linearly separable data, we choose the
unchanged. Hence, it is complying to Def. 1.                                                          regularization parameter λ = 0 and the learning rate η = 1.
   While balancing can achieve a high communication reduc-                                               In Fig. 1 (left) we present the result for dimensionality
tion over direct resolution particularly for a large number                                           n = 100, with k = 512 nodes, processing m = 12.8M data
of nodes, it potentially degenerates in certain special situ-                                         points through T = 25000 rounds. For divergence thresholds
ations: We can end up in a stable regime in which local                                               up to 3.0, dynamic synchronization can retain the error num-
violations are likely to be balanced by a subset of the nodes;                                        ber of statically synchronizing every 8 rounds. At the same
however a full synchronization would strongly reduce the                                              time the communication is reduced to 3.9% of the original
expected number of violations in future rounds. In other                                              number of messages. An approximately similar amount of
words: balancing can delay crucial reference point updates                                            communication reduction can also be achieved using static
indefinitely. A simple hedging mechanism for online opti-                                             synchronization by increasing the batch size to 128. This
mization can be employed to avoid this situation: we count                                            approach, however, only retains 51.5% of the error reduc-
the number of local violations using the current reference                                            tion over no communication. Analyzing the development of
point and trigger a full synchronization whenever this num-                                           the evaluation metrics over time reveals: At the beginning
ber exceeds the number of nodes. This concludes our dy-                                               of each episode there is a relatively short phase in which
namic protocol for distributed prediction. All components                                             additional errors are accumulated and the communicative
are summarized in Alg. 2                                                                              protocols acquire an advantage over the baseline of never
                                                                                                      synchronizing. This is followed by a phase during which no
                                                                                                      additional error is made. Here, the communication curve
4.          EMPIRICAL EVALUATION                                                                      of the dynamic protocols remain constant acquiring a gain
   In this section we investigate the practical performance                                           over the static protocols in terms of communication.
of the dynamic learning protocol for two controlled settings:                                            We now turn to a harder experimental setting, in which
one with linearly separable data and one with unsepara-                                               the target distribution is given by a rapidly drifting two-
ble data. Our main goal is to empirically confirm that the                                            layer neural network. For this target even the Bayes op-
predictive gain of static full synchronizations (using a batch                                        timal classifier per episode has a non-zero error, and, in
size of 8) over no synchronization can be approximately pre-                                          particular, the generated data is not linearly separable. In-
served for small enough thresholds, and to assess the amount                                          tuitively, it is harder in this setting to save communication,
of communication reduction achieved by these thresholds.                                              because a non-zero residual error can cause the linear mod-
   We start with the problem of tracking a rapidly drifting                                           els to periodically fluctuate around a local loss minimizer—
random disjunction. In this case the target distribution pro-                                         resulting in crossings of the divergence threshold even when
duces data that is episode-wise linearly separable. Hence, we                                         the learning processes have reached their asymptotic regime.
can set up the individual learning processes so that they con-                                        We choose the network structure and parameter ranges in


                                                                                               5
a way that allow for a relatively good approximation by                Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning,
linear models (see Bshouty and Long [2012]). The pro-                    and games. Cambridge University Press, 2006. ISBN 978-
cess for generating a single labeled data point is as fol-               0-521-84108-5.
lows: First, the label y ∈ Y = {−1, 1} is drawn uniformly              Koby Crammer and Yoram Singer. On the algorithmic im-
from Y . Then, values are determined for hidden variables               plementation of multiclass kernel-based vector machines.
Hi with 1 ≤ i ≤ dlog ne based on a Bernoulli distribution               Journal of Machine Learning Research, 2:265–292, 2001.
P [Hi = · |Y = y] = Ber(phi,y ). Finally, x ∈ X = {−1, 1}n             Hal Daumé III, Jeff M. Phillips, Avishek Saha, and Suresh
is determined by drawing xi for 1 ≤ i ≤ n according to                   Venkatasubramanian. Efficient protocols for distributed
P [Xi = xi , |Hp(i) = h] = Ber(poi,h ) where p(i) denotes the            classification and optimization. In ALT 2012.
unique hidden layer parent of xi . In order to ensure lin-             Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin
ear approximability, the parameters of the output layer are              Xiao. Optimal distributed online prediction using mini-
drawn such that |poi,−1 − poi,1 | ≥ 0.9, i.e., their values have         batches. Journal of Machine Learning Research, 13:165–
a high relevance in determining the hidden values. As in                 202, 2012.
the disjunction case all parameters are re-set randomly af-            Daniel Hsu, Nikos Karampatziakis, John Langford, and
ter each round with a fixed drift probability (here, 0.005).            Alexander J. Smola. Parallel online learning. In Scaling
For this non-separable setting we choose again to optimize              up machine learning: Parallel and distributed approaches.
the logistic loss, this time with parameters λ = 0.5 and                Cambridge University Press.
η = 0.05 respectively. Also, in order to increase the stabil-          Ram Keralapura, Graham Cormode, and Jeyashankher Ra-
ity of the learning process, we apply averaged updates over              mamirtham. Communication-efficient distributed mon-
mini-batches of size 8.                                                  itoring of thresholded counts. In Proc. of the ACM
   Figure 1 (right) contains the results for dimensionality              SIGMOD Int. Conf. on Management of Data (SIGMOD
150, with k = 1024 nodes, processing m = 2.56M data                      2006), pages 289–300, 2006.
points through T = 2500 rounds. For divergence thresholds              Daniel Keren, Izchak Sharfman, Assaf Schuster, and
up to 0.05, dynamic synchronization can retain the error                Avishay Livne. Shape sensitive geometric monitoring.
of the baseline. At the same time the communication is                  Knowledge and Data Engineering, IEEE Transactions on,
reduced to 46% of the original number of messages.                      24(8):1520–1535, 2012.
                                                                       G. Mann, R. McDonald, M. Mohri, N. Silberman, and
5.   CONCLUSION                                                          D. Walker. Efficient large-scale distributed training of
                                                                         conditional maximum entropy models. In Advances in
   We presented a protocol for distributed online prediction             Neural Information Processing Systems (NIPS 2009), vol-
that aims to dynamically save on network communications                  ume 22, pages 1231–1239, 2009.
in sufficiently easy phases of the modeling task. The pro-             Ryan T. McDonald, Keith Hall, and Gideon Mann. Dis-
tocol has a controlled predictive regret over its static coun-           tributed training strategies for the structured perceptron.
terpart and experiments show that it can indeed reduce the               In Human Language Technologies: Conf. of the North
communication substantially—up to 95% in settings where                  American Chapter of the Association of Computational
the linear learning processes are suitable to model the data             Linguistics, Proceedings (HLT-NAACL), pages 456–464,
well and converge reasonably fast. Generally, the effectivity            2010.
of the approach appears to correspond to the effectivity of            XuanLong Nguyen, Martin J Wainwright, and Michael I
linear modeling by SGD in the given setting.                            Jordan. Decentralized detection and classification using
   For future research a theoretical characterization of this           kernel methods. In Proceedings of the twenty-first inter-
behavior is desirable. A practically even more important di-            national conference on Machine learning, page 80. ACM,
                                                                        2004.
rection is to extend the approach to other model classes that
can tackle a wider range of learning problems. In principle,           Joel B Predd, SB Kulkarni, and H Vincent Poor. Distributed
the approach of controlling model divergence remains appli-              learning in wireless sensor networks. Signal Processing
cable, as long as the divergence is measured with respect                Magazine, IEEE, 23(4):56–69, 2006.
to a distance function that induces a useful loss bound be-            Izchak Sharfman, Assaf Schuster, and Daniel Keren. A ge-
tween two models. For probabilistic models this can for                  ometric approach to monitoring threshold functions over
instance be the KL-divergence. However, more complex                     distributed data streams. ACM Trans. Database Syst., 32
                                                                         (4), 2007.
distance functions constitute more challenging distributed
monitoring tasks, which currently are open problems.                   Lin Xiao. Dual averaging methods for regularized stochastic
                                                                         learning and online optimization. The Journal of Machine
                                                                         Learning Research, 11:2543–2596, 2010.
References
                                                                       Tong Zhang. Solving large scale linear prediction problems
Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and                   using stochastic gradient descent algorithms. In Proceed-
  Alexander Rakhlin. A stochastic view of optimal regret                 ings of the 21st int. conf. on Machine learning (ICML
  through minimax duality. In COLT 2009 - The 22nd Con-                  2004), 2004.
  ference on Learning Theory, 2009.
                                                                       Martin Zinkevich, Alex J. Smola, and John Langford. Slow
Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay                 learners are fast. In Proc. of 23rd Annual Conference
 Mansour. Distributed learning, communication complex-                  on Neural Information Processing Systems (NIPS 2009),
 ity and privacy. Journal of Machine Learning Research -                pages 2331–2339, 2009.
 Proceedings Track, 23:26.1–26.22, 2012.
                                                                       Martin Zinkevich, Markus Weimer, Alexander J. Smola, and
Nader H. Bshouty and Philip M. Long. Linear classifiers are             Lihong Li. Parallelized stochastic gradient descent. In
  nearly optimal when hidden variables have diverse effects.            Proc. of 24th Annual Conference on Neural Information
  Machine Learning, 86(2):209–231, 2012.                                Processing Systems (NIPS 2010), pages 2595–2603, 2010.


                                                                   6