<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting data distribution in distributed learning of deep classification models under the parameter server architecture.</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Athens Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>5</lpage>
      <abstract>
        <p>Nowadays, deep learning models are used in a wide range of applications, including classification and recognition tasks. The constant growth on the data size has led to the use of more complex model architectures for creating neural network classifiers. Both the model complexity and the amount of data usually prohibit the training on a single machine, due to time and memory limitations. Thus, distributed learning setups have been proposed to train deep networks, when a vast amount of data is available. One such common setup follows the parameter server approach, where worker tasks compute gradients to update the network stored in the servers, often in a synchronous free manner. However, the lack of synchronization may harm the resulting model quality due to the efect of stale gradients, which are computed based on older model versions. In this PhD research, we aim to explore how asynchronous learning could benefit from data preprocessing tasks revealing hidden traits regarding the data distribution.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In the recent years, deep learning has become a widely used part
in a variety of applications. For example, in the image processing
domain, neural networks are widely used in classification [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and
tagging [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] applications. Deep models are also widely used in other
domains as speech recognition [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or text classification [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        All the aforementioned applications are actually classification
tasks. In order to create accurate classifiers, a wide amount of data
shall be used. As we increase the volume of the data available, more
complex models are used in order to represent the patterns implied
by the data. A common example of complex model architectures
proposed are the ResNet [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Inception [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] architectures used
on the Imagenet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] dataset.
      </p>
      <p>
        Both the increase in the model complexity and the amount of
data available could prohibit model training in a single machine,
since it would take numerous hours or days to create a generic
and reliable model. For instance, without the use of accelerators as
GPUs, Stanford’s Dawnbench [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] took more than 10 days to train a
ResNet-152 model on the Imagenet dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Thus, there have
been proposed distributed architectures that could be used in order
to speed up the training process. Depending on what is decided to be
distributed, multiple solutions could be adopted. The most common
approach, especially when it comes to big data, is to distribute the
available data to various workers, following data parallel learning
architectures [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The most prominent setups of such distributed
learning setups are met under the parameter server [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and the
all-reduce [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] architectures. Widely used deep learning systems, as
Google TensorFlow [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Apache MXNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and PyTorch [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] have
adopted the concept of distributed learning following one or more
the aforementioned architectures.
      </p>
      <p>
        While distributed learning enables faster deep neural network
training, each distributed approach introduces some new issues that
might harm either the training speed or the quality of the resulting
model. For example, all-reduce approaches introduce
synchronization overheads on each training step. On the other hand, while in
the case of parameter server architecture such overheads are not
met, as workers usually operate in an asynchronous manner [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
Thus, stale gradients efects may occur, which could either delay
model convergence or lead the model’s training loss function to
diverge [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. However, we believe that if data are wisely used, such
phenomena could be overcome.
      </p>
      <p>In this PhD research, we focus on studying how the distributed
deep learning process could benefit from the distribution of the
training data, especially under the parameter server architecture.
Data preprocessing techniques can be used to obtain an a-priori
knowledge of the data domain, which could be beneficial in the
training process . We aim to study and propose systematic ways on
how the data should be assigned to the available training worker
nodes. Furthermore, we will also study whether random or
algorithmic access patterns on data are preferable during the training
process, focusing on the distributed case. Considering such
techniques, we aim for the training to be less sensitive to undesirable
efects that appear in asynchronous distributed learning setups.</p>
      <p>The rest of this papers is organized in four sections. At first, in
Section 2 we refer to any related background knowledge necessary
to easily understand the paper. Section 3 follows with a discussion
on how former knowledge of the data distribution could benefit the
learning process, especially under a parameter server setup. Finally,
the paper concludes with Section 4 which outlines the steps that
will be followed to complete this research.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND 2 2.1</title>
    </sec>
    <sec id="sec-3">
      <title>Optimization Related Preliminaries</title>
      <p>
        In the context of classification problems [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], a neural network with
weights represented by the vector  ⃗ , is considered to approximate
a function  : R →− R that, given an input feature vector ⃗, could
be used to classify it to a category  , i.e.  =  (⃗;  ⃗ ). Given a set of
feature vectors ⃗1, ⃗2, ..., ⃗ and their corresponding category labels
 1,  2, ...,   , the function  used to describe the neural network
can be identified by using the appropriate vector  ⃗ derived as the
. . .
      </p>
      <p>WORKER TASKS
1</p>
      <p>Repcaeraivmeemteordsel 4</p>
      <p>Send local gradients
to parameter</p>
      <p>
        servers
solution of an optimization problem on a loss function  . Gradient
Descent is considered to be among the most popular algorithm
used in optimization problems [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. However, considering both the
size of deep neural networks and the vast amount of data usually
available, Gradient Descent is not preferred, since each iteration
will be too slow. Mini-Batch Stochastic Gradient Descent
(minibatch SGD) is used as an alternative instead, which uses only a
subset of training examples in each training iteration.
      </p>
      <p>
        While an iteration of Mini-Batch SGD is faster than the one of
GD, it is important to note that it usually needs more iterations
to converge. Figure 1 presents a contour plot, which outlines how
the gradients move the weights towards the optimization point.
Gradient Descent takes into account the whole data distribution
in each training step and thus continuously moves towards the
optimization point. However, mini-batch SGD computes the
gradient of a training step with only  examples, directing the weights
to various directions before converging. The aforementioned
algorithms cannot guarantee convergence to the global minimization
point, as optimization functions in neural network training are
non-convex and they may stuck on local minima. Alternatives, as
Adam, AdaGrad and others, have been proposed as less vulnerable
to such phenomena [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Parameter Server Architecture</title>
      <p>
        Following a data parallel scheme, parameter server architecture [
        <xref ref-type="bibr" rid="ref15 ref30">15,
30</xref>
        ] introduces two diferent entities in the learning process: the
workers and the parameter servers. Parameter servers are used to
store neural network parameters in a distributed fashion.
Workers use local copies of the network and a local part of the data
to compute gradients based on some variance of the mini-batch
SGD algorithm. While gradients can be aggregated under various
synchronization schemes, parameter server usually follows an
asynchronous parallel approach for updating the global model in the
servers. The steps of training a model under the aforementioned
setup are fully depicted in Figure 2.
Worker Task N
      </p>
      <p>Local
Model</p>
      <p>Copy
Data Shard</p>
      <p>N</p>
      <p>Extract local
mini2 batch from local data</p>
      <p>shard
3 Lcoocmaplugtraatdioiennst</p>
      <p>Worker Task 1</p>
      <p>Local
Model</p>
      <p>Copy
Data Shard</p>
      <p>1
Worker Task 2</p>
      <p>Local
Model</p>
      <p>Copy
Data Shard
2</p>
    </sec>
    <sec id="sec-5">
      <title>EXPLOITING DATA DISTRIBUTION IN</title>
    </sec>
    <sec id="sec-6">
      <title>LEARNING</title>
      <p>In this section, we will present how we could exploit distribution
related traits that could emerge from data preprocessing for the
learning process, especially in the distributed case.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Exploit data distibution in single node training.</title>
      <p>
        As we mentioned in Section 2.1, Mini-Batch SGD does not move
the weights of the neural network directly to the minimization
point due to the restricted view it has on the data on each iteration.
However, we know that Gradient Descent is able to move directly
towards the optimization point. A usual approach to overcome
such problems when training deep learning models is to randomly
shufle the data before each mini-batch extraction in order to obtain
a mini-batch with less correlated data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>However, it is reasonable to state the question what will happen
if the mini-batch is chosen such that it is actually representative of
the whole dataset. Will this either result to a more accurate model
or to a faster training process in respect to the random sampling
techniques used? Is it important to perform some preprocess to the
training data in order to understand their structure and determine
the sampling process during the mini-batch selection?</p>
      <p>
        Bengio proposed Curriculum Learning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as an approach
towards this direction and proved that the training was able to
converge to better local minima when he decided to use traits of the
data to help the network training process. For instance, in an image
classification case, he decided to use only some easily distinguished
data at first, and then include more complex images.
      </p>
      <p>
        In this PhD research, we aim to focus on how to select the
minibatch on each iteration to be representative of the whole data set
and boost the network training. As a proof of concept, we clustered
each class from the CIFAR-10 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] dataset to two sub-clusters and
chose each mini-batch to include data from each resulting
subcluster of all classes. In this example, we noticed a 5% improvement
in the validation loss and a 2% improvement in the validation error.
We further aim to examine whether we could benefit from real-time
training metrics in order to select the training examples for each
upcoming mini-batch.
3.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Exploit data distibution in parameter server training.</title>
      <p>
        Stale Gradients Efect. As we stated in the introductory section 1,
the parameter server training is usually harmed from the stale
gradients efect. Stale gradients occur when a worker computes
a gradient update using old model parameters. In 2013, Dutta
approached the problem of staleness with proposing an appropriate
variable learning rate [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In 2017, Jiang also approaches the
staleness problem with learning rate techniques in an heterogeneous
environment [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Moreover, in 2018, Huang proposed FlexPS [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
which facilitated a staleness parameter controlling the aging of the
parameter to avoid staleness efects.
      </p>
      <p>Better data assignment to workers. In the research works
stated above, algorithmic solutions in the parameter server or
learning rate level are proposed to smooth the staleness efect. However,
we believe that if the data part that can be accessed from each
worker is not representative on the whole dataset, this may further
harm the staleness efect, since a common approach is to randomly
shard the data to workers. For instance, TensorFlow uses a modular
sharding approach based on the training example index to assign it
to a worker. To further support this claim, we will discuss an
example based on an Imagenet subset with images from Flickr (approx.
60GB size). Figure 3 presents a histogram with the image
population in each class. In this figure, we can observe that most of the
classes consist of approximately 100 images, while some of them
consist of more than 1000 (and even more than 1500 images). Thus,
it is possible that a random data assignment approach could not
provide a worker with data of some of the less populated classes or
bias another towards a highly populated class (data skew on some
workers). Having trained on a stale parameter set on some iteration,
such worker could direct the weight not towards the direction of
the true optimization point, but possibly to another one which will
better optimize this part of data, due to lack of knowledge regarding
the data space. In this research, we aim to study whether
stratification in data sharding to workers and in mini-batch selection per
worker (in class or in hidden level according to the distribution)
could be facilitated in order to smooth the efects of staleness.</p>
      <p>
        Stratification is widely used when the computing task cannot
view entirely the data, as for example in an approximate query
processing problem [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In the context of learning, it is also used
to facilitate learning from heterogeneous databases [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Moreover,
in a 2020 research [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], hidden stratification appeared to crucially
afect the quality of classification models for medical images.
      </p>
      <p>
        Extracting data distribution related information. Hidden
stratification can be used to reveal how the data are organized
in the distribution. A common approach to discover hidden
patterns in the data distribution is the use of unsupervised learning
techniques, as clustering. In the big data context, multiple
clustering techniques have been proposed. For instance, in [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] they
have designed a clustering framework for big data, which is able
to discover multiple distribution types. Others propose
approximate and distributed versions of common clustering algorithms,
as DBSCAN [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Apart from clustering, it would also be eficient
to utilize function that cluster together similar points, in the same
manner as hash functions do. Towards this approach and in the
spirit of Locality Sensitive Hashing, Gao proposed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] Data
Sensitive Hashing, where he facilitates data distribution to hash
together close data points in a high dimensional space.
      </p>
    </sec>
    <sec id="sec-9">
      <title>RESEARCH PLAN</title>
      <p>Having presented the concept and the ideas behind this PhD reasearch,
he have designed a plan that we should follow to conduct this
research. The aforementioned plan is outlined below.</p>
      <p>• Measure the efects of mini-batch design in single node
training. The first part of our work includes to propose and
study eficient techniques that take into account
distribution traits to systematically construct the mini-batches, as
representatives of the whole data set, used for neural
network training. In case of data sets with numerous classes,
as ImageNet, where we can not create representative
minibatches with commonly used size, we plan to randomly omit
diferent parts of the data from each mini-batch. Having
experimented with the distribution traits, we further plan
to take the real-time training and validation metrics into
account when creating the next mini-batch. For instance, in
case the model presents large loss metrics in some examples,
we could attempt to provide the next mini-batch with more
data following an equivalent distribution.
• Study and evaluate techniques to extract distribution
traits from big data sets. A first approach to identify how
data points are organized in the multidimensional space is
with the help of clustering algorithms. However, since we
want to focus on big data and distributing learning, we have
to compare various techniques that could eficiently discover
distribution related information, as the DSH one stated
earlier. Having studied existing approaches in this problem, we
will attempt to design and propose our own method that will
eficiently compute any necessary information.
• Apply distribution related information in data
sharding. In order to create representative data shards, our first
goal is to consider class stratification, and identify whether
it is able to facilitate the learning process. Moreover,
having eficiently discovered any necessary distribution traits,
the next part of our research aims to exploit them in shard
creation process. The knowledge of stratification and
distribution traits derived from the data is expected to further
help parameter server training.
• Design and propose a streaming system for serving
mini-batches to workers in the parameter server setup.
Our research will conclude with the design and
implementaion of a system that will collect data information and
facilitate the distribution information extraction mechanisms to
learn how to eficiently prepare mini-batches for the workers
to train in the parameter server setup. This system will be
created taking into consideration any observations from the
steps described above.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>Technologies</title>
      <p>Having presented our research plan, we briefly describe
state-ofthe-art technologies that we plan to use in order to construct our
various components.</p>
      <p>
        • Apache Spark [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] is a widely used general purpose big data
system. Among other libraries, Spark provides SparkML,
which ofers some clustering algorithms that we could
exploit to identify initial data distribution traits. Moreover,
Spark can easily be used to compute any other interesting
metrics that we might need to consider.
• Regarding neural network training, we aim to use Google
TensorFlow [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which also operates under the parameter
server architecture.
• For streaming mini-batches we aim to examine the use of
Apache Arrow [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], since it optimizes data in a columnar
format for CPU and GPU analytical operations. Moreover,
TensorFlow can directly read from Arrow streams.
      </p>
      <p>We aim to design the final system as a layer over the training
cluster, which will encapsulate all the above technologies. Thus, a
user will be able to easily benefit from our system.
4.2</p>
    </sec>
    <sec id="sec-11">
      <title>Benchmarking Setup</title>
      <p>Having implemented each of our components, we aim to benchmark
how they afect the asynchronous distributed training process in
terms of speed and resulting training and validation metrics. As
a baseline, we will use state-of-the-art neural networks trained
under a parameter server setup with the optimal hyper parameters.
For instance, we could train ResNet or Inception models using
the ImageNet data set in a simple parameter server setup. These
baseline models could be compared with the ones trained taking
the distribution traits into account.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research has been co-financed by the European Union and
Greek national funds through the Operational Program
"Competitiveness, Entrepreneurship and Innovation", under the call
RESEARCH ś CREATE ś INNOVATE (project code:T1EDK-04605).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[1] [n.d.]. https://arrow.apache.org/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>[2] [n.d.]. DAWNBench. https://dawn.cs.stanford.edu/benchmark/ImageNet/train. html</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Martín</given-names>
            <surname>Abadi</surname>
          </string-name>
          , Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Matthieu</given-names>
            <surname>Devin</surname>
          </string-name>
          , Sanjay Ghemawat, Geofrey Irving, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Isard</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Tensorflow: A system for large-scale machine learning</article-title>
          .
          <source>In 12th {USENIX} Symposium on Operating Systems Design and Implementation</source>
          ({
          <source>OSDI} 16)</source>
          .
          <year>265ś283</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Jérôme Louradour, Ronan Collobert, and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Curriculum learning</article-title>
          .
          <source>In Proceedings of the 26th annual international conference on machine learning. 41ś48.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Tianqi</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mu</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yutian</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Min</given-names>
            <surname>Lin</surname>
          </string-name>
          , Naiyan
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Minjie</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang.
          <year>2015</year>
          .
          <article-title>Mxnet: A flexible and eficient machine learning library for heterogeneous distributed systems</article-title>
          .
          <source>arXiv preprint arXiv:1512.01274</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Cody</given-names>
            <surname>Coleman</surname>
          </string-name>
          , Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Bailis</surname>
          </string-name>
          , Kunle Olukotun, Chris Ré, and
          <string-name>
            <given-names>Matei</given-names>
            <surname>Zaharia</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Dawnbench: An end-to-end deep learning benchmark and competition</article-title>
          .
          <source>Training</source>
          <volume>100</volume>
          ,
          <issue>101</issue>
          (
          <year>2017</year>
          ),
          <fpage>102</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Jinquan</surname>
          </string-name>
          <string-name>
            <surname>Dai</surname>
          </string-name>
          , Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan,
          <string-name>
            <given-names>Zhichao</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jiao</given-names>
            <surname>Wang</surname>
          </string-name>
          , Shengsheng Huang,
          <string-name>
            <surname>Zhongyuan</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuhao Yang</surname>
            , Bowen She, Dongjie Shi, Qi Lu, Kai Huang, and
            <given-names>Guoqiong</given-names>
          </string-name>
          <string-name>
            <surname>Song</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BigDL: A Distributed Deep Learning Framework for Big Data</article-title>
          .
          <source>In Proceedings of the ACM Symposium on Cloud Computing</source>
          (Santa Cruz, CA, USA) (
          <source>SoCC '19)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>50ś60</year>
          . https://doi.org/10.1145/3357223.3362707
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          , Greg S Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V Le,
          <string-name>
            <given-names>Mark Z Mao</given-names>
            ,
            <surname>Marc'Aurelio Ranzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Senior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Paul</given-names>
            <surname>Tucker</surname>
          </string-name>
          , et al.
          <year>2012</year>
          .
          <article-title>Large scale distributed deep networks</article-title>
          .
          <source>In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume</source>
          <volume>1</volume>
          .
          <year>1223ś1231</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>ImageNet: A Large-Scale Hierarchical Image Database</article-title>
          .
          <source>In CVPR09.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sanghamitra</surname>
            <given-names>Dutta</given-names>
          </string-name>
          , Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and
          <string-name>
            <given-names>Priya</given-names>
            <surname>Nagpurkar</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Slow and stale gradients can win the race: Error-runtime tradeofs in distributed SGD</article-title>
          .
          <source>In International Conference on Artificial Intelligence and Statistics</source>
          . PMLR,
          <year>803ś812</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jinyang</surname>
            <given-names>Gao</given-names>
          </string-name>
          , Hosagrahar Visvesvaraya Jagadish, Wei Lu, and Beng Chin Ooi.
          <year>2014</year>
          .
          <article-title>DSH: data sensitive hashing for high-dimensional k-nnsearch</article-title>
          .
          <source>In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1127ś 1138.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ian</surname>
            <given-names>Goodfellow</given-names>
          </string-name>
          , Yoshua Bengio, and
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Courville</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep learning</article-title>
          . MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Alex</surname>
            <given-names>Graves</given-names>
          </string-name>
          , Abdel-rahman
          <string-name>
            <surname>Mohamed</surname>
            , and
            <given-names>Geofrey</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Speech recognition with deep recurrent neural networks</article-title>
          .
          <source>In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645ś6649.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition. 770ś778.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Yuzhen</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Fan Yang,
          <string-name>
            <given-names>Jinfeng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yuying</given-names>
            <surname>Guo</surname>
          </string-name>
          , and James Cheng.
          <year>2018</year>
          .
          <article-title>FlexPS: Flexible Parallelism Control in Parameter Server Architecture</article-title>
          .
          <source>Proceedings of the VLDB Endowment 11</source>
          ,
          <issue>5</issue>
          (Jan.
          <year>2018</year>
          ),
          <year>566ś579</year>
          . https://doi.org/10.1145/3177732.3177734
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Jiawei</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Bin Cui, Ce Zhang, and
          <string-name>
            <given-names>Lele</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Heterogeneity-aware distributed parameter servers</article-title>
          .
          <source>In Proceedings of the 2017 ACM International Conference on Management of Data. 463ś478.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Srikanth</surname>
            <given-names>Kandula</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kukjin</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Surajit</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Experiences with approximating queries in Microsoft's production big-data clusters</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>12</volume>
          ,
          <issue>12</issue>
          (
          <year>2019</year>
          ),
          <year>2131ś2142</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Evdokia</surname>
            <given-names>Kassela</given-names>
          </string-name>
          , Nikodimos Provatas, Ioannis Konstantinou, Avrilia Floratou, and
          <string-name>
            <given-names>Nectarios</given-names>
            <surname>Koziris</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>General-Purpose vs Specialized Data Analytics Systems: A Game of ML &amp; SQL Thrones</article-title>
          .
          <source>In 2019 IEEE International Conference on Big Data (Big Data).</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Learning multiple layers of features from tiny images</article-title>
          .
          <source>Technical Report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Alex</surname>
            <given-names>Krizhevsky</given-names>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geofrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems. 1097ś1105.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Jingzhou</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei-Cheng</surname>
            <given-names>Chang</given-names>
          </string-name>
          , Yuexin Wu, and
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Deep learning for extreme multi-label text classification</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          .
          <year>115ś124</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Lulli</given-names>
          </string-name>
          , Matteo Dell'Amico,
          <string-name>
            <given-names>Pietro</given-names>
            <surname>Michiardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Laura</given-names>
            <surname>Ricci</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>NG-DBSCAN: scalable density-based clustering for arbitrary data</article-title>
          .
          <source>Proceedings of the VLDB Endowment 10</source>
          ,
          <issue>3</issue>
          (
          <year>2016</year>
          ),
          <year>157ś168</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Supun</surname>
            <given-names>Nakandala</given-names>
          </string-name>
          , Yuhao Zhang, and
          <string-name>
            <given-names>Arun</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Cerebro: A data system for optimized deep learning model selection</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>13</volume>
          ,
          <issue>12</issue>
          (
          <year>2020</year>
          ),
          <year>2159ś2173</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Luke</given-names>
            <surname>Oakden-Rayner</surname>
          </string-name>
          , Jared Dunnmon, Gustavo Carneiro, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Ré</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Hidden stratification causes clinically meaningful failures in machine learning for medical imaging</article-title>
          .
          <source>In Proceedings of the ACM conference on health, inference, and learning. 151ś159.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Adam</surname>
            <given-names>Paszke</given-names>
          </string-name>
          , Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
          <string-name>
            <given-names>Luca</given-names>
            <surname>Antiga</surname>
          </string-name>
          , et al.
          <year>2019</year>
          .
          <article-title>Pytorch: An imperative style, high-performance deep learning library</article-title>
          .
          <source>In Advances in neural information processing systems. 8026ś8037.</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Jose</surname>
            <given-names>Picado</given-names>
          </string-name>
          , Arash Termehchy, and
          <string-name>
            <given-names>Sudhanshu</given-names>
            <surname>Pathak</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Learning eficiently over heterogeneous databases</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>11</volume>
          ,
          <issue>12</issue>
          (
          <year>2018</year>
          ),
          <year>2066ś2069</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An overview of gradient descent optimization algorithms</article-title>
          .
          <source>arXiv preprint arXiv:1609.04747</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Erich</surname>
            <given-names>Schubert</given-names>
          </string-name>
          , Alexander Koos, Tobias Emrich, Andreas Züfle, Klaus Arthur Schmid, and
          <string-name>
            <given-names>Arthur</given-names>
            <surname>Zimek</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A Framework for Clustering Uncertain Data</article-title>
          .
          <source>Proceedings of the VLDB Endowment 8</source>
          ,
          <issue>12</issue>
          (Aug.
          <year>2015</year>
          ),
          <year>1976ś1979</year>
          . https://doi. org/10.14778/2824032.2824115
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wojna</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Rethinking the Inception Architecture for Computer Vision</article-title>
          . In
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <year>2818ś2826</year>
          . https://doi.org/10.1109/CVPR.
          <year>2016</year>
          .308
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Eric</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Xing</surname>
          </string-name>
          , Qirong Ho, Wei Dai,
          <string-name>
            <surname>Jin-Kyu</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Jinliang Wei,
          <string-name>
            <given-names>Seunghak</given-names>
            <surname>Lee</surname>
          </string-name>
          , Xun Zheng, Pengtao Xie, Abhimanu Kumar, and
          <string-name>
            <given-names>Yaoliang</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Petuum: A New Platform for Distributed Machine Learning on Big Data</article-title>
          .
          <source>In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney</source>
          ,
          <string-name>
            <surname>NSW</surname>
          </string-name>
          , Australia) (
          <source>KDD '15)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>1335ś1344</year>
          . https://doi.org/10.1145/2783258.2783323
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Matei</surname>
            <given-names>Zaharia</given-names>
          </string-name>
          , Reynold S. Xin, Patrick Wendell,
          <string-name>
            <surname>Tathagata Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Michael Armbrust</surname>
            , Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,
            <given-names>Michael J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            , Ali Ghodsi,
            <given-names>Joseph</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>Scott</given-names>
          </string-name>
          <string-name>
            <surname>Shenker</surname>
            , and
            <given-names>Ion</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Apache Spark: A Unified Engine for Big Data Processing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>59</volume>
          ,
          <issue>11</issue>
          (Oct.
          <year>2016</year>
          ),
          <year>56ś65</year>
          . https://doi.org/10.1145/2934664
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Fast Zero-Shot Image Tagging</article-title>
          .
          <source>In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <year>5985ś5994</year>
          . https://doi.org/10.1109/CVPR.
          <year>2016</year>
          .644
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>