=Paper= {{Paper |id=Vol-2971/paper02 |storemode=property |title=Exploiting Data Distribution in Distributed Learning of Deep Classification Models under the Parameter Server Architecture |pdfUrl=https://ceur-ws.org/Vol-2971/paper02.pdf |volume=Vol-2971 |authors=Nikodimos Provatas |dblpUrl=https://dblp.org/rec/conf/vldb/Provatas21 }} ==Exploiting Data Distribution in Distributed Learning of Deep Classification Models under the Parameter Server Architecture== https://ceur-ws.org/Vol-2971/paper02.pdf
      Exploiting data distribution in distributed learning of deep
    classification models under the parameter server architecture.
                                                                      Nikodimos Provatas
                                                             supervised by Prof. Nectarios Koziris
                                                           National Technical University of Athens
                                                                       Athens, Greece
                                                                  nprov@cslab.ece.ntua.gr

ABSTRACT                                                                               learning setups are met under the parameter server [30] and the
Nowadays, deep learning models are used in a wide range of appli-                      all-reduce [7] architectures. Widely used deep learning systems, as
cations, including classification and recognition tasks. The constant                  Google TensorFlow [3], Apache MXNet [5] and PyTorch [25] have
growth on the data size has led to the use of more complex model                       adopted the concept of distributed learning following one or more
architectures for creating neural network classifiers. Both the model                  the aforementioned architectures.
complexity and the amount of data usually prohibit the training                           While distributed learning enables faster deep neural network
on a single machine, due to time and memory limitations. Thus,                         training, each distributed approach introduces some new issues that
distributed learning setups have been proposed to train deep net-                      might harm either the training speed or the quality of the resulting
works, when a vast amount of data is available. One such common                        model. For example, all-reduce approaches introduce synchroniza-
setup follows the parameter server approach, where worker tasks                        tion overheads on each training step. On the other hand, while in
compute gradients to update the network stored in the servers,                         the case of parameter server architecture such overheads are not
often in a synchronous free manner. However, the lack of synchro-                      met, as workers usually operate in an asynchronous manner [18].
nization may harm the resulting model quality due to the effect of                     Thus, stale gradients effects may occur, which could either delay
stale gradients, which are computed based on older model versions.                     model convergence or lead the model’s training loss function to
In this PhD research, we aim to explore how asynchronous learning                      diverge [23]. However, we believe that if data are wisely used, such
could benefit from data preprocessing tasks revealing hidden traits                    phenomena could be overcome.
regarding the data distribution.                                                          In this PhD research, we focus on studying how the distributed
                                                                                       deep learning process could benefit from the distribution of the
1    INTRODUCTION                                                                      training data, especially under the parameter server architecture.
                                                                                       Data preprocessing techniques can be used to obtain an a-priori
In the recent years, deep learning has become a widely used part
                                                                                       knowledge of the data domain, which could be beneficial in the
in a variety of applications. For example, in the image processing
                                                                                       training process . We aim to study and propose systematic ways on
domain, neural networks are widely used in classification [20] and
                                                                                       how the data should be assigned to the available training worker
tagging [32] applications. Deep models are also widely used in other
                                                                                       nodes. Furthermore, we will also study whether random or algo-
domains as speech recognition [13] or text classification [21].
                                                                                       rithmic access patterns on data are preferable during the training
   All the aforementioned applications are actually classification
                                                                                       process, focusing on the distributed case. Considering such tech-
tasks. In order to create accurate classifiers, a wide amount of data
                                                                                       niques, we aim for the training to be less sensitive to undesirable
shall be used. As we increase the volume of the data available, more
                                                                                       effects that appear in asynchronous distributed learning setups.
complex models are used in order to represent the patterns implied
                                                                                          The rest of this papers is organized in four sections. At first, in
by the data. A common example of complex model architectures
                                                                                       Section 2 we refer to any related background knowledge necessary
proposed are the ResNet [14] and Inception [29] architectures used
                                                                                       to easily understand the paper. Section 3 follows with a discussion
on the Imagenet [9] dataset.
                                                                                       on how former knowledge of the data distribution could benefit the
   Both the increase in the model complexity and the amount of
                                                                                       learning process, especially under a parameter server setup. Finally,
data available could prohibit model training in a single machine,
                                                                                       the paper concludes with Section 4 which outlines the steps that
since it would take numerous hours or days to create a generic
                                                                                       will be followed to complete this research.
and reliable model. For instance, without the use of accelerators as
GPUs, Stanford’s Dawnbench [6] took more than 10 days to train a
ResNet-152 model on the Imagenet dataset [2]. Thus, there have                         2 BACKGROUND
been proposed distributed architectures that could be used in order
to speed up the training process. Depending on what is decided to be                   2.1 Optimization Related Preliminaries
distributed, multiple solutions could be adopted. The most common                      In the context of classification problems [12], a neural network with
approach, especially when it comes to big data, is to distribute the                   weights represented by the vector 𝑤⃗ , is considered to approximate
available data to various workers, following data parallel learning                    a function 𝑓 : R𝑛 →    − R that, given an input feature vector 𝑥⃗ , could
architectures [8]. The most prominent setups of such distributed                       be used to classify it to a category 𝑦, i.e. 𝑦 = 𝑓 (𝑥⃗ ; 𝑤⃗ ). Given a set of
                                                                                       feature vectors 𝑥⃗ 1, 𝑥⃗ 2, ..., 𝑥⃗𝑛 and their corresponding category labels
Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den-
mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative   𝑦1, 𝑦2, ..., 𝑦𝑛 , the function 𝑓 used to describe the neural network
Commons License Attribution 4.0 International (CC BY 4.0).                             can be identified by using the appropriate vector 𝑤⃗ derived as the
                              5
                                   Update global model                 Extract local mini-
                                                                                              3     EXPLOITING DATA DISTRIBUTION IN
                                  with received gradients         2   batch from local data
                                   Parameter                                  shard                 LEARNING
                                  Server Task 1                   3      Local gradient
                                    Parameter                             computations        In this section, we will present how we could exploit distribution
                                   Server Task 2                                              related traits that could emerge from data preprocessing for the
                                                                      Worker Task 1
                                      Parameter                                Local
                                    Server Task M                              Model          learning process, especially in the distributed case.
                                                                               Copy
                                           Global
                                          Model Part                        Data Shard
                                                                                 1
                              PARAMETER SERVER                                                3.1    Exploit data distibution in single node
                                   TASKS
                                                                                                     training.
                                  Worker Task N                       Worker Task 2
                                           Local                               Local          As we mentioned in Section 2.1, Mini-Batch SGD does not move
                                           Model                               Model
                                           Copy             ...                Copy           the weights of the neural network directly to the minimization
                                        Data Shard                          Data Shard
                                            N                                    2            point due to the restricted view it has on the data on each iteration.
                                                                                              However, we know that Gradient Descent is able to move directly
                                                   WORKER TASKS
Figure 1: Contour Plot                                                                        towards the optimization point. A usual approach to overcome
                                              Send local gradients
  outlining Gradient          Receive model                                                   such problems when training deep learning models is to randomly
                         1     parameters
                                            4    to parameter
                                                    servers
Descent and Mini-Batch                                                                        shuffle the data before each mini-batch extraction in order to obtain
Gradient Descent while                                                                        a mini-batch with less correlated data [12].
 converging to a local  Figure 2: Training under the                                              However, it is reasonable to state the question what will happen
   minimum point.      Parameter    Server Architecture                                       if the mini-batch is chosen such that it is actually representative of
                                                                                              the whole dataset. Will this either result to a more accurate model
                                                                                              or to a faster training process in respect to the random sampling
solution of an optimization problem on a loss function 𝐿. Gradient                            techniques used? Is it important to perform some preprocess to the
Descent is considered to be among the most popular algorithm                                  training data in order to understand their structure and determine
used in optimization problems [27]. However, considering both the                             the sampling process during the mini-batch selection?
size of deep neural networks and the vast amount of data usually                                  Bengio proposed Curriculum Learning [4] as an approach to-
available, Gradient Descent is not preferred, since each iteration                            wards this direction and proved that the training was able to con-
will be too slow. Mini-Batch Stochastic Gradient Descent (mini-                               verge to better local minima when he decided to use traits of the
batch SGD) is used as an alternative instead, which uses only a                               data to help the network training process. For instance, in an image
subset of training examples in each training iteration.                                       classification case, he decided to use only some easily distinguished
   While an iteration of Mini-Batch SGD is faster than the one of                             data at first, and then include more complex images.
GD, it is important to note that it usually needs more iterations                                 In this PhD research, we aim to focus on how to select the mini-
to converge. Figure 1 presents a contour plot, which outlines how                             batch on each iteration to be representative of the whole data set
the gradients move the weights towards the optimization point.                                and boost the network training. As a proof of concept, we clustered
Gradient Descent takes into account the whole data distribution                               each class from the CIFAR-10 [19] dataset to two sub-clusters and
in each training step and thus continuously moves towards the                                 chose each mini-batch to include data from each resulting sub-
optimization point. However, mini-batch SGD computes the gradi-                               cluster of all classes. In this example, we noticed a 5% improvement
ent of a training step with only 𝐵 examples, directing the weights                            in the validation loss and a 2% improvement in the validation error.
to various directions before converging. The aforementioned algo-                             We further aim to examine whether we could benefit from real-time
rithms cannot guarantee convergence to the global minimization                                training metrics in order to select the training examples for each
point, as optimization functions in neural network training are                               upcoming mini-batch.
non-convex and they may stuck on local minima. Alternatives, as
Adam, AdaGrad and others, have been proposed as less vulnerable                               3.2    Exploit data distibution in parameter
to such phenomena [27].                                                                              server training.
                                                                                              Stale Gradients Effect. As we stated in the introductory section 1,
2.2    Parameter Server Architecture                                                          the parameter server training is usually harmed from the stale
Following a data parallel scheme, parameter server architecture [15,                          gradients effect. Stale gradients occur when a worker computes
30] introduces two different entities in the learning process: the                            a gradient update using old model parameters. In 2013, Dutta ap-
workers and the parameter servers. Parameter servers are used to                              proached the problem of staleness with proposing an appropriate
store neural network parameters in a distributed fashion. Work-                               variable learning rate [10]. In 2017, Jiang also approaches the stale-
ers use local copies of the network and a local part of the data                              ness problem with learning rate techniques in an heterogeneous
to compute gradients based on some variance of the mini-batch                                 environment [16]. Moreover, in 2018, Huang proposed FlexPS [15]
SGD algorithm. While gradients can be aggregated under various                                which facilitated a staleness parameter controlling the aging of the
synchronization schemes, parameter server usually follows an asyn-                            parameter to avoid staleness effects.
chronous parallel approach for updating the global model in the                                  Better data assignment to workers. In the research works
servers. The steps of training a model under the aforementioned                               stated above, algorithmic solutions in the parameter server or learn-
setup are fully depicted in Figure 2.                                                         ing rate level are proposed to smooth the staleness effect. However,
       2000                                                             4     RESEARCH PLAN
                                                                        Having presented the concept and the ideas behind this PhD reasearch,
       1500
                                                                        he have designed a plan that we should follow to conduct this re-
#Classes




       1000                                                             search. The aforementioned plan is outlined below.
                                                                              • Measure the effects of mini-batch design in single node
           500                                                                  training. The first part of our work includes to propose and
                                                                                study efficient techniques that take into account distribu-
             00    500          1000             1500      2000                 tion traits to systematically construct the mini-batches, as
                                #Images in Class
                                                                                representatives of the whole data set, used for neural net-
                                                                                work training. In case of data sets with numerous classes,
Figure 3: Class Population Histogram of Imagenet data ob-                       as ImageNet, where we can not create representative mini-
tained from Flickr                                                              batches with commonly used size, we plan to randomly omit
                                                                                different parts of the data from each mini-batch. Having
                                                                                experimented with the distribution traits, we further plan
we believe that if the data part that can be accessed from each
                                                                                to take the real-time training and validation metrics into
worker is not representative on the whole dataset, this may further
                                                                                account when creating the next mini-batch. For instance, in
harm the staleness effect, since a common approach is to randomly
                                                                                case the model presents large loss metrics in some examples,
shard the data to workers. For instance, TensorFlow uses a modular
                                                                                we could attempt to provide the next mini-batch with more
sharding approach based on the training example index to assign it
                                                                                data following an equivalent distribution.
to a worker. To further support this claim, we will discuss an exam-
                                                                              • Study and evaluate techniques to extract distribution
ple based on an Imagenet subset with images from Flickr (approx.
                                                                                traits from big data sets. A first approach to identify how
60GB size). Figure 3 presents a histogram with the image popula-
                                                                                data points are organized in the multidimensional space is
tion in each class. In this figure, we can observe that most of the
                                                                                with the help of clustering algorithms. However, since we
classes consist of approximately 100 images, while some of them
                                                                                want to focus on big data and distributing learning, we have
consist of more than 1000 (and even more than 1500 images). Thus,
                                                                                to compare various techniques that could efficiently discover
it is possible that a random data assignment approach could not
                                                                                distribution related information, as the DSH one stated ear-
provide a worker with data of some of the less populated classes or
                                                                                lier. Having studied existing approaches in this problem, we
bias another towards a highly populated class (data skew on some
                                                                                will attempt to design and propose our own method that will
workers). Having trained on a stale parameter set on some iteration,
                                                                                efficiently compute any necessary information.
such worker could direct the weight not towards the direction of
                                                                              • Apply distribution related information in data shard-
the true optimization point, but possibly to another one which will
                                                                                ing. In order to create representative data shards, our first
better optimize this part of data, due to lack of knowledge regarding
                                                                                goal is to consider class stratification, and identify whether
the data space. In this research, we aim to study whether stratifica-
                                                                                it is able to facilitate the learning process. Moreover, hav-
tion in data sharding to workers and in mini-batch selection per
                                                                                ing efficiently discovered any necessary distribution traits,
worker (in class or in hidden level according to the distribution)
                                                                                the next part of our research aims to exploit them in shard
could be facilitated in order to smooth the effects of staleness.
                                                                                creation process. The knowledge of stratification and dis-
    Stratification is widely used when the computing task cannot
                                                                                tribution traits derived from the data is expected to further
view entirely the data, as for example in an approximate query
                                                                                help parameter server training.
processing problem [17]. In the context of learning, it is also used
                                                                              • Design and propose a streaming system for serving
to facilitate learning from heterogeneous databases [26]. Moreover,
                                                                                mini-batches to workers in the parameter server setup.
in a 2020 research [24], hidden stratification appeared to crucially
                                                                                Our research will conclude with the design and implemen-
affect the quality of classification models for medical images.
                                                                                taion of a system that will collect data information and facil-
    Extracting data distribution related information. Hidden
                                                                                itate the distribution information extraction mechanisms to
stratification can be used to reveal how the data are organized
                                                                                learn how to efficiently prepare mini-batches for the workers
in the distribution. A common approach to discover hidden pat-
                                                                                to train in the parameter server setup. This system will be
terns in the data distribution is the use of unsupervised learning
                                                                                created taking into consideration any observations from the
techniques, as clustering. In the big data context, multiple clus-
                                                                                steps described above.
tering techniques have been proposed. For instance, in [28] they
have designed a clustering framework for big data, which is able
to discover multiple distribution types. Others propose approxi-
                                                                        4.1     Technologies
mate and distributed versions of common clustering algorithms,          Having presented our research plan, we briefly describe state-of-
as DBSCAN [22]. Apart from clustering, it would also be efficient       the-art technologies that we plan to use in order to construct our
to utilize function that cluster together similar points, in the same   various components.
manner as hash functions do. Towards this approach and in the                 • Apache Spark [31] is a widely used general purpose big data
spirit of Locality Sensitive Hashing, Gao proposed in [11] Data                 system. Among other libraries, Spark provides SparkML,
Sensitive Hashing, where he facilitates data distribution to hash               which offers some clustering algorithms that we could ex-
together close data points in a high dimensional space.                         ploit to identify initial data distribution traits. Moreover,
        Spark can easily be used to compute any other interesting                               Statistics. PMLR, 803ś812.
        metrics that we might need to consider.                                            [11] Jinyang Gao, Hosagrahar Visvesvaraya Jagadish, Wei Lu, and Beng Chin Ooi. 2014.
                                                                                                DSH: data sensitive hashing for high-dimensional k-nnsearch. In Proceedings of
      • Regarding neural network training, we aim to use Google                                 the 2014 ACM SIGMOD international conference on Management of data. 1127ś
        TensorFlow [3], which also operates under the parameter                                 1138.
                                                                                           [12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT
        server architecture.                                                                    press.
      • For streaming mini-batches we aim to examine the use of                            [13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech
        Apache Arrow [1], since it optimizes data in a columnar                                 recognition with deep recurrent neural networks. In 2013 IEEE international
                                                                                                conference on acoustics, speech and signal processing. IEEE, 6645ś6649.
        format for CPU and GPU analytical operations. Moreover,                            [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
        TensorFlow can directly read from Arrow streams.                                        learning for image recognition. In Proceedings of the IEEE conference on computer
                                                                                                vision and pattern recognition. 770ś778.
   We aim to design the final system as a layer over the training                          [15] Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Fan Yang, Jinfeng
cluster, which will encapsulate all the above technologies. Thus, a                             Li, Yuying Guo, and James Cheng. 2018. FlexPS: Flexible Parallelism Control in
                                                                                                Parameter Server Architecture. Proceedings of the VLDB Endowment 11, 5 (Jan.
user will be able to easily benefit from our system.                                            2018), 566ś579. https://doi.org/10.1145/3177732.3177734
                                                                                           [16] Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware dis-
4.2      Benchmarking Setup                                                                     tributed parameter servers. In Proceedings of the 2017 ACM International Confer-
                                                                                                ence on Management of Data. 463ś478.
Having implemented each of our components, we aim to benchmark                             [17] Srikanth Kandula, Kukjin Lee, Surajit Chaudhuri, and Marc Friedman. 2019. Ex-
how they affect the asynchronous distributed training process in                                periences with approximating queries in Microsoft’s production big-data clusters.
terms of speed and resulting training and validation metrics. As                                Proceedings of the VLDB Endowment 12, 12 (2019), 2131ś2142.
                                                                                           [18] Evdokia Kassela, Nikodimos Provatas, Ioannis Konstantinou, Avrilia Floratou,
a baseline, we will use state-of-the-art neural networks trained                                and Nectarios Koziris. 2019. General-Purpose vs Specialized Data Analytics
under a parameter server setup with the optimal hyper parameters.                               Systems: A Game of ML & SQL Thrones. In 2019 IEEE International Conference
                                                                                                on Big Data (Big Data).
For instance, we could train ResNet or Inception models using                              [19] Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images.
the ImageNet data set in a simple parameter server setup. These                                 Technical Report.
baseline models could be compared with the ones trained taking                             [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-
                                                                                                tion with deep convolutional neural networks. In Advances in neural information
the distribution traits into account.                                                           processing systems. 1097ś1105.
                                                                                           [21] Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep
ACKNOWLEDGMENTS                                                                                 learning for extreme multi-label text classification. In Proceedings of the 40th
                                                                                                International ACM SIGIR Conference on Research and Development in Information
This research has been co-financed by the European Union and                                    Retrieval. 115ś124.
Greek national funds through the Operational Program "Compet-                              [22] Alessandro Lulli, Matteo Dell’Amico, Pietro Michiardi, and Laura Ricci. 2016.
                                                                                                NG-DBSCAN: scalable density-based clustering for arbitrary data. Proceedings of
itiveness, Entrepreneurship and Innovation", under the call RE-                                 the VLDB Endowment 10, 3 (2016), 157ś168.
SEARCH ś CREATE ś INNOVATE (project code:T1EDK-04605).                                     [23] Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2020. Cerebro: A data system
                                                                                                for optimized deep learning model selection. Proceedings of the VLDB Endowment
                                                                                                13, 12 (2020), 2159ś2173.
REFERENCES                                                                                 [24] Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Ré.
 [1] [n.d.]. https://arrow.apache.org/                                                          2020. Hidden stratification causes clinically meaningful failures in machine
 [2] [n.d.]. DAWNBench. https://dawn.cs.stanford.edu/benchmark/ImageNet/train.                  learning for medical imaging. In Proceedings of the ACM conference on health,
     html                                                                                       inference, and learning. 151ś159.
 [3] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef-               [25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
     frey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael                   Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.
     Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th                2019. Pytorch: An imperative style, high-performance deep learning library. In
     {USENIX } Symposium on Operating Systems Design and Implementation ( {OSDI }               Advances in neural information processing systems. 8026ś8037.
     16). 265ś283.                                                                         [26] Jose Picado, Arash Termehchy, and Sudhanshu Pathak. 2018. Learning efficiently
 [4] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009.                  over heterogeneous databases. Proceedings of the VLDB Endowment 11, 12 (2018),
     Curriculum learning. In Proceedings of the 26th annual international conference            2066ś2069.
     on machine learning. 41ś48.                                                           [27] Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms.
 [5] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun                  arXiv preprint arXiv:1609.04747 (2016).
     Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and            [28] Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Züfle, Klaus Arthur
     efficient machine learning library for heterogeneous distributed systems. arXiv            Schmid, and Arthur Zimek. 2015. A Framework for Clustering Uncertain Data.
     preprint arXiv:1512.01274 (2015).                                                          Proceedings of the VLDB Endowment 8, 12 (Aug. 2015), 1976ś1979. https://doi.
 [6] Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi                  org/10.14778/2824032.2824115
     Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. 2017. Dawn-         [29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the
     bench: An end-to-end deep learning benchmark and competition. Training 100,                Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer
     101 (2017), 102.                                                                           Vision and Pattern Recognition (CVPR). 2818ś2826. https://doi.org/10.1109/CVPR.
 [7] Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang                    2016.308
     Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan, Zhichao Li, Jiao Wang, Sheng-            [30] Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun
     sheng Huang, Zhongyuan Wu, Yang Wang, Yuhao Yang, Bowen She, Dongjie                       Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New
     Shi, Qi Lu, Kai Huang, and Guoqiong Song. 2019. BigDL: A Distributed Deep                  Platform for Distributed Machine Learning on Big Data. In Proceedings of the 21th
     Learning Framework for Big Data. In Proceedings of the ACM Symposium on                    ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
     Cloud Computing (Santa Cruz, CA, USA) (SoCC ’19). Association for Computing                (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machinery, New
     Machinery, New York, NY, USA, 50ś60. https://doi.org/10.1145/3357223.3362707               York, NY, USA, 1335ś1344. https://doi.org/10.1145/2783258.2783323
 [8] Jeffrey Dean, Greg S Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V           [31] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust,
     Le, Mark Z Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, et al. 2012.             Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J.
     Large scale distributed deep networks. In Proceedings of the 25th International            Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache
     Conference on Neural Information Processing Systems-Volume 1. 1223ś1231.                   Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct.
 [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A            2016), 56ś65. https://doi.org/10.1145/2934664
     Large-Scale Hierarchical Image Database. In CVPR09.                                   [32] Y. Zhang, B. Gong, and M. Shah. 2016. Fast Zero-Shot Image Tagging. In 2016
[10] Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya                   IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5985ś5994.
     Nagpurkar. 2018. Slow and stale gradients can win the race: Error-runtime trade-           https://doi.org/10.1109/CVPR.2016.644
     offs in distributed SGD. In International Conference on Artificial Intelligence and