=Paper=
{{Paper
|id=Vol-2971/paper02
|storemode=property
|title=Exploiting Data Distribution in Distributed Learning of Deep Classification Models under the Parameter Server Architecture
|pdfUrl=https://ceur-ws.org/Vol-2971/paper02.pdf
|volume=Vol-2971
|authors=Nikodimos Provatas
|dblpUrl=https://dblp.org/rec/conf/vldb/Provatas21
}}
==Exploiting Data Distribution in Distributed Learning of Deep Classification Models under the Parameter Server Architecture==
Exploiting data distribution in distributed learning of deep
classification models under the parameter server architecture.
Nikodimos Provatas
supervised by Prof. Nectarios Koziris
National Technical University of Athens
Athens, Greece
nprov@cslab.ece.ntua.gr
ABSTRACT learning setups are met under the parameter server [30] and the
Nowadays, deep learning models are used in a wide range of appli- all-reduce [7] architectures. Widely used deep learning systems, as
cations, including classification and recognition tasks. The constant Google TensorFlow [3], Apache MXNet [5] and PyTorch [25] have
growth on the data size has led to the use of more complex model adopted the concept of distributed learning following one or more
architectures for creating neural network classifiers. Both the model the aforementioned architectures.
complexity and the amount of data usually prohibit the training While distributed learning enables faster deep neural network
on a single machine, due to time and memory limitations. Thus, training, each distributed approach introduces some new issues that
distributed learning setups have been proposed to train deep net- might harm either the training speed or the quality of the resulting
works, when a vast amount of data is available. One such common model. For example, all-reduce approaches introduce synchroniza-
setup follows the parameter server approach, where worker tasks tion overheads on each training step. On the other hand, while in
compute gradients to update the network stored in the servers, the case of parameter server architecture such overheads are not
often in a synchronous free manner. However, the lack of synchro- met, as workers usually operate in an asynchronous manner [18].
nization may harm the resulting model quality due to the effect of Thus, stale gradients effects may occur, which could either delay
stale gradients, which are computed based on older model versions. model convergence or lead the model’s training loss function to
In this PhD research, we aim to explore how asynchronous learning diverge [23]. However, we believe that if data are wisely used, such
could benefit from data preprocessing tasks revealing hidden traits phenomena could be overcome.
regarding the data distribution. In this PhD research, we focus on studying how the distributed
deep learning process could benefit from the distribution of the
1 INTRODUCTION training data, especially under the parameter server architecture.
Data preprocessing techniques can be used to obtain an a-priori
In the recent years, deep learning has become a widely used part
knowledge of the data domain, which could be beneficial in the
in a variety of applications. For example, in the image processing
training process . We aim to study and propose systematic ways on
domain, neural networks are widely used in classification [20] and
how the data should be assigned to the available training worker
tagging [32] applications. Deep models are also widely used in other
nodes. Furthermore, we will also study whether random or algo-
domains as speech recognition [13] or text classification [21].
rithmic access patterns on data are preferable during the training
All the aforementioned applications are actually classification
process, focusing on the distributed case. Considering such tech-
tasks. In order to create accurate classifiers, a wide amount of data
niques, we aim for the training to be less sensitive to undesirable
shall be used. As we increase the volume of the data available, more
effects that appear in asynchronous distributed learning setups.
complex models are used in order to represent the patterns implied
The rest of this papers is organized in four sections. At first, in
by the data. A common example of complex model architectures
Section 2 we refer to any related background knowledge necessary
proposed are the ResNet [14] and Inception [29] architectures used
to easily understand the paper. Section 3 follows with a discussion
on the Imagenet [9] dataset.
on how former knowledge of the data distribution could benefit the
Both the increase in the model complexity and the amount of
learning process, especially under a parameter server setup. Finally,
data available could prohibit model training in a single machine,
the paper concludes with Section 4 which outlines the steps that
since it would take numerous hours or days to create a generic
will be followed to complete this research.
and reliable model. For instance, without the use of accelerators as
GPUs, Stanford’s Dawnbench [6] took more than 10 days to train a
ResNet-152 model on the Imagenet dataset [2]. Thus, there have 2 BACKGROUND
been proposed distributed architectures that could be used in order
to speed up the training process. Depending on what is decided to be 2.1 Optimization Related Preliminaries
distributed, multiple solutions could be adopted. The most common In the context of classification problems [12], a neural network with
approach, especially when it comes to big data, is to distribute the weights represented by the vector 𝑤⃗ , is considered to approximate
available data to various workers, following data parallel learning a function 𝑓 : R𝑛 → − R that, given an input feature vector 𝑥⃗ , could
architectures [8]. The most prominent setups of such distributed be used to classify it to a category 𝑦, i.e. 𝑦 = 𝑓 (𝑥⃗ ; 𝑤⃗ ). Given a set of
feature vectors 𝑥⃗ 1, 𝑥⃗ 2, ..., 𝑥⃗𝑛 and their corresponding category labels
Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den-
mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative 𝑦1, 𝑦2, ..., 𝑦𝑛 , the function 𝑓 used to describe the neural network
Commons License Attribution 4.0 International (CC BY 4.0). can be identified by using the appropriate vector 𝑤⃗ derived as the
5
Update global model Extract local mini-
3 EXPLOITING DATA DISTRIBUTION IN
with received gradients 2 batch from local data
Parameter shard LEARNING
Server Task 1 3 Local gradient
Parameter computations In this section, we will present how we could exploit distribution
Server Task 2 related traits that could emerge from data preprocessing for the
Worker Task 1
Parameter Local
Server Task M Model learning process, especially in the distributed case.
Copy
Global
Model Part Data Shard
1
PARAMETER SERVER 3.1 Exploit data distibution in single node
TASKS
training.
Worker Task N Worker Task 2
Local Local As we mentioned in Section 2.1, Mini-Batch SGD does not move
Model Model
Copy ... Copy the weights of the neural network directly to the minimization
Data Shard Data Shard
N 2 point due to the restricted view it has on the data on each iteration.
However, we know that Gradient Descent is able to move directly
WORKER TASKS
Figure 1: Contour Plot towards the optimization point. A usual approach to overcome
Send local gradients
outlining Gradient Receive model such problems when training deep learning models is to randomly
1 parameters
4 to parameter
servers
Descent and Mini-Batch shuffle the data before each mini-batch extraction in order to obtain
Gradient Descent while a mini-batch with less correlated data [12].
converging to a local Figure 2: Training under the However, it is reasonable to state the question what will happen
minimum point. Parameter Server Architecture if the mini-batch is chosen such that it is actually representative of
the whole dataset. Will this either result to a more accurate model
or to a faster training process in respect to the random sampling
solution of an optimization problem on a loss function 𝐿. Gradient techniques used? Is it important to perform some preprocess to the
Descent is considered to be among the most popular algorithm training data in order to understand their structure and determine
used in optimization problems [27]. However, considering both the the sampling process during the mini-batch selection?
size of deep neural networks and the vast amount of data usually Bengio proposed Curriculum Learning [4] as an approach to-
available, Gradient Descent is not preferred, since each iteration wards this direction and proved that the training was able to con-
will be too slow. Mini-Batch Stochastic Gradient Descent (mini- verge to better local minima when he decided to use traits of the
batch SGD) is used as an alternative instead, which uses only a data to help the network training process. For instance, in an image
subset of training examples in each training iteration. classification case, he decided to use only some easily distinguished
While an iteration of Mini-Batch SGD is faster than the one of data at first, and then include more complex images.
GD, it is important to note that it usually needs more iterations In this PhD research, we aim to focus on how to select the mini-
to converge. Figure 1 presents a contour plot, which outlines how batch on each iteration to be representative of the whole data set
the gradients move the weights towards the optimization point. and boost the network training. As a proof of concept, we clustered
Gradient Descent takes into account the whole data distribution each class from the CIFAR-10 [19] dataset to two sub-clusters and
in each training step and thus continuously moves towards the chose each mini-batch to include data from each resulting sub-
optimization point. However, mini-batch SGD computes the gradi- cluster of all classes. In this example, we noticed a 5% improvement
ent of a training step with only 𝐵 examples, directing the weights in the validation loss and a 2% improvement in the validation error.
to various directions before converging. The aforementioned algo- We further aim to examine whether we could benefit from real-time
rithms cannot guarantee convergence to the global minimization training metrics in order to select the training examples for each
point, as optimization functions in neural network training are upcoming mini-batch.
non-convex and they may stuck on local minima. Alternatives, as
Adam, AdaGrad and others, have been proposed as less vulnerable 3.2 Exploit data distibution in parameter
to such phenomena [27]. server training.
Stale Gradients Effect. As we stated in the introductory section 1,
2.2 Parameter Server Architecture the parameter server training is usually harmed from the stale
Following a data parallel scheme, parameter server architecture [15, gradients effect. Stale gradients occur when a worker computes
30] introduces two different entities in the learning process: the a gradient update using old model parameters. In 2013, Dutta ap-
workers and the parameter servers. Parameter servers are used to proached the problem of staleness with proposing an appropriate
store neural network parameters in a distributed fashion. Work- variable learning rate [10]. In 2017, Jiang also approaches the stale-
ers use local copies of the network and a local part of the data ness problem with learning rate techniques in an heterogeneous
to compute gradients based on some variance of the mini-batch environment [16]. Moreover, in 2018, Huang proposed FlexPS [15]
SGD algorithm. While gradients can be aggregated under various which facilitated a staleness parameter controlling the aging of the
synchronization schemes, parameter server usually follows an asyn- parameter to avoid staleness effects.
chronous parallel approach for updating the global model in the Better data assignment to workers. In the research works
servers. The steps of training a model under the aforementioned stated above, algorithmic solutions in the parameter server or learn-
setup are fully depicted in Figure 2. ing rate level are proposed to smooth the staleness effect. However,
2000 4 RESEARCH PLAN
Having presented the concept and the ideas behind this PhD reasearch,
1500
he have designed a plan that we should follow to conduct this re-
#Classes
1000 search. The aforementioned plan is outlined below.
• Measure the effects of mini-batch design in single node
500 training. The first part of our work includes to propose and
study efficient techniques that take into account distribu-
00 500 1000 1500 2000 tion traits to systematically construct the mini-batches, as
#Images in Class
representatives of the whole data set, used for neural net-
work training. In case of data sets with numerous classes,
Figure 3: Class Population Histogram of Imagenet data ob- as ImageNet, where we can not create representative mini-
tained from Flickr batches with commonly used size, we plan to randomly omit
different parts of the data from each mini-batch. Having
experimented with the distribution traits, we further plan
we believe that if the data part that can be accessed from each
to take the real-time training and validation metrics into
worker is not representative on the whole dataset, this may further
account when creating the next mini-batch. For instance, in
harm the staleness effect, since a common approach is to randomly
case the model presents large loss metrics in some examples,
shard the data to workers. For instance, TensorFlow uses a modular
we could attempt to provide the next mini-batch with more
sharding approach based on the training example index to assign it
data following an equivalent distribution.
to a worker. To further support this claim, we will discuss an exam-
• Study and evaluate techniques to extract distribution
ple based on an Imagenet subset with images from Flickr (approx.
traits from big data sets. A first approach to identify how
60GB size). Figure 3 presents a histogram with the image popula-
data points are organized in the multidimensional space is
tion in each class. In this figure, we can observe that most of the
with the help of clustering algorithms. However, since we
classes consist of approximately 100 images, while some of them
want to focus on big data and distributing learning, we have
consist of more than 1000 (and even more than 1500 images). Thus,
to compare various techniques that could efficiently discover
it is possible that a random data assignment approach could not
distribution related information, as the DSH one stated ear-
provide a worker with data of some of the less populated classes or
lier. Having studied existing approaches in this problem, we
bias another towards a highly populated class (data skew on some
will attempt to design and propose our own method that will
workers). Having trained on a stale parameter set on some iteration,
efficiently compute any necessary information.
such worker could direct the weight not towards the direction of
• Apply distribution related information in data shard-
the true optimization point, but possibly to another one which will
ing. In order to create representative data shards, our first
better optimize this part of data, due to lack of knowledge regarding
goal is to consider class stratification, and identify whether
the data space. In this research, we aim to study whether stratifica-
it is able to facilitate the learning process. Moreover, hav-
tion in data sharding to workers and in mini-batch selection per
ing efficiently discovered any necessary distribution traits,
worker (in class or in hidden level according to the distribution)
the next part of our research aims to exploit them in shard
could be facilitated in order to smooth the effects of staleness.
creation process. The knowledge of stratification and dis-
Stratification is widely used when the computing task cannot
tribution traits derived from the data is expected to further
view entirely the data, as for example in an approximate query
help parameter server training.
processing problem [17]. In the context of learning, it is also used
• Design and propose a streaming system for serving
to facilitate learning from heterogeneous databases [26]. Moreover,
mini-batches to workers in the parameter server setup.
in a 2020 research [24], hidden stratification appeared to crucially
Our research will conclude with the design and implemen-
affect the quality of classification models for medical images.
taion of a system that will collect data information and facil-
Extracting data distribution related information. Hidden
itate the distribution information extraction mechanisms to
stratification can be used to reveal how the data are organized
learn how to efficiently prepare mini-batches for the workers
in the distribution. A common approach to discover hidden pat-
to train in the parameter server setup. This system will be
terns in the data distribution is the use of unsupervised learning
created taking into consideration any observations from the
techniques, as clustering. In the big data context, multiple clus-
steps described above.
tering techniques have been proposed. For instance, in [28] they
have designed a clustering framework for big data, which is able
to discover multiple distribution types. Others propose approxi-
4.1 Technologies
mate and distributed versions of common clustering algorithms, Having presented our research plan, we briefly describe state-of-
as DBSCAN [22]. Apart from clustering, it would also be efficient the-art technologies that we plan to use in order to construct our
to utilize function that cluster together similar points, in the same various components.
manner as hash functions do. Towards this approach and in the • Apache Spark [31] is a widely used general purpose big data
spirit of Locality Sensitive Hashing, Gao proposed in [11] Data system. Among other libraries, Spark provides SparkML,
Sensitive Hashing, where he facilitates data distribution to hash which offers some clustering algorithms that we could ex-
together close data points in a high dimensional space. ploit to identify initial data distribution traits. Moreover,
Spark can easily be used to compute any other interesting Statistics. PMLR, 803ś812.
metrics that we might need to consider. [11] Jinyang Gao, Hosagrahar Visvesvaraya Jagadish, Wei Lu, and Beng Chin Ooi. 2014.
DSH: data sensitive hashing for high-dimensional k-nnsearch. In Proceedings of
• Regarding neural network training, we aim to use Google the 2014 ACM SIGMOD international conference on Management of data. 1127ś
TensorFlow [3], which also operates under the parameter 1138.
[12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT
server architecture. press.
• For streaming mini-batches we aim to examine the use of [13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech
Apache Arrow [1], since it optimizes data in a columnar recognition with deep recurrent neural networks. In 2013 IEEE international
conference on acoustics, speech and signal processing. IEEE, 6645ś6649.
format for CPU and GPU analytical operations. Moreover, [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
TensorFlow can directly read from Arrow streams. learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770ś778.
We aim to design the final system as a layer over the training [15] Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Fan Yang, Jinfeng
cluster, which will encapsulate all the above technologies. Thus, a Li, Yuying Guo, and James Cheng. 2018. FlexPS: Flexible Parallelism Control in
Parameter Server Architecture. Proceedings of the VLDB Endowment 11, 5 (Jan.
user will be able to easily benefit from our system. 2018), 566ś579. https://doi.org/10.1145/3177732.3177734
[16] Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware dis-
4.2 Benchmarking Setup tributed parameter servers. In Proceedings of the 2017 ACM International Confer-
ence on Management of Data. 463ś478.
Having implemented each of our components, we aim to benchmark [17] Srikanth Kandula, Kukjin Lee, Surajit Chaudhuri, and Marc Friedman. 2019. Ex-
how they affect the asynchronous distributed training process in periences with approximating queries in Microsoft’s production big-data clusters.
terms of speed and resulting training and validation metrics. As Proceedings of the VLDB Endowment 12, 12 (2019), 2131ś2142.
[18] Evdokia Kassela, Nikodimos Provatas, Ioannis Konstantinou, Avrilia Floratou,
a baseline, we will use state-of-the-art neural networks trained and Nectarios Koziris. 2019. General-Purpose vs Specialized Data Analytics
under a parameter server setup with the optimal hyper parameters. Systems: A Game of ML & SQL Thrones. In 2019 IEEE International Conference
on Big Data (Big Data).
For instance, we could train ResNet or Inception models using [19] Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images.
the ImageNet data set in a simple parameter server setup. These Technical Report.
baseline models could be compared with the ones trained taking [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural information
the distribution traits into account. processing systems. 1097ś1105.
[21] Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep
ACKNOWLEDGMENTS learning for extreme multi-label text classification. In Proceedings of the 40th
International ACM SIGIR Conference on Research and Development in Information
This research has been co-financed by the European Union and Retrieval. 115ś124.
Greek national funds through the Operational Program "Compet- [22] Alessandro Lulli, Matteo Dell’Amico, Pietro Michiardi, and Laura Ricci. 2016.
NG-DBSCAN: scalable density-based clustering for arbitrary data. Proceedings of
itiveness, Entrepreneurship and Innovation", under the call RE- the VLDB Endowment 10, 3 (2016), 157ś168.
SEARCH ś CREATE ś INNOVATE (project code:T1EDK-04605). [23] Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2020. Cerebro: A data system
for optimized deep learning model selection. Proceedings of the VLDB Endowment
13, 12 (2020), 2159ś2173.
REFERENCES [24] Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Ré.
[1] [n.d.]. https://arrow.apache.org/ 2020. Hidden stratification causes clinically meaningful failures in machine
[2] [n.d.]. DAWNBench. https://dawn.cs.stanford.edu/benchmark/ImageNet/train. learning for medical imaging. In Proceedings of the ACM conference on health,
html inference, and learning. 151ś159.
[3] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef- [25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
frey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.
Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th 2019. Pytorch: An imperative style, high-performance deep learning library. In
{USENIX } Symposium on Operating Systems Design and Implementation ( {OSDI } Advances in neural information processing systems. 8026ś8037.
16). 265ś283. [26] Jose Picado, Arash Termehchy, and Sudhanshu Pathak. 2018. Learning efficiently
[4] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. over heterogeneous databases. Proceedings of the VLDB Endowment 11, 12 (2018),
Curriculum learning. In Proceedings of the 26th annual international conference 2066ś2069.
on machine learning. 41ś48. [27] Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms.
[5] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun arXiv preprint arXiv:1609.04747 (2016).
Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and [28] Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Züfle, Klaus Arthur
efficient machine learning library for heterogeneous distributed systems. arXiv Schmid, and Arthur Zimek. 2015. A Framework for Clustering Uncertain Data.
preprint arXiv:1512.01274 (2015). Proceedings of the VLDB Endowment 8, 12 (Aug. 2015), 1976ś1979. https://doi.
[6] Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi org/10.14778/2824032.2824115
Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. 2017. Dawn- [29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the
bench: An end-to-end deep learning benchmark and competition. Training 100, Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer
101 (2017), 102. Vision and Pattern Recognition (CVPR). 2818ś2826. https://doi.org/10.1109/CVPR.
[7] Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang 2016.308
Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan, Zhichao Li, Jiao Wang, Sheng- [30] Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun
sheng Huang, Zhongyuan Wu, Yang Wang, Yuhao Yang, Bowen She, Dongjie Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New
Shi, Qi Lu, Kai Huang, and Guoqiong Song. 2019. BigDL: A Distributed Deep Platform for Distributed Machine Learning on Big Data. In Proceedings of the 21th
Learning Framework for Big Data. In Proceedings of the ACM Symposium on ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Cloud Computing (Santa Cruz, CA, USA) (SoCC ’19). Association for Computing (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machinery, New
Machinery, New York, NY, USA, 50ś60. https://doi.org/10.1145/3357223.3362707 York, NY, USA, 1335ś1344. https://doi.org/10.1145/2783258.2783323
[8] Jeffrey Dean, Greg S Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V [31] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust,
Le, Mark Z Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, et al. 2012. Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J.
Large scale distributed deep networks. In Proceedings of the 25th International Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache
Conference on Neural Information Processing Systems-Volume 1. 1223ś1231. Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A 2016), 56ś65. https://doi.org/10.1145/2934664
Large-Scale Hierarchical Image Database. In CVPR09. [32] Y. Zhang, B. Gong, and M. Shah. 2016. Fast Zero-Shot Image Tagging. In 2016
[10] Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5985ś5994.
Nagpurkar. 2018. Slow and stale gradients can win the race: Error-runtime trade- https://doi.org/10.1109/CVPR.2016.644
offs in distributed SGD. In International Conference on Artificial Intelligence and