=Paper=
{{Paper
|id=Vol-1558/paper40
|storemode=property
|title=Operator and Workflow Optimization for High-Performance Analytics
|pdfUrl=https://ceur-ws.org/Vol-1558/paper40.pdf
|volume=Vol-1558
|authors=Hans Vandierendonck,Karen L. Murphy,Mahwish Arif,Jiawen Sun,Dimitrios S. Nikolopoulos
|dblpUrl=https://dblp.org/rec/conf/edbt/VandierendonckM16
}}
==Operator and Workflow Optimization for High-Performance Analytics==
Operator and Workflow Optimization
for High-Performance Analytics
Hans Vandierendonck, Karen L. Murphy, Mahwish Arif,
∗
Jiawen Sun and Dimitrios S. Nikolopoulos
Queen’s University Belfast
Belfast, United Kingdom
{h.vandierendonck,k.l.murphy,m.arif,jsun03,d.nikolopoulos}@qub.ac.uk
ABSTRACT such less familiar to developers. Pitfalls exist that tend to
We make a case for studying the impact of intra-node par- make algorithms on sparse data sets memory-bound where
allelism on the performance of data analytics. We identify they could be more efficient and compute-bound. As such,
four performance optimizations that are enabled by an in- libraries with high-performance implementations of common
creasing number of processing cores on a chip. We discuss operators are provided [3, 6, 13].
the performance impact of these opimizations on two analyt- While individual operators are commonly provided by var-
ics operators and we identify how these optimizations affect ious libraries, a useful computation typically requires a work-
each another. flow that links together multiple operators. Very often, these
operators communicate through data stored on disk, which
induces redundant operations, such as I/O, involving disk
Keywords access, parsing and data conversions. An alternative solu-
Data analytics; high-performance analytics; intra-node par- tion is to create single binaries that encapsulate a complex
allelism workflow [14]. Such a solution has the potential of signifi-
cantly higher processing rates by avoiding unnessecary I/O.
This paper presents techniques for optimizing operators
1. INTRODUCTION and workflows in order to achieve high-performance ana-
In the era of big data, data analytics represent an increas- lytics. Within this area, we identify and characterize four
ing fraction of the processing cycles spent in data centres. widely applicable optimisations. 1. Assuming (highly) par-
While in classical data mangement and analytics, operators allel nodes (Moore’s Law predicts that all future nodes will
could be described within the bounds of the SQL domain- be increasingly more parallel), we demonstrate that utiliz-
specific language, this is no longer the case with data ana- ing parallelism within operators is extremely important to
lytics for big data. In big data, the operators are diverse, as improve performance. The motivation behind using par-
is the data they are operating on, and can involve any algo- allelism are the observations that (i) parallelism allows to
rithm to transform, classify or structure the data at hand. hide I/O latencies, and (ii) many analytics problems are
As the structure of the data and the associated operators compute-bound rather than I/O-bound, an observation that
are ill-defined, so is the scope of data analytics operators. goes against intuition [2, 11].
In order to achieve low processing times, operators require 2. Input and output contribute to a large proportion of
careful design and must be highly optimized. Where efficient the execution time due to the size of data sets and the poten-
SQL statements may, to a large extent, be written by domain tially low amount of computation performed per byte trans-
experts and automatically optimized, this is definitely not ferred. I/O operations benefit, however, also from intra-
the case for analytics operators. Moreover, optimization of node parallelism, allowing on the one hand to read indepen-
the operators is often counter-intuitive as it involves sparse dent files concurrently, and on the other hand overlapping
data sets, which are less deeply covered in text books and as data processing with disk and network access latency.
∗This work is supported by the European Community’s 3. While big data frameworks often steer towards dump-
Seventh Framework Programme (FP7/2007-2013) under ing intermediate data sets to disk, the overhead of I/O and
the ASAP project, grant agreement no. 619706, and storing intermediate data sets are significant. This overhead
by the United Kingdom EPSRC under grant agreement can be avoided by fusing operators in workflows into single
EP/L027402/1. executable images and by feeding the data from one operator
to the next.
4. The choice of internal data structures used in analytics
operators is determining for the performance of the opera-
tor. We demonstrate a 3.4 fold speedup by interchanging
one standardized data structure for another. However, this
⃝
c 2016, Copyright is with the authors. Published in the Workshop Pro- result depends also on the degree of intra-node parallelism
ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor-
deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this utilized. As such the optimization problem is non-trivial.
paper is permitted under the terms of the Creative Commons license CC- The goal of this paper is to demonstrate the importance
by-nc-nd 4.0 of these considerations for implementing analytics queries.
MEDAL ’16 March 15, 2016, Bordeaux, France
8
Table 1: Data set description.
Self-‐Rela(ve
Speedup
7
Input Documents Bytes Distinct words
6
Mix 23432 62.8 MB 184743
5
NSF
abstracts
NSF Abstracts 101483 310.9 MB 267914
4
Mix
3
The remainder of this paper is structured as follows. In 2
Section 2 we discuss our system assumptions and software 1
environment. In Section 3 we analyse two typical operators, 0
one text processing and one numeric operator, and charac- 0
5
10
15
20
terize the impact achievable by the identified performance Number
of
Threads
optimizations. In Section 4 we discuss related work. Figure 1: Self-relative performance scalability of the
K-Means operator.
2. SETUP
our two datasets (Table 1). We use the algorithm to assign
We start our investigation at a small scale, focusing on
documents to one of 8 clusters based on their normalized
the activities on a single node as these allow us to better
TF/IDF scores.
understand the performance of operators and workflows.
The self-relative speedup shows how much performance
Performing analytics on a single node is important as a
is improved by utilizing multiple CPU cores. The speedup
single-node can be built with a large amount of working
obtained is sensitive to the data set operated on: The NSF
memory (up to 16 TB) and many processing cores (over a
Abstracts data set has about 100,000 documents and is sped
100). Such a system could efficiently process many real-
up nearly 8 times using intra-node parallelism. The Mix
world data sets. However, we expect that our conclusions
data set has around 23,000 documents, which is sufficient
remain valid when applied to scale-out systems, as optimiz-
only for a 2.5 x speedup. This effect is easily explained by
ing the performance of nodes in isolation is crucial to opti-
the parallel loops in K-means clustering, which are all loops
mize the system overall.
iterating over the documents. As the number of documents
To test the importance of the identified optimisations, we
grows, so does the parallel scalability.
implement two analytics operators in the Cilkplus exten-
The execution time of our implementation is furthermore
sion of C++, a programming language designed for high-
short in comparison to other implementations. We com-
performance and parallel computing at MIT, first developed
pared the execution time of our K-means clustering imple-
over two decades ago and continuously refined since then.
mentation against WEKA [3] (version 3.6.13). Using the
Cilkplus, now commercialized by Intel, supports the con-
“SimpleKMeans” algorithm, a single-threaded K-Means al-
struction of parallel tasks through language constructs that
gorithm, on the same data sets requires over 2 hours, after
express parallelism and vectorization (SIMDization) in an
which we aborted the execution. In contrast, executing our
easily accessible way. In the Cilkplus model, each thread
implementation sequentially required 3.3s and 40.9s for the
of computation is bound to a processing core. The princi-
Mix and NSF Abstracts data sets respectively. Note that
ples utilized should apply to other languages and parallel
while we did not see the execution of WEKA through to
constructs, e.g., Java streams.
the end, we have verified that our WEKA installation works
We study two operators: term frequency–inverse docu-
correctly on small data sets.
ment frequency (TF/IDF) and K-means clustering. TF/IDF
While our implementation is significantly faster than WEKA,
extracts words from text documents and rates the impor-
this is not automatic. Several key optimisations were re-
tance of a word on the basis of its frequency of occurence
quired to achieve the performance of our algorithm: (i) Us-
within a specific document as well as within the whole set
ing sparse vectors to represent inherently sparse data. (ii) Re-
of documents. K-means clustering is an unsupervized clas-
cycling data structures throughout the K-means iterations
sification technique that allows for the grouping of similar
to avoid redundant data copies and memory pressure. E.g.,
data items described as numeric vectors.
we do not create new objects during the iterations of the
K-means algorithm.
3. ANALYSIS The conclusion of this experiment is thus that (i) intra-
node parallelism is an important opportunity to accelerate
3.1 Intra-Node Parallelism data analytics, especially on larger data sets; (ii) the im-
Many problems in data mining are trivially compute-bound, plementation and the choice of data structures has a huge
especially learning algorithms using neural networks, sup- influence on execution time; (iii) parallelism can be exploited
port vector machines and the like, which utilize computa- without casting the algorithms in map/reduce form.
tionally demanding hyperbolic functions and can require
many iterations to train the model. It should go without 3.2 Parallel Input
saying that algorithms like these can be accelerated using A code that is well-optimized and where CPU is a bot-
high degrees of intra-node parallelism. tleneck can also benefit from parallelizing I/O operations.
K-means clustering is perhaps one of the cheapest un- Under these circumstances, CPU utilization is high and I/O
supervized learning algorithms. As such, we will use K- resources are underutilized, including local disk and network
means clustering to demonstrate that data analytics opera- resources. Intra-node parallelism can thus increase the uti-
tions benefit from intra-node parallelism. Figure 1 shows the lization of disk and network resources.
self-relative speedup of the K-means clustering algorithm on In this section we study the problem of calculating the
8
20
Execu&on
Time
(s)
Self-‐Rela(ve
Speedup
7
15
6
5
10
output
4
5
kmeans
3
NSF
abstracts
2
0
transform
Mix
u-‐map
map
u-‐map
map
u-‐map
map
u-‐map
map
u-‐map
map
1
input+wc
0
0
5
10
15
20
1
4
8
12
16
Number
of
Threads
Number
of
Threads
Figure 2: Self-relative parallel scalability of the
TF/IDF operator. Figure 4: Execution time of the TF/IDF–K-
Means workflow on the Mix input using a
std::unordered_map (u-map) or a std::map.
120
Execu&on
Time
(s)
100
80
output
As pointed out above, I/O is both costly and hard to par-
60
kmeans
allelize. As such, avoiding I/O is always a good optimization.
40
Figure 3 shows the execution time of the TF/IDF–K-Means
20
transform
workflow when executing the TF/IDF and K-Means opera-
0
kmeans-‐input
tors as discrete operators that communicate by storing the
discrete
merged
discrete
merged
discrete
merged
discrete
merged
discrete
merged
intermediate TF/IDF scores on disk, versus a merged opera-
9idf-‐output
tor without storage of the intermediates. The results clearly
1
4
8
12
16
input+wc
demonstrate that dumping data to disk has a high latency.
Number
of
Threads
In this experiment, the data is dumped to a local hard disk.
Both the output of the TF/IDF scores and the subsequent
Figure 3: Execution time of the TF/IDF–K-Means input are executed by a single thread because the file format
workflow when executing the TF/IDF and K-Means utilized (ARFF [3]) does not easily support parallel I/O. In
operators as discrete steps communicating through contrast, transforming the data when it is stored in-memory
file I/O, versus a merged operator with storage of is much faster and parallelizes well.
the TF/IDF scores. Uses the NSF Abstracts input. The presence of intra-node parallelism is an important
differentiator as to whether I/O bears much overhead or
not. On a single-threaded execution, I/O increases execu-
term frequency–inverse document frequency (TF/IDF) [10] tion time by 36.9%. On 16 threads, however, I/O makes the
property of a set of documents. Our implementation col- execution 3.84 times slower because it does not parallelize.
lects term frequencies (word counts) for each of the docu-
ments in the set. Moreover, a list of all unique terms across 3.4 Data Structures
the documents is constructed. This list is annotated with Algorithms use data structures to store input, output and
the number of documents where the word occurs. In a first internal data sets, The choice of these data structures impact
phase, the per-document term frequencies and the overall performance. In the case of TF/IDF, the key data struc-
term-document count properties are collected using dedi- tures are the dictionaries storing unique words and their
cated hash tables, mapping a word to a term frequency or frequencies. Figure 4 shows the execution time of TF/IDF–
an overall document count. In a second phase, we calcu- K-Means workflow on the Mix data set and a varying num-
late for each document the per-term TF/IDF score using ber of threads. Results for the larget NSF Abstracts data
the hash tables described above. For each document, a set are more dramatic.
sparse TF/IDF vector is constructed, sorted by term IDs The results demonstrate that the input and word-count
and written to the output file in Attribute-Relation File For- step (“input+wc” in Figure 4) is faster when using the std::map
mat (ARFF) format [3]. The first phase can be executed in data structure as opposed to the std::unordered_map data
parallel for each of the documents. The main limitation to structure. The first is implemented as a red-black tree, while
obtain speedup here is bandwidth to the storage system. the latter is implemented as a hash table. Moreover, the
The second phase is not parallelized as the ARFF format unordered map is pre-sized to hold 4K items to minimize
does not facilitate parallel output. resizing overhead.
While the TF/IDF problem is mainly concerned with data While reading documents and counting words is faster
input, tokenization and hash table operations, it benefits with a map, the subsequent data transformation step is
strongly from intra-node parallelism (Figure 2). It speeds slower using a map, especially on one thread. This follows as
up by nearly 6-fold for the Mix data set and by 7-fold for the the input and word-count phase is write-intensive, consist-
NSF Abstracts data set. Parallelizing output is important ing of frequent insertion of values in the dictionary. Inser-
as well. However, file formats are often designed in such a tion in the unordered map (a hash table) is inefficient due to
way that parallel I/O becomes hard. (i) resize operations, which requires re-hashing all elements,
(ii) memory pressure, as the array underlying the hash table
3.3 Workflow Fusion is by construction both sparse (to approximate O(1) opera-
tions) and very large (due to the data sets used). In contrast, intra-node parallelism. Moreover, several optimizations, such
the transformation step performs only lookups on the hash as avoiding I/O through workflow fusion and choice of data
table, which are known to be faster on the unordered map structure, are influenced by the presence and degree of intra-
O(1) as opposed to the map O(log n). node parallelism. This paper thus points out a new direction
However, the transformation step scales much better with for realizing high-performance analytics and identifies open
an increasing number of threads when using the map: it challenges.
scales to 6.1 x on 16 threads using the map, while it scales
only to 3.4x using the unordered map data structure. This is 6. REFERENCES
in part due to the memory consumption. In particular, using
[1] R. Chen and H. Chen. Tiled-mapreduce: Efficient and
the Mix data set, main memory consumption is 420MB with
flexible mapreduce processing on multicore with tiling.
the map, while it rises to 12.8 GB using the unordered map.
ACM Trans. Archit. Code Optim., 10(1):3:1–3:30,
Likewise, the output phase performs lookups only on the
Apr. 2013.
dictionaries and thus favours the unordered map. Moreover,
the output phase is hard to parallelize. [2] A. Crotty, A. Galakatos, K. Dursun, T. Kraska,
We conclude that selection of the internal data structures U. Çetintemel, and S. B. Zdonik. Tupleware: Big data,
has a significant impact on execution time. Moreover, dif- big analytics, small clusters. In Conf. on Innovative
ferent steps of a workflow may execute faster using different Data Systems Research (CIDR), page 7, Jan. 2015.
data structures. As such, the choice of internal data struc- [3] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
ture must be taken judiciously, depending on the overall time P. Reutemann, and I. H. Witten. The weka data
taken by each step of the workflow and also on the extent mining software: An update. SIGKDD Explor. Newsl.,
to which each phase can be parallelized. 11(1):10–18, Nov. 2009.
[4] M. Han, K. Daudjee, K. Ammar, M. T. Özsu,
X. Wang, and T. Jin. An experimental comparison of
4. RELATED WORK pregel-like graph processing systems. Proc. VLDB
The performance of data analytics frameworks is an im- Endow., 7(12):1047–1058, Aug. 2014.
portant concern. Pavlo et al compare map/reduce systems [5] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi:
against distributed DBMSes and find interesting trade-offs Large-scale graph computation on just a pc. In OSDI
in performance between these approaches [8]. They find that pages 31–46, 2012.
map/reduce is easier to setup but in the end the DBMS was [6] Apache mahout: Scalable machine learning and data
more performant. mining. http://mahout.apache.org.
Ousterhout et al analyse real-life peta-scale workloads. [7] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker,
They find that CPU is more often a bottleneck than I/O and B.-G. Chun. Making sense of performance in data
and that network performance has little impact on job com- analytics frameworks. In NSDI’15, pages 293–307,
pletion time [7]. Moreover, they find that straggler nodes 2015.
can be identified and that in most cases the cause for strag-
[8] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.
gling can be identified.
DeWitt, S. Madden, and M. Stonebraker. A
Han et al perform a similar analysis for graph analytics
comparison of approaches to large-scale data analysis.
frameworks [4]. They identified several opportunities for
In SIGMOD Intl. Conf. on Management of Data,
improvement in these systems. In a similar study, Satish et
pages 165–178, 2009.
al [11] find that hand-optimized codes can outperform program-
[9] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski,
mer-friendly frameworks by up to 560-fold.
and C. Kozyrakis. Evaluating mapreduce for
Several authors have investigated analytics frameworks for
multi-core and multiprocessor systems. In HPCA,
shared-memory systems (single nodes), covering map/reduce
pages 13–24, 2007.
workloads [9, 1] and graph analytics [12]. Kyrola et al op-
timize graph analytics assuming that the graph fits on disk [10] G. Salton and M. J. McGill, editors. Introduction to
but not in main memory [5] Zhang et al optimize graph Modern Information Retrieval. Mcgraw-Hill, 1983.
analytics for non-uniform memory architectures [15]. [11] N. Satish et al. Navigating the maze of graph analytics
frameworks using massive graph datasets. In SIGMOD
Intl. Conf. on Management of Data, pages 979–990,
5. CONCLUSION 2014.
As data analytics are applied to increasingly larger data [12] J. Shun and G. E. Blelloch. Ligra: A lightweight graph
sets, it is increasingly important to study and optimize the processing framework for shared memory. In ACM
execution time of analytics operators. In this paper, we have PPoPP, pages 135–146, 2013.
studied in particular the impact of parallelism on the perfor- [13] Apache spark mllib.
mance of data analytics, and in particular intra-node paral- http://spark.apache.org/mllib/.
lelism, which presents an important opportunity as Moore’s [14] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
Law remains valid. M. McCauley, M. J. Franklin, S. Shenker, and
Through studying one text processing and one numeric I. Stoica. Resilient distributed datasets: A
operator, we identified four optimizations related to intra- fault-tolerant abstraction for in-memory cluster
node parallelism that we expect are widely applicable across computing. In NSDI, 2–2, 2012.
data analytics: intra-node parallel computation, parallel I/O, [15] K. Zhang, R. Chen, and H. Chen. NUMA-Aware
workflow optimization and selection of internal or intermedi- graph-structured analytics. In PPoPP pages 183–193,
ate data structures. We demonstrate that analytics queries 2015.
have strong potential for performance optimization through