Operator and Workflow Optimization
                              for High-Performance Analytics

                               Hans Vandierendonck, Karen L. Murphy, Mahwish Arif,
                                                                            ∗
                                   Jiawen Sun and Dimitrios S. Nikolopoulos
                                                        Queen’s University Belfast
                                                         Belfast, United Kingdom
                  {h.vandierendonck,k.l.murphy,m.arif,jsun03,d.nikolopoulos}@qub.ac.uk

ABSTRACT                                                                 such less familiar to developers. Pitfalls exist that tend to
We make a case for studying the impact of intra-node par-                make algorithms on sparse data sets memory-bound where
allelism on the performance of data analytics. We identify               they could be more eﬃcient and compute-bound. As such,
four performance optimizations that are enabled by an in-                libraries with high-performance implementations of common
creasing number of processing cores on a chip. We discuss                operators are provided [3, 6, 13].
the performance impact of these opimizations on two analyt-                 While individual operators are commonly provided by var-
ics operators and we identify how these optimizations aﬀect              ious libraries, a useful computation typically requires a work-
each another.                                                            flow that links together multiple operators. Very often, these
                                                                         operators communicate through data stored on disk, which
                                                                         induces redundant operations, such as I/O, involving disk
Keywords                                                                 access, parsing and data conversions. An alternative solu-
Data analytics; high-performance analytics; intra-node par-              tion is to create single binaries that encapsulate a complex
allelism                                                                 workflow [14]. Such a solution has the potential of signifi-
                                                                         cantly higher processing rates by avoiding unnessecary I/O.
                                                                            This paper presents techniques for optimizing operators
1. INTRODUCTION                                                          and workflows in order to achieve high-performance ana-
   In the era of big data, data analytics represent an increas-          lytics. Within this area, we identify and characterize four
ing fraction of the processing cycles spent in data centres.             widely applicable optimisations. 1. Assuming (highly) par-
While in classical data mangement and analytics, operators               allel nodes (Moore’s Law predicts that all future nodes will
could be described within the bounds of the SQL domain-                  be increasingly more parallel), we demonstrate that utiliz-
specific language, this is no longer the case with data ana-             ing parallelism within operators is extremely important to
lytics for big data. In big data, the operators are diverse, as          improve performance. The motivation behind using par-
is the data they are operating on, and can involve any algo-             allelism are the observations that (i) parallelism allows to
rithm to transform, classify or structure the data at hand.              hide I/O latencies, and (ii) many analytics problems are
As the structure of the data and the associated operators                compute-bound rather than I/O-bound, an observation that
are ill-defined, so is the scope of data analytics operators.            goes against intuition [2, 11].
   In order to achieve low processing times, operators require              2. Input and output contribute to a large proportion of
careful design and must be highly optimized. Where eﬃcient               the execution time due to the size of data sets and the poten-
SQL statements may, to a large extent, be written by domain              tially low amount of computation performed per byte trans-
experts and automatically optimized, this is definitely not              ferred. I/O operations benefit, however, also from intra-
the case for analytics operators. Moreover, optimization of              node parallelism, allowing on the one hand to read indepen-
the operators is often counter-intuitive as it involves sparse           dent files concurrently, and on the other hand overlapping
data sets, which are less deeply covered in text books and as            data processing with disk and network access latency.
∗This work is supported by the European Community’s                         3. While big data frameworks often steer towards dump-
Seventh Framework Programme (FP7/2007-2013) under                        ing intermediate data sets to disk, the overhead of I/O and
the ASAP project, grant agreement no. 619706, and                        storing intermediate data sets are significant. This overhead
by the United Kingdom EPSRC under grant agreement                        can be avoided by fusing operators in workflows into single
EP/L027402/1.                                                            executable images and by feeding the data from one operator
                                                                         to the next.
                                                                            4. The choice of internal data structures used in analytics
                                                                         operators is determining for the performance of the opera-
                                                                         tor. We demonstrate a 3.4 fold speedup by interchanging
                                                                         one standardized data structure for another. However, this
⃝
c 2016, Copyright is with the authors. Published in the Workshop Pro-    result depends also on the degree of intra-node parallelism
ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor-
deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this     utilized. As such the optimization problem is non-trivial.
paper is permitted under the terms of the Creative Commons license CC-      The goal of this paper is to demonstrate the importance
by-nc-nd 4.0                                                             of these considerations for implementing analytics queries.
MEDAL ’16 March 15, 2016, Bordeaux, France
                                                                                                      8	
  
          Table 1: Data set description.


                                                                      Self-­‐Rela(ve	
  Speedup	
  
                                                                                                      7	
  
 Input          Documents Bytes          Distinct words
                                                                                                      6	
  
 Mix                 23432    62.8 MB             184743
                                                                                                      5	
                                                   NSF	
  abstracts	
  
 NSF Abstracts      101483 310.9 MB               267914
                                                                                                      4	
  
                                                                                                                                                            Mix	
  
                                                                                                      3	
  
The remainder of this paper is structured as follows. In                                              2	
  
Section 2 we discuss our system assumptions and software                                              1	
  
environment. In Section 3 we analyse two typical operators,                                           0	
  
one text processing and one numeric operator, and charac-                                                     0	
     5	
                10	
                 15	
                 20	
  
terize the impact achievable by the identified performance                                                                    Number	
  of	
  Threads	
  
optimizations. In Section 4 we discuss related work.            Figure 1: Self-relative performance scalability of the
                                                                K-Means operator.
2. SETUP
                                                                our two datasets (Table 1). We use the algorithm to assign
   We start our investigation at a small scale, focusing on
                                                                documents to one of 8 clusters based on their normalized
the activities on a single node as these allow us to better
                                                                TF/IDF scores.
understand the performance of operators and workflows.
                                                                   The self-relative speedup shows how much performance
   Performing analytics on a single node is important as a
                                                                is improved by utilizing multiple CPU cores. The speedup
single-node can be built with a large amount of working
                                                                obtained is sensitive to the data set operated on: The NSF
memory (up to 16 TB) and many processing cores (over a
                                                                Abstracts data set has about 100,000 documents and is sped
100). Such a system could eﬃciently process many real-
                                                                up nearly 8 times using intra-node parallelism. The Mix
world data sets. However, we expect that our conclusions
                                                                data set has around 23,000 documents, which is suﬃcient
remain valid when applied to scale-out systems, as optimiz-
                                                                only for a 2.5 x speedup. This eﬀect is easily explained by
ing the performance of nodes in isolation is crucial to opti-
                                                                the parallel loops in K-means clustering, which are all loops
mize the system overall.
                                                                iterating over the documents. As the number of documents
   To test the importance of the identified optimisations, we
                                                                grows, so does the parallel scalability.
implement two analytics operators in the Cilkplus exten-
                                                                   The execution time of our implementation is furthermore
sion of C++, a programming language designed for high-
                                                                short in comparison to other implementations. We com-
performance and parallel computing at MIT, first developed
                                                                pared the execution time of our K-means clustering imple-
over two decades ago and continuously refined since then.
                                                                mentation against WEKA [3] (version 3.6.13). Using the
Cilkplus, now commercialized by Intel, supports the con-
                                                                “SimpleKMeans” algorithm, a single-threaded K-Means al-
struction of parallel tasks through language constructs that
                                                                gorithm, on the same data sets requires over 2 hours, after
express parallelism and vectorization (SIMDization) in an
                                                                which we aborted the execution. In contrast, executing our
easily accessible way. In the Cilkplus model, each thread
                                                                implementation sequentially required 3.3s and 40.9s for the
of computation is bound to a processing core. The princi-
                                                                Mix and NSF Abstracts data sets respectively. Note that
ples utilized should apply to other languages and parallel
                                                                while we did not see the execution of WEKA through to
constructs, e.g., Java streams.
                                                                the end, we have verified that our WEKA installation works
   We study two operators: term frequency–inverse docu-
                                                                correctly on small data sets.
ment frequency (TF/IDF) and K-means clustering. TF/IDF
                                                                   While our implementation is significantly faster than WEKA,
extracts words from text documents and rates the impor-
                                                                this is not automatic. Several key optimisations were re-
tance of a word on the basis of its frequency of occurence
                                                                quired to achieve the performance of our algorithm: (i) Us-
within a specific document as well as within the whole set
                                                                ing sparse vectors to represent inherently sparse data. (ii) Re-
of documents. K-means clustering is an unsupervized clas-
                                                                cycling data structures throughout the K-means iterations
sification technique that allows for the grouping of similar
                                                                to avoid redundant data copies and memory pressure. E.g.,
data items described as numeric vectors.
                                                                we do not create new objects during the iterations of the
                                                                K-means algorithm.
3. ANALYSIS                                                        The conclusion of this experiment is thus that (i) intra-
                                                                node parallelism is an important opportunity to accelerate
3.1 Intra-Node Parallelism                                      data analytics, especially on larger data sets; (ii) the im-
   Many problems in data mining are trivially compute-bound,    plementation and the choice of data structures has a huge
especially learning algorithms using neural networks, sup-      influence on execution time; (iii) parallelism can be exploited
port vector machines and the like, which utilize computa-       without casting the algorithms in map/reduce form.
tionally demanding hyperbolic functions and can require
many iterations to train the model. It should go without        3.2                                   Parallel Input
saying that algorithms like these can be accelerated using         A code that is well-optimized and where CPU is a bot-
high degrees of intra-node parallelism.                         tleneck can also benefit from parallelizing I/O operations.
   K-means clustering is perhaps one of the cheapest un-        Under these circumstances, CPU utilization is high and I/O
supervized learning algorithms. As such, we will use K-         resources are underutilized, including local disk and network
means clustering to demonstrate that data analytics opera-      resources. Intra-node parallelism can thus increase the uti-
tions benefit from intra-node parallelism. Figure 1 shows the   lization of disk and network resources.
self-relative speedup of the K-means clustering algorithm on       In this section we study the problem of calculating the
                                     8	
                                                                                                                                          20	
  

                                                                                                                                                    Execu&on	
  Time	
  (s)	
  
     Self-­‐Rela(ve	
  Speedup	
     7	
  
                                                                                                                                                                                  15	
  
                                     6	
  
                                     5	
                                                                                                                                          10	
                                                                                                                              output	
  
                                     4	
  
                                                                                                                                                                                    5	
                                                                                                                             kmeans	
  
                                     3	
                                                                   NSF	
  abstracts	
  
                                     2	
                                                                                                                                            0	
                                                                                                                             transform	
  
                                                                                                           Mix	
  

                                                                                                                                                                                            u-­‐map	
  
                                                                                                                                                                                                          map	
  
                                                                                                                                                                                                                    u-­‐map	
  
                                                                                                                                                                                                                                  map	
  
                                                                                                                                                                                                                                            u-­‐map	
  
                                                                                                                                                                                                                                                          map	
  
                                                                                                                                                                                                                                                                    u-­‐map	
  
                                                                                                                                                                                                                                                                                  map	
  
                                                                                                                                                                                                                                                                                            u-­‐map	
  
                                                                                                                                                                                                                                                                                                          map	
  
                                     1	
                                                                                                                                                                                                                                                                            input+wc	
  
                                     0	
  
                                             0	
                     5	
                 10	
                    15	
                20	
                                                             1	
                     4	
                     8	
                  12	
                    16	
  
                                                                             Number	
  of	
  Threads	
                                                                                                              Number	
  of	
  Threads	
  
Figure 2: Self-relative parallel scalability of the
TF/IDF operator.                                                                                                                              Figure 4:    Execution time of the TF/IDF–K-
                                                                                                                                              Means workflow on the Mix input using a
                                                                                                                                              std::unordered_map (u-map) or a std::map.
                                     120	
  
    Execu&on	
  Time	
  (s)	
  

                                     100	
  
                                      80	
                                                                           output	
  
                                                                                                                                                 As pointed out above, I/O is both costly and hard to par-
                                      60	
  
                                                                                                                     kmeans	
                 allelize. As such, avoiding I/O is always a good optimization.
                                      40	
  
                                                                                                                                              Figure 3 shows the execution time of the TF/IDF–K-Means
                                      20	
                                                                           transform	
  
                                                                                                                                              workflow when executing the TF/IDF and K-Means opera-
                                       0	
  
                                                                                                                     kmeans-­‐input	
         tors as discrete operators that communicate by storing the
                                                     discrete	
  
                                                     merged	
  
                                                     discrete	
  
                                                     merged	
  
                                                     discrete	
  
                                                     merged	
  
                                                     discrete	
  
                                                     merged	
  
                                                     discrete	
  
                                                     merged	
  

                                                                                                                                              intermediate TF/IDF scores on disk, versus a merged opera-
                                                                                                                     9idf-­‐output	
  
                                                                                                                                              tor without storage of the intermediates. The results clearly
                                                     1	
         4	
          8	
     12	
        16	
               input+wc	
               demonstrate that dumping data to disk has a high latency.
                                                             Number	
  of	
  Threads	
                                                        In this experiment, the data is dumped to a local hard disk.
                                                                                                                                              Both the output of the TF/IDF scores and the subsequent
Figure 3: Execution time of the TF/IDF–K-Means                                                                                                input are executed by a single thread because the file format
workflow when executing the TF/IDF and K-Means                                                                                                utilized (ARFF [3]) does not easily support parallel I/O. In
operators as discrete steps communicating through                                                                                             contrast, transforming the data when it is stored in-memory
file I/O, versus a merged operator with storage of                                                                                            is much faster and parallelizes well.
the TF/IDF scores. Uses the NSF Abstracts input.                                                                                                 The presence of intra-node parallelism is an important
                                                                                                                                              diﬀerentiator as to whether I/O bears much overhead or
                                                                                                                                              not. On a single-threaded execution, I/O increases execu-
term frequency–inverse document frequency (TF/IDF) [10]                                                                                       tion time by 36.9%. On 16 threads, however, I/O makes the
property of a set of documents. Our implementation col-                                                                                       execution 3.84 times slower because it does not parallelize.
lects term frequencies (word counts) for each of the docu-
ments in the set. Moreover, a list of all unique terms across                                                                                 3.4                                 Data Structures
the documents is constructed. This list is annotated with                                                                                        Algorithms use data structures to store input, output and
the number of documents where the word occurs. In a first                                                                                     internal data sets, The choice of these data structures impact
phase, the per-document term frequencies and the overall                                                                                      performance. In the case of TF/IDF, the key data struc-
term-document count properties are collected using dedi-                                                                                      tures are the dictionaries storing unique words and their
cated hash tables, mapping a word to a term frequency or                                                                                      frequencies. Figure 4 shows the execution time of TF/IDF–
an overall document count. In a second phase, we calcu-                                                                                       K-Means workflow on the Mix data set and a varying num-
late for each document the per-term TF/IDF score using                                                                                        ber of threads. Results for the larget NSF Abstracts data
the hash tables described above. For each document, a                                                                                         set are more dramatic.
sparse TF/IDF vector is constructed, sorted by term IDs                                                                                          The results demonstrate that the input and word-count
and written to the output file in Attribute-Relation File For-                                                                                step (“input+wc” in Figure 4) is faster when using the std::map
mat (ARFF) format [3]. The first phase can be executed in                                                                                     data structure as opposed to the std::unordered_map data
parallel for each of the documents. The main limitation to                                                                                    structure. The first is implemented as a red-black tree, while
obtain speedup here is bandwidth to the storage system.                                                                                       the latter is implemented as a hash table. Moreover, the
The second phase is not parallelized as the ARFF format                                                                                       unordered map is pre-sized to hold 4K items to minimize
does not facilitate parallel output.                                                                                                          resizing overhead.
   While the TF/IDF problem is mainly concerned with data                                                                                        While reading documents and counting words is faster
input, tokenization and hash table operations, it benefits                                                                                    with a map, the subsequent data transformation step is
strongly from intra-node parallelism (Figure 2). It speeds                                                                                    slower using a map, especially on one thread. This follows as
up by nearly 6-fold for the Mix data set and by 7-fold for the                                                                                the input and word-count phase is write-intensive, consist-
NSF Abstracts data set. Parallelizing output is important                                                                                     ing of frequent insertion of values in the dictionary. Inser-
as well. However, file formats are often designed in such a                                                                                   tion in the unordered map (a hash table) is ineﬃcient due to
way that parallel I/O becomes hard.                                                                                                           (i) resize operations, which requires re-hashing all elements,
                                                                                                                                              (ii) memory pressure, as the array underlying the hash table
3.3 Workflow Fusion                                                                                                                           is by construction both sparse (to approximate O(1) opera-
tions) and very large (due to the data sets used). In contrast,   intra-node parallelism. Moreover, several optimizations, such
the transformation step performs only lookups on the hash         as avoiding I/O through workflow fusion and choice of data
table, which are known to be faster on the unordered map          structure, are influenced by the presence and degree of intra-
O(1) as opposed to the map O(log n).                              node parallelism. This paper thus points out a new direction
   However, the transformation step scales much better with       for realizing high-performance analytics and identifies open
an increasing number of threads when using the map: it            challenges.
scales to 6.1 x on 16 threads using the map, while it scales
only to 3.4x using the unordered map data structure. This is      6.   REFERENCES
in part due to the memory consumption. In particular, using
                                                                   [1] R. Chen and H. Chen. Tiled-mapreduce: Eﬃcient and
the Mix data set, main memory consumption is 420MB with
                                                                       flexible mapreduce processing on multicore with tiling.
the map, while it rises to 12.8 GB using the unordered map.
                                                                       ACM Trans. Archit. Code Optim., 10(1):3:1–3:30,
   Likewise, the output phase performs lookups only on the
                                                                       Apr. 2013.
dictionaries and thus favours the unordered map. Moreover,
the output phase is hard to parallelize.                           [2] A. Crotty, A. Galakatos, K. Dursun, T. Kraska,
   We conclude that selection of the internal data structures          U. Çetintemel, and S. B. Zdonik. Tupleware: Big data,
has a significant impact on execution time. Moreover, dif-             big analytics, small clusters. In Conf. on Innovative
ferent steps of a workflow may execute faster using diﬀerent           Data Systems Research (CIDR), page 7, Jan. 2015.
data structures. As such, the choice of internal data struc-       [3] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
ture must be taken judiciously, depending on the overall time          P. Reutemann, and I. H. Witten. The weka data
taken by each step of the workflow and also on the extent              mining software: An update. SIGKDD Explor. Newsl.,
to which each phase can be parallelized.                               11(1):10–18, Nov. 2009.
                                                                   [4] M. Han, K. Daudjee, K. Ammar, M. T. Özsu,
                                                                       X. Wang, and T. Jin. An experimental comparison of
4. RELATED WORK                                                        pregel-like graph processing systems. Proc. VLDB
   The performance of data analytics frameworks is an im-              Endow., 7(12):1047–1058, Aug. 2014.
portant concern. Pavlo et al compare map/reduce systems            [5] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi:
against distributed DBMSes and find interesting trade-oﬀs              Large-scale graph computation on just a pc. In OSDI
in performance between these approaches [8]. They find that            pages 31–46, 2012.
map/reduce is easier to setup but in the end the DBMS was          [6] Apache mahout: Scalable machine learning and data
more performant.                                                       mining. http://mahout.apache.org.
   Ousterhout et al analyse real-life peta-scale workloads.        [7] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker,
They find that CPU is more often a bottleneck than I/O                 and B.-G. Chun. Making sense of performance in data
and that network performance has little impact on job com-             analytics frameworks. In NSDI’15, pages 293–307,
pletion time [7]. Moreover, they find that straggler nodes             2015.
can be identified and that in most cases the cause for strag-
                                                                   [8] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.
gling can be identified.
                                                                       DeWitt, S. Madden, and M. Stonebraker. A
   Han et al perform a similar analysis for graph analytics
                                                                       comparison of approaches to large-scale data analysis.
frameworks [4]. They identified several opportunities for
                                                                       In SIGMOD Intl. Conf. on Management of Data,
improvement in these systems. In a similar study, Satish et
                                                                       pages 165–178, 2009.
al [11] find that hand-optimized codes can outperform program-
                                                                   [9] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski,
mer-friendly frameworks by up to 560-fold.
                                                                       and C. Kozyrakis. Evaluating mapreduce for
   Several authors have investigated analytics frameworks for
                                                                       multi-core and multiprocessor systems. In HPCA,
shared-memory systems (single nodes), covering map/reduce
                                                                       pages 13–24, 2007.
workloads [9, 1] and graph analytics [12]. Kyrola et al op-
timize graph analytics assuming that the graph fits on disk       [10] G. Salton and M. J. McGill, editors. Introduction to
but not in main memory [5] Zhang et al optimize graph                  Modern Information Retrieval. Mcgraw-Hill, 1983.
analytics for non-uniform memory architectures [15].              [11] N. Satish et al. Navigating the maze of graph analytics
                                                                       frameworks using massive graph datasets. In SIGMOD
                                                                       Intl. Conf. on Management of Data, pages 979–990,
5. CONCLUSION                                                          2014.
   As data analytics are applied to increasingly larger data      [12] J. Shun and G. E. Blelloch. Ligra: A lightweight graph
sets, it is increasingly important to study and optimize the           processing framework for shared memory. In ACM
execution time of analytics operators. In this paper, we have          PPoPP, pages 135–146, 2013.
studied in particular the impact of parallelism on the perfor-    [13] Apache spark mllib.
mance of data analytics, and in particular intra-node paral-           http://spark.apache.org/mllib/.
lelism, which presents an important opportunity as Moore’s        [14] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
Law remains valid.                                                     M. McCauley, M. J. Franklin, S. Shenker, and
   Through studying one text processing and one numeric                I. Stoica. Resilient distributed datasets: A
operator, we identified four optimizations related to intra-           fault-tolerant abstraction for in-memory cluster
node parallelism that we expect are widely applicable across           computing. In NSDI, 2–2, 2012.
data analytics: intra-node parallel computation, parallel I/O,    [15] K. Zhang, R. Chen, and H. Chen. NUMA-Aware
workflow optimization and selection of internal or intermedi-          graph-structured analytics. In PPoPP pages 183–193,
ate data structures. We demonstrate that analytics queries             2015.
have strong potential for performance optimization through