Operator and Workflow Optimization for High-Performance Analytics Hans Vandierendonck, Karen L. Murphy, Mahwish Arif, ∗ Jiawen Sun and Dimitrios S. Nikolopoulos Queen’s University Belfast Belfast, United Kingdom {h.vandierendonck,k.l.murphy,m.arif,jsun03,d.nikolopoulos}@qub.ac.uk ABSTRACT such less familiar to developers. Pitfalls exist that tend to We make a case for studying the impact of intra-node par- make algorithms on sparse data sets memory-bound where allelism on the performance of data analytics. We identify they could be more efficient and compute-bound. As such, four performance optimizations that are enabled by an in- libraries with high-performance implementations of common creasing number of processing cores on a chip. We discuss operators are provided [3, 6, 13]. the performance impact of these opimizations on two analyt- While individual operators are commonly provided by var- ics operators and we identify how these optimizations affect ious libraries, a useful computation typically requires a work- each another. flow that links together multiple operators. Very often, these operators communicate through data stored on disk, which induces redundant operations, such as I/O, involving disk Keywords access, parsing and data conversions. An alternative solu- Data analytics; high-performance analytics; intra-node par- tion is to create single binaries that encapsulate a complex allelism workflow [14]. Such a solution has the potential of signifi- cantly higher processing rates by avoiding unnessecary I/O. This paper presents techniques for optimizing operators 1. INTRODUCTION and workflows in order to achieve high-performance ana- In the era of big data, data analytics represent an increas- lytics. Within this area, we identify and characterize four ing fraction of the processing cycles spent in data centres. widely applicable optimisations. 1. Assuming (highly) par- While in classical data mangement and analytics, operators allel nodes (Moore’s Law predicts that all future nodes will could be described within the bounds of the SQL domain- be increasingly more parallel), we demonstrate that utiliz- specific language, this is no longer the case with data ana- ing parallelism within operators is extremely important to lytics for big data. In big data, the operators are diverse, as improve performance. The motivation behind using par- is the data they are operating on, and can involve any algo- allelism are the observations that (i) parallelism allows to rithm to transform, classify or structure the data at hand. hide I/O latencies, and (ii) many analytics problems are As the structure of the data and the associated operators compute-bound rather than I/O-bound, an observation that are ill-defined, so is the scope of data analytics operators. goes against intuition [2, 11]. In order to achieve low processing times, operators require 2. Input and output contribute to a large proportion of careful design and must be highly optimized. Where efficient the execution time due to the size of data sets and the poten- SQL statements may, to a large extent, be written by domain tially low amount of computation performed per byte trans- experts and automatically optimized, this is definitely not ferred. I/O operations benefit, however, also from intra- the case for analytics operators. Moreover, optimization of node parallelism, allowing on the one hand to read indepen- the operators is often counter-intuitive as it involves sparse dent files concurrently, and on the other hand overlapping data sets, which are less deeply covered in text books and as data processing with disk and network access latency. ∗This work is supported by the European Community’s 3. While big data frameworks often steer towards dump- Seventh Framework Programme (FP7/2007-2013) under ing intermediate data sets to disk, the overhead of I/O and the ASAP project, grant agreement no. 619706, and storing intermediate data sets are significant. This overhead by the United Kingdom EPSRC under grant agreement can be avoided by fusing operators in workflows into single EP/L027402/1. executable images and by feeding the data from one operator to the next. 4. The choice of internal data structures used in analytics operators is determining for the performance of the opera- tor. We demonstrate a 3.4 fold speedup by interchanging one standardized data structure for another. However, this ⃝ c 2016, Copyright is with the authors. Published in the Workshop Pro- result depends also on the degree of intra-node parallelism ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this utilized. As such the optimization problem is non-trivial. paper is permitted under the terms of the Creative Commons license CC- The goal of this paper is to demonstrate the importance by-nc-nd 4.0 of these considerations for implementing analytics queries. MEDAL ’16 March 15, 2016, Bordeaux, France 8   Table 1: Data set description. Self-­‐Rela(ve  Speedup   7   Input Documents Bytes Distinct words 6   Mix 23432 62.8 MB 184743 5   NSF  abstracts   NSF Abstracts 101483 310.9 MB 267914 4   Mix   3   The remainder of this paper is structured as follows. In 2   Section 2 we discuss our system assumptions and software 1   environment. In Section 3 we analyse two typical operators, 0   one text processing and one numeric operator, and charac- 0   5   10   15   20   terize the impact achievable by the identified performance Number  of  Threads   optimizations. In Section 4 we discuss related work. Figure 1: Self-relative performance scalability of the K-Means operator. 2. SETUP our two datasets (Table 1). We use the algorithm to assign We start our investigation at a small scale, focusing on documents to one of 8 clusters based on their normalized the activities on a single node as these allow us to better TF/IDF scores. understand the performance of operators and workflows. The self-relative speedup shows how much performance Performing analytics on a single node is important as a is improved by utilizing multiple CPU cores. The speedup single-node can be built with a large amount of working obtained is sensitive to the data set operated on: The NSF memory (up to 16 TB) and many processing cores (over a Abstracts data set has about 100,000 documents and is sped 100). Such a system could efficiently process many real- up nearly 8 times using intra-node parallelism. The Mix world data sets. However, we expect that our conclusions data set has around 23,000 documents, which is sufficient remain valid when applied to scale-out systems, as optimiz- only for a 2.5 x speedup. This effect is easily explained by ing the performance of nodes in isolation is crucial to opti- the parallel loops in K-means clustering, which are all loops mize the system overall. iterating over the documents. As the number of documents To test the importance of the identified optimisations, we grows, so does the parallel scalability. implement two analytics operators in the Cilkplus exten- The execution time of our implementation is furthermore sion of C++, a programming language designed for high- short in comparison to other implementations. We com- performance and parallel computing at MIT, first developed pared the execution time of our K-means clustering imple- over two decades ago and continuously refined since then. mentation against WEKA [3] (version 3.6.13). Using the Cilkplus, now commercialized by Intel, supports the con- “SimpleKMeans” algorithm, a single-threaded K-Means al- struction of parallel tasks through language constructs that gorithm, on the same data sets requires over 2 hours, after express parallelism and vectorization (SIMDization) in an which we aborted the execution. In contrast, executing our easily accessible way. In the Cilkplus model, each thread implementation sequentially required 3.3s and 40.9s for the of computation is bound to a processing core. The princi- Mix and NSF Abstracts data sets respectively. Note that ples utilized should apply to other languages and parallel while we did not see the execution of WEKA through to constructs, e.g., Java streams. the end, we have verified that our WEKA installation works We study two operators: term frequency–inverse docu- correctly on small data sets. ment frequency (TF/IDF) and K-means clustering. TF/IDF While our implementation is significantly faster than WEKA, extracts words from text documents and rates the impor- this is not automatic. Several key optimisations were re- tance of a word on the basis of its frequency of occurence quired to achieve the performance of our algorithm: (i) Us- within a specific document as well as within the whole set ing sparse vectors to represent inherently sparse data. (ii) Re- of documents. K-means clustering is an unsupervized clas- cycling data structures throughout the K-means iterations sification technique that allows for the grouping of similar to avoid redundant data copies and memory pressure. E.g., data items described as numeric vectors. we do not create new objects during the iterations of the K-means algorithm. 3. ANALYSIS The conclusion of this experiment is thus that (i) intra- node parallelism is an important opportunity to accelerate 3.1 Intra-Node Parallelism data analytics, especially on larger data sets; (ii) the im- Many problems in data mining are trivially compute-bound, plementation and the choice of data structures has a huge especially learning algorithms using neural networks, sup- influence on execution time; (iii) parallelism can be exploited port vector machines and the like, which utilize computa- without casting the algorithms in map/reduce form. tionally demanding hyperbolic functions and can require many iterations to train the model. It should go without 3.2 Parallel Input saying that algorithms like these can be accelerated using A code that is well-optimized and where CPU is a bot- high degrees of intra-node parallelism. tleneck can also benefit from parallelizing I/O operations. K-means clustering is perhaps one of the cheapest un- Under these circumstances, CPU utilization is high and I/O supervized learning algorithms. As such, we will use K- resources are underutilized, including local disk and network means clustering to demonstrate that data analytics opera- resources. Intra-node parallelism can thus increase the uti- tions benefit from intra-node parallelism. Figure 1 shows the lization of disk and network resources. self-relative speedup of the K-means clustering algorithm on In this section we study the problem of calculating the 8   20   Execu&on  Time  (s)   Self-­‐Rela(ve  Speedup   7   15   6   5   10   output   4   5   kmeans   3   NSF  abstracts   2   0   transform   Mix   u-­‐map   map   u-­‐map   map   u-­‐map   map   u-­‐map   map   u-­‐map   map   1   input+wc   0   0   5   10   15   20   1   4   8   12   16   Number  of  Threads   Number  of  Threads   Figure 2: Self-relative parallel scalability of the TF/IDF operator. Figure 4: Execution time of the TF/IDF–K- Means workflow on the Mix input using a std::unordered_map (u-map) or a std::map. 120   Execu&on  Time  (s)   100   80   output   As pointed out above, I/O is both costly and hard to par- 60   kmeans   allelize. As such, avoiding I/O is always a good optimization. 40   Figure 3 shows the execution time of the TF/IDF–K-Means 20   transform   workflow when executing the TF/IDF and K-Means opera- 0   kmeans-­‐input   tors as discrete operators that communicate by storing the discrete   merged   discrete   merged   discrete   merged   discrete   merged   discrete   merged   intermediate TF/IDF scores on disk, versus a merged opera- 9idf-­‐output   tor without storage of the intermediates. The results clearly 1   4   8   12   16   input+wc   demonstrate that dumping data to disk has a high latency. Number  of  Threads   In this experiment, the data is dumped to a local hard disk. Both the output of the TF/IDF scores and the subsequent Figure 3: Execution time of the TF/IDF–K-Means input are executed by a single thread because the file format workflow when executing the TF/IDF and K-Means utilized (ARFF [3]) does not easily support parallel I/O. In operators as discrete steps communicating through contrast, transforming the data when it is stored in-memory file I/O, versus a merged operator with storage of is much faster and parallelizes well. the TF/IDF scores. Uses the NSF Abstracts input. The presence of intra-node parallelism is an important differentiator as to whether I/O bears much overhead or not. On a single-threaded execution, I/O increases execu- term frequency–inverse document frequency (TF/IDF) [10] tion time by 36.9%. On 16 threads, however, I/O makes the property of a set of documents. Our implementation col- execution 3.84 times slower because it does not parallelize. lects term frequencies (word counts) for each of the docu- ments in the set. Moreover, a list of all unique terms across 3.4 Data Structures the documents is constructed. This list is annotated with Algorithms use data structures to store input, output and the number of documents where the word occurs. In a first internal data sets, The choice of these data structures impact phase, the per-document term frequencies and the overall performance. In the case of TF/IDF, the key data struc- term-document count properties are collected using dedi- tures are the dictionaries storing unique words and their cated hash tables, mapping a word to a term frequency or frequencies. Figure 4 shows the execution time of TF/IDF– an overall document count. In a second phase, we calcu- K-Means workflow on the Mix data set and a varying num- late for each document the per-term TF/IDF score using ber of threads. Results for the larget NSF Abstracts data the hash tables described above. For each document, a set are more dramatic. sparse TF/IDF vector is constructed, sorted by term IDs The results demonstrate that the input and word-count and written to the output file in Attribute-Relation File For- step (“input+wc” in Figure 4) is faster when using the std::map mat (ARFF) format [3]. The first phase can be executed in data structure as opposed to the std::unordered_map data parallel for each of the documents. The main limitation to structure. The first is implemented as a red-black tree, while obtain speedup here is bandwidth to the storage system. the latter is implemented as a hash table. Moreover, the The second phase is not parallelized as the ARFF format unordered map is pre-sized to hold 4K items to minimize does not facilitate parallel output. resizing overhead. While the TF/IDF problem is mainly concerned with data While reading documents and counting words is faster input, tokenization and hash table operations, it benefits with a map, the subsequent data transformation step is strongly from intra-node parallelism (Figure 2). It speeds slower using a map, especially on one thread. This follows as up by nearly 6-fold for the Mix data set and by 7-fold for the the input and word-count phase is write-intensive, consist- NSF Abstracts data set. Parallelizing output is important ing of frequent insertion of values in the dictionary. Inser- as well. However, file formats are often designed in such a tion in the unordered map (a hash table) is inefficient due to way that parallel I/O becomes hard. (i) resize operations, which requires re-hashing all elements, (ii) memory pressure, as the array underlying the hash table 3.3 Workflow Fusion is by construction both sparse (to approximate O(1) opera- tions) and very large (due to the data sets used). In contrast, intra-node parallelism. Moreover, several optimizations, such the transformation step performs only lookups on the hash as avoiding I/O through workflow fusion and choice of data table, which are known to be faster on the unordered map structure, are influenced by the presence and degree of intra- O(1) as opposed to the map O(log n). node parallelism. This paper thus points out a new direction However, the transformation step scales much better with for realizing high-performance analytics and identifies open an increasing number of threads when using the map: it challenges. scales to 6.1 x on 16 threads using the map, while it scales only to 3.4x using the unordered map data structure. This is 6. REFERENCES in part due to the memory consumption. In particular, using [1] R. Chen and H. Chen. Tiled-mapreduce: Efficient and the Mix data set, main memory consumption is 420MB with flexible mapreduce processing on multicore with tiling. the map, while it rises to 12.8 GB using the unordered map. ACM Trans. Archit. Code Optim., 10(1):3:1–3:30, Likewise, the output phase performs lookups only on the Apr. 2013. dictionaries and thus favours the unordered map. Moreover, the output phase is hard to parallelize. [2] A. Crotty, A. Galakatos, K. Dursun, T. Kraska, We conclude that selection of the internal data structures U. Çetintemel, and S. B. Zdonik. Tupleware: Big data, has a significant impact on execution time. Moreover, dif- big analytics, small clusters. In Conf. on Innovative ferent steps of a workflow may execute faster using different Data Systems Research (CIDR), page 7, Jan. 2015. data structures. As such, the choice of internal data struc- [3] M. Hall, E. Frank, G. Holmes, B. Pfahringer, ture must be taken judiciously, depending on the overall time P. Reutemann, and I. H. Witten. The weka data taken by each step of the workflow and also on the extent mining software: An update. SIGKDD Explor. Newsl., to which each phase can be parallelized. 11(1):10–18, Nov. 2009. [4] M. Han, K. Daudjee, K. Ammar, M. T. Özsu, X. Wang, and T. Jin. An experimental comparison of 4. RELATED WORK pregel-like graph processing systems. Proc. VLDB The performance of data analytics frameworks is an im- Endow., 7(12):1047–1058, Aug. 2014. portant concern. Pavlo et al compare map/reduce systems [5] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: against distributed DBMSes and find interesting trade-offs Large-scale graph computation on just a pc. In OSDI in performance between these approaches [8]. They find that pages 31–46, 2012. map/reduce is easier to setup but in the end the DBMS was [6] Apache mahout: Scalable machine learning and data more performant. mining. http://mahout.apache.org. Ousterhout et al analyse real-life peta-scale workloads. [7] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, They find that CPU is more often a bottleneck than I/O and B.-G. Chun. Making sense of performance in data and that network performance has little impact on job com- analytics frameworks. In NSDI’15, pages 293–307, pletion time [7]. Moreover, they find that straggler nodes 2015. can be identified and that in most cases the cause for strag- [8] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. gling can be identified. DeWitt, S. Madden, and M. Stonebraker. A Han et al perform a similar analysis for graph analytics comparison of approaches to large-scale data analysis. frameworks [4]. They identified several opportunities for In SIGMOD Intl. Conf. on Management of Data, improvement in these systems. In a similar study, Satish et pages 165–178, 2009. al [11] find that hand-optimized codes can outperform program- [9] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, mer-friendly frameworks by up to 560-fold. and C. Kozyrakis. Evaluating mapreduce for Several authors have investigated analytics frameworks for multi-core and multiprocessor systems. In HPCA, shared-memory systems (single nodes), covering map/reduce pages 13–24, 2007. workloads [9, 1] and graph analytics [12]. Kyrola et al op- timize graph analytics assuming that the graph fits on disk [10] G. Salton and M. J. McGill, editors. Introduction to but not in main memory [5] Zhang et al optimize graph Modern Information Retrieval. Mcgraw-Hill, 1983. analytics for non-uniform memory architectures [15]. [11] N. Satish et al. Navigating the maze of graph analytics frameworks using massive graph datasets. In SIGMOD Intl. Conf. on Management of Data, pages 979–990, 5. CONCLUSION 2014. As data analytics are applied to increasingly larger data [12] J. Shun and G. E. Blelloch. Ligra: A lightweight graph sets, it is increasingly important to study and optimize the processing framework for shared memory. In ACM execution time of analytics operators. In this paper, we have PPoPP, pages 135–146, 2013. studied in particular the impact of parallelism on the perfor- [13] Apache spark mllib. mance of data analytics, and in particular intra-node paral- http://spark.apache.org/mllib/. lelism, which presents an important opportunity as Moore’s [14] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, Law remains valid. M. McCauley, M. J. Franklin, S. Shenker, and Through studying one text processing and one numeric I. Stoica. Resilient distributed datasets: A operator, we identified four optimizations related to intra- fault-tolerant abstraction for in-memory cluster node parallelism that we expect are widely applicable across computing. In NSDI, 2–2, 2012. data analytics: intra-node parallel computation, parallel I/O, [15] K. Zhang, R. Chen, and H. Chen. NUMA-Aware workflow optimization and selection of internal or intermedi- graph-structured analytics. In PPoPP pages 183–193, ate data structures. We demonstrate that analytics queries 2015. have strong potential for performance optimization through