=Paper=
{{Paper
|id=Vol-1558/paper44
|storemode=property
|title=Towards an Analytics Query Engine
|pdfUrl=https://ceur-ws.org/Vol-1558/paper44.pdf
|volume=Vol-1558
|authors=Nantia Makrynioti,Vasilis Vassalos
|dblpUrl=https://dblp.org/rec/conf/edbt/MakryniotiV16
}}
==Towards an Analytics Query Engine==
Towards an Analytics Query Engine ∗ Nantia Makrynioti Vasilis Vassalos Athens University of Economics and Business Athens University of Economics and Business Athens, Greece Athens, Greece makriniotik@aueb.gr vassalos@aueb.gr ABSTRACT with black box libraries, evaluating various algorithms for a This vision paper presents new challenges and opportuni- task and tuning their parameters, in order to produce an ef- ties in the area of distributed data analytics, at the core of fective model, is a time-consuming process. Things become which are data mining and machine learning. At first, we even more complicated when we want to leverage paralleliza- provide an overview of the current state of the art in the area tion on clusters of independent computers for processing big and then analyse two aspects of data analytics systems, se- data. Details concerning load balancing, scheduling or fault mantics and optimization. We argue that these aspects will tolerance can be quite overwhelming even for an experienced emerge as important issues for the data management com- software engineer. munity in the next years and propose promising research Research in the data management domain recently started directions for solving them. tackling the above issues by developing systems for large- scale analytics that aim at providing higher-level primitives for building data mining and machine learning algorithms, Keywords as well as hiding low-level details of distributed execution. Data analytics, Declarative machine learning, Distributed MapReduce [12] and Dryad [16] were the first frameworks processing that paved the way. However, these initial efforts suffered from low usability, as they offered expressive but at the same 1. INTRODUCTION time low-level languages to program data analysis tasks. Soon the need for higher-level programming languages on With the rapid growth of world wide web (WWW) and the top of these frameworks became apparent. Systems, such development of social networks, the available amount of data as Hive [23], Pig Latin [20], DryadLINQ [25] and Scope [9], has exploded. This availability has encouraged many com- offer higher-level languages that enable developers to write panies and organizations in recent years to collect and anal- part of their programs in declarative style. Then, these pro- yse data, in order to extract information and gain valuable grams are automatically translated into MapReduce jobs or knowledge. At the same time hardware cost has decreased, Dryad vertices that form a directed acyclic graph (DAG), so storage and processing of big data is not prohibitive even which is optimized for efficient distributed execution. This for smaller companies. Topic classification, sentiment anal- paradigm is adopted by other systems, too. Stratosphere ysis, spam filtering, fraud and anomaly detection are only [5], Tupleware [11] and MLbase [17] also aim at hiding de- a few analytics tasks that gained considerable popularity tails of distributed execution, such as load balancing and over the past few years, along with more traditional ware- scheduling, in order to give the impression to the user that house queries that gather statistics from data. Hence, mak- she develops code for a single machine. ing the deployment of solutions for such tasks less tedious, Apart from the programming model, optimization tech- adds value to the services provided by these companies and niques is another important issue that these systems address. organizations, and encourages more people to enhance their Big data need fast and efficient solutions and as a conse- work using information from data. quence frameworks should leverage any opportunity for code It is clear that the areas of data mining and machine learn- optimization. Query rewriting, exploitation of data locality ing are at the core of data analysis tasks. However, devel- and vectorization are concepts already known and widely oping such algorithms needs not only expertise in software explored in databases and compilers. These techniques are engineering, but also a solid mathematical background in also applied in the aforementioned analytics systems along order to interpret correctly and efficiently the mathematical with more recent ideas. computations into a program. Even when experimenting In this paper, we study systems for large-scale data anal- ∗ This work was co-funded by the European Union and the Gen- ysis in the context of these two directions: semantics and eral Secretariat of Research and Technology, Ministry Of Educa- optimization techniques. We present an overview of the im- tion, Research Religious Affairs under the project Asset Manage- mense development of systems for large scale data analysis ment Optimization & Risk Management Software (AMOR) of the that made their appearance over the past few years and pro- Bilateral R&T Cooperation Program Greece - Israel 2013-2015. pose research topics that we argue will attract the interest c 2016, Copyright is with the authors. Published in the Workshop Pro- of the data management community in the near future. The ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this rest of the paper is organized as follows. Section 2 describes paper is permitted under the terms of the Creative Commons license CC- classes of systems for distributed data analytics and section by-nc-nd 4.0 3 analyses open challenges concerning the semantics and the ilar to Stratosphere’s Sopremo layer, Jaql and Pig Latin de- optimization methods used in such systems. Finally, section fine each workflow as a sequence of steps, but with each step 4 concludes the paper. performing a high-level transformation, such as SQL and ETL operations. Users are able to integrate custom code in 2. CURRENT STATE OF THE ART their data analysis tasks by writing their own UDFs either in external languages, such as Java and Python, or by us- Current work in the area of big data analytics focuses ing operators of the system. Eventually, Jaql and Pig Latin on proposing programming models for data mining and ma- scripts are automatically compiled to MapReduce jobs. A chine learning tasks, and developing optimizations that re- variant of Pig Latin, called MyriaL, extends its program- sult to efficient execution of users’ programs on a distributed ming model with looping constructs and is used in Myria execution engine. So, large-scale analytics frameworks can [15], a big data management service. Iterative processes are be divided into two main categories: libraries of data min- also a limitation for Jaql and Sopremo, as the developer is ing/machine learning algorithms or sets of primitives to de- not able to define in these programming models that a given velop such algorithms. workflow will be repeated for a number of iterations or as Libraries provide implementations of algorithms commonly long a condition holds. used in data analysis tasks, such as SVMs and K-means, tar- Moving on to the incorporation of relational operators geted to a specific distributed execution platform. The user to a high-level language, we find DryadLINQ that trans- is able to use these algorithms in her code by calling them forms LINQ programs into distributed processes running on as functions, in order to load data from files or databases, an execution engine called Dryad. The LINQ model incor- transform data or use machine learning algorithms to anal- porates constructs for manipulating data items into a host yse them. The programming paradigm is the same as that language, such as C# and other .NET languages. Itera- of a developer writing code for a single machine and call- tion is expressed using loop constructs of the host language. ing functions from a third-party library. The difference is Spark exposes a similar functional programming interface that this code is then automatically compiled by the sys- implemented in Scala, but provides greater support for it- tem in order to be executed on a distributed platform. Such erative processes and expression of shared state among it- libraries are available for all popular distributed execution erations. Finally, Tupleware follows the same approach as engines. Apache Mahout [2] was initially implemented on Stratosphere’s PACT model by proposing operators that are Hadoop [1] and is gradually extended to Spark [26]. A simi- second-order functions and take as argument a user defined lar example is MLlib [19], a scalable machine learning library first-order function. However, Tupleware’s set of operators on top of Spark, whereas MADlib [10] is a library of SQL- is more extended than the one provided by PACT. These op- based machine learning and data analysis algorithms, which erators are used to define workflows inside a host language. run on database engines. They include algorithms for clas- Then, the system transforms user’s code into a distributed sification, clustering, collaborative filtering, dimensionality program, which is deployed and executed on a cluster of reduction and other useful preprocessing tasks. machines. Systems that belong to the second category provide a On the other hand, MLI API [22] and SystemML [13] that set of primitives to the users, in order to simplify the de- run on Spark and MapReduce respectively, aim to imitate velopment of distributed data analysis algorithms. These the style of statistical computing languages, such as R and primitives can be also combined with UDFs (User Defined MATLAB, which are very popular among machine learning Functions) in many cases to allow for custom code in data researchers and help them build software prototypes quickly. analysis tasks. In this class of systems, algorithms are not As a result, MLI includes interfaces for three main concepts, ready to call, but the user can use these primitives to develop Optimizers, Algorithms and Models, as well as APIs for two machine learning or data mining algorithms that run on a data structures MLTable and LocalMatrix, which supports specific distributed execution engine. The provided primi- linear algebra operations. In SystemML, programs are se- tives hide low-level details concerning distribution, such as quences of statements written in DML, a language which load balancing and fault tolerance, for which the system includes constructs for input/output, control structures and provides solutions. Declarative style combined with imper- assignments, as seen in R, on matrices or scalars. Neverthe- ative/procedural programming, is popular in this kind of less, more advanced features of R, such as objects and lists, systems, gaining ground for the application of the DBMS are not currently supported by DML. paradigm in data analysis platforms. Finally, MLbase provides a declarative language above a This trend is already evident in existing platforms, such layer of a library of algorithms, by which the user is able to as Stratosphere, Jaql [6] and Pig Latin that provide a hy- define the type of task she wants to execute, e.g. classifica- brid of procedural and declarative programming, as well as tion, and the data to be used. Then the system tests various DryadLINQ, Tupleware and Spark that attempt to incor- machine learning algorithms from its library and parameter porate relational and other operators to a host language. values on the data provided, and determines an effective Stratosphere provides three programming models organized combination based on quality and time performance. Apart as layers of a larger stack that comes down to an execution from the declarative language, MLbase also offers high-level engine and a data storage system. Sopremo is the top layer primitives, such as gradient and stohastic gradient descent, of the stack and trades expressiveness for declarativity. It that make development of distributed machine learning al- includes a considerable number of high-level operators, such gorithms easier for researchers. as relational or domain-specific operators which offer more Despite the abundance of available systems and the va- advanced functionality (e.g. duplicate detection, named en- riety of approaches that these follow, programmers are still tity recognition). Each Sopremo plan is translated to the expected to write a considerable amount of code in most programming model below Sopremo, PACT, and is finally of these platforms. The idea of providing greater degree executed on Nephele, Stratosphere’s execution engine. Sim- improvements in these layers can increase the efficiency of execution overall. An extension of Datalog could also cover more requirements of data analysis tasks and be exposed to the users as a programmable language. The distinction between the user’s program and the logical layer is also sup- ported by Hyracks [7], which serves as a parallel-platform for compiling higher-level declarative data-processing lan- guages, such as Pig Latin and Hive. We also consider the sets of operations provided by MLI API, MLbase and Spark ML [3] good attempts, as except from relational operators, they are also closer to the seman- tics of machine learning area by providing declarative op- erators for some frequent components of machine learning algorithms, such as linear algebra operations and gradient descent. The Spark ML package is also built around the key Figure 1: The main categories of analytics operators concepts of learning and data transformation algorithms, in in current systems. order to standardize multiple APIs for machine learning on top of a data structure called DataFrame. Thus, a key in- gredient in the development of optimizable, massively scal- of declarativity by extending the set of operators that data able analytics is the creation of an analytics query engine analysis systems support, and minimize any glue code needed supporting a language with clear semantics that incorpo- between operation calls would make the development and rates both relational operators and high-level operators for maintenance of programs much easier. The description of common components of machine learning algorithms, e.g. an algebra that would form the basis of these operators and optimization functions and preprocessing tasks 1 . The pur- model their semantics is a primary goal at this direction. pose of this language is to model various data analysis tasks We explore this open problem in the following section, as as plans consisting solely or mostly of operation calls. An- well as the challenges that are presented by optimizing the other important aspect of an efficient analytics query engine execution of tasks written in this set of operators. is the implementation of optimizations, as described in the next section. 3. OPEN CHALLENGES 3.2 Optimizations in Data Analytics Systems 3.1 Data Analytics Semantics The optimization techniques adopted by large-scale ana- For some time now, the data management community lytics systems are mainly borrowed from two areas: databases compares the current situation in data analytics with the and compilers. beginning of database systems. Currently, data mining and As in database systems, query rewriting is also used in machine learning tasks in distributed platforms are done in data analysis systems, either depending on heuristics or on ad hoc ways and developers have to use various systems, a cost model. Heuristics are rules that are fired based on spe- each targeted to a specific class of tasks. Data models and cific properties, but it is not examined whether these rules operators are different among systems, each exposing its own produce a faster plan in any case. DryadLINQ is currently semantics, and there is no notion of an algebra that could applying heuristics in query rewriting, although in the fu- form a basis for data analytics languages. Figure 1 displays ture the group plans to design and implement a cost model the main categories of operators included in the current sys- for query optimization similar to the one used in DBMS. tems. Stratosphere and Jaql have already moved on to cost-based While relational queries are easily defined with the afore- optimizations, handling also difficulties that appear due to mentioned operators, common concepts in machine learn- the large amount of user-defined code. Given that many op- ing, such as models and optimization algorithms, are not erators of data analysis systems take as input user defined represented in a declarative way in the systems described first-order functions, semantics of these operators cannot be above and their implementation still involves a lot of cus- known. This makes a big difference from traditional query tom code. Hence, apart from the distributed execution, the optimizations. Static code analysis is one way to tackle un- development of machine learning algorithms is not simpli- known semantics. Through code analysis, it is possible to fied compared to more traditional languages such as R and determine which data are read and written from each oper- MATLAB. ator and separate them into read and write sets. Applying In an attempt to present declarative solutions for machine compiler optimizations to UDFs is also useful. Function and learning tasks, recent work [8] proposes Datalog as a suitable variable inlining, as well as SIMD vectorization are the most declarative foundation for analytics systems. As each sys- popular compiler optimization techniques we see in these tem is usually targeted to a specific ML class of tasks (e.g. systems. Jaql and Tupleware already exploit these ideas. graph analytics), Datalog can serve as a logical layer where Concerning optimization in the context of the analytics all these different programming models will be translated. model proposed above, we describe crucial issues that come The purpose is two-folded. First, to avoid developing opti- into the picture. The first decision to be made regards the mizations for each new system and second to separate the 1 Part of these elements are also covered by the PMML for- user’s program from its logical interpretation. The second mat [14], a data exchange standard for sharing predictive argument is very important as it prevents changes in log- models produced by data mining and machine learning al- ical and physical layer from affecting user’s code, whereas gorithms. definition of the form of the logical plan for the programs Proc. VLDB Endow., 2(2):1481–1492, Aug. 2009. expressed with this model. Another important concern is [11] A. Crotty, A. Galakatos, K. Dursun, T. Kraska, the computation of appropriate cost metrics for evaluating U. Çetintemel, and S. B. Zdonik. Tupleware: Redefining these plans. Traditional cost models used in query opti- modern analytics. CoRR, abs/1406.6667, 2014. mization are not designed to cover every dimension of query [12] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI’04: Proceedings of performance, such as execution time. Machine learning has the 6th Conference on Symposium on Operating Systems already made its way into prediction of execution latency Design and Implementation. USENIX Association, 2004. and resource usage [4, 18] for queries running on DBMSs [13] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, with relevant papers proposing the training of models on V. Sindhwani, S. Tatikonda, Y. Tian, and previous query instances, whose features are based on cardi- S. Vaithyanathan. Systemml: Declarative machine learning nalities estimated by the optimizer, the count of occurrences on mapreduce. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE ’11, for each operator in the query plan and other statistics. The pages 231–242, 2011. application of similar machine learning techniques for pre- [14] A. Guazzelli, M. Zeller, W.-C. Lin, and G. Williams. dicting performance metrics of data analysis programs is an PMML: An open standard for sharing models. The R interesting direction to pursue, which becomes even more Journal, 1(1):60–65, 2009. challenging if we consider concurrent workloads [24]. Finally, [15] D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, given the sheer size of data that are analysed nowadays, the P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, proposed optimization techniques should address challenges J. Wang, A. Whitaker, S. Xu, M. Balazinska, B. Howe, and that arise in a distributed environment [21] and efficiently D. Suciu. Demonstration of the myria big data management service. In Proceedings of the 2014 ACM translate these logical plans to programs that would run in SIGMOD International Conference on Management of a distributed execution engine, such as Spark. Data, SIGMOD ’14, pages 881–884, 2014. [16] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. 4. CONCLUSION Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2Nd ACM In this paper, we presented open challenges in the research SIGOPS/EuroSys European Conference on Computer area of big data analytics and we specifically focused on the Systems 2007, EuroSys ’07, pages 59–72, 2007. aspects of semantics and optimization techniques. Given [17] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. the variety of systems that are used for data analysis, we Franklin, and M. I. Jordan. Mlbase: A distributed believe that the consolidation of semantics in an appropriate machine-learning system. In CIDR. www.cidrdb.org, 2013. algebra and the evolvement of optimization methods applied [18] J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust estimation of resource consumption for sql queries in an analytics query engine are of great importance and will using statistical techniques. In 38th International attract even more interest in the next few years. Conference on Very Large Databases. Very Large Data Bases Endowment Inc., August 2012. 5.[1] Apache REFERENCES hadoop. https://hadoop.apache.org/. [19] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, [2] Apache mahout. http://mahout.apache.org/. M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, [3] Spark ml. R. Zadeh, M. Zaharia, and A. Talwalkar. Mllib: Machine http://spark.apache.org/docs/latest/ml-guide.html. learning in apache spark. CoRR, abs/1505.06807, 2015. [4] M. Akdere, U. Çetintemel, M. Riondato, E. Upfal, and [20] C. Olston, B. Reed, U. Srivastava, R. Kumar, and S. B. Zdonik. Learning-based query performance modeling A. Tomkins. Pig latin: A not-so-foreign language for data and prediction. In Proceedings of the 2012 IEEE 28th processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Data Engineering, ICDE ’12, International Conference on Management of Data, pages 390–401, 2012. SIGMOD ’08, pages 1099–1110, 2008. [5] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, [21] M. T. Ozsu. Principles of Distributed Database Systems. F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, Prentice Hall Press, 3rd edition, 2007. F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, [22] E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The J. E. Gonzalez, M. J. Franklin, M. I. Jordan, and stratosphere platform for big data analytics. The VLDB T. Kraska. Mli: An api for distributed machine learning. Journal, 23(6):939–964, Dec. 2014. CoRR, abs/1310.5426, 2013. [6] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. [23] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, Eltabakh, C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A scripting language for large scale semistructured data warehousing solution over a map-reduce framework. Proc. analysis. PVLDB, 4(12):1272–1283, 2011. VLDB Endow., 2(2):1626–1629, Aug. 2009. [7] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. [24] W. Wu, Y. Chi, H. Hacı́gümüş, and J. F. Naughton. Hyracks: A flexible and extensible foundation for Towards predicting query execution time for concurrent data-intensive computing. In Proceedings of the 2011 IEEE and dynamic database workloads. Proc. VLDB Endow., 27th International Conference on Data Engineering, ICDE 6(10):925–936, Aug. 2013. ’11, pages 1151–1162, 2011. [25] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, [8] Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, P. K. Gunda, and J. Currey. Dryadlinq: A system for T. Condie, M. Weimer, and R. Ramakrishnan. Scaling general-purpose distributed data-parallel computing using a datalog for machine learning on big data. CoRR, high-level language. In Proceedings of the 8th USENIX abs/1203.0160, 2012. Conference on Operating Systems Design and [9] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, Implementation, OSDI’08, pages 1–14, 2008. D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and [26] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and efficient parallel processing of massive data sets. Proc. I. Stoica. Spark: Cluster computing with working sets. In VLDB Endow., 1(2):1265–1276, Aug. 2008. Proceedings of the 2Nd USENIX Conference on Hot Topics [10] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and in Cloud Computing, HotCloud’10, pages 10–10, 2010. C. Welton. Mad skills: New analysis practices for big data.