=Paper= {{Paper |id=Vol-1558/paper44 |storemode=property |title=Towards an Analytics Query Engine |pdfUrl=https://ceur-ws.org/Vol-1558/paper44.pdf |volume=Vol-1558 |authors=Nantia Makrynioti,Vasilis Vassalos |dblpUrl=https://dblp.org/rec/conf/edbt/MakryniotiV16 }} ==Towards an Analytics Query Engine== https://ceur-ws.org/Vol-1558/paper44.pdf
                            Towards an Analytics Query Engine                                                ∗




                          Nantia Makrynioti                                             Vasilis Vassalos
          Athens University of Economics and Business                    Athens University of Economics and Business
                        Athens, Greece                                                 Athens, Greece
                        makriniotik@aueb.gr                                           vassalos@aueb.gr



ABSTRACT                                                                 with black box libraries, evaluating various algorithms for a
This vision paper presents new challenges and opportuni-                 task and tuning their parameters, in order to produce an ef-
ties in the area of distributed data analytics, at the core of           fective model, is a time-consuming process. Things become
which are data mining and machine learning. At first, we                 even more complicated when we want to leverage paralleliza-
provide an overview of the current state of the art in the area          tion on clusters of independent computers for processing big
and then analyse two aspects of data analytics systems, se-              data. Details concerning load balancing, scheduling or fault
mantics and optimization. We argue that these aspects will               tolerance can be quite overwhelming even for an experienced
emerge as important issues for the data management com-                  software engineer.
munity in the next years and propose promising research                     Research in the data management domain recently started
directions for solving them.                                             tackling the above issues by developing systems for large-
                                                                         scale analytics that aim at providing higher-level primitives
                                                                         for building data mining and machine learning algorithms,
Keywords                                                                 as well as hiding low-level details of distributed execution.
Data analytics, Declarative machine learning, Distributed                MapReduce [12] and Dryad [16] were the first frameworks
processing                                                               that paved the way. However, these initial efforts suffered
                                                                         from low usability, as they offered expressive but at the same
1.    INTRODUCTION                                                       time low-level languages to program data analysis tasks.
                                                                         Soon the need for higher-level programming languages on
  With the rapid growth of world wide web (WWW) and the
                                                                         top of these frameworks became apparent. Systems, such
development of social networks, the available amount of data
                                                                         as Hive [23], Pig Latin [20], DryadLINQ [25] and Scope [9],
has exploded. This availability has encouraged many com-
                                                                         offer higher-level languages that enable developers to write
panies and organizations in recent years to collect and anal-
                                                                         part of their programs in declarative style. Then, these pro-
yse data, in order to extract information and gain valuable
                                                                         grams are automatically translated into MapReduce jobs or
knowledge. At the same time hardware cost has decreased,
                                                                         Dryad vertices that form a directed acyclic graph (DAG),
so storage and processing of big data is not prohibitive even
                                                                         which is optimized for efficient distributed execution. This
for smaller companies. Topic classification, sentiment anal-
                                                                         paradigm is adopted by other systems, too. Stratosphere
ysis, spam filtering, fraud and anomaly detection are only
                                                                         [5], Tupleware [11] and MLbase [17] also aim at hiding de-
a few analytics tasks that gained considerable popularity
                                                                         tails of distributed execution, such as load balancing and
over the past few years, along with more traditional ware-
                                                                         scheduling, in order to give the impression to the user that
house queries that gather statistics from data. Hence, mak-
                                                                         she develops code for a single machine.
ing the deployment of solutions for such tasks less tedious,
                                                                            Apart from the programming model, optimization tech-
adds value to the services provided by these companies and
                                                                         niques is another important issue that these systems address.
organizations, and encourages more people to enhance their
                                                                         Big data need fast and efficient solutions and as a conse-
work using information from data.
                                                                         quence frameworks should leverage any opportunity for code
  It is clear that the areas of data mining and machine learn-
                                                                         optimization. Query rewriting, exploitation of data locality
ing are at the core of data analysis tasks. However, devel-
                                                                         and vectorization are concepts already known and widely
oping such algorithms needs not only expertise in software
                                                                         explored in databases and compilers. These techniques are
engineering, but also a solid mathematical background in
                                                                         also applied in the aforementioned analytics systems along
order to interpret correctly and efficiently the mathematical
                                                                         with more recent ideas.
computations into a program. Even when experimenting
                                                                            In this paper, we study systems for large-scale data anal-
∗
  This work was co-funded by the European Union and the Gen-             ysis in the context of these two directions: semantics and
eral Secretariat of Research and Technology, Ministry Of Educa-          optimization techniques. We present an overview of the im-
tion, Research Religious Affairs under the project Asset Manage-         mense development of systems for large scale data analysis
ment Optimization & Risk Management Software (AMOR) of the
                                                                         that made their appearance over the past few years and pro-
Bilateral R&T Cooperation Program Greece - Israel 2013-2015.
                                                                         pose research topics that we argue will attract the interest
 c 2016, Copyright is with the authors. Published in the Workshop Pro-   of the data management community in the near future. The
ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor-
deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this
                                                                         rest of the paper is organized as follows. Section 2 describes
paper is permitted under the terms of the Creative Commons license CC-   classes of systems for distributed data analytics and section
by-nc-nd 4.0
3 analyses open challenges concerning the semantics and the       ilar to Stratosphere’s Sopremo layer, Jaql and Pig Latin de-
optimization methods used in such systems. Finally, section       fine each workflow as a sequence of steps, but with each step
4 concludes the paper.                                            performing a high-level transformation, such as SQL and
                                                                  ETL operations. Users are able to integrate custom code in
2.   CURRENT STATE OF THE ART                                     their data analysis tasks by writing their own UDFs either
                                                                  in external languages, such as Java and Python, or by us-
   Current work in the area of big data analytics focuses
                                                                  ing operators of the system. Eventually, Jaql and Pig Latin
on proposing programming models for data mining and ma-
                                                                  scripts are automatically compiled to MapReduce jobs. A
chine learning tasks, and developing optimizations that re-
                                                                  variant of Pig Latin, called MyriaL, extends its program-
sult to efficient execution of users’ programs on a distributed
                                                                  ming model with looping constructs and is used in Myria
execution engine. So, large-scale analytics frameworks can
                                                                  [15], a big data management service. Iterative processes are
be divided into two main categories: libraries of data min-
                                                                  also a limitation for Jaql and Sopremo, as the developer is
ing/machine learning algorithms or sets of primitives to de-
                                                                  not able to define in these programming models that a given
velop such algorithms.
                                                                  workflow will be repeated for a number of iterations or as
   Libraries provide implementations of algorithms commonly
                                                                  long a condition holds.
used in data analysis tasks, such as SVMs and K-means, tar-
                                                                     Moving on to the incorporation of relational operators
geted to a specific distributed execution platform. The user
                                                                  to a high-level language, we find DryadLINQ that trans-
is able to use these algorithms in her code by calling them
                                                                  forms LINQ programs into distributed processes running on
as functions, in order to load data from files or databases,
                                                                  an execution engine called Dryad. The LINQ model incor-
transform data or use machine learning algorithms to anal-
                                                                  porates constructs for manipulating data items into a host
yse them. The programming paradigm is the same as that
                                                                  language, such as C# and other .NET languages. Itera-
of a developer writing code for a single machine and call-
                                                                  tion is expressed using loop constructs of the host language.
ing functions from a third-party library. The difference is
                                                                  Spark exposes a similar functional programming interface
that this code is then automatically compiled by the sys-
                                                                  implemented in Scala, but provides greater support for it-
tem in order to be executed on a distributed platform. Such
                                                                  erative processes and expression of shared state among it-
libraries are available for all popular distributed execution
                                                                  erations. Finally, Tupleware follows the same approach as
engines. Apache Mahout [2] was initially implemented on
                                                                  Stratosphere’s PACT model by proposing operators that are
Hadoop [1] and is gradually extended to Spark [26]. A simi-
                                                                  second-order functions and take as argument a user defined
lar example is MLlib [19], a scalable machine learning library
                                                                  first-order function. However, Tupleware’s set of operators
on top of Spark, whereas MADlib [10] is a library of SQL-
                                                                  is more extended than the one provided by PACT. These op-
based machine learning and data analysis algorithms, which
                                                                  erators are used to define workflows inside a host language.
run on database engines. They include algorithms for clas-
                                                                  Then, the system transforms user’s code into a distributed
sification, clustering, collaborative filtering, dimensionality
                                                                  program, which is deployed and executed on a cluster of
reduction and other useful preprocessing tasks.
                                                                  machines.
   Systems that belong to the second category provide a
                                                                     On the other hand, MLI API [22] and SystemML [13] that
set of primitives to the users, in order to simplify the de-
                                                                  run on Spark and MapReduce respectively, aim to imitate
velopment of distributed data analysis algorithms. These
                                                                  the style of statistical computing languages, such as R and
primitives can be also combined with UDFs (User Defined
                                                                  MATLAB, which are very popular among machine learning
Functions) in many cases to allow for custom code in data
                                                                  researchers and help them build software prototypes quickly.
analysis tasks. In this class of systems, algorithms are not
                                                                  As a result, MLI includes interfaces for three main concepts,
ready to call, but the user can use these primitives to develop
                                                                  Optimizers, Algorithms and Models, as well as APIs for two
machine learning or data mining algorithms that run on a
                                                                  data structures MLTable and LocalMatrix, which supports
specific distributed execution engine. The provided primi-
                                                                  linear algebra operations. In SystemML, programs are se-
tives hide low-level details concerning distribution, such as
                                                                  quences of statements written in DML, a language which
load balancing and fault tolerance, for which the system
                                                                  includes constructs for input/output, control structures and
provides solutions. Declarative style combined with imper-
                                                                  assignments, as seen in R, on matrices or scalars. Neverthe-
ative/procedural programming, is popular in this kind of
                                                                  less, more advanced features of R, such as objects and lists,
systems, gaining ground for the application of the DBMS
                                                                  are not currently supported by DML.
paradigm in data analysis platforms.
                                                                     Finally, MLbase provides a declarative language above a
   This trend is already evident in existing platforms, such
                                                                  layer of a library of algorithms, by which the user is able to
as Stratosphere, Jaql [6] and Pig Latin that provide a hy-
                                                                  define the type of task she wants to execute, e.g. classifica-
brid of procedural and declarative programming, as well as
                                                                  tion, and the data to be used. Then the system tests various
DryadLINQ, Tupleware and Spark that attempt to incor-
                                                                  machine learning algorithms from its library and parameter
porate relational and other operators to a host language.
                                                                  values on the data provided, and determines an effective
Stratosphere provides three programming models organized
                                                                  combination based on quality and time performance. Apart
as layers of a larger stack that comes down to an execution
                                                                  from the declarative language, MLbase also offers high-level
engine and a data storage system. Sopremo is the top layer
                                                                  primitives, such as gradient and stohastic gradient descent,
of the stack and trades expressiveness for declarativity. It
                                                                  that make development of distributed machine learning al-
includes a considerable number of high-level operators, such
                                                                  gorithms easier for researchers.
as relational or domain-specific operators which offer more
                                                                     Despite the abundance of available systems and the va-
advanced functionality (e.g. duplicate detection, named en-
                                                                  riety of approaches that these follow, programmers are still
tity recognition). Each Sopremo plan is translated to the
                                                                  expected to write a considerable amount of code in most
programming model below Sopremo, PACT, and is finally
                                                                  of these platforms. The idea of providing greater degree
executed on Nephele, Stratosphere’s execution engine. Sim-
                                                                 improvements in these layers can increase the efficiency of
                                                                 execution overall. An extension of Datalog could also cover
                                                                 more requirements of data analysis tasks and be exposed
                                                                 to the users as a programmable language. The distinction
                                                                 between the user’s program and the logical layer is also sup-
                                                                 ported by Hyracks [7], which serves as a parallel-platform
                                                                 for compiling higher-level declarative data-processing lan-
                                                                 guages, such as Pig Latin and Hive.
                                                                    We also consider the sets of operations provided by MLI
                                                                 API, MLbase and Spark ML [3] good attempts, as except
                                                                 from relational operators, they are also closer to the seman-
                                                                 tics of machine learning area by providing declarative op-
                                                                 erators for some frequent components of machine learning
                                                                 algorithms, such as linear algebra operations and gradient
                                                                 descent. The Spark ML package is also built around the key
Figure 1: The main categories of analytics operators             concepts of learning and data transformation algorithms, in
in current systems.                                              order to standardize multiple APIs for machine learning on
                                                                 top of a data structure called DataFrame. Thus, a key in-
                                                                 gredient in the development of optimizable, massively scal-
of declarativity by extending the set of operators that data     able analytics is the creation of an analytics query engine
analysis systems support, and minimize any glue code needed      supporting a language with clear semantics that incorpo-
between operation calls would make the development and           rates both relational operators and high-level operators for
maintenance of programs much easier. The description of          common components of machine learning algorithms, e.g.
an algebra that would form the basis of these operators and      optimization functions and preprocessing tasks 1 . The pur-
model their semantics is a primary goal at this direction.       pose of this language is to model various data analysis tasks
We explore this open problem in the following section, as        as plans consisting solely or mostly of operation calls. An-
well as the challenges that are presented by optimizing the      other important aspect of an efficient analytics query engine
execution of tasks written in this set of operators.             is the implementation of optimizations, as described in the
                                                                 next section.
3.    OPEN CHALLENGES
                                                                 3.2   Optimizations in Data Analytics Systems
3.1   Data Analytics Semantics                                      The optimization techniques adopted by large-scale ana-
   For some time now, the data management community              lytics systems are mainly borrowed from two areas: databases
compares the current situation in data analytics with the        and compilers.
beginning of database systems. Currently, data mining and           As in database systems, query rewriting is also used in
machine learning tasks in distributed platforms are done in      data analysis systems, either depending on heuristics or on
ad hoc ways and developers have to use various systems,          a cost model. Heuristics are rules that are fired based on spe-
each targeted to a specific class of tasks. Data models and      cific properties, but it is not examined whether these rules
operators are different among systems, each exposing its own     produce a faster plan in any case. DryadLINQ is currently
semantics, and there is no notion of an algebra that could       applying heuristics in query rewriting, although in the fu-
form a basis for data analytics languages. Figure 1 displays     ture the group plans to design and implement a cost model
the main categories of operators included in the current sys-    for query optimization similar to the one used in DBMS.
tems.                                                            Stratosphere and Jaql have already moved on to cost-based
   While relational queries are easily defined with the afore-   optimizations, handling also difficulties that appear due to
mentioned operators, common concepts in machine learn-           the large amount of user-defined code. Given that many op-
ing, such as models and optimization algorithms, are not         erators of data analysis systems take as input user defined
represented in a declarative way in the systems described        first-order functions, semantics of these operators cannot be
above and their implementation still involves a lot of cus-      known. This makes a big difference from traditional query
tom code. Hence, apart from the distributed execution, the       optimizations. Static code analysis is one way to tackle un-
development of machine learning algorithms is not simpli-        known semantics. Through code analysis, it is possible to
fied compared to more traditional languages such as R and        determine which data are read and written from each oper-
MATLAB.                                                          ator and separate them into read and write sets. Applying
   In an attempt to present declarative solutions for machine    compiler optimizations to UDFs is also useful. Function and
learning tasks, recent work [8] proposes Datalog as a suitable   variable inlining, as well as SIMD vectorization are the most
declarative foundation for analytics systems. As each sys-       popular compiler optimization techniques we see in these
tem is usually targeted to a specific ML class of tasks (e.g.    systems. Jaql and Tupleware already exploit these ideas.
graph analytics), Datalog can serve as a logical layer where        Concerning optimization in the context of the analytics
all these different programming models will be translated.       model proposed above, we describe crucial issues that come
The purpose is two-folded. First, to avoid developing opti-      into the picture. The first decision to be made regards the
mizations for each new system and second to separate the         1
                                                                   Part of these elements are also covered by the PMML for-
user’s program from its logical interpretation. The second       mat [14], a data exchange standard for sharing predictive
argument is very important as it prevents changes in log-        models produced by data mining and machine learning al-
ical and physical layer from affecting user’s code, whereas      gorithms.
definition of the form of the logical plan for the programs             Proc. VLDB Endow., 2(2):1481–1492, Aug. 2009.
expressed with this model. Another important concern is            [11] A. Crotty, A. Galakatos, K. Dursun, T. Kraska,
the computation of appropriate cost metrics for evaluating              U. Çetintemel, and S. B. Zdonik. Tupleware: Redefining
these plans. Traditional cost models used in query opti-                modern analytics. CoRR, abs/1406.6667, 2014.
mization are not designed to cover every dimension of query        [12] J. Dean and S. Ghemawat. Mapreduce: simplified data
                                                                        processing on large clusters. In OSDI’04: Proceedings of
performance, such as execution time. Machine learning has               the 6th Conference on Symposium on Operating Systems
already made its way into prediction of execution latency               Design and Implementation. USENIX Association, 2004.
and resource usage [4, 18] for queries running on DBMSs            [13] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald,
with relevant papers proposing the training of models on                V. Sindhwani, S. Tatikonda, Y. Tian, and
previous query instances, whose features are based on cardi-            S. Vaithyanathan. Systemml: Declarative machine learning
nalities estimated by the optimizer, the count of occurrences           on mapreduce. In Proceedings of the 2011 IEEE 27th
                                                                        International Conference on Data Engineering, ICDE ’11,
for each operator in the query plan and other statistics. The           pages 231–242, 2011.
application of similar machine learning techniques for pre-        [14] A. Guazzelli, M. Zeller, W.-C. Lin, and G. Williams.
dicting performance metrics of data analysis programs is an             PMML: An open standard for sharing models. The R
interesting direction to pursue, which becomes even more                Journal, 1(1):60–65, 2009.
challenging if we consider concurrent workloads [24]. Finally,     [15] D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu,
given the sheer size of data that are analysed nowadays, the            P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk,
proposed optimization techniques should address challenges              J. Wang, A. Whitaker, S. Xu, M. Balazinska, B. Howe, and
that arise in a distributed environment [21] and efficiently            D. Suciu. Demonstration of the myria big data
                                                                        management service. In Proceedings of the 2014 ACM
translate these logical plans to programs that would run in             SIGMOD International Conference on Management of
a distributed execution engine, such as Spark.                          Data, SIGMOD ’14, pages 881–884, 2014.
                                                                   [16] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.
4.   CONCLUSION                                                         Dryad: Distributed data-parallel programs from sequential
                                                                        building blocks. In Proceedings of the 2Nd ACM
   In this paper, we presented open challenges in the research          SIGOPS/EuroSys European Conference on Computer
area of big data analytics and we specifically focused on the           Systems 2007, EuroSys ’07, pages 59–72, 2007.
aspects of semantics and optimization techniques. Given            [17] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J.
the variety of systems that are used for data analysis, we              Franklin, and M. I. Jordan. Mlbase: A distributed
believe that the consolidation of semantics in an appropriate           machine-learning system. In CIDR. www.cidrdb.org, 2013.
algebra and the evolvement of optimization methods applied         [18] J. Li, A. C. König, V. Narasayya, and S. Chaudhuri.
                                                                        Robust estimation of resource consumption for sql queries
in an analytics query engine are of great importance and will
                                                                        using statistical techniques. In 38th International
attract even more interest in the next few years.                       Conference on Very Large Databases. Very Large Data
                                                                        Bases Endowment Inc., August 2012.
5.[1] Apache
       REFERENCES
             hadoop. https://hadoop.apache.org/.
                                                                   [19] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks,
                                                                        S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai,
 [2] Apache mahout. http://mahout.apache.org/.                          M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin,
 [3] Spark ml.                                                          R. Zadeh, M. Zaharia, and A. Talwalkar. Mllib: Machine
     http://spark.apache.org/docs/latest/ml-guide.html.                 learning in apache spark. CoRR, abs/1505.06807, 2015.
 [4] M. Akdere, U. Çetintemel, M. Riondato, E. Upfal, and         [20] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
     S. B. Zdonik. Learning-based query performance modeling            A. Tomkins. Pig latin: A not-so-foreign language for data
     and prediction. In Proceedings of the 2012 IEEE 28th               processing. In Proceedings of the 2008 ACM SIGMOD
     International Conference on Data Engineering, ICDE ’12,            International Conference on Management of Data,
     pages 390–401, 2012.                                               SIGMOD ’08, pages 1099–1110, 2008.
 [5] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag,           [21] M. T. Ozsu. Principles of Distributed Database Systems.
     F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl,         Prentice Hall Press, 3rd edition, 2007.
     F. Naumann, M. Peters, A. Rheinländer, M. J. Sax,            [22] E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan,
     S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The            J. E. Gonzalez, M. J. Franklin, M. I. Jordan, and
     stratosphere platform for big data analytics. The VLDB             T. Kraska. Mli: An api for distributed machine learning.
     Journal, 23(6):939–964, Dec. 2014.                                 CoRR, abs/1310.5426, 2013.
 [6] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y.       [23] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
     Eltabakh, C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A          S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A
     scripting language for large scale semistructured data             warehousing solution over a map-reduce framework. Proc.
     analysis. PVLDB, 4(12):1272–1283, 2011.                            VLDB Endow., 2(2):1626–1629, Aug. 2009.
 [7] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica.     [24] W. Wu, Y. Chi, H. Hacı́gümüş, and J. F. Naughton.
     Hyracks: A flexible and extensible foundation for                  Towards predicting query execution time for concurrent
     data-intensive computing. In Proceedings of the 2011 IEEE          and dynamic database workloads. Proc. VLDB Endow.,
     27th International Conference on Data Engineering, ICDE            6(10):925–936, Aug. 2013.
     ’11, pages 1151–1162, 2011.                                   [25] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson,
 [8] Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis,          P. K. Gunda, and J. Currey. Dryadlinq: A system for
     T. Condie, M. Weimer, and R. Ramakrishnan. Scaling                 general-purpose distributed data-parallel computing using a
     datalog for machine learning on big data. CoRR,                    high-level language. In Proceedings of the 8th USENIX
     abs/1203.0160, 2012.                                               Conference on Operating Systems Design and
 [9] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,                   Implementation, OSDI’08, pages 1–14, 2008.
     D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and            [26] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and
     efficient parallel processing of massive data sets. Proc.          I. Stoica. Spark: Cluster computing with working sets. In
     VLDB Endow., 1(2):1265–1276, Aug. 2008.                            Proceedings of the 2Nd USENIX Conference on Hot Topics
[10] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and              in Cloud Computing, HotCloud’10, pages 10–10, 2010.
     C. Welton. Mad skills: New analysis practices for big data.