A Relational Approach to Complex Dataflows

A Relational Approach to Complex Dataflows YannisChronis i.chronis@di.uoa.gr Foufoulas Vaggelis Nikolopoulos Yannis Alexandros Papadopoulos Lefteris Stamatogiannakis Christoforos Svingos Yannis Ioannidis Dept. of Informatics and Telecom MaDgIK Lab University of Athens

Greece

A Relational Approach to Complex Dataflows 1613-0073) 6E0038108EBC9EE8255A76EBCC41B98B GROBID - A machine learning software for extracting information from scholarly documents

Clouds have become an attractive platform for highly scalable processing of Big Data, especially due to the concept of elasticity, which characterizes them. Several languages and systems for cloud-based data processing have been proposed in the past, with the most popular among them being based on MapReduce [7]. In this paper, we present Exareme, a system for elastic large-scale data processing on the cloud that follows a more general paradigm. Exareme is an open source project [1] 1 . The system offers a declarative language which is based on SQL with user-defined functions (UDFs) extended with parallelism primitives and an inverted syntax to easily express data pipelines. Exareme is designed to take advantage of clouds by dynamically allocating and deallocating compute resources, offering trade-offs between execution time and monetary cost.

1 This research is supported in part by the European Commission under Optique, Human Brain and MD-Paedigree projects.

INTRODUCTION

Modern applications face the need to process large amount of data using complex functions. Examples include complex analytics [14], similarity joins [12], and extract-transformload (ETL) processes [15]. Such rich tasks are typically expressed using high-level APIs or languages [16] and are transformed into data intensive workflows, or simply dataflows. Exareme uses a master-worker architecture. Our language is based on SQL to express both intra-worker and inter-worker dataflows. We use UDFs and a inverted syntax to easily express local pipelines and complex computations. Interworker dataflows are described with simple parallelism primitives. These abstractions allow users to fine tune dataflows for different applications. All of the basic components of Exareme are designed to support the elastic properties of cloud infrastructures. We provide comparisons to other state of the art systems.

The system architecture is shown in Figure 1. From a user's point of view, the system is used as a traditional database system: create / drop tables or indexes, import external data, issue queries. The queries are expressed in ExaDFL. ExaDFL is transformed into data processing flows (dataflows) represented as directed acyclic graphs (DAGs) that have arbitrary computations (operators) as nodes and producer-consumer interactions as edges between the nodes. The typical queries we target are complex data-intensive transformations that are expensive to execute, queries may run for several minutes or hours.

Exareme is separated into the following components: The Master, is the main entry point, through the gateway, to the system and is responsible for the coordination of the rest of the components. The Execution Engine communicates with the resource manager and schedules the operators of the query respecting their dependencies in the dataflow graph and the available resources. It also monitors the dataflow execution and handles failures. All the information related to the data and the allocated VMs is stored in the Registry. The Resource Manager is responsible for the allocation and deallocation of VMs based on the demand. The Optimization Engine translates ExaDFL query into the distributed machine code of the system (similar to [13]) and creates the final execution plan by assigning operators to workers (Section 4.1). Finally, the Worker executes operators (relational operators and UDFs) and transfers intermediate results to other workers. Each worker fetches the partitions needed for the execution and caches them to its local disk for subsequent usage. Madis is the core engine of the Worker [2], it is an extension of SQLite, based on the APSW wrapper. It executes the computations described by ExaQL (3.1). Madis processes the data in a streaming fashion and performs pipelining when

Data Model

Exareme adopts the relational data model and extends it with: Complex Field Types: JSON, CSV, and TSV. Table Partitions: A table is defined as a set of partitions and a partition is defined as a set of records having a particular property, i.e. the hash value of a column. Partitioning: If the database has multiple tables as it happens in data warehouses, the largest tables are partitioned and all others are replicated along with the partitions. Data placement is crucial for performance and elasticity. We use a modification of consistent hashing [11], because it offers good theoretical bounds and can be accurately modeled. To increase flexibility and efficiency we use over-partitioning and replication. This way, changing the size of the virtual infrastructure will cause only data transfer and not the computation of a new partitioning.

Money/Time Trade-Offs

Exareme can express money/time trade-offs by examining variations of an execution plan, we refer to this notion as eco-elasticity [9] [8]. Exareme's scheduler creates different execution plans based on the algorithm described in 4.1. Along with every query the user can specify an SLA. Using the SLAs the scheduler chooses the execution plan based on its time and money requirements.

LANGUAGE

Queries are issued to Exareme using ExaDFL. ExaDFL is a dataflow language that describes DAGs and it's based on SQL extended with UDFs and data parallelism primitives [6]. ExaDFL allows fine control, but requires an understanding of partitioning and data placement. We are currently working on an optimizer that will produce ExaDFL from UDF extended SQL by applying classic database optimizations and transforming functions with their distributed version when it is necessary. In this section we firstly present the language that describes intra-worker dataflows. Then we present the data parallelism primitives and at the end we present the language as one. We use the following subset of TPC-H [3] : lineitem(l orderkey, l comment), orders(o orderkey,o clerk). Both are hash partitioned to 4 parts on their keys.

ExaQL

ExaQL is based on the SQL-92 standard. The relational primitives of SQL are a good way to express relations and data combinations. We use SQL to combine data and process them with UDFs, whenever the SQL abstractions are not sufficient or efficient to use. We enhanced the syntax of ExaQL to easily combine virtual table functions (UDTFs) into data pipelines. Suppose we want to find the most frequent words that some clerks use in their comments when they buy or sell products. We have the names of the clerks in a compressed XML file that is accessible via HTTP. In ExaQL, we can express it as follows: The query uses the FILE UDTF to fetch, uncompress, and load the data on-the-fly from the HTTP server specified. It is not needed to import or create temporary tables, all the details are handled automatically by the system. The output of FILE is given to the XMLPARSE UDTF that parses the XML content and produces a table with the names of the clerks. Row function STRSPLITV takes a string as input and produces one nested table for each comment by splitting the words into rows. Notice that this behaviour is different from the row functions typically supported by database systems which produce a single value. This is an extension of Exareme for row and aggregate functions.

Data Parallelism Primitives

The support of simple primitives declaratively express potential data parallelism in the dataflow language itself and let the system decide the actual degree of parallelism at runtime. This is very helpful since the queries are expressed independently of the parallelism used.

Input Primitives

Figure 3 (top) shows the types of combinations supported on two partitioned tables R and S, where a query Q is executed on each partition pair indicated, as well as the type of reduction supported on a single partitioned table. Direct : This combines two (or more) tables that either (a) both have been partitioned in the way required by the combination specified, e.g., a distributed join on tables hashpartitioned on the join attribute, or (b) one has been fully replicated and the other has been partitioned in some fashion, e.g., a join between a small table replicated to the locations of the partitions of much larger table. Cartesian product: This combines two (or more) tables that have been partitioned in ways unrelated to the combination specified. Tree: This performs a multi-level tree reduction on a single table, generalizing the two-level (combine and reduce) reduction of MapReduce. This is used when Q has aggregate functions that are algebraic or distributive and has been shown to exhibit very good performance in practice. Two or more queries form a script. Each query has two semantically different parts: parallelism and ExaQL. The first part describes the input and output data parallelism used and the second part the computations that get executed on each input combination. The following ExaDFL dataflow is equivalent to the ExaQL query of our example: The first query is executed to download and parse the XML file. The extern directive declares that the query uses an external source and only one instance of the query should be created. The result is a table called clerk that is replicated to 4 partitions. The second query combines tables lineitem, orders, and clerk using the direct input combination. Notice that the result of the join is correct since tables lineitem and orders are partitioned on the join column and table clerk is replicated. Finally, the third query is used to create table result using a tree aggregation. This is possible because the aggregate function sum is distributive. All the temporary tables are deleted automatically at the end of the script.

Output Primitives

ExaDFL//

QUERY OPTIMIZATION

In principle, the optimization process could proceed in one giant step, examining all execution plans that could answer the query and choosing the optimal that satisfies the required constraints. Given the size of the alternatives space in our setting, this approach is infeasible. Instead, our optimization process proceeds in multiple smaller steps, each one operating at some level and making assumptions about the levels below. This is in analogy to query optimization in traditional databases but with the following differences. The operators may represent arbitrary operations and may have performance characteristics that are not known. Furthermore, optimality may be subject to QoS or other constraints and may be based on multiple criteria, e.g., monetary cost of resources, quality of data, etc., and not just solely on performance.The resources available for the execution of a dataflow are not fixed a-priori but flexible and reservable on demand.

Sky

The dataflow scheduler we use, takes as input the dataflow DAG and assigns its nodes (operators) to workers. It does so by taking into account two types of constraints i) the dataflow (DAG) implied constraints based on the inter-operator dependencies captured by its edges, ii) the execution environment implied constraints due to resource limitations. In that respect, we categorize resources as time-shared and space-shared [10]. Time-shared resources can be used by multiple operators concurrently at very low overhead. Concurrent use of space-shared resources implies high overheads beyond workers limits of resources. We consider memory as the only space-shared resource, whereas CPU and network as time-shared resources. Constraints are imposed only by space-shared resources in every worker, at any given moment, memory must be sufficient for the execution of the running operators. The scheduling algorithm we propose is Dynamic Skyline (Sky) and is shown in Algorithm 1.

Sky is an iterative algorithm that incrementally computes the skylines of schedules, Figure 2. The algorithm begins by scheduling the operators from producers to consumers as defined by the DAG. Each operator with no inputs is a candidate for assignment. An operator is a candidate as soon as all of its inputs are available. The scheduler considers assigning every operator at an existing worker or at a new worker by adding a new VM. The result is a skyline of schedules (Figure 2). The final execution plan can be selected either manually or automatically based on SLAs. [17] The scheduler uses the following heuristics regarding data transferring. It transfers only intermediate results and, if possible, does not move original tables. Intermediate results are usually smaller than the original tables because queries with a single input usually contain filters and queries with multiple inputs usually join the tables using equi-joins. This type of join reduces the size of the output table. Some type of queries are executed very efficiently this way, especially when the small tables fit in memory. This is the usual case for OLAP workloads with star or snowflake schema. An- for all containers c of s do 13:

S ← S ∪ {s + assign(next, c, −, −)} skyline ← skyline of S 23:

ready ← ready − {next} ∪ {operators in G that dependency constraints no longer exist} 24: end while 25: return skyline other benefit with this approach is the exploitation of indexes if they exists on the original tables. In addition, we add gravity operators pinned to the location of the tables, so the movement of the original tables out of their initial location becomes an optimization choice.

EXPERIMENTAL EVALUATION

Environment: We used up to 64 VMs, each with 1 CPU, 4 GB of memory, and 20 GB of disk, provided by Okeanos 2 . The average network speed measured was 150 Mbps. Datasets: We generated a total of 256 GB of the following tables, using the TPC-H benchmark [3]. (in the parenthesis we note the number of partition and the partitioning key) region(1), partsupp(1, ps partkey), orders(128, o orderkey), lineitem(128, l orderkey), customer(1, c custkey), part(1, p partkey), nation(1), and supplier(1). Measurements: We run each query 4 times and report the average of the last 3 measurements. We compared Exareme with Hive [16] (with both MR [4] and Tez [5] as backend, formerly known as Stinger) and System X (an industry leading commercial system). Figure 4 shows the results, to save space we have omitted some results, but only if Exareme is faster. Hive-stinger was always faster than Hive. The versions of the systems we used are Hive 0.13.1, Hadoop 2.5.1, Tez 0.5.0 (intermediate results are compressed (Snappy)). System X is faster for queries 1 and 6 that involve aggregations on the largest table (lineitem). We were not able to execute queries 8 and 9 on System X because of memory limits (System X is an in-memory system). Overall, we observe that Exareme is faster in most cases than the state-of-the art systems. Figure 5 show the profit that is gained when exploiting eco- elasticity. As a baseline we use three static infrastructures that do not change over time small with 15 VMs, medium with 30 VMs, large with 60 VMs. We run the system for one hour using a client that issue Q1 in three phases, each of 1 hour duration. In the first and third phase, the Poisson parameter λ is set to 60 and in the second phase to 30 (the rate is doubled). The elastic layout allocator produces a better-fitted layout that adapts to the workload changes and yields the highest profit compared to all static choices.

Co mp u t e C l o u d E x e c u o n E n g i n e R e s o u r c e Ma n a g e r O p mi z a o n E n g i n e R e g i s t r y P a r s e r G a t e wa y Ma s t e r Wo r k e r Wo r k e r Wo r k e r S t o r a g e C l o u d

Figure 1 :1Figure 1: Exareme's architecture

Figure 2 :2Figure 2: Dynamic infrastucture elasticities

select word, count( * ) as count from(select STRSPLITV(l comment) as word from lineitem, orders, (XMLPARSE '["/name"]' FILE 'http://../clerk.xml.gz') as clerk where l orderkey = o orderkey and o clerk = name) as words group by word order by count desc;

Figure 3 (3Figure 3 (bottom): Same: The default mode does, the output number of partitions is determined by the input. Partition: Hash partitioning is used. This requires two

Figure 3 :3Figure 3: Input Partitioning (top), Output Partitioning (bottom)

Algorithm 11Dynamic Skyline Input: G: A dataflow graph. Output: skyline: The skyline schedules. 1: ready ←{operators in G that have no dependencies} 2: op1 ← maxRunningT ime(ready) 3: vm1 ← allocateN ewV M () 4: schedule1 ← {assign(op1, vm1, −, −)} 5: skyline ← {schedule1} 6: while ready = do 7: next ← maxRunningT ime(ready); S ← 8: for s ∈ skyline do 9: if next is pinned then 10: S ← S ∪ {s + assign(next, next.pin loc, −, −)} 11: else 12:

2Figure 4 :4Figure 4: TPC-H with 64GB data and 32 VMs on System X, Hive and Exareme

Figure 5 :5Figure 5: configuration with eco-elasticity vs. static layouts.

All of the above compose ExaDFL according to the following grammar:ExaDFL:= (<query>)+query:= <parallelism> <ExaQL> ;parallelism:= create distributed [temp]table <name> [<output comb>]as [<input comb>]

output := [to <number>] [(hash | range)] partition on <name>(,<name>) * input := direct | cprod | tree | extern (the rest is omitted due to space)

ACKNOWLEDGEMENTS

The authors would like to thank Herald Kllapi and Manolis Tsangaris.

<author> <persName><surname>References</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b1"> <monogr> <title/> <author> <persName><surname>Exareme</surname></persName> </author> <ptr target="http://www.exareme.org/" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b2"> <monogr> <title/> <author> <persName><surname>Madis</surname></persName> </author> <ptr target="https://github.com/madgik/madis" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b3"> <monogr> <ptr target="http://hadoop.apache.org/" /> <title level="m">Hadoop <author> <persName><surname>Apache</surname></persName> </author> <author> <persName><surname>Tez</surname></persName> </author> <ptr target="http://tez.apache.org/" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b5"> <monogr> <title level="m" type="main">A Guide to the SQL Standard CJDate HDarwen 1997 Addison-Wesley Longman 4th Ed MapReduce: Simplified Data Processing on Large Clusters JDean SGhemawat OSDI 2004 The cost of doing science on the cloud: the montage example EDeelman IEEE/ACM SC 2008 Schedule optimization for data processing flows on the cloud K SIGMOD ' 11 289 2011 Parallel query scheduling and optimization with time-and space-shared resources MGarofalakis YIoannidis VLDB '97 Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web DRKarger STOC 1997 Near neighbor join HKllapi BHarb CYu ICDE Condor -A Hunter of Idle Workstations MJLitzkow ICDCS 1988 Dremel: Interactive analysis of web-scale datasets SMelnik PVLDB 3 1 2010 Modeling and managing etl processes ASimitsis VLDB PhD Workshop 2003 Hive -a petabyte scale data warehouse using Hadoop AThusoo ICDE 2010 Intermediate Microeconomics : A Modern Approach HRVarian