=Paper=
{{Paper
|id=Vol-1864/paper_6
|storemode=property
|title=Implementing Common Table Expressions for MariaDB
|pdfUrl=https://ceur-ws.org/Vol-1864/paper_6.pdf
|volume=Vol-1864
|authors=Galina Shalygina,Boris Novikov
}}
==Implementing Common Table Expressions for MariaDB==
Implementing Common Table Expressions for MariaDB Galina Shalygina Boris Novikov* Mathematics and Mechanics Department Mathematics and Mechanics Department Saint Petersburg State University Saint Petersburg State University Saint-Petersburg, Russia Saint-Petersburg, Russia galashalygina@gmail.com b.novikov@spbu.ru Abstract—Common Table Expressions (CTEs), introduced in non-linear recursion. A little later MySQL included an the SQL Standard 1999, are similar to subroutines in implementation of CTE in their code as well. programming languages: they can be referenced from multiple places in a query and may refer to themselves, providing As soon as a CTE is defined in a query, this CTE may be recursive queries. Many RDBMSs implemented this feature, yet used several times in the same query, thus the role of CTEs is not in full. Until recently MariaDB did not have this feature similar to that of subroutines in imperative programming either. This article describes CTE overall and looks through languages. some interesting cases that are implemented in MariaDB only, like non-linear recursion and mutual recursion. It also compares The SQL Standard requires that results of the query optimizations for non-recursive CTEs across different RDBMSs. execution are the same as if CTEs were executed only once Finally, the results of experiments comparing computation of during the query processing. This requirement suggests a recursive CTEs for MariaDB and PostgreSQL are presented. straightforward implementation of CTEs that computes a CTE as a separate query and materializes results in a temporary Keywords—common table expressions; optimization; MariaDB; table. However, this implementation might result in a poor performance. For example, consider a CTE defining a large I. INTRODUCTION number of rows, say, extracting a table with millions rows. In spite of the fact that first SQL Standard appeared in Such CTE may be invoked in a query with additional selection 1986, there was lack of standard recursive constructions for criteria restricting the size of output to a single row. If the over a decade from this period. Nevertheless, as it was same table expression was placed directly into FROM clause, essential to write hierarchical queries in some cases, several an optimizer, most likely, would evaluate the condition first, DBMSs introduced their own recursive constructions. One of avoiding calculation and materialization of unneeded rows. the most popular is CONNECT BY presented by Oracle in An efficient implementation of CTE, even for non- 1980's [6]. And still now, even after standard recursive recursive queries, is a non-trivial task. construction common table expression (CTE) was officially introduced, there are many researches on presentation of These days more and more benchmarks include CTE recursive queries, especially on using Object-Relational usage. To compare, TPC-H 1999 Benchmark has no tests Mapping to make recursion [9], [10], [11], [12]. using CTE, while in TPC-H 2011 38 of 100 queries contain CTE [8]. To provide fast and low-cost CTE computation a lot The first attempt of CTE was in the SQL Standard 1999 of attention was given to researches on optimization [3] and from this version of Standard the CTE specification techniques for CTE. Articles [13], [14], [15] provide such didn’t change in the latest versions (see SQL:2008 [4]). The techniques for recursive CTEs. MariaDB also introduced its first implementation of CTEs dates back to 1997, to RDBMS own optimization technique for non-recursive CTEs, that DB2. Later there was a period of stagnation and only in 2003- concerns non-mergeable views and derived tables. 2005 other RDBMSs started implementing CTE [5]. Nowadays CTE are supported by most major RDBMSs [7], In this article we will discuss the implementation of CTE but still none of them support all CTE features that are that takes into account MariaDB Server special characteristics. described in the SQL Standard 1999 [3]. Not so long ago We will also overview different optimization techniques for MariaDB introduced its own implementation of CTE on which non-recursive CTEs. Non-recursive CTEs are handled as one of the authors has worked with other MariaDB developers derived tables, but recursive ones are a much more difficult [1], [2]. This implementation includes features that have never case. Moreover, they are computed in a different way. been introduced by other RDBMSs, like mutual recursion or Firstly, it should be checked if recursion is linear. Secondly, all mutual recursive CTEs are detected, if there are *The work of the second author is supported by Academy of Finland and Russian Foundation for Basic research grant 16-57-48001 any. At the same time all Standard restrictions on CTEs are III. RECURSIVE CTE checked. As regards optimization techniques for non-recursive The greatest advantage that using CTE provides is that it CTEs, we describe most popular of them with proper can reference to itself so a recursive query can be specified. examples and compare different RDBMSs approaches. Each definition of a recursive CTE consists of obligatory The contributions of this work include: WITH RECURSIVE keyword, the CTE name, an optional Techniques for efficient implementation of several column list and seed and recursive parts specifying this CTE. special cases of CTEs Both seed part and recursive part can be defined by several Techniques for implementation of mutual recursion SELECT statements and they should be joined by UNION or of CTEs UNION ALL operations. An implementation of both non-recursive and recursive CTEs in MariaDB WITH RECURSIVE This paper is organized as follows: Section 2 discusses expression_name [( column_name [,...] )] non-recursive CTEs in general and Section 3 describes AS ( [] UNION [ALL] recursive ones. Also Section 3 shows how computation of ) [,...] recursive CTEs in MariaDB goes, looks through different recursion cases and demonstrates how recursion can be stopped. Section 4 compares optimizations in different A. Computation RDBMSs and Section 5 presents the results of experiments on At the first step all the components of the seed part are two RDBMSs MariaDB and PostgreSQL. In Section 6 a computed. At all further steps the recursive part is computing conclusion is made. using the records produced by the previous steps. The Standard requires that only linear recursion can be used. This II. NON-RECURSIVE CTE means that at each step of recursion only those records that are produced by the latest step can be used. The process of CTE can be introduced as a temporary result set that is computation of recursive CTE stops when there are no new defined within the scope of a single SELECT, INSERT, records produced. UPDATE, DELETE or CREATE VIEW statement. In MariaDB CTE can be defined only in SELECT or/and In MariaDB computation of recursive CTE goes according CREATE VIEW statements. to the following scenario: Each definition of non-recursive CTE consists of At the preparatory stage the dependency matrix for all obligatory WITH keyword, the CTE name, an optional column CTEs used in the query is built to detect recursive CTEs. Also list and a query specifying this CTE. at this stage it is checked if there are enough seed parts for recursive CTEs. At the stage of the semantical analysis the structure of WITH expression_name [( column_name [,...] )] temporary tables where results will be saved is defined using AS the seed part of the query. Several temporary tables are created: the table where the final result records are accumulated, the table for the records produced by the latest Non-recursive CTEs can be called 'query-local views' and step and the tables for each reference to the recursive CTE. are similar to derived tables. As well as derived tables they Further all Standard restrictions are checked at this stage. aren't stored in the database. This means that CTEs live only for the duration of the query where they are defined. However, Lastly, at the execution stage CTE is executed in cycle unlike derived tables, they can be referenced multiple times in using temporary tables defined at the previous stage. At the the same query. Instead of redefining the same derived table beginning these temporary tables contain the result of every time CTE can be used with the aim of making query execution of the seed part. On each iteration the content of the more readable. table for new records is used as the entry for the recursive part. The result of the recursive part execution is added to the In MariaDB after CTE identification all references to non- table where the final result is stored and the new produced recursive CTEs are formed as references to derived tables. On records are written in the table for the new records. If there is further phases of semantical analysis, optimization and no data in the table for the new records, the process stops. execution non-recursive CTEs are handled as derived tables. The only thing that should be checked is if there are any As stated before, there is one temporary table for each renamed columns in the CTE definition because derived table reference to the recursive CTE. All these tables are similar to doesn't have such an option. each other, but storing them all is caused by the current limitations of the MariaDB server. Later an optimization which solves this problem will be added to the server. Also it must be mentioned that any recursive CTE is to the next step. MariaDB supports only mutual recursion as computed only once for the query. specified in the Standard. It also must be said that MariaDB is the first RDBMS that implemented mutual recursion and the B. Non-linear recursion only one who did it at the time of writing. As it was said before, a non-linear recursion is forbidden by the Standard. However, MariaDB supports this feature. D. How recursion can be stopped In MariaDB using linear recursion for traversal of a tree or A non-linear recursion can be useful when some a directed acyclic graph the execution is guaranteed to stop. restrictions of the Standard on recursive CTEs have to be However, in many other cases the user has to add some lifted. So, non-linear recursion can be used when: conditions to prevent loops. there is more than one reference to recursive CTEs in When a transitive closure is computed, in the definition of a FROM clause of recursive_part of CTE; recursive CTE only UNION can be used to join recursive and there are some references to recursive CTEs in the seed parts. For instance, when the user needs to find all cities right part of LEFT JOIN or in the left part of RIGHT that he can reach from some place by bus, there can be more JOIN; than one bus route with the same point of arrival. If there are there are some references to recursive CTEs in a such routes and the user applies UNION ALL, some subquery that is defined in a WHERE clause of some destinations will be added repeatedly and the routes will be recursive CTE; added again and again. This will lead to an infinite process. The main difference in computation of non-linear In the case when the paths over the graph with the loops recursion from linear one is that on each iteration not only the need to be computed, for example, all paths consist of cities records produced by the latest recursive step but all records that can be reached by bus from some place, there might be a produced so far for the CTE are used as the entries for the special condition written into WHERE in the recursive part of recursive part. So, in MariaDB implementation of CTEs when CTE definition to stop indefinitely increasing paths the restrictions are lifted, new records are just added to the computing. As well as in the previous case there can be some tables created for the recursive references from the bus routes with the same destination. Adding a city that specification of CTE. The recursion stops when no new already exists in the path will lead to an infinite process, that's records produced. why a condition that checks if the city exists in the path needs Whereas in some cases a query looks cleaner and to be added. executing converges faster, usage of non-linear recursion can Also in MariaDB there is a safety measure – the special be very helpful in many cases. For example, it can be variable @@max_recursive_iterations that controls the count effectively used to stop the recursive process. of the completed iterations during computation of a recursive If the user wants to use non-linear recursion in MariaDB, CTE. The user can change it himself if needed. he can set @@standard_compliant_cte=0 and work with it. IV. OPTIMIZATION TECHNIQUES C. Mutual recursion The basic algorithm of execution of a CTE stores results of Mutual recursion is said to be one of the most interesting CTE in a temporary table. When the query where the CTE was forms of recursion. It is such a form where two or more CTEs defined calls for the CTE results, the result records are taken refer to each other. from this temporary table. Although this algorithm always works, in most cases it is not optimal. Some optimization In MariaDB all mutual recursive CTEs are detected at the techniques on non-recursive CTEs are discussed and a preparatory stage. For every recursive_part of CTE a search is comparison between different RDBMSs approaches is made made for all recursive_parts mutually recursive with the CTE below. where it was defined. It must be said that recursive CTE is mutually recursive with itself. All found mutually recursive A. CTE merging CTEs are linked into a ring chain. Further, it is checked if With this optimization applied the CTE is merged into there are enough seed_parts for a mutually recursive group. parent JOIN so that parts of the CTE definition replace There should be at least one anchor for the mutually recursive corresponding parts of the parent query. There are some group. restrictions on a CTE in order it could be merged: GROUP Mutual recursion is allowed by the Standard, but it BY, DISTINCT, etc. can't be used in CTE definition. required that any mutually recursive group of CTEs can be This optimization technique is the same as transformed into a single recursive CTE. An example of the ALGORITHM=MERGE for views in MySQL. non-restricted case of mutual recursion can be as follows: when there are two recursive CTEs, where on each iteration On the Fig. 1 the example of how this technique works is one CTE waits until the second one ends computation with the shown. The upper listing shows the initial query and the low content of the first CTE as the entry, and only after that goes shows how the optimizer will transform it. WITH engineers AS ( instances than the disjunction of these conditions has to be SELECT * pushed into CTE. FROM employees WHERE dept = 'Development') SELECT * D. Comparison of optimizations in MariaDB, PostgreSQL, FROM engineers E, support_cases SC MS SQL Server, MySQL 8.0.0-labs WHERE E.name = SC.assignee AND MariaDB as MS SQL Server supports merging and SC.created = '2016-09-30' AND E.location = 'Amsterdam' condition pushdown. PostgreSQL supports reuse only. MySQL 8.0.0-labs supports both merging and reuse and it works in such way: it tries merging otherwise makes reuse. SELECT * FROM employees E, support_cases SC TABLE I. EXISTENCE OF OPTIMIZATION TECHNIQUES IN DIFFERENT WHERE E.dept = 'Development' AND RDBMSS E.name = SC.assignee AND SC.created = '2016-09-30' AND Optimization technique exists E.location = ’Amsterdam’ DBMS CTE merge Condition pushdown CTE reuse Fig. 1. Example of CTE merging MariaDB 10.2 yes yes no MS SQL Server yes yes no B. Condition pushdown PostgreSQL no no yes The condition pushdown is used when merging is not MySQL possible, for example when CTE has GROUP BY. Conditions 8.0.0-Labs yes no yes in the WHERE clause of a query that depend only on the columns of the CTE are pushed into the query defining this CTE. In the general case conditions can be pushed only in the V. THE RESULTS OF EXPERIMENTS ON MARIADB AND HAVING clause of the CTE, but on some conditions it makes POSTGRESQL sense to push them into the WHERE clause. As a result, a Some tests have been conducted on the computer with temporary table is made smaller. processor Intel(R) Core(TM) i7-4710HQ CPU, 2.50GHz, 8 Besides CTEs this optimization works for derived tables GB RAM on Opensuse 13.2 operating system. We tested and non-mergeable views. PostgreSQL 9.3 and MariaDB 10.2 database systems, and gave the same amount of 16 Mb memory for temporary tables On the Fig. 2 the example of how this technique works is in both systems. It was relevant to use PostgreSQL 9.3 shown. The upper listing shows the initial query and the low because CTEs implementation didn’t change in further shows how optimizer will transform it. versions. The experiments were made in a database containing the WITH sales_per_year AS ( information about domestic flights in the USA during 2008. SELECT year(order.date) AS years, sum(order.amount) AS sales Database schema consists of the following relations: FROM order GROUP BY year) tab_2008(month, dayofmonth, dep_time, arrtime, SELECT * flightnum, airtime, origin, dest, dist); FROM sales_per_year airports(names); WHERE year IN ('2015','2016') We wanted to find multi-destination itineraries. So, we decided to find the shortest way between the airports of WITH sales_per_year AS ( interest by plane. The table airports shows which airports SELECT year(order.date) AS years, sum(order.amount) AS sales should be visited. None of the airports can be visited twice. FROM order Besides, the plane should leave for the next destination a day WHERE year IN ('2015','2016') or more after the previous plane. GROUP BY year) SELECT * The following query Q1 for MariaDB is shown on Fig. 3. FROM sales_per_year The script for PostgreSQL has a difference in functions cast(origin as char(32)) and locate(tab_2008.dest, Fig. 2. Example of condition pushdown s_planes.path). The analogue of locate function in PostgreSQL is position function and its return type is text, C. CTE reuse that’s why the result of cast function in PostgreSQL in this The main idea of this method is to fill the CTE once and query will be not char(32), but text. then use it multiple times. It works with condition pushdown. This query starts from 'IAD' airport in seed_part and looks Yet if different conditions are to be pushed for different CTE through the table tab_2008 to find flights with 'IAD' as origin and one of the airports from table airports as destination. As database systems, so it was decided to continue experiments the needed destination is found it is checked if it has already on the tables without indexes. been visited on the route to prevent repeats. From the received Although, we decided to make some other experiments data we take only those paths that involve all airports from and minimize the number of searched airports. So, fewer steps table airports and have the smallest overall distance. of recursion were made. We made these experiments only on the table with 1134055 records. The results are shown in WITH RECURSIVE s_planes (path, dest, dayofmonth, TABLE III. dist, it) AS ( SELECT cast(origin as char(30)), origin, When the number of searched airports is 8, MariaDB has dayofmonth, 0, 1 much better results than PostgreSQL. But during the FROM tab_2008 minimization of the number of the airports PostgreSQL results WHERE dayofmonth = 3 AND origin = 'IAD' AND flightnum = 3231 become closer and closer to MariaDB results. When the UNION number of the airports is fewer than 5 PostgreSQL results SELECT become better than MariaDB ones and this trend continues. concat(s_planes.path,',',tab_2008.dest), tab_2008.dest, tab_2008.dayofmonth, We compared query execution plans in both database s_planes.dist+tab_2008.dist, it+1 systems and found that they were absolutely the same. FROM tab_2008, airports, s_planes WHERE However, operation of HASH JOIN made in a recursive part tab_2008.origin = s_planes.dest AND of the query requires 3-4 times longer period in PostgreSQL locate(tab_2008.dest, s_planes.path)=0 AND than in MariaDB on the big number of recursive iterations. tab_2008.dest = airports.name AND We don’t know the reason of this yet, and in our further tab_2008.dayofmonth > s_planes.dayofmonth) SELECT * researches we will focus on this theme. FROM s_planes WHERE it = 8 AND dist = (SELECT min(dist) VI. CONCLUSION FROM s_planes In this paper we presented a number of techniques for WHERE it = 8); execution of CTEs and provided an implementation of these Fig. 3. Query Q1 for MariaDB techniques for MariaDB. We described in details recursive CTE computation and mutual recursive CTE computation. Also we discussed in which cases non-linear recursion can be TABLE II. THE RESULTS OF THE QUERY Q1 (OVERALL RESULT) used. We compared existence of optimization techniques for Records count non-recursive CTE in different databases. DBMS 587130 1134055 6858079 We also performed some experiments using flights table MariaDB 16.72 sec 31.97 sec 3 min 9 sec on PostgreSQL and MariaDB. They showed that PostgreSQL PostgreSQL 60.29 sec 1 min 91 sec 11 min 50 sec has better results only when few steps of recursion are needed. For the long recursive process on a huge amount of data MariaDB is a better choice. TABLE III. THE RESULTS OF THE EXPERIMENTS DURING AIRPORTS COUNT MINIMIZATION (OVERALL RESULT ) The authors want to express gratitude to MariaDB developers Igor Babaev and Sergey Petrunya for their Airports count participation in the work on MariaDB CTE implementation DBMS 4 5 6 7 8 and their help in writing this article. 2.28 3.49 5.93 14.45 MariaDB 31.97 sec sec sec sec sec REFERENCES 1.13 2.97 6.74 38.66 1 min 91 [1] Code of the implementation of non-recursive CTEs for MariaDB, URL: PostgreSQL sec sec sec sec sec https://github.com/MariaDB/server/pull/134 [2] Code of the implementation of recursive CTEs for MariaDB, URL: https://github.com/MariaDB/server/pull/182 [3] SQL/Foundation ISO/IEC 9075-2:1999 We've made a query Q1 on table tab_2008 consists of [4] SQL/Foundation ISO/IEC 9075-2:2008 different number of records: 587130, 1134055 and 6858079. [5] P. Przymus, A. Boniewicz, M. Burza´nska, and K. Stencel. Recursive The results of the experiments are shown in TABLE II. query facilities in relational databases: a survey. In DTA and BSBT, pages 89–99. Springer, 2010 What can be seen from the table is that the results of the tests [6] Oracle Database Online Documentation, 10g Release 2 (10.2) in MariaDB are more than three times better than in [7] D. Stuparu and M. Petrescu. Common Table Expression: Different PostgreSQL. Database Systems Approach. Journal of Communication and Computer, 6(3):9–15, 2009 Also query Q1 was made on the same table with 587130 [8] TPC Benchmark™DS (TPC-DS): The New Decision Support records but with the index on origin column. The results Benchmark Standard improved only for an overall value of 1 second in both [9] Boniewicz, A., Stencel, K., Wiśniewski, P.: Unrolling SQL:1999 Hibernate ORM. In Eder, J., Bielikov´a, M., Tjoa, A.M., eds.: recursive queries. In Kim, T.h., Ma, J., Fang, W.c., Zhang, Y., ADBIS(2). Volume 789 of CEUR Workshop Proceedings., CEUR- Cuzzocrea, A., eds.: Computer Applications for Database, Education, WS.org (2011) 190–199 and Ubiquitous Computing. Volume 352 of Communications in [13] Ghazal, A., Crolotte, A., Seid, D.Y.: Recursive sql query optimization Computer and Information Science. Springer Berlin Heidelberg (2012) with k-iteration lookahead. In Bressan, S., K¨ung, J., Wagner, R., eds.: 345–354 DEXA. Volume 4080 of Lecture Notes in Computer Science., Springer [10] Szumowska, A., Burzańska, M., Wiśniewski, P., Stencel, K.: Efficient (2006) 348–357 Implementation of Recursive Queries in Major Object Relational [14] Ordonez, C.: Optimization of linear recursive queries in sql. IEEE Mapping Systems. In FGIT 2011 78-89 Trans. Knowl. Data Eng. 22 (2010) 264–277 [11] Burzanska, M., Stencel, K., Suchomska, P., Szumowska, A., [15] Burzanska, M., Stencel, K., Wisniewski, P.: Pushing predicates into Wisniewski, P.: Recursive queries using object relational mapping. In recursive sql common table expressions. In Grundspenkis, J., Morzy, T., Kim, T.H., Lee, Y.H., Kang, B.H., Slezak, D., eds.: FGIT. Volume 6485 Vossen, G., eds.: ADBIS. Volume 5739 of Lecture Notes in Computer of Lecture Notes in Computer Science., Springer (2010) 42–50 Science., Springer (2009) 194–205 [12] Wiśniewski, P., Szumowska, A., Burzańska, M., Boniewicz, A.: Hibernate the recursive queries - defining the recursive queries using