=Paper= {{Paper |id=Vol-1864/paper_6 |storemode=property |title=Implementing Common Table Expressions for MariaDB |pdfUrl=https://ceur-ws.org/Vol-1864/paper_6.pdf |volume=Vol-1864 |authors=Galina Shalygina,Boris Novikov }} ==Implementing Common Table Expressions for MariaDB== https://ceur-ws.org/Vol-1864/paper_6.pdf
                Implementing Common Table Expressions
                             for MariaDB

                       Galina Shalygina                                                         Boris Novikov*
          Mathematics and Mechanics Department                                       Mathematics and Mechanics Department
             Saint Petersburg State University                                          Saint Petersburg State University
                 Saint-Petersburg, Russia                                                   Saint-Petersburg, Russia
                galashalygina@gmail.com                                                        b.novikov@spbu.ru



    Abstract—Common Table Expressions (CTEs), introduced in                 non-linear recursion. A little later MySQL included an
the SQL Standard 1999, are similar to subroutines in                        implementation of CTE in their code as well.
programming languages: they can be referenced from multiple
places in a query and may refer to themselves, providing                       As soon as a CTE is defined in a query, this CTE may be
recursive queries. Many RDBMSs implemented this feature, yet                used several times in the same query, thus the role of CTEs is
not in full. Until recently MariaDB did not have this feature               similar to that of subroutines in imperative programming
either. This article describes CTE overall and looks through                languages.
some interesting cases that are implemented in MariaDB only,
like non-linear recursion and mutual recursion. It also compares                The SQL Standard requires that results of the query
optimizations for non-recursive CTEs across different RDBMSs.               execution are the same as if CTEs were executed only once
Finally, the results of experiments comparing computation of                during the query processing. This requirement suggests a
recursive CTEs for MariaDB and PostgreSQL are presented.                    straightforward implementation of CTEs that computes a CTE
                                                                            as a separate query and materializes results in a temporary
    Keywords—common table expressions; optimization; MariaDB;
                                                                            table. However, this implementation might result in a poor
                                                                            performance. For example, consider a CTE defining a large
                         I. INTRODUCTION                                    number of rows, say, extracting a table with millions rows.
    In spite of the fact that first SQL Standard appeared in                Such CTE may be invoked in a query with additional selection
1986, there was lack of standard recursive constructions for                criteria restricting the size of output to a single row. If the
over a decade from this period. Nevertheless, as it was                     same table expression was placed directly into FROM clause,
essential to write hierarchical queries in some cases, several              an optimizer, most likely, would evaluate the condition first,
DBMSs introduced their own recursive constructions. One of                  avoiding calculation and materialization of unneeded rows.
the most popular is CONNECT BY presented by Oracle in                          An efficient implementation of CTE, even for non-
1980's [6]. And still now, even after standard recursive                    recursive queries, is a non-trivial task.
construction common table expression (CTE) was officially
introduced, there are many researches on presentation of                        These days more and more benchmarks include CTE
recursive queries, especially on using Object-Relational                    usage. To compare, TPC-H 1999 Benchmark has no tests
Mapping to make recursion [9], [10], [11], [12].                            using CTE, while in TPC-H 2011 38 of 100 queries contain
                                                                            CTE [8]. To provide fast and low-cost CTE computation a lot
    The first attempt of CTE was in the SQL Standard 1999                   of attention was given to researches on optimization
[3] and from this version of Standard the CTE specification
                                                                            techniques for CTE. Articles [13], [14], [15] provide such
didn’t change in the latest versions (see SQL:2008 [4]). The                techniques for recursive CTEs. MariaDB also introduced its
first implementation of CTEs dates back to 1997, to RDBMS                   own optimization technique for non-recursive CTEs, that
DB2. Later there was a period of stagnation and only in 2003-               concerns non-mergeable views and derived tables.
2005 other RDBMSs started implementing CTE [5].
Nowadays CTE are supported by most major RDBMSs [7],                            In this article we will discuss the implementation of CTE
but still none of them support all CTE features that are                    that takes into account MariaDB Server special characteristics.
described in the SQL Standard 1999 [3]. Not so long ago                     We will also overview different optimization techniques for
MariaDB introduced its own implementation of CTE on which                   non-recursive CTEs. Non-recursive CTEs are handled as
one of the authors has worked with other MariaDB developers                 derived tables, but recursive ones are a much more difficult
[1], [2]. This implementation includes features that have never             case. Moreover, they are computed in a different way.
been introduced by other RDBMSs, like mutual recursion or
                                                                               Firstly, it should be checked if recursion is linear.
                                                                            Secondly, all mutual recursive CTEs are detected, if there are
    *The work of the second author is supported by Academy of Finland and
Russian Foundation for Basic research grant 16-57-48001
any. At the same time all Standard restrictions on CTEs are                           III. RECURSIVE CTE
checked. As regards optimization techniques for non-recursive        The greatest advantage that using CTE provides is that it
CTEs, we describe most popular of them with proper                can reference to itself so a recursive query can be specified.
examples and compare different RDBMSs approaches.
                                                                     Each definition of a recursive CTE consists of obligatory
   The contributions of this work include:                        WITH RECURSIVE keyword, the CTE name, an optional
        Techniques for efficient implementation of several       column list and seed and recursive parts specifying this CTE.
         special cases of CTEs                                    Both seed part and recursive part can be defined by several
        Techniques for implementation of mutual recursion        SELECT statements and they should be joined by UNION or
         of CTEs                                                  UNION ALL operations.
        An implementation of both non-recursive and
         recursive CTEs in MariaDB                                   WITH RECURSIVE
    This paper is organized as follows: Section 2 discusses          expression_name [( column_name [,...] )]
non-recursive CTEs in general and Section 3 describes                AS ( [] UNION [ALL]
recursive ones. Also Section 3 shows how computation of
                                                                            ) [,...]
recursive CTEs in MariaDB goes, looks through different
recursion cases and demonstrates how recursion can be
stopped. Section 4 compares optimizations in different            A. Computation
RDBMSs and Section 5 presents the results of experiments on           At the first step all the components of the seed part are
two RDBMSs MariaDB and PostgreSQL. In Section 6 a                 computed. At all further steps the recursive part is computing
conclusion is made.                                               using the records produced by the previous steps. The
                                                                  Standard requires that only linear recursion can be used. This
                  II. NON-RECURSIVE CTE                           means that at each step of recursion only those records that are
                                                                  produced by the latest step can be used. The process of
    CTE can be introduced as a temporary result set that is       computation of recursive CTE stops when there are no new
defined within the scope of a single SELECT, INSERT,              records produced.
UPDATE, DELETE or CREATE VIEW statement. In
MariaDB CTE can be defined only in SELECT or/and                      In MariaDB computation of recursive CTE goes according
CREATE VIEW statements.                                           to the following scenario:
     Each definition of non-recursive CTE consists of                 At the preparatory stage the dependency matrix for all
obligatory WITH keyword, the CTE name, an optional column         CTEs used in the query is built to detect recursive CTEs. Also
list and a query specifying this CTE.                             at this stage it is checked if there are enough seed parts for
                                                                  recursive CTEs.
                                                                      At the stage of the semantical analysis the structure of
   WITH expression_name [( column_name [,...] )]
                                                                  temporary tables where results will be saved is defined using
   AS                                    the seed part of the query. Several temporary tables are
                                                                  created: the table where the final result records are
                                                                  accumulated, the table for the records produced by the latest
    Non-recursive CTEs can be called 'query-local views' and
                                                                  step and the tables for each reference to the recursive CTE.
are similar to derived tables. As well as derived tables they
                                                                  Further all Standard restrictions are checked at this stage.
aren't stored in the database. This means that CTEs live only
for the duration of the query where they are defined. However,        Lastly, at the execution stage CTE is executed in cycle
unlike derived tables, they can be referenced multiple times in   using temporary tables defined at the previous stage. At the
the same query. Instead of redefining the same derived table      beginning these temporary tables contain the result of
every time CTE can be used with the aim of making query           execution of the seed part. On each iteration the content of the
more readable.                                                    table for new records is used as the entry for the recursive
                                                                  part. The result of the recursive part execution is added to the
    In MariaDB after CTE identification all references to non-
                                                                  table where the final result is stored and the new produced
recursive CTEs are formed as references to derived tables. On
                                                                  records are written in the table for the new records. If there is
further phases of semantical analysis, optimization and
                                                                  no data in the table for the new records, the process stops.
execution non-recursive CTEs are handled as derived tables.
The only thing that should be checked is if there are any             As stated before, there is one temporary table for each
renamed columns in the CTE definition because derived table       reference to the recursive CTE. All these tables are similar to
doesn't have such an option.                                      each other, but storing them all is caused by the current
                                                                  limitations of the MariaDB server. Later an optimization
                                                                  which solves this problem will be added to the server.
   Also it must be mentioned that any recursive CTE is             to the next step. MariaDB supports only mutual recursion as
computed only once for the query.                                  specified in the Standard. It also must be said that MariaDB is
                                                                   the first RDBMS that implemented mutual recursion and the
B. Non-linear recursion                                            only one who did it at the time of writing.
    As it was said before, a non-linear recursion is forbidden
by the Standard. However, MariaDB supports this feature.           D. How recursion can be stopped
                                                                      In MariaDB using linear recursion for traversal of a tree or
     A non-linear recursion can be useful when some
                                                                   a directed acyclic graph the execution is guaranteed to stop.
restrictions of the Standard on recursive CTEs have to be
                                                                   However, in many other cases the user has to add some
lifted. So, non-linear recursion can be used when:
                                                                   conditions to prevent loops.
        there is more than one reference to recursive CTEs in         When a transitive closure is computed, in the definition of
         a FROM clause of recursive_part of CTE;                   recursive CTE only UNION can be used to join recursive and
        there are some references to recursive CTEs in the        seed parts. For instance, when the user needs to find all cities
         right part of LEFT JOIN or in the left part of RIGHT      that he can reach from some place by bus, there can be more
         JOIN;                                                     than one bus route with the same point of arrival. If there are
        there are some references to recursive CTEs in a          such routes and the user applies UNION ALL, some
         subquery that is defined in a WHERE clause of some        destinations will be added repeatedly and the routes will be
         recursive CTE;                                            added again and again. This will lead to an infinite process.
    The main difference in computation of non-linear                   In the case when the paths over the graph with the loops
recursion from linear one is that on each iteration not only the   need to be computed, for example, all paths consist of cities
records produced by the latest recursive step but all records      that can be reached by bus from some place, there might be a
produced so far for the CTE are used as the entries for the        special condition written into WHERE in the recursive part of
recursive part. So, in MariaDB implementation of CTEs when         CTE definition to stop indefinitely increasing paths
the restrictions are lifted, new records are just added to the     computing. As well as in the previous case there can be some
tables created for the recursive references from the               bus routes with the same destination. Adding a city that
specification of CTE. The recursion stops when no new              already exists in the path will lead to an infinite process, that's
records produced.                                                  why a condition that checks if the city exists in the path needs
    Whereas in some cases a query looks cleaner and                to be added.
executing converges faster, usage of non-linear recursion can          Also in MariaDB there is a safety measure – the special
be very helpful in many cases. For example, it can be              variable @@max_recursive_iterations that controls the count
effectively used to stop the recursive process.                    of the completed iterations during computation of a recursive
    If the user wants to use non-linear recursion in MariaDB,      CTE. The user can change it himself if needed.
he can set @@standard_compliant_cte=0 and work with it.
                                                                                  IV. OPTIMIZATION TECHNIQUES
C. Mutual recursion                                                    The basic algorithm of execution of a CTE stores results of
    Mutual recursion is said to be one of the most interesting     CTE in a temporary table. When the query where the CTE was
forms of recursion. It is such a form where two or more CTEs       defined calls for the CTE results, the result records are taken
refer to each other.                                               from this temporary table. Although this algorithm always
                                                                   works, in most cases it is not optimal. Some optimization
    In MariaDB all mutual recursive CTEs are detected at the       techniques on non-recursive CTEs are discussed and a
preparatory stage. For every recursive_part of CTE a search is     comparison between different RDBMSs approaches is made
made for all recursive_parts mutually recursive with the CTE       below.
where it was defined. It must be said that recursive CTE is
mutually recursive with itself. All found mutually recursive       A. CTE merging
CTEs are linked into a ring chain. Further, it is checked if           With this optimization applied the CTE is merged into
there are enough seed_parts for a mutually recursive group.        parent JOIN so that parts of the CTE definition replace
There should be at least one anchor for the mutually recursive     corresponding parts of the parent query. There are some
group.                                                             restrictions on a CTE in order it could be merged: GROUP
    Mutual recursion is allowed by the Standard, but it            BY, DISTINCT, etc. can't be used in CTE definition.
required that any mutually recursive group of CTEs can be            This optimization technique is the                   same     as
transformed into a single recursive CTE. An example of the         ALGORITHM=MERGE for views in MySQL.
non-restricted case of mutual recursion can be as follows:
when there are two recursive CTEs, where on each iteration            On the Fig. 1 the example of how this technique works is
one CTE waits until the second one ends computation with the       shown. The upper listing shows the initial query and the low
content of the first CTE as the entry, and only after that goes    shows how the optimizer will transform it.
 WITH engineers AS (                                             instances than the disjunction of these conditions has to be
  SELECT *                                                       pushed into CTE.
  FROM employees
  WHERE dept = 'Development')
 SELECT *
                                                                 D. Comparison of optimizations in MariaDB, PostgreSQL,
 FROM engineers E, support_cases SC                                 MS SQL Server, MySQL 8.0.0-labs
 WHERE E.name = SC.assignee AND                                     MariaDB as MS SQL Server supports merging and
       SC.created = '2016-09-30' AND
       E.location = 'Amsterdam'                                  condition pushdown. PostgreSQL supports reuse only.
                                                                 MySQL 8.0.0-labs supports both merging and reuse and it
                                                                 works in such way: it tries merging otherwise makes reuse.
 SELECT *
 FROM employees E, support_cases SC                               TABLE I.         EXISTENCE OF OPTIMIZATION TECHNIQUES IN DIFFERENT
 WHERE E.dept = 'Development' AND                                                               RDBMSS
       E.name = SC.assignee AND
       SC.created = '2016-09-30' AND                                                            Optimization technique exists
       E.location = ’Amsterdam’                                          DBMS
                                                                                       CTE merge     Condition pushdown     CTE reuse
Fig. 1. Example of CTE merging                                      MariaDB 10.2          yes                yes                no

                                                                    MS SQL Server         yes                yes                no
B. Condition pushdown
                                                                    PostgreSQL             no                no                 yes
    The condition pushdown is used when merging is not
                                                                    MySQL
possible, for example when CTE has GROUP BY. Conditions             8.0.0-Labs
                                                                                          yes                no                 yes
in the WHERE clause of a query that depend only on the
columns of the CTE are pushed into the query defining this
CTE. In the general case conditions can be pushed only in the       V. THE RESULTS OF EXPERIMENTS ON MARIADB AND
HAVING clause of the CTE, but on some conditions it makes                             POSTGRESQL
sense to push them into the WHERE clause. As a result, a
                                                                    Some tests have been conducted on the computer with
temporary table is made smaller.
                                                                 processor Intel(R) Core(TM) i7-4710HQ CPU, 2.50GHz, 8
   Besides CTEs this optimization works for derived tables       GB RAM on Opensuse 13.2 operating system. We tested
and non-mergeable views.                                         PostgreSQL 9.3 and MariaDB 10.2 database systems, and
                                                                 gave the same amount of 16 Mb memory for temporary tables
   On the Fig. 2 the example of how this technique works is      in both systems. It was relevant to use PostgreSQL 9.3
shown. The upper listing shows the initial query and the low     because CTEs implementation didn’t change in further
shows how optimizer will transform it.                           versions.
                                                                     The experiments were made in a database containing the
 WITH sales_per_year AS (                                        information about domestic flights in the USA during 2008.
  SELECT year(order.date) AS years,
          sum(order.amount) AS sales                             Database schema consists of the following relations:
  FROM order
  GROUP BY year)                                                         tab_2008(month, dayofmonth, dep_time, arrtime,
 SELECT *                                                                 flightnum, airtime, origin, dest, dist);
 FROM sales_per_year                                                     airports(names);
 WHERE year IN ('2015','2016')
                                                                     We wanted to find multi-destination itineraries. So, we
                                                                 decided to find the shortest way between the airports of
 WITH sales_per_year AS (                                        interest by plane. The table airports shows which airports
  SELECT year(order.date) AS years,
          sum(order.amount) AS sales
                                                                 should be visited. None of the airports can be visited twice.
  FROM order                                                     Besides, the plane should leave for the next destination a day
  WHERE year IN ('2015','2016')                                  or more after the previous plane.
  GROUP BY year)
 SELECT *                                                            The following query Q1 for MariaDB is shown on Fig. 3.
 FROM sales_per_year                                             The script for PostgreSQL has a difference in functions
                                                                 cast(origin as char(32))           and  locate(tab_2008.dest,
Fig. 2. Example of condition pushdown
                                                                 s_planes.path). The analogue of locate function in
                                                                 PostgreSQL is position function and its return type is text,
C. CTE reuse                                                     that’s why the result of cast function in PostgreSQL in this
   The main idea of this method is to fill the CTE once and      query will be not char(32), but text.
then use it multiple times. It works with condition pushdown.
                                                                     This query starts from 'IAD' airport in seed_part and looks
Yet if different conditions are to be pushed for different CTE
                                                                 through the table tab_2008 to find flights with 'IAD' as origin
and one of the airports from table airports as destination. As       database systems, so it was decided to continue experiments
the needed destination is found it is checked if it has already      on the tables without indexes.
been visited on the route to prevent repeats. From the received
                                                                         Although, we decided to make some other experiments
data we take only those paths that involve all airports from
                                                                     and minimize the number of searched airports. So, fewer steps
table airports and have the smallest overall distance.
                                                                     of recursion were made. We made these experiments only on
                                                                     the table with 1134055 records. The results are shown in
 WITH RECURSIVE s_planes (path, dest, dayofmonth,                    TABLE III.
 dist, it) AS (
  SELECT cast(origin as char(30)), origin,                              When the number of searched airports is 8, MariaDB has
          dayofmonth, 0, 1                                           much better results than PostgreSQL. But during the
  FROM tab_2008                                                      minimization of the number of the airports PostgreSQL results
  WHERE dayofmonth = 3 AND origin = 'IAD' AND
         flightnum = 3231
                                                                     become closer and closer to MariaDB results. When the
  UNION                                                              number of the airports is fewer than 5 PostgreSQL results
  SELECT                                                             become better than MariaDB ones and this trend continues.
   concat(s_planes.path,',',tab_2008.dest),
   tab_2008.dest, tab_2008.dayofmonth,                                   We compared query execution plans in both database
   s_planes.dist+tab_2008.dist, it+1                                 systems and found that they were absolutely the same.
  FROM tab_2008, airports, s_planes
  WHERE                                                              However, operation of HASH JOIN made in a recursive part
   tab_2008.origin = s_planes.dest AND                               of the query requires 3-4 times longer period in PostgreSQL
   locate(tab_2008.dest, s_planes.path)=0 AND                        than in MariaDB on the big number of recursive iterations.
   tab_2008.dest = airports.name AND                                 We don’t know the reason of this yet, and in our further
   tab_2008.dayofmonth > s_planes.dayofmonth)
 SELECT *                                                            researches we will focus on this theme.
 FROM s_planes
 WHERE it = 8 AND
        dist = (SELECT min(dist)
                                                                                               VI. CONCLUSION
                FROM s_planes                                            In this paper we presented a number of techniques for
                WHERE it = 8);
                                                                     execution of CTEs and provided an implementation of these
Fig. 3. Query Q1 for MariaDB                                         techniques for MariaDB. We described in details recursive
                                                                     CTE computation and mutual recursive CTE computation.
                                                                     Also we discussed in which cases non-linear recursion can be
      TABLE II. THE RESULTS OF THE QUERY Q1 (OVERALL RESULT)         used. We compared existence of optimization techniques for
                                   Records count                     non-recursive CTE in different databases.
   DBMS
                     587130          1134055             6858079
                                                                        We also performed some experiments using flights table
 MariaDB      16.72 sec           31.97 sec          3 min 9 sec
                                                                     on PostgreSQL and MariaDB. They showed that PostgreSQL
 PostgreSQL   60.29 sec           1 min 91 sec       11 min 50 sec   has better results only when few steps of recursion are needed.
                                                                     For the long recursive process on a huge amount of data
                                                                     MariaDB is a better choice.
 TABLE III. THE RESULTS OF THE EXPERIMENTS DURING AIRPORTS COUNT
                   MINIMIZATION (OVERALL RESULT )
                                                                         The authors want to express gratitude to MariaDB
                                                                     developers Igor Babaev and Sergey Petrunya for their
                                   Airports count                    participation in the work on MariaDB CTE implementation
   DBMS
                 4            5       6          7           8       and their help in writing this article.
              2.28        3.49     5.93        14.45
 MariaDB                                                31.97 sec
              sec         sec      sec         sec
                                                                                                  REFERENCES
              1.13        2.97     6.74        38.66    1 min 91     [1]   Code of the implementation of non-recursive CTEs for MariaDB, URL:
 PostgreSQL   sec         sec      sec         sec      sec                https://github.com/MariaDB/server/pull/134
                                                                     [2]   Code of the implementation of recursive CTEs for MariaDB, URL:
                                                                           https://github.com/MariaDB/server/pull/182
                                                                     [3]   SQL/Foundation ISO/IEC 9075-2:1999
    We've made a query Q1 on table tab_2008 consists of
                                                                     [4]   SQL/Foundation ISO/IEC 9075-2:2008
different number of records: 587130, 1134055 and 6858079.
                                                                     [5]   P. Przymus, A. Boniewicz, M. Burza´nska, and K. Stencel. Recursive
The results of the experiments are shown in TABLE II.                      query facilities in relational databases: a survey. In DTA and BSBT,
                                                                           pages 89–99. Springer, 2010
What can be seen from the table is that the results of the tests
                                                                     [6]   Oracle Database Online Documentation, 10g Release 2 (10.2)
in MariaDB are more than three times better than in
                                                                     [7]   D. Stuparu and M. Petrescu. Common Table Expression: Different
PostgreSQL.                                                                Database Systems Approach. Journal of Communication and Computer,
                                                                           6(3):9–15, 2009
   Also query Q1 was made on the same table with 587130
                                                                     [8]   TPC Benchmark™DS (TPC-DS): The New Decision Support
records but with the index on origin column. The results                   Benchmark Standard
improved only for an overall value of 1 second in both
[9]  Boniewicz, A., Stencel, K., Wiśniewski, P.: Unrolling SQL:1999               Hibernate ORM. In Eder, J., Bielikov´a, M., Tjoa, A.M., eds.:
     recursive queries. In Kim, T.h., Ma, J., Fang, W.c., Zhang, Y.,              ADBIS(2). Volume 789 of CEUR Workshop Proceedings., CEUR-
     Cuzzocrea, A., eds.: Computer Applications for Database, Education,          WS.org (2011) 190–199
     and Ubiquitous Computing. Volume 352 of Communications in               [13] Ghazal, A., Crolotte, A., Seid, D.Y.: Recursive sql query optimization
     Computer and Information Science. Springer Berlin Heidelberg (2012)          with k-iteration lookahead. In Bressan, S., K¨ung, J., Wagner, R., eds.:
     345–354                                                                      DEXA. Volume 4080 of Lecture Notes in Computer Science., Springer
[10] Szumowska, A., Burzańska, M., Wiśniewski, P., Stencel, K.: Efficient           (2006) 348–357
     Implementation of Recursive Queries in Major Object Relational          [14] Ordonez, C.: Optimization of linear recursive queries in sql. IEEE
     Mapping Systems. In FGIT 2011 78-89                                          Trans. Knowl. Data Eng. 22 (2010) 264–277
[11] Burzanska, M., Stencel, K., Suchomska, P., Szumowska, A.,               [15] Burzanska, M., Stencel, K., Wisniewski, P.: Pushing predicates into
     Wisniewski, P.: Recursive queries using object relational mapping. In        recursive sql common table expressions. In Grundspenkis, J., Morzy, T.,
     Kim, T.H., Lee, Y.H., Kang, B.H., Slezak, D., eds.: FGIT. Volume 6485        Vossen, G., eds.: ADBIS. Volume 5739 of Lecture Notes in Computer
     of Lecture Notes in Computer Science., Springer (2010) 42–50                 Science., Springer (2009) 194–205
[12] Wiśniewski, P., Szumowska, A., Burzańska, M., Boniewicz, A.:
     Hibernate the recursive queries - defining the recursive queries using