Query Processing and Optimization using Compiler Tools

                 Caetano Sauer                            Karsten Schmidt                     Theo Härder
            University of Kaiserslautern            University of Kaiserslautern       University of Kaiserslautern
                     Germany                                 Germany                            Germany
              csauer@cs.uni-kl.de                   kschmidt@cs.uni-kl.de               haerder@cs.uni-kl.de


ABSTRACT                                                            in some specific way, which includes pattern matching, rule
We propose a rule-based approach for (X)Query compila-              evaluation, and translation. To avoid writing huge amounts
tion that operates on a single query representation—called          of sophisticated code by hand, effective language processing
abstract syntax tree (AST)—throughout the whole trans-              tools called compiler generators (or compiler-compilers) are
lation and transformation process. For this purpose, we             used. In this way, programmers are enabled to create power-
exploit some new features of the ANTLR compiler gener-              ful and complex compilers using just a formal description of
ator such as tree pattern matching. This approach avoids            the language: grammars. The goal of this paper is to intro-
error-prone transformations into different internal query rep-      duce this compiler approach in all stages of query processing,
resentations and allows to specify grammar rules instead of         thereby achieving complexity reduction, improved maintain-
writing code tailor-made for the specific DBMS. This ap-            ability, and a formal specification through grammars, which
proach may further help to develop new query optimizer              in turn provides safety.
rules, to prove correctness of rules, and to evaluate rule sets
in a declarative manner.
                                                                    2.   QUERY PROCESSING PIPELINE
1.   INTRODUCTION                                                   The work in [4] describes all stages of XQuery[1] processing
                                                                    in XTC, a native XML database system [3]. XTC follows
A query processor is part of a DBMS responsible for trans-
                                                                    classical conventions of query processor design and will be
lating declarative statements written in a given query lan-
                                                                    used in this work—without loss of generality and indepen-
guage into a sequence of physical operations to be executed
                                                                    dent of the data model—as a base reference and an example.
by the lower levels of the system. This complex task in-
                                                                    All nine stages illustrated in Figure 1 correspond to opera-
cludes parsing, modifying, and optimizing the initial repre-
                                                                    tions over a distinct representation of a query, and the over-
sentation of the query. Typically, this task is divided into
                                                                    all sequence of stages is here referred to as the pipeline of
smaller and more specific steps thereby introducing differ-
                                                                    query processing. Three subtasks can be identified, namely
ent query representations facilitating the translation step at
                                                                    translation, optimization, and execution, each embracing
hand. Reaching the final executable code for a given query,
                                                                    one or more stages. For this reason, three different inter-
the query representation usually undergoes a metamorpho-
                                                                    nal query representations are used: AST (abstract syntax
sis from a string, via a syntax tree, to multiple equivalent
                                                                    tree), XQGM (XML query graph model, i.e., logical plan),
logical and physical plans.
                                                                    and QEP (query execution plan, i.e., physical plan). Note
                                                                    that the input XQuery string is considered as an external
Despite the different representations, which also introduce
                                                                    representation.
new formalisms, data structures, and significant amounts of
source code, the overall task nevertheless is (from a more
                                                                    The subtask translation parses XQuery statements and
abstract perspective) the translation of a declarative state-
                                                                    produces ASTs, which contain the syntax elements of the
ment into executable code—and this problem is tackled by
                                                                    query. Throughout the subsequent steps of normalization,
classical tools known to all computer scientists, namely com-
                                                                    static typing, and simplification, this syntax is checked and
pilers.
                                                                    reduced to a canonical AST, which is finally translated into
                                                                    a logical plan described in XQGM, i.e., an XML extension
Technically speaking, all stages of query processing are per-
                                                                    of the classical query graph model (QGM)[2]. The logical
forming essentially the same processing steps, that is, repre-
                                                                    plan serves as input to the optimizer, which reorganizes
senting the query as a tree of operations and manipulating it
                                                                    or replaces the algebraic operations. Besides reducing the
                                                                    number of operations and the intermediate result size, it also
                                                                    identifies join options and generates alternatives for logically
                                                                    equivalent XQGMs. The logical plan identified as optimal
                                                                    (i.e., cheapest) one is then converted into a physical plan,
                                                                    i.e., algebraic operations are replaced by physical operations
                                                                    on data storage objects. This physical plan, referred to as
                                                                    QEP, is processed by the query engine and delivers a se-
GvD Workshop’10, 25.-28.05.2010, Bad Helmstedt, Germany
Copyright is held by the author/owner(s).                           quence of XML nodes. Eventually, these nodes are materi-
                                                                    alized to a given result format, such as string or XML nodes.
                   XQuery                                          unit of a language, such as a variable name or keywords like
                                                                   for, where, and return. Every token is associated to an
                    Parser                                         optional string, which carries the exact sequence of char-
                                                                   acters that were matched in the input query. To illustrate
        Abstract Syntax Tree (AST)                                 the structure and the creation of the first AST, we use the
                                                                   sample query in Figure 2a. For this query, the parsing stage
                Normalization                  Translator          builds the AST shown in Figure 2b. Tokens are described
                Static Typing                                      by their type (always starting with an upper case charac-
                                                                   ter), e.g. FunctionCall, and the applicable text, which
                Simplification                                     serves here as a payload and is described between brackets.
              QGM Translation                                      In the function call token, for example, the text payload
                                                                   is [“count“]. The relationship between tokens is expressed
      XML Query Graph Model (XQGM)                                 by the tree hierarchy. The argument of the function count,
                                                                   for instance, is a variable reference represented as a child of
                                                                   FunctionCall.
             Algebraic Rewriting
                                               Optimizer
               Plan Generation                                     3.2      Creating AST Structures from Grammars
                                                                   The core engine of a query processor based on ASTs and
        Query Execution Plan (QEP)                                 grammars is a compiler generator. The input string is tok-
                                                                   enized and processed by a compiler. This compiler produc-
                  Execution                                        ing the AST was generated for a specific language described
                                               Executor
               Materialization                                     by a given grammar. Such a grammar consists of a set of
                                                                   rules used to describe the elements of the language. An
                    Result                                         XQuery parser, for example, would make use of four rules
                                                                   to match the input of Figure 2a. At the topmost level, the
 Figure 1: Query processing pipeline used in XTC                   query contains a FLOWR expression consisting of an assign-
                                                                   ment of a variable to a path expression and a function call
                                                                   with a variable reference as the return expression. These
Although working at different levels of abstraction and ma-        rules are described by Figure 2c, using the syntax of the
nipulating different representations of a query, all stages of     ANTLR compiler generator1 .
the pipeline perform essentially similar tasks, such as scan-
ning trees (note, plans are realized as trees), identifying pat-   In addition to parsing the input, the generated compiler
terns, and performing rewrites. In the following section,          must also execute actions embedded in the rules. One ex-
we present a query processing approach that performs these         ample is the maintenance of variable references in a sym-
general tasks based on a high-level descriptive language on        bol table when a variable assignment is parsed. This ac-
a single internal representation, which is used in all stages      tion appears between curly braces, like the pseudo-function
of the pipeline.                                                   insertV ariableRef erence() in the rule flowrExpr of Fig-
                                                                   ure 2c. Later on, when a variable is referenced, the compiler
                                                                   looks up its value in the symbol table and replaces the ref-
3.    ASTS AND GRAMMARS                                            erence by its value.
Most of the complexity behind the implementation or main-
tenance of a query processor results from the use of different
                                                                   A special kind of actions are the so-called rewrite rules,
levels of abstraction, which are expressed by the three dif-
                                                                   which appear after the -> symbol in the grammar. They
ferent internal representations in Figure 1. Implementing
                                                                   specify how the input matched is translated to an AST,
extensions of the query language, adding new functionali-
                                                                   as occurring in the rule functionCall. Here, the tokens
ties, or even maintaining the code due to API modifications
                                                                   matched are mapped to a tree where the root node is the
become cumbersome tasks for the programmer, particularly
                                                                   FunctionCall token and the arguments of the function are
in research prototypes, where changes frequently happen.
                                                                   the children. The syntax used for rewrite rules is, for ex-
                                                                   ample, ^(a b c), where a is the root node and b and c are
3.1    Representing Queries and Plans as ASTs                      its children. Note, the difference between the token and the
A first step towards complexity reduction can certainly be         rule having apparently the same name. The token is a real
achieved by replacing the three internal data structures (AST,     object living in memory and starting with an upper-case let-
XQGM and QEP) with only the AST. The reason for choos-             ter. Rules, on the other hand, serve as logical units of the
ing a plain syntax tree for a unified query representation         compiler execution, like methods in a Java class, and are
comes from the fact that such trees are the basic unit of work     written using a lower-case letter in the first position. Figure
in language compilers. This chapter demonstrates tech-             2b shows the resulting AST when parsing the sample query.
niques to execute all stages of the pipeline with ASTs while
maintaining the semantics of the query language. Note that         3.3      AST Manipulation
the logical abstraction of the query pipeline model is re-         After the generated parser has processed the input query,
tained, while only the implementation technique is unified.        the first AST instance is constructed using the rules given
                                                                   above. Several ASTs are generated and parsed in different
An AST is a labeled, ordered tree structure, where each node
                                                                   1
is a token. By definition, a token is the minimal syntactic            http://www.antlr.org/
    (a)                                                             (a)
1     for $b in doc("dblp.xml")//article
2     return count($b)                                                                               A

    (b)
                                                                                                B        E
                                   FlowrExpr


                  ForClause                    ReturnClause                                C        D

                           PathExpr
      VariableName[”b”]       (omitted)
                                           FunctionCall[“count“]    (b)

                                          VariableReference[”b”]
                                                                    A DOWN B DOWN C D UP E UP

    (c)                                                             Figure 3: (a) A sample AST and (b) the serialized
                                                                    list of tokens
1  expr:
2    flowrExpr | functionCall | varRef ;
                                                                    processing, however, is to apply rewrite rules under distinct
 3 flowrExpr:
                                                                    conditions, which are independent of the AST structure. As
 4   "for" $ qname "in" expr
                                                                    an example, the compiler may rewrite a node access to an
 5     { insertVariableReference(); }
                                                                    index scan when generating a QEP, given the condition that
 6   "return" expr
                                                                    an index for that access exists. Such conditions can be in-
 7   -> ^(FlowrExpr
                                                                    corporated into the grammar using semantic predicates, as
 8          ^(ForClause VariableName[qname] expr)
                                                                    will be shown later.
 9          ^(ReturnClause expr)
10       );
                                                                    Sometimes, it is cumbersome to write full grammars in or-
11 functionCall:
                                                                    der to perform simple AST manipulation tasks, because only
12   qname "(" expr ("," expr)* ")"
                                                                    a particular subset of the rules is relevant. Since a parser
13   -> ^(FunctionCall[qname] expr+);
                                                                    always recognizes all valid expressions of a language, all nec-
14 qname:
                                                                    essary rules must be specified when generating it. To over-
15   ("a".."z")+ ;
                                                                    come this problem, ANTLR implements the concept of a
                                                                    Tree Pattern Matcher. These are special kinds of parsers
    Figure 2: (a) A sample XQuery expression, (b) the               that recognize any token sequence and trigger embedded ac-
    AST generated for it, and (c) the grammar rules to              tions when a certain pattern contained in the grammar rules
    transform the query into an AST                                 is found. Making use of this feature, a grammar, having only
                                                                    the patterns that require rewrite, is sufficient to generate a
                                                                    tree scanner that performs the desired manipulation.

    stages throughout the pipeline. ANTLR provides powerful         3.4    Applying ASTs in the Pipeline
    extensions to the grammar rules, enabling AST-based query       Table 1 sketches four examples for the application of gram-
    processing and optimization. For instance, the insertion of     mar rules on ASTs. Each rule and the related ASTs are
    actions in arbitrary places of a rule (syntax: {}) and the      reduced to the bare minimum required for showing the AST
    specification of rewrite rules (syntax: ->). In the follow-     rewrite effects.
    ing, additional extensions will be described, and Section 3.4
    shows how to apply these techniques in a grammar-based
    query processor.                                                  1. Normalization for a FLOWR expression is applied
                                                                         to replace multiple for clauses by nesting single fors.
    Parsing an AST is obviously an essential functionality for           Here, the rule’s forClause and returnClause were
    further manipulations. Therefore, the AST is serialized into         omitted for simplicity.
    a sequence of tokens. The serialization method implemented        2. Index Matching for a Path Step generates alterna-
    by ANTLR adds two special tokens to signalize descending             tive plans, when suitable indexes are available. The ex-
    and ascending movements—DOWN and UP—and scans                        ample addresses child nodes of the context item named
    the tree in pre-order. Figure 3a illustrates a sample tree           ’author’. After the transformation, this logical data ac-
    and 3b its serialized token sequence. Note that the serial-          cess is represented by a physical operator, either a doc-
    ization is equivalent to the tree syntax in grammar rules,           ument index scan (ElementDocumentScan) or an
    but using the UP/DOWN tokens instead of parenthesis:                 element index scan (ElementIndexScan)—physical
    ^(A ^(B C D) E)                                                      operators available in our XDBMS. This rule is using
                                                                         a semantic predicate—isIndexAvailable().
    Because tree parsing allows to manipulate an AST by re-
    organizing or replacing entire subtrees, we also use it for       3. Join Reordering rules help to generate alternative
    the application of rewriting rules. A crucial task in query          plans having different join orders. Thus, cost-based
        plan evaluation may aim for the cheapest join order         impact of enabling or disabling certain rules and, thereby,
        when applying cost estimations to all the alternatives.     disabling dependent rules automatically. Furthermore, clas-
        In the example, the nested for clauses of expression 1      sical problems will reappear such as cycle detection and pre-
        and 2 are swapped because their context is indepen-         vention, search space pruning, and various search strategies
        dent (the semantic predicate (condition) isForClau-         (e.g., bottom-up, top-down, etc.).
        seIndependent() has to be checked first).
                                                                    Eventually an AST-represented plan needs to be translated
     4. Predicate Pushdown moves the whereClause of a               into a QEP that is often tailored to a specific (X)DBMS.
        FLOWR expression into a nested FLOWR expression.            Here again, rule-based translation allows to easily translate
        However, again a check isWhereIndependent() is              AST expressions into DMBS-dependent operators or code
        required to verify that the where predicate can be ea-      (see Table 1: example 2). In a perfect solution, it is suf-
        gerly used as a filter. Normally a nested FLOWR ex-         ficient to specify the set of available DBMS operators to
        pression is correlated to the outer one and, thus, a non-   automatically select the set of rules that produce a tailored
        blocking whereClause (e.g., blocking clauses contain        QEP. Thus, a generic query optimization may be used for
        a grouping or ordering expression) is usually specified     different DBMSs without losing expressiveness while reduc-
        for the correlated input visible for the nested one.        ing the efforts of optimizer development to a single instance.

4.     OPPORTUNITIES AND CHALLENGES                                 Because XQuery is a Turing-complete language, its not pos-
Integrating the AST-only query manipulation approach into           sible to cover all language aspects within the AST grammar
the XTC system became possible due to the extension of tree         rules. For this reason, we want to explore the limitations
pattern matching in ANTLR. However, the existing query              and opportunities for a smooth integration.
compilation process, sketched in Section 2, is not only com-
plex but also well-studied, optimized, and supported by a           Once the essential parts of the query processing pipeline
huge amount of test cases. To replace the entire compila-           are specified based on the AST concept, we can start with
tion process, a step-by-step approach seems to be feasible.         other aspects of optimization. For instance, the nice way
As a consequence, the new AST-only version may start with           of specifying query rewrite rules may help to easily identify
a rudimentary set of rules that can be manually handled. In         opportunities for (intra) query parallelization.
this way, the correctness of the rule’s impact can be proved
and a suboptimal QEP can be built and executed. In the              5.   SUMMARY & OUTLOOK
next steps, existing optimization rules (i.e., Java code work-      Based on the current state of XTC query processing—the
ing on the QGM) can be transformed into declarative rules           classical approach sketched in Section 2—, we presented our
for the AST. Eventually, we hope to obtain at least the same        first experience using only language tools. This lightweight
set of powerful rewrite rules and integrate them into our new       description style of query translation and transformation
query processing approach. A major challenge is to express          seems to provide equal expressiveness as hand-written pro-
the flexibility of rules written in Java with the concepts and      gram code. However, up to now, neither all limitations are
features of a language tool while retaining the same seman-         identified nor all opportunities disclosed.
tics.
                                                                    We hope to reduce the complexity of writing query compil-
The AST representation entails certain limitations to the se-       ers and query optimizers while not affecting the quality of
mantic rewriting power, when only tree syntax is exploited          optimization. Furthermore, we focus on proving rule cor-
within the rules. For instance, nested bindings and depen-          rectness, finding optimal rule sets, and identifying DBMS-
dencies cannot always be decoupled, although decoupled              independent approaches.
processing may be semantically correct. For instance, for
the join reorder example of Section 3.4 which depends on            The next steps will deal with XQuery language aspects and
the evaluation of the forClauseIndependent() condition,             physical operator mapping capabilities in XTC, before we
a semantically reordering may also be possible in case of an        want to compare the performance (i.e., overhead) of tailored
existing dependency by simply exchanging axis relationships         program code with our new AST rule-based approach.
(e.g., from child axis to parent axis). However, XQuery se-
mantics for empty sequences has to be observed2 . Although
all necessary conditions are represented within the AST, it
                                                                    6.   REFERENCES
                                                                    [1] D. Chamberlin et al., XQuery 1.1: An XML Query
seems to be extremely challenging to describe generic rules             Language. W3C Working Draft, (2008).
covering all the (hidden) corner cases.                             [2] L. M. Haas, J. C. Freytag, G. M. Lohman, and H. Pirahesh,
                                                                        Extensible query processing in starburst, SIGMOD Rec. 18
Another aspect of rule-based optimization is to control the             (1989), no. 2, 377–388.
set of applicable and meaningful rules. Because not all rules       [3] Michael P. Haustein and Theo Härder, An Efficient
                                                                        Infrastructure for Native Transactional XML Processing,
do always improve a given query but only increase the op-               DATAK 61 (2007), no. 3, 500–523.
timization budget, the selection and usage of rules may be-         [4] Christian Mathis, Storing, Indexing, and Querying XML
come an important aspect. Moreover, we hope to exploit                  Documents in Native Database Management Systems, Ph.D.
the easy way of expressing (new) rules to figure out which              thesis, University of Kaiserslautern, München, 7 2009.
rules maybe applied independently and which not. However,
not only the “fastest” rule chain is interesting but also the
2
  When child nodes do not contribute to the evaluation, an
empty sequence is required.
Rule                                 AST before                                AST after
duplicatedForClause:
  ^(Flowr
      forClause
      forClause
      returnClause                                                                        Flowr
  )
  -> ^(Flowr
          ^(For
                                                   Flowr                        For                 Return
              forClause{1}
          )
          ^(Return
              ^(Flowr                 For           For          Return        expr 1               Flowr
                  ^(For
                      forClause{2}
                  )                  expr 1     expr 2           expr 3                    For                Return
                  ^(Return
                      returnClause
                  )
              )                                                                          expr 2               expr 3
          )
     )
;


elementAccess:
  ^(Step
      contextItem
      Axis                                                                      ElementDocumentScan
      nodeTest                                                                  (’child’,’author’)
  )
  -> { isIndexAvailable }?                         PathStep
     ^(ElementIndexScan[                                                                 correlated
             Axis,                                                                         Input
             nodeTest                 correlated
         ]                              Input         Axis         NodeTest
         contextItem
     )                                                                           ElementIndexScan
  -> ^(ElementDocumentScan[                                        NameTest      (’child’,’author’)
                                                     ’child’
             Axis,                                                  ’author’
             nodeTest                                                                    correlated
         ]                                                                                 Input
         contextItem
     )
;


joinReorder:
  ^(Flowr
      forClause           // 1                Flowr                                        Flowr
      ^(ReturnClause
           ^(Flowr
               forClause // 2
                                      For                 Return                 For                Return
           )
      )
  )
  -> { isForClauseIndependent }?     expr 1               Flowr                 expr 2                Flowr
   ^(Flowr
      forClause{2}
      ^(ReturnClause                           For                  Return                  For               Return
           ^(Flowr                                                              Swap
               forClause{1}
           )
      )                                       expr 2                expr 3                 expr 1              expr 3
   );


predicatePushDown:
  ^(Flowr
      forClause
      whereClause
      ^(ReturnClause                            Flowr                                  Flowr
          ^(Flowr
              forClause
              returnClause
                                      For      Where           Return            For        Return
          )
      )
  )
  -> { isWhereIndependent }?         expr 1     expr 2           Flowr          expr 1       Flowr
  ^(Flowr
      forClause
      ^(ReturnClause                                       For       Return     For          Where        Return
          ^(Flowr
              forClause
              whereClause
              returnClause                                expr 3     expr 4    expr 3        expr 2         expr 4
          )
      )
  );


           Table 1: Examples of AST manipulation for XQuery processing