=Paper= {{Paper |id=Vol-3009/invited1 |storemode=property |title=Projective Beth Definability and Craig Interpolation for Relational Query Optimization (Material to Accompany Invited Talk) |pdfUrl=https://ceur-ws.org/Vol-3009/invited1.pdf |volume=Vol-3009 |authors=David Toman,Grant Wedell |dblpUrl=https://dblp.org/rec/conf/kr/TomanW21 }} ==Projective Beth Definability and Craig Interpolation for Relational Query Optimization (Material to Accompany Invited Talk)== https://ceur-ws.org/Vol-3009/invited1.pdf
     Projective Beth Definability and Craig
Interpolation for Relational Query Optimization
 (Material to Accompany an Invited Talk at SOQE 2021)

                          David Toman and Grant Weddell

                Cheriton School of CS, University of Waterloo, Canada
                            {david,gweddel}@uwaterloo.ca




        Abstract. Accessing information using a high-level data model or ontol-
        ogy has been a long-standing objective of research communities in several
        areas. The underlying idea of separating a logical/conceptual view of how
        information is understood by users from a physical view of the layout of
        data in data structures, called physical data independence, has been a
        focus of research for more than fifty years. Here, we explore the issues
        connected with optimizing and executing relational queries and updates
        in this setting. In particular, we consider how to find appropriate refor-
        mulations of user queries over the physical design, and show how these
        ideas naturally relate to first-order definability and interpolation.
        The talk elaborates on how logic-based approaches can be used to capture
        both the high-level conceptual views of information and the low-level
        physical layout of data. Based on such a formalism, we present the design
        of a relational query optimizer based on Craig interpolation that allows
        users to compile both queries and updates to low-level code that operates
        directly over a physical encoding of the data. The ultimate objective of
        this design is to produce low-level code that performs comparably with
        hand-written code in low-level programming languages such as C.



1     The Problem

Abstraction and high-level approaches to software development have been among
the most significant factors in increasing both the productivity of developers and
the quality of the applications they develop. Efforts along these lines date back
to the late 60s and early 70s when the concepts of data independence [1,2] and
abstract data types (ADTs) [13] were introduced. The concepts share the goal
of enabling application programmers to develop applications with respect to
a purely abstract or conceptual understanding of the application’s data or in-
formation, an understanding that is entirely independent of the concrete data
structures and related algorithms that encode the data on physical storage de-
vices. Indeed, modern file systems are examples: programmers manipulate files

    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2         David Toman and Grant Wedell

via operations that are entirely devoid of any need to understand low-level disk
layout issues.
    The idea of data independence gained popularity in the 70s, both in the area
of programming languages, e.g., with languages such as SETL [8,14,10], and in
the area of information and database systems mainly due to the development of
the relational model (RM) [5] with accompanying data manipulation language(s)
[6] based on first-order logic. However, with the passing of time (50 years later),
approaches that use lower levels of abstraction, such as C or the various recent
NoSQL database systems, have often displaced approaches that promote data
independence, often at the cost of increasing development time and/or lowering
the quality of deployed systems. The most common reason for this phenomenon
is the need for massive scaleability and flexibility, capabilities often missing in
systems with high levels of abstraction such as RM.
    The goal of this presentation is to outline a direction of research that, in
the realm of database and information systems, enables simultaneous high level
abstractions at the user level and extreme flexibility at the physical design level,
that is, in the choice of concrete data structures and their access algorithms.
Indeed, our ultimate goal of this direction of research is to compete with hand-
written code in low-level languages such as C, while providing the high level
of abstraction in the original RM [5] that are not yet fully realized in existing
relational database management systems.


2      Data Independence (through an Example)

We begin outlining how data independence can be understood more formally in
terms of first-order (relational) signatures and integrity constraints (i.e., first-
order sentences over these signatures).


2.1     The Logical Schema

The logical schema is a first order signature SL and an accompanying set of
integrity constraints ΣL that are specific to the domain of the application (that
require user familiarity). The situation can be depicted as follows:


                  ΣL        SL    o             ϕ      Logical Schema
                                                      and User Queries

The users interacting with the data use queries, in our case open first order
formulae over SL , to formulate their requests (we will deal with modifying the
data later in Section 4). There are two important observations that follow from
this arrangement:

    1. The user only requires familiarity with SL and ΣL to be able to develop
       applications; and
Projective Beth Definability and Craig Interpolation for Query Optimization        3

2. The user can assume that the actual data is a single interpretation on SL
   that is a model of ΣL over which her requests are evaluated (i.e., without
   the need to comprehend subtle issues related to logical entailment and/or
   belief revision). We call such interpretations instances of the schema.

Example 1 (Logical Schema)
We will use the following logical schema formulated in SQL as our running
example.

CREATE TABLE employee (                 CREATE TABLE department (
  num     INTEGER NOT NULL,               num     INTEGER NOT NULL,
  name    CHAR(20),                       name    CHAR(50),
  worksin INTEGER NOT NULL                manager INTEGER NOT NULL,
  PRIMARY KEY (num),                      PRIMARY KEY (num),
  FOREIGN KEY (worksin)                   FOREIGN KEY (manager)
        REFERENCES department                   REFERENCES employee
)                                       )

The (instances of) employee and department relation declarations are intu-
itively meant to store information about employee numbers, names and depart-
ments they work in, and about departments, their names and managers. In our
formalism, this is simply a syntactic sugar for a signature

                       SL = {employee/3, department/3}

(where “/i” indicates predicate arity) and integrity constraints

    ΣL = { employee(x, y1 , z1 ) ∧ employee(x, y2 , z2 ) → y1 = y2 ∧ z1 = z2 ,
           employee(x, y, z) → ∃u, v.department(z, u, v), . . .                }

stating that employees are identified by their number, that they must work for
exactly one department, and so on.
The ability of specifying integrity constraints in ΣL allows one to go beyond what
is available in typical implementations of the relational model, for example:

 – managers are employees that manage a department (a view)
     manager(x, y, z) ↔ employee(x, y, z) ∧ ∃u, v.department(u, v, x)
 – managers work in their own departnemts (business rule)
     employee(x, y, z) ∧ department(u, v, x) → z = u
 – workers and managers partition employees (partition)
     employee(x, y, z) ↔ (manager(x, y, z) ∨ worker(x, y, z))
     manager(x, y, z) ∧ worker(x, y, z) → ⊥


Observe that this extends the signature of the logical schema with additional
predicate symbols manager/3 and worker/3 that a user can now reference in
queries.
4         David Toman and Grant Wedell

2.2      The Physical Schema

We use a similar strategy to define the physical schema where we again use
relational signatures and constraints for this purpose. However, these symbols
will correspond to actual data structures that are called access paths in database
literature. These access paths correspond to various ways to access data, ranging
from dereferencing a pointer in main memory or extracting a field from a main
memory record (abstracted by binary predicate symbols whose interpretations
are address-value pairs) to using main memory data structures such as linked
lists (again abstracted by appropriate predicate symbols) to reading data from
external storage, and to communicating with other agents. The situation can be
again depicted as follows:

                   ΣL         SL       o          ϕ        Logical Schema
                                                          and User Queries

                              ΣLP             (compile)

                                                   
                   ΣP      SA ⊆ SP o              ψ       Physical Schema
                                                          and Query Plans

Note that there can be additional helper predicate symbols in SP in addition to
the access paths SA .
   There are two issues with this strategy that must be addressed:

    1. Will it suffice to associate access paths (data structures and their associated
       search algorithms) with predicate symbols?
    2. Is it reasonable to also think about generated code using access paths as
       formulae (ψ above)?

To address the 1st question, we annotate the symbols in SA with so called binding
patterns [19] indicating which arguments of the particular access path must be
bound to a value before the access path can be executed. We indicate this by an
additional integer in the signature specification, for example “pointer-nav/2/1”
indicates that the access path representing address-value pairs in main memory
can be only used when we have a value for the first component (i.e., an address).
The implementation then consists of a simple statement for dereferencing this
address to produce the a value of the second argument. This observation also
leads to restrictions on the form of ψ [16].

Example 2 (Physical Schema)
We illustrate the first issue by defining the physical schema for our running
example. Our physical design consists of a linked list of employee records that
use pointers (references) to indicate department records an employee works in.
In a similar fashion, the department records use a pointer to indicate which
Projective Beth Definability and Craig Interpolation for Query Optimization               5

employee is a manager. The records in a Pascal-like notation are as follows:

         record emp of                                record dept of
                integer num                                  integer num
                string    name                               string    name
                reference dept                               reference mgr

In our formalism this looks as follows: we define the following predicates to be
associated with access paths (i.e., in SA ):

 – empfile/1/0: set of addresses of emp records; this access path abstracts
   navigating a linked list (of emp records) in main memory.
 – emp-num/2/1: a set of address of emp records paired with the emp num-
   bers; this access path corresponds to extracting a field (num in this case)
   from an emp record (given an address of such a record). The access paths
   emp-name/2/1 and emp-dept/2/1 and dept-num/2/1, dept-name/2/1, and
   dept-mgr/2/1 similarly abstract the field extraction of the remaining fields
   from the emp and dept records.

We also use two auxiliary predicates emp/1 and dept/1 to stand for the sets of
addresses of emp and dept records. Integrity constraints (ΣP ∪ΣLP ) then capture
the properties of instances of the physical schema and how they relate to the
logical schema. For example the fact that records have appropriate fields can be
specified as follows:

emp(e) → ∃d.emp-dept(e, d)                               emp records have a dept field
emp-dept(e, d1 ) ∧ emp-dept(e, d2 ) → d1 = d2            the dept field is functional
emp-dept(e, d) → dept(d)                                 the value of the dept field is
                                                              a pointer to a dept record

For full listing of the constraints see Appendix A. This completes our description
of the physical schema for our example.

2.3   Queries and Plans
Now we are ready to give an answer to our 2nd question, how to interpret
formulae as query plans. This is straightforward: atomic formulae are mapped to
(the code associated with) access paths and logical connectives and quantifiers
to “control flow code fragments” as follows:
 atomic formula           7→           access path (a get-first / get-next iterator)
 conjunction               7 →         nested loops join
 existential quantifier      7 →       projection (with optional duplicate information)
 disjunction                   7 →     concatenation
 negation                        7 →   simple complement
For a formula to correspond to a plan (i.e., executable code), it is also necessary
to obey binding patterns [16]. While such a procedural interpretation of atoms
6      David Toman and Grant Wedell

and logical connectives might seem over simplistic, we discuss in Section 3.2
below how this simple fine-grained interpretation suffices for most of the hard-
coded solutions in other database systems.
Example 3
We illustrate this framework by worked examples of several user queries together
with possible query plans for these queries over our running physical design case.

Q1: List employee numbers, names, and departments (employee(x, y, z)). We
can show that this user query is logically equivalent under the integrity con-
straints to the following formula over SA :
            ∃e, d.empfile(e) ∧ emp-num(e, x) ∧ emp-name(e, y)
                                  ∧ emp-dept(e, d) ∧ dept-num(d, z)
Assuming our formulas as plans mapping, this formula would correspond to the
following C-like code (with trivial simplifications and inlining of the ea-xxx(x, y)
access paths to y := x->xxx):
          for e in empfile do
                    x := e->num; y := e->name;
                    d := e->dept; z := d->num; return (x, y, z);
Note also that the formula above satisfies the binding patterns associated with
the access paths used as it retrieves the address of an emp record before attempt-
ing to extract the values of it’s fields.

Q2: List worker numbers and names (∃z.worker(x, y, z)). Again, this query is
equivalent to the following formula over SA :
           ∃e, d.empfile(e) ∧ emp-num(e, x) ∧ emp-name(e, y)
                                 ∧ emp-dept(e, d) ∧ ¬dept-mgr(d, e)
Note that a negation, ¬dept-mgr(d, e), is required, and that there is no negation
in the query nor in the schema that provides any direct clue that it is needed.
(We are not aware of any system that can synthesize this plan, that is, that
compiles queries using this framwork.)

Q3: List all department numbers and their names (∃z.department(x, y, z)).
Finding a plan for this query is more difficult since we do not have a direct
way to “scan” dept records. However, it is an easy exercise to verify that the
following two formulae over SA are logically equivalent to the query:
∃d, e.empfile(e) ∧ emp-dept(e, d)
         ∧ dept-num(d, x) ∧ dept-name(d, y)
      (relying on the constraint that “departments have at least one employee”)

∃d, e.empfile(e) ∧ emp-dept(e, d)
        ∧ dept-num(d, x) ∧ dept-name(d, y) ∧ dept-mgr(d, e)
    (relying on the constraint that “managers work in their own departments”)
Projective Beth Definability and Craig Interpolation for Query Optimization        7

Both correspond to plans. However, while the second might seem to be less
efficient than the first, a query optimizer should prefer it on the grounds that,
in this case, the quantified variables d and e are functionally determined by the
answer variable x. Hence, the final projection generated for the second has no
need to eliminate duplicate answers. This is not the case for the first of these
formulae since it would return a copy of the department information for every
employee of the department should duplicate elimination in the final projection
not be performed.
Many other problems and issues in physical design and query plans can be
revolved in this framework, including standard RDBMS physical designs (and
more), access to search structures (index access and selection), horizontal par-
titioning/sharding, column store/index-only plans, hash-based access to data
(including hash-joins), multi-level storage (aka disk/remote/distributed files),
materialized views, etc., all without any need for coding in C beyond the need
for the generic specifications of get-first / get-next templates for concrete
data structures [16].


3     Interpolation and Query Optimization
Now we turn our attention to the description of a query compiler/optimizer that,
given the logical and physical schemata and a user query, generates a query plan
that correctly implements the user request.

3.1   What Queries Make Sense?? (to users)
However, before we begin, it is important to resolve what queries make sense
to a user who presumes there is a single interpretation of symbols in SL at any
point in time, no matter how it is represented/stored physically. To satisfy to
this expectation, the queries that make sense should have the same answer in
every model of the overall physical design Σ in which the interpretation of SA is
fixed, that is, where the stored data is always the same. This arrangement also
guarantees that artifacts facilitating efficient storage and retrieval of information
won’t be leaked in the results of queries (since they do not exist in the logical
view of the data). The consequence of this observation is that either
 1. there are situations in which a seemingly reasonable user query cannot be
    answered (that would be the case for Q3 in Section 2.3, were the constraint
    “departments have at least one employee” absent from the schema), or
 2. queries must adhere to syntactic restrictions in which, e.g., symbols corre-
    sponding to built-in operations cannot be used completely freely, and physi-
    cal designs must also adhere to syntactic restrictions such as so-called stan-
    dard designs (i.e., where an access path exists for every logical table in ΣL ,
    thus guaranteeing that every user query can be answered).
To make the definition of sensible queries more formal, we appeal to a well-known
notion of definability:
8         David Toman and Grant Wedell

Proposition 4 (Projective Beth Definability [4])
Let Σ ∪ {ϕ} be a FO theory over symbols in L and L0 ⊆ L. Then t.f.a.e.:

    1. For all M1 , M2 models of Σ such that M1|L0 = M2|L0 , and all a tuples of
       individuals, it holds that M1 |= ϕ[a] iff M2 |= ϕ[a], and
    2. ϕ is equivalent under Σ to some formula ψ in L0 .

We say that ϕ is explicitly definable w.r.t. Σ and L0 .
Definability (over SA w.r.t. Σ) formally captures the idea of (physical) data
independence, the illusion of a single interpretation of the logical schema that
satisfies integrity constraints that is presented to the users, and therefore pro-
vides the means of determining which queries can be answered over a particular
physical design.
    The first question is how to test for definability. The following observation
reduces this test to determining whether a particular formula constructed from
the user query is entailed by a theory constructed from the schema: ϕ is explicitly
definable (w.r.t. Σ and over SA ) if and only if

                                  Σ ∪ Σ 0 |= ϕ → ϕ0                                 (1)

where Σ 0 (ϕ0 ) is Σ (ϕ) in which symbols NOT in SA are primed, respectively.
    The next question is how to find a plan for a given query. Our observations
on how formulae can be interpreted as query plans in Section 2.3 then mostly
reduces query compilation to a search for the formula ψ in Proposition 4(2). To
find ψ, we rely on a variant of the following result [7]:

           If Σ ∪ Σ 0 |= ϕ → ϕ0 then there is ψ s.t. Σ ∪ Σ 0 |= ϕ → ψ → ϕ0

where L(ψ) ⊆ L(SA ). Here, ψ is called the Craig interpolant. Moreover, we can
extract any such ψ from a Tableau proof of (1) in linear time [9].

3.2      Architecture
The above discussion might seem to solve the query compilation problem. How-
ever there are additional issues that need to be addressed:
    1. The search for interpolants and their implied query plans must consider
       that alternative but logically equivalent plans might have vastly different
       performance characteristics.1 Hudek et al. [11] introduce an approach that
       separates the tableau-based search for interpolants from the cost-based2 ex-
       ploration of alternative query plans.
1
  This holds even for conjunctive formulae: hence database literature often focuses on
  the so-called join-order problem [15].
2
  Cost-based query optimization is the cornerstone of relational systems [15]; advance-
  ments in the area of query plan cost estimation are easily incorporated in this frame-
  work.
Projective Beth Definability and Craig Interpolation for Query Optimization                     9

2. Binding patterns for access paths (see Section 2.2) further restrict the space
   of executable query plans (and in turn of sensible queries). Benedikt et al.[3]
   have shown how the binding patterns can be accommodated in the search
   for interpolants (i.e., in the search for proofs of definability).
3. In addition, during the search for optimal query plans, we consider the im-
   pact of duplicate elimination as illustrated by plans for Q3 in Section 2.3. A
   detailed account for this facet of query compilation can be found in [16,18].

Figure 1 sketches an architecture of a query compilation/optimization system
that addresses the above concerns. The compiler preprocesses the given schema



                  Duplicate,         Split    o
                 / Compile to   /                      Planner      /     Code
    Query                           Tableau
                                                  /     (A*)            Generator
                   Bytecode           VM
                                       O                        O
                                                                                
                  Compile to                          Postproc.          C(lang)       Execu-
    Schema   /     Bytecode                            and Cost           and         / table
                  (optimize)                          Estimation         Linker         Code
                                                            O                O
    Cost                                                                Access Path
    model                                                                Libraries

                                Fig. 1: Compiler Architecture



and the user query into a normal form and generates a bytecode that drives a vir-
tual machine-based (VM) tableau theorem prover [17]. Unlike standard theorem
provers, including those that can generate interpolants [12], the tableau VM gen-
erates an intermediate representation of a space of equivalent interpolants called
closing sets [11]. Closing sets are then explored by an A∗ -based planner to find
a query plan with the lowest estimated cost. The planner also explores ways to
avoid duplicate elimination in the process. The planner is then followed by a
code generator that produces the ultimate query plans in a form of C source.


4   Updates

In this section, we sketch how the problem of compiling updates on a logical
design can be translated to the problem of compiling queries on a related logical
design, thus enabling the same framework above to also be used to compile
inserts, updates and deletes on logical tables.
    As already mentioned in Section 2.1, user updates are formulated with re-
spect to the logical schema (SL and ΣL ). Moreover, physical data independence
presents the user with a illusion that he is modifying an instance of SL by
adding/removing ground tuples to/from the interpretations of symbols in SL .
This process can be formalized in three parts as follows:
10      David Toman and Grant Wedell

 1. For every symbol R ∈ SL , introduce two additional symbols, R+ and R− ,
    whose (disjoint) interpretations correspond to the ground tuples the user
    wants to add or remove to/from the current instance;
 2. The updated instance is then defined by executing an simultaneous assign-
    ments R := (R ∪ R+ ) − R− for all R ∈ SL ; and
 3. At the end of the assignment the new interpretation must be a model of ΣL .

The symbols R+ and R− are commonly called the delta relations and Part 3
of this process on user updates ensures so-called consistency preserving transac-
tions.
    To convert the update problem to the problem of synthesizing plans for
queries, consider two copies of the schema Σ, in which all symbols are super-
scripted by o and n , respectively. The intuition is that the o and n symbols
correspond to the interpretations of SL before and after the update. The actual
assignment (Part 2 of the above process) can be then captured as additional
formulae
                        Ro (x) ∨ R− (x) ↔ Rn (x) ∨ R+ (x)
for each R ∈ SL as depicted below.


                                         U + ,U −
                    ΣLo        SoL                    +3       SnL        ΣLn

                               o                               n
                              ΣLP      compile                ΣLP
                                                                  
                    ΣPo     SA ⊆ SoP      +
                                              
                                                  −
                                                           +3 SnA ⊆ SnP   ΣPn
                                         A ,A


In the same way, the changes to access paths in SA can be captured by analo-
gous constraints, as depicted in the lower half of the figure. Thus, user inserts,
updates and deletes on logical tables (comprising a transaction) are mapped to
a definability question of the following form:

     Is An (or A+ , A− ) definable in terms of Aoi and Uj+ , Uj− (user updates)
     for every access path A ∈ SA , given the instance of all access paths in
     SA and of all delta relations for SL ?

A positive answer to this question yields a update plan that applies the delta
relations corresponding to the access paths to their current interpretations.


5    Summary

We have outlined how projective Beth definability can be used in database and
information systems to facilitate physical data independence. Moreover, we have
Projective Beth Definability and Craig Interpolation for Query Optimization     11

shown how a variation on Craig interpolation can be used to compile and opti-
mize user queries and user updates that are formulated over a logical schema to
an executable plan over a fine-grained physical design. There are many avenues
for further research and development, including: (1) admitting more powerful
languages for user requests, such as languages with aggregation; (2) enhance-
ments to the tableau provers, as well as alternatives such as superposition-based
provers; and (3) improvements to the planning component of query compilation
responsible for exploring the search space of alternative query plans.


References
 1. Charles W. Bachman. CODASYL data base task group: October 1969 report,
    1969.
 2. Charles W. Bachman. Summary of current work - ANSI/X3/SPARC/Study
    Group–Database Systems. FDT Bull. ACM SIGFIDET SIGMOD, 6(3):16–39,
    1974.
 3. Michael Benedikt, Julien Leblay, Balder ten Cate, and Efthymia Tsamoura. Gen-
    erating Plans from Proofs: The Interpolation-based Approach to Query Reformu-
    lation. Synthesis Lectures on Data Management. Morgan & Claypool Publishers,
    2016.
 4. Evert Willem Beth. On Padoa’s method in the theory of definition. Indagationes
    Mathematicae, 15:330–339, 1953.
 5. E. F. Codd. A relational model of data for large shared data banks. Commun.
    ACM, 13(6):377–387, 1970.
 6. E. F. Codd. Relational completeness of data base sublanguages. IBM Research
    Report, RJ987, 1972.
 7. William Craig. Three uses of the Herbrand-Genzen theorem in relating model
    theory and proof theory. Journal of Symbolic Logic, 22:269–285, 1957.
 8. Robert B. K. Dewar, Arthur Grand, Ssu-Cheng Liu, Jacob T. Schwartz, and Ed-
    mond Schonberg. Programming by refinement, as exemplified by the SETL repre-
    sentation sublanguage. ACM Trans. Program. Lang. Syst., 1(1):27–49, 1979.
 9. Melvin Fitting. First-Order Logic and Automated Theorem Proving, Second Edi-
    tion. Graduate Texts in Computer Science. Springer Publishers, 1996.
10. Stefan M. Freudenberger, Jacob T. Schwartz, and Micha Sharir. Experience with
    the SETL optimizer. ACM Trans. Program. Lang. Syst., 5(1):26–45, 1983.
11. Alexander K. Hudek, David Toman, and Grant E. Weddell. On enumerating query
    plans using analytic tableau. In Automated Reasoning with Analytic Tableaux and
    Related Methods - 24th International Conference, TABLEAUX 2015, Wroclaw,
    Poland, September 21-24, pages 339–354, 2015.
12. Laura Kovács and Andrei Voronkov. Interpolation and symbol elimination. In
    Renate A. Schmidt, editor, Automated Deduction - CADE-22, 22nd International
    Conference on Automated Deduction, Montreal, Canada, August 2-7, 2009. Pro-
    ceedings, volume 5663 of Lecture Notes in Computer Science, pages 199–213.
    Springer, 2009.
13. Barbara H. Liskov and Stephen N. Zilles. Programming with abstract data types.
    SIGPLAN Notices, 9(4):50–59, 1974.
14. Edmond Schonberg, Jacob T. Schwartz, and Micha Sharir. An automatic technique
    for selection of data structures in SETL programs. ACM Trans. Program. Lang.
    Syst., 3(2):126–143, 1981.
12      David Toman and Grant Wedell

15. Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A.
    Lorie, and Thomas G. Price. Access Path Selection in a Relational Database
    Management System. In ACM SIGMOD International Conference on Management
    of Data, pages 23–34, 1979.
16. David Toman and Grant E. Weddell. Fundamentals of Physical Design and Query
    Compilation. Synthesis Lectures on Data Management. Morgan & Claypool Pub-
    lishers, 2011.
17. David Toman and Grant E. Weddell. An interpolation-based compiler and opti-
    mizer for relational queries (system design report). In IWIL@LPAR 2017 Workshop
    and LPAR-21 Short Presentations, Maun, Botswana, May 7-12, 2017, 2017.
18. David Toman and Grant E. Weddell. Using feature-based description logics to
    avoid duplicate elimination in object-relational query languages. Künstliche Intell.,
    34(3):355–363, 2020.
19. Jeffrey D. Ullman. Implementation of logical query languages for databases. ACM
    Trans. Database Syst., 10(3):289–321, 1985.


A     Constraints for the Running Example
The following listing is a complete specification of constraints needed for our
running example. Note that some of the constraints in Section 2.1 are entailed
by the constraints below (and are thus omited).
%
% logical schema (entailed constraints omited)
%
% a (virtual) view for managers
manager(x,y,z) <-> (employee(x,y,z) and ex(n,department(z,n,x))),
%
% disjoint partition of employees to managers and workers
employee(x,y,z) <-> (manager(x,y,z) or worker(x,y,z)),
manager(x,y,z) and worker(x,u,v) -> bot,
%
% businness logic: managers work for their own departments
(department(x,y,z) and employee(z,u,w)) -> x=w,
%
% physical schema and mappings
%
% design of emp and dept structs; emp/dept addresses, fields functional
emp(e) -> ex(y,emp_num(e,y)), emp_num(e,y) and emp_num(e,z)-> y=z,
                               emp_num(y,x) and emp_num(z,x)-> y=z,
emp(e) -> ex(y,emp_name(e,y)), emp_name(e,y) and emp_name(e,z)-> y=z,
emp(e) -> ex(y,emp_dept(e,y)), emp_dept(e,y) and emp_dept(e,z)-> y=z,
                               emp_dept(e,d) -> dept(d),
%
dept(d) -> ex(y,dept_num(d,y)), dept_num(d,y) and dept_num(d,z)-> y=z,
                                dept_num(y,x) and dept_num(z,x)-> y=z,
dept(d) -> ex(y,dept_name(d,y)), dept_name(d,y) and dept_name(d,z)-> y=z,
dept(d) -> ex(y,dept_mgr(d,y)), dept_mgr(d,y) and dept_mgr(d,z)-> y=z,
                                dept_mgr(d,e) -> emp(e),
Projective Beth Definability and Craig Interpolation for Query Optimization   13

%
% linked list for ea’s and record attributes
%
empfile(x) <-> emp(x),
%
% user predicates and mappings
%
employee(x,y,z) <-> ex(e,baseemployee(e,x,y,z)),
%
emp(e)                              <-> ex([x,y,z],baseemployee(e,x,y,z)),
emp_num(e,x)                        <-> ex([y,z],baseemployee(e,x,y,z)),
emp_name(e,y)                       <-> ex([x,z],baseemployee(e,x,y,z)),
ex(d,emp_dept(e,d) and dept_num(d,z)) <-> ex([x,y],baseemployee(e,x,y,z)),
%
department(x,y,z) <-> ex(d,basedepartment(d,x,y,z)),
%
dept(d)                            <-> ex([x,y,z],basedepartment(d,x,y,z)),
dept_num(d,x)                      <-> ex([y,z],basedepartment(d,x,y,z)),
dept_name(d,y)                     <-> ex([x,z],basedepartment(d,x,y,z)),
ex(e,dept_mgr(d,e) and emp_num(e,z)) <-> ex([x,y],basedepartment(d,x,y,z))