Experiencing Hypothetical Datalog in SQL Puzzles Fernando Sáenz-Pérez1 1 Faculty of Computer Science, Complutense University of Madrid, 28040 Madrid, Spain Abstract This work presents some SQL puzzles making use of the application of Hypothetical Datalog (HD) to implement an SQL system. HD provides a way of declaring SQL Common Table Expressions as used to specify local view definitions and recursive queries. We present the concepts of HD that will be employed to translate SQL queries to HD, providing its syntax, an inference system and translation rules. Finally, several SQL puzzles used during teaching SQL in a database subject are proposed. Keywords Hypothetical Datalog, Recursive SQL, Teaching, Puzzles 1. Introduction SQL is usually understood as a specific-purpose, declarative programming language, intended to insert and extract data from relational databases. However, since the addition of recursive common table expressions (CTE, included for the first time in the ISO/IEC Standard SQL:1999, also known as SQL3) its nature has evolved to become a Turing-complete language. To acknowledge this, a specific class of tag systems, CTS’s (Cyclic Tag Systems) were proposed in [1], and a corollary of its work was that a CTS can emulate a Turing-complete class of tag systems. Since a CTS can be implemented in standard SQL,1 this proves that SQL is Turing complete. This result can be used to pose more advanced problems as puzzles to students with the aim of sharpening their programming skills and promoting open-mindedness. Using puzzles to this end is a well-known technique [2], and in fact they are used all around the world as in the Hour of Code,2 Programming Praxis,3 CodeKata,4 ¡Acepta el reto! 5 and also specific for SQL [3].6 Our experience and student feedback when teaching database concepts to advanced students (typically in the double degree of Computer Science and Mathematics) during last years confirms the usefulness of puzzles as an intellectual challenge posed to students (proposed as extra-curricular activities). This paper collects some of those SQL puzzles, ranging from the easier ones to the more complex and appealing ones. Though SQL can not be considered as an esoteric language [4], expressing these puzzles with this language can be viewed as an esoteric usage. While these puzzles can be solved in most SQL systems, when teaching we use instead the Datalog Educational System (DES)7 [5]. This system, which is capable of solving standard SQL queries, is in our view more adequate for students than others because, in particular, it provides better syntax error explanations and even semantic error warnings [6]. Moreover, students can use its SQL declarative debugger [7] for fixing semantically incorrect views. These last two features are absent in commercial SQL systems. Apart from this, DES extends the SQL language beyond the standard, including the Datalog 2.0 2024: 5th International Workshop on the Resurgence of Datalog in Academia and Industry, October 11-14, 2024, Dallas, Texas, USA $ fernan@sip.ucm.es (F. Sáenz-Pérez)  0000-0001-6075-4398 (F. Sáenz-Pérez) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://wiki.postgresql.org/index.php?title=Cyclic_Tag_System&oldid=15106 2 https://hourofcode.com 3 https://programmingpraxis.com 4 http://codekata.com 5 Face the challenge! https://www.aceptaelreto.com 6 SQLzoo https://sqlzoo.net, HackerRank https://www.hackerrank.com/domains/sql 7 Desktop applications for different OS’s and Prolog host systems (https://des.sourceforge.io), and an on-line system DESweb (https://desweb.fdi.ucm.es) which includes the last development version are available. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 54 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 possibility of making assumptions, both positive (assuming data which is not in the database) and negative (neglecting data which is already in the database) for querying the database. However, since these puzzles are targeted at database students, we restrict ourselves to uses admitted by the SQL standard and leave the extension out of the scope of this paper. DES is built with concepts from intuitionistic logic programming (ILP) [8] dealing to a Hypothetical Datalog (HD) language, which makes possible those positive and negative assumptions. As a conse- quence, it can be used to express temporary relations, similar to what an SQL WITH clause allows [9]. Each SQL query submitted to this system is translated into an HD program and solved by its logic engine. This paper extends [9] by introducing expressions in conditions and meta-predicates which are needed to support the translation of standard SQL queries with expressions in WHERE clauses and aggregates as required to solve those puzzles.8 1.1. Motivation SQL data-retrieval queries usually start with a SELECT keyword, which builds the outcome in terms of its expression list. An exception to this is the WITH clause, which permits local view declarations which can be used in its outcome SQL statement. These declarations are known as Common Table Expressions (CTE’s), and its syntax in recursive WITH queries can be simplified as: WITH RECURSIVE R1 AS SQL1 , -- CTE1 ..., R𝑛 AS SQL𝑛 -- CTE𝑛 SQL -- Outcome SQL where each R𝑖 is a local, temporary view name defined by the SQL query SQL𝑖 , and can be referenced only in the context of each SQL𝑖 and SQL , which is the query that builds the outcome. Each declaration R𝑖 AS SQL𝑖 is a CTE. Whilst a local view declaration is similar to a CREATE VIEW declaration, its scope is restricted as already noted, and its lifetime is the same as for SQL . Though CTE’s are used for a number of reasons (divide-and-conquer, readability, creating views for which the user has no rights, . . . ), we are interested in the use that enables SQL as a Turing complete language: by specifying recursive queries. As an example, the following defines natural numbers: WITH RECURSIVE nat(n) AS (SELECT 0 UNION ALL SELECT n+1 FROM nat) SELECT n FROM nat; where the keyword RECURSIVE is required by the standard to specify a (potentially) recursive CTE, the first SELECT is the base case (simply returning the tuple (0) as a FROM-less statement), which is unionised with the inductive case (second SELECT). This single CTE is used in the outcome SQL (third SELECT) to deliver all the naturals. Query semantics can be understood as finding its fixpoint for the application of the unionised queries on an enlarging relation nat, which is increased with a new number in each fixpoint-search iteration. In this case, the fixpoint would be infinite, though top-𝑁 queries can limit the number of retrieved tuples. Current SQL systems include limitations on recursive queries: mutual and non-linear recursion are not allowed, only one recursive case can be set, and duplicate elimination is not allowed in the union of the base and inductive cases. These limitations can be overcome with other database languages such as Datalog operating over the same relational data, a language capable of expressing mutual and non-linear recursion, and duplicates [10, 11]. However, local (temporary) view definitions as needed to represent the behavior of a WITH clause are not part of a Datalog language, which only allows for non-volatile predicates. 8 This work was supported by the State Research Agency (AEI) of the Spanish Ministry of Science and Innovation under grant PID2019-104735RB-C42 (SAFER). 55 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 Here is where intuitionistic logic programming (ILP) can be of help because it adds embedded implications in the body of clauses. Intuitively, in an embedded implication 𝐴 ⇒ 𝐵, the atom 𝐴 is assumed during proving 𝐵 [12], limiting its scope to this proof. A database language based on ILP is Hypothetical Datalog [8], which is extended in [9, 13, 14] with rules (not only atoms) in such assumptions, a feature that can be leveraged to represent those SQL local view definitions. Next section recalls this feature implemented in the DES system, which also includes essential SQL functionalities such as handling duplicates and aggregates. 2. Intuitionistic Logic Programming for SQL The Hypothetical Datalog language in [9] implements a system based on intuitionistic logic program- ming. Here, its syntax and semantics are extended with conditions in expressions and meta-predicates for supporting common SQL needs: top-𝑁 queries (i.e., at most, 𝑁 solutions to its goal are allowed), aggregates, and expressions in comparison predicates. Next, 𝑣𝑎𝑟𝑠(𝑇 ) is the set of variables in 𝑇 . Definition 2.1 (Syntax). 𝜌 := 𝐴 | 𝐴 ← 𝜑1 ∧ . . . ∧ 𝜑𝑛 𝜑 := 𝐴 | ¬𝐴 | 𝐸1 ◇ 𝐸2 | 𝛿(𝐴) | 𝜏 (𝑁, 𝐴) | 𝒢(𝐴, 𝑋𝑠, 𝜑) | 𝜌1 ∧ . . . ∧ 𝜌𝑛 ⇒ 𝜑 where 𝜌 stands for rule, 𝜑 for goal, 𝐴 for atom (possibly containing variables and constants, but no compound terms), 𝑁 is a natural number, 𝑋𝑠 is a list of variables, 𝐸𝑖 are expressions, ◇ is a comparison operator (=, <, ≤. . . ), symbols ¬, 𝛿, 𝜏 and 𝒢 are meta-predicates, and 𝑣𝑎𝑟𝑠(𝜌𝑖 ) do not occur out of 𝜌𝑖 . Here, the symbol ⇒ denotes the embedded implication (it builds a goal that can occur in the body of a rule), and 𝛿 is a duplicate elimination meta-predicate. The first condition (on compound terms) is needed for ensuring a finite fixpoint in absence of infinite relations, and the second one (on variables of rules 𝜌𝑖 ) ensures that 𝜌𝑖 is not specialized by any substitution out of it along inferencing, so 𝜌𝑖 can be assumed “as is” for any inference step, which is an adequate requirement for modelling local definitions in SQL WITH statements (however, it represents a limitation with respect to [12]). Symbols ¬, 𝛿, 𝜏 and 𝒢 are meta-predicates, standing respectively for negation, duplicate elimination, top-𝑁 queries, and group-by (for solving aggregations as counts and summations, which can occur in expressions as other scalar functions). A group-by goal 𝒢(𝐴, 𝑋𝑠, 𝜑) means that ground instances of 𝐴 are grouped by the list of variables 𝑋𝑠 (grouping criteria) and, for each group, there is at most one instantation such that it fulfils 𝜑. Next, we extend the inference system in [9] with inference rules for comparison operators, top-𝑁 and group-by goals. The well-known concepts of predicate dependency graph (PDG) and stratifiable ¬ database Δ [15] are used here (see also page 298 in [9]). Here, the PDG includes a negative arc 𝑝 ← 𝑞 between the head predicate 𝑝 of a rule and a predicate 𝑞 in its body, such that 𝑞 occurs in an argument ¬ of a meta-predicate ¬, 𝜏 or 𝒢; otherwise, it includes a positive arc 𝑝 ← 𝑞. A negative arc 𝑝 ← 𝑞 imposes that 𝑞 must be solved before 𝑝 for avoiding non-monotonicity [15]. 𝑠𝑡𝑟(Δ, 𝑃 ) is the stratum corresponding to predicate 𝑃 in Δ [9] such that for a positive arc p←q, 𝑠𝑡𝑟(Δ, p) ≥ 𝑠𝑡𝑟(Δ, q), and for ¬ a negative arc p←q, 𝑠𝑡𝑟(Δ, p) > 𝑠𝑡𝑟(Δ, q). Δ ⊢ 𝜓 is an inference expression, where 𝜓 is an identified ground atom 𝑖𝑑 : 𝐴. Such an 𝑖𝑑 is of the form 𝑖𝑑1 · · · · · 𝑖𝑑𝑛 (𝑛 ≥ 1) and helps data provenance for supporting duplicates. 𝑝𝑟𝑒𝑑(𝐴) is the predicate of atom 𝐴. A negative inference expression is of the form Δ ⊢ 𝑖𝑑 : −𝐴. Negative inference expressions play a key role in the SQL extension for negative assumptions, but are out of the scope of this paper. Next definition specifies an inference system where each rule is read as: If the formulas above the line can be inferred, then the ones below the line can be inferred as well. The application of an inference rule is an inference step. Definition 2.2 (Inference system). Given a database Δ and a set of input inference expressions ℰ, the inference system associated to the 𝑠-th stratum is defined as follows, where 𝑑𝑠 (ℰ) is a closure operator that denotes the set of inference expressions derivable in this system: Axioms: 56 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 • Δ ⊢ 𝑖𝑑 : 𝐴 is an axiom for each (ground) atomic formula 𝑖𝑑 : 𝐴 in Δ, where 𝑠𝑡𝑟(Δ, 𝑝𝑟𝑒𝑑(𝐴)) = 𝑠. • Δ ⊢ 𝑐1 ◇ 𝑐2 is an axiom, where ◇ ∈ {=, <, >, ≤, ≥}, 𝑐𝑖 are constants, and 𝑐1 ◇ 𝑐2 holds. • Each expression in ℰ is an axiom. Inference Rules: • For any rule 𝐴 ← 𝜑1 ∧ . . . ∧ 𝜑𝑛 with identifier 𝑖𝑑 in Δ, where 𝑠𝑡𝑟(Δ, 𝑝𝑟𝑒𝑑(𝐴)) = 𝑠 and for any ground substitution 𝜃: Δ ⊢ 𝑖𝑑𝑖 : 𝜑𝑖 𝜃 for each 𝑖 Δ ⊢ 𝑖𝑑′ : 𝐴𝜃 (Clause) where 𝑖𝑑′ is the composition of identifiers 𝑖𝑑 · 𝑖𝑑1 · · · 𝑖𝑑𝑛 . • For any constants 𝑐𝑖 and expressions 𝑒𝑖 : Δ ⊢ 𝑐1 ◇ 𝑐2 Δ ⊢ 𝑒1 𝜃 ◇ 𝑒2 𝜃 (Expression) where ◇ ∈ {=, <, >, ≤, ≥}, and 𝑒𝑖 𝜃 evaluates to 𝑐𝑖 . • For any atoms 𝐴, 𝐴′ : Δ ⊢ 𝑖𝑑 : 𝐴 Δ ⊢ 𝛼 : 𝛿(𝐴′ 𝜃) (Duplicates) where 𝐴 = 𝐴′ 𝜃, 𝛼 is a single, unique identifier. • For any atoms 𝐴𝑖 , 𝐴, natural number 𝑁 , and ground substitutions 𝜃𝑖 : Δ ⊢ 𝑖𝑑𝑖 : 𝐴𝑖 Δ ⊢ 𝛼𝑖 : 𝜏 (𝑁, 𝐴𝜃𝑖 ) (Top) where 𝐴𝑖 = 𝐴𝜃𝑖 , 𝛼𝑖 are at most 𝑁 single, unique identifiers. • For any atoms 𝐴𝑖 , 𝐴, 𝐵𝑗 , 𝐵 and ground substitutions 𝜃𝑖 , 𝜃: Δ ⊢ 𝑖𝑑𝑖 : 𝐴𝑖 Δ ⊢ 𝑖𝑑𝑗 : 𝐵𝑗 Δ ⊢ 𝛼𝑖 : 𝒢(𝐴𝜃𝑖 , 𝑋𝑠𝜃𝑖 , 𝐵𝜃𝑖 ) (Group-By) where each 𝛼𝑖 is a unique identifier for each group 𝑋𝑠𝜃𝑖 , and 𝐴 = 𝐴𝑖 𝜃, 𝐵 = 𝐵𝑗 𝜃. • For any goal 𝜑: Δ ∪ {𝜌1 , . . . , 𝜌𝑛 } ⊢ 𝜑 Δ ⊢ 𝜌1 ∧ . . . ∧ 𝜌𝑛 ⇒ 𝜑 (Assumption) Example 1. Natural numbers can be specified in this language as follows: 𝜑1 ≡ (𝑛𝑎𝑡(0) ∧ (𝑛𝑎𝑡(𝑋) ← 𝑛𝑎𝑡(𝑌 ) ∧ 𝑋 = 𝑌 + 1)) ⇒ 𝑛𝑎𝑡(𝑋) Starting with Δ = ∅, and applying the Assumption rule: {(𝑛𝑎𝑡(0)), (𝑛𝑎𝑡(𝑋) ← 𝑛𝑎𝑡(𝑌 ) ∧ 𝑋 = 𝑌 + 1)} ⊢ 𝑛𝑎𝑡(𝑋) ∅ ⊢ (𝑛𝑎𝑡(0) ∧ (𝑛𝑎𝑡(𝑋) ← 𝑛𝑎𝑡(𝑌 ) ∧ 𝑋 = 𝑌 + 1)) ⇒ 𝑛𝑎𝑡(𝑋) Δ, originally empty, has been enlarged with the two rules in the assumption (let us call it Δ′ ), and each one receives a unique identifier (as any other possible rule in the program). For this example, let 𝑟1 and 𝑟2 be the respective identifiers for 𝑛𝑎𝑡(0) and 𝑛𝑎𝑡(𝑋) ← 𝑛𝑎𝑡(𝑌 ) ∧ 𝑋 = 𝑌 + 1. In a second inference step, we can apply the Clause inference rule, giving: Δ′ ⊢ 𝑟1 : 𝑛𝑎𝑡(𝑋){𝑋 ↦→ 0} ≡ Δ′ ⊢ 𝑟1 : 𝑛𝑎𝑡(0), where 𝜃 = {𝑋 ↦→ 0}, and Δ′ ⊢ 𝑟1 : 𝑛𝑎𝑡(0) represents a first solution of the query. This can be read as: the atom 𝑛𝑎𝑡(0) identified by 𝑟1 has been inferred for the context Δ′ . Next, using 𝑟2 in Clause for some 𝑖𝑑 after Assumption yields: Δ′ ⊢ 𝑖𝑑 : 𝑛𝑎𝑡(𝑌 )𝜃1 Δ′ ⊢ (𝑋 = 𝑌 + 1)𝜃1 Δ′ ⊢ 𝑟2 · 𝑖𝑑 : 𝑛𝑎𝑡(𝑋)𝜃1 57 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 where axioms above the line can be inferred: First, the axiom Δ′ ⊢ 𝑟1 : 𝑛𝑎𝑡(0) (with 𝜃 = {𝑌 ↦→ 0}) as before, and the second one because it becomes a trivial axiom Δ′ ⊢ 1 = 1 after applying Expression (i.e., the evaluated 𝑋 = 0 + 1 for {𝑋 ↦→ 1}). So 𝜃1 = {𝑌 ↦→ 0, 𝑋 ↦→ 1}, and the axiom Δ′ ⊢ 𝑟2 · 𝑟1 : 𝑛𝑎𝑡(1) can be inferred. Proceeding in the same way, the following axioms are inferred with 𝑘 > 0: Δ′ ⊢ 𝑟2 · · · 𝑟2 ·𝑟1 : 𝑛𝑎𝑡(𝑘) ⏟ ⏞ 𝑘 □ Negative information is deduced by applying the closed-world assumption to a set of inference expressions ℰ (written as 𝑐𝑤𝑎(ℰ)) as the union of ℰ and the negative inference expression for Δ ⊢ 𝜑 such that Δ ⊢ 𝜑 ∈ / ℰ. Thus, a stratified bottom-up construction of the unified stratified semantics can be specified by: • ℰ0 = ∅ • ℰ 𝑠+1 = 𝑐𝑤𝑎(𝑑𝑠+1 (ℰ 𝑠 )) for 𝑠 ≥ 0. which builds a set of axioms ℰ that provides a way to assign a meaning to a goal 𝜑 with: 𝑠𝑜𝑙𝑣𝑒(𝜑, ℰ) = {Δ ⊢ 𝑖𝑑 : 𝜓 ∈ ℰ | 𝜑𝜃 = 𝜓} where 𝜃 is a substitution and each axiom in ℰ is mapped to the database Δ it was deduced for, and the inferred fact 𝜓 is labelled with its data source 𝑖𝑑 (for supporting duplicates). We use Δ(ℰ) to denote the multiset of facts 𝜓 so that Δ ⊢ 𝑖𝑑 : 𝜓 ∈ ℰ for any 𝑖𝑑. In Example 1, there is only one stratum, so: 𝑠𝑜𝑙𝑣𝑒(𝜑1 , ℰ) = {Δ′ ⊢ 𝑟2 · · · 𝑟2 ·𝑟1 : 𝑛𝑎𝑡(𝑘) | 𝑘 ≥ 0} = ℰ𝜑1 ⏟ ⏞ 𝑘 and the outcome would be the multiset Δ(ℰ𝜑1 ) = {𝑛𝑎𝑡(𝑘) | 𝑘 ≥ 0}. A top-10 goal for this example would restrict the number of solutions as: Δ(ℰ𝜏 (10,𝜑1 ) ) = {𝑛𝑎𝑡(𝑘) | 0 ≤ 𝑘 ≤ 9}, which corresponds to modifying the outcome SQL query in the WITH statement by SELECT TOP 10 n FROM nat. The stratified construction of inference expressions captures the requirements of SQL clauses such as EXCEPT, NOT IN, and GROUP BY, which are sources of negative arcs in the construction of a predicate dependency graph (PDG) [16]. 3. Translating SQL to Hypothetical Datalog Here, the function SQL_to_DL is defined by extending Definition 6 in [9] with top-𝑁 queries, aggregates and expressions. This function takes a relation name and an SQL statement as input and returns a multiset of Hypothetical Datalog rules providing the same meaning as the SQL relation for a corresponding predicate with the same name as the relation. From here on: Without loss of expressivity, we assume that a group-by statement includes a relation name in its FROM clause; set-related operators and symbols refer to multisets because SQL relations can contain duplicates; each attribute is qualified by the relation name it corresponds to so that they can be uniquely identified; an empty conjunctive goal is 𝑡𝑟𝑢𝑒; where convenient, variables of goals are made explicit; and for the sake of conciseness, Hypothetical Datalog is simply referred to as Datalog. Definition 3.1. The function SQL_to_DL takes a relation name and an SQL statement as input and returns a Datalog program as defined by cases as follows: % Basic SELECT statement SQL_to_DL (𝑟, SELECT 𝐸𝑥𝑝𝑠 FROM 𝑅𝑒𝑙 WHERE 𝐶𝑜𝑛𝑑) = { 𝑟(𝑋) ← 𝐷𝐿𝑅𝑒𝑙(𝑌 ) ∧ 𝐷𝐿𝐸𝑥𝑝𝑠(𝑋) ∧ 𝐷𝐿𝐶𝑜𝑛𝑑(𝑍) } 𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠 𝐶𝑜𝑛𝑑𝑅𝑢𝑙𝑒𝑠, ⋃︀ ⋃︀ ⋃︀ where SQLREL_to_DL (𝑅𝑒𝑙) = (𝐷𝐿𝑅𝑒𝑙(𝑌 ), 𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠), SQLEXPS_to_DL (𝐸𝑥𝑝𝑠) = 58 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 (𝐷𝐿𝐸𝑥𝑝𝑠(𝑋), 𝑡𝑟𝑢𝑒, 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠), and SQLCOND_to_DL (𝐶𝑜𝑛𝑑) = (𝐷𝐿𝐶𝑜𝑛𝑑(𝑍), 𝐶𝑜𝑛𝑑𝑅𝑢𝑙𝑒𝑠). Here, each E𝑖 in the list 𝐸𝑥𝑝𝑠 is either an expression or an argument name present in the relation 𝑅𝑒𝑙 with corresponding logic variable 𝑋𝑖 ∈ 𝑋. 𝑅𝑒𝑙 is either a single defined relation (table or view), or a join of relations, or an SQL statement. Function SQLREL_to_DL (respectively, SQLCOND_to_DL ) takes an SQL relation (respectively, condition) and returns a goal with as many variables 𝑌 as arguments of 𝑅𝑒𝑙 and, possibly, additional rules which result from the translation (and similarly for SQLEXPS_to_DL (𝐸𝑥𝑝𝑠)). Variables 𝑍 ⊆ 𝑌 come as a result of the translation of the condition 𝐷𝐿𝐶𝑜𝑛𝑑 to a goal. % GROUP BY statement SQL_to_DL (𝑟, SELECT 𝐸𝑥𝑝𝑠 FROM 𝑅𝑒𝑙 GROUP BY 𝐶𝑜𝑙𝑠) = { 𝑟(𝑋) ← 𝐷𝐿𝐸𝑥𝑝𝑠 ∧ 𝒢(𝐷𝐿𝑅𝑒𝑙(𝑌 ), 𝑉 , 𝐷𝐿𝐴𝑔𝑔) } 𝐸𝑥𝑝𝑠𝑅𝑢𝑙𝑒𝑠 𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠, ⋃︀ ⋃︀ where SQLEXPS_to_DL (𝐸𝑥𝑝𝑠) = (𝐷𝐿𝐸𝑥𝑝𝑠, 𝐷𝐿𝐴𝑔𝑔, 𝐸𝑥𝑝𝑠𝑅𝑢𝑙𝑒𝑠), SQLREL_to_DL (𝑅𝑒𝑙) = (𝐷𝐿𝑅𝑒𝑙(𝑌 ), 𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠). Here, 𝑉 ⊆ 𝑌 are the variables of the predicate 𝐷𝐿𝑅𝑒𝑙 corresponding to the attributes 𝐶𝑜𝑙𝑠 of the relation 𝑅𝑒𝑙. % Top-𝑁 SQL_to_DL (𝑟, SELECT TOP N 𝐸𝑥𝑝𝑠 FROM 𝑅𝑒𝑙 WHERE 𝐶𝑜𝑛𝑑) = { 𝑟(𝑋) ← 𝜏 (𝑁, 𝑟 ′ (𝑋)) } ⋃︀ 𝑅𝑢𝑙𝑒𝑠, where SQL_to_DL (𝑟′ , SELECT 𝐸𝑥𝑝𝑠 FROM 𝑅𝑒𝑙 WHERE 𝐶𝑜𝑛𝑑) = 𝑅𝑢𝑙𝑒𝑠 % Union SQL_to_DL (𝑟, SQL1 UNION ALL SQL2 ) = SQL_to_DL (𝑟, SQL1 ) SQL_to_DL (𝑟, SQL2 ) ⋃︀ % Difference SQL_to_DL (𝑟, SQL1 EXCEPT SQL2 ) = { 𝑟(𝑋) ← 𝑠(𝑋) ∧ ¬𝑡(𝑋) } SQL_to_DL (𝑠, SQL1 ) SQL_to_DL (𝑡, SQL2 ) ⋃︀ ⋃︀ % Intersection SQL_to_DL (𝑟, SQL1 INTERSECT SQL2 ) = { 𝑟(𝑋) ← 𝑠(𝑋) ∧ 𝑡(𝑋) } SQL_to_DL (𝑠, SQL1 ) SQL_to_DL (𝑡, SQL2 ) ⋃︀ ⋃︀ % WITH statement SQL_to_DL (𝑟, WITH 𝑟1 AS SQL1 , ..., 𝑟𝑛 AS SQL𝑛 SQL ) = { 𝑟(𝑋) ← ({SQL_to_DL (𝑟𝑖 , SQL 𝑖 ) | 1 ≤ 𝑖 ≤ 𝑛}) ⇒ 𝑠(𝑋) } SQL_to_DL (𝑠, SQL ) ⋀︀ ⋃︀ where (𝐵𝑎𝑔) denotes 𝐵1 ∧ · · · ∧ 𝐵𝑚 (𝐵𝑖 ∈ 𝐵𝑎𝑔). ⋀︀ Next three definitions specify the functions SQLREL_to_DL , SQLEXPS_to_DL and SQLCOND_to_DL which are used by SQL_to_DL . Definition 3.2. The function SQLREL_to_DL takes an SQL relation (either a relation name, or a join of relations, or an SQL statement) as input and returns both a Datalog goal and program as defined by cases as follows: % Extensional/Intensional relation name SQLREL_to_DL (𝑅𝑒𝑙𝑁 𝑎𝑚𝑒) = (𝑅𝑒𝑙𝑁 𝑎𝑚𝑒(𝑋), {}) where 𝑋 are the 𝑛 variables corresponding to the 𝑛-ary relation 𝑅𝑒𝑙𝑁 𝑎𝑚𝑒. % Join of relations SQLREL_to_DL ((𝑅𝑒𝑙1 , . . . , 𝑅𝑒𝑙𝑛 )) = (𝐷𝐿𝑅𝑒𝑙(𝑋1 )∧· · ·∧𝐷𝐿𝑅𝑒𝑙(𝑋𝑛 ), 𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠1 ∪· · ·∪𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠𝑛 ) where SQLREL_to_DL (𝑅𝑒𝑙𝑖 ) = (𝐷𝐿𝑅𝑒𝑙(𝑋𝑖 ), 𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠𝑖 ). % SQL statement SQLREL_to_DL (𝑆𝑄𝐿) = (𝑅𝑒𝑙𝑁 𝑎𝑚𝑒(𝑋), SQL_to_DL (𝑅𝑒𝑙𝑁 𝑎𝑚𝑒, 𝑆𝑄𝐿)) where 𝑋 are the 𝑛 variables corresponding to the 𝑛-ary statement 𝑆𝑄𝐿, and 𝑅𝑒𝑙𝑁 𝑎𝑚𝑒 is a fresh relation name. 59 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 Definition 3.3. The function SQLEXPS_to_DL takes a sequence of SQL expressions as input and returns a Datalog conjunctive goal for expressions, another for aggregates, and a program, as defined by cases as follows: % Sequence of expressions ⋀︀ ⋀︀ SQLEXPS_to_DL ((𝐸𝑥𝑝1 , . . . , 𝐸𝑥𝑝𝑛 )) = ( {𝐷𝐿𝐸𝑥𝑝𝑖 | 1 ≤ 𝑖 ≤ 𝑛}, {𝐷𝐿𝐴𝑔𝑔𝑖 | 1 ≤ 𝑖 ≤ 𝑛}, 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠1 ∪ · · · ∪ 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠𝑛 ), where SQLEXPS_to_DL (𝐸𝑥𝑝𝑖 ) = (𝐷𝐿𝐸𝑥𝑝𝑖 , 𝐷𝐿𝐴𝑔𝑔𝑖 , 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠𝑖 ). % Expression SQLEXPS_to_DL (𝐸𝑥𝑝) = ((𝑋 = 𝐷𝐿𝐸𝑥𝑝) ∧ 𝐷𝐿𝑄𝑠, 𝐷𝐿𝐴𝑔𝑔, 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠), where 𝑋 is the variable corresponding to the expression 𝐸𝑥𝑝 at position 𝑝 in the list of SQL expressions (𝐸𝑥𝑝1 , . . . , 𝐸𝑥𝑝𝑛 ), and 𝐷𝐿𝐸𝑥𝑝 is the result of: 1) replacing each attribute 𝑎𝑖 of a relation 𝑟 in 𝐸𝑥𝑝 by the respective variable 𝑋𝑖 of the predicate 𝑟; 2) replacing each aggregate function 𝑓 over an attribute 𝑏𝑗 of the relation 𝑠 by a fresh variable 𝑌𝑗 . For each 𝑓 out of 𝑛, 𝐷𝐿𝐴𝑔𝑔 includes a conjunctive goal 𝑌𝑗 = 𝑓 (𝑍𝑗 ), where 𝑍𝑗 is the variable of 𝑠(𝑍) which corresponds to the attribute 𝑏𝑗 ; and 3) replacing each query 𝑄 with alias 𝑎 by a fresh variable 𝑈 such that SQL_to_DL (𝑎, 𝑄) = (𝑄𝑅𝑢𝑙𝑒𝑠), 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠 contains 𝑄𝑅𝑢𝑙𝑒𝑠, and 𝐷𝐿𝑄𝑠 includes a conjunctive goal 𝑎(𝑈 ). Definition 3.4. The function SQLCOND_to_DL takes an SQL condition as input and returns a Datalog conjunctive goal and a program, as defined by cases as follows: % Basic condition SQLCOND_to_DL (𝐸𝑥𝑝1 𝑐𝑜𝑝 𝐸𝑥𝑝2 ) = ((𝑋1 𝑐𝑜𝑝′ 𝑋2 ) {(𝑋𝑖 = 𝐷𝐿𝐸𝑥𝑝𝑖 ) ∧ 𝐷𝐿𝑄𝑠𝑖 | 1 ≤ 𝑖 ≤ ⋀︀ 2}, 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠1 ∪ 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠2 ), where SQLEXPS_to_DL (𝐸𝑥𝑝𝑖 ) = ((𝑋𝑖 = 𝐷𝐿𝐸𝑥𝑝𝑖 ) ∧ 𝐷𝐿𝑄𝑠𝑖 , 𝑡𝑟𝑢𝑒, 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠𝑖 ), 1 ≤ 𝑖 ≤ 2, and 𝑐𝑜𝑝′ is the Datalog corresponding comparison operator to 𝑐𝑜𝑝. % Conjunction SQLCOND_to_DL (𝐶1 AND 𝐶2 ) = (𝐶1′ ∧ 𝐶2′ , 𝐶𝑅𝑢𝑙𝑒𝑠1 ∪ 𝐶𝑅𝑢𝑙𝑒𝑠2 ) where 𝐶𝑖 (1 ≤ 𝑖 ≤ 2) are conditions and SQLCOND_to_DL (𝐶𝑖 )=(𝐶𝑖′ , 𝐶𝑅𝑢𝑙𝑒𝑠𝑖 ). % Disjunction SQLCOND_to_DL (𝐶1 OR 𝐶2 ) = (𝛿(𝑑), {𝑑 ← 𝐶1′ , 𝑑 ← 𝐶2′ } ∪ 𝐶𝑅𝑢𝑙𝑒𝑠1 ∪ 𝐶𝑅𝑢𝑙𝑒𝑠2 ) where 𝐶𝑖 (1 ≤ 𝑖 ≤ 2) are conditions, SQLCOND_to_DL (𝐶𝑖 )=(𝐶𝑖′ , 𝐶𝑅𝑢𝑙𝑒𝑠𝑖 ), and 𝑑 is a fresh atom. % Negated condition SQLCOND_to_DL (NOT 𝐶 ) = (𝐶 ′ , 𝐶𝑅𝑢𝑙𝑒𝑠) where 𝐶 ′ =SQLCOND_to_DL (𝐶)=(𝐶 ′ , 𝐶𝑅𝑢𝑙𝑒𝑠), and 𝐶 represents the logical complement of 𝐶 (𝑋 < 𝑌 = 𝑋 ≥ 𝑌 , 𝐶1 ∧ 𝐶2 = 𝐶1 ∨ 𝐶2 , and so on). Note that disjunction requires duplicate elimination because, otherwise, there can be more tuples in the result because a disjunction in a rule is rewritten as two rules, as the next example illustrates: Example 2. Consider the SQL query SELECT SQRT(2) FROM dual WHERE TRUE OR TRUE defining a relation 𝑟. The relation dual, as originally defined by Oracle, contains a single row and a single column, and it was included as a predefined relation to allow for calculations such as this example. DES includes this as a propositional relation instead (there is no need for an argument because it is not intended to be used). Translating this query using Definition 3.1-Basic SELECT statement, where 𝑅𝑒𝑙 = dual, 𝐸𝑥𝑝𝑠 = SQRT(2) and 𝐶𝑜𝑛𝑑 = TRUE OR TRUE results in: SQL_to_DL (𝑟, SELECT SQRT(2) FROM dual WHERE TRUE OR TRUE) = { 𝑟(𝑋) ← 𝐷𝐿𝑅𝑒𝑙 ∧ 𝐷𝐿𝐸𝑥𝑝𝑠(𝑋) ∧ 𝐷𝐿𝐶𝑜𝑛𝑑 } 𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠 𝐸𝑥𝑝𝑅𝑢𝑙𝑒𝑠 𝐶𝑜𝑛𝑑𝑅𝑢𝑙𝑒𝑠 = ⋃︀ ⋃︀ ⋃︀ { 𝑟(𝑋) ← 𝑑𝑢𝑎𝑙 ∧ 𝑋 = 𝑠𝑞𝑟𝑡(2) ∧ 𝛿(𝑑), 𝑑 ← 𝑡𝑟𝑢𝑒, 𝑑 ← 𝑡𝑟𝑢𝑒 } (rules with respective identifiers 𝑟1 , 𝑟2 and 𝑟3 ), because following Definition 3.2-Extensional/Intensional relation name: SQLREL_to_DL (dual) = (𝐷𝐿𝑅𝑒𝑙, 𝑅𝑒𝑙𝑅𝑢𝑙𝑒𝑠) = (𝑑𝑢𝑎𝑙, {}), following Defini- tion 3.3-Expression: SQLEXPS_to_DL (SQRT(2)) = (𝐷𝐿𝐸𝑥𝑝, 𝐷𝐿𝐴𝑔𝑔, 𝐸𝑥𝑝𝑠𝑅𝑢𝑙𝑒𝑠) = ((𝑋 = 60 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 𝑠𝑞𝑟𝑡(2)), 𝑡𝑟𝑢𝑒, {}), and following Definition 3.4-Disjunction: SQLCOND_to_DL (TRUE OR TRUE) = (𝐷𝐿𝐶𝑜𝑛𝑑, 𝐶𝑜𝑛𝑑𝑅𝑢𝑙𝑒𝑠) = (𝛿(𝑑), {𝑑 ← 𝑡𝑟𝑢𝑒, 𝑑 ← 𝑡𝑟𝑢𝑒}). Thus, 𝑠𝑜𝑙𝑣𝑒(𝑟(𝑋), ℰ) = {Δ ⊢ 𝑟1 · 𝛼 : 𝑟(1.4142)} (where 𝛼 is the fresh identifier introduced by Definition 2.2-Duplicates). Without duplicate elimination (goal 𝑑 instead of 𝛿(𝑑) in rule 𝑟1 ): 𝑠𝑜𝑙𝑣𝑒(𝑟(𝑋), ℰ) = {Δ ⊢ 𝑟1 · 𝑟2 : 𝑟(1.4142), Δ ⊢ 𝑟1 · 𝑟3 : 𝑟(1.4142)}, which is not the expected outcome for the SQL query. The actual system applies other techniques to simplify the translated program, such as partial evaluation and folding/unfolding [17], which are not presented here. In this case, the resulting translation is simply {𝑟(1.4142)}, as it can be checked in the system by displaying compilations with the command /show_compilations on. Observe also that the system emits the warning for this example: [Sem] Tautological condition: (true OR true), which is one of the results of its semantic analysis [6], indicating a superfluous condition. □ Next, the translation of a recursive WITH statement is illustrated. Example 3. Consider the SQL query in Subsection 1.1 defining natural numbers. Translating this query follows Definition 3.1-WITH statement, where 𝑟1 = nat, SQL 1 = SELECT 0 UNION ALL SELECT n+1 FROM nat, and SQL = SELECT n FROM nat: SQL_to_DL (𝑟, WITH RECURSIVE nat(n) AS (SELECT 0 UNION ALL SELECT n+1 FROM nat) SELECT n FROM nat ) = { 𝑟(𝑋) ← ({SQL_to_DL (𝑟𝑖 , SQL 𝑖 ) | 1 ≤ 𝑖 ≤ 𝑛}) ⇒ 𝑠(𝑋) } SQL_to_DL (𝑠, SQL ) = ⋀︀ ⋃︀ {(𝑟(𝑋) ← ((𝑛𝑎𝑡(𝑋1 ) ← 𝑑𝑢𝑎𝑙 ∧ 𝑋1 = 0 ∧ 𝑡𝑟𝑢𝑒) ∧ (𝑛𝑎𝑡(𝑋2 ) ← 𝑛𝑎𝑡(𝑌 ) ∧ 𝑋2 = 𝑌 + 1 ∧ 𝑡𝑟𝑢𝑒) ⇒ 𝑠(𝑋))), (𝑠(𝑋3 ) ← 𝑛𝑎𝑡(𝑍) ∧ 𝑋3 = 𝑍 ∧ 𝑡𝑟𝑢𝑒)} because, by Definition 3.1-Union: SQL_to_DL (𝑛𝑎𝑡, SELECT 0 UNION ALL SELECT n+1 FROM nat) = SQL_to_DL (𝑛𝑎𝑡, SELECT 0) ∪ SQL_to_DL (𝑛𝑎𝑡, SELECT n+1 FROM nat) and, by Definition 3.1-Basic SQL statement, where SELECT 0 is an abbreviation for SELECT 0 FROM dual WHERE TRUE, the following is get by performing similar steps to Example 2: SQL_to_DL (𝑛𝑎𝑡, SELECT 0) = { 𝑛𝑎𝑡(𝑋1 ) ← 𝑑𝑢𝑎𝑙 ∧ 𝑋1 = 0 ∧ 𝑡𝑟𝑢𝑒 }, and applying again the same definition to: SQL_to_DL (𝑛𝑎𝑡, SELECT n+1 FROM nat) = { 𝑛𝑎𝑡(𝑋2 ) ← 𝑛𝑎𝑡(𝑌 ) ∧ 𝑋2 = 𝑌 + 1 ∧ 𝑡𝑟𝑢𝑒 }, and: SQL_to_DL (𝑠, SELECT n FROM nat) = { 𝑠(𝑋3 ) ← 𝑛𝑎𝑡(𝑍) ∧ 𝑋3 = 𝑍 ∧ 𝑡𝑟𝑢𝑒 }. Further simplification and unfolding steps done in the system results in a single rule: {(𝑟(𝑋) ← (𝑛𝑎𝑡(0)) ∧ (𝑛𝑎𝑡(𝑌 ) ← 𝑛𝑎𝑡(𝑍) ∧ 𝑌 = 𝑍 + 1) ⇒ 𝑛𝑎𝑡(𝑋))}. □ [9] provides a theorem for the semantic equivalence of an SQL relation and its counterpart Datalog translation for a subset of the language presented here. An SQL relation is translated into an extended relational algebra relation (ERA) definition (with SQL_to_ERA ), for which an interpretation for a com- puted answer in terms of the ERA operators is provided. The Datalog program is interpreted with the unified stratified semantics, which provides the set of axioms ℰ for the original database Δ augmented with the translation of the query 𝑄 to this Datalog program. So, the proof inductively proceeds by strata, and for each stratum by cases, showing the semantic equivalence of the translation from the SQL query 𝑄 into an ERA expression for a relation 𝑟 (with SQL_to_ERA ), and the translation of 𝑄 into a Datalog program (with SQL_to_DL ). 4. SQL Puzzles This section provides several SQL puzzles9 provided to students and solved in DES, mainly with its on-line interface DESweb. We use the concrete SQL syntax that DES supports, allowing for omitting the verbose RECURSIVE modifier which, though promoted and required by the standard for recursive queries, it seems to be an unneeded writing burden (a simple program analysis can classify recursive predicates and apply both specific compilations and optimizations). Also, UNION can be used instead of UNION ALL if duplicates are not required. Even when it might seem that discarding duplicates results in a solving overhead, it turns out to be the opposite for recursive queries [18]. For each puzzle, its problem statement is given, together with an equivalent Datalog formulation, and outcome. A selection of other puzzles can be found in Appendix A. 9 At the time of posing these puzzles and up to the best of our knowledge, most of these puzzles were novel in SQL (we only found a statement for primes), and solutions to them are reported here for the first time, both in SQL and Datalog. 61 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 Here, we use the concrete syntax of DES, where :- stands for the neck of a rule, the comma (,) for goal conjunction, => for embedded implication, /\ for rule assumptions, distinct(𝐴) for 𝛿(𝐴), top(𝑁 ,𝐴) for 𝜏 (𝑁, 𝐴), and group_by(𝐴,𝑋𝑠,𝜑) for 𝒢(𝐴, 𝑋𝑠, 𝜑). Relation attributes are not explicitly qualified with the relation name unless necessary. 4.1. Euler Number Problem Statement: Compute the Euler number 𝑒 as a Taylor series: 1 1 1 1 ∑︁ 1 𝑒= + + + + ··· = 0! 1! 2! 3! 𝑖! 𝑖≥0 As stop condition implicitly use the system precision for numbers. A Solution: A straightforward solution is to define a CTE for the factorials with a schema of two arguments: the index 𝑛 (n) and the value of 𝑛! (factorial). Its base case is (0, 1), and the inductive case just computes (n + 1, n+1*factorial), given an already-computed tuple (n, factorial). Then, define each term in the series as another CTE, say taylor, with a single column (euler) with each term in the series, computed as 1/factorial from the CTE factorials: WITH factorials(n, factorial) AS (SELECT 0, 1 UNION ALL SELECT n+1, (n+1)*factorial FROM factorials) WITH taylor(euler) AS (SELECT 1/factorial FROM factorials) SELECT SUM(euler) FROM (SELECT TOP 18 euler FROM taylor); Note a couple of things: First, there are two nested WITH statements (factorials is only known in the context of the subsequent WITH statement. Alternatively, two consecutive CTE’s are also possible. Second, the outcome SQL needs a top-𝑁 clause to limit recursion. Up to now, main RDBMS’s as PostgreSQL, DB2, SQL Server, and Oracle do not admit nested WITH statements. Datalog Formulation: Next, it is shown the automatically-generated core Datalog10 query and program corresponding to the SQL statement. The program is asserted before solving the query and retracted afterwards. factorials(0,1) /\ (factorials(B,C) :- factorials(D,E), B=D+1, C=(D+1)*E) => answer_1_6(A). answer_1_6(A) :- (taylor(B) :- factorials(_C,D), B=1/D) => answer_1_8(A). answer_1_8(A) :- group_by(answer_1_10(B),[],A=sum(B)). answer_1_10(A) :- top(18,taylor(B)), B=A. This translation folds some relations for the deductive engine to be able to solve the query. Instead, a user would alternative write a query like this: factorials(0,1) /\ ( factorials(D,E) :- factorials(F,G), D=F+1, E=(F+1)*G ) => (taylor(B) :- factorials(_C,D), B=1/D) => group_by(top(18,taylor(B)),[],A=sum(B)). An alternative solution is to define a CTE with an argument for the index 𝑛 of the 𝑛-th term in the series, the value of the term (𝑡𝑒𝑟𝑚), and the current value of the running summation (𝑣𝑎𝑙). Thus, the base case is 𝑛 = 0, 𝑡𝑒𝑟𝑚 = 1, and 𝑣𝑎𝑙 = 1, and the inductive case computes the next term for presolved cases delivering 𝑛 + 1, 𝑡𝑒𝑟𝑚/(𝑛 + 1) (which equates to divide the 𝑛-th term by 𝑛 + 1), and accumulating in 𝑣𝑎𝑙 the 𝑛 + 1-th approximation. The stop condition should test if the last term is strictly greater than 0, since otherwise this means that the system numeric precision is met. 10 Core Datalog is the language directly processed by the deductive engine. (User) Datalog is the language that the user can use at the top-level, and that enables high-level formulations for expressions and some metapredicates. 62 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 WITH re(n,term,val) AS (SELECT 0, 1.0, 1.0 UNION SELECT n+1, term/(n+1), val+term/(n+1) FROM re WHERE term/(n+1) > 0) SELECT MAX(val) AS euler FROM re; Datalog Formulation: re(0,1,1) /\ ( re(B,C,D) :- re(E,F,G), H=F/(E+1), H>0, B=E+1, C=F/(E+1), D=G+F/(E+1) ) => group_by(re(_B,_C,D), [], A=max(D)). By submitting this user Datalog query at the top-level, the system compiles it to an almost equal core Datalog query and program as shown above for the SQL case. Such compilations can be inspected with the command /show_compilations on. Outcome: The query returns in the SICStus implementation: answer(euler:float) -> { answer(2.7182818284590455) } Obviously, the precision of the approximation to the Euler number depends on the underlying system arithmetics. Here, the last digit should read 2 instead of 5. This outcome is available by submitting the command display_answer on. 4.2. All Time Greatest Hits Problem Statement: We are given the table hits(theme string, copies int),11 holding song titles and number of sold copies for each title. Write a recursive SQL query to assign the rank of each song in terms of its sold copies (number 1 is for the most sold song, number 2 for the next and so on). If there are two or more songs with the same number of copies, they will receive the same rank number. A Solution: An idea to solve this puzzle is to build pairs with an integer representing the rank and the number of copies for hits such that there no other hits with a higher number of copies. Then, return the maximum rank for each group of hits with the same number of copies, adding to the output the song titles (known from the table hits and their number of copies). The base case in the recursive part is rank 1 for all hits. The inductive case rec is the next rank in rec such that the copies in hits is strictly less than the copies in rec, i.e., once all copies receive a rank 1, the rank 2 is received to all copies but the one with the maximum number of copies, and so on. Its SQL formulation is as follows: WITH rec(ranking, copies) AS (SELECT 1, copies FROM hits UNION SELECT ranking+1, hits.copies FROM hits, rec WHERE hits.copies < rec.copies) SELECT ranking, theme, hits.copies FROM hits, (SELECT MAX(ranking) AS ranking, copies FROM rec GROUP BY copies) m WHERE m.copies = hits.copies; Datalog Formulation: Observe that a Datalog equivalent query is neater than its SQL counterpart: ( rec(1,D) :- hits(_E,D) ) /\ ( rec(F,G) :- hits(_H,G), rec(I,J), G hits(B,C), group_by(rec(D,C), [C], A=max(D)). Outcome: answer(h1.theme:varchar(50),h1.copies:int,pos:int) -> { answer(1,’White Christmas’,50), answer(2,’In the Summertime’,31), answer(3,’Silent Night’,30), answer(4,’My Heart will Go On’,25), answer(4,’Rock Around the Clock’,25), answer(5,’I Will Always Love You’,20), answer(5,’It\’s Now or Never’,20), answer(5,’We Are the World’,20), answer(6,’If I Didn\’t Care’,19) } Info: 9 tuples computed. 11 https://www.countryliving.com/life/news/a45720/white-christmas-song-history. 63 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 4.3. Prime Numbers Problem Statement: Compute the first 𝑛 prime numbers, where 𝑛 is a natural bound. A Solution: The sieve of Eratosthenes can be expressed in SQL by generating numbers and getting rid of those which can be divided by a lesser number. So, a natural number generator can be defined as a CTE, and then the factors for each number as another CTE. For a number to be a factor of another number it suffices to check that the division remainder is 0. The CTE factor below delivers pairs of 𝑥,𝑓 , where 𝑥 is a number and 𝑓 is one of its factors. The outcome selects each 𝑛 for which there is only one factor. /set bound 100 WITH number(n) AS (SELECT 0 UNION SELECT n+1 FROM number WHERE n < $bound$), factor(x, f) AS (SELECT x.n, f.n FROM number x, number f WHERE x.n > f.n AND f.n > 0 AND x.n mod f.n = 0) SELECT x FROM (SELECT x, COUNT(*) c FROM factor GROUP BY x) WHERE c=1; Note that DES includes user variables which can be used as parameters, such as bound in this example, which are set with the command /set Variable Value. Datalog Formulation: Again, the translation seems to be more compact than the source SQL statement. number(0) /\ ( number(B) :- number(C), C<100, B=C+1 ) /\ ( factor(D,E) :- number(D), number(E), D>E, E>0, 0=D mod E ) => group_by(factor(A,_F), [A], (_G=count,_G=1)). Outcome: answer(factor.x:int) -> { answer(2), answer(3), answer(5), answer(7), ..., answer(97) } Info: 25 tuples computed. 4.4. Graph of an Explicit Function Problem Statement: Plot the sine function as depicted in the figure below. A Solution: Here we define two CTE’s: One working on the real plane (plot) which relates 𝑥 and 𝑦 coordinates via a function (parametrically specified with the user variable f), with square dimensions. The second CTE works on the “pixel" plane (bars) by building a string of spaces with as many spaces as pixels in the ordinates (horizontally represented instead of vertically), which will be truncated to the required length with respect to the value of the function at each 𝑥 coordinate. Observe that strings in the display are quoted and ended with the characters “’),”, so that it visually resembles the plot. Both plot and bars are built in a similar way natural numbers are defined: a base case with the first point in the plane, and with a single space, respectively. The recursive case of plot adds pairs (x+dx, fx) for 𝑥 coordinates up to its bound, whilst for bars, a space is added up to reaching its top length. Once both relations are defined, the outcome query truncates (with the function substr: substring) the longest bar up to the value of the corresponding 𝑦 coordinate for each 𝑥 coordinate found in plot. -- Dimension of the plot (d+1) rows and (d+1) columns: /set d 50 -- Start x: /set x0 0.0 -- End x: /set x1 4*pi -- Delta x: /set dx ($x1$-$x0$)/$d$ -- Function: /set f sin(xi) WITH 64 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 plot(x, fx) AS (SELECT $x0$ xi, $f$ UNION SELECT x+$dx$ xi, $f$ FROM plot WHERE x <= $x1$), bars(bar) AS (SELECT ’ ’ UNION SELECT bar||’ ’ FROM bars WHERE length(bar)<$d$) SELECT substr((select max(bar) bar from bars), 1, round($d$*(fx-(SELECT min(fx) FROM plot))*(SELECT 1/(max(fx)-min(fx)) FROM plot))) FROM plot; Datalog Formulation: The following equivalent Datalog query includes the aggregate predicates max/3 and min/3, which are syntax simplifications for a group_by predicate (e.g., max(R,X,M) ≡ group_by(R,[],M=max(X))). Functions concat(A,B) (concatenation of A and B and written as A||B in standard SQL), length(A) (length of A), and substr(A,P,N) (substring of N characters of A taken from position P) are used. ( plot($x0$,F) :- F=sin($x0$) ) /\ ( plot(B,C) :- plot(D,_E), D=<$x1$, B=D+$dx$, C=sin(D+$dx$) ) /\ bars(’ ’) /\ ( bars(H) :- bars(G), length(G)<$d$, H=concat(G,’ ’) ) => plot(_A,_B), max(bars(_K),_K,_H), min(plot(_L,_M),_M,_C), group_by(plot(_N,_O),[],(_F=1/(max(_O)-min(_O)))), A=substr(_H,1,round($d$*(_B-_C)*_F)). 4.5. Base Conversion Problem Statement: Given the table conversion(c:string, fb:int, tb:int), with natural numbers to be converted (c) from a base (fb) to another base (tb), write an SQL query returning a row for each element in the table conversion, with schema (number_fb, fb, number_tb, tb), such that number_fb is the number in base fb equivalent to the number number_tb in base tb. A Solution: This problem can be subdivided as usually in two subproblems: Convert to base 10, and then to the target base. We can assume a table digits(n INT, d STRING) containing tuples as { (0,’0’), (1,’1’), . . . , (10,’A’), . . . }. For converting to base 10, the approach is to recursively generate tuples with the inspected position 𝑖 (index of the digit from right to left starting with 0) with a value corresponding to the number up to its 𝑖-th position. So, we can think of a schema for this first CTE as to_base_10(c, fb, tb, s, i, n), where original parameters in conversion (c, fb and tb) are kept with their same names, s is the number truncated up to the 𝑖 − 1-th index, i is the index 𝑖, and n is the value (base 10) for the number c up to the 𝑖-th position. The original number is required in this schema because there will be in general several numbers to be converted, so that this first column will discriminate the number for which the conversion is made. A similar requirement applies to the original base fb and the target base fb. For the sake of keeping neat the formulation, the base case for to_base_10 starts at index −1, and the original number is appended (with the || operator) with a void −1-th position, delivering a value 0 for it. Otherwise, we could start at index 0, but calculations in the recursive case should be also included in the base case. This recursive case builds tuples from to_base_10(c, fb, tb, s, i, n) itself as follows: its first 3 columns are the original ones, as found in to_base_10. The number s (recall: a string) is right trimmed in 1 character; the index i is incremented, and its quantity (weighted with its position: fb^(i+1)) is added to n, where the value of the digit is taken from the digits table. Converting a number in base 10 to another base is a somewhat similar problem, but with schema to_base_b(c, fb, tb, n, st): the three original columns (c, fb and tb), the number in base 10 (n) and the most significant digits already processed (st). Here, the base case comes from the results for the relation to_base_10, selecting the value n for the final calculation (the length of the truncated number is 1: the last, most-significant character in the string). In this case, the index is not used because another approach is taken: dividing the current number by the target base and delivering the digit 65 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 corresponding to the remainder of the division as the next digit, from the less to the most significant digits. Then, the recursive case builds the outcome by appending the calculated digit to the left of the former outcome. WITH to_base_10(c, fb, tb, s, i, n) AS ( SELECT c, fb, tb, c || ’ ’, -1, 0 FROM conversion UNION SELECT c, fb, tb, SUBSTR(s,1,LENGTH(s)-1), i+1, integer(n+(SELECT n FROM digits WHERE d=SUBSTR(to_base_10.s,LENGTH(to_base_10.s)-1,1) )*fb^(i+1)) FROM to_base_10 WHERE LENGTH(s)>1 ), to_base_b(c, fb, tb, n, st) AS ( SELECT c, fb, tb, n, ’’ FROM to_base_10 WHERE LENGTH(s)=1 UNION SELECT c, fb, tb, n DIV tb, (SELECT d FROM digits WHERE n=to_base_b.n MOD to_base_b.tb) || st FROM to_base_b WHERE n MOD tb > 0 ) SELECT c AS number_fb, fb AS fb, st AS number_tb, tb AS tb FROM to_base_b WHERE n=0; Datalog Formulation: The Datalog translation is more elaborated than in previous examples (but still, neater than its counterpart SQL formulation). ( to_base_10(E,F,G,H,-1,0) :- conversion(E,F,G), H=concat(E,’ ’) ) /\ ( to_base_10(I,J,K,L,M,N) :- to_base_10(I,J,K,O,P,Q), R=length(O), R>1, T=R-1, L=substr(O,1,T), M=P+1, digits(U,V), X=length(O)-1, V=substr(O,X,1), N=integer(Q+U*J**(P+1))) /\ ( to_base_b(Y,Z,AA,AB,’’) :- to_base_10(Y,Z,AA,AC,_AD,AB), length(AC)=1 ) /\ ( to_base_b(AE,AF,AG,AH,AI) :- to_base_b(AE,AF,AG,AJ,AK), AJ mod AG>0, AH=AJ div AG, digits(AM,AN), AM=AJ mod AG, AI=concat(AN,AK) ) => to_base_b(A,B,D,0,C). 4.6. Universal Turing Machine Problem Statement: Implement a universal Turing machine as a state transition system, following https://aturingmachine.com/. Possible transitions are specified in the input table transitions(state, symbol, newstate, newsymbol, move), and the initial state is given in initial_state(state, tape, pos). Here, state is a numeric value identifying the state, tape is a string containing a sequence of symbols 0’s and 1’s, pos is the position of the scanned symbol (the leftmost character of the tape is always the position 1 even when the tape can can grow indefinitely to the right or to the left), move can be either be ’L’ (to the left) or ’R’ (to the right). Input Data: The two input tables are filled as shown next to implement a binary counting machine. It counts from 0 upwards without stopping. Inspecting its outcome can be done by adding a stop condition on the number of the transitions (steps). INSERT INTO initial_state VALUES(0,’ ’,1); -- (0,B) -> (1,B) Left: When a blank (B) is found we change to state 1: INSERT INTO transitions VALUES(0,’ ’,1,’ ’,’L’); -- (0,0) -> (0,0) Right: This state moves the tape to the right most digit: INSERT INTO transitions VALUES(0,’0’,0,’0’,’R’); -- The following transition is not needed: -- (0,1) -> (0,1) Right: This state moves the tape to the right most digit -- (1,B) -> (0,1) Right: If we change a Blank to a 1 we change back to state 0: INSERT INTO transitions VALUES(1,’ ’,0,’1’,’R’); -- (1,0) -> (0,1) Right: If we change a 0 to a 1 we change back to state 0: 66 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 INSERT INTO transitions VALUES(1,’0’,0,’1’,’R’); -- (1,1) -> (1,0) Left: If we change a 1 to a 0 we keep looking to the left: INSERT INTO transitions VALUES(1,’1’,1,’0’,’L’); A Solution: Implementing a universal Turing machine can be done with a recursive query where the base case contains the initial state, and the recursive case selects the next state in terms of the symbol in the current tape position. This must consider that the tape can grow both the left and to the right. For the last case, the string tape can be concatenated with a new symbol and the index at the first position does not change. For the former case, a new blank symbol is prepended and the index 1 is for this position. WITH m(state,tape,pos,step) AS ( SELECT state,tape,pos,1 FROM initial_state UNION ALL SELECT -- State t.newstate, -- Tape CASE WHEN t.move=’L’ AND pos>1 THEN SUBSTR(tape,1,pos-1)||t.newsymbol||SUBSTR(tape,pos+1) WHEN t.move=’L’ AND pos=1 THEN ’ ’||t.newsymbol||SUBSTR(tape,pos+1) WHEN t.move=’R’ AND LENGTH(tape)=pos THEN SUBSTR(tape,1,pos-1)||t.newsymbol||’ ’ WHEN t.move=’R’ AND LENGTH(tape)>pos THEN SUBSTR(tape,1,pos-1)||t.newsymbol||SUBSTR(tape,pos+1) END, -- Move CASE WHEN t.move=’L’ AND pos>1 THEN pos-1 WHEN t.move=’L’ AND pos=1 THEN 1 WHEN t.move=’R’ THEN pos+1 END, -- Step step+1 FROM transitions t, m WHERE m.state=t.state AND SUBSTR(m.tape,m.pos,1)=t.symbol -- AND step < $max_step$ -- Use this as a stop condition ) SELECT step,state,pos,tape FROM m; Datalog Formulation: (m(E,F,G,1) :- initial_state(E,F,G)) /\ (m(H,I,J,K) :- transitions(L,M,H,N,O), m(L,P,Q,R), ’$substr’(P,Q,1,M), R<40, S=Q-1, ’$substr’(P,1,S,T), ’$concat’(T,N,U), V=Q+1, ’$substr’(P,V,W), ’$concat’(U,W,X), ’$concat’(’ ’,N,Y), Z=Q+1, ’$substr’(P,Z,AA), ’$concat’(Y,AA,AB), AC=Q-1, ’$substr’(P,1,AC,AD), ’$concat’(AD,N,AE), ’$concat’(AE,’ ’,AF), AG=Q-1, ’$substr’(P,1,AG,AH), ’$concat’(AH,N,AI), AJ=Q+1, ’$substr’(P,AJ,AK), ’$concat’(AI,AK,AL), ’$case’([((O=’L’,Q>1),Q-1),((O=’L’,Q=1),1),(O=’R’,Q+1)],null,J), K=R+1, ’$case’([((O=’L’,Q>1),X),((O=’L’,Q=1),AB),((O=’R’,’$length’(P,AM),AM=Q),AF), ((O=’R’,’$length’(P,AN),AN>Q),AL)],null,I)) => m(B,D,C,A). 5. Conclusions and Future Work SQL puzzles for challenging students’ skills have been presented and solved in the DES system, which has been selected because it helps students with syntax and semantic warnings about SQL queries. Since the system is based on intuitionistic logic programming, it allows for neater and simpler SQL 67 Fernando Sáenz-Pérez CEUR Workshop Proceedings 54–68 formulations to solve those puzzles because SQL queries are translated into a Hypothetical Datalog program. Both the syntax and the inference system for the Hypothetical Datalog system in [9] have been extended to support expressions in conditions and other meta-predicates needed for expressing SQL more complex queries. The implementation can be greatly enhanced in several ways; for example, replacing its prototype SQL parser with a more performant one, and supporting the semi-naïve differential optimization [19], which hugely improves performance for recursive problems, and fixing some known issues. Additionally, more puzzles will be developed for the enjoyment of students (and the teacher). References [1] M. Cook, Universality in elementary cellular automata, Complex Systems 15 (2004). [2] A. Levitin, M.-A. Papalaskari, Using puzzles in teaching algorithms, SIGCSE Bull. 34 (2002) 292–296. [3] J. Celko, Joe Celko’s SQL Puzzles and Answers, Second Edition (The Morgan Kaufmann Series in Data Management Systems), Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. [4] S. Wikipedia, Esoteric Programming Languages: Brainfuck, Intercal, Befunge, Esoteric Program- ming Language, Kvikkalkul, One Instruction Set Computer, Unlambda, Malbo, University-Press Org, 2013. URL: https://books.google.es/books?id=4X_hngEACAAJ. [5] F. Sáenz-Pérez, DES: A Deductive Database System, Electronic Notes on Theoretical Computer Science 271 (2011) 63–78. [6] F. Sáenz-Pérez, Applying constraint logic programming to sql semantic analysis, Theory and Practice of Logic Programming 19 (2019) 808–825. doi:10.1017/S1471068419000206. [7] R. Caballero, Y. García-Ruiz, F. Sáenz-Pérez, Declarative Debugging of Wrong and Missing Answers for SQL Views, in: Eleventh International Symposium on Functional and Logic Programming (FLOPS 2012), LNCS 7294, Springer, 2012, pp. 73–87. [8] A. J. Bonner, Hypothetical Datalog: Complexity and Expressibility, Theoretical Computer Science 76 (1990) 3–51. [9] F. Sáenz-Pérez, Intuitionistic Logic Programming for SQL, in: M. V. Hermenegildo, P. Lopez-Garcia (Eds.), Logic-Based Program Synthesis and Transformation, Springer International Publishing, Cham, 2017, pp. 293–308. [10] F. Arni, K. Ong, S. Tsur, H. Wang, C. Zaniolo, The Deductive Database System LDL++, TPLP 3 (2003) 61–94. [11] F. Sáenz-Pérez, Tabling with support for relational features in a deductive database, ECEASST 55 (2012). [12] A. J. Bonner, L. T. McCarty, Adding Negation-as-Failure to Intuitionistic Logic Programming, in: E. L. Lusk, R. A. Overbeek (Eds.), Proceedings of the North American Conference on Logic Programming (NACLP), The MIT Press, 1990, pp. 681–703. [13] F. Sáenz-Pérez, Restricted predicates for hypothetical datalog, Electronic Proceedings in Theoretical Computer Science 200 (2015) 64–79. [14] F. Sáenz-Pérez, Implementing tabled hypothetical datalog, in: Proceedings of the 25th IEEE International Conference on Tools with Artificial Intelligence, ICTAI’13, 2013, pp. 596–601. [15] J. D. Ullman, Principles of Database and Knowledge-Base Systems, Volume II, Computer Science Press, 1988. [16] ISO/IEC, SQL:2016 ISO/IEC 9075-1:2016 Standard, 2016. [17] L. Sterling, E. Shapiro, The Art of Prolog (2Nd Ed.): Advanced Programming Techniques, MIT Press, Cambridge, MA, USA, 1994. [18] S. Nieva, F. Sáenz-Pérez, J. Sánchez, HR-SQL: An SQL Database System with Extended Recursion and Hypothetical Reasoning. Ongoing Work, in: XV Jornadas sobre Programación y Lenguajes, PROLE2015 (SISTEDES), 2015, pp. 1–15. [19] I. Balbin, K. Ramamohanarao, A generalization of the differential approach to recursive query evaluation, The Journal of Logic Programming 4 (1987) 259 – 262. 68