INTRODUCTION

Paul Peseux supervised by M.Berar

0 1

T.Paquet

0 1

V.Nicollet Litis Normandie

0 1

Lokad Paris

0 1

France paul.peseux@lokad.com

0 1 0 @ i n t e r c e p t FLOAT 1 Variable Scalar Boolean Variable Addition Function Call Function Call (2 parameters) Parameter access Constant access Broadcast Projector Aggregation Projector Predicate

This work is about performing automatic diferentiation of a query in the context of relational databases and queries. This is done in order to perform optimization through gradient descent in these relational databases. This work describes a form of automatic differentiation for a subset of relational queries.

INTRODUCTION

Modern Diferentiable Programming applied to Deep Learning concentrates on dense and regular problems such as images [ 13 ] [ 12 ] and sound [ 6 ], or studies ways to project unstructured problems into this framework [ 11 ] (e.g. auto-encoders for text data). Its success is partly due to automatic diferentiation [ 21 ]. Parallel to this many business domains have a very well-defined structure, but this structure is relational. For example supply chain data is organized in relational databases and experts are used to working with these. A canonical example coming from this domain: items in the product table come from suppliers in the supplier table and are stored in warehouses in the warehouse table; the problem’s structure is completely diferent from Computer Vision or Natural Language Processing, two hot topics in Machine Learning (ML). As people that understand the supply chain complexity work with relational databases, it seems to be the adequate place to let them build their own models and optimize them. One of the main ways to optimize a model is through gradient-based methods; if the model is written with queries then we need to diferentiate them to optimize the model. Letting experts build white box models will help them to check the sanity of models, which is very dificult to do on black box models, such as deep neural networks. It is called Interpretable ML in [ 19 ] and directly applies to the supply chain, where thousands of orders are placed everyday for a single company. Furthermore [ 20 ] has shown the advantages when performing an optimization in the database system itself, limiting data transfer costs, over pulling the data out to an external ML-oriented system.

Many sub parts of languages have been diferentiated: Python [ 17 ] [ 4 ], C [ 8 ], Julia [ 10 ], Swift [ 18 ], F# [ 3 ] . . . A more complete reference can be found at [ 7 ]. However, there are only a few attempts at diferentiating SQL ·like programming languages, to our knowledge. Thus we study a way to diferentiate relational queries in order to perform optimization through gradient descent.

ADSL [ 8 ] [ 10 ] [ 2 ] [ 21 ] [. . . ] proposes to diferentiate subsets of common programming languages. What these initiatives have in common is the purpose of diferentiating a pre-existing language. It is an interesting task, while being complicated, as those languages are not crafted for diferentiation. This is especially true for relational programming languages.

We introduce ADSL1, which is A Diferentiable Sub Language that is intended to lower relational language. It is a simple language where Automatic Diferentiation is a first class citizen. This idea is similar to [ 1 ] [ 9 ] [ 14 ]. ADSL is closed by diferentiation: the adjoint, i.e. the derived program, of an ADSL program is also a diferentiable ADSL program. This closure gives immediate access to higher order derivatives, which are sometimes used [ 15 ] [ 5 ]. ADSL is a simple SSA language that supports loops and conditional. Its major specificity is its projectors and broadcasts support.

According to the definition below, an ADSL program is a list of Statements < >, whose grammar is defined by: Ψ Ξ)⟩ Φ)⟩ ⟨ S ⟩ ⟨ e ⟩ ::= . | ⟨ v ← ⟩ | ⟨ Cond ( v | ⟨ For ( | ⟨ Return v ⟩ ::= . | ⟨ v ⟩ | ⟨ f ⟩ | ⟨ b ⟩ | ⟨ v + w ⟩ | ⟨ Call1 op v ⟩ | ⟨ Call2 op v w ⟩ | ⟨ Param i ⟩ | ⟨ Const i ⟩ | ⟨ v ⊳ ⟩ | ⟨ v ⊲ ⟩ | ⟨ Pred ⟩

Variable assignment Conditional Loop Output of a program

⟨ Pred ⟩ ::= .

| ⟨ And v w ⟩ | ⟨ Or v w ⟩ | ⟨ Not v ⟩ | ⟨ v < w ⟩ | ⟨ v ≤ ⟩

ADSL is tight but it is enough to tackle many real business problems, such as those encountered in the supply chain. Its main characteristic is to be easily and fully diferentiable. Projectors 1Adsl library can be found at https://github.com/Lokad/Adsl R′ ( , →) × ( , →) × ′ In this section we describe our approach to derive relational query. This approach is based on compilation.

Let R be a relational query that creates the float column in the table . We assume that R involves a float column in the table . In optimization or Machine Learning, such an objective function is often called loss. stands for Observation Table and for Parameter Table. It is possible that = .

Our main goal is to minimize the scalar:

SELECT sum ( L o s s ) FROM OT

Diferentiating a relational query means that we want to create the column ′ in that is the derivative of . with respect to . . Doing so will unlock optimization through gradient-based methods. As it appears hard to diferentiate arbitrary query, we reduce our scope to a subset of queries that should be wide enough to cover many industrial cases. First we do not consider a query that involves a Common Table Expression: these have to be inlined in the query. It drastically helps query compilation to ADSL. Second . should appear once and only once.

Let be the set of tables used in R. Then { , } ⊂ .

Let’s introduce the relation" −→ " when the primary key of is a foreign key in . It is said that broadcasts into .

Here is a simple way to create such and in SQL:

CREATE t a b l e TA AS

SELECT f o r e i g n K e y AS p r i m a r y K e y FROM TB

GROUP BY f o r e i g n K e y ( , −→) naturally forms a graph. Then we can compile the query R to the pair ( , −→) × where is an ADSL program. Evaluation of ( , −→) × gives ..

As is an ADSL program it is possible to diferentiate it with respect to the input associated to . as ′.

We state that with

R ′ = ( , −→) × ′ evaluation of R ′ gives the expected . ′, which is represented in Figure 1.

We believe that this schema is a good way to diferentiate relational queries: if the output is a relational query then we can use all the optimizations and parallelizations developed for regular ones. 4

GRAPH POINT OF VIEW

In this section we introduce notations on graphs that will be applied to the SQL Table tree in order to simplify it and facilitate its compilation to ADSL.

Definition 4.1 (Polytree). A Polytree is a directed acyclic graph whose underlying undirected graph is a tree.

For example, any tree structure of a website is a Polytree.

Definition 4.2 (Cross Edge). A cross-edge is a pair of edges in a graph ( −→ , −→ ) which indicates that comes from a cross operation between and .

Here is a simple way to create such an edge in SQL:

CREATE t a b l e B AS SELECT ∗ FROM A CROSS JOIN C

Let be a Polytree with cross-edges: ( , , (, )) Definition 4.3 (PolyStar). Let’s define a PolyStar ★ as ★ = { (, ) | a Polytree with cross-edges & n a node of } (1)

A PolyStar is a PolyTree with a special focus on a specific node of the graph.

Let (, ) ∈ ★. We call • an upstream node a node of such that −→ . • an upstream cross node a cross node of such that one of its parents is an upstream node. • an observation-cross a cross node of such that one of its parents is . • a downstream node a node of such that is not an observationcross node and that −→ .

• We call a full node the remaining nodes of .

By construction, there is no path between a full node and . 4.1

SQL Table tree simplification

A relational query in SQL creates many tables, even though some could be grouped. For example,

SELECT L o s s FROM OT

creates another table with a bijection from its index to index. Thus we introduce a novel join operator that helps us to simplify the table tree: TOTAL JOIN. 1 TOTAL JOIN 2 ON ⟨ ⟩ is the same semantic as 1 INNER JOIN 2 ON ⟨ ⟩ with the additional constraint that for each line of 1, there is exactly one line of 2 that corresponds. To make a successful 1 TOTAL JOIN 2 ON ⟨ ⟩ it is suficient that columns are a primary key in 2 and a foreign key in 2, but it is not necessary.

Thanks to this join operator that is reminiscent of [ 16 ], we can gather tables in the graph that come from this operation. Indeed, creating a new table is thus equivalent to adding a column to the origin table. Then the compilation in ADSL from any join operator that is not a TOTAL JOIN gives a projector (a broadcast or 3 1 4

TOTAL JOIN 2 5 an aggregator). If a TOTAL JOIN is used then operations can be performed line by line thus we can compile it to a scalar operation such as +, Call1, Call2 . . .

Remark 2 (Sufficient condition). If the query is written without any Common Table Expression and involves . only once, then , ∈ ★ where are the used tables in R.

4.2 A supply chain example

In this section we take a real case from the supply chain industry to illustrate our previous formalization.

Let’s consider that our database contains information on products that a company sells. It has the Product table recording the products. These products are organized by categories. The Orders table records products orders. Assuming that we also have a Week table whose primary is the week number, we would write:

CREATE t a b l e C a t e g o r y AS

SELECT c a t e g o r y FROM P r o d u c t GROUP BY c a t e g o r y

CREATE t a b l e CategoryWeek AS SELECT ∗ FROM C a t e g o r y CROSS JOIN Week CREATE t a b l e ProductWeek AS SELECT ∗ FROM P r o d u c t CROSS JOIN Week Then we get Figure 3. 4.3 Why all these notations?

All these notations help us to compile the query as easily as possible. While computing a line of . ′ from ( , −→) × ′, an input coming from • the observation table gives a scalar • an upstream table gives a scalar • a full table gives a vector of the size of the full table itself. • an upstream-cross table gives a vector of the size of the left table used in the cross operation • a downstream table gives a vector of certain size.

In summary, we introduce the TOTAL JOIN operator to turn the SQL Table tree into a PolyStar. Once it is a PolyStar, its lowering, i.e. compilation, to ADSL is simplified.

5 EXPERIMENTS 5.1 Dataset

We used the Chicago taxi rides dataset that can be found here 2. We chose this dataset because it has also been used by [ 20 ] which partly motivated our work.

For each ride, we use the taxi identifier, distance (in miles) and the tips (in dollar). In this example, the Observation table is the Trips table and the Taxis table is an upstream table:

CREATE t a b l e T a x i s AS

SELECT t a x i I d , 1 AS a FROM T r i p s GROUP BY t a x i I d

5.2 Linear Regression

We use linear regression to predict the tip based on the trip distance.

= × + It is an interesting example to perform a benchmark but according to us, it does not illustrate the relational aspect of the dataset. Thus we also used an augmented version of this model where the slope depends on the taxi identifier, the intercept remains shared among taxis:

taxiId = taxiId × +

This example illustrates how the relational information between the Trips table and the Taxis table has to be used. (2) (3) thus the derived query (with respect to the slope ) should be: SELECT

t r i p I d , t a x i I d ,

FROM (

SELECT

∗ , 2 ∗ a ∗ ( E s t i m a t e d − T i p s ) AS G r a d i e n t , T a x i s . a ∗ T r i p s . d i s t a n c e +

AS E s t i m a t e d

FROM T r i p s INNER JOIN T a x i s

WHERE T r i p s . t a x i I d = T a x i s . t a x i I d ) ; For the sake of notation, slopes are initialized at 1 and intercept at 0.

Such a model has a straight forward explanation; the model can be white boxed. Taxi’s slope shows its ability to get tips. It is called Interpretable ML [ 19 ]. We’ve used linear regression for simplicity sake, but unlocking diferentiable programming, i.e. access program derivative, to relational programming language unlocks an amazing variety of other models. All experiments were run on Azure with ¨Standard_L8s_v2¨, a 8 vCPU machine, running at 2.557 GHz with a disk of 1.9TB NVMe. Our prototype and experiments were run on a supply chain Domain Specific Language. It is a Python ·like implementation of SQL narrowed for supply chain problems. Tests were carried out five times and the average runtimes recorded. In Table 1, we present our result for 10 epochs of gradient descent on the Chicago dataset. 103 105 1.95 × 107

Taxis 479 1037 9201

In Table 1, Shared Slope is the implementation relative to equation 2 and Taxi’s Slope is relative to equation 3. We’ve not reproduced all experiments from [ 20 ] as our focus is to diferentiate a relational query that involves table relationships. In our example this is the relationship between Trips and Taxis tables. 6

CONCLUSION

In this work we’ve presented a concrete approach to perform diferentiation on relational query. Our claim is that derived query should also be a query. Thus we have introduced a dedicated programming language ADSL that is closed by diferentiation. Thanks to the introduced operator TOTAL JOIN and PolyStar, we can clarify the roles of diferent tables in the relational query to diferentiate. Our implementation allows us to eficiently tackle real world problems such as those encountered in a supply chain, for example. We hope that relational programming language will consider Automatic Diferentiation as first class citizens in the future, this would strengthen "Query 2.0" [ 22 ] and unlock many interesting applications. This would help every engineer working on relational databases to develop eficient white-box models by easily plugging their expertise into it.

ACKNOWLEDGMENTS

This work was supported by ANRT French program and Lokad.

[1]

Martín

Abadi and

Gordon

Plotkin . 2019 . A simple diferentiable programming language . Proceedings of the ACM on Programming Languages 4 ( 12 2019 ), 1 - 28 . https://doi.org/10.1145/3371106

[2]

Atilim

Baydin , Barak Pearlmutter, Alexey Radul, and

Jefrey

Siskind . 2018 . Automatic diferentiation in machine learning: A survey . Journal of Machine Learning Research 18 ( 04 2018 ), 1 - 43 .

[3]

Atilim

Günes Baydin , Barak A. Pearlmutter , and

Siskind . 2016 . DifSharp: An AD Library for . NET Languages. ArXiv abs/1611 .03423 ( 2016 ).

[4]

Olivier

Breuleux and Bart van Merriënboer. 2017 . Automatic Diferentiation in Myia.

[5]

Cerezo and

Patrick

Coles . 2020 . Impact of Barren Plateaus on the Hessian and Higher Order Derivatives .

[6]

Hoon

Chung , Sung Joo Lee, Hyeong Bae Jeon, and

Park . 2020 . Semi-Supervised Speech Recognition Acoustic Model Training Using Policy Gradient . Applied Sciences 10 ( 2020 ), 3542 .

[7] Autodif .org community. 2020 . Tools for Automatic Diferentiation . http://www. autodif.org/?module=Tools.

[8]

Hascoët and

Pascual . 2013 . The Tapenade automatic diferentiation tool: Principles, model, and specification . ACM Trans. Math. Softw . 39 ( 2013 ), 20 : 1 - 20 : 43 .

[9]

Hu ,

Anderson , Tzu-Mao Li , Q.

Sun , N.

Carr , Jonathan Ragan-Kelley, and F.

Durand . 2020 . Dif Taichi: Diferentiable Programming for Physical Simulation . ArXiv abs/ 1910 .00935 ( 2020 ).

[10]

Michael

Innes . 2018 . Don't Unroll Adjoint: Diferentiating SSA-Form Programs . (10 2018 ).

[11]

Mike

Innes ,

Edelman ,

Fischer ,

Rackauckas ,

Saba ,

V. B.

Shah , and

Will

Tebbutt . 2019 . A Diferentiable Programming System to Bridge Machine Learning and Scientific Computing . ArXiv abs/ 1907 .07587 ( 2019 ).

[12] Tzu-Mao Li . 2019 . Diferentiable Visual Computing . ArXiv abs/ 1904 .12228 ( 2019 ).

[13] Tzu-Mao

Michaël

Gharbi ,

Andrew

Adams ,

Durand , and Jonathan RaganKelley. 2018 . Diferentiable programming for image processing and deep learning in halide . ACM Transactions on Graphics (TOG) 37 ( 2018 ), 1 - 13 .

[14]

Carol

Mak and

Ong . 2020 . A Diferential-form Pullback Programming Language for Higher-order Reverse-mode Automatic Diferentiation . ArXiv abs/ 2002 .08241 ( 2020 ).

[15] Andrea

Mari

, Thomas Bromley, and

Nathan

Killoran . 2020 . Estimating the gradient and higher-order derivatives on quantum hardware.

[16]

Frank

McSherry . 2010 . Privacy Integrated Queries: An Extensible Platform for Privacy-Preserving Data Analysis . Commun. ACM 53 ( 09 2010 ), 89 - 97 . https: //doi.org/10.1145/1559845.1559850

[17]

B. V.

Merrienboer ,

Moldovan , and Alexander

Wiltschko . 2018 . Tangent: Automatic diferentiation using source-code transformation for dynamically typed array programming . ArXiv abs/1809 .09569 ( 2018 ).

[18]

Marc

Rasi Bart Chrzaszcz Richard Wei ,

Dan

Zheng . 2020 . Diferentiable Programming Manifesto . https://github.com/apple/swift/blob/main/docs/ DiferentiableProgramming.md.

[19]

Cynthia

Rudin . 2019 . Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead . Nature Machine Intelligence 1 ( 05 2019 ), 206 - 215 . https://doi.org/10.1038/s42256-019-0048-x

[20] Maximilian

Schüle , Frédéric Simonis, Thomas

Heyenbrock , A.

Kemper , Stephan Günnemann, and T.

Neumann . 2019 . In-Database Machine Learning: Gradient Descent and Tensor Algebra for Main Memory Database Systems . In BTW.

[21] Bart

van Merriënboer

, Olivier Breuleux , Arnaud Bergeron, and Pascal Lamblin . 2018 . Automatic diferentiation in ML: Where we are and where we should be going . (10 2018 ).

[22] Wu

Weiyuan

, Lampros Flokas, Eugene Wu , and Jiannan Wang . 2020 . Complaintdriven Training Data Debugging for Query 2 . 0 . 1317 - 1334 . https://doi.org/10. 1145/3318464.3389696