Differential Datalog

Differential Datalog LeonidRyzhyk VMware Research MihaiBudiu VMware Research Differential Datalog 78CA25E91AA2227DB20BDB0188CAD76F GROBID - A machine learning software for extracting information from scholarly documents

Many real-world applications based on deductive databases require incrementally updating output relations (tables) in response to changes to input relations. To make such applications easier to implement we have created Differential Datalog (DDlog), a dialect of Datalog that automates incremental computation. A DDlog programmer writes traditional, non-incremental Datalog programs. However, the execution model of DDlog is fully incremental : at runtime DDlog programs receive streams of changes to the input relations (insertions or deletions) and produce streams of corresponding changes to derived relations. The DDlog compiler translates DDlog programs to Differential Dataflow (DD) [17] programs; DD provides an incremental execution engine supporting all the relational operators, including fixed-point. The DDlog language is targeted for system builders. In consequence, the language emphasizes usability, by providing a rich type system, a powerful expression language, a module system, including string manipulation, arithmetic, and integration with C, Rust, and Java. The code is open-source, available using an MIT permissive license [1].

Introduction

Motivation. Many real-world applications must update their output in response to input changes. Consider, for example, a cluster management system such as Kubernetes [11], that configures cluster nodes to execute a user-defined workload. As the workload changes, e.g., container instances are added or removed from the system, the configuration must change accordingly. In a large cluster computing configuration from scratch is prohibitively expensive. Instead, modern cluster management systems, including Kubernetes, apply changes incrementally, only updating state effected by the change.

As another example, program analysis frameworks like Doop [5] evaluate a set of rules defined over the abstract syntax tree of the program. Such an analyzer can be integrated into an IDE to alert the developer as soon as a potential bug is introduced in the program. This requires re-evaluating the rules after every few keystrokes. In order to achieve interactive performance when working with very large code bases, the re-evaluation must occur incrementally, preserving as much as possible intermediate results computed at earlier iterations.

Incremental algorithms tend to be significantly more complex than their nonincremental versions. An incremental algorithm must propagate input changes to the output via all intermediate computation steps. This, in turn, requires (1) maintaining intermediate computation results for each step, and (2) implementing an incremental version of each operation, which, given an update to its input, computes an update to its output. Incremental computations that operate on relational state are ubiquitous throughout systems management software stacks. The complexity of the incremental algorithms greatly impacts the development cost, feature velocity, maintainability, and performance of the control systems.

We argue that, instead of dealing with the complexity of incremental computation on a case-by-case basis, developers should embrace programming tools that solve the problem once and for all. In this paper we present one such tool -Differential Datalog (DDlog) -a programming language that automates incremental computation. A DDlog programmer only has to write a Datalog specification for the original (non-incremental) problem. From this description the DDlog compiler generates an efficient incremental implementation.

Overview. DDlog is a bottom-up, incremental, in-memory, typed Datalog engine for building embedded deductive databases.

Bottom-up: DDlog starts from a set of ground facts (provided by the user) and computes all possible derived facts by following Datalog rules, in a bottomup fashion. (In contrast, top-down engines are optimized to answer individual user queries without computing all possible facts ahead of time.)

Incremental: whenever presented with changes to the ground facts, DDlog only performs the minimum computation necessary to compute all changes in the derived facts. This has significant performance benefits, and only produces output of minimum size, also reducing communication requirements. DDlog evaluation is always incremental ; non-incremental (traditional) evaluation can be implemented as a special case, starting from empty relations.

In-memory: DDlog stores and processes data in memory1 . At the moment, DDlog keeps all the data in the memory of a single machine 2 .

Typed: Pure Datalog does not have concepts like data types, arithmetic, strings or functions. To facilitate writing of safe, maintainable, and concise code, DDlog extends Datalog with:

-A powerful type system, including Booleans, unlimited precision integers, bit-vectors, strings, tuples, and Haskell-style tagged unions. -Standard integer and bit-vector arithmetic.

-A simple functional language containing functions that allows expressing many computations over these data-types in DDlog without resorting to external functions. -String operations, including string concatenation and interpolation.

-The ability to store and manipulate sets, vectors, and maps as first-class values in relations, including performing aggregations. Embedded: while DDlog programs can be run interactively via a command line interface, the primary use case is to run DDlog in the same address space with an application that requires deductive database functionality. A DDlog program is compiled into a Rust library that can be linked against a Rust, C/C++ or Java program (bindings for other languages can be easily added).

DDlog is an open-source project, hosted on github [1] using an MIT-license.

Differential Datalog (DDlog)

A DDlog program operates on typed relations. The programmer defines a set of rules to compute a set of output relations based on input relations (Figure 1). Rules are evaluated incrementally: given a set of changes to the input relations (insertions or deletions), DDlog produces a set of changes to the output relations (expressed also as insertions or deletions).

Here we give a brief overview of the language; the DDlog language reference [22] and tutorial [23] provide a detailed presentation of language features.

Type system.

DDlog is a statically-checked, strongly-typed language; users specify types for relations, variables, functions, but often DDlog can infer types from the context. The type system is inspired by Haskell, and supports a rich set of types. Base types include Booleans, bit-strings (e.g., bit<32>), infinite-precision integers (bigint), and UTF-8 strings. Derived types are tuples, structures, and tagged unions (which generalize enumerated types). We currently do not allow defining recursive types like lists or trees; however DDlog contains three builtin collection types: maps, sets, and arrays (described in Section 2.4). Figure 2 shows several type declarations.

Generic types are supported; type variables are syntactically distinguished by a tick: 'A. The language contains a built-in reference type Ref<'T>. Unlike other languages, two references are equal if the objects referred are equal; thus references do not alter the nature of Datalog. References can be used to reduce memory consumption when complex objects are stored in multiple relations.

Relations and Rules.

Relations are strongly typed; the value in each column must have a staticallydetermined type. There are three kinds of relations in DDlog: Input relations: the content of these relations is provided by the environment, in an incremental way. Output relations: these are computed by the DDlog program, and the DDlog runtime will inform the environment of changes in these relations. Intermediate relations: these are also computed by the DDlog program, but they are hidden from the environment.

Figure 3 shows three relation declarations. An input relation may declare an optional primary key -which is a set of columns that can be used to delete entries efficiently by specifying only the key.

DDlog rules are composed of standard Datalog operators: joins, antijoins, and unions, illustrated in Figure 3, as well as aggregation, and flatmap, discussed in Section 2.4. DDlog allows recursive rules with stratified negation: intuitively, a DDlog relation cannot recursively depend on its own negation.

Computations in rules.

Much of DDlog's power stems from its ability to perform complex computation inside rules. For example, the rule in Figure 4 and width tables on the object id column, and then computes the area of the object as the product of its height and width. The DDlog expression language supports arithmetic, string manipulation, control flow constructs and function calls.

Local variables. Local variables are used to store intermediate results of computations. In DDlog, local variables can be introduced in three different contexts:

(1) variables can be defined directly in the body of a rule, e.g., the area variable in Figure 4; (2) a variable can be defined in a match pattern, as in Figure 5; and (3) finally, a variable can be defined inside an expression, e.g., the res variable in Figure 6. A variable is visible within the syntactic scope where it was defined.

"Imperative" rule syntax. We have also defined an alternative syntax for rules, inspired by the FLWOR syntax of XQuery expressions [4]. The "imperative" fragment offers several statements: skip (does nothing), for, if, match, block statements (enclosed in braces), and variable definitions var...in. An example is shown in Figure 4. This language is essentially a language of monoid comprehensions [7], so it is easily converted to a traditional Datalog representation using a syntax-directed translation in the compiler front-end. Recursive relations cannot be expressed using this syntax.

Integers. The integer types (bigint and bit<N>) provide the standard arithmetic operations, as well as bit-wise operations, bit selection v[15:8], shifting, and concatenation.

Strings All primitive types contain built-in conversions to strings, and users can implement string conversion functions for user-defined types (like Java's toString() method). Expressions enclosed within ${...} in a string literal are interpolated : they are evaluated at run-time, converted to strings and substituted; this is a feature inspired by JavaScript; for example "x+y=${x+y}".

Pattern matching. DDlog borrows the match expression from ML and Haskell; a match expression simultaneously performs pattern-matching against type constructors or values, and also can bind values. Figure 5 shows a match expression that uses a nested pattern to extract a byte from a value a with type OptionalIPAddress (this type was defined in Figure 2). For example, the last case binds the addr variable to the value of the ipv6addr field.

Pattern matching can also be used directly in the body of a rule, as in the last line from Figure 5, which extracts only IPv6 addresses from the Host relation and binds their value to the addr variable, which in turn is used in the left-hand side of the rule defining IPv6Addr relation.

Functions. DDlog functions encapsulate pure (side-effect-free) computations. Example functions are lastByte from Figure 5, and concat from Figure 6. Recursive functions are not supported. Users and libraries can declare prototypes of extern functions, which must be implemented outside of DDlog (e.g., in Rust), and linked against the DDlog program at link time. The compiler assumes that extern functions are pure.

Collections

The DDlog standard library contains three built-in generic collection types (implemented natively in Rust): Vec<'T>, Set<'T> and Map<'K, 'V>. Values of these types can be stored as first-class values within relations. Equality for values of these types is defined element-wise. In theory such types are not necessary, since collections within relations can be represented using separate relations. We have introduced them into the language because many practical applications have data models that contain nested collections; by supporting collection-valued columns natively in DDlog we can more easily interface with such applications, // declare external function returning a vector of strings extern function split ( s : string , sep : string ): Vec < string > // DDlog function to concatenate all elements of a vector function concat ( s : Vec < string > , sep : string ): string = { var res = " " ; for ( e in s ) { res = ( if ( res != " " ) ( res + sep ) else res ) + e }; res // last value is function evaluation result } input relation Phrases ( p : string ) relation Words ( w : string ) // Words contains all words that appear in some phrase Words ( w ) : -Phrases ( p ) , var w = FlatMap ( split (p , " ␣ " )).

// Shortest path between each pair of points x , y // (x , y ) is the key for grouping // min is the function used to aggregate data in each group ShortestPath (x , y , min_cost ) : -Path (x , y , cost ) , var min_cost = Aggregate (( x , y ) , min ( cost )). without the need to write glue code to convert collections back and forth into separate relations using foreign keys. Figure 6 shows the declaration in DDlog of an external function which splits a string into substrings using a separator; this function returns a vector of strings.

for loops can be used to iterate over elements in collections. Figure 6 shows an implementation of the function concat, the inverse of split, which uses a loop.

The FlatMap operator can be used to flatten a collection into a set of DDlog records, as illustrated in the definition of relation Words in Figure 6.

The Aggregate operator can be used to evaluate the equivalent of SQL groupby-aggregate queries. The aggregate operator has two arguments: a key function, and an aggregation function. The aggregation function receives a group of records that share the same key. The ShortestPath relation in Figure 6 is computed using aggregation.

Module system

DDlog offers a simple module system, inspired by Haskell and Python, which allows importing definitions (types, functions, relations) from multiple files. The user can add imported definitions directly into the name space of the importing module or keep them in a separate name space to prevent name collisions. Similar to Java packages, module names are hierarchical and the module name hierarchy must match the paths on the the filesystem where modules are stored. The directive import library.module will load the module from file library/module.dl.

The DDlog standard library is a module containing a growing collection of useful functions and data structures: some generic functions and data-types, such as min, string manipulation and conversion to strings, functions to manipulate vectors, sets, maps (insertion, deletion, lookup, etc.).

DDlog Implementation

Compiling DDlog to Differential Dataflow

Differential Dataflow. The core execution engine of DDlog is Differential Dataflow (DD). Differential Dataflow [17] is a streaming big-data processing system which provides incremental (differential) computation. DD is an incremental mapreduce-like system, but supporting a wide set of relational operators, including recursion (fixed-point) and joins. Section 4.3 in [17] describes the core relational operators that are used by our compiler to implement DDlog operators. DD is described in several publications [20,17] and has an open-source implementation with online documentation [16,15].

Compilation. Figure 7 shows how DDlog programs are compiled. The DDlog compiler is written in Haskell. The compiler generates Rust code (as text files); the Rust code is compiled and linked with the open-source Rust version of the DD library [14]. The DD engine operates on multisets, where elements can have positive or negative cardinalities; to get a set semantics we need to apply distinct operators on some multisets, in particular, output collections.

The DDLog compiler performs parsing, type inference, validation, and several optimization steps. In order to compute incremental results the DD runtime has to maintain temporal indexes (indexed by logical time), containing previous versions of relations. Many of our optimizations are geared towards reducing memory consumption; for example, we attempt to share indexes between multiple collections and operators. We use reference counting for large values, but stack-based implementations for small values. The non-linear operators (like distinct) can be very expensive, so we try to minimize their usage. The compiler attempts reuse common prefixes of disjoint rules.

The output of the DDlog compiler is a dataflow graph, which may contain cycles (introduced by recursion). The nodes of the graph represent relations; the relations are computed by dataflow relational operators. Edges connect each operator to its input and output relations. DD natively implements the following operators: map, filter, distinct, join, antijoin, groupby, union, aggregation, and flatmap. Each operator has a highly optimized implementation, incorporating temporal indexes that track updates to each relation over time and allow efficient incremental evaluation. The DD library is responsible for executing the emitted dataflow graph across many cores by running a user-defined number of worker threads.

Interacting with DDlog programs

Transactional API. The interaction with a running DDlog program is done through a transactional API. At any time only one transaction can be outstanding. After starting a transaction the user can insert and delete any number of tuples from input relations. When attempting to commit a transaction all updates are applied atomically and changes to all output relations are produced. Users register an upcall to be notified of these changes.

The DDlog API is implemented in Rust, with bindings available for other languages, currently C and Java.

The command-line interface. For every DDlog program the compiler generates a library that exports the transactional API, which can be invoked from Rust or other languages. The compiler also generates a command-line interface program (CLI) that allows users to interact with the DDlog program directly via a command line or a script. The CLI allows users to start transactions, insert or delete tuples in input relations, commit transactions, dump the contents of relations, and get statistics about resource consumption. The CLI is also used for regression testing and debugging.

Applications

Controller for Virtual Networks. The most significant DDlog program we have written so far is a reimplementation of OVN [21] -a production-grade virtual network controller used to implement the network substrate for cloud management systems.

OVN translates a set of network management policies into OpenFlow rules that have to be installed on the virtual switches in the network. The logic is very complicated, comprising tens of input, output and intermediate relations.

The original program was written in C, and is not fully incremental. The DDlog implementation has about 6000 lines of code, about the same size as the original code base, but it is fully incremental. With the exception of a small number of library functions imported from C, we were able to implement the entire OVN logic in DDlog. This would not be feasible with a more traditional dialect of Datalog that does not support types and expressions.

Firewall management. We have re-implemented a proprietary network management application in DDlog. The application manages a firewall in a network of switches and virtual machines (VMs). The firewall is driven by a centralized policy; when the policy changes, the local rules have to be updated in all network devices. The core of this program is a graph reachability problem in a directed graph, which is a recursive query written in a few lines of DDlog.

We compare the performance of the DDlog implementation (blue lines) with a production-ready hand-written incremental Java program, which has been heavily optimized (pink lines). The Java program has several thousands lines of code. The graphs are synthetic network topologies; the average node degree is 2.

Figure 8 shows execution time and memory consumption of the two implementations. On the X axis we always have the graph size in nodes, and on the Y axis performance. You will note that the DDlog program performs several times better than the hand-optimized Java implementation.

Related work

A survey of Datalog engines can be found in [13]. Here we focus on incremental evaluation; a survey of incremental evaluation is [9]. Notable algorithms include Delete-Rederive [10], FOIES [6]. Saha [24] provides an algorithm for tabled logic programs. The Backward-Forward algorithm [19] improves DRed under some circumstances. IncA [25,26] is a Datalog dialect for incremental program analysis; it introduces the DRed L algorithm. Another class of algorithms use provenance to perform incremental computation [12]. Several recent paper describe systems that use incremental evaluation for relational computation models: [2,27].

The only other incremental Datalog engine that we are aware of is a LogiQL [8], a commercial product of LogicBlox [3]. Unfortunately there is no published data about the performance of the LogicBlox incremental engine.

DDlog is built on top of Differential Dataflow [14]; several declarative incremental query engines generalizing Datalog were built on top of Differential Dataflo [20,17]. Some of the DDlog features were inspired by .Net LINQ [18].

Conclusion and future work

DDlog is a young project. The language is evolving quickly, driven by the use cases. We place paramount importance on language usability; this is why we have enhanced Datalog with many non-traditional constructs. Our goal is to reduce as much as possible the need to transition between multiple languages when writing large projects.

Our ongoing work on DDlog focuses on continuous improvement of its performance and memory utilization, as well as use-case-driven evolution of its syntax, features, and libraries. We welcome any contributions or users!

Fig. 1 .1Fig. 1. Incremental evaluation of a Datalog program.

/Fig. 2 .2Fig. 2. Type declarations in DDlog. None, Some, IPv6Address are pattern constructors.

Fig. 3 .3Fig.3. A graph described by two relations and a rule to compute paths in the graph that exclude some nodes.

function lastByte ( a : OptionalIPAddress ): bit <8 > = { match ( a ) { None -> 0 , Some { IPv4Address {. ipv4addr = addr }} -> addr [7:0] , Some { IPv6Address {. ipv6addr = addr }} -> addr [7:0] } } relation Host ( address : OptionalIPAddress ) // Rule that performs matching on address structure IPv6Addr ( addr ) : -Host (. address = Some { IPv6Address { addr }}).

Fig. 5 .5Fig. 5. Pattern matching used in a DDlog function and in a rule.

Fig. 6 .6Fig. 6. Operations on collections: iteration, flattening, aggregation.

Fig. 7 .7Fig. 7. DDlog compilation flow.

Fig. 8 .8Fig. 8. DDlog performance as a function of the graph size. (1) the time taken to execute the non-incremental reachability computation (inserting all nodes and edges in a single transaction); (2) the peak memory consumption; (3) the time to insert an additional 12% edges into the graph, and (4) the time to delete 3% of the edges.

computes an inner join of heightinput relation Height ( object_id : int , height : int )input relation Width ( object_id : int , width : int )output relation Area ( object_id : int , area : int )// Compute the area of an object as the product of// its height and width .Area ( oid , area ) : -Height ( oid , h ) , Width ( oid , w ) ,var area = w * h .// Alternative syntax for defining the same relation .for o in Height // o is a tuple with fields ( object_id , height )for o1 in Width if o1 . oid == o . oidArea ( o . oid , o1 . width * o . height )

Fig.4. Rule examples. The first rule uses expressions and variablesvar introduces a variable binding. The second rule is equivalent with the first one, but is written using an imperative-style syntax. In a typical use case, a DDlog program is used in conjunction with a persistent database; database records are fed to DDlog as inputs and the derived facts computed by DDlog are written back to the database; DDlog does not include a storage engine. The core engine of DDlog is differential dataflow[14], which supports distributed computation over partitioned data; we may add this capability in the future.

Differential Datalog December 2018 DBtoaster: Higher-order delta processing for dynamic, frequently fresh views YAhmad OKennedy CKoch MNikolic Proceedings of the VLDB Endowment the VLDB Endowment 2012 5 Design and implementation of the LogicBlox system MAref BCate TJGreen BKimelfeld DOlteanu EPasalic TLVeldhuizen GWashburn International Conf. on Management of Data (SIGMOD) 2015 XQuery 1.0: An XML query language SBoag DChamberlin MFFernández DFlorescu JRobie JSiméon MŞtefănescu 2002. March 2019 Strictly declarative specification of sophisticated points-to analyses MBravenboer YSmaragdakis Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA) 2009 First-order incremental evaluation of Datalog queries GDong JSu Database Programming Languages (DBPL)

London, UK

1994 Towards an effective calculus for object query languages LFegaras DMaier ACM SIGMOD Record 24 1995 ACM LogiQL: A declarative language for enterprise applications TJGreen Symposium on Principles of Database Systems (PODS) 2015 Maintenance of materialized views: Problems, techniques, and applications AGupta ISMumick IEEE Data Eng. Bull 18 2 1995 Maintaining views incrementally AGupta ISMumick VSSubrahmanian International Conf. on Management of Data (SIGMOD) 1993 Kubernetes Production-grade container orchestration Recursive computation of regions and connectivity in networks MLiu NETaylor WZhou ZGIves BTLoo International Conference on Data Engineering (ICDE)

Shanghai, China

29 March-2 April 2009 Datalog: Concepts, history, and outlook DMaier KTTekle MKifer DSWarren Declarative Logic Programming MKifer YALiu 2018 Differential Dataflow FMcsherry January 2019 Differential Dataflow API reference FMcsherry March 2019 Differential Dataflow documentation FMcsherry March 2019 Differential Dataflow FMcsherry DMurray RIsaacs MIsard Conference on Innovative Data Systems Research (CIDR) January 2013 Unifying tables, objects and documents EMeijer WSchulte GBierman International Workshop on Declarative Programming in the Context of Object-Oriented Languages (DPCOOL)

Uppsala, Sweden

August 25 2003 Incremental update of Datalog materialisation: The backward/forward algorithm BMotik YNenov RPiro IHorrocks AAAI Conference on Artifficial Intelligence (AAAI)

Austin, TX

January 25-30 2015 Naiad: A timely dataflow system DGMurray FMcsherry RIsaacs MIsard PBarham MAbadi ACM Symposium on Operating Systems Principles (SOSP)

Farminton, Pennsylvania

2013 Ovn Open Virtual Network architecture March 2019 Differential Datalog (DDLog) language reference LRyzhyk MBudiu December 2018 A differential Datalog (DDLog) tutorial LRyzhyk MBudiu December 2018 Incremental evaluation of tabled logic programs DSaha CRRamakrishnan International Conference on Logic Programming (ICLP) 2003 Incrementalizing latticebased program analyses in Datalog TSzabó GBergmann SErdweg MVoelter Object-Oriented Programming, Systems, Languages and Applications (OOPSLA)

Boston, MA

Oct. 2018 IncA: A DSL for the definition of incremental program analyses TSzabó SErdweg MVoelter International Conference on Automated Software Engineering (ASE) 2016 Incremental view maintenance over array data WZhao FRusu BDong KWu PNugent ACM International Conference on Management of Data (ICMD) ACM 2017