1. Introduction

Italian Symposium on Advanced Database Systems, June

An Overview of Vadalog: a System for Reasoning over Large Knowledge Graphs

Luigi Bellomari n

i DavideBenedetto

EmanuelSallinge

Banca d'Italia

Italy

Workshop Proceedings

0 TU Wien, Faculty of Informatics , Vienna , Austria 1 Università Roma Tre, Department of Computer Science and Engineering , Rome , Italy 2 University of Oxford, Department of Computer Science , Oxford , UK

2022

1 9 22

As a main requirement, a knowledge representation and reasoning language must exhibit a good tradeof between expressive power and computational complexity of reasoning. Warded Datalog+/-, a fragment from the Datalog+/- family, satisfies these requirements, by ofering high expressive power and at the same time preserving tractability of query answering. The Vadalog system, a knowledge graph management system developed by the University of Oxford and Banca d'Italia, adopts Warded Datalog+/as its core language. This enables Vadalog to solve many ontological reasoning tasks including those where complex graph traversal operations are involved. To control graph navigation, Vadalog combines diferent rules prioritization policies and join implementations. In this paper, a short version of our recent contribution appeared in the Journal of Information Systems, we describe Vadalog from an architectural point of view, focusing on the execution model, graph traversal strategies, and join algorithms. We also provide an experimental evaluation of the system.

automated reasoning knowledge representation datalog knowledge graph vadalog

1. Introduction

We are witnessing a renewed attention in the use of logic-based languages for Knowledge Representation and Reasoning (KRR) on Knowledge Graphs (KG) from both the academia and the industry, with many companies such as Linked1In] a[nd Amazon [ 2 ] confirming the adoption of the Datalog languag3e].[In the literature, there is no common agreement on a shared definition for KGs [ 4 ], but the following common trait emerges: a KG is a large network of entities, and instances for those entities, describing real-world objects and their interrelations with specific reference to a domain or to an organizatio5n]. [From a technical standpoint, a knowledge graph is a semistructured data model characterized by an extensional, intensional and derived extensional componen6t][. The intensional component is generally modeled with a KRR formalism, in the form of ontologies, which are evaluated over enterprise data sources (extensional data) to enrich the knowledge graph with new (derived) knowledge. The quest for an ideal KRR language that enables reasoning with large knowledge graphs spawned substantial research in the database community, and many languages have been intensively investigated and proposed under a number of diferent names7][, which we commonly share here as the Datalog± class 8[]. Datalog± extends Datalog with additional features, of which the most relevant is existential quantification (whence the “+”), while resorting to syntactic restrictions (“-”) to guarantee decidability and, in some cases, tractability of query answering. Many Datalog± fragments have been investigated8,[ 9, 10, 11, 12 ] and each of them provides diferent expressive power and computational complexity. Among these, Warded Da±tagluoagrantees tractability1[ 3 ], being a good KRR language candidate, since it exhibits high expressive power, being able to model complex real domains, for instance requiring full recursion and existential quantification, as well as low computational complexity, enabling scalability in practice. The Vadalog system is a state-of-the-art knowledge graph management system (KGMS), that adopts Warded Datalo±gas its core language. It has been widely presented in many wo5r,ks [ 14, 15, 16 ], and largely used for solving many industrial tasks, particula1r7l,y1i8n, 1[ 9 ]. Discussion Topics. In this paper we examine the system under the hood, by analyzing the reasoning algorithms and the architectural choices that make Vadalog a fully fledged KGMS. In particular, we discuss about threeasoning termination strategy based on the isomorphism check; the system architecture of the Vadalog system and its internals, with particular attention to the runtime model, the routing strategies and the cycle and termination management. We also propose anexperimental analysis that evaluates the system in both real and synthetic settings. Overview. The remainder of this paper is organized as follows: in the next section we describe the fundamental notions that address reasoning with Vadalog; in Se3ctwioendiscuss the main architectural choices that make Vadalog a fully fledged KGMS; the experimental discussion is provided in Section4; in Section5 we draw our conclusions.

2. Reasoning with Vadalog

Vadalog is the reasoning language of the Vadalog system. It belongs to the Da±t afalmogily of languages that generalizes Datalog rules by introducing existential quantification in rule heads and enriches it with additional features of practical utility, while posing syntactical restrictions to achieve decidability and data tractability. The logicalVcaodraelofg extends to Warded Datalo±g[ 20 ], which captures plain Datalog as well as SPARQL queries under the entailment regime for OWL 2 QL21[] and is able to perform ontological reasoning tasks. A Vadalog program consists of a seteoxifstential rules, ortuple-generating dependencies of the form∀∀̄ ((̄, ̄)̄ → ∃ (̄, ̄ ))̄ , where (the body) and (the head) are conjunctions of atoms with constants and variables. For brevity, we write this existential r u(l,ē)̄ a→s ∃ (̄, ̄ )̄ and replace∧ with comma to denote conjunction of atoms. Intuitively, the semantics of such a rule is the following: for each f(ac,t̄ ′̄) that occurs in an instanc e,there exists a tuple″̄ of constants and nulls such that the fac(ts,̄ ″̄) are also i n.

The Chase Procedure. The semantics of a set of Datalog rulΣesover a database , denoted Σ() , is usually specified in an operational way via the well known chase proced2u2]r.e [ It expands with facts derived via the application of ruleΣs ofver , into a new database chase(D,Σ), possibly containing fresh nulls as placeholders for the existentially quantified objects. A rule = (, ̄)̄ → ∃ (̄, ̄ )̄ is applicable to Σ() if there is a unifier (i.e a mapping from variables to constants) such that( ̄ , ̄ ) ⊆ Σ() and ( ̄ ) does not belong toΣ() . If is applicable toΣ() with a unifier , then it performs achase step i.e., it generates new facts

( ̄ ′) that are added toΣ() , where ̄ = ̄ ′. The chase performs chase steps until no rule in Σ is applicable.

Example 1. Consider the following set of rules, which provide a simple description of how influence of individuals propagates in company networks. 1 ∶ C() → ∃ 2 ∶ C(, ) → 3 ∶ C(, ),

Ceo(, ).

Influences (, ).

I (, ) →

Influences (, ). 4 ∶ Influences (, ),

Influences (, ), ≠ →

L(, ) .

By Rule 1, for every company , there is a CEO . The CEO of a company exerts influence over it (Rule 2). By Rule 3, if a company controls (i.e., controls the majority of voting rights) a company , then a person exerting influence over also exerts influence over as well. Finally, by Rule 4, companies influenced by the same person are linked to each other.

Let us analyze how the rules are applied on DCo=m{pany(a),Company(b),Ceo(Bob,a),Control(a,b), Influences (Bob,c)}, by considering thcehase procedure. By the application of Ru2lwee obtain the influence of Bob on the company a. Then Rul3e can be applied by unifying with the factsControl(a,b) andInfluences(Bob,a) , thus inferring that Bob influences the company b. Finally, by the application of rul4eon the factsInfluences(Bob,a), Influences(Bob,b) andInfluences(Bob,c) an we can conclude that there is a link between the companies a and b, a and c, b and c. Termination and Recursion Control. Reasoning tasks for Datalog with existential quantifiers are undecidable in general. In fact, the chase procedure for Datalog with existential quantification may in general not terminate due to the possibility of producing unboundedly many labelled nulls. However, we have seen that Warded Da±taislodgecidable and reasoning with it isPTIME [ 13 ]. Intuitively, the key idea to guarantee chase termination is to tame the propagation on the labelled nulls. This is achievable by exploiting the so-called termination strategy, that blocks the generation of facts that have been produced in previous chase steps. The Vadalog system termination strategy relies on the isomorphism termination strategy. We say that two facts are isomorphic if they refer to the same predicate name, they have the same constants in the same positions and there is a bijection between the labelled nulls. Yet, theoretical tractability results are far away from a practical algorithm. Ideally, a fact produced by the chase that is isomorphic to a previous generated one can be skipped since it is not relevant for answering to the reasoning task. To obtain a terminating, correct algorithm it is suficient to reformulate the notion oafpplicable rule in the chase step, introduced before. More details of the algorithm and the theoretical proof about its correctness can be fou1n4d].inW[e redefine the applicability as follows: we sa=y(, ̄)̄ → ∃ (̄, ̄ )̄ is applicable to Σ() if there is a unifier such that( ̄ , ̄ ) ⊆ Σ() where ̄ = ̄ ′ and ′, for each

and there are not facts iΣn() isomorphic t o( ̄ ′, ̄ ′), ∈ ,̄ is a fresh labeled null that does not occuΣr(i)n .

3. The Architecture of the Vadalog System

In this section we describe the architecture of the Vadalog system. Throughout the section we illustrate the main architectural choices that make the system a fully fledged KGMS. We give a description of i) the pipeline architecture, ii) the execution model, that is, how the rules are evaluated at runtime, iii) the reasoning routing strategies, i.e. diferent techniques to control rule prioritization applicability, iv) techniques for handling cycles and memory management. Overall, the Vadalog system is a memory-resident server exposrienagsoaning API with the following interfacer:eason(kg_ref,parameters). A client application issues calls to the reasoning API specifying a reference to a knowledge grkagp_href to activate the reasoning process upon: the Vadalog system handles a repository of knowledge graphs, with unique identifiers kg_ref; technically the KG consists of a set of rules annotated with bindings of predicates to external data sources, a compact representation of the extensional and intensional components. The engine computes the answer to the reasoning task and returns a representation of the output facts. From now on, the terms scan, filter and rule are used interchangeably. Pipeline Architecture. The Vadalog system implementation is based on the pipes and filters architectural style, a pattern used to process streaming data flows. In this pattern, data streams throughout a pipeline of filters, each of those receives data from the input flow, applies the required transformations, generates the output flow and passes it to the next filter through a connector called pipe. In our settings, this pipeline is build up by compiling a set of Vadalog rules where the input filters are represented by input rules (e.g. EDB predicates in a rule body) and the output filters are assumed by queried predicates of the reasoning task. The compilation process is divided in three distinct parts. At the first stage, the compilation process involves multiple rewriting actions, acted b yloagic optimizer. At the second stage the optimized rules received from the parser are transformed into a reasoning access planlobgyicthcoempiler. The logic compiler acts as a planner: it produces a static pipeline in the form of a predicate graph where each node (filter) is represented by an atom from the set of rules, and there is an edge (pipe) from a nodem to a noden if there is a rule that unifies fromm to n. At the last stage, a query compiler converts the logic pipeline into a reasoning query plan, where the nodes are translated into active data scans, connected by intermediate bufers. To enable the data flow consumption and production, input and output record managers are attached to the start and to the end of the pipeline, respectively. In the reasoning pipeline, input record managers are connected with virtual input scans, while output record managers receive data from output scans; in-between these two, we have three diferent scans, that symbolize the behaviour of the linear (one body atom), join TGDs (more than one body atom).

Execution Model. The Vadalog system does not directly adopt the chase procedure, but follows the architecture of traditional relational DBMSs and its execution model is a generalization of the volcano iterator model [23] extended to the entire reasoning query plan pipeline. The reasoning process of the Vadalog system follows a pull based (query driven) approach, where each filter reads facts from respective parent. The pull action starts from sink nodes, which asks for output facts and propagate the request down to its predecessors, triggering the invocations along the pipeline searching for available facts. Filters interact with each others using proimpeinti(v)e,s next(), get(), close(), which respectively open the parent stream, ask for the presence of a fact to fetch, obtain it, and close the communication. The request is opened by the sink filter which issues anext() call, that is propagated to the parent filters until the input filters are reached. These access data from the external sources, converts it to ground facts and contribute to feed the reasoning process with new data. Facts stream through the pipeline up to the requestors and are stored into local bufer caches to be available for multiple consumers. Reasoning Routing Strategies. The common strategy adopted by the chase procedure satisfies the filters applicability in a breadth first fashion. This behaviour is in general a good compromise when a priori order of rule applicability is not defined, since it guarantees a good balance among the filters. In [ 15 ] we studied how the chase can be leveraged in multiple non-deterministic choices, analyzing several scenarios where the breadth-first policy for the applicability of the rules is not always an adequate solution. We formalize this as the rule routing problem, that is the problem of filters having to decide which source to invoke, when multiple choices are possible. For instance, consider the set of Vadalog rules in the exa2m.ple Example 2. Given the following set of rules, the desired output are facts for q 1 ∶ edb1(x,y) → f(x,y) 2 ∶ edb2(x,y) → h(y) 3 ∶ h(x) → ∃z p(x,z) 4 ∶ f(x,y), p(x,k) → q(x,y,k) 5 ∶ q(x,y,z) → f(x,x) 6 ∶ edb3(x,y), p(y,z) → p(x,z) The filters 3, 4 and 5 can fetch data from alternative parent filters for their body predicates. For instance, 3 can fetch facts from 2; 4 can fetch facts from 1 and 5 or 3 and 6; then 6 can fetch facts from 3 and itself. Consider now the first activation of Rule 6, for example triggered by a next call from Rule 4, which contains the desired output q. The filter 6 has to take a non-deterministic routing choice, about issuingnaext() call to alternatively itself (recursively) or 3. It is intuitive that if the first alternative is issued before, no result would be fetched and an extra call to 3 and 6 would be necessary. It is easy to see that if filter 5 would take this sub-optimal sequence of iflter invocations each time it is activated, the system would issue a huge number of useless calls, with clear performance loss. Many routing strategies exist, each implementing a diverse behaviour when issuing rule applicability order. We now briefly present the four strategies mostly relevant for our discussion.

Round-Robin. (RR) It is the default and simplest compile time ordering strategy. For each filter, an arbitrary order of source filters is established and fixed. Intuitively, this strategy fosters a breadth-first expansion of the chase.

Recursive-last. (RL) It a compile time ordering strategy. For each filter, the set of source filters is sorted by ascending the non-recursive ones. Clearly, by recursive filters we mean rules where at least one body atom is mutually recursive with the head. The rest of the rules are the non-recursive ones. This strategy reduces the number of failure calls, since it allows to feed the pipeline with facts available for recursive calls with invocations to base case rules. Shortest EDB Path (SEP). This compile time ordering strategy is a refinement of RR. Let us define the relative depthΣ() of a rul e∶ (, ̄)̄ → ∃ (̄, ̄ )̄ , in a set of Vadalog ruleΣsas follows. Let be the reasoning query plan built froΣm.We define Σ() as max{ 1, … , }, where (a) SynthA (b) SynthB (c) Company Own is the length of the shortest path infrom to the nearest rule′ that produces facts for the atom in , such that ′ contains only EDB in the body; intuitivel y,measures the distance of from the nearest EDB. If contains multiple body atoms, then it considers only the atom having the highest distance to the nearest EDB.

Random (R). This runtime ordering strategy adopts a random policy to select the candidate filter. It is particularly useful as a baseline to evaluate other strategies.

Cycle Management In the runtime execution of a reasoning task, we deal with two kinds of cycles:runtime invocation cycles andnon-terminating sequences. Runtime cycles are recursive invocation paths of thneext() primitive. As an example, let us consider the following sequence of filter invocations ← ← ← ← . If cannot fulfil the recursion with any fact from other recursive or base cases, we are still not allowed to interrupt the iteration, as the required facts for may derive from other recursive cases to be explored first.

We distinguish the absence of facts in cyclic casceyscl(ic miss) from the actual absence of further factsr(eal miss). The first case denotes that a specific routing resulted in a circular invocation of thenext() primitive: it results in a failure, but the source is not discarded, because a following invocation may instead produce new facts for it. On the other hand, a real miss is permanent and denotes that a filter does not have available facts anymore and will not have them in the future, thus must be discarded. In case of cyclic misses, the invoked filter notifies the caller withnaotifyCycle() primitive, that flows along the pipeline from called filters to caller. It is now clear that the specific routing policy is essential to balance the exploration of all the possible recursive and base cases. Non-terminating sequences directly derive from the presence of recursion (and hence cycles) and a recursive pipeline may generate infinite facts. To cope with this, the Vadalog system featuretseramination strategy wrappers, a component that prevents the generation of isomorphic facts, as described in Sec2.tion

4. Experimental Evaluation

The Vadalog system has been largely evaluated in previous w1o4r,k1s5[]. In Section3, we described the many alternative routing strategies for reasoning. This raises the question how these diferent strategies afect the performance of the system. Thus in this section our discussion concentrates on showing how the four routing strategies impacts the performance of Vadalog in synthetic and a real world settings.

Test Setup. We invoked Vadalog via its REST interface and used CSV data to make tests independent of host-side optimizations. We ran each experiment ten times, averaging the elapsed times. All the experimental evaluations are run with a local installation of the Vadalog system in a MacBook Pro i7 — 2.5 GHz and 16 GB of RAM. The engine is compiled with JDK 9 and reasoning tasks are executed without any use of concurrency or distribution techniques. Synthetic Scenarios. We analyzed the behaviour of the four routing strategies in two scenarios coming from the set of synthetic benchmarks proposed i1n5],[created withiWarded [24], a configurable tool for generating Warded Data±loprgograms. Each scenario comprises 100 rules, with an alternating number of recursive and non-recursive SryunltehsA. is made of 70 non-recursive and 30 recursive rulSeys.nthB is made of 55 recursive and 45 non-recursive rules. The results are depicted in figures 1(a) and 1(b). We can observe that in presence of intensive recursion SEP and RL perform better than the other two strategies.

Real Scenario. We analyzed the behaviour of the four routing strategies at scale. We ran a basic reachability task in dense networks of 50, 150, 250, 350, 500 companies, built as scale-free networks with parameters (controlling distribution, density) fitted from the European graphs. Example 3. We used the following set of Vadalog rules to compute reachability.

2 ∶ Pℎ(, ), 1 ∶ O(, , ) →

O(, , ) → 3 ∶ Pℎ(, ) →

Pℎ(, ).

The results in Figure 1(c) confirm the trend of RR and R shown in the synthetic experiments, demonstrating that these two strategies are not ideal in recursive settings. In fact, the rule applicability order adopted by these two strategy is only sub-optimal, because they are not able to prioritize base over the recursive cases thus leading to perfomance loss. In this case SEP enables best performance. This is addressable to the inability of RL to distinguish which recursive filter to opt for, whereas SEP applies a global heuristic.

5. Conclusion

In this paper we gave an overview of the Vadalog system, a modern architecture for automated reasoning in knowledge graphs. We discussed the main problems in implementing a fully lfedged KGMS, with particular attention to the strategies for handling termination and cycle management. We also opened a discussion about the rule routing problem, the problem of iflters having to decide which source to invoke, when multiple choices are possible and we described a set of routing strategies which adopt diverse heuristic to handle recursive cases and we experimented the impact on the performance of these strategies in the Vadalog system. [21] B. Glimm, C. Ogbuji, S. Hawke, I. Herman, B. Parsia, A. Polleres, A. Seaborne, SPARQL 1.1 entailment regimes, 2013. W3C Recommendation 21 March 2013, 2013. [22] R. Fagin, P. Kolaitis, R. Miller, L. Popa, Data exchange: Semantics and query answering, in: ICDT, 2003, pp. 207–224. [23] G. Graefe, W. J. McKenna, The volcano optimizer generator: Extensibility and eficient search, in: ICDE, 1993, pp. 209–218. [24] T. Baldazzi, L. Bellomarini, E. Sallinger, P. Atzeni, iwarded: A system for benchmarking datalog+/-reasoning (technical report), arXiv preprint arXiv:2103.08588 (2021).

[1]

W. E.

Moustafa ,

Papavasileiou ,

Yocum ,

Deutsch , Datalography: Scaling datalog graph analytics on graph processing systems , in: BigData, IEEE, 2016 , pp. 56 - 65 .

[2]

Zhao ,

Subotic ,

Scholz , Debugging large-scale datalog: A scalable provenance evaluation strategy , ACM Trans. Program. Lang. Syst . 42 ( 2020 ) 7: 1 - 7 : 35 .

[3]

Abiteboul ,

Hull ,

Vianu , Foundations of databases, volume 8 , Addison-Wesley

Reading

, 1995 .

[4]

Ehrlinger , W. Wöß, Towards a definition of knowledge graphs, in: SEMANTiCS (Posters , Demos, SuCCESS), 2016 .

[5]

Bellomarini ,

Fakhoury ,

Gottlob , E. Sallinger, Knowledge graphs and enterprise ai: The promise of an enabling technology , in: 2019 IEEE 35th International Conference on Data Engineering (ICDE) , 2019 , pp. 26 - 37 . do1i : 0 .1109/ICDE. 2019 . 00011 .

[6]

Bellomarini ,

Sallinger ,

Vahdati , Knowledge graphs: The layered perspective , in: Knowledge Graphs and Big Data Processing , Springer, Cham, 2020 , pp. 20 - 34 .

[7]

Baget ,

Leclère , M. Mugnier, Walking the decidability line for rules with existential variables , in: KR, AAAI Press, 2010 .

[8]

Calì , G. Gottlob,

Lukasiewicz , A general datalog-based framework for tractable query answering over ontologies , in: PODS , 2009 , pp. 77 - 86 .

[9]

Calì ,

Gottlob ,

Pieris , Towards more expressive ontology languages: The query answering problem , Artificial Intelligence 193 ( 2012 ) 87 - 128 .

[10]

Baget ,

Mugnier ,

Rudolph , M. Thomazo, Walking the complexity lines for generalized guarded existential rules , in: IJCAI , 2011 , pp. 712 - 717 .

[11]

Calì , G. Gottlob, M. Kifer, Taming the infinite chase: Query answering under expressive relational constraints , J. Artif. Intell. Res . 48 ( 2013 ) 115 - 174 .

[12]

Leone ,

Manna , G. Terracina,

Veltri , Eficiently computable data∃lopgrograms , in: G. Brewka,

Eiter , S. A . McIlraith (Eds.), KR 2012 , AAAI Press, 2012 .

[13]

Gottlob ,

Lukasiewicz ,

Marnette ,

Pieris , Datalog+/ -: A family of logical knowledge representation and query languages for new applications ( 2010 ).

[14]

Bellomarini , E. Sallinger, G. Gottlob, The vadalog system: Datalog-based reasoning for knowledge graphs , PVLDB 11 ( 2018 ) 975 - 987 .

[15]

Bellomarini ,

Benedetto ,

Gottlob , E. Sallinger, Vadalog: A modern architecture for automated reasoning with large knowledge graphs , Information Systems ( 2020 ) 101528 .

[16]

Bellomarini ,

Gottlob ,

Pieris , E. Sallinger, Swift logic for big data and knowledge graphs , in: IJCAI , 2017 , pp. 2 - 10 .

[17]

Atzeni ,

Bellomarini ,

Benedetto , E. Sallinger, Traversing knowledge graphs with good old (and new) joins ( 2021 ).

[18]

Atzeni ,

Bellomarini ,

Iezzi ,

Sallinger ,

Vlad , Weaving enterprise knowledge graphs: The case of company ownership graphs ., in: EDBT , 2020 , pp. 555 - 566 .

[19]

Bellomarini ,

Benedetti ,

Ceri ,

Gentili ,

Laurendi ,

Magnanimi ,

Nissl , E. Sallinger, Reasoning on company takeovers during the covid-19 crisis with knowledge graphs, in: RuleML+ RR (Supplement ), 2020 .

[20]

Gottlob ,

Pieris , Beyond SPARQL under OWL 2 QL entailment regime: Rules to the rescue , in: IJCAI , 2015 , pp. 2999 - 3007 .