Traversing Knowledge Graphs with Good Old (and New) Joins Paolo Atzeni1 , Luigi Bellomarini2 , Davide Benedetto1 , and Emanuel Sallinger3 1 Università Roma Tre 2 Banca d’Italia 3 TU Wien & University of Oxford 1 Introduction Knowledge Graphs (KGs) provide a concise and intuitive abstraction for a variety of domains where edges capture the (potentially recursive) relationships between the entities [9]. This is leading to the rise of systems and tools able to facilitate graph data modeling, processing and analysis, with prominent AI companies developing core systems based on the property graph model [2]. In this context, Datalog-based languages are being re-discovered to be ductile to accomplish reasoning tasks over complex property graphs as they provide the essential elements to enable graph navigational operations [3]. The semantics of a Datalog program is usually specificed in an operational way via the chase procedure [7]. It entails multiple nondeterministic choices such as the rule application order and the fact binding order when multiple uni- fication is possible [6]. In state-of-the-art reasoners, chase-based procedures are not directly adopted, but encoded in the form of engineered variations of the volcano iterator model [8] and so essentially within a pipe-and-filters architec- ture, where nodes (filters) are relational algebra operators and edges (pipes) are dependency connections between the rules. Such (potentially cyclic) structures, known as access plans, need to be translated into reasoning plans, where abstract relational algebra operators are transformed into specific project, select and join implementations: many implementations of each operator exist and it is up to the optimizer to choose the best one in terms of execution cost. This paper focuses on cases where the Datalog reasoning process involves a graph traversal task and investigates the connection between reasoning plans and graph traversal strategies. We move from the observation that the nondeter- ministic choices posed by the chase can be leveraged to control graph traversals —allowing to alternate breadth-first and depth-first strategies— and study the link of such choices with the reasoning plans. We conclude that in plans, specific join implementations and rule prioritization policies reflect the nondeterministic choices and exploit them to guide graph traversals in modern reasoners. Specif- ically, we implement our results in the vadalog System [3], a state-of-the-art knowledge graph management systems and conduct experimental evaluation. Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 P. Atzeni, L. Bellomarini, D. Benedetto, E. Sallinger 2 Knowledge Graphs Let us start from some some preliminary notions. A KG can be defined as semi- structured data model characterized by three components: (i) a ground exten- sional component (EDB), i.e., a set of relational constructs for schema and data which can be effectively modeled as a property graph; (ii) an intensional com- ponent (IDB), i.e., a set of inference rules over the constructs of the ground extensional component; (iii) a derived extensional component that can be pro- duced as the result of the application of the inference rules over the ground extensional component (with the so-called reasoning process) [4]. Reasoning in logic-based KGs substantiates in the application of rules (rep- resenting the IDB) to the EDB, in order to generate the derived extensional component by logical inference. This process is commonly known as forward chaining [1], typically applied via chase-based procedures [7]. KGs are particularly suited for the representation of domains with many interconnected entities: EDB is typically modeled as a property graph (PG), while IDB encode the traversal logic. We adopt a relational representation of EDBs and thus of PGs where nodes and edges are encoded as facts over relation symbols that are specific to the domain of interest. 3 Traversing Knowledge Graphs To uncover the relationship between reasoning plans and graph traversals in Datalog, we can start with a basic st-connectivity scenario [11].4 For nodes 𝑠 and 𝑡 of a directed graph, st-connectivity is the decision problem of establishing whether 𝑡 is reachable from 𝑠. Let us consider an example. Example 1. The following set Σ of Datalog rules reasons on st-connectivity be- tween Frankfurt and Zurich. EDB 𝐷 contains facts of the form Linked(𝑥, 𝑦), expressing that a city 𝑦 is directly reachable from 𝑥. The intensional predicate Connected denotes connectivity. Connected(x, y) : − Linked(x, y) (1) Connected(x, z) : − Connected(x, y), Linked(y, z) (2) > : − Connected(source, target), source = Frankfurt, target = Zurich (3) Let us analyze how Datalog rules are applied, by considering the chase procedure. The chase adds new facts to the source database 𝐷 until the final result Σ(𝐷) satisfies all the rules of Σ. Initially Σ(𝐷) = 𝐷. A unifier is a mapping from variables to constants. We say that a rule 𝜌 = 𝜑(¯ 𝑥 , 𝑦¯) → 𝜓(¯ 𝑥 ) is applicable 4 While specialized algorithms exist for st-connectivity [11], here we do not aim at providing new heuristics for the problem, but at showing how Datalog evaluation strategies materialize into different traversal algorithms. Traversing Knowledge Graphs with Joins 3 to Σ(𝐷) if there is a unifier 𝜃 𝜌 such that 𝜑(¯ 𝑥 𝜃 𝜌 , 𝑦¯𝜃 𝜌 ) ⊆ Σ(𝐷) and and 𝜓(¯ 𝑥𝜃𝜌) does not belong to Σ(𝐷). If 𝜌 is applicable to Σ(𝐷) with a unifier 𝜃 𝜌 , then it performs a chase step, i.e., it generates new facts 𝜓(¯ 𝑥 𝜃 𝜌0 ) that are added to Σ(𝐷), 0 where 𝑥¯𝜃 𝜌 = 𝑥¯𝜃 𝜌 . The chase performs chase steps until no rule in Σ is applicable. The chase poses two classes of nondeterministic choices: (i) for an applicable rule, multiple possible unifiers can exist and, (ii) multiple rules can be applicable at the same time. By handle we mean a mechanism by which a specific nonde- terministic choice in the chase can be leveraged to control the resulting graph traversal behaviour. We recognize two of them: – unification anatomy, i.e., controlling the application order of logical unifiers; – unification morphology, i.e., controlling the application order of rules. The unification anatomy induces an implicit order on the bound facts and, in- directly, the order of the EDB; choosing a specific applicable rule prioritizes the application of base vs inductive cases. A combination of the two handles can be used to define specific visits in the graph. Preferring inductive cases to base cases gives rise to depth-first exploration; vice versa, prioritizing base cases produces breadth-first ones. In depth-first traversals, nondeterministic choices of paths are more relevant than in breadh-first and are taken by prioritizing unifiers. The vadalog System does not directly adopt the chase procedure, but fol- lows the architecture of traditional relational DBMSs, encoding Datalog rules in terms of reasoning plans, where specific implementations of relational algebra operators are considered —multiple join versions exist— and a set of so-called routing strategies are used to decide on rule application priority. These two de- grees of freedom allow to act on the anatomy/morphology handles: different join implementations result in different unifiers being applied (anatomy) and the routing strategy is an encoding of the unification morphology. This means that graph traversal strategies in vadalog can be controlled by choosing rout- ing strategies and join implementations, opening the way to the development of graph-based optimizers that compile execution plans into reasoning plans on the basis of specific cost-based heuristics evaluated against the EDB (defining the structure of the graph) and the IDB (encoding the specific problem to be solved). For instance, for our st-connectivity instance, a hybrid depth-/breadth- first approach would pay off, by first optimistically trying multiple direct and deep connectivity paths (even driven by some heuristics, in more sophisticated settings) and eventually resorting to breadth-first search in case of failure. vadalog offers multiple routing strategies, e.g., round-robin (RR) and EDB- first, and join implementations, e.g., the standard nested-loop join (NLJ) and depth-search join (DSJ), an original implementation we present in this work, specifically devised for depth-first traversals. It is intuitive to understand how a combination of RR and NLJ can be used to simulate general purpose breadth- first traversals. In the next section we briefly describe DSJ, which addresses depth-first exploration. 4 P. Atzeni, L. Bellomarini, D. Benedetto, E. Sallinger Algorithm 1: Depth-search Join 1 static LSTACK, static RCUR; match = false; 2 L, RCUR_POS = LSTACK.pop(); ⊲ Skip visited tuples (cycle avoidance) 3 R = RCUR[RCUR_POS]; ⊲ Position is 0 by default 4 while !match do 5 match = tryJoin(L,R); 6 while !match and RCUR_POS