Parallel OWL 2 RL Materialisation in Centralised, Main-Memory RDF Systems Boris Motik, Yavor Nenov, Robert Piro, Ian Horrocks and Dan Olteanu firstname.lastname@cs.ox.ac.uk Department of Computer Science, Oxford University, Oxford, United Kingdom Abstract. We present a novel approach to parallel materialisation (i.e., fixpoint computation) of OWL RL Knowledge Bases in centralised, main-memory, multi- core RDF systems. Our approach comprises a datalog reasoning algorithm that evenly distributes the workload to cores, and an RDF indexing data structure that supports efficient, ‘mostly’ lock-free parallel updates. Our empirical evaluation shows that our approach parallelises computation very well so, with 16 physical cores, materialisation can be up to 13.9 times faster than with just one core. 1 Introduction The OWL 2 RL profile forms a fragment of datalog so that reasoning over OWL 2 RL knowledge bases (KB) can straightforwardly be rendered into answering datalog queries. These datalog queries are either the OWL 2 RL/RDF rules [18, Section 4.3] applied to the KB or the datalog translations [10] of the OWL 2 RL axioms of the KB applied to the data portion (ABox) of the KB. Answering datalog queries can be solved by backward chaining [2, 23], or one can materialise all consequences of the rules and the data so that subsequent queries can be answered without the rules. Mate- rialisation supports efficient querying, so it is commonly used in practice, but it is also very expensive. We show that materialisation can be efficiently parallelised on modern multi-core systems. In addition, main-memory databases have been gaining momen- tum in academia and practice [16] due to the decreasing cost of RAM, so we focus on centralised, main-memory, multi-core RDF systems. We present a new materialisa- tion algorithm that evenly distributes the workload to cores, and an RDF data index- ing scheme that supports efficient ‘mostly’ lock-free data insertion. Our techniques are complementary to the ones for shared-nothing distributed RDF systems with nontrivial communication cost between the nodes [23, 20, 25]: each node can parallelise compu- tation and/or store RDF data using our approach. Materialisation has P-complete data complexity and is thus believed to be inher- ently sequential. Nevertheless, many practical parallelisation techniques have been de- veloped; we discuss these using the following OWL 2 RL example axioms and their translation into datalog rules. AvB A(x, y) → B(x, y) (R1) C ◦E vD C(x, y) ∧ E(y, z) → D(x, z) (R2) D◦E vE D(x, y) ∧ E(y, z) → C(x, z) (R3) Interquery parallelism identifies rules that can be evaluated in parallel. For example, rules (R2) and (R3) must be evaluated jointly since C and D are mutually dependent, but rule (R1) is independent since B is independent from C and D. Such an approach does not guarantee a balanced workload distribution: for example, the evaluation of (R2) and (R3) might be more costly than of (R1); moreover, the number of independent com- ponents (two in our example) limits the degree of parallelism. Intraquery parallelism assigns distinct rule instantiations to threads by constraining variables in rules to domain subsets [7, 21, 9, 29, 27]. For example, with N threads and assuming that all objects are represented as integers, the ith thread can evaluate (R1)–(R3) with (x mod N = i) added to the rules’ antecedents. Such a static partitioning does not guarantee an even workload distribution due to data skew. Systems such as WebPIE [23], Marvin [20], C/MPI [25]; DynamiTE [24], and [13] use variants of these approaches to support OWL 2 RL fragments such as RDFS or pD∗ [22]. In contrast, we handle general, recursive datalog rules using a parallel variant of the seminaı̈ve algorithm [2]. Each thread extracts a fact from the database and matches it to the rules; for example, given a fact E(a, b), a thread will match it to atom E(y, z) in rule (R2) and evaluate subquery C(x, b) to derive the rule’s consequences, and it will handle rule (R3) analogously. We thus obtain independent subqueries, each of which is evalu- ated on a distinct thread. The difference in subquery evaluation times does not matter because the number of queries is proportional to the number of tuples, and threads are fully loaded. We thus partition rule instantiations dynamically (i.e., as threads become free), unlike static partitioning which is predetermined and thus susceptible to skew. To support this idea in practice, an RDF storage scheme is needed that (i) supports efficient evaluation of subqueries, and (ii) can be efficiently updated in parallel. To sat- isfy (i), indexes over RDF data are needed. Hexastore [26] and RDF-3X [19] provide six-fold sorted indexes that support merge joins and allow for a high degree of data compression. Such approaches may be efficient if data is static, but data changes con- tinuously during materialisation so maintaining sorted indexes or re-compressing data can be costly and difficult to parallelise. Storage schemes based on columnar databases [15] with vertical partitioning [1] suffer from similar problems. To satisfy both (i) and (ii), we use hash-based indexes that can efficiently match all RDF atoms (i.e., RDF triples in which some terms are replaced with variables) and thus support the index nested loops join. Hash table access can be easily parallelised, which allows us to support ‘mostly’ lock-free [14] updates: most of the time, at least one thread is guaranteed to make progress regardless of the remaining threads; how- ever, threads do occasionally resort to localised locking. Lock-free data structures are resilient to adverse thread scheduling and thus often parallelise better than lock-based ones. Compared to the sort-merge [3] and hash join [4] algorithms, the index nested loops join with hash indexes exhibits random memory access which is potentially less efficient than sequential access, but our experiments suggest that hyperthreading and a high degree of parallelism can compensate for this drawback. We have implemented our approach in a new system called RDFox and have eval- uated its performance on several synthetic and real-world datasets. Parallelisation was beneficial in all cases, achieving a speedup in materialisation times of up to 13.9 with 16 physical cores, rising up to 19.3 with 32 virtual cores obtained by hyperthreading. Our system also proved competitive with OWLIM-Lite (a commercial RDF system) and our implementation of the seminaı̈ve algorithm without parallelisation on top of PostgreSQL and MonetDB, with the latter systems running on a RAM disk. We did not independently evaluate query answering; however, queries are continuously answered during materialisation, so we believe that our results show that our data indexing scheme also supports efficient query answering over RDF data. 2 Preliminaries A term is a resource (i.e., a constant) or a variable; we denote terms with t, and vari- ables with x, y, and z. An (RDF) atom is a triple hs, p, oi of terms called the subject, predicate, and object, respectively. A fact is a variable-free atom. A rule r has the form (1), where H is the head atom, and B1 , . . . , Bn are body atoms; we let h(r) ··= H and bi (r) ··= Bi . B1 ∧ . . . ∧ Bn → H (1) Rules must be safe: each variable in H must occur in some Bi . A program P is a finite set of possibly recursive rules. The materialisation (i.e., the fixpoint) P ∞ (I) of a finite set of facts I with P , a substitution σ and its application Aσ to an atom A, and the composition τ θ of substitutions τ and θ are defined as usual [2]. Our RDF system does not use the fixed OWL 2 RL/RDF rule set [18, Section 4.3]. Instead, we translate the OWL 2 RL axioms of the given KB into a datalog program; such programs generally contain simpler rules with fewer body atoms and are thus easier to evaluate. Our approach is, however, also applicable to OWL 2 RL/RDF rules. 3 Parallel Datalog Materialisation We now present our algorithm that, given a datalog program P and a finite set of facts I, computes P ∞ (I) on N threads. The data structure storing these facts (which we iden- tify with I) must support several abstract operations: I.add(F ) should check whether I contains a fact F and add it to I if not; moreover, I should provide an iterator factsI where factsI .next returns a not yet returned fact or ε if such a fact does not exist, and factsI .hasNext returns true if I contains a not yet returned fact. These operations need not enjoy the ACID1 properties, but they must be linearisable [14]: each asynchronous sequence of calls should appear to happen in a sequential order, with the effect of each call taking place at an instant between the call’s invocation and response. Accesses to I thus does not require external synchronisation via locks or critical sections. Furthermore, I must support an interface for answering conjunctive queries con- strained to a subset of I; the latter will be used to prevent repeated derivations by a rule. To formalise this, we assume that I can be viewed as a vector; then, for F ∈ I a fact, I