VLog: A Column-Oriented Datalog System for
             Large Knowledge Graphs?

             Jacopo Urbani1 , Ceriel Jacobs1 , and Markus Krötzsch2
1
    Dept. Computer Science, VU University Amsterdam, Amsterdam, The Netherlands
      2
        Center for Advancing Electronics Dresden (cfaed), TU Dresden, Germany


1     Introduction

Rule-based ontology languages are well-suited for data-intensive applications,
and rules therefore are an important topic in Semantic Web applications and
research up to the present day [7,8,9,10]. The common foundation of many rule
languages is Datalog, which was recently the focus of much renewed interest [4].
Modern Datalog systems such as LogicBlox, SocialLite, or EmptyHeaded, now
often compete successfully with state-of-the-art techniques in their target area.
     In spite of these advances, scalability remains a big challenge. We therefore
have recently presented a new Datalog system, VLog (Vertical Datalog) [10],
which exploits data management technologies used in column stores to compute
efficiently rule-based computation on large Knowledge Graphs (KGs). Compared
to traditional row stores, column-based approaches have shown performance ad-
vantages on analytical workloads [6], but were so far deployed mostly in relational
DBMSs. With VLog, we show that Datalog too can benefit from column-store
technology. This is not obvious, since the advantages of column stores are set off
by a comparatively high cost of updates [2], whereas the iterative computation
of Datalog query results can produce large numbers of derived facts that need
to be inserted. Indeed, as far as we know, ours is the first work to successfully
use column-based technology for Datalog processing.
     VLog can process arbitrary Datalog rules on top of a range of RDF stores and
relational databases. The heart of VLog is an efficient materialization procedure
that computes derivations bottom-up. To improve performance, VLog packs a
significant number of carefully engineered optimizations on several levels.
     The goal of this demonstration is to provide hands-on insights into the work-
ings of the system. We will show Datalog processing performed live, on a laptop,
through a graphical monitoring interface that displays details of the computa-
tion. We will use different datasets, and show the impact of several optimizations.
A demonstration provides an ideal format for discussing implementation aspects
and practical experiences that would be difficult to present in a paper. More-
over, we will show important features that have never been published before,
especially VLog’s new RDBMS bindings, which were introduced only recently.
?
    This work was partially funded by COMMIT, the NWO VENI project 639.021.335,
    and the DFG in grants KR 4381/1-1 and CRC 912 HAEC.
2     VLog: Vertical Datalog
VLog was designed to perform efficient rule-based computations on knowledge
graphs (KGs). VLog can evaluate arbitrary (positive) Datalog programs using
several computing strategies as detailed in [10]. The heart of bottom-up evalu-
ation is a modified semi-naive evaluation, aided by precomputed relations that
are computed top-down using the QSQ-R algorithm (or optionally by a Magic
Sets procedure [3]). Essential characteristics of the system include:
1. A variety of database systems can be used as a source for the data
that rules are evaluated on. The separation of the underlying database from the
actual query computation is natural in Datalog, where one distinguishes EDB
relations (extensional DB relations; the given input data) from IDB relations
(intensional DB relations; the derived data). VLog completely isolates the IDB
layer, which is a column-based in-memory store, from the EDB layer, which can
be any database. In addition to an RDF backend3 , VLog also natively supports
popular relational databases (MySQL, MonetDB) and a generic ODBC interface.
2.    Space-efficient in-memory storage of derivations is achieved using
column-store technology. VLog stores IDB tables column-by-column instead of
row-by-row. To avoid costly updates, VLog never inserts new tuples into existing
tables, but creates new tables instead. This is efficient when processing data set-
at-a-time, with many new facts derived and stored in one step. Column-based
layouts can safe memory through simple yet effective data compression, such as
run-length-encoding [1], and by sharing whole columns between multiple tables.
3. Bottom-up and top-down computation strategies, combined automat-
ically and guided by dynamic optimizations, can lead to significant speed-ups.
While VLog’s column-based approach improves memory efficiency, computation
can be slowed down when derivations from many tables must be considered. We
counter this effect by adding several dynamic optimizations that enable us to
disregard, at runtime, the contents of some IDB tables. In this way, we can avoid
expensive unions and save significant computation (see [10] for details).
4. Free and open source code (C++) is available at https://github.com/
jrbn/vlog. VLog was designed to be easy to use, and offers a command line and
a Web interface. Rules can be defined in simple text files.


3     Experiments and Demonstration Setup
To gain insights into the performance of VLog, we report some experiments on
reasoning on large knowledge bases. Additional experiments are also found in
our other works [10,11]. We selected RDFox [8] as our main competitor since it
is the leading Datalog engine for RDF. Experiments have been executed on a
MacBook Pro with a 2.2GHz Intel Core i7 CPU, 512GB SDD, and 16GB RAM
running on MacOS Yosemite, which is almost identical to the computer to be
used during the demonstration. This means we can re-run the experiments live.
3
    Trident, the in-house RDF triple store used in our earlier work
    Our selected experiments use two scenarios: “DBpedia” with 112M triples
and 9,396 Datalog rules, and “LUBM” with 17M triples and 66 rules. VLog
can easily handle much larger datasets, but we selected these two since they
are popular and can run on a laptop. The inputs have been obtained by taking
subsets of DBpedia4 and LUBM [5], together with Datalog encodings of the
OWL RL fragments of their native ontologies (see [10,11] for details).
    First we compare VLog to RDFox on DBpedia. VLog can materialize all
derivations for this input in 67sec using at most 648MB of RAM, while RDFox
needs 177sec and 7,917MB. RDFox is optimized for parallel processing, so it may
achieve better runtimes on a more powerful computer, but the effectiveness of
VLog’s memory savings (factor 12 in this case) would remain the same.
    In a second experiment, we measure the performance of VLog using dif-
ferent backends on LUBM (a much smaller input). The runtimes we obtain are
16sec (Trident), 459sec (MySQL), 209sec (MonetDB), and 232sec (MonetDB via
ODBC), respectively. It is not surprising that Trident is much more efficient than
external relational stores, since it is specifically optimized for graph-structured
data while the other systems support more general schemas. Moreover, Trident
and VLog are compiled together and can run in a single process, while the others
communicate via HTTP calls.
Demo Description We have developed a GUI that shows the state of the
system during and after the computation. In our live demonstration, we can
visualize the progress of Datalog derivations as they are computed. We will use
a variety of datasets, rule sets, backend DBMS, and optimization settings to
illustrate the working of VLog. When the computation has finished, the GUI
offers access to further statistical information collected during the computation,
which can be inspected and compared with results of other runs to understand
the dynamics of the processing in detail.
    The GUI is a dynamic HTML page that can be displayed in any browser.
Figure 1a shows its header, which displays general parameters, the current state
of the execution, and memory usage. The GUI also visualizes the progress of the
computation, showing both the number of newly derived tuples (see Fig. 1b) and
the time taken in each step (a similar bar chart). It is interesting to compare
both views in order to see the effects of some optimizations and to detect pos-
sible performance bottlenecks. Discussing these differences provides interesting
insights and can highlight potential for further optimization.
Demo Walkthrough For the live demonstration, we will launch several ex-
ecutions and trace the operation of VLog in our monitoring tool. We focus on
cases that terminate in a few minutes since these are most appropriate to show
the system at work. The demo setup is suitable to discuss not only the perfor-
mance and optimizations of VLog, but also the behaviour of various database
backends (MySQL, MonetDB, Trident) and the characteristics of the variety of
benchmarks we have used. The execution can be customized with different set-
tings, and we can quickly add new rules or disable others. We plan to discuss
4
    http://www.dbpedia.org
VLog Monitor Interface
Resource Monitor                 Input
Occupied RAM: 1419/16384 MB      Command line arguments:
                                 mat -i ./lubm1000 --rules ./LUBM-rdfox-rules --edb ./edb.conf ...
                                 N. EDB predicates: 1
                                 N. IDB predicates: 72
                                 N. Rules: 170
     8%                          Statistics
                                 Runtime: 00:00:09.272
                                 Current iteration: 160
                                 Current rule:
                                 HEAD=RP44[1]ff(?3,?5) BODY=RP7[1]ff(?5,?3)
Refresh rate (ms): 1000          Outputs of Rule Executions:


           (a) Process Monitor Interface                                                             (b) Tuples Derived in Each Step

                              Fig. 1: Examples of the GUI used during the demo.


the results individually and in comparison with other settings to highlight inter-
esting issues. We added several customizations to our interface to allow a lively
interaction. For instance, the bar chart in Fig. 1b can be changed by restricting
the view to particular subsets of iterations or rules. A 2-minute screencast of our
demo can be found online: https://iccl.inf.tu-dresden.de/web/ISWC16demo/en.

References
 1. D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in
    column-oriented database systems. In Proc. of SIGMOD, pages 671–682, 2006.
 2. D. Abadi, A. Marcus, S. Madden, and K. Hollenbach. SW-Store: a vertically
    partitioned DBMS for Semantic Web data management. VLDB J., 18(2):385–406,
    2009.
 3. S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases, volume 8. Addison-
    Wesley Reading, 1995.
 4. O. de Moor, G. Gottlob, T. Furche, and A. Sellers. Datalog Reloaded, volume 6702.
    Springer, 2012.
 5. Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge base
    systems. J. of Web Semantics, 3:158–182, 2005.
 6. S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Ker-
    sten. MonetDB: two decades of research in column-oriented database architectures.
    IEEE Data Eng. Bull., 35(1):40–45, 2012.
 7. M. Krötzsch and V. Thost. Ontologies for knowledge graphs: Breaking the rules.
    In Proc. of ISWC, LNCS. Springer, 2016. To appear.
 8. B. Motik, Y. Nenov, R. Piro, I. Horrocks, and D. Olteanu. Parallel Materialisation
    of Datalog Programs in Centralised, Main-Memory RDF Systems. In Proc. of
    AAAI, pages 129–137, 2014.
 9. Y. Nenov, R. Piro, B. Motik, I. Horrocks, Z. Wu, and J. Banerjee. RDFox: A
    highly-scalable RDF store. In Proc. of ISWC, pages 3–20, 2015.
10. J. Urbani, C. Jacobs, and M. Krötzsch. Column-oriented datalog materialization
    for large knowledge graphs. In Proc. of AAAI, pages 258–264, 2016.
11. J. Urbani, C. Jacobs, and M. Krötzsch. VLog: A column-oriented datalog reasoner
    (extended abstract). In Proc. of 39th Annual German Conf. on AI (KI’16), LNAI.
    Springer, 2016. To appear.