<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VLog: A Column-Oriented Datalog System for Large Knowledge Graphs?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacopo Urbani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ceriel Jacobs</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Krotzsch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Advancing Electronics Dresden (cfaed)</institution>
          ,
          <addr-line>TU Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. Computer Science, VU University Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Rule-based ontology languages are well-suited for data-intensive applications, and rules therefore are an important topic in Semantic Web applications and research up to the present day [7,8,9,10]. The common foundation of many rule languages is Datalog, which was recently the focus of much renewed interest [4]. Modern Datalog systems such as LogicBlox, SocialLite, or EmptyHeaded, now often compete successfully with state-of-the-art techniques in their target area. In spite of these advances, scalability remains a big challenge. We therefore have recently presented a new Datalog system, VLog (Vertical Datalog) [10], which exploits data management technologies used in column stores to compute e ciently rule-based computation on large Knowledge Graphs (KGs). Compared to traditional row stores, column-based approaches have shown performance advantages on analytical workloads [6], but were so far deployed mostly in relational DBMSs. With VLog, we show that Datalog too can bene t from column-store technology. This is not obvious, since the advantages of column stores are set o by a comparatively high cost of updates [2], whereas the iterative computation of Datalog query results can produce large numbers of derived facts that need to be inserted. Indeed, as far as we know, ours is the rst work to successfully use column-based technology for Datalog processing. VLog can process arbitrary Datalog rules on top of a range of RDF stores and relational databases. The heart of VLog is an e cient materialization procedure that computes derivations bottom-up. To improve performance, VLog packs a signi cant number of carefully engineered optimizations on several levels. The goal of this demonstration is to provide hands-on insights into the workings of the system. We will show Datalog processing performed live, on a laptop, through a graphical monitoring interface that displays details of the computation. We will use di erent datasets, and show the impact of several optimizations. A demonstration provides an ideal format for discussing implementation aspects and practical experiences that would be di cult to present in a paper. Moreover, we will show important features that have never been published before, especially VLog's new RDBMS bindings, which were introduced only recently.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>? This work was partially funded by COMMIT, the NWO VENI project 639.021.335,
and the DFG in grants KR 4381/1-1 and CRC 912 HAEC.</p>
    </sec>
    <sec id="sec-2">
      <title>VLog: Vertical Datalog</title>
      <p>
        VLog was designed to perform e cient rule-based computations on knowledge
graphs (KGs). VLog can evaluate arbitrary (positive) Datalog programs using
several computing strategies as detailed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The heart of bottom-up
evaluation is a modi ed semi-naive evaluation, aided by precomputed relations that
are computed top-down using the QSQ-R algorithm (or optionally by a Magic
Sets procedure [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Essential characteristics of the system include:
1. A variety of database systems can be used as a source for the data
that rules are evaluated on. The separation of the underlying database from the
actual query computation is natural in Datalog, where one distinguishes EDB
relations (extensional DB relations; the given input data) from IDB relations
(intensional DB relations; the derived data). VLog completely isolates the IDB
layer, which is a column-based in-memory store, from the EDB layer, which can
be any database. In addition to an RDF backend3, VLog also natively supports
popular relational databases (MySQL, MonetDB) and a generic ODBC interface.
2. Space-e cient in-memory storage of derivations is achieved using
column-store technology. VLog stores IDB tables column-by-column instead of
row-by-row. To avoid costly updates, VLog never inserts new tuples into existing
tables, but creates new tables instead. This is e cient when processing data
setat-a-time, with many new facts derived and stored in one step. Column-based
layouts can safe memory through simple yet e ective data compression, such as
run-length-encoding [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and by sharing whole columns between multiple tables.
3. Bottom-up and top-down computation strategies, combined
automatically and guided by dynamic optimizations, can lead to signi cant speed-ups.
While VLog's column-based approach improves memory e ciency, computation
can be slowed down when derivations from many tables must be considered. We
counter this e ect by adding several dynamic optimizations that enable us to
disregard, at runtime, the contents of some IDB tables. In this way, we can avoid
expensive unions and save signi cant computation (see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for details).
4. Free and open source code (C++) is available at https://github.com/
jrbn/vlog. VLog was designed to be easy to use, and o ers a command line and
a Web interface. Rules can be de ned in simple text les.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Demonstration Setup</title>
      <p>
        To gain insights into the performance of VLog, we report some experiments on
reasoning on large knowledge bases. Additional experiments are also found in
our other works [
        <xref ref-type="bibr" rid="ref10 ref11">10,11</xref>
        ]. We selected RDFox [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as our main competitor since it
is the leading Datalog engine for RDF. Experiments have been executed on a
MacBook Pro with a 2.2GHz Intel Core i7 CPU, 512GB SDD, and 16GB RAM
running on MacOS Yosemite, which is almost identical to the computer to be
used during the demonstration. This means we can re-run the experiments live.
      </p>
      <sec id="sec-3-1">
        <title>3 Trident, the in-house RDF triple store used in our earlier work</title>
        <p>
          Our selected experiments use two scenarios: \DBpedia" with 112M triples
and 9,396 Datalog rules, and \LUBM" with 17M triples and 66 rules. VLog
can easily handle much larger datasets, but we selected these two since they
are popular and can run on a laptop. The inputs have been obtained by taking
subsets of DBpedia4 and LUBM [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], together with Datalog encodings of the
OWL RL fragments of their native ontologies (see [
          <xref ref-type="bibr" rid="ref10 ref11">10,11</xref>
          ] for details).
        </p>
        <p>First we compare VLog to RDFox on DBpedia. VLog can materialize all
derivations for this input in 67sec using at most 648MB of RAM, while RDFox
needs 177sec and 7,917MB. RDFox is optimized for parallel processing, so it may
achieve better runtimes on a more powerful computer, but the e ectiveness of
VLog's memory savings (factor 12 in this case) would remain the same.</p>
        <p>In a second experiment, we measure the performance of VLog using
different backends on LUBM (a much smaller input). The runtimes we obtain are
16sec (Trident), 459sec (MySQL), 209sec (MonetDB), and 232sec (MonetDB via
ODBC), respectively. It is not surprising that Trident is much more e cient than
external relational stores, since it is speci cally optimized for graph-structured
data while the other systems support more general schemas. Moreover, Trident
and VLog are compiled together and can run in a single process, while the others
communicate via HTTP calls.</p>
        <p>Demo Description We have developed a GUI that shows the state of the
system during and after the computation. In our live demonstration, we can
visualize the progress of Datalog derivations as they are computed. We will use
a variety of datasets, rule sets, backend DBMS, and optimization settings to
illustrate the working of VLog. When the computation has nished, the GUI
o ers access to further statistical information collected during the computation,
which can be inspected and compared with results of other runs to understand
the dynamics of the processing in detail.</p>
        <p>The GUI is a dynamic HTML page that can be displayed in any browser.
Figure 1a shows its header, which displays general parameters, the current state
of the execution, and memory usage. The GUI also visualizes the progress of the
computation, showing both the number of newly derived tuples (see Fig. 1b) and
the time taken in each step (a similar bar chart). It is interesting to compare
both views in order to see the e ects of some optimizations and to detect
possible performance bottlenecks. Discussing these di erences provides interesting
insights and can highlight potential for further optimization.</p>
        <p>Demo Walkthrough For the live demonstration, we will launch several
executions and trace the operation of VLog in our monitoring tool. We focus on
cases that terminate in a few minutes since these are most appropriate to show
the system at work. The demo setup is suitable to discuss not only the
performance and optimizations of VLog, but also the behaviour of various database
backends (MySQL, MonetDB, Trident) and the characteristics of the variety of
benchmarks we have used. The execution can be customized with di erent
settings, and we can quickly add new rules or disable others. We plan to discuss</p>
      </sec>
      <sec id="sec-3-2">
        <title>4 http://www.dbpedia.org</title>
        <p>8%
Refresh rate (ms): 1000</p>
        <p>Input
Command line arguments:
mat -i ./lubm1000 --rules ./LUBM-rdfox-rules --edb ./edb.conf . .</p>
        <p>
          N. EDB predicates:1
N. IDB predicates:72
N. Rules:170
Statistics
Runtime: 00:00:09.272
Current iteration:160
Current rule:
HEAD=RP44[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]ff(?3,?5) BODY=RP7[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]ff(?5,?3)
        </p>
        <p>Outputs of Rule Executions:
(a) Process Monitor Interface</p>
        <p>(b) Tuples Derived in Each Step
the results individually and in comparison with other settings to highlight
interesting issues. We added several customizations to our interface to allow a lively
interaction. For instance, the bar chart in Fig. 1b can be changed by restricting
the view to particular subsets of iterations or rules. A 2-minute screencast of our
demo can be found online: https://iccl.inf.tu-dresden.de/web/ISWC16demo/en.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          .
          <article-title>Integrating compression and execution in column-oriented database systems</article-title>
          .
          <source>In Proc. of SIGMOD</source>
          , pages
          <volume>671</volume>
          {
          <fpage>682</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>D.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Hollenbach.</surname>
          </string-name>
          SW-Store:
          <article-title>a vertically partitioned DBMS for Semantic Web data management</article-title>
          .
          <source>VLDB J</source>
          .,
          <volume>18</volume>
          (
          <issue>2</issue>
          ):
          <volume>385</volume>
          {
          <fpage>406</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hull</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vianu</surname>
          </string-name>
          . Foundations of databases, volume
          <volume>8</volume>
          . AddisonWesley Reading,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>O. de Moor</surname>
            , G. Gottlob,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Furche</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Sellers</surname>
          </string-name>
          .
          <source>Datalog Reloaded</source>
          , volume
          <volume>6702</volume>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          , and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>He in. LUBM: A benchmark for OWL knowledge base systems</article-title>
          .
          <source>J. of Web Semantics</source>
          ,
          <volume>3</volume>
          :
          <fpage>158</fpage>
          {
          <fpage>182</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Gro en</article-title>
          , N. Nes,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Mullender</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. L. Kersten.</surname>
          </string-name>
          <article-title>MonetDB: two decades of research in column-oriented database architectures</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <volume>35</volume>
          (
          <issue>1</issue>
          ):
          <volume>40</volume>
          {
          <fpage>45</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. M. Krotzsch and
          <string-name>
            <given-names>V.</given-names>
            <surname>Thost</surname>
          </string-name>
          .
          <article-title>Ontologies for knowledge graphs: Breaking the rules</article-title>
          .
          <source>In Proc. of ISWC</source>
          , LNCS. Springer,
          <year>2016</year>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>B.</given-names>
            <surname>Motik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nenov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Piro</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Horrocks</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          .
          <article-title>Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF Systems</article-title>
          .
          <source>In Proc. of AAAI</source>
          , pages
          <volume>129</volume>
          {
          <fpage>137</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nenov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Piro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Motik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Horrocks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Banerjee.</surname>
          </string-name>
          <article-title>RDFox: A highly-scalable RDF store</article-title>
          .
          <source>In Proc. of ISWC</source>
          , pages
          <volume>3</volume>
          {
          <fpage>20</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>J. Urbani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Jacobs</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <article-title>Krotzsch. Column-oriented datalog materialization for large knowledge graphs</article-title>
          .
          <source>In Proc. of AAAI</source>
          , pages
          <volume>258</volume>
          {
          <fpage>264</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>J. Urbani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Jacobs</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <article-title>Krotzsch. VLog: A column-oriented datalog reasoner (extended abstract)</article-title>
          .
          <source>In Proc. of 39th Annual German Conf. on AI (KI'16)</source>
          , LNAI. Springer,
          <year>2016</year>
          . To appear.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>