=Paper= {{Paper |id=Vol-2369/short13 |storemode=property |title=Datalog-based Reasoning for Knowledge Graphs |pdfUrl=https://ceur-ws.org/Vol-2369/short13.pdf |volume=Vol-2369 |authors=Luigi Bellomarini,Georg Gottlob,Emanuel Sallinger |dblpUrl=https://dblp.org/rec/conf/amw/BellomariniGS19 }} ==Datalog-based Reasoning for Knowledge Graphs== https://ceur-ws.org/Vol-2369/short13.pdf
      Datalog-based Reasoning for Knowledge Graphs

           Luigi Bellomarini1,2 , Georg Gottlob1,3 , and Emanuel Sallinger1,3
                                 1
                                     University of Oxford
                                      2
                                        Banca d’Italia
                                        3
                                          TU Wien

Abstract. This is a short abstract based on the recently published VLDB 2018 paper
[6] – the main technical paper describing the Vadalog system. It describes the central
motivations and gives useful pointers on the architecture and the main algorithms.
Introduction and motivation. The importance of capitalizing and exploiting corporate
knowledge has been clear to decision makers since the late 1970s, when this idea was
gradually made concrete in the context of expert systems, software frameworks able to
harness such knowledge and provide answers to structured business questions. Through
deductive database systems – database systems that have advanced reasoning capabil-
ities – and in particular the language Datalog, the area became of high interest to the
database community in the 1980s and 1990s. Nevertheless, even though the importance
of harnessing knowledge certainly has grown steadily since then, culminating in today’s
desire of companies to exploit knowledge graphs, the interest in deductive databases has
faltered due to immature technology and overly complicated KRR formalisms.
     Yet, we are currently assisting to a resurgence of Datalog in academia and industry
[1, 4, 7–10]: companies like LogicBlox have proven that a fruitful exchange between
academic research and high-performance industrial applications can be achieved based
on Datalog [1], and companies like LinkedIn have shown that the interest in Datalog
permeates industry [16]. Meanwhile, the recent interest in machine learning brought
renewed visibility to AI, raising interest and triggering investments in thousands of
companies world-wide, which suddenly wish to collect, encapsulate and exploit their
corporate knowledge in the form of a knowledge graph.
     In this context, it has been recognized that to handle the complex knowledge-based
scenarios encountered today, such as reasoning over large knowledge graphs, Data-
log has to be extended with features such as existential quantification. Yet, Datalog-
based reasoning in the presence of existential quantification is in general undecidable.
Many efforts have been made to define decidable fragments. Warded Datalog± is a very
promising one, as it captures PTIME complexity while allowing ontological reasoning.
Yet, so far, no implementation of Warded Datalog± was available.
     We recently introduced the Vadalog system, a Datalog-based system for performing
complex logic reasoning tasks, such as those required in advanced knowledge graphs;
it is Oxford’s contribution to the VADA research programme [18, 13], a joint effort of
the universities of Oxford, Manchester and Edinburgh and around 20 industrial partners
such as Facebook, BP, and the NHS (UK national health system). The Vadalog system
proposes the first implementation of Warded Datalog± , a high-performance Datalog±
system utilising an aggressive termination control strategy. In this paper, we summarise
the key aspects of the Vadalog system, while pointing to the original technical paper for
the details and a comprehensive experimental evaluation.
Reasoning over knowledge graphs. The term knowledge graph (KG) has no standard
definition. It can be seen as referring only to Google’s Knowledge Graph, to triple-based
models, or to multi-attributed graphs, which represent n-ary relations [15, 17]. As shown
by Krötzsch [14], in order to support rule-based reasoning on such data structures, it
is sometimes necessary to use tuples of arity higher than three at least for intermediate
results. In this paper, we adopt a general notion of KGs by allowing relations of arbitrary
arity, to support all of these models and modes of reasoning.

Example 1. An example of a simple knowledge graph reasoning setting is given in [15]:
               Spouse(x, y, start, loc, end) → Spouse(y, x, start, loc, end)
This rule expresses that when a person x is married to a person y at a particular location,
starting date and end date, then the same holds for y and x. That is, the graph of persons
and their marriage relations is symmetric.

As stated in [15], most modern ontology languages are not able to express this exam-
ple. Beyond this simple example, there are numerous requirements for a system that
allows ontological reasoning over KGs. Navigating graphs is impossible without pow-
erful recursion; ontological reasoning is impossible without existential quantification
in rule heads [12]. Yet reasoning with recursive Datalog is undecidable in the presence
of existential quantification, so some tradeoffs have to be accepted. While a compre-
hensive analysis of various requirements was given in [5], let us isolate three concrete
characteristics for a language that supports reasoning over KGs:

1. Recursion over KGs. Should be at least able to express full recursion and joins,
   i.e., should at least encompass Datalog. Full recursion in combination with arbitrary
   joins allows to express complex reasoning tasks over KGs. Navigational capabilities,
   empowered by recursion, are vital for graph-based structures.
2. Ontological Reasoning over KGs. Should at least be able to express SPARQL rea-
   soning under the OWL 2 QL entailment regime and set semantics. OWL 2 QL is one
   of the most adopted profiles of the Web Ontology Language, standardized by W3C.
3. Low Complexity. Reasoning should be tractable in data complexity. This is a mini-
   mal requirement for allowing scalability over large volumes of data.

Beyond these specific requirements, the competition between powerful recursion, pow-
erful existential quantification and low complexity has spawned fruitful research through-
out the community to address the mentioned Datalog undecidability in the presence of
existential quantification. This has been done under a number of different names, but
which we shall here call Datalog± , the “+” referring to the additional features (including
existential quantification), the “-” to restrictions that have to be made to obtain decid-
ability. Many languages within the Datalog± family of languages have been proposed
and intensively investigated [2, 3, 7–10]. Depending on the syntactic restrictions, they
achieve a different balance between expressiveness and computational complexity.
    Figure 1 gives an overview of the main Datalog± languages. In fact, most of these
candidates, including Linear Datalog± , Guarded Datalog± , Sticky Datalog± and Weakly
Sticky Datalog± do not fulfil (1). Datalog itself does not fulfil (2). Warded and Weakly
Frontier Guarded Datalog± satisfy (1) and (2), thus are expressive enough. However,
Fig. 1: Syntactic containment of Datalog± languages. Annotations (non-bold) denote
data complexity. All names that do not explicitly mention Datalog refer to the respective
Datalog± languages. E.g., “Sticky” refers to “Sticky Datalog± ”.



the expressiveness of Weakly Frontier Guarded Datalog± comes at the price of it being
EXPTIME-complete [3]. Thus it does not fulfil (3). Thus, in total, the only known
language that satisfies (1), (2) and (3) is Warded Datalog± . Yet, while Warded Datalog±
has very good theoretical properties, the algorithms presented in [12] are alternating-
time Turing machine algorithms, far away from a practical implementation.
    Before summarising the characteristics of the Vadalog system, let us propose an
example of a Datalog program, which we consider representative of the complexity of
a typical reasoning setting involving KGs. The example is at the same time close to a
classical scenario [11] as well as relevant for an application of the KG of one of our
current industrial partners; in particular, for detecting specific risks related to issuers or
guarantors for funds that are non-eligible due to their financing arrangement: in Europe
there is prohibition on guaranteed debt instruments issued by a “closely-linked entity”.


Example 2. Consider a set of rules about ownership relationships among a large number
of companies. The extensional knowledge is stored in a database in the form of tuples
of a relation Own(comp1 , comp2 , w): company comp1 directly owns a fraction w of
company comp2 , with 0 ≤ w ≤ 1. In addition, with a set of two rules, we intensionally
represent the concept of company control (also studied in [11]): a company x controls
a company y if company x directly owns more than half of the shares of company y or
if x controls a set S of companies that jointly own more than half of y. Note that the
msum operator, i.e., aggregation in the form of monotonic sum, is not a standard feature
of Datalog, but an extension that some Datalog-based systems support [6].
                                                Control(x, x).
                                        Own(x, y, w), w > 0.5 → Control(x, y)
         Control(x, y), Own(y, z, w), v = msum(w, hyi), v > 0.5 → Control(x, z).

Here, for fixed x, the aggregate construct msum(w, hyi) forms the sum over all values w
such that for some company y, Control(x, y) is true, and Own(y, z, w) holds, i.e., com-
pany y directly owns fraction w of company z. Now, with the described knowledge
graph, many questions can be asked, such as: (i) obtain all the pairs (x, y) such that
company x controls company y; (ii) which companies are controlled by x? Which com-
panies control x? (iii) does company x control company y?

The Vadalog system. The Vadalog system is built around the Vadalog language, with
Warded Datalog± as its logical core. It is currently used as the core deductive database
system of the overall Vadalog Knowledge Graph Management System described in [5],
as well as at various industrial partners, including the finance, security, and media in-
telligence industries. The main contributions are:

– A novel analysis of Warded Datalog± , which culminates in the first practical algo-
  rithm for Warded Datalog± . In the Vadalog system, the core principle behind the
  reasoning process is performing an aggressive termination control, which amounts
  to adopting dynamic programming strategies to preempt the generation of already
  explored areas of the KG as well as isomorphic ones, guaranteeing at the same time
  termination (as the fragment is decidable) and efficiency (as all the “informative”
  KG areas can be explored in PTIME). To achieve this goal, it is essential to recog-
  nise these KG areas as early as possible.
  While the naı̈ve solution would imply a full caching of all the generated facts, the
  challenge here is guaranteeing limited memory footprint. We identify a number of
  guide data structures that closely exploit the underpinnings of Warded Datalog± and
  allow to abstract large KG areas by means of single representative facts (or patterns
  thereof). More in particular, we propose structures that play a complementary role for
  exploiting the periodicity of the KG: warded forest, which actually allows vertical
  pruning of isomorphic trees based on root isomorphism, and (lifted) linear forest,
  which allows horizontal pruning of entire pattern-isomorphic trees.
– A system and architecture that implements this algorithm in a relational database-
  inspired operator pipeline architecture. The pipeline’s operators rely on termination
  strategy wrappers which transparently prevent the generation of facts that may lead
  to non-termination while ensuring the correctness of the result. The system is com-
  pleted by a wide range of standard and novel optimisation techniques such as the
  dynamic construction of in-memory indices, stream-optimized “slot machine” joins,
  monotonic aggregation, materialization points, etc.
– The technical paper also presents a full-scale experimental evaluation of the Vada-
  log system on a variety of real-world and synthetic scenarios that thoroughly validate
  the effectiveness of our techniques on Warded Datalog± in absolute terms and com-
  paratively with the top existing systems, which are outperformed by our reasoner.
Acknowledgements. This work is supported by the EPSRC programme grant VADA
EP/M025268/1, the Vienna Science and Technology Fund (WWTF) grant VRG18-013,
and the EU Horizon 2020 grant 809965.


References
 1. M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pasalic, T. L. Veldhuizen,
    and G. Washburn. Design and implementation of the LogicBlox system. In SIGMOD, pages
    1371–1382, 2015.
 2. J. Baget, M. Leclère, and M. Mugnier. Walking the decidability line for rules with existential
    variables. In KR. AAAI Press, 2010.
 3. J. Baget, M. Mugnier, S. Rudolph, and M. Thomazo. Walking the complexity lines for
    generalized guarded existential rules. In IJCAI, pages 712–717. IJCAI/AAAI, 2011.
 4. P. Barceló and R. Pichler, editors. Datalog in Academia and Industry - Second International
    Workshop, Datalog 2.0, Vienna, Austria, September 11-13, 2012. Proceedings, volume 7494
    of Lecture Notes in Computer Science. Springer, 2012.
 5. L. Bellomarini, G. Gottlob, A. Pieris, and E. Sallinger. Swift logic for big data and knowledge
    graphs. In IJCAI, pages 2–10, 2017.
 6. L. Bellomarini, E. Sallinger, and G. Gottlob. The vadalog system: Datalog-based reasoning
    for knowledge graphs. PVLDB, 11(9):975–987, 2018.
 7. A. Calı̀, G. Gottlob, and M. Kifer. Taming the infinite chase: Query answering under expres-
    sive relational constraints. J. Artif. Intell. Res., 48:115–174, 2013.
 8. A. Calı̀, G. Gottlob, and T. Lukasiewicz. A general Datalog-based framework for tractable
    query answering over ontologies. J. Web Sem., 14:57–83, 2012.
 9. A. Calı̀, G. Gottlob, T. Lukasiewicz, B. Marnette, and A. Pieris. Datalog+/-: A family of
    logical knowledge representation and query languages for new applications. In LICS, pages
    228–242, 2010.
10. A. Calı̀, G. Gottlob, and A. Pieris. Towards more expressive ontology languages: The query
    answering problem. Artificial Intelligence, 193:87–128, 2012.
11. S. Ceri, G. Gottlob, and L. Tanca. Logic programming and databases. Springer, 2012.
12. G. Gottlob and A. Pieris. Beyond SPARQL under OWL 2 QL entailment regime: Rules to
    the rescue. In IJCAI, pages 2999–3007, 2015.
13. N. Konstantinou, M. Koehler, E. Abel, C. Civili, B. Neumayr, E. Sallinger, A. A. A. Fer-
    nandes, G. Gottlob, J. A. Keane, L. Libkin, and N. W. Paton. The VADA architecture for
    cost-effective data wrangling. In SIGMOD Conference, pages 1599–1602. ACM, 2017.
14. M. Krötzsch. Efficient rule-based inferencing for OWL EL. In IJCAI 2011, pages 2668–
    2673, 2011.
15. M. Marx, M. Krötzsch, and V. Thost. Logic on MARS: ontologies for generalised property
    graphs. In IJCAI 2017, pages 1188–1194, 2017.
16. W. E. Moustafa, V. Papavasileiou, K. Yocum, and A. Deutsch. Datalography: Scaling datalog
    graph analytics on graph processing systems. In BigData, pages 56–65. IEEE, 2016.
17. J. Urbani, C. J. H. Jacobs, and M. Krötzsch. Column-oriented datalog materialization for
    large knowledge graphs. In AAAI 2016, pages 258–264, 2016.
18. VADA. Project. http://vada.org.uk/, 2016.