=Paper= {{Paper |id=Vol-3801/short3 |storemode=property |title=Nemo: A Scalable and Versatile Datalog Engine |pdfUrl=https://ceur-ws.org/Vol-3801/short3.pdf |volume=Vol-3801 |authors=Alex Ivliev,Lukas Gerlach,Simon Meusel,Jakob Steinberg,Markus Krötzsch |dblpUrl=https://dblp.org/rec/conf/datalog/Ivliev0MSK24 }} ==Nemo: A Scalable and Versatile Datalog Engine== https://ceur-ws.org/Vol-3801/short3.pdf
                         Nemo: A Scalable and Versatile Datalog Engine
                         Alex Ivliev1,∗ , Lukas Gerlach1 , Simon Meusel1 , Jakob Steinberg1 and Markus Krötzsch1
                         1 Knowledge-Based Systems Group, TU Dresden, Dresden, Germany



                                        Abstract
                                        Nemo is a toolkit for large-scale data analysis that emphasizes robustness and ease of use. Nemo’s core is a scalable
                                        and efficient main-memory reasoner that supports an expressive extension of Datalog with support for data types,
                                        existential rules, aggregates, and (stratified) negation. Built around this core is a versatile system of libraries and
                                        applications for interfacing with several data formats and programming languages, use as a web application, and
                                        IDE integration. In this system description, we present this toolkit giving a high-level overview of the system
                                        architecture as well as its supported language features.

                                        Keywords
                                        Reasoning Engine, Existential Rules, Logic Programming




                         1. Introduction
                         Datalog is an important rule language [1] forming the basis for many more complex formalisms such as
                         existential rules [2, 3] that are of interest in both theory and practice. Accordingly, many rule reasoning
                         systems have been presented, which we can roughly classify in the following types [4]:
                                1. Answer set programming solvers [5, 6] and logic programming systems such as Prolog [7]
                                2. Knowledge graph and database engines such as RDFox [8], VLog [9], Vadalog [10], and Graal [11]
                                3. Data-analytics systems such as Soufflé [12], LogicBlox [13], SocialLite [14], or EmptyHeaded [15]
                                4. Data management frameworks such as Datomic, Google Logica, and CozoDB

                            Despite the wide variety of tools, many systems described in the literature may not be a viable choice
                         in practice, due to discontinuation or access restrictions of closed-source commercial systems. In the field
                         of logic programming, these problems have been overcome, and advanced open-source systems such as
                         Clingo [16] are available to researchers, whereas the situation in other rule system categories is more
                         precarious. Among the mentioned systems of types (2) and (3), the only open-source tool with a release
                         in the past twelve months is Soufflé.
                            In this extended abstract, we summarize our recent paper [17] on our new rule reasoning toolkit
                         Nemo built for applications of type (2) and (3), where the computation of logical entailments (or query
                         results) from a variety of inputs is the main reasoning task. Nemo’s rule language extends Datalog
                         with various datatypes, negation, aggregates (both stratified), and many datatype-specific functions and
                         operators known from query languages like SPARQL while also supporting existential rules with a fast
                         implementation of the restricted (or standard) chase algorithm.
                            Thanks to Nemo’s modular architecture, we can provide command-line clients for all major platforms,
                         a public web application with a built-in rule editor, an IDE plugin for rule editing in VSCode, and
                         APIs for integration in several programming languages. All components of Nemo are free and open
                         source, and their development repositories public. Most of Nemo is written in Rust, a language that
                         emphasizes type and memory safety, and code quality is an explicit concern. The source code is available
                         at https://github.com/knowsys/nemo.
                         Datalog 2.0 2024: 5th International Workshop on the Resurgence of Datalog in Academia and Industry, October 11, 2024, Dallas,
                         Texas, USA
                         ∗ Corresponding author.
                         $ alex.ivliev@tu-dresden.de (A. Ivliev); lukas.gerlach@tu-dresden.de (L. Gerlach); simon.meusel@mailbox.tu-dresden.de
                         (S. Meusel); jakob_maximilian.steinberg@mailbox.tu-dresden.de (J. Steinberg); markus.kroetzsch@tu-dresden.de
                         (M. Krötzsch)
                          0000-0002-1604-6308 (A. Ivliev); 0000-0003-4566-0224 (L. Gerlach); 0000-0002-9172-2601 (M. Krötzsch)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                43
Alex Ivliev et al. CEUR Workshop Proceedings                                                           43–47




Figure 1: Rule reasoning in the browser with Nemo Web (left) and tracing on the console (right)


2. System Overview
The Nemo toolkit consists of several programs and libraries. This includes a command-line application
called nmo, and a web application1 shown in Figure 1 (left). The web application utilizes Nemo’s
WebAssembly interface, enabling it to run entirely in the browser without any installation. It features
a powerful rule editor with syntax highlighting and auto-complete, which is made possible by Nemo’s
implementation of the open language server protocol [18]. This also enables advanced editor support for
Nemo’s rule language within compatible IDEs, such as VSCode. Nemo further provides a Python and Rust
developer API, which expose import/export, the reasoner, and give access to the internal rule structure.
   The core of Nemo is a fast and scalable reasoner that materializes logical consequences using semi-naive
bottom-up evaluation [1]. For any inference that is computed, Nemo can also provide a computation trace
that encodes a proof tree. This trace is available in machine-readable JSON format or in more visual
representations. Figure 1 (right) shows an example use of this feature on the console.
   Nemo keeps data in-memory, representing tables as hierarchically-sorted, column-based tables, following
the design in VLog [9]. In contrast to VLog, our columns support domain elements of several types:
fixed-size values (e.g., 32bit floats or 64bit signed integers) are stored directly, while variable-sized data
(e.g., strings and IRIs) is first mapped to integer ids through a dictionary. Data in columns is further
compressed using Run-Length-Encoding with increments. Tables in Nemo are accessible via trie iterators
[19], which makes it possible to utilize the leapfrog trie-join [20], a multi-way, worst-case optimal join
algorithm. In this setting, query plans are largely determined by the variable (column) order, which we
heuristically determine by searching for orders that avoid inefficiencies like cross-products and excessive
reordering of tables. The implementation for operations like union or difference also make use of this
trie-based access. Projection and resorting of tables cannot be implemented efficiently in this way, however,
and require row-based temporary tables.
   Nemo’s efficient storing mechanism combined with its use of state-of-the-art join algorithm and
additional optimizations makes its performance competitive with other mature systems, often outperforming
them on our benchmarks [17]. We were surprised that our browser-based Nemo Web often competed for
second place in these evaluations. Our experiments with a version of the Wikidata knowledge graph [21]
show that Nemo can reason with data sets of several billion triples. Further details on these evaluations
can be found in Section 5 of [17].


3. Language Features
Nemo’s rule language is based on Datalog [1], which is extended with many features of modern query and
rule languages.2 The syntax largely follows common logic programming conventions, representing rules
as implications consisting of conjunctions of first-order atoms where variables are universally quantified.
1 The web application is available at: https://tools.iccl.inf.tu-dresden.de/nemo/
2A full documentation can be found here: https://knowsys.github.io/nemo-doc/




                                                                44
    Alex Ivliev et al. CEUR Workshop Proceedings                                                          43–47


    Nemo supports recursion and further allows notation used in the query language SPARQL [22], which
    provides a robust encoding of identifiers, including special characters and data types, as illustrated next:

1   parents(, carla, bob). parents(alice, carla, bob) .
2   nameAndYearOfBirth(alice, "Alice Müller", 2003) .
3
4   child(?C, ?M), child(?C, ?F) :- parents(?C, ?M, ?F) .
5   sibling(?C, ?D) :- child(?C, ?P), child(?D, ?P), ?C != ?D .


    Lines 1–2 provide three facts, illustrating basic syntax for various kinds of data. In gen-
    eral, all syntactic forms available in RDF and SPARQL can be used, e.g., "論理学"@ja and
    "3.1"^^. Lines 4–5 show two simple rules, where
    ? marks variables and != denotes inequality. Nemo also supports non-monotonic negation using “~”:

6   onlyChild(?C) :- child(?C, _), ~sibling(?C, _) .


    As usual in logic programming, anonymous body variables are denoted by _ and treated like different
    variables, i.e. variables _ in Line 6 are distinct. Moreover, Nemo considers variables that occur only in
    negated body atoms to be existentially quantified beneath the negation. Hence, ~sibling(?C,_) requires
    that all sibling facts do not match this pattern, whereas child(?C,_) requires some matching fact. A
    similar convention is used in the answer set programming, e.g., in current versions of Clingo [5].

    Data Import and Export External data can be imported into Nemo using the following syntax:

7   @import parents :- csv { resource = "parents.csv.gz", limit = 100 } .
8   @import nameAndYearOfBirth :- csv { resource = "https://www.data.com/name_year.csv" } .


    When providing an online resource through a URL as shown in Line 8, data will be downloaded upon
    program evaluation. Currently, Nemo is able to process data encoded as CSV (with user-set delimiter)
    or RDF. For RDF, triples (NTriples, Turtle, RDF/XML) and quads (TRiG, NQuads) are supported. All
    formats are available for export using analogous syntax via the @export directive. On the command-line,
    results may also be printed on the console. Various details of the import and export can be controlled
    through optional parameters, and data can be processed with GZip compression.

    Datatypes and Functions Nemo natively supports values of many different data types, including
    integer, string, language-tagged string, single and double precision float, and Boolean. To manipulate
    such values, Nemo offers a wide range of built-in functions, which largely correspond with SPARQL
    FILTER expressions. Such functions include standard arithmetic operators and mathematical functions on
    numbers such as SQRT, functions for string manipulation such as CONCAT, Boolean operators, as well as
    comparison operators such as < or !=. Functions can be nested arbitrarily and may occur anywhere in
    the rule. Bindings for any variable used within a function must be sufficiently determined, similar to the
    standard safety condition used for Datalog rules.
       Nemo is dynamically typed and therefore allows values of any type to be used in any position, without
    requiring a fixed schema. Most functions, such as STRLEN are not defined for all types of inputs, instead
    producing no result when given a value of unsupported type.

    Aggregation Nemo supports common aggregates such as #count, #sum, or #max, as shown next:

9   childCount(?P, #count(?C)) :- child(?C, ?P) .


    Rule 9 counts, for each value of ?P, the number of distinct values of ?C that satisfy the rule body. We
    distinguish three kinds of variables: aggregate variables occur in aggregate functions (in the head),
    group-by variables are the head variables that are not aggregate variables; and body-only variables that
    do not appear in the head. Aggregation is evaluated as follows: (1) find all rule matches, (2) project away



                                                        45
     Alex Ivliev et al. CEUR Workshop Proceedings                                                             43–47


     bindings of body-only variables and remove duplicates that agree on all remaining variables, (3) group the
     set of projected matches by distinct values of group-by variables, and (4) apply the aggregation function
     on each group. Duplicate elimination in (2) means we always aggregate over sets, corresponding to the
     keyword DISTINCT in many query languages. Users control the semantics through variables in the head:

10   sum1(?A, ?B, #sum(?N)) :- p(?A, ?B, ?N) .
11   sum2(?A, #sum(?N, ?B)) :- p(?A, ?B, ?N) .
12   sum3(?A,     #sum(?N)) :- p(?A, ?B, ?N) .


     Rule 10 sums up the numbers N for each pair of As and Bs; rule 11 sums up the Ns from all pairs of Ns
     and Bs for each A (possibly having the same N value); and rule 12 sums up the distinct Ns for each A.

     Existential Rules Existential rules, also known as tuple-generating dependencies, are an important
     formalism used to describe integrity constraints [23] or to formulate dependencies within a data exchange
     setting [24]. Syntactically, they are represented by introducing existentially quantified variables in the rule
     head, denote by ! in Nemo, as shown in the next example:

13   child(?C, !P), child(!P, ?G) :- grandchild(?C, ?G) .


     Nemo supports reasoning over existential rules by implementing the restricted (a.k.a. standard) chase
     [2, 3]. For each match of the body of rule 13, Nemo creates a fresh value for !P, provided that the head is
     not satisfied for any existing value yet. Fresh values are represented in Nemo by named nulls, which we
     treat as a separate type of domain element distinct from other elements. Named nulls are identified with
     RDF blank nodes in data import and export, and denoted accordingly (using notation like _:42).


     4. Conclusion and Future Work
     Nemo is a comprehensive framework for rule reasoning that includes a scalable rule engine, a feature-rich
     rule language, and advanced user interfaces with strong developer support. Further development of Nemo
     will focus on enhancing its expressive power by introducing native support for complex types such function
     terms and sets, and adding user-defined functions to increase flexibility and unlocking additional use
     cases. To further improve performance, we will transition Nemo’s implementation to being able to utilize
     multi-threading. Next steps to improve developer tools include better highlighting of static analysis results
     to the user and better tracing support in the Nemo web application, which will simplify debugging and
     create a more robust and explainable reasoning system.


     Acknowledgments
     This work is in part supported by Deutsche Forschungsgemeinschaft in project number 389792660
     (TRR 248, CPEC), by the Bundesministerium für Bildung und Forschung under European ITEA project
     01IS21084 (InnoSale), in the Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI),
     and by BMBF and DAAD in project 57616814 (SECAI).


     References
      [1] S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases, Addison Wesley, 1994.
      [2] M. Benedikt, G. Konstantinidis, G. Mecca, B. Motik, P. Papotti, D. Santoro, E. Tsamoura, Bench-
          marking the chase, in: Proc. 36th Symp. on Principles of Database Systems (PODS’17), ACM,
          2017, pp. 37–52. doi:10.1145/3034786.3034796.
      [3] M. Mugnier, M. Thomazo, An introduction to ontology-based query answering with existential
          rules, in: Reasoning Web: Reasoning on the Web in the Big Data Era – 10th Int. Summer School,
          volume 8714 of LNCS, Springer, 2014, pp. 245–278. doi:10.1007/978-3-319-10587-1\_6.



                                                           46
Alex Ivliev et al. CEUR Workshop Proceedings                                                         43–47


 [4] A. Ivliev, S. Ellmauthaler, L. Gerlach, M. Marx, M. Meißner, S. Meusel, M. Krötzsch, Nemo: First
     glimpse of a new rule engine, in: Proc. 39th Int. Conf. on Logic Programming (ICLP’23), volume
     385 of EPTCS, 2023, pp. 333–335. doi:10.4204/EPTCS.385.35.
 [5] M. Gebser, B. Kaufmann, T. Schaub, Conflict-driven answer set solving: From theory to practice,
     Artif. Intell. 187 (2012) 52–89. doi:10.1016/j.artint.2012.04.001.
 [6] M. Alviano, F. Calimeri, C. Dodaro, D. Fuscà, N. Leone, S. Perri, F. Ricca, P. Veltri, J. Zangari,
     The ASP system DLV2, in: Proc. 14th Int. Conf. on Logic Programming and Nonmonotonic
     Reasoning (LPNMR’17), volume 10377 of LNCS, Springer, 2017, pp. 215–221. doi:10.1007/
     978-3-319-61660-5_19.
 [7] P. Körner, M. Leuschel, J. Barbosa, V. S. Costa, V. Dahl, M. V. Hermenegildo, J. F. Morales,
     J. Wielemaker, D. Diaz, S. Abreu, Fifty years of Prolog and beyond, Theory Pract. Log. Program.
     22 (2022) 776–858. doi:10.1017/S1471068422000102.
 [8] Y. Nenov, R. Piro, B. Motik, I. Horrocks, Z. Wu, J. Banerjee, RDFox: A highly-scalable RDF store,
     in: Proc. 14th Int. Semantic Web Conf. (ISWC’15), Part II, volume 9367 of LNCS, Springer, 2015,
     pp. 3–20. doi:10.1007/978-3-319-25010-6_1.
 [9] J. Urbani, C. Jacobs, M. Krötzsch, Column-oriented Datalog materialization for large knowledge
     graphs, in: Proc. 30th AAAI Conf. on Artificial Intelligence (AAAI’16), AAAI Press, 2016, pp.
     258–264. doi:10.1609/aaai.v30i1.9993.
[10] L. Bellomarini, E. Sallinger, G. Gottlob, The Vadalog system: Datalog-based reasoning for knowledge
     graphs, Proc. VLDB Endowment 11 (2018) 975–987. doi:10.14778/3213880.3213888.
[11] J. Baget, M. Leclère, M. Mugnier, S. Rocher, C. Sipieter, Graal: A toolkit for query answering with
     existential rules, in: Proc. 9th Int. Web Rule Symposium (RuleML’15), volume 9202 of LNCS,
     Springer, 2015, pp. 328–344. doi:10.1007/978-3-319-21542-6\_21.
[12] H. Jordan, B. Scholz, P. Subotic, Soufflé: On synthesis of program analyzers, in: Proc. 28th Int.
     Conf. on Computer Aided Verification (CAV’16), Part II, volume 9780 of LNCS, Springer, 2016, pp.
     422–430. doi:10.1007/978-3-319-41540-6_23.
[13] M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pasalic, T. L. Veldhuizen, G. Washburn,
     Design and implementation of the LogicBlox system, in: Proc. 2015 ACM SIGMOD Int. Conf. on
     Mngmt of Data, 2015, pp. 1371–1382. doi:10.1145/2723372.2742796.
[14] J. Seo, S. Guo, M. S. Lam, SociaLite: An efficient graph query language based on datalog, IEEE
     Trans. Knowl. Data Eng. 27 (2015) 1824–1837. doi:10.1109/TKDE.2015.2405562.
[15] C. R. Aberger, S. Tu, K. Olukotun, C. Ré, EmptyHeaded: A relational engine for graph processing,
     in: Proc. 2016 ACM SIGMOD Int. Conf. on Management of Data, ACM, 2016, pp. 431–446.
     doi:10.1145/3129246.
[16] M. Gebser, R. Kaminski, B. Kaufmann, T. Schaub, Multi-shot ASP solving with clingo, Theory
     Pract. Log. Program. 19 (2019) 27–82. doi:doi.org/10.1017/S1471068418000054.
[17] A. Ivliev, L. Gerlach, S. Meusel, J. Steinberg, M. Krötzsch, Nemo: Your friendly and versatile
     rule reasoning toolkit, in: Proc. 21st Int. Conf. on Principles of Knowledge Representation and
     Reasoning (KR’24), 2024.
[18] LSP, Official page for Language Server Protocol, Microsoft, 2024. https://microsoft.github.io/
     language-server-protocol/, accessed May 2024.
[19] E. Fredkin, Trie memory, Commun. ACM 3 (1960) 490–499. doi:10.1145/367390.367400.
[20] T. L. Veldhuizen, Triejoin: A simple, worst-case optimal join algorithm, in: Proc. 17th Int. Conf. on
     Database Theory (ICDT’14), 2014, pp. 96–106. doi:10.5441/002/icdt.2014.13.
[21] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Commun. ACM 57
     (2014) 78–85. doi:10.1145/2629489.
[22] S. Harris, A. Seaborne (Eds.), SPARQL 1.1 Query Language, W3C Recommendation, 21 March
     2013. Available at http://www.w3.org/TR/sparql11-query/.
[23] C. Beeri, M. Y. Vardi, A proof procedure for data dependencies, J. ACM 31 (1984) 718–741.
     doi:10.1145/1634.1636.
[24] R. Fagin, P. G. Kolaitis, R. J. Miller, L. Popa, Data exchange: semantics and query answering,
     Theoretical Computer Science 336 (2005) 89–124. doi:10.1016/j.tcs.2004.10.033.



                                                    47