=Paper= {{Paper |id=Vol-3229/paper66 |storemode=property |title=RUBEN: A Rule Engine Benchmarking Framework |pdfUrl=https://ceur-ws.org/Vol-3229/paper66.pdf |volume=Vol-3229 |authors=Kevin Angele,Jürgen Angele,Umutcan Şimşek,Dieter Fensel |dblpUrl=https://dblp.org/rec/conf/ruleml/AngeleASF22 }} ==RUBEN: A Rule Engine Benchmarking Framework== https://ceur-ws.org/Vol-3229/paper66.pdf
RUBEN: A Rule Engine Benchmarking Framework
Kevin Angele1,3 , Jürgen Angele2 , Umutcan Şimşek1 and Dieter Fensel1
1
  Semantic Technology Institute, University of Innsbruck, Technikerstrasse 21a, 6020 Innsbruck, Austria
2
  adesso, Competence Center Artificial Intelligence
3
  Onlim GmbH, Weintraubengasse 22, 1020 Vienna, Austria


                                         Abstract
                                         Knowledge graphs have become an essential technology for powering intelligent applications. Enriching
                                         the knowledge within knowledge graphs based on use case-specific requirements can be achieved using
                                         inference rules. Applying rules on knowledge graphs requires performant and scalable rule engines.
                                         Analyzing rule engines based on test cases covering various characteristics is crucial for identifying the
                                         optimal rule engine for a given use case. To this end, we present RUBEN: A Rule Engine Benchmarking
                                         Framework providing interfaces to benchmark rule engines based on given test cases. Besides a description
                                         of RUBEN’s interfaces, we present a selection of test cases adopted from the OpenRuleBench, and an
                                         evaluation of four rule engines. In the future, we aim to benchmark existing rule engines regularly and
                                         encourage the community to propose new test cases and include other rule engines.

                                         Keywords
                                         RUBEN, Rules, Rule Engines, Benchmark




1. Introduction
Knowledge graphs have become an important technology for powering intelligent applications
integrating data from heterogeneous (often incomplete) sources. Parts of the missing knowledge
can be inferred by using inference rules. Besides, rules can be used for data integration or
information extraction. In recent years, many new rule engines [1, 2, 3] were developed targeting
knowledge graphs. Performance and scalability are eminent requirements for rule engines
operating on knowledge graphs to deliver fast responses for intelligent applications. Analyzing
rule engines based on test cases covering various characteristics is crucial for an overview of
the available engines, their performance, and scalability.
   In 2009 OpenRuleBench [4] providing a set of performance benchmarks for comparing and
analyzing rule engines was published. OpenRuleBench included systems relying on different
technologies, including Prolog-based, deductive databases, production rules, triple engines, and
general knowledge bases. The main issues when comparing various academic and commercial
systems are the different syntaxes and supported features. Therefore, manually generating the
rules for the various systems was necessary. At least the data for those test cases were generated
programmatically. Due to the differences in the capabilities, not all test cases are applicable for
all the systems. OpenRuleBench is freely available and encourages the community to contribute.
Unfortunately, OpenRuleBench is not a benchmarking framework but a collection of rule sets

RuleML+RR’22: 16th International Rule Challenge and 6th Doctoral Consortium, September 26–28, 2022, Virtual
Envelope-Open kevin.angele@sti2.at (K. Angele)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
representing different test cases and corresponding datasets. Each tested system has its scripts
for running the test cases. This makes running the evaluation quite cumbersome. Besides, the
last assessment was conducted in 2011, and nothing seemed to have happened since then.
   Therefore, we present RUBEN: A Rule Engine Benchmarking Framework providing a simple
interface for including rule engines into a given set of test cases. The main aim of this framework
is to provide an easy way to extend the collection of engines to be evaluated and execute the
test cases without running multiple scripts. In the end, the output of all engines is combined
into a single result file.
   As a basis for this framework, we rely on the data provided by the OpenRuleBench [4]. The
test cases were completely adopted, and the test data was adapted to the new versions of the
engines. For the first version of RUBEN, we used a selection of the engines of the different
categories:
    • Deductive database - Stardog1 , VLog [3]
    • Production and reactive rule systems - Drools2
    • Rule engines for triples - Jena3
   In the future, we plan to extend this list of engines and also encourage the community to add
new engines.
   In this paper, we give an overview of RUBEN’s implementation (Section 2), introduce the
test cases (Section 3) and present a small subset of the evaluation results (Section 4). Afterward,
related work in this area is presented. Finally, Section 6 concludes the paper and gives an
outlook on future work.


2. RUBEN
RUBEN4 is a rule engine benchmarking framework written in Java, bundling the evaluation of
various engines into a single framework that is easy to configure and execute. This chapter
presents RUBEN’s architecture and the interface to be implemented for rule engines that should
be included in the evaluation.
   Figure 1 presents the components RUBEN is composed of. The main component called
Ruben loads the evaluation configuration and triggers the execution of the test cases via the
BenchmarkExecutor. Rule engines and test cases are configurable for the evaluation.
   Table 1 presents the general configuration options, namely name and testDataPath, for RUBEN.
While the name is used as a label for the result file, the testDataPath specifies the location
of the test data. The test data needs to include the data needed for each test case and the
corresponding rules for each engine. So far, the data and rule files need to be provided in the
format supported by the rule engine. In the future, we aim to represent the test cases in a rule
engine-independent format. Then, the independent format needs to be interpreted by each rule
engine and transformed into their format for loading the data and rules. The properties engines
and testCases embed the rule engine and test case configurations.
   1
     https://www.stardog.com/
   2
     https://www.drools.org/
   3
     https://jena.apache.org/
   4
     Check https://github.com/kev-ang/RUBEN for the source code.
             Configuration




                  ReasoningEngineConfiguration                                   TestCaseConfiguration




                                                 use                              use




                                       BenchmarkConfiguration




                                                                 use


                                     BenchmarkExecution




                                                         Ruben




                                           BenchmarkExecutor




                                                       RuleEngine



                                     implements           implements   implements              implements


                   RuleEngines




                        Drools                     Jena                Stardog                       VLog




Figure 1: UML-Component diagram of RUBEN


   Each rule engine can be configured by using the properties name, classpath, and settings (see
Table 2). The name of the rule engine is used to identify the corresponding test data within
the test data folder. Besides, the classpath refers to the class within the evaluation framework
used to execute the evaluation. settings can be used to provide rule engine-specific settings
(optional). Including the same rule engine with a different name and settings allows evaluating
multiple configurations for the same rule engine.
   Table 3 presents the configuration of test cases and properties that need to be specified. A
test case consists of a testCategory, testCaseIdentifier, and testName. The testCategory is used to
categorize tests. Within those categories the testName identifies the name of the test. Each test
Table 1
General configuration
 Property                                                 Description
 name                                                     Specify a name for the configuration.
 engines                                                  Engines to be used for the evaluation. For further
                                                          details on how to configure the engines, see Table
                                                          2.
 testCases                                                Configure the test cases to be included in the
                                                          evaluation. For further details on how to config-
                                                          ure the test cases, see Table 3.
 testDataPath                                             Path to the folder containing the data required
                                                          for the evaluation. The structure of the folder
                                                          containing the test data must follow a predefined
                                                          pattern. For each rule engine to be evaluated a
                                                          folder is needed (the folder name must be equal to
                                                          the name field in Table 2. Inside this folder there
                                                          must be a folder for each test category (see test-
                                                          Category in Table 3). Within the category folder
                                                          a folder with the name equal to the testName (see
                                                          Table 3) needs to be included. The files within
                                                          the test folder need to be named according to the
                                                          testCaseIdentifier values (see Table 3).


Table 2
Configuration of a rule engine
 Property                                                 Description
 name                                                     Name of the rule engine to be evaluated. This
                                                          name must be used as name for the folder within
                                                          the test data folder.
 classpath                                                Refers to the implementation of the rule engine
                                                          within the framework.
 settings (optional)                                      Define additional settings for the rule engine.
                                                          Those settings are provided as a map consisting
                                                          of key values.


can have multiple test cases identified by the testCaseIdentifier.
   The path for loading the test data for each test case and rule engine is composed of different
information in the configuration and has the following structure:
{ t e s t D a t a P a t h } / { engine_name } / { t e s t C a s e _ t e s t C a t e g o r y } /
{ testCase_testName } / { t e s t C a s e _ t e s t C a s e I d e n t i f i e r }
  Including a rule engine into the evaluation framework requires the implementation of the
RuleEngine interface. Table 4 presents the methods to be implemented for a rule engine.
  For an overview of the flow through the framework, we will present the steps taken to
Table 3
Configuration of a test case
 Property                                        Description
 testCategory                                    Category the test belongs to. The value of this
                                                 property is used to distinguish test categories for
                                                 each rule engine.
 testCaseIdentifier                              Unique identifier for the given test case. The
                                                 value of this property needs to be used for naming
                                                 the test case files.
 testName                                        Each test category can have various tests. The
                                                 value of this property is used to distinguish tests
                                                 within a category.


Table 4
Rule Engine Interface
 Method                                          Description
 cleanUp()                                       This method is used to clean up the rule engine
                                                 after the evaluation of a test case. Caches need
                                                 to be invalidated and the data removed from the
                                                 rule engine.
 executeQuery(query)                             Executes a single query. As a return value the
                                                 number of results need to be returned.
 getEngineName()                                 Returns the name of the engine as defined in the
                                                 configuration.
 prepare(testDataPath, testCase)                 Based on the given test data and the current test
                                                 case, the rule engine needs to load the relevant
                                                 data and prepare everything for the execution of
                                                 queries.
 setEngineName(engineName)                       Allows to set the name of the rule engine.
 setSettings(settings)                           Rule engine specific settings are provided via this
                                                 method. The rule engine gets those settings in a
                                                 map consisting of key-value pairs.
 shutDown()                                      Stop all processes initiated by the rule engine.
                                                 All temporary data that was created during the
                                                 evaluation needs to be cleaned up.


load the configuration and execute the test cases. Initially, the main component (Ruben) loads
the evaluation configuration. Afterward, the framework iterates through the provided rule
engine configurations to execute all test cases for each rule engine. To evaluate the test cases,
Ruben forwards the required information about the test data, the rule engine, and the test
cases to the BenchmarkExecutor component. In the first step, the BenchmarkExecutor calls the
prepare method of the current rule engine to load the relevant test data and rules required for
the given test case. Afterward, the queries of the provided test case are executed using the
executeQuery method, and the results in the form of the number of results are stored in a report
together with the execution time. After executing all queries for a given test case, the cleanUp
method is called to prepare for the next test case. Finally, when all test cases are evaluated,
the BenchmarkExecutor calls the shutDown method of the rule engine to stop all processes and
clean up all temporary files. The results are collected, and the framework continues with the
following rule engine.


3. Test Cases
For the initial version of RUBEN, we rely on the test cases provided by the OpenRuleBench
[4]. OpenRuleBench’s main aim is to test several tasks rule engines are known to be good at.
Therefore, the authors in [4] used datasets of different sizes ranging from 50,000 to 1,000,000
facts. Alongside the generated test cases, OpenRuleBench includes four real-world benchmarks:
DBLP database5 , Mondial6 , wine ontology7 , and WordNet8 . The selected tests are representative of
database and knowledge representation problems. Although all the rule engines in the original
OpenRuleBench support actions, such as those in production rule systems and Prolog, they are
not included due to the different paradigms of the engines.
   The authors in [4] introduce several test categories with dedicated test cases:

    • Large join tests
    • Datalog recursion
    • Default negation

  The following briefly introduces the tests for the different test categories. The descriptions
are taken from [4].

3.1. Large Join Tests
This test category contains two database join tests Join1 and Join2, LUBM-derived tests, the
Mondial, and DBLP tests.

Join1 is about a non-recursive tree of binary joins. Those recursions are represented by the
following rules9 :

        a(X,Y) :- b1(X,Z), b2(Z,Y).
        b1(X,Y) :- c1(X,Z), c2(Z,Y).
        b2(X,Y) :- c3(X,Z), c4(Z,Y).
        c1(X,Y) :- d1(X,Z), d2(Z,Y).

  The facts for the base relations c2, c3, c4, d1, and d2 are randomly generated, resulting in two
datasets with 50,000 facts and 250,000 facts. Based on the derived predicates a, b1, and b2 the
    5
      http://www.informatik.uni-trier.de/~ley/db/
    6
      http://www.dbis.informatik.uni-goettingen.de/Mondial/
    7
      http://www.w3.org/TR/2004/REC-owl-guide-20040210/#WinePortal
    8
      http://wordnet.princeton.edu
    9
      The rules within this section are represented using the Prolog syntax.
test queries are formulated using different bindings for the variables: free-free, free-bound, and
bound-free.
   Additionally, a test called 5*Join1 was created out of five copies of the previous rule set. Here,
the predicate a(X,Y) was renamed to a1, ..., a5. Then, a new rule unioning the results of the
previous five rule set was introduced.
     a(X,Y) :- a1(X,Y); a2(X,Y); a3(X,Y); a4(X,Y); a5(X,Y).

Join2 consists of patterns of joins borrowed from [5]. Those joins produce a large intermediate
result but a small set of answers.
     ra(A,B,C,D,E) :- p(A),p(B),p(C),p(D),p(E).
     rb(A,B,C,D,E) :- p(A),p(B),p(C),p(D),p(E).
     r(A,B,C,D,E) :- ra(A,B,C,D,E),rb(A,B,C,D,E).
     q(A) :- r(A,_,_,_,_).
     q(B) :- r(_,B,_,_,_).
     q(C) :- r(_,_,C,_,_).
     q(D) :- r(_,_,_,D,_).
     q(E) :- r(_,_,_,_,E).

The query in this test case determines all facts for predicate q and the content of the base
relation is p(a0), ..., p(a18).

LUBM-derived Tests include three rule sets adapted from the original Lehigh Benchmark,
LUBM [6]. LUBM is a dataset generator for a synthetical university dataset consisting of uni-
versities (the amount can be specified for the generation), courses, departments, professors, and
students. For the OpenRuleBench, the authors generated two datasets: one for ten universities
resulting in more than 1,000,000 tuples; one for 50 universities resulting in over 6,000,000
tuples. The original LUBM benchmark provides 14 queries, of which three (Query1, Query2,
and Query9) were selected and adapted.
query1(X) :- takesCourse(X,graduateCourse0), graduateStudent(X).

   Query1 joins two relations on an attribute with high selectivity (i.e., each tuple in one relation
joins with a small number of tuples in the other relation).
query2(X,Y,Z) :- graduateStudent(X), memberOf(X,Z),
 ↪  undergraduateDegreeFrom(X,Y), university(Y), department(Z),
 ↪  subOrganizationOf_0(Z,Y).
  Query2 joins three unary and three binary relations with high selectivity. Therefore, the final
answer, even for the most extensive dataset consisting of 50 universities, contains only around
100 tuples.
query9(X,Y,Z) :- advisor(X,Y), teacherOf(Y,Z), takesCourse(X,Z), student(X),
 ↪  faculty(Y), course(Z).
  Query9 joins three binary and three unary relations with a lower selectivity than Query1 and
Query2. The answer to this query is rather large (for the large dataset, around 10,000 tuples).
Mondial is a database consisting of geographical information derived from the CIA Fact-
book10 . It contains around 60,000 facts providing information about cities, provinces, and
countries worldwide. Only one query is used for this test case delivering statistical informa-
tion about provinces in China. The particularity of the query is the large intermediate result
(1,676,942 facts) compared to the small final result (888 facts).

DBLP is a database representing publications about databases and logic programming. The
database is derived from the Web-based bibliography DBLP11 . A single relation with nearly
2,500,000 facts about more than 200,000 publications builds up the test.
q(Id,T,A,Y,M) :- att(Id,title,T), att(Id,year,Y), att(Id,author,A),
 ↪  att(Id,month,M).

  The query within this test case is a 4-way join of parts of the same database relation.

3.2. Datalog Recursion
This category includes test cases using recursion, a significant feature for distinguishing rule-
based systems from traditional database management systems. The authors of the Open-
RuleBench intended to evaluate the performance of such queries by providing recursive tests.
The tests in this category include:
    • Classical transitive closure
    • The same-generation siblings problem
    • WordNet (application from natural language processing)
    • The wine ontology in a rule-based representation

Transitive Closure of a binary relation (par) is the smalles transitive relation containing par.
tc(X,Y) :- par(X,Y).
tc(X,Y) :- par(X,Z), tc(Z,Y).

   Deriving the relation tc causes trouble in traditional prologs if the predicates in the second
rule switch places. Also, when the par relation represents a cyclic graph, Prolog might run into
an infinite loop. Four datasets were randomly generated for this test: with cycles with 50,000
facts and 500,000 facts and without cycles (also 50,000 facts and 500,000 facts).

Same-Generation Problem tries to find all siblings in the same generation. The same
generation refers to the equal distance from a common ancestor.
sg(X,Y) :- sib(X,Y).
sg(X,Y) :- par(X,Z), sg(Z,Z1), par(Y,Z1).

   The base relations of par and sib were randomly generated, and again two types of datasets
(cyclic and acyclic) are used for this test. The smaller datasets consist of 6000 and the larger of
24000 facts.
   10
        https://www.cia.gov/the-world-factbook/
   11
        http://www.informatik.uni-trier.de/~ley/db/
WordNet includes common queries from natural language processing. The queries of the
provided test case seek to find all:
    • hypernyms - words more general than the given word
    • hyponyms - words more specific than the given word
    • meronyms - words related by the part-of-a-whole semantic relation
    • holonyms - words related by the composed-of relation
    • troponyms - words more precise than the given word
    • same-synset - a word that is in the same set of synonyms as the given word
    • glosses
    • antonyms - a word with the opposite meaning of the given word
    • adjective-clusters

   As the basis for the test data, WordNet Version 3.0 was used. The database consists of around
115,000 synsets containing over 150,000 words in total. Most tests deliver more than 400,000
facts, and some cases exceed 2,000,000 facts. In this paper, we only show the hypernyms test.
For a complete description of the tests, check the GitHub-Repository12

hypernyms(W1,W2) :- s(S1,_,W1,_,_,_), hypernymSynsets(S1,S2),
 ↪  s(S2,_,W2,_,_,_).

hypernymSynsets(S1,S2) :- hypernym(S1,S2).

hypernymSynsets(S1,S2) :- hypernym(S1,S3), hypernymSynsets(S3,S2).


Wine Ontology is a rule-based representation of the OWL wine ontology, consisting of 815
rules and 654 facts. A characteristic of this test is the recursive dependency of many predicates
on each other through chains of rules, resulting in large groups of predicates connected via the
depends-on relationship. Those large groups are especially critical for top-down engines.

3.3. Default Negation
In this category, the tests include default negation in the body of the rules. In contrast to the
OpenRuleBench, we focus only on predicate-stratified negation[7], so far. A modified same-
generation problem is used for the predicate-stratified negation.
   The modified same-generation test is as follows:

nonsg(X,Y) :- tc(X,Y).
nonsg(X,Y) :- tc(Y,X).
sg2(X,Y) :- sg(X,Y), not nonsg(X,Y).

  Again, as for the original same-generation problem, the base relations of par and sib were
randomly generated, with cycles in the data. This test is executed for the two data sets of 6000
and 24000 facts.
   12
        https://github.com/kev-ang/RUBEN
4. Evaluation and Results
This section presents a subset of the evaluation results produced by RUBEN. We focused on the
test cases and a subset of the tools evaluated in OpenRuleBench [4]. This section first presents
the experimental setup, then the methodology, and finally, the evaluation results.

4.1. Experimental Setup
The evaluation framework was hosted on a server with an Intel XEON E5-v3, 6 Cores / 12 Threads,
3.50 GHz Base Frequency, 3.80 GHz Max Turbo Frequency processor and 256GB RAM. Out of the
256GB RAM, only 32GB were made available to the framework and, therefore, to the engines.
The operating system on the machine is Debian GNU / Linux 11.
   Similar to the OpenRuleBench, we evaluate engines from different categories. In total, we
evaluate four engines out of three categories. The engines to be evaluated are assigned to the
following categories:

    • Deductive Databases
             – Stardog - is implemented in Java providing extensive reasoning capabilities, including
               datalog evaluation.
             – VLog - is implemented in C++ and provides an efficient Datalog engine for large knowl-
               edge graphs supporting RDF, OWL, and SPARQL. The efficiency (memory usage and
               speed) is a result of the combination of a column-based layout with novel optimization
               methods. VLog’s code is open source and freely available.
    • Production rule system
             – Drools - a bottom-up engine based on the Rete algorithm [8]. The Rete algorithm
               combines semi-naive bottom-up computation [9] with a certain heuristic for common
               expression elimination [10].
    • Rule engines for triples
             – Apache Jena - a Java-based framework including two rule engines (bottom-up and
               top-down).

   Apache Jena, Drools, and Stardog are rule engines not materializing the rules. Materializing
implies calculating all the implicit facts on load time and storing them. In contrast, VLog
materializes the rules, and for evaluating a query, VLog can directly access the materialized
facts13 . Therefore, the results are not directly comparable.

4.2. Methodology
The used data sets ranging from 50,000 to 1,000,000 facts are generated with the help of data
generators. For running the various tests OpenRuleBench provides, we adopted the test cases.
We provided a data file containing the facts, a rule file, and a file containing the queries to be
evaluated. This allows providing the files in a format optimal for the rule engine. Additionally,
   13
        The paper mentions a query-driven reasoning mode, which we did not find so far in the examples.
rules can be optimized for specific test cases and for each engine. Each query is evaluated two
times, and the response time of the second value is taken as the result. The first execution is
used to initialize all indices and fill the caches. The second query then represents the optimized
query response time. The query response time includes the rule evaluation and the result
counting. The results for each test case are collected and returned in a file.
   Due to the limited space, we present a small subset of the evaluation results:

    • Large join tests - include database joins (50,000 facts).
    • Datalog recursion - include classical transitive closure and the well-known same-
      generation siblings problem.

  The selection of those test cases was based on the ability of the given rule engines. Since
Drools and Jena participated in the OpenRuleBench, we knew upfront the intersection between
those. Those test case candidates were also checked for the other rule engines.

4.3. Results
Focusing on the number of test cases the engines evaluated, Drools and VLog supported the
large join and datalog recursion tests. Jena and Stardog could not evaluate tests like Join2
and Same Generation for negation. The first requires predicates like p("abcd0") and Same
Generation requires negation which is not supported by both engines in their rule language.
Tables 5, 6, and 7 present the results of those tests that were supported by the selected engines.
   Special values in cells specify an unexpected behavior: error implies that the query evaluation
did not finish as expected, timeout indicates that the evaluation was not finished within 15
minutes. All times within the table are given in seconds and contain only the rule evaluation and
result counting. Loading time is excluded from those numbers. For VLog, we discuss the results
separately, as it materializes all the rules before evaluating the queries. The materialization
time for the different test cases for VLog is given in brackets in the table cell.

Table 5
Large joins, join1, no query bindings (time in seconds)
          query                     a(X,Y)                  b1(X,Y)               b2(X,Y)
           Size                     50000                    50000                 50000
         Drools                     error                    error                 error
           Jena                   timeout                    104.9                  2.7
         Stardog                 exception                    37.2                  1.4
          VLog                   0.122 (356)              0.114 (356)           0.110 (356)

   The results in Table 5 show that Stardog is the fastest rule-based engine for the large join
test that does not materialize the rules upfront. The large join test uses the same dataset for
all three queries. Therefore, the materialization time shown for VLog is relevant only once for
the whole test case. As you can see, the materialization time for this test case for VLog is quite
high, at nearly 6 minutes. However, the query times are then a tiny fraction of the query time
of Stardog.
Table 6
Datalog recursion, same generation, no query bindings (time in seconds)
     size                6000                  6000                 24000            24000
  Cyclic data             no                    yes                   no               yes
    Drools               error                 error                 error            error
     Jena                53.8                   61                  189.1            239.1
    VLog               0.027 (5)             0.030 (6)            0.030 (21)       0.030 (24)

Table 7
Datalog recursion, transitive closure, no query bindings (time in seconds)
     size               50000                 50000                500000           500000
  Cyclic data             no                    yes                   no               yes
    Drools               error                 error                 error            error
     Jena                 8.9                  28.6                  76.7            346.5
   Stardog                10                   27.8                  66.6            262.5
    VLog               0.058 (1)             0.126 (5)            0.064 (12)       0.124 (51)


   Jena is the fastest engine for the Datalog recursion same generation test in Table 6, as Drools
only throws errors. Here, VLog would be faster than Jena even if the materialization time is
added to the query times. Stardog was not able to execute this test case and was therefore left
out.
   Stardog is the fastest rule engine for the Datalog recursion transitive closure test in Table 7
for the engines without materialization. As for the test case before, VLog would be faster than
Stardog even if each query’s materialization times are included.
   Drools threw an exception in all test cases with an OutOfMemoryError . When evaluating
the rules in Drools, Java objects are generated. This number of objects proliferates, causing
an out-of-memory error. For the next run (next year), optimizing and adapting the rules is
necessary to avoid out-of-memory errors.


5. Related Work
This section presents related work about rule engine benchmarking frameworks. Besides, in
the second part of this section, we present some benchmarking datasets used by various rule
engine implementations for the evaluation.

5.1. Benchmarking Frameworks
One framework already known is the OpenRuleBench [4]. OpenRuleBench analyzes the perfor-
mance and scalability of various rule engines by providing a set of test cases covering different
functionalities. OpenRuleBench is open source and welcomes contributions from the com-
munity. However, the last evaluation report dates 2011. Besides, running OpenRuleBench is
cumbersome due to customized scripts for each supported rule engine. Integrating a new rule
engine into the framework requires writing shell scripts and adopting the given test cases to
introduce the rule engine.
   A more up-to-date framework for benchmarking rule-based inference engines is [11]. The
authors provide a fully automated framework for benchmarking rule-based reasoning engines.
Therefore, a configuration for a generation tool for generating a meta-model needs to be
provided. Then, the generated meta-model is translated for the different rule models of the
rule engines to be benchmarked. The translated model is then fed into the respective rule
engine, and reasoning time and memory consumption is measured. The different parts of the
framework are implemented in Java. However, when integrating a new rule engine into the
framework, a translation class needs to be implemented to translate the meta-model into the
corresponding rule model. Afterward, a shell script needs to be provided to run the test cases
on the new rule engine. In the end, there is an overall shell script responsible for running the
whole benchmark. This framework supports only generated test cases which is a drawback
compared to OpenRuleBench, for example.
   In contrast to OpenRuleBench and the other framework, RUBEN is not based on shell scripts
and is fully implemented in Java. RUBEN provides interfaces implemented by rule engines that
should be included in the benchmark, allowing seamless integration.

5.2. Benchmarking Datasets
Two popular rule engines published in recent years are RDFox [2] and Vlog [3]. The authors did
not use one of the previously introduced frameworks for their evaluation. They used existing
datasets and adapted them to measure the performance. In the following, we first present
datasets of rules followed by datasets for benchmarking RDF and OWL systems.
   Datasets including rules are, e.g., Manners or WaltzDB14 . These datasets focus on evaluating
pure matching algorithms (selecting rules to be evaluated) or testing a subset of available
reasoning engines. However, they can not be used for other systems than those for which they
are defined.
   RDFox and Vlog rely more or less on the same set of test data15 . One of those datasets is
Claros, a cultural database catalog representing archaeologic artifacts. This dataset does not
come with rules. Therefore, the authors added some manually generated rules. Another dataset
is DBpedia, representing structured data extracted from the Wikipedia info boxes. Same as
for Claros, the authors extended the dataset with manually generated rules. Both previous
datasets are real-world datasets. In the following we briefly introduce two synthetical datasets
called LUBM [6] and an extension of LUBM called UOBM [12]. LUBM is widely used for RDF
systems with an ontology describing universities, departments, professors, and students. The
data is generated by specifying the number of universities and comes with 14 well-designed
test queries. Those queries cover the features of traditional reasoning systems. However, rules
are not included, and the authors of Vlog and RDFox added them manually. UOBM extends
LUBM by addressing the sparsity of the data. Therefore, external links between members in
different university instances are added, resulting in exponential growth of the complexity for
scalability testing.
   14
        https://www.cs.utexas.edu/ftp/ops5-benchmark-suite/
   15
        http://www.cs.ox.ac.uk/isg/tools/RDFox/2014/AAAI/
6. Conclusion and Future Work
In this paper, we presented a rule engine benchmarking framework called RUBEN, implemented
in Java. RUBEN evaluates the performance and scalability of rule engines; essential to select
a rule engine best fitting a given use case. In contrast to existing benchmarking frameworks
like OpenRuleBench [4], RUBEN provides a simple interface a rule engine needs to implement
to be included in the benchmark. Unfortunately, the first version of RUBEN requires adapting
the test cases to the newly introduced rule engine. So far, RUBEN provides most of the tests
initially introduced in the OpenRuleBench suite. The rule engines supported by RUBEN are a
subset of the engines introduced by OpenRuleBench. To this end, we welcome the community
to extend the rule engines and test cases list.
   In the future, we plan to add a general description of test cases by introducing a common rule
format. Then, each rule engine can include a translation method for converting the common
rule format into the required format. Still, keeping the possibility to provide optimized rule sets
for the cases. Besides, the list of rule engines will be extended by including RDFox [2]. Finally,
(maybe) with the help of the community, we will extend the test cases to fully support and
extend the list of test cases provided by OpenRuleBench. Further, it is planned to repeat the
benchmark every year.


References
 [1] J.-F. Baget, M. Leclère, M.-L. Mugnier, S. Rocher, C. Sipieter, Graal: A toolkit for query
     answering with existential rules, in: International Symposium on Rules and Rule Markup
     Languages for the Semantic Web, Springer, 2015, pp. 328–344.
 [2] Y. Nenov, R. Piro, B. Motik, I. Horrocks, Z. Wu, J. Banerjee, Rdfox: A highly-scalable rdf
     store, in: International Semantic Web Conference, Springer, 2015, pp. 3–20.
 [3] D. Carral, I. Dragoste, L. González, C. Jacobs, M. Krötzsch, J. Urbani, Vlog: A rule engine
     for knowledge graphs, in: International Semantic Web Conference, Springer, 2019, pp.
     19–35.
 [4] S. Liang, P. Fodor, H. Wan, M. Kifer, Openrulebench: An analysis of the performance of
     rule engines, in: Proceedings of the 18th international conference on World wide web,
     2009, pp. 601–610.
 [5] B. Bishop, F. Fischer, Iris-integrated rule inference system, in: International Workshop on
     Advancing Reasoning on the Web: Scalability and Commonsense (ARea 2008), sn, 2008.
 [6] Y. Guo, Z. Pan, J. Heflin, Lubm: A benchmark for owl knowledge base systems, Journal of
     Web Semantics 3 (2005) 158–182.
 [7] K. R. Apt, H. A. Blair, A. Walker, Towards a theory of declarative knowledge, in: Founda-
     tions of deductive databases and logic programming, Elsevier, 1988, pp. 89–148.
 [8] C. L. Forgy, Rete: A fast algorithm for the many pattern/many object pattern match
     problem, in: Readings in Artificial Intelligence and Databases, Elsevier, 1989, pp. 547–559.
 [9] J. D. Ullman, Principles of Database and Knowledge-Base Systems – Volume I: Classical
     Database Systems, Computer Science Press, New York, Oxford, 1988. URL: http://dl.acm.
     org/citation.cfm?id=42790, 1995 wurde der 8. Nachdruck des Buches publiziert.
[10] F. Bry, N. Eisinger, T. Eiter, T. Furche, G. Gottlob, C. Ley, B. Linse, R. Pichler, F. Wei,
     Foundations of rule-based query answering, Reasoning Web International Summer School
     (2007) 1–153.
[11] S. Bobek, P. Misiak, Framework for benchmarking rule-based inference engines, in:
     International Conference on Artificial Intelligence and Soft Computing, Springer, 2017, pp.
     399–410.
[12] L. Ma, Y. Yang, Z. Qiu, G. Xie, Y. Pan, S. Liu, Towards a complete owl ontology benchmark,
     in: European Semantic Web Conference, Springer, 2006, pp. 125–139.