<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CILC</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Toward Executing Datalog on Big Data Platforms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Cuteri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Mazzotta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Ricca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Calabria</institution>
          ,
          <addr-line>Rende</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>40</volume>
      <fpage>25</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Modern databases must be capable of querying massive volumes of data, as we live in the era of Big Data. At the same time, Datalog, a logic language at the roots of database theory, remains a powerful formalism for expressing complex queries in a declarative manner. In this paper, we move the first steps toward bridging these two worlds in an innovative way. In particular, we introduce an approach that compiles stratified Datalog programs in Spark jobs for execution on Big Data frameworks. Our approach aims at efectively combining the declarative nature of logic rules with the scalability of modern distributed systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Datalog</kwd>
        <kwd>Big Data</kwd>
        <kwd>Spark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Modern databases are increasingly required to manage and process massive volumes of data generated
by diverse applications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The continuous growth of data, both in size and complexity, has pushed
traditional systems beyond their limits, necessitating new architectural and computational approaches.
To eficiently store, retrieve, and analyze such vast datasets, modern databases often operate over
largescale distributed clusters. This trend has accelerated research into scalable systems capable of handling
complex analytical tasks across distributed environments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Consequently, designing eficient,
faulttolerant, and highly performant database solutions has become a critical focus in both academia and
industry [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Among the prominent systems developed to address these challenges there, is Apache
Spark [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a widely adopted platform for large-scale data analytics. Spark enables eficient in-memory
processing across distributed clusters, significantly accelerating the computation of complex analytics
tasks. Its flexible architecture supports a variety of workloads, ranging from simple queries to advanced
machine learning pipelines. However, implementing analytics in Spark often requires programmers to
manually express the evaluation of queries using imperative programming constructs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        On the other hand, declarative query languages ofer higher-level abstractions that allow users to
specify what results they want without detailing how to compute them, leading to improved productivity
and reduced potential for programming errors compared to imperative approaches. In the database
landscape, Datalog has emerged as a foundational declarative language, deeply influencing both the
theoretical underpinnings and practical applications of database systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Datalog allows the
formulation of complex queries using a declarative logic-based formalism [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The limitations of
traditional Datalog evaluation systems in handling Big Data, due to their inability to leverage modern
distributed data management environments [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], pave the way for a new generation of Datalog
engines designed on top of scalable platforms like Spark, possibly unlocking unprecedented potential
for declarative analytics at massive scale.
      </p>
      <p>
        In this paper, we make a preliminary contribution toward this goal by presenting Datalog2Spark,
a system that explores the compilation of Datalog queries into Spark programs for evaluation over
computer clusters. In this first prototype, we focus on non-recursive Datalog programs with stratified
negation and incorporate some useful aggregation functions. Both extensions were conceived by
recognizing that () the support of data types is fundamental in database systems, and () it has been
observed how aggregation is crucial for efective data processing and analytics [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For this reason, we
ifrst extend the Datalog language with constructs for type definitions, on the lines of previous studies [ 8].
Furthermore, we support some aggregation functions designed to enhance the expressive power and
practical utility of our tool. To assess the viability of our approach, we conducted a preliminary
experimental evaluation of our implementation, comparing it with two main-memory engines capable
of processing Datalog. The first system, dlv [9], is well known for extending a powerful Datalog engine
into a comprehensive logic programming platform for declarative reasoning under the Answer Set
Programming (ASP) paradigm. The second system, clingo [10], is an ASP solver that can also evaluate
Datalog programs, leveraging its support for the broader ASP language. Obtained results confirm the
viability of the approach, demonstrating its potential to bring the benefits of declarative programming
to Big Data analytics while delivering acceptable performance on single-server environments compared
to existing systems, and at the same time enabling seamless scale-out over modern distributed databases,
and capability of executing queries with aggregates that cannot be handled by compared systems.
      </p>
      <p>The remaining of this paper is structured as follows. In section 2 we review the basics of Datalog and
Apache Spark. Section 3 presents an extension of the Datalog language as well as the novel
compilationbased technique implemented by Datalog2Spark. Section 4 provides an experimental analysis. Finally
Section 6 discusses related works and draws the conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>2.1. Spark
In this section we provide basic notions about Apache Spark and Datalog which are the core of the
contribution of this paper.</p>
      <p>Apache Spark is a unified engine allowing for parallel data processing on distributed clusters [ 11]. Such
an engine provides a powerful, yet eficient, programming API, which allows distributed applications to
combine (possibly) multiple workflows such as streaming data analysis, machine learning, structured
data analysis, among others. The Spark programming API, available in diferent programming languages
such as Java, Python, Scala, and R, is built on top of Resilient Distributed Datasets (RDDs) [12]. An
RDD is an immutable and distributed in-memory dataset which represents the basic data abstraction in
Spark. RDDs have then been refined creating another data abstraction called Spark Dataset. A Spark
Dataset supports mainly two kinds of operations, namely transformations and actions. Transformations
are all those operations that, once applied on a Dataset, generate a new Dataset obtained by applying
the transformations to all the elements of the input Dataset. As examples, we can consider filtering ,
projection, sorting among many others. Actions, instead, are those operations that consume elements
of an input Dataset to produce a specific result. As example consider counting or printing. One of the
most important features of the Spark programming API, and so of Datasets, is that transformations are
lazily evaluated. More precisely, Spark keeps track of the sequence of transformations (also known as
lineage) which define each Dataset. As soon as an action is invoked, Spark identifies the sequence of
transformations needed for computing the Dataset on which the action has been invoked, and creates
one or more jobs for carrying out the computation. A job is made of stages which are collections of tasks
that can be executed in parallel. Transformations of each job are transformed into a directed acyclic
graph (DAG) which defines dependency relations between transformations. Finally, Spark translates
the DAG into an optimized execution plan which is executed in order to compute the result of the
invoked action. This feature makes Spark applications very eficient, since Dataset are computed only
when they are needed, but at the same time also very fault tolerant, since transformations defining each
Dataset are stored and so they can be (partially) re-executed if some errors happen.
Example 1. Let us consider data reported in the following table:</p>
      <sec id="sec-2-1">
        <title>Assuming that the table is stored into a CSV file named sensors.csv, then the following spark application</title>
        <p>reads this CSV file and finds all the available measurements for “room_1”.</p>
        <p>S p a r k S e s s i o n s e s s i o n = S p a r k S e s s i o n . b u i l d e r ( ) . g e t O r C r e a t e ( ) ;
D a t a s e t &lt;Row &gt; s e n s o r s D a t a = s e s s i o n . r e a d ( )
. s c h e m a ( " r o o m _ i d s t r i n g , t i m e s t a m p i n t e g e r , m e a s u r e i n t e g e r " )
. c s v ( " s e n s o r s . c s v " ) ;
i n t r o o m 1 M e a s u r e s = s e n s o r s D a t a
. f i l t e r ( s e n s o r s D a t a . c o l ( " r o o m _ i d " ) . e q u a l T o ( " r o o m _ 1 " ) )
. show ( ) ;</p>
      </sec>
      <sec id="sec-2-2">
        <title>The first instruction builds the so called spark session which serves as entry point of all Spark func</title>
        <p>tionalities. By means of this session object, it is possible to read from diferent sources. In our case we
are reading the data from a source whose data follow a schema made of three columns, namely room_id,
timestamp, and measure; and it is represented by a CSV file. The resulting data defines a Spark Dataset
named sensorsData.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Then, sensorsData is used to print all the available measurements for room_1. To this end, first the filter transformation is applied for filtering all the rows having room_1 as value of the column “room_id”, then then the action show prints the result of our query.</title>
        <p>The execution of a Spark application is typically done in parallel. More precisely, there is one driver
process and many worker processes. The driver process is responsible for managing the execution of
the Spark application and distributing the work across the workers (or executors). The executors carry
out the work that is assigned to them by the driver program. When Spark is executed on a cluster,
executors are scattered across the network and so the executors may be spawned on diferent physical
machines. However, Spark provides also a local mode, where executor processes are spawned on the
same machine but they use diferent CPU cores.</p>
        <sec id="sec-2-3-1">
          <title>2.2. Datalog</title>
          <p>Terms can be either variables or constants. Constants are strings starting with lowercase letter or
integers, while variables are strings stating with uppercase letter. An atom  is an expression of the
form (1, . . . , ) where  is a predicate of arity  and 1, . . . ,  are terms. An atom is ground if all
its terms are constants. A literal is either an atom  or its negation  , where  denote negation
as failure. A literal of the form  is said to be positive, while a literal of the form   is said to be
negative. Given a literal  =  (resp  ),  denotes the complement of , that is   (resp ). A rule is
an expression of the form ℎ ← 1, . . . ,  where ℎ is an atom referred to as head, and 1, . . . ,  referred
to as body, is a conjunction of literals. Given a rule ,  denotes the set of literals in the body of rule
, and  the atom appearing in the head of . Given a set of literals , we denote +(resp − ) the
set of positive (resp. negative) literals of . A program Π is a set of rules. The dependency graph of
a program Π , Π is a directed labeled graph where the nodes are predicates appearing in Π and the
set of the edges contains a positive (resp. negative) edge (, ) if there exists a rule  ∈ Π such that
 ∈  and  appears in + (resp. in − ). A program Π is said to be stratified if Π does not contain
loops involving negative edges. A program is said to be recursive if Π contains some loops.</p>
          <p>In this paper we consider stratified programs that contain no loops in Π.</p>
          <p>Example 2. Let us consider sensors data from Example 1. In order to compute all the available
measurements we can write the following Datalog program:
(_1, 1, 10) ←
(_2, 1, 12) ←
(_1, 2, 14) ←
(_1, 3, 18) ←
1 (,  ) ←
(_1, ,  )</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Here the facts over the  predicate encode the ground truth about sensors measure. Then the rule</title>
        <p>1 (,  ) ← (_1, ,  ) is used to derive all those measures relative to
“room_1” which are encoded by atoms over predicate 1 , that are 1 (1, 10),
1 (2, 14), and 1 (3, 18).</p>
        <p>
          For further details we refer the reader to dedicate literature [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Compiling Datalog into Spark</title>
      <p>To bridge Datalog and Spark worlds we provide a compilation-based approach aimed at compiling
Datalog programs into ad-hoc Spark applications which serve as parallel and large-scale evaluator for
the compiled programs. To this end, we first propose an extension of the base Datalog language in order
to support typed predicates and aggregation function; and then we present our compilation techniques.</p>
      <sec id="sec-3-1">
        <title>3.1. Extended Datalog</title>
        <p>Relational databases typically have a well defined schema in which columns of a given relation are
associated to a specific data type. However, predicates in Datalog programs do not enforce a fixed
schema. The lack of such restriction may introduce unexpected behavior when Datalog predicates
are mapped over Spark Datasets which impose a fixed schema. A natural way of filling this gap is to
extend the Datalog base language with ad-hoc type directives. A type directive is an expression of the
form # _(1, ..., ), where _ is a predicate name and 1, ..., 
represent terms’ data types. More precisely, supported data types are: numeric, for representing
integers and floats, and string for representing strings.</p>
        <sec id="sec-3-1-1">
          <title>Example 3. Let us consider the Datalog program from Example 2, and the following type directives:</title>
          <p># (, , )
# 1 (, )</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>By means of these two type directives we are imposing that, for the ground atoms over  predicate, terms must be respectively of type string, numeric, and numeric. Similarly for ground atoms over predicate 1 , both terms must be of numeric type.</title>
          <p>Type directives are not mandatory, and so for all those predicates for which a type directive does not exist,
terms are either considered as string (for input predicates) or their type is inferred during compilation
(for non-input predicates). By exploiting such typing mechanism, users can specify predicates’ schemas
which must be satisfied during query evaluation.</p>
          <p>Another natural language extension is the usage of aggregation functions. Aggregates are widely
adopted in declarative formalisms and allow to specify complex query in a very compact and natural
way [13]. To this end, we consider standard aggregates defined in the ASPCore-2 as well as novel
aggregate functions which are also supported by the Spark programming API. More precisely, a symbolic
set is a pair of the form  : , where  is a list of terms and  is a conjunction of literals. A
ground set is a set of pair  :  the  is a list of constants and  is conjunction of ground literals. An
aggregate function is an expression of the form  () where  is either a symbolic or ground set and  ∈
{#, #, #, #, #, #, #, #} is an aggregate function symbol.
Note that, #, #, #, and # aggregate function symbols stand for descriptive
statistics in their mathematical terms (i.e., average, median, standard deviation and variance respectively).
An aggregate atom is an expression of the form  () ≺  , where  () is an aggregation function,
≺ ∈ { &lt;, ≤ , &gt;, ≥ , =} is a comparison operator and  is a term that is called guard. An aggregate atom
 () ≺  is ground if  is a ground set and  is a constant.</p>
          <p>In the extended Datalog language we allow the usage of aggregate atoms in rule bodies.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Example 4. Let us consider the Datalog program from Example 2. The following rule can used to derive the atom  (i.e., our query) if there are at least two measures available for “room_1”:</title>
          <p>←</p>
          <p>#{,  : 1 (,  )} ≥ 2</p>
          <p>By means of such aggregate atoms it is possible to model more involved query requiring aggregations
which are very frequent in Big Data analytics context.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Compilation</title>
        <p>We now describe how extended Datalog programs (i.e. Datalog programs with type directives and
aggregates) can be compiled into ad-hoc Spark applications. The idea behind our approach consists of
interpreting predicates of a Datalog program Π as Spark datasets and rules of Π as transformations
over such datasets. In particular, each predicate in Π is associated with a Spark dataset which has as
many columns as the arity of the predicate; whereas each rule is evaluated by means of multiple join
transformations between predicates appearing in the body and ad-hoc transformations for aggregate
atoms.</p>
        <p>To compile rules containing aggregates we first apply a program rewriting technique which normalizes
rules containing aggregates into rules where each aggregate atom has exactly one literal in its body.</p>
        <p>W.l.o.g., we assume each rule of an extended Datalog program contains at most one aggregate atom.
Let  be a rule of the form ℎ ← 1, . . . , , # { : } ≻  then it is rewritten as follows:
( ) ←
_(,  ) ←
 ←
1, . . . , 
( ), 
( ), # { : _(,  )} ≻ 
where  is the union of all the terms appearing in literals 1, . . . , , and  are those terms appearing
both in  and  . The following example should better clarify our rewriting.</p>
        <sec id="sec-3-2-1">
          <title>Example 5. Let  be the following rule:</title>
          <p>ℎ(, ) ← (,  ), (, ), #{ : (,  ), (, )} = 
Then our rewriting techniques transform  in the following rules:
(, , ) ←
_(, , ) ←
ℎ(, ) ←
(,  ), (, )
(, , ), (,  ), (, )
(, , ), #{ : _(, , )} =</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Intuitively, the first rule encodes the truth of the literals in the body  without considering the aggregate</title>
          <p>atoms by means of the predicate . Then, the second rule defines the join between literals in the body
of  and the conjunction of literals in the aggregate atom by means of the _ predicate. Finally, the
 and _ predicates are used to reconstruct the original rule.</p>
          <p>After rewriting all rules containing aggregates, the rewritten program is translated into a Spark
application. The resulting application serves as a solver for evaluating arbitrary instances of the
compiled program.</p>
          <p>In order to describe the proposed compilation techniques, we proceed with the following example of
extended Datalog program that we refer to as Sensors:
Example 6.
ℎ() ←
( ), (), #{ : (, ,  )} =  .
( 2), ( 1), (),  =  2 −  1, ().
( 1,  2, ), (),  (,  ,  ),
#{ :  &gt;  1,  &lt;=  2, (, ,  ),  &gt;  } &gt;= .</p>
          <p>( 1,  2, ), (, , , ),  1 &gt;= ,  2 &lt;= .</p>
          <p>In this example, program instances are indeed sequences of measures made by sensors that are placed
in controlled environments called rooms, encoded by predicates /3 and /1, and batches
of products stored into rooms for a given amount of time, encoded by predicate /4. Then, we aim
at identifying abnormal room conditions according to aggregated sensors’ measures. More precisely, a
room  experiences an abnormal condition over an interval  if the average measure of the sensors
in the room falls outside the comfort range (specified for ) for at least  time points of the interval .
Finally, we determine which batches of products were stocked in a room under abnormal conditions.
Such products should be trashed. A batch  of products stored in a room  from time  1 to time  2
should be trashed if the room  experienced some abnormal condition within time range  1 −  2.</p>
          <p>Let us consider the rule</p>
          <p>(, ,  ) ← ( ), (), #{ : (, ,  )} =  .</p>
          <p>This rule is rewritten as follows:</p>
          <p>(, ) ←
(, ,  ) ←
( ), ().</p>
          <p>(, ), #{ : (, ,  )} =  .</p>
          <p>Note that, in this case, the rewriting does not generate the rule defining _ since the conjunction
inside the aggregate atom already contains exactly one literal.</p>
          <p>These two rules are then compiled into Spark transformations. The rule (, ) ← ( ), ()
basically generates the cartesian product between time points and rooms. Thus, it can be compiled as
follows: first of all, the compiler prints the instructions for loading predicates  and  from CSV files
into two Spark Datasets; then the compiler prints the instruction that computes the cartesian product
between the two datasets and saves the result of this transformation in a new Dataset . The following
code snippet reports the code generated by the Compiler for the rule (, ) ← ( ), ().</p>
          <p>D a t a s e t &lt;Row&gt; t = s e s s i o n . r e a d ( )
. f o r m a t ( " c s v " )
. schema ( " t 0 f l o a t " )
. l o a d ( i n s t a n c e _ p a t h + " t . c s v " )
. d r o p D u p l i c a t e s ( ) ;
D a t a s e t &lt;Row&gt; r = s e s s i o n . r e a d ( )
. f o r m a t ( " c s v " )
. schema ( " t 0 f l o a t " )
. l o a d ( i n s t a n c e _ p a t h + " r . c s v " )
. d r o p D u p l i c a t e s ( ) ;</p>
          <p>D a t a s e t &lt;Row&gt; t r = t . c r o s s J o i n ( r ) ;</p>
          <p>At this point the Dataset  can be used in the evaluation of the next rule. The rule
(, ,  ) ← (, ), #{ : (, ,  )} =   computes the
average of measures for each time point  and each room . For this rule, the compiler first prints the
instruction that loads the predicate  into a new Dataset and then prints the instruction for
computing the join between the Datasets  and . More precisely, such join is defined on the
equalities between () the first term of  and the second term of  and () the second term
of  and the first term of . The result of the join defines a new Dataset, namely trmeasure.
Then, the compiler prints the transformations for aggregating measures relative to each time point and
each room. More precisely, the trmeasure Dataset is first grouped by columns representing time and
room and then the measures of each group are aggregated using the average (avg) function. Finally,
the resulting Dataset is used to model the  predicate which appear in the head of the rule. The
following snippet reports the compiled code describe above.</p>
          <p>All the remaining rules are first rewritten and then compiled by following the same idea described so
far. Thus, we avoid reporting the lengthy code would be generated for all the remaining rules.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Implementation</title>
        <p>We implemented the Extended Datalog language and the compilation-based approach described above
in a novel system named Datalog2Spark, which is able to compile Extended Datalog programs into
Spark applications. The architecture of Datalog2Spark is illustrated in Figure 1.</p>
        <p>Datalog2Spark takes as input an Extended Datalog program, say Π , which is then analyzed and
rewritten by the Rewriter module. More precisely, the module reads all type directives which will be
instrumental in defining the schema of the resulting Spark Datasets. For all those input predicates for
which the user did not specify a type directive, the Rewriter module defines a type directive in which all
terms are assigned to the type string. Once input types have been defined, the Rewriter module proceeds
by inferring the types of remaining predicates and by checking that there are no type mismatches.
Whenever a mismatch occurs, the compilation aborts. The following example should clarify how both
type inference works and how mismatches are detected.</p>
        <sec id="sec-3-3-1">
          <title>Example 7. Consider the following program Π :</title>
          <p>(,  ) ←
(), #{,  : (, , )} =  .</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>And the following type directive:</title>
          <p># (, , )</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Input predicates of Π are  and . However, there is one type directive for the predicate  and so, the</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>Rewriter module introduces a default directive for  of the form # ().</title>
        </sec>
        <sec id="sec-3-3-5">
          <title>At this point, terms of all input predicates have a precise type and so, the Rewriter proceeds by checking that such types do not lead to a mismatch.</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>More precisely, for the predicate , the Rewriter infers the type directive # (, ).</title>
        </sec>
        <sec id="sec-3-3-7">
          <title>Intuitively, the first term of  inherits its type from the first term of  that is , while the second term</title>
          <p>of  inherits its type from the result of the sum aggregation function, which is .</p>
        </sec>
        <sec id="sec-3-3-8">
          <title>Since no type directives have been inferred/specified for  then no conflicts arise.</title>
        </sec>
        <sec id="sec-3-3-9">
          <title>Conversely, a type directive of the form # (, ) would have led to a mismatch on the type of the second term of .</title>
          <p>Then, the Rewriter module applies the rewriting technique described above by replacing each rule
containing aggregates with the ones produced by the rewriting. As a result, the Rewriter generates a
program, namely Rewritten Program denoted by Π ′, which is then given to the Compiler module for
compiling each rule into ad-hoc Spark transformations. More precisely, the compiler first computes the
dependency graph of Π ′. Then, it computes strongly connected components 1, ...,  that give us a
topological order of Π′ such that no paths exist from  to  if  &lt; . By following that order, for
each component  (which is indeed a set of predicates appearing in Π ′), the compiler compiles all the
rules having in the head predicates appearing in .</p>
          <p>Such an order guarantees that each rule  ∈ Π ′ is evaluated only when all the predicates appearing
in the body of  have been previously evaluated.</p>
          <p>As a result, we obtain an ad-hoc Spark application which can be used to evaluate arbitrary instances
of the program Π . More precisely, an instance of Π must be a folder made of diferent predicate-specific
CSV files, which represent input facts.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>To assess performances of the proposed approach we conducted an empirical evaluation over
computation-intensive benchmarks. In our comparison we considered the clingo and dlv systems
and four versions of Datalog2Spark with the following research goals: (G1) assessing the impact of
Datalog2Spark w.r.t. clingo and dlv; (G2) studying the scalability of Datalog2Spark with an increasing
number of Spark workers. The four diferent versions of Datalog2Spark are reported as spark-n where
n represents the number of workers used by Spark.</p>
      <p>Benchmarks In our evaluation we consider both benchmarks taken from the literature [14] and a
benchmark built on top of the Sensors program reported in the previous section. More precisely, the
ifrst three benchmarks are formulations of data-intensive join operations, while the last benchmark is
useful to assess how Datalog2Spark performs over queries involving aggregates.</p>
      <p>Data intensive joins fall in the large join category considered by [14]. Each of the joins presents
diferent features. DBLP is a well-known Computer Science bibliography database. The DBLP problem
extracts articles data by computing a 4-way join between the single input relation , and selects
almost all the columns of the join.
(, , , ,  ) ←</p>
      <p>(, ,  ), (, ,  ), (, ℎ, ), (, ℎ,  ).</p>
      <p>Instances generated for DBLP include a number of articles between 200000 to 6000000. Each article
could include up to 6 or up to 9 authors, while both month and year of publication are taken at random.
The Join1 problem defines four binary joins between predicates of arity two.</p>
      <p>(,  ) ←
1(,  ) ←
2(,  ) ←
1(,  ) ←
1(, ), 2(,  ).
1(, ), 2(,  ).
3(, ), 4(,  ).</p>
      <p>1(, ), 2(,  ).</p>
      <p>Instances for Join1 were generated by varying the domain size of the input predicates (namely d1, d2,
c3, c4, c2) and the number of facts for each predicate. We varied the domain of each term of the input
predicates between 300 and 1000, and we considered either all the tuples in the cartesian product of the
domains or a subset of it (ranging from 40% to 100% of the tuples).</p>
      <p>The LUBM problem defines three independent queries over a more complex database. In this case,
multiple joins between distinct relations are computed and then, except for query1, all columns of the
resulting join are selected.</p>
      <p>1() ←
2(, , ) ←
9(, , ) ←
(, 0), ().
(),  (, ),  (,  ),
( ), (),  _0(,  ).
(,  ), ℎ (, ), (, ),
(),  ( ), ().</p>
      <p>Instances for LUBM are generated by varying many parameters like the number of students, the number
of universities, the number of courses followed by each student, and a few others. For generating
instances that are a plausible representation of real-world universities, we decided to fix a parameter
configuration and then to multiply by a factor all the parameters, except the number of courses attended
by student. In this way we generated instances ranging from 50000 students split into 12 universities,
up to 800000 students split in 225 universities. For each fixed parameters configuration we made the
number of courses followed by student vary from 5 to 20.</p>
      <p>The last benchmark, instead, is an extension of the sensors program from Example 6. In particular,
we added rules encoding further abnormal room conditions detected by measures that are below the
minimum comfort range value. More precisely, the following rules are added:
( 1,  2, ) ←
( 1,  2, ), (),  (,  ,  ),
#{ :  &gt;  1,  &lt;=  2, (, ,  ),  &lt;  } &gt;= .</p>
      <p>( 1,  2, ), (, , , ),  1 &gt;= ,  2 &lt;= .</p>
      <p>Instances for sensors are generated by varying the total number of rooms, the number of sensors per
room, and the number of time points for which a measure exists. We varied the number of rooms from
100 to 600, the number of time points from 2000 to 4000 and the number of sensors from 4 to 6.
Hardware setup All experiments were executed on Intel(R) Xeon(R) CPU E7-8880 v4 @
2.20GHz, running Debian GNU/Linux 12; with memory and CPU (i.e. user+system) limited to
64GB and 1800s. Executables and benchmarks are available at https://osf.io/jc4wn/?view_only=
88e6cf62edec4678b3417217cb9b1252
Results. Obtained results are reported in the plots in Figures 2-5. We recall that in the cactus plots
instances are sorted by time, and a point (i, j) of the plot indicates that a system solved the i-th instance
with time limit of j seconds. Each line in a cactus plot reports execution times for a given system.</p>
      <p>For assessing the impact of the proposed approach w.r.t. the clingo and dlv systems (G1) we compare
them with spark-1 which uses only one worker for a fair comparison.</p>
      <p>Overall, it is possible to observe that considered benchmarks highlight both strength and limitation
of the proposed approach. For dblp, spark-1 achieves better results w.r.t. clingo and dlv since it
pre-filters the predicate att over the specific columns and then computes the 4-way join between the</p>
      <p>Number of solved instances
5
10
15
20
25</p>
      <p>30
1,600
1,400</p>
      <p>Number of solved instances
ifltered Datasets, which is way more eficient than iterating the entire predicate set. On the contrary,
for join1 benchmark, spark-1 exhibited some overhead w.r.t. clingo and dlv. Such an overhead is
mainly due to the large number of duplicates produced by the join operation. In this case spark-1 is not
able to identify such duplicates and so it first generates duplicated tuples and then drops them, which is
detrimental as soon as the number of duplicates start increasing.</p>
      <p>For the lubm benchmark, spark-1 exhibited performance comparable to that of clingo, while dlv
emerged as the best-performing system. In this case, spark-1 applies limited pre-filtering, restricted
to the takesCourse predicate. Unlike the dblp benchmark, this filtering has little efect on overall
performance; however, spark-1 introduces no significant overhead compared to clingo. By contrast,
dlv benefits from advanced rule ordering strategies, allowing it to scale eficiently on smaller instances.
Nonetheless, as the instance size grows, dlv does not scale as well as spark. However, it is important
to point out that, even though the use of only one worker is not the typical Spark deployment mode,
spark-1 revealed to be competitive w.r.t. clingo and dlv. Finally, for the sensors encoding there is no
line for clingo and dlv since they do not support the aggregate # used in this benchmark.</p>
      <p>We now study the scalability of Datalog2Spark by considering an increasing number of Spark workers
(G2). To this end, we compared performance of spark-1, spark-4, spark-8, and spark-16.</p>
      <p>Table 1, reports the cumulative PAR-2 score for all the spark versions, and the speedup of each version
with respect to spark-1 for each benchmark. We recall that the PAR-2 score is obtained by considering
the total runtime for completed instance and 2 times the timeout for timed out instances. Whereas,
speedups are then computed by dividing the total execution time of spark-1 by the total execution
time of the each spark-n.</p>
      <p>In almost all the considered benchmarks, it is possible to observe a significant improvement introduced
by spark-4 w.r.t. spark-1. This suggests that partitioning datasets over diferent Spark workers is very
efective even if the obtained speedup is not linear. Probably this is due to the size of the considered
instances, and so by considering larger instances it is expected that the speedup observed for spark-8
and spark-16 increases. Among considered benchmarks, lubm and sensors are the one on which Sparks
scales the most. More precisely, for sensors benchmark spark-1 experiences four timed out instances
while spark-16 allows to solve all the instances roughly within 600s (i.e. a third of the considered
timeout). This is very positive result which highlights also the efectiveness of the proposed approach
in modeling typical Big Data analysis requiring aggregation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>
        Datalog has long served as a core declarative framework for querying and reasoning over data [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Its
evaluation is typically handled by main-memory systems [14, 15, 10, 9], which struggle to scale in the
realm of Big Data applications. This limitation has led to a growing body of work focused on scalable
engines which aim at parallelizing the evaluation of logic programs both on single machine and in
distributed environments [
        <xref ref-type="bibr" rid="ref6 ref7">16, 17, 6, 7</xref>
        ]. Tachmazidis et al. introduced a MapReduce-based approach
for evaluating logic programs under well-founded semantics, achieving scalability over large datasets.
However, their approach is tailored for specific example of programs, which limits its applicability,
and more importantly, it may not benefit from the advancements introduced by modern Big Data
technologies such as in-memory computation supported by Spark [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this direction, Shkapsky
et al. and Rogala et al. proposed approaches for integrating Datalog queries within Spark applications.
Both systems allow the user to integrate Datalog queries, expressed a new query language inspired to
Datalog, into complex Spark pipelines. However, both approaches force the user to explicitly embed
the query within a Spark application. In contrast, our compilation-based approach can automatically
compose Spark applications which serve as ad-hoc distributed solvers for Datalog programs.
      </p>
      <p>Related techniques are also those realized for the parallel instantiation of ASP program of which
Datalog evaluation is a subcase [19]. Compilation-based techniques have recently gained popularity
and proved to be highly efective for evaluation of both ASP and Datalog programs [ 20, 21, 22, 23].
Notably, Cuteri and Ricca introduced a technique to compile Datalog into ad-hoc C++ engine. However,
resulting engines still remain limited to sequential main-memory execution which represents a strong
limitation for large-scale programs.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this paper, we presented Datalog2Spark, a preliminary system that explores the compilation of
Datalog queries into Spark programs for scalable evaluation over modern distributed data management
platforms. Our prototype focuses on non-recursive Datalog with stratified negation, extended with
type definitions and powerful aggregation functions, addressing key requirements for efective Big
Data analytics. A preliminary experimental evaluation against main-memory engines, dlv and clingo,
demonstrated that Datalog2Spark delivers acceptable performance on single-server environments while
ofering seamless scalability across distributed platforms. Additionally, our system supports expressive
aggregation features that are not supported by available engines.</p>
      <p>These encouraging results suggest that Datalog2Spark can be a promising foundation for a scalable
Datalog implementation, and future work will focus on extending the system to support recursive
queries, optimizing compilation strategies, and further assessments of performance on large-scale
distributed environments.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by the Italian Ministry of Industrial Development (MISE) under project
EITWIN n. F/310168/05/X56 CUP B29J24000680005; and by the Italian Ministry of Research (MUR) under
projects: PNRR FAIR - Spoke 9 - WP 9.1 CUP H23C22000860006, Tech4You CUP H23C22000370006, and
PRIN PINPOINT CUP H23C22000280006.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[8] F. Ricca, N. Leone, Disjunctive logic programming with types and objects: The dlv+ system, J.</p>
      <p>Appl. Log. 5 (2007) 545–573.
[9] N. Leone, G. Pfeifer, W. Faber, T. Eiter, G. Gottlob, S. Perri, F. Scarcello, The DLV system for
knowledge representation and reasoning, ACM Trans. Comput. Log. 7 (2006) 499–562.
[10] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, T. Schaub, P. Wanko, Theory solving made
easy with clingo 5, volume 52 of OASICS, Schloss Dagstuhl, 2016, pp. 2:1–2:15.
[11] B. Chambers, M. Zaharia, Spark: The definitive guide: Big data processing made simple, " O’Reilly</p>
      <p>Media, Inc.", 2018.
[12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I. Stoica,
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in:
Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation,
NSDI’12, USENIX Association, USA, 2012, p. 2.
[13] F. Calimeri, W. Faber, M. Gebser, G. Ianni, R. Kaminski, T. Krennwallner, N. Leone, M. Maratea,
F. Ricca, T. Schaub, Asp-core-2 input language format, Theory Pract. Log. Program. 20 (2020)
294–309.
[14] S. Liang, P. Fodor, H. Wan, M. Kifer, Openrulebench: an analysis of the performance of rule engines,
in: J. Quemada, G. León, Y. S. Maarek, W. Nejdl (Eds.), Proceedings of the 18th International
Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, ACM, 2009, pp.
601–610. URL: https://doi.org/10.1145/1526709.1526790. doi:10.1145/1526709.1526790.
[15] H. Jordan, B. Scholz, P. Subotic, Souflé: On synthesis of program analyzers, in: CAV (2), volume
9780 of Lecture Notes in Computer Science, Springer, 2016, pp. 422–430.
[16] A. Dovier, A. Formisano, G. Gupta, M. V. Hermenegildo, E. Pontelli, R. Rocha, Parallel logic
programming: A sequel, Theory Pract. Log. Program. 22 (2022) 905–973.
[17] I. Tachmazidis, G. Antoniou, W. Faber, Eficient computation of the well-founded semantics over
big data, Theory Pract. Log. Program. 14 (2014) 445–459.
[18] M. Rogala, J. Hidders, J. Sroka, Datalogra: datalog with recursive aggregation in the spark RDD
model, in: P. A. Boncz, J. L. Larriba-Pey (Eds.), Proceedings of the Fourth International Workshop on
Graph Data Management Experiences and Systems, Redwood Shores, CA, USA, June 24 - 24, 2016,
ACM, 2016, p. 3. URL: https://doi.org/10.1145/2960414.2960417. doi:10.1145/2960414.2960417.
[19] S. Perri, F. Ricca, M. Sirianni, Parallel instantiation of ASP programs: techniques and experiments,</p>
      <p>Theory Pract. Log. Program. 13 (2013) 253–278.
[20] B. Cuteri, F. Ricca, A compiler for stratified datalog programs: preliminary results, in: S. Flesca,
S. Greco, E. Masciari, D. Saccà (Eds.), Proceedings of the 25th Italian Symposium on Advanced
Database Systems, Squillace Lido (Catanzaro), Italy, June 25-29, 2017, volume 2037 of CEUR
Workshop Proceedings, CEUR-WS.org, 2017, p. 158. URL: https://ceur-ws.org/Vol-2037/paper_23.
pdf.
[21] G. Mazzotta, F. Ricca, C. Dodaro, Compilation of aggregates in ASP systems, in: AAAI, AAAI</p>
      <p>Press, 2022, pp. 5834–5841.
[22] C. Dodaro, G. Mazzotta, F. Ricca, Compilation of tight ASP programs, in: ECAI, volume 372 of</p>
      <sec id="sec-8-1">
        <title>Frontiers in Artificial Intelligence and Applications , IOS Press, 2023, pp. 557–564.</title>
        <p>[23] C. Dodaro, G. Mazzotta, F. Ricca, Blending grounding and compilation for eficient ASP solving,
in: KR, 2024.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Harrison</surname>
          </string-name>
          , S. Feuerstein,
          <article-title>MySQL stored procedure programming - building high-performance web applications with PHP, Perl, Python, Java and</article-title>
          .
          <source>NET: covers MySQL 5</source>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly</surname>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Cuzzocrea (Eds.),
          <source>Big Data - Algorithms</source>
          , Analytics, and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          , Chapman and Hall/CRC,
          <year>2015</year>
          . URL: https://doi.org/10.1201/b18050. doi:
          <volume>10</volume>
          .1201/B18050.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Software</surname>
          </string-name>
          <string-name>
            <surname>Foundation</surname>
          </string-name>
          , Apache spark,
          <year>2025</year>
          . URL: https://spark.apache.org/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          ,
          <article-title>Database systems - the complete book (2</article-title>
          . ed.),
          <source>Pearson Education</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ceri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tanca</surname>
          </string-name>
          ,
          <article-title>Logic Programming</article-title>
          and Databases, Surveys in computer science, Springer,
          <year>1990</year>
          . URL: https://www.worldcat.org/oclc/20595273.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shkapsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Interlandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Condie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          ,
          <article-title>Big data analytics with datalog queries on spark</article-title>
          , in: F. Özcan, G. Koutrika, S. Madden (Eds.),
          <source>Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2016</year>
          , San Francisco, CA, USA, June 26 - July 01,
          <year>2016</year>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1149</lpage>
          . URL: https://doi.org/10.1145/2882903.2915229. doi:
          <volume>10</volume>
          . 1145/2882903.2915229.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shkapsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          ,
          <article-title>Optimizing recursive queries with monotonic aggregates in deals</article-title>
          , in: J.
          <string-name>
            <surname>Gehrke</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Shim</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Lohman</surname>
          </string-name>
          (Eds.),
          <source>31st IEEE International Conference on Data Engineering, ICDE</source>
          <year>2015</year>
          , Seoul, South Korea,
          <source>April 13-17</source>
          ,
          <year>2015</year>
          , IEEE Computer Society,
          <year>2015</year>
          , pp.
          <fpage>867</fpage>
          -
          <lpage>878</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>