<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RDF Stream Reasoning via Answer Set Programming on Modern Big Data Platform</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiangnan Ren</string-name>
          <email>xiang-nan.ren@atos.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Cure´</string-name>
          <email>olivier.cure@u-pem.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hubert Naacke</string-name>
          <email>hubert.naacke@lip6.fr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guohui Xiao</string-name>
          <email>xiao@inf.unibz.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ATOS</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Free University of Bozen-Bolzano</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LIGM (UMR 8049)</institution>
          ,
          <addr-line>CNRS, UPEM</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Sorbonne Universite ́s, UPMC</institution>
          ,
          <addr-line>Univ Paris 06</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>RDF stream reasoning is gaining more and more attention but current research mainly focuses on logical frameworks which aim to formalize the query semantics and enhance the complexity of reasoning ability. These frameworks are evaluated on prototype systems based on a centralized design and suffer from limited scalability. A common way to enhance system scalability is to adopt a distributed approach. Moreover, the study of applying distributed solution for expressive RDF stream reasoning is still missing. In this paper, we explore the ability of modern Big Data platform to handle highly expressive temporal Datalog/Answer Set Programming(ASP) over RDF data streams. In order to achieve our goal, we first discuss some key features to parallelize Datalog/ASP program, and we associate these features to the two well known distributed stream processing models, namely Bulk Synchronous Processing (BSP) and Record-at-A-Time (RAT). We build a technical demonstrator called BigSR on top of Spark(BSP) and Flink(RAT) to support our evaluations, and identify the pros and cons of each model. Our experiments show that, BigSR achieves high throughput beyond million-triples per second using a rather small cluster of machines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In the era of the ever-growing semantic data flood, the challenge of processing
declarative queries and inferences over rich and massive RDF data streams remains of
major issue. On the one hand, stream processing must be efficient enough to ingest data
with throughput and latency constraints which are imposed respectively by the
incoming data streams and underlying applications. On the other hand, the query language
has to be expressive enough to support temporal logic and reasoning that may require
recursion. In order to cope with the first aspect, distributed systems supporting fault
tolerance, automatic task distribution and recovery are generally required. Considering
the second aspect, Datalog [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Answer Set Programming (ASP)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] programs seem
to fit efficiently since they represent a good balance between expressive power, safety,
performance, and usability. Note that considering such expressiveness permits to
address ontology languages such as OWL2RL. This work demonstrates the feasibility to
design such a system but it also emphasizes that such a solution can be implemented
with open-source, state of the art Big Data technologies, hence being a prototype for a
production-ready system.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Distributed RDF Stream Reasoning</title>
      <p>
        We use LARS [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as the theoretical foundation of our implementation. In addition,
we identify Parallelism Level and Streaming Model as the two main factors which
leverages the scalability of distributed RDF stream reasoning.
      </p>
      <p>
        Parallelism Level. As defined in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], there are three levels to parallelize the
evaluation of a stratified Datalog/ASP program: Component Level, Rules level, and Single
Rule level. Through our work, we designed a series of queries which cover all the three
parallelism level to evaluate the performance impacts.
      </p>
      <p>
        Streaming Models. Two broad classes of streaming models are adopted by modern
distributed streaming engines, i.e., Bulk Synchronous Processing (BSP) and
Record-atA-Time (RAT). In order to evaluate the efficiency of these two streaming models for
RDF stream reasoning, we decide to choose Apache Spark Streaming (i.e., of BSP) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and Apache Flink (i.e., of RAT) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as the underlying computing frameworks.
Considering a simple LARS program P = T(X ) w(l;d)3(R1(Y ) ^ R2(Y ; X )), Figure 1
compares the differences of program evaluations between BSP and RAT on Spark and
Flink, respectively. Spark launches the continuous query execution synchronously, each
query execution is triggered after the previous computations is completed (Figure 1(a)).
Flink serializes, caches, and pushes forward each record to the next operator eagerly
right after the current computation is done. The asynchronous data processing on Flink
minimizes processing delay (Figure 1(b)).
      </p>
      <p>Fig. 2: BigSR system architecture</p>
      <p>BigSR. To study the feasibility of applying a distributed approach on RDF stream
reasoning and explore the performance impact associated with two above-mentioned
factors, we build BigSR - a reusable prototype for distributed RDF stream reasoning.
BigSR consists of three principal components: (i) Data-feed is built on top of Apache
Kafka (a distributed message queue) for high throughput, fault-tolerant data stream
management; (ii) Sink persists query outputs into a storage component such as Amazon
S3, HDFS or even Kafka; (iii) Computing core compiles LARS program into BigSR’s
logical plan and evaluates the program via Spark/Flink’s native operators.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>In this section, we showcase some evaluation result. Following the standard Yahoo
benchmark tailored for streaming systems, we designed a micro-benchmark to
evaluate BigSR. The benchmark involves 15 queries and 4 datasets (SRBench, CityBench,
Lubm and Waves). We organize the 15 queries into two groups: (1) in the first group, the
queries Q1 to Q11 are designed to evaluate the two main factors of streaming engine,
i.e., system throughput and query latency. In particular, we add recursive operators in
Q9, Q10 and Q11 to study the pros and cons of recursion support on BSP and RAT. (2)
the queries Q12 to Q15 in the second group are designed for the purpose of evaluating
the minimum latency that the system could achieve.</p>
      <p>We evaluate BigSR 1 in a small cluster of 9 nodes (6 nodes for Spark/Flink, 3 nodes
for Kafka and ZooKeeper). Figure 4 and Figure 4 respectively give the engine
throughput and query latency of Q1 to Q11. On both Spark and Flink, BigSR attains throughput
of millions triples per second, and second-level delay. Table 1 compares the query
latency of Q12 to Q15. Since Spark requires to define the size of micro-batch within its
BSP streaming model, we reduce the size of micro-batch to 500 ms and record the
processing delay. In general, limited by the BSP model, Spark retains the latency around
100 ms. In the contrary, RAT model provides Flink the ability to achieve the latency of
sub-millisecond.
1 https://github.com/renxiangnan/bigsr</p>
    </sec>
    <sec id="sec-4">
      <title>4 Conclusion</title>
      <p>Expressive RDF stream reasoning is an emerging area that is in its infancy. The research
for scalable RDF stream reasoning, especially by applying distributed approach, is still
missing. In this paper, we introduce BigSR and some results of experiments. We use
LARS as the theoretical foundations for our implementations, and we build a bridge
between recent stream reasoning theoretical work and modern Big Data technology.
Our evaluation shows that distributed solution enhance the system throughput to million
triples per second with second/sub-second delay. In future work, we plan to concentrate
on the trade-off between the query expressiveness and system scalability, which gives
us a road map to design a production-ready engine.</p>
      <p>Acknowledgements. This work was partially supported by the OBATS project at the
Free University of Bozen-Bolzano.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Serge</surname>
            <given-names>Abiteboul</given-names>
          </string-name>
          , Richard Hull, and
          <string-name>
            <given-names>Victor</given-names>
            <surname>Vianu</surname>
          </string-name>
          .
          <source>Foundations of Databases. AddisonWesley</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Harald</given-names>
            <surname>Beck</surname>
          </string-name>
          , Minh Dao-Tran, Thomas Eiter, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Fink</surname>
          </string-name>
          .
          <article-title>LARS: A logic-based framework for analyzing reasoning over streams</article-title>
          .
          <source>In AAAI</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and
          <string-name>
            <given-names>Kostas</given-names>
            <surname>Tzoumas</surname>
          </string-name>
          .
          <article-title>Apache flinkTM: Stream and batch processing in a single engine</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Eiter</surname>
          </string-name>
          , Giovambattista Ianni, and Thomas Krennwallner.
          <article-title>Answer set programming: A primer</article-title>
          .
          <source>In Reasoning Web</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Simona</given-names>
            <surname>Perri</surname>
          </string-name>
          , Francesco Ricca, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Sirianni</surname>
          </string-name>
          .
          <article-title>Parallel instantiation of ASP programs: techniques and experiments</article-title>
          .
          <source>TPLP</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Matei</given-names>
            <surname>Zaharia</surname>
          </string-name>
          , Mosharaf Chowdhury,
          <string-name>
            <surname>Tathagata Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ankur Dave</surname>
          </string-name>
          , Justin Ma,
          <string-name>
            <surname>Murphy</surname>
            <given-names>McCauley</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Scott</given-names>
            <surname>Shenker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Stoica</surname>
          </string-name>
          .
          <article-title>Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing</article-title>
          .
          <source>In NSDI</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>