RDF Stream Reasoning via Answer Set Programming on Modern Big Data Platform Xiangnan Ren1,2 , Olivier Curé2 , Hubert Naacke3 , and Guohui Xiao4 1 ATOS, France xiang-nan.ren@atos.net 2 LIGM (UMR 8049), CNRS, UPEM, France olivier.cure@u-pem.fr 3 Sorbonne Universités, UPMC, Univ Paris 06, France hubert.naacke@lip6.fr 4 Free University of Bozen-Bolzano, Italy xiao@inf.unibz.it Abstract. RDF stream reasoning is gaining more and more attention but current research mainly focuses on logical frameworks which aim to formalize the query semantics and enhance the complexity of reasoning ability. These frameworks are evaluated on prototype systems based on a centralized design and suffer from limited scalability. A common way to enhance system scalability is to adopt a distributed approach. Moreover, the study of applying distributed solution for expressive RDF stream reasoning is still missing. In this paper, we explore the ability of modern Big Data platform to handle highly expressive temporal Data- log/Answer Set Programming(ASP) over RDF data streams. In order to achieve our goal, we first discuss some key features to parallelize Datalog/ASP program, and we associate these features to the two well known distributed stream process- ing models, namely Bulk Synchronous Processing (BSP) and Record-at-A-Time (RAT). We build a technical demonstrator called BigSR on top of Spark(BSP) and Flink(RAT) to support our evaluations, and identify the pros and cons of each model. Our experiments show that, BigSR achieves high throughput beyond million-triples per second using a rather small cluster of machines. 1 Introduction In the era of the ever-growing semantic data flood, the challenge of processing declar- ative queries and inferences over rich and massive RDF data streams remains of ma- jor issue. On the one hand, stream processing must be efficient enough to ingest data with throughput and latency constraints which are imposed respectively by the incom- ing data streams and underlying applications. On the other hand, the query language has to be expressive enough to support temporal logic and reasoning that may require recursion. In order to cope with the first aspect, distributed systems supporting fault tolerance, automatic task distribution and recovery are generally required. Considering the second aspect, Datalog [1] and Answer Set Programming (ASP)[4] programs seem to fit efficiently since they represent a good balance between expressive power, safety, performance, and usability. Note that considering such expressiveness permits to ad- dress ontology languages such as OWL2RL. This work demonstrates the feasibility to design such a system but it also emphasizes that such a solution can be implemented with open-source, state of the art Big Data technologies, hence being a prototype for a production-ready system. 2 Distributed RDF Stream Reasoning We use LARS [2] as the theoretical foundation of our implementation. In addition, we identify Parallelism Level and Streaming Model as the two main factors which leverages the scalability of distributed RDF stream reasoning. Parallelism Level. As defined in [5], there are three levels to parallelize the eval- uation of a stratified Datalog/ASP program: Component Level, Rules level, and Single Rule level. Through our work, we designed a series of queries which cover all the three parallelism level to evaluate the performance impacts. Fig. 1: Blocking and non-blocking query processing. Streaming Models. Two broad classes of streaming models are adopted by modern distributed streaming engines, i.e., Bulk Synchronous Processing (BSP) and Record-at- A-Time (RAT). In order to evaluate the efficiency of these two streaming models for RDF stream reasoning, we decide to choose Apache Spark Streaming (i.e., of BSP) [6] and Apache Flink (i.e., of RAT) [3] as the underlying computing frameworks. Consid- w(l,d) ering a simple LARS program P = T(X ) ← τ 3(R1(Y ) ∧ R2(Y , X )), Figure 1 compares the differences of program evaluations between BSP and RAT on Spark and Flink, respectively. Spark launches the continuous query execution synchronously, each query execution is triggered after the previous computations is completed (Figure 1(a)). Flink serializes, caches, and pushes forward each record to the next operator eagerly right after the current computation is done. The asynchronous data processing on Flink minimizes processing delay (Figure 1(b)). Fig. 2: BigSR system architecture BigSR. To study the feasibility of applying a distributed approach on RDF stream reasoning and explore the performance impact associated with two above-mentioned factors, we build BigSR - a reusable prototype for distributed RDF stream reasoning. BigSR consists of three principal components: (i) Data-feed is built on top of Apache Kafka (a distributed message queue) for high throughput, fault-tolerant data stream management; (ii) Sink persists query outputs into a storage component such as Amazon S3, HDFS or even Kafka; (iii) Computing core compiles LARS program into BigSR’s logical plan and evaluates the program via Spark/Flink’s native operators. 3 Evaluation In this section, we showcase some evaluation result. Following the standard Yahoo benchmark tailored for streaming systems, we designed a micro-benchmark to eval- uate BigSR. The benchmark involves 15 queries and 4 datasets (SRBench, CityBench, Lubm and Waves). We organize the 15 queries into two groups: (1) in the first group, the queries Q1 to Q11 are designed to evaluate the two main factors of streaming engine, i.e., system throughput and query latency. In particular, we add recursive operators in Q9 , Q10 and Q11 to study the pros and cons of recursion support on BSP and RAT. (2) the queries Q12 to Q15 in the second group are designed for the purpose of evaluating the minimum latency that the system could achieve. We evaluate BigSR 1 in a small cluster of 9 nodes (6 nodes for Spark/Flink, 3 nodes for Kafka and ZooKeeper). Figure 4 and Figure 4 respectively give the engine through- put and query latency of Q1 to Q11 . On both Spark and Flink, BigSR attains throughput of millions triples per second, and second-level delay. Table 1 compares the query la- tency of Q12 to Q15 . Since Spark requires to define the size of micro-batch within its BSP streaming model, we reduce the size of micro-batch to 500 ms and record the pro- cessing delay. In general, limited by the BSP model, Spark retains the latency around 100 ms. In the contrary, RAT model provides Flink the ability to achieve the latency of sub-millisecond. Fig. 3: Throughput (milliseconds) on Spark and Flink for Q1 to Q11 . 1 https://github.com/renxiangnan/bigsr Fig. 4: Query latency (milliseconds) on Spark and Flink for Q1 to Q11 . Q12 Q13 Q14 Q15 Spark 110 96 115 99 Flink <1 <1 <1 <1 Table 1: Stateless query latency (millisecond); Spark micro-batch size = 500 ms. 4 Conclusion Expressive RDF stream reasoning is an emerging area that is in its infancy. The research for scalable RDF stream reasoning, especially by applying distributed approach, is still missing. In this paper, we introduce BigSR and some results of experiments. We use LARS as the theoretical foundations for our implementations, and we build a bridge between recent stream reasoning theoretical work and modern Big Data technology. Our evaluation shows that distributed solution enhance the system throughput to million triples per second with second/sub-second delay. In future work, we plan to concentrate on the trade-off between the query expressiveness and system scalability, which gives us a road map to design a production-ready engine. Acknowledgements. This work was partially supported by the OBATS project at the Free University of Bozen-Bolzano. References 1. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison- Wesley, 1995. 2. Harald Beck, Minh Dao-Tran, Thomas Eiter, and Michael Fink. LARS: A logic-based frame- work for analyzing reasoning over streams. In AAAI, 2015. 3. Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache flinkTM : Stream and batch processing in a single engine. IEEE Data Eng. Bull., 2015. 4. Thomas Eiter, Giovambattista Ianni, and Thomas Krennwallner. Answer set programming: A primer. In Reasoning Web, 2009. 5. Simona Perri, Francesco Ricca, and Marco Sirianni. Parallel instantiation of ASP programs: techniques and experiments. TPLP, 2013. 6. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy Mc- Cauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.