Introduction

SRSPG: A Plugin-based Spark Framework for Large-scale RDF Streams Processing on GPU

Tenglong Ren

Guozheng Rao

Xiaowang Zhang?

xiaowangzhang@tju.edu.cn 0

Zhiyong Feng

0 0 College of Intelligence and Computing, Tianjin University , Tianjin , China Tianjin Key Laboratory of Cognitive Computing and Application , Tianjin , China

In this paper, we propose a plugin-based Spark framework (SRSPG) for large-scale RDF streams processing on GPU. Within this framework, We convert RDF streams to a RDF graph in a uni ed and simple way. Then we can apply various SPARQL query engines to process continuous queries and utilize GPU to accelerate queries. Computation Module provides a Spark-based Join algorithm utilizing GPU for parallel joining, obtaining the nal results. Besides, we provide Compute Resource Management to balance the scheduling and task execution between GPU and memory resources. Finally, we evaluate our work bulit on gStore and RDF-3X on the LUBM benchmark. The experimental results show that SRSPG is e ective for real-time processing of large-scale RDF streams.

Introduction

In our work, we propose a plugin-based Spark framework for large-scale RDF streams processing on GPU. In order to make full use of computing power of GPU, we add Query Split module and Computation Module. The Query Split decomposes the SPARQL queries. Computation module is used to compute the intermediate results through on GPU. We evaluate our experiments on the benchmark LUBM. The experimental results show that our framework is e ective and e cient. 2

Overview of SRSPG

The framework of SRSPG consists of the following six main parts: Syntax Translator, Data Transformer, Query Trigger, Query Split, SPARQL API, and Computation Module shown in Figure 1.

RDF

Stream(S) Continuous Query(Q) Data Transformer RDF Graph Window Selector Syntax Translator Query Trigger Query Split SPARQL Engines SPARQL API SubQ1

...

SubQm

Computation Module

Spark-based Parallel Join Compute Resource Management

... GPU 1 GPU 2 GPU k

Results

Data Transformer This module transforms RDF streams to capture snapshots based on window selector obtained from Query Trigger, and converting these snapshots to RDF graphs. We convert RDF streams into continuous window data by Esper or other DSMS. Finally, the RDF graphs are sent to SPARQL API.

SPARQL API SRSPG provides SPARQL API for users, which makes it possible for SPARQL engines (centralized and distributed) to process RDF streams.

Query Split Query Split module decomposes queries into some subqueries based on the weight of predicates. We assign weights to predicates when loading RDF data. The higher the frequency of predicates in RDF graph, the greater the weight and the greater the impact on query results.

SELECT ？X WHERE{

TP1 : ?X likes ?Y. 100 TP2 : ?Y lives ?W. 10 TP3 : ?W located ?S.} 30 TP2: ?Y lives ?W 10

TP3: ?W located ?S 30 TP1: ?X likes ?Y 100 solution

Computation Module The module of Computation Module proposes Join parallel algorithm based on Spark, utilizing GPU for parallel joining. The tasks can be disassembled into some streams. Spark supports multi-threaded computing, while GPUs are usually serial. The situation leads to contention for GPU resources among threads. So we present Compute Resource Management to balance the scheduling and task execution of GPU, GPU and memory resources. 3

Experiments and Evaluations Our experiments are evaluated on server equipped with a 4 CPUs with 6 cores and 64GB memory, a NVIDIA GTX590 GPU, which has 24GB device memory and is clocked at 1.35GHz. The version of operating system is Ubuntu 14.04. In order to support GPU on YARN, we use version 3.1.2. We use RDF-GPU and gStore-GPU by employing RDF-3X and gStore within SRSPG. Our experiments utilized LUBM dataset.

The experiments uses the standard query Q1 and Q2 provided by LUBM. Figure 3 shows that when S = 240s and SET = 235s, SRSPG uses GPU, the query time of stream data decreases, gStore improves the speed by more than two times than RDF-3X. When the data size is small, GPU acceleration is not obvious, mainly due to data communication and transmission time problems. The larger the data scale, the better the acceleration e ect.

105 s e m i T 104

Conclusions

In this paper, we proposed a plugin-based Spark framework for real-time largescale RDF streams processing on GPU in an e cient and simply way. In the future, we will take advantage of more novel computing hardware to increase the speed of large-scale RDF streams processing such as FPGA. 5

Acknowledgments

This work is supported by the National Key Research and Development Program of China (2017YFC0908401) and the National Natural Science Foundation of China (61672377,61972455). Xiaowang Zhang is supported by the Peiyang Young Scholars in Tianjin University (2019XRX-0032).

1. Anicic

, Fodor

, Rudolph

, and Stojanovic N.: EP-SPARQL : A unifed language for event processing and stream reasoning . In: Proc. of WWW 2011 , pp. 635 { 644 .

2. Barbieri

D.F.

, Braga

, Ceri

, Della Valle E., Grossniklaus

: Querying RDF streams with C-SPARQL . SIGMOD Rec., 39 ( 1 ), 20 { 26 ( 2010 ).

3. Fang

, Zhao

, Zhang

, Yang

X.:

A united framework for large-scale resource description framework stream processing . J. Comput. Sci. Technol ., 34 ( 4 ): 762 - 774 ( 2019 ).

4. Le-Phuoc

, Dao-Tran

, Parreira

J. X.

, Hauswirth

M.:

A native and adaptive approach for unifed processing of linked streams and linked data . In: Proc. of ISWC 2011 , pp. 370 { 388 .

5. Li

, Zhang

, Feng Z.: PRSP: A plugin-based framework for RDF stream processing . In: Proc. of WWW 2017 (Poster) , pp. 815 { 816 .