=Paper=
{{Paper
|id=Vol-2456/paper23
|storemode=property
|title=SRSPG: A Plugin-based Spark Framework for Large-scale RDF Streams Processing on GPU
|pdfUrl=https://ceur-ws.org/Vol-2456/paper23.pdf
|volume=Vol-2456
|authors=Tenglong Ren,Guozheng Rao,Xiaowang Zhang,Zhiyong Feng
|dblpUrl=https://dblp.org/rec/conf/semweb/RenRZF19
}}
==SRSPG: A Plugin-based Spark Framework for Large-scale RDF Streams Processing on GPU==
SRSPG: A Plugin-based Spark Framework for Large-scale RDF Streams Processing on GPU Tenglong Ren, Guozheng Rao, Xiaowang Zhang? , and Zhiyong Feng College of Intelligence and Computing, Tianjin University, Tianjin, China Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China ? Corresponding author: xiaowangzhang@tju.edu.cn Abstract. In this paper, we propose a plugin-based Spark framework (SRSPG) for large-scale RDF streams processing on GPU. Within this framework, We convert RDF streams to a RDF graph in a unified and simple way. Then we can apply various SPARQL query engines to process continuous queries and utilize GPU to accelerate queries. Computation Module provides a Spark-based Join algorithm utilizing GPU for par- allel joining, obtaining the final results. Besides, we provide Compute Resource Management to balance the scheduling and task execution be- tween GPU and memory resources. Finally, we evaluate our work bulit on gStore and RDF-3X on the LUBM benchmark. The experimental re- sults show that SRSPG is effective for real-time processing of large-scale RDF streams. 1 Introduction RDF streams, as a new type of dynamic dataset, can model real-time information in traffic monitoring, intelligent city and other fields. Real-time processing of large-scale RDF streams has become an important research topic nowadays. What is more, most of the existing RSP(RDF Steam Processing) systems are centralized, such as C-SPARQL [2], CQELS [4], EP-SPARQL [1]. These engines can not process large-scale RDF streams. A framework PRSP [5, 3] is presented to process continuous queries on large-scale RDF streams by exploiting various SPARQL query engines in a unified way. Apache Spark is a fast computing engine designed for large-scale data pro- cessing. The intermediate results are stored in memory, so that large-scale data can be processed better. Graphics Processing Units(GPUs) is an efficient parallel processing way, which is widely applied and provides a higher level of speedup by executing multiple threads synchronously. However, there are few parallel com- puting systems based on GPU and Spark to process large-scale RDF streams. In addition, the scheduling mechanism of Spark and the task of GPU have to be taken into account. * Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In our work, we propose a plugin-based Spark framework for large-scale RD- F streams processing on GPU. In order to make full use of computing power of GPU, we add Query Split module and Computation Module. The Query S- plit decomposes the SPARQL queries. Computation module is used to compute the intermediate results through on GPU. We evaluate our experiments on the benchmark LUBM. The experimental results show that our framework is effec- tive and efficient. 2 Overview of SRSPG The framework of SRSPG consists of the following six main parts: Syntax Trans- lator, Data Transformer, Query Trigger, Query Split, SPARQL API, and Com- putation Module shown in Figure 1. SPARQL Engines RDF Computation Module Stream(S) Results RDF Graph SPARQL Spark-based Data Parallel Join Transformer API Window Compute Resource Selector Continuous SubQ1 Management Query(Q) Syntax Query Query Split ... ... Translator Trigger GPU GPU GPU SubQm 1 2 k Fig. 1. The framework of SRSPG. Syntax Translator The module of Syntax Translator translates continuous queries into unified queries. The translated query is sent to Query Trigger module. Query Trigger The part of Query Trigger receives the unified queries, splitting them into two parts: ρ(Q) and window selector. ρ(Q), as the core SPARQL query, is pushed into SPARQL API and Query Split module. Window selec- tor is in the form of a 6-tuple which are encapsulated into a management subsets and sent to Data Transformer. Let p = [S, α, β, γ, tf, ρ(Q)] where S is the RDF stream to be processed; α represents the window size; β is the updating time of windows; γ is the updating latency of windows; tf is used to determine whether RDF streams are processed. ρ(Q) is the core SPARQL query. Data Transformer This module transforms RDF streams to capture snap- shots based on window selector obtained from Query Trigger, and converting these snapshots to RDF graphs. We convert RDF streams into continuous window data by Esper or other DSMS. Finally, the RDF graphs are sent to SPARQL API. SPARQL API SRSPG provides SPARQL API for users, which makes it pos- sible for SPARQL engines (centralized and distributed) to process RDF streams. Query Split Query Split module decomposes queries into some subqueries based on the weight of predicates. We assign weights to predicates when loading RDF data. The higher the frequency of predicates in RDF graph, the greater the weight and the greater the impact on query results. SELECT ?X WHERE{ TP2: ?Y lives ?W 10 TP3: ?W located ?S 30 TP1 : ?X likes ?Y. 100 TP2 : ?Y lives ?W. 10 TP3 : ?W located ?S.} 30 TP1: ?X likes ?Y 100 solution Fig. 2. An example of SPARQL query splitting. Computation Module The module of Computation Module proposes Join parallel algorithm based on Spark, utilizing GPU for parallel joining. The tasks can be disassembled into some streams. Spark supports multi-threaded computing, while GPUs are usually serial. The situation leads to contention for GPU resources among threads. So we present Compute Resource Man- agement to balance the scheduling and task execution of GPU, GPU and memory resources. 3 Experiments and Evaluations Our experiments are evaluated on server equipped with a 4 CPUs with 6 cores and 64GB memory, a NVIDIA GTX590 GPU, which has 24GB device memory and is clocked at 1.35GHz. The version of operating system is Ubuntu 14.04. In order to support GPU on YARN, we use version 3.1.2. We use RDF-GPU and gStore-GPU by employing RDF-3X and gStore within SRSPG. Our experiments utilized LUBM dataset. The experiments uses the standard query Q1 and Q2 provided by LUBM. Figure 3 shows that when S = 240s and SET = 235s, SRSPG uses GPU, the query time of stream data decreases, gStore improves the speed by more than two times than RDF-3X. When the data size is small, GPU acceleration is not obvious, mainly due to data communication and transmission time problems. The larger the data scale, the better the acceleration effect. RDF-GPU RDF-3X RDF-GPU RDF-3X gStore gStore-GPU 4.5 gStore gStore-GPU 10 5 10 Times Times 4 10 4 10 200 300 400 500 600 200 300 400 500 600 Data size Data size (a) query response time of Q1 (b) query response time of Q2 Fig. 3. Querying time in different engines under GPU and CPU. 4 Conclusions In this paper, we proposed a plugin-based Spark framework for real-time large- scale RDF streams processing on GPU in an efficient and simply way. In the future, we will take advantage of more novel computing hardware to increase the speed of large-scale RDF streams processing such as FPGA. 5 Acknowledgments This work is supported by the National Key Research and Development Program of China (2017YFC0908401) and the National Natural Science Foundation of China (61672377,61972455). Xiaowang Zhang is supported by the Peiyang Young Scholars in Tianjin University (2019XRX-0032). References 1. Anicic D., Fodor P., Rudolph S., and Stojanovic N.: EP-SPARQL: A unifed language for event processing and stream reasoning. In: Proc. of WWW 2011, pp. 635–644. 2. Barbieri D.F., Braga D., Ceri S., Della Valle E., Grossniklaus M.: Querying RDF streams with C-SPARQL. SIGMOD Rec., 39(1), 20–26 (2010). 3. Fang H., Zhao B., Zhang X., Yang X.: A united framework for large-scale resource description framework stream processing. J. Comput. Sci. Technol., 34(4): 762-774 (2019). 4. Le-Phuoc D., Dao-Tran M., Parreira J. X., Hauswirth M.: A native and adaptive approach for unifed processing of linked streams and linked data. In: Proc. of ISWC 2011, pp.370–388. 5. Li Q., Zhang X., Feng Z.: PRSP: A plugin-based framework for RDF stream pro- cessing. In: Proc. of WWW 2017 (Poster), pp.815–816.