Background

Internet Technology (TOIT), Vo.

Proactive Replication of Dynamic Linked Data for Scalable RDF Stream Processing

Sejin Chun

Jooik Jung

Xiongnan Jin

Seungjun Yoon

sjyoong@icl.yonsei.ac.kr 0

Kyong-Ho Lee

khlee89@yonsei.ac.kr 0 0 Department of Computer science, Yonsei University , Seoul , Republic of Korea

2015

290 2003 161 164

In this paper, we propose a scalable method of proactively replicating a subset of remote datasets for RDF Stream Processing. Our solution achieves a fast query processing by maintaining the replicated data up-to-date before query evaluation. To construct the replication process e ectively, we present an update estimation model to handle the changes in updates over time. With the update estimation model, we re-construct the replication process in response to the outdated data. Finally, we conduct exhaustive tests with a real-world dataset to verify our solution.

Background

5HQFVWSPRL {tc,tu} \6FQ]UKLWDR )

Update estimation model

{NR} đ RDF Nodes

Graphs SERVICE (sub-)queries,

ET (Continuous)

Answers

Solution

RZLGQ

RDF Stream 5 H V X O W , Q W H J U D W R U

Replicated

RDF Graphs Materialized View

Fig. 1. The proposed system Our solution presents a proactive replication of Linked Data for RSP. The proposed solution refreshes the replicated data retrieved from a SPARQL endpoint before query evaluation. In other words, we maintain the replicated data up-todate before joining stream data with remote data. Thus, we achieve a fast query processing because we do not require any invocations to the endpoints at every query evaluation while maintaining a high accuracy.

Figure 1 illustrates the proposed system. Given an RSP query that joins RDF streams with SERVICE patterns, a query manager accepts the query as an input and divides it into two queries: STREAM and SERVICE queries. The STREAM query should be delegated to an RSP engine like C-SPARQL, and SERVICE (sub)queries should be delivered to a proactive replication component (PR). An RSP engine registers the STREAM query and evaluates it continuously. Meanwhile, from the SERVICE (sub-)queries, PR constructs a replication process NR, in which each instance invokes a remote service and materializes the result to MV. Lastly, a result integrator combines the results obtained from RSP and PR, and produces answers continuously.

Speci cally, PR consists of three phases: construction, re-composition, and synchronization. In the construction phase, PR constitutes NR with an update estimation model. Each instance of NR is assigned to a node in order to obtain a subset of remote RDF data through a SPARQL endpoint. To model various changes in the number of updates over time, our update estimation model is based on the inhomogeneous recurrent piecewise constant process [4]. The underlying assumption of such process is that repeats every Q time unit, in other words, (T ) = (T + Q) for all time periods T . Thus, we construct an initial version of an update process NU by assigning to a given time interval.

With the initial version of NU , we create and deploy the instances of NR based on a set of evaluation time ET to select stream data. Let a time-based sliding window W consist of ( , ), where is a width of the window and is a slide as the gap between the opening time instants of consecutive windows. Given a query q that contains one or more Ws, we compute ET = f 1; ; ng for q, where each indicates the evaluation time for each window Wn of W. Therefore, we determine the number of instances of NR and their positions by NU and ET .

Given a time interval Q, the solution mappings of a SERVICE pattern and the update estimation model = (T; ) for all time periods T , we de ne a replication process NR of in the following:

NR( ) = (T ; ; r( )) ( 1 ) Where a vector represents an e ective replication instance r( ) with the value for each time interval T , and each r( ) is composed of half-opened intervals of the form [s; f ). The start time s is the time at which the SERVICE patterns corresponding to executes and the nish time f is the time of replicating the solution mappings retrieved from the endpoint.

In the synchronization phase, PR receives the set of solution mappings retrieved by the instance of NR and replicates them into MV. To renew the update estimation model, the information about the replicated data (i.e., whether the data changes(tc) or not(tu) is transferred and computed for new over a time period.

In the recomposition phase, PR re-constructs NR using a new and a cost metric at time t such as M (t) and G(t). In detail, M (t) is de ned as the number of updates being missed from MV at time t. Larger M (t) deteriorates the freshness of MV, and decreases the accuracy of the answer. G(t) is de ned as the number of replication instances in which the result of SERVICE patterns is equivalent to the duplicated data in the prior release in [0,t]. Thus, reducing G(t) improves the performance of maintaining MV in terms of stability.

To derive new from irregular invocations to endpoints, we use a maximumlikelihood estimator (MLE) [5]. The MLE computes the expected that has the highest probability of producing the observed set of changes, which are detected from accesses. Since each access to an endpoint can determine whether the requested dataset has been updated(tc) or not(tu), we estimate new without complete history of updates. 3

Evaluation

Experimental Setup. We developed our solution based on C-SPARQL. To compare with the state-of-the-art work, we implemented a process of maintaining MV [3]. In addition, we selected CQELS as a baseline method which performs generally better than C-SPARQL. We utilized a query Q6 and its related datasets from CityBench2. In addition, we extended the query by adding remote services that provide real-world parking information3,4. Here, to maintain the average response time of a service, e.g., 1s, consistently, we used subqueries, e.g., hentityURIi ?p ?o. Both the average of result sizes with 850kb and the number of results with 5000 records are approximately similar at every query evaluation.

Experimental Result. Figure 2 shows the average execution time of processing Q6 with varying the number of SERVICE patterns. On average, our method took ve seconds less than the method of [3]. Speci cally, the amount of reduced execution time .5 seconds for two services, 1 second for 4 services, 6s for 8 services, and 11s for 16 services, respectively. This improvement is due to that our 2https://github.com/CityBench/Benchmark 3https://www.parkwhiz.com 4http://lod.seoul.go.kr/ v e f r b u e

The proposed method Baseline method to the number of SERVICE patterns and the number of missing updates solution pulls the replicated data from MV at every query evaluation. ing updates. Using parking information during a week, we checked how many updates were missing from

MV. We then measured the accuracy of the replicated data using Jaccard Similarity, that is de ned as the size of the intersection of the replicated and the answer sets divided by the size of the union of them. At each hour, the result has a higher accuracy and small number of missing updates, i.e., 00:00 to 05:00 and 06:00 to 24:00, whereas some cases have larger number of missing updates but the accuracy is also high, i.e., 05:00 to 06:00. At each hour, it has a higher accuracy and small number of missing updates. In addition, we utilized that the Pearson correlation coe cient method estimates the correlation which is a strength of relationship between the accuracy and the number of missing updates. The obtained value of the coe cient was -.234, which indicates that the correlation is weak. From this experiment, we learned that our solution of maintaining the replicated data up-to-date before query evaluation may not have a strong in uence on the accuracy of the answer. Acknowledgement. This work was supported by the ICT R&D program of MSIP/IITP, Republic of Korea. [B0101-16-1276, Access Network Control Techniques for Various IoT Services] approach for uni ed processing of linked streams and linked data. In: ISWC 2011, Approximate continuous query answering over streams and dynamic linked data

1. Barbieri , D. F. , Braga , D. , Ceri , S. , Della Valle , E. , Grossniklaus , M.: C-SPARQL: SPARQL for continuous querying . In: WWW , pp. 1061 - 1062 . ACM. ( 2009 ) 2 . Le-Phuoc , D. , Dao-Tran , M. , Parreira , J. X. , Hauswirth , M.:

A native and adaptive