PRONA: A Plugin for Well-Designed Approximate Queries in Jena Zhenyu Song1,3 , Xiaowang Zhang1,3,∗ , and Zhiyong Feng2,3 1 School of Computer Science and Technology, Tianjin University, Tianjin, China 2 School of Computer Software, Tianjin University, Tianjin, China 3 Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China ∗ Corresponding author: xiaowangzhang@tju.edu.cn Abstract. The time of answering a SPARQL query with its all exac- t solutions in large scale RDF dataset possibly exceeds users’ tolerable waiting time, especially when it contains the OPT operations. It be- comes essential to make a trade-off between the query response time and solution accuracy. We propose PRONA - an plugin for well-designed approximate queries in Jena, which provides help for users to answer well-designed SPARQL queries by approximate computation. The main features of PRONA comprise SPARQL query engine with approximate queries, as well as various approximate degrees for users to choose. 1 Introduction Resource Description Framework (RDF) is the standard data model in the Se- mantic Web. SPARQL recommended by W3C has become the standard language for querying RDF data since 2008. OPT operation takes an core role in UNION-free well-designed patterns. For simplification, we directly call well-designed patterns instead of UNION-free well- designed patterns. OPT operation aims to extend solutions for users[3]. It may take more time to obtain all exact solutions than only “non-optional” solutions. Removing some “optional” parts of well-designed queries is a natural idea to obtain approximate queries, which contributes to less query response time. For instance, consider a pattern Q as follows: Q = ((?x, rdf:type, artist) OPT ((?x, country, ?y) OPT (?x, company, ?z))). Based on this natural idea, there are two approximate patterns with less OPT operators as follows: – Q1 = (?x, rdf:type, artist); – Q2 = ((?x, rdf:type, artist) OPT (?x, country, ?y)). However, consider Q3 = ((?x, rdf:type, artist) OPT (?xcompany, ?z)), it is not approximate query since (?x, company, ?z) directly depends on (?x, country, ?y). The notion of approximation has been proposed in [1]. However, it did not pro- vide a fine-grained approximation method. Jena[2] is a free and open source Java framework for building semantic web and linked data applications. But it does not provide approximate queries for users to answer well-designed queries. In this paper, we focus on well-designed SPARQL queries, whose “optional” parts are really optional. Moreover, it is maximal among all fragments of LSQ [4]. Compared to our previous work [5], furthermore, we develop a plugin for Jena to answer well-designed SPARQL queries with approximate queries, which combines our approximate method and query process in Jena. 2 Preliminaries OPT Normal Form A UNION-free pattern P is in OPT normal form [3] if P meets one of the following two conditions: – P is constructed by using only the AND and FILTER operators; – P = (P1 OPT P2 ) where P1 and P2 patterns are in OPT normal form. For instance, the pattern Q stated in Section 1 is in OPT normal form. Three rewriting rules[3] can be applied to transform non OPT normal form into OPT normal form: let P, Q, R be patterns and C a constraint, – (P OPT R) FILTER C ≡ (P FILTER C) OPT R; – (P OPT R) AND Q ≡ (P AND Q) OPT R; – P AND (Q OPT R) ≡ (P AND Q) OPT R. Well-Designed Patterns A UNION-free pattern P is well-designed if the followings hold: – P is safe, that is, each subpattern of the form Q FILTER C of P holds the condition: var (C) ⊆ var (Q). – for every subpattern P 0 = (P1 OPT P2 ) of P and for every variable ?x occurring in P , the following condition hold: If ?x occurs both inside P2 and outside P 0 , then it also occurs in P1 . For instance, the pattern Q in Section 1 is a well-designed pattern. Note that the OPT operation provides really optional left-outer join due to the weak monotonicity [3]. 3 Approximate Queries OPT-depth in OPT Normal Form To characterize the different levels of optional patterns, we define OPT-depth of patterns in OPT normal form. Definition 1 (OPT-depth). Let P be a pattern in OPT normal form. We use dep(P ) to denote its OPT-depth as follows: – dep(P ) = 0 if P is an AF -pattern; – dep(P ) = max{dep(P1 ), . . . , dep(Pm )} + 1 if O(P ) = {P1 , . . . , Pm }. For instance, the OPT-depth of the pattern Q stated in Section 1 is 2. Approximate Queries Intuitively, approximate patterns are subpatterns obtained by reducing their OPT-depths. Definition 2 (k-approximation). Let P be a pattern in OPT normal form (P0 OPT P1 OPT . . . OPT Pm ) and k be a natural number. The k-approximate pattern of P (written as P (k) ) can be obtained in the following inductive way: 2 – P (k) = BGP (P ) if k = 0; (k−1) (k−1) – P (k) = P0 OPT P1 OPT . . . OPT Pm if 1 ≤ k ≤ dep(P ) − 1; – P (k) = P if k ≥ dep(P ). P (k) is more closed to P with higher value of k. For instance, in Section 1, Q = Q1 and Q(1) = Q2 . (0) 4 PRONA Plugin PRONA Overview PRONA is written in Java in a 3-tier design shown in Fig- ure 1(a). The bottom layer consists of the Jena framework and the ARQ4 query engine, both used as a black box for evaluating queries. Before answering SPAR- QL queries, the second layer provides the rewriting process and approximation evaluation, which lead to the generation of approximate queries. GUI is shown in Figure 1(b). For single query solution s, we denote its amount of domain as dom(s). For instance, consider solution µ = {?x → a, ?y → b}, dom(µ)=2. Given an approximate query solution S, which contains solution s1 , s2 , · · · , sn . It is notable that the total of domain (dom(s1 ) + dom(s2 ) + · · · + dom(sn )) reflects solution precision. (a) PRONA architecture (b) PRONA GUI Fig. 1. PRONA Overview Experiments The purpose of our experiments is to evaluate (1) the perfor- mance improvement of approximate well-designed SPARQL queries, and (2) the solution precision percentage after approximate queries. In our experiments, LUBM5 is used as dataset. Two 4-approximation well- designed SPARQL queries are designed. Q1 contains 4 OPT operators, 6 triple patterns and 6 variables. Q2 contains 14 OPT operators, 17 triple patterns and 16 variables. Approximate solution precision dividing original query solution precision leads to the solution precision percentage. It has shown in Figure 2 and Figure 3 that both query response time and solution precision reduce with the increment of approximate degree (k value 4 http://jena.sourceforge.net/ARQ 5 http://swat.cse.lehigh.edu/projects/lubm 3 decreases). Solution precision percentage decreases about 10% with 25% query response time decreasing when k changes from 4 to 3 in Figure 3. LUBM1 LUBM5 LUBM10 300 Response Time[ms] 200 k k=0 k=1 k=2 k=3 k=4 LUBM1 11.02 16.53 22.05 82.23 100 100 LUBM5 11.05 16.58 22.10 83.13 100 0 1 2 k 3 4 LUBM10 11.13 16.70 22.26 83.11 100 (a) Performance (b) Solution Precision Percentage(%) ·106 Fig. 2. Q1 over PRONA 4 LUBM1 LUBM5 LUBM10 3 Response Time[ms] 2 k k=0 k=1 k=2 k=3 k=4 1 LUBM1 0.01 5.19 68.74 93.75 100 0 LUBM5 0.01 5.25 68.70 93.59 100 0 1 2 k 3 4 LUBM10 0.01 5.27 68.75 93.68 100 (a) Performance (b) Solution Precision Percentage(%) Fig. 3. Q2 over PRONA 5 Conclusion In this paper, we propose PRONA which helps users answer well-designed SPAR- QL queries by approximate computation. In the future, we are going to handle other non-well-designed patterns and deal with more operations such as UNION. Acknowledgement. This work is supported by the program of the Nation- al Key Research and Development Program of China (2016YFB1000603) and the National Natural Science Foundation of China (NSFC) (61502336). References 1. P. Barcelo, R. Pichler, and S. Skritek. Efficient evaluation and approximation of well-designed pattern trees. In Proc. of PODS 2015, pages 131–144, 2015. 2. J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and K. Wilkinson. Jena: implementing the semantic web recommendations. In Proc. of WWW 2004, pages 74–83, 2004. 3. J. Prez, M. Arenas, and C. Gutierrez. Semantics and complexity of SPARQL. ACM Transactions on Database Systems, 34(3):30–43, 2009. 4. M. Saleem, M. I. Ali, A. Hogan, Q. Mehmood, and A.-C. N. Ngomo. LSQ: The linked SPARQL queries dataset. In Proc. of ISWC 2015, pages 261–269, 2015. 5. Z. Song, X. Zhang, Z. Feng, X. Wang, and G. Rao. LSQ: The linked SPARQL queries dataset. In Proc. of SemiBDMA 2016, to appear, 2016. 4