-

TriAL-QL: Distributed Processing of Navigational Queries?

Martin Przyjaciel-Zablocki

Alexander Schatzle

Adrian Lange

lange@iig.uni-freiburg.de 1 0 Department of Computer Science, University of Freiburg , Germany 1 IIG Telematics, University of Freiburg , Germany

Navigational queries are natural query patterns for RDF data, but yet most existing RDF query languages fail to cover all the varieties inherent to its triple-based model, including SPARQL 1.1 and its derivatives. TriAL* is one of the most expressive approaches but not supported by any existing RDF triplestore. In this paper, we propose TriAL-QL, an easy to write and grasp language for TriAL*, preserving its procedural structure. To demonstrate its feasibility, we provide a proof of concept implementation for Hadoop using Hive as execution layer and give some preliminary experimental results.

Introduction ? A substantially extended version will appear in [ 5 ] TriAL* based on Hadoop. While TriAL* is a neat approach for querying RDF, its algebraic notation is inappropriate for practical usage. Thus, we propose the TriAL* Query Language (TriAL-Ql) that keeps the procedural structure of TriAL* by representing each algebra operation with a SQL-like statement. This way, even complex navigational queries are easy to grasp and write.

Our major contributions can be summarized as follows: (1) We introduce TriAL-QL, a query language for TriAL* with an intuitive mapping between both. (2) Next, we provide an Hadoop-based implementation of TriAL-QL using Hive. Optimizations include a precomputed 1-hop neighborhood, di erent evaluation strategies and a carefully chosen storage layout. (3) Finally, we show some preliminary experiments that demonstrate the scalability and feasibility of our approach. 2

TriAL-QL: A Procedural Query Language for RDF The Triple Algebra with Recursion (TriAL*) [ 4 ] is one of the most expressive RDF query languages with polynomial complexity. In contrast to many other approaches, TriAL* is a closed and hence compositional language, i.e. the output is a set of triples rather than graphs or bindings. It works directly with triples, which allows us to write queries that are not expressible using query languages based on the standard graph model (e.g. regular path queries and nested regular expressions ). TriAL* takes the relational algebra as its basis with some restrictions to guarantee closure. The most important operator is a triple join between two ternary relations E1 and E2 representing sets of triples, de ned as: E1 ./i;;j;k E2, where i; j; k 2 fs1; p1; o1; s2; p2; o2g indicate the implicit projection on three elds to keep the operation closed with s1 referring to the subject of E1, etc. represents the join conditions whereas is a set of conditions between objects and data values. To express paths of arbitrary length, recursion is added with the right (e ./i;;j;k) and lef t (./i;;j;k e) Kleene closure, where e is a TriAL* expression. Thus, a reachability query that asks for pairs (x; z) which follow the connection pattern corresponds to the TriAL* expression (E ./ so11;=ps12; o2 ) with E being a relation name in a triplestore (cf. [ 4 ] for more details).

TriAL-QL. Next, we introduce the notation of TriAL-QL. The basic idea is to atten the algebra expressions of TriAL* to a sequence of interrelated statements. A complete grammar can be found on our project website 1. Table 1 shows the algebra of TriAL* with the corresponding syntax in TriAL-QL. Each algebra operation is represented by exactly one SQL-like statement. Accordingly, we can express the previously discussed reachability query by the following TriAL-QL statement:

SELECT s1; p1; o2 FROM E ON o1 = s2 USING right 1 http://dbis.informatik.uni-freiburg.de/forschung/projekte/DiPoS/

Next, we consider a more complex reachability problem introduced in [ 4 ] asking for pairs (x; z) which follow a connection pattern that requires reasoning capabilities:

TriAL*: e1 = (E ./ sp11;=os22; o1 ) e2 = (e1 ./ so11;=ps12;; op21=p2 )

+ TriAL-QL: e1 = SELECT s1; o2; o1 FROM E

ON p1 = s2 USING lef t e2 = SELECT s1; p1; o2 FROM e1

ON o1 = s2; p1 = p2 USING lef t The inner expression e1 computes the transitive closure of the predicates from E, while e2 computes the transitive closure of this resulting relation. Again, we can see that TriAL-QL stays close to its TriAL* expression illustrating the strength of a compositional language where the result of the rst statement can be used as input for the second. This makes TriAL-QL queries easy to write, understand and modify.

Further, new operators can be added smoothly, meeting our requirements. First, we extend the syntax of TriAL-QL with a STORE operation that enables us to materialize the result of a TriAL* expression e as a new relation in a triplestore. This way, not only the output but also intermediate results can be stored for later processing, if desired. Next, as we focus on processing web-scale RDF data, we have seen the need to introduce more control over the recursion depth of the right and lef t Kleene operator. Therefore, we introduce a scalar that replaces the Kleene Star and limits the number of join compositions. In TriAL-QL this can be formulated within the USING clause by writing, e.g., lef t(4) for applying lef t Kleene four times. 3

A Distributed Execution Engine for TriAL-QL We implemented our execution engine on top of Hadoop using Hive as intermediate layer rather than working directly with MapReduce. This makes us independent of any Hadoop changes (e.g. Yarn) while taking advantage of continuously optimized Hive versions or newer execution engines like Tez that come along with the recent SQL-on-Hadoop trend. Due to very limited space restrictions, we give only a brief introduction to our implementation. In short, a TriAL-QL query is rst parsed to generate an abstract syntax tree and next mapped to TriAL* which is in turn translated into HiveQL queries. We investigated different evaluation strategies based on (1) linear and (2) nonlinear recursion as introduced in [ 1 ] and performed exhaustive experiments with di erent storage formats (e.g. RCFile, ORC, Parquet) and strategies (e.g. indices, partitions) to identify best practices for storing RDF data in Hive. Further, a precomputed 1-hop neighborhood reduces the amount of required joins.

Experiments. Some preliminary results are illustrated in Figure 1. We used the Social Intelligence Benchmark (SIB) data generator2 to create social networks of di erent sizes. The left hand side (a) shows execution times and number of resulting triples for an exemplary query with three joins and one set operation using linear evaluation. Both series exhibit an almost linear scaling behavior. The right hand side (b) compares the linear to the nonlinear evaluation on a more complex query involving the Kleene Star. In this example, the nonlinear evaluation is superior to the linear one with increasing data size as it reduces the amount of required join iterations from 12 (linear) to 8 (nonlinear). Exhaustive experiments that include more advanced evaluation strategies are needed for more comprehensive conclusions and are part of ongoing work [ 5 ]. However, the rst preliminary results already demonstrate the scalability and feasibility of our approach.

1;250 0

Runtimes Results 0 linear nonlinear 200 400 600 1000 2000 200 400 600

Fig. 1. (a) execution times vs. results (linear) (b) linear vs. nonlinear execution 2 http://www.w3.org/wiki/Social_Network_Intelligence_BenchMark

1. Afrati , F.N. , Borkar , V.R. , Carey , M.J. , Polyzotis , N. , Ullman , J.D. : Map-reduce extensions and recursive queries . In: EDBT 2011 , Sweden, March 21 - 24 ( 2011 )

2. Angles , R.: A comparison of current graph database models . In: 28th ICDE Workshops , 2012 , Arlington, VA , USA, April 1- 5 ( 2012 )

3. Arenas , M. , Gottlob , G. , Pieris , A. : Expressive languages for querying the semantic web . In: PODS'14 , Snowbird , UT , USA, June 22-27, 2014 ( 2014 )

4. Libkin , L. , Reutter , J.L. , Vrgoc , D. : Trial for RDF: adapting graph query languages for RDF data . In: PODS 2013 , New York, NY, USA - June 22 - 27, 2013 ( 2013 )

5. Przyjaciel-Zablocki , M. , Schatzle, A. , Lausen , G.: TriAL-QL: Distributed Processing of Navigational Queries . In: 18th WebDB 2015 , Melbourne, Australia ( 2015 )