TriAL-QL: Distributed Processing of Navigational Queries? Martin Przyjaciel-Zablocki1 , Alexander Schätzle1 , and Adrian Lange2 1 Department of Computer Science, University of Freiburg, Germany zablocki|schaetzle@informatik.uni-freiburg.de 2 IIG Telematics, University of Freiburg, Germany lange@iig.uni-freiburg.de Abstract. Navigational queries are natural query patterns for RDF data, but yet most existing RDF query languages fail to cover all the varieties inherent to its triple-based model, including SPARQL 1.1 and its derivatives. TriAL* is one of the most expressive approaches but not supported by any existing RDF triplestore. In this paper, we propose TriAL-QL, an easy to write and grasp language for TriAL*, preserv- ing its procedural structure. To demonstrate its feasibility, we provide a proof of concept implementation for Hadoop using Hive as execution layer and give some preliminary experimental results. 1 Introduction Graph databases and their respective graph query languages are commonly used for RDF data querying. However, in contrast to the standard graph model, an edge label in RDF (predicate) does not come from a finite alphabet and may also appear as a source or destination (subject and object, respectively) of another edge. Consequently, RDF query languages based on typical graph query lan- guages like nested regular expressions (NRE) are not capable of such constructs and lose important features, e.g. reasoning over predicates within a query [2]. To the best of our knowledge, there are only two RDF query languages that enable expressive navigational capabilities with reasoning and can be evaluated in polynomial time, namely Triple Query Language Lite (TriQ-Lite) [3] and Triple Algebra with Recursion (TriAL*) [4]. TriQ-Lite is defined as a Datalog extension that captures SPARQL queries enriched with the OWL 2 QL profile, whereas TriAL* is a closed language that works directly with triples includ- ing recursion over triple joins. Although, the steady growth of Semantic Web data with its high degree of diversity in both, structure and vocabulary, justi- fies such expressive RDF query languages, it also raises the need for solutions that scale with the data size. In recent years, Hadoop has become the de-facto standard for processing Big Data, with a recent trend on scalable, interactive SQL-on-Hadoop solutions. The descent of TriAL* from relational algebra and its inherent compositionality led us to the decision to build an implementation of ? A substantially extended version will appear in [5] TriAL* based on Hadoop. While TriAL* is a neat approach for querying RDF, its algebraic notation is inappropriate for practical usage. Thus, we propose the TriAL* Query Language (TriAL-Ql) that keeps the procedural structure of TriAL* by representing each algebra operation with a SQL-like statement. This way, even complex navigational queries are easy to grasp and write. Our major contributions can be summarized as follows: (1) We introduce TriAL-QL, a query language for TriAL* with an intuitive mapping between both. (2) Next, we provide an Hadoop-based implementation of TriAL-QL using Hive. Optimizations include a precomputed 1-hop neighborhood, different evaluation strategies and a carefully chosen storage layout. (3) Finally, we show some preliminary experiments that demonstrate the scalability and feasibility of our approach. 2 TriAL-QL: A Procedural Query Language for RDF The Triple Algebra with Recursion (TriAL*) [4] is one of the most expressive RDF query languages with polynomial complexity. In contrast to many other approaches, TriAL* is a closed and hence compositional language, i.e. the out- put is a set of triples rather than graphs or bindings. It works directly with triples, which allows us to write queries that are not expressible using query lan- guages based on the standard graph model (e.g. regular path queries and nested regular expressions). TriAL* takes the relational algebra as its basis with some restrictions to guarantee closure. The most important operator is a triple join between two ternary relations E1 and E2 representing sets of triples, defined as: E1 ./i,j,k θ,η E2 , where i, j, k ∈ {s1 , p1 , o1 , s2 , p2 , o2 } indicate the implicit projection on three fields to keep the operation closed with s1 referring to the subject of E1 , etc. θ represents the join conditions whereas η is a set of conditions be- tween objects and data values. To express paths of arbitrary length, recursion is added with the right (e ./i,j,k ∗ i,j,k ∗ θ,η ) and lef t (./θ,η e) Kleene closure, where e is a TriAL* expression. Thus, a reachability query that asks for pairs (x, z) which follow the connection pattern corre- sponds to the TriAL* expression (E ./ os11 ,=sp12, o2 )∗ with E being a relation name in a triplestore (cf. [4] for more details). TriAL-QL. Next, we introduce the notation of TriAL-QL. The basic idea is to flatten the algebra expressions of TriAL* to a sequence of interrelated statements. A complete grammar can be found on our project website 1 . Table 1 shows the algebra of TriAL* with the corresponding syntax in TriAL-QL. Each algebra operation is represented by exactly one SQL-like statement. Ac- cordingly, we can express the previously discussed reachability query by the following TriAL-QL statement: SELECT s1 , p1 , o2 FROM E ON o1 = s2 USING right 1 http://dbis.informatik.uni-freiburg.de/forschung/projekte/DiPoS/ Table 1. TriAL-QL Algebra & Syntax, where e, e1 and e2 correspond to a TriAL* expression. (i, j, k, θ, η as previously defined, cf. [4].) Algebra (TriAL*) Syntax (TriAL-QL) σθ,η (e) SELECT i, j, k FROM e FILTER θ, η e1 ./i,j,k θ,η e2 SELECT i, j, k FROM e1 JOIN e2 ON θ FILTER η e1 ∪ e2 e1 UNION e2 e1 − e2 e1 MINUS e2 (e ./i,j,k θ,η ) ∗ SELECT i, j, k FROM e ON θ FILTER η USING right i,j,k (./θ,η e)∗ SELECT i, j, k FROM e ON θ FILTER η USING lef t Next, we consider a more complex reachability problem introduced in [4] asking for pairs (x, z) which follow a connection pattern that requires reasoning capabilities: TriAL*: e1 = (E ./ sp11,=so22, o1 )∗ e2 = (e1 ./ so11 ,=sp12,, op21 =p2 )∗ ⇓ TriAL-QL: e1 = SELECT s1 , o2 , o1 FROM E ON p1 = s2 USING lef t e2 = SELECT s1 , p1 , o2 FROM e1 ON o1 = s2 , p1 = p2 USING lef t The inner expression e1 computes the transitive closure of the predicates from E, while e2 computes the transitive closure of this resulting relation. Again, we can see that TriAL-QL stays close to its TriAL* expression illustrating the strength of a compositional language where the result of the first statement can be used as input for the second. This makes TriAL-QL queries easy to write, understand and modify. Further, new operators can be added smoothly, meeting our requirements. First, we extend the syntax of TriAL-QL with a STORE operation that enables us to materialize the result of a TriAL* expression e as a new relation in a triplestore. This way, not only the output but also intermediate results can be stored for later processing, if desired. Next, as we focus on processing web-scale RDF data, we have seen the need to introduce more control over the recursion depth of the right and lef t Kleene operator. Therefore, we introduce a scalar that replaces the Kleene Star and limits the number of join compositions. In TriAL-QL this can be formulated within the USING clause by writing, e.g., lef t(4) for applying lef t Kleene four times. 3 A Distributed Execution Engine for TriAL-QL We implemented our execution engine on top of Hadoop using Hive as intermedi- ate layer rather than working directly with MapReduce. This makes us indepen- dent of any Hadoop changes (e.g. Yarn) while taking advantage of continuously optimized Hive versions or newer execution engines like Tez that come along with the recent SQL-on-Hadoop trend. Due to very limited space restrictions, we give only a brief introduction to our implementation. In short, a TriAL-QL query is first parsed to generate an abstract syntax tree and next mapped to TriAL* which is in turn translated into HiveQL queries. We investigated dif- ferent evaluation strategies based on (1) linear and (2) nonlinear recursion as introduced in [1] and performed exhaustive experiments with different storage formats (e.g. RCFile, ORC, Parquet) and strategies (e.g. indices, partitions) to identify best practices for storing RDF data in Hive. Further, a precomputed 1-hop neighborhood reduces the amount of required joins. Experiments. Some preliminary results are illustrated in Figure 1. We used the Social Intelligence Benchmark (SIB) data generator2 to create social networks of different sizes. The left hand side (a) shows execution times and number of resulting triples for an exemplary query with three joins and one set operation using linear evaluation. Both series exhibit an almost linear scaling behavior. The right hand side (b) compares the linear to the nonlinear evaluation on a more complex query involving the Kleene Star. In this example, the nonlinear evaluation is superior to the linear one with increasing data size as it reduces the amount of required join iterations from 12 (linear) to 8 (nonlinear). Exhaustive experiments that include more advanced evaluation strategies are needed for more comprehensive conclusions and are part of ongoing work [5]. However, the first preliminary results already demonstrate the scalability and feasibility of our approach. 100 11.77 15 linear 1,036 Runtimes Results (in M triples) Runtimes (in 1000 s) 71.1 1,250 Results 80 12.5 nonlinear Runtimes (in s) 1,000 7.99 734 10 693 60 6.33 750 540 5.13 7.5 34 40 343 19.5 500 2.49 5 2.17 12.1 250 20 4.8 2.5 0 0 0 0 200 400 600 1000 2000 200 400 600 Fig. 1. (a) execution times vs. results (linear) (b) linear vs. nonlinear execution References 1. Afrati, F.N., Borkar, V.R., Carey, M.J., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: EDBT 2011, Sweden, March 21-24 (2011) 2. Angles, R.: A comparison of current graph database models. In: 28th ICDE Work- shops, 2012, Arlington, VA, USA, April 1-5 (2012) 3. Arenas, M., Gottlob, G., Pieris, A.: Expressive languages for querying the semantic web. In: PODS’14, Snowbird, UT, USA, June 22-27, 2014 (2014) 4. Libkin, L., Reutter, J.L., Vrgoc, D.: Trial for RDF: adapting graph query languages for RDF data. In: PODS 2013, New York, NY, USA - June 22 - 27, 2013 (2013) 5. Przyjaciel-Zablocki, M., Schätzle, A., Lausen, G.: TriAL-QL: Distributed Processing of Navigational Queries. In: 18th WebDB 2015, Melbourne, Australia (2015) 2 http://www.w3.org/wiki/Social_Network_Intelligence_BenchMark