LDQL: A Language for Linked Data Queries Olaf Hartig http://olafhartig.de Abstract In this paper, we propose LDQL, that is, a language to query Linked Data on the Web. The novelty of LDQL is that it enables a user to express sepa- rately (i) patterns that describe the expected query result, and (ii) Web navigation paths that select the data sources to be used for computing the result. As a down- side of this expressiveness, we find that for some LDQL queries, a complete exe- cution is not possible in practice. To address this issue, we show a syntactic prop- erty based on which systems can identify queries that do not have this limitation. 1 Introduction In recent years an increasing amount of structured data has been published and in- terlinked on the World Wide Web (WWW) in adherence to the Linked Data princi- ples [2,3]. These principles are based on standard Web technologies. In particular, (i) the Hypertext Transfer Protocol (HTTP) is used to access data, (ii) HTTP-based Uniform Resource Identifiers (URIs) are used as identifiers for entities described in the data, and (iii) the Resource Description Framework (RDF) is used as data model. Then, any HTTP URI in an RDF triple presents a data link that enables software clients to retrieve more data by looking up the URI with an HTTP request. The adoption of these princi- ples has lead to the creation of a globally distributed dataspace: the Web of Linked Data. From a graph database perspective the Web of Linked Data can be conceived of as two graphs of different types: First, the (virtual) union of all RDF triples in the Web of Linked Data represents an RDF graph [6]; this graph is distributed over the documents in the Web of Linked Data. Second, these documents constitute the nodes of a link graph whose edges represent the aforementioned data links that connect the documents. The emergence of the Web of Linked Data makes possible an online execution of declarative queries over up-to-date data from a virtually unbounded set of data sources, each of which is readily accessible without any need for implementing source-specific APIs or wrappers. This possibility has spawned research interest in approaches to query Linked Data on the WWW as if it was a single (distributed) database. For an overview on query execution techniques proposed in this context refer to [12]. While there does not exist a standard language for expressing such queries, a few options have been proposed. In particular, a first strand of research focuses on extending the scope of SPARQL—which is the standard query language for centralized collections of RDF data [9]—such that an evaluation of SPARQL queries over Linked Data on the WWW has a well-defined semantics [4,10,11,14,16]. A second strand of research focuses on navigational languages (e.g., NautiLOD [8]). A commonality of all these proposals is that querying the aforementioned two graphs that comprise the Web of Linked Data (i.e., the link graph and the RDF data graph) is inherently entangled in each of the proposed query formalisms. That is, for any query expressed in such a formalism, the definition of query-relevant regions of the link graph and the definition of query-rel- evant data from the RDF graph within the specified regions depend on one another. In this paper, we study the alternative approach of avoiding such an inherent de- pendency. To this end, we propose a novel query language for Linked Data which we call LDQL. In contrast to the aforementioned query formalisms, LDQL separates query components for selecting regions of the link graph from components for specifying the query result that has to be constructed from the data in the selected regions. For the for- mer, LDQL introduces the notion of a link path expression, which is a form of nested regular expression, and for the latter we use SPARQL. More precisely, the most basic type of LDQL queries consists of a link path expression and a SPARQL graph pattern. Additionally, such queries can be combined using conjunctions and disjunctions, where some subqueries may provide the starting points of navigation for other subqueries. The downside of the expressiveness provided by LDQL are queries for which a complete execution is not possible in practice (because of the lack of a complete, up- to-date data catalog that is inherent to the WWW). To capture this issue formally, we define a notion of Web-safeness for LDQL queries. Then, the obvious question that arises is how to identify those LDQL queries that are Web-safe. Our contribution to answer this question is to show a sufficient syntactic condition for Web-safeness. The paper is structured as follows: Section 2 introduces a data model that provides the basis for defining the semantics of LDQL formally. In Section 3 we define LDQL and show simple algebraic properties. Thereafter, in Section 4 we focus on Web-safe- ness. Section 5 concludes the paper and sketches future work. For proofs of the formal results in this paper we refer to the extended version of this paper [13]. 2 Data Model In this section we introduce a structural data model that captures the concept of a Web of Linked Data formally. As usual [4,8,10,11,14,16], for the definitions and analysis in this paper, we assume that the Web is fixed during the execution of any single query. We use the RDF data model [6] as a basis for our model of a Web of Linked Data. That is, we assume three pairwise disjoint, infinite sets U (URIs), B (blank nodes), and L (literals). Then, an RDF triple is a tuple hs, p, oi ∈ (U ∪ B) × U × (U ∪ B ∪ L). For any RDF triple t = hs, p, oi we write uris(t) to denote the set of all URIs in t. Additionally, we assume another infinite set D that is disjoint from U, B, and L, respectively. We refer to elements in this set as Linked Data documents, or documents for short, and use them to represent the concept of Web documents from which Linked Data can be extracted. Hence, we assume a function, say data, that maps each document d ∈ D to a finite set of RDF triples data(d) ⊆ (U ∪ B) × U × (U ∪ B ∪ L) such that the data of each document uses a unique set of blank nodes; i.e., for any pair of distinct documents d, d0 ∈ D, and any two RDF triples t = hs, p, oi and t0 = hs0, p0, o0 i with t ∈ data(d) and t0 ∈ data(d0 ), it holds that {s, p, o} ∩ {s0, p0, o0 } ∩ B = ∅. Given these preliminaries, we are ready to define a Web of Linked Data. Definition 1. A Web of Linked Data is a tuple W = hD, adoci that consists of a set of documents D ⊂ D and a partial function adoc : U → D that is surjective. Function adoc of a Web of Linked Data W = hD, adoci captures the relationship between the URIs that can be looked up in this Web and the documents that can be retrieved by such lookups. Since not every URI can be looked up, the function is par- tial. For any URI u ∈ U with u ∈ dom(adoc) (i.e., any URI that can be looked up in W ), document d = adoc(u) can be considered the authoritative source of data for u in W (hence, the name adoc). To accommodate for documents that are authoritative for multiple URIs, we do not require injectivity for function adoc. However, we require surjectivity because we conceive documents as irrelevant for a Web of Linked Data if they cannot be retrieved by any URI lookup in this Web. In this paper, we assume that the set of documents D in any Web of Linked Data W = hD, adoci is finite, in which case we say W is finite [11]. Moreover, given a Web of Linked Data W = hD, adoci, we say that a URI u ∈ uris(t) in an RDF triple t = hs, p, oi establishes a data link to a document d ∈ D if adoc(u) = d. As a final concept, our data model formalizes the aforementioned notion of a link graph in which directed edges represent data links between documents. In the following formal definition of this graph, each edge is associated with a label that identifies both the particular RDF triple and the URI in this triple that establishes the corresponding data link. These labels shall provide the basis for defining the navigational component of our query language (cf. Section 3.1). Definition 2. The link graph of a Web of Linked Data W = hD, adoci is a directed, edge-labeled multigraph hD, Ei whose vertices are all documents in W, and which has a directed edge hdsrc , t, u, dtgt i ∈ E from document dsrc ∈ D to document dtgt ∈ D if the data of dsrc contains an RDF triple t with a URI u ∈ uris(t) that establishes a data link to dtgt ; the edge is labeled with both the triple t and the URI u. Hence, the set of edges E ⊆ D × T × U × D is defined as follows:  E = hdsrc , t, u, dtgt i t ∈ data(dsrc ) such that u ∈ uris(t) and adoc(u) = dtgt . Example 1. As a running example for this paper assume a simple Web of Linked Data Wex = hDex , adocex i with three documents, dA , dB and dC (i.e., Dex = {dA , dB , dC }). The data in these documents are the following sets of RDF triples: data(dA ) = {huA , p1 , uB i, data(dB ) = {huB , p1 , uC i}; huB , p2 , uC i}; data(dC ) = {huA , p2 , uC i}; and function adocex is given as follows: adocex (uA ) = dA , adocex (uB ) = dB , and adocex (uC ) = dC (i.e., dom(adocex ) = {uA , uB , uC }). This Web contains 8 data links. For instance, URI uA in the RDF triple in data(dC ) establishes a data link to document dA . Hence, the corresponding edge in the link graph of Wex is dC , huA , p2 , uC i, uA , dA . Figure 1 illustrates this link graph with all 8 edges. 3 Definition of LDQL This section defines our Linked Data query language, LDQL. LDQL queries are meant to be evaluated over a Web of Linked Data and each such query is built from two types Figure 1. The link graph of our example Web of Linked Data Wex . of components: First, there are link path expressions for selecting query-relevant doc- uments of the queried Web of Linked Data; second, there are SPARQL graph patterns for specifying the query result that has to be constructed from the data in the selected documents. For this paper, we assume that the reader is familiar with the definition of SPARQL [9], including the algebraic formalization introduced by Pérez et al. [15]. In the following, we first formalize our notion of link path expressions. Thereafter, we define a syntax and semantics of LDQL queries and show simple algebraic proper- ties of such queries. For brevity, we introduce LDQL also based on an algebraic syntax. 3.1 Link Path Expressions Link Path Expressions (LPEs) are a form of nested regular expressions for navigation over the link graph of a Web of Linked Data. The base case for LPEs is to select single link graph edges in the context of a designated URI. To this end,  we introduce link pat- terns, that is, tuples hs, p, oi ∈ U ∪ { , +} × U ∪ { , +} × U ∪ L ∪ { , +} , where is a special symbol that denotes a wildcard and + is another special symbol that denotes a placeholder for the context URI. Then, a link graph edge hdsrc , t, u, dtgt i with t = hx1 , x2 , x3 i matches a link pattern lp = hy1 , y2 , y3 i in the context of a URI uctx if for each i ∈ {1, 2, 3}, any of the following three properties holds: 1. yi = , or 2. yi = + and xi = uctx , or 3. yi = xi . Example 2. Consider the link pattern lpex = h , p2 , i. The link graph of our example Web Wex (cf. Example 1) contains two edges that match lpex in the context of URI uA , namely, the edges dA , huB , p2 , uC i, uB , dB and dA , huB , p2 , uC i, uC , dC . The rationale for adopting the notion of a designated context URI in our definition of link patterns is to support cases in which link graph navigation has to be focused solely on data links that are authoritative, where we call a data link authoritative if it is established by an RDF triple in a document dsrc such that dsrc is the authoritative document for some URI in the triple. Formally, in terms of link graph edges, an edge hdsrc , t, u, dtgt i ∈ E in the link graph hD, Ei of a Web of Linked Data W = hD, adoci represents an authoritative data link if adoc(u0 ) = dsrc for some URI u0 ∈ uris(t). Example 3. Consider the link pattern lp0ex = h , p2 , +i, which is a more restrictive variation of the link pattern discussed in the previous example. In contrast to lpex , there does not exist any edge in the link graph of Wex that matches lp0ex in the context of URI uA . Any such edge would have to represent an authoritative data link; more pre- cisely, the RDF triple of such a data link must have the context URI (i.e., uA ) in the ob- ject position, which is not the case for the two edges that match the less restrictive link pattern lpex . Notice also that, if we use URI uC as context instead, there exists an edge that matches link pattern lp0ex in the context of uC , namely dC , huA , p2 , uC i, uA , dA . Given the notion of a link pattern, we define LPEs as nested regular expressions over the alphabet that consists of all possible link patterns. Hence, an LPE is an expression defined by the following grammar (where lp is an arbitrary link pattern): lpe = ε | lp | lpe/lpe | lpe|lpe | lpe ∗ | [lpe] Note that our notion of LPEs does not provide an operator for navigating paths in their inverse direction. The reason for omitting such an operator is that traversing arbitrary data links backwards is impossible on the WWW. The semantics of LPEs is based on a designated context URI (similar to our notion of matching edges for link patterns). In particular, the semantics requires that each path of link graph edges that satisfies an LPE starts from the document that is authoritative for the given context URI. Then, the result of evaluating an LPE represents the end nodes of all such paths. More precisely, the result is a set of URIs where, informally, the lookup of each such URI returns the document that constitutes the end node of the corresponding path. The following definition captures the semantics of LPEs formally. Definition 3. Let lpe be an LPE, let W = hD, adoci be a Web of Linked Data with link graph hD, Ei, and let uctx be a URI. The uctx -based evaluation of lpe over W, denoted by [[lpe]]uWctx, is a set of URIs that is empty if uctx ∈ / dom(adoc); if uctx ∈ dom(adoc), then the set is defined recursively as follows (where dctx = adoc(uctx ), lp is a link pattern, and lpe, lpe 1 , lpe 2 are arbitrary LPEs): [[ ε ]]uWctx = {uctx } [[lp]]uWctx = {u | hdctx , t, u, di ∈ E that matches lp in the context of uctx } 0 [[lpe 1 /lpe 2 ]]uWctx = {u ∈ [[lpe 2 ]]uW | u0 ∈ [[lpe 1 ]]uWctx } [[lpe 1 |lpe 2 ]]uWctx = [[lpe 1 ]]uWctx ∪ [[lpe 2 ]]uWctx [[lpe ∗ ]]uWctx = {uctx } ∪ [[lpe]]uWctx ∪ [[lpe/lpe]]uWctx ∪ [[lpe/lpe/lpe]]uWctx ∪ ... [[ [lpe] ]]uWctx = {uctx | [[lpe]]uWctx 6= ∅}. Example 4. Let lpe ex be the LPE h , p2 , i (i.e., the link pattern discussed in Ex- ample 2). For our example Web Wex and context URI uA , we know by Example 2 that the link graph edges dA , huB , p2 , uC i, uB , dB and dA , huB , p2 , uC i, uC , dC match the pattern in the context of URI uA . Hence, the LPE selects documents dB = adocex (uB ) and dC = adocex (uC ). More precisely, we have [[lpe ex ]]uWAex = {uB , uC }. Example 5. As another example consider the slightly more complex LPE lpe 0ex which is of the form h , p1 , i∗ /[h , p2 , i]. This LPE selects documents that can be reached via arbitrarily long paths of data links with predicate p1 and, additionally, have some outgoing data link with predicate p2 . For our example Web Wex and context URI uA , all three documents, dA , dB and dC , can be reached via a p1 -path from URI uA , but only dA and dC pass the p2 -related test. Hence, we have [[lpe 0ex ]]uWAex = {uA , uC }. While LPEs allows users to select documents from the queried Web of Linked Data, in the context of LDQL these documents are used to form an RDF dataset for evaluating a given SPARQL graph pattern. Formally, we specify this dataset as follows. Definition 4. Let uctx be a URI, let lpe be an LPE, and let W = hD, adoci be a Web of Linked Data. The uctx -lpe-selected dataset over W is an RDF dataset D = hG0 , N i (as per [9,1]) whose default graph G0 is the union of all Named Graphs of the dataset, and that contains a Named Graph hu, Gi ∈ N for every URI u ∈ [[lpe]]uWctx whose lookup (in W ) results in retrieving a document; hence, S G0 = hu,Gi∈N G, and uctx N = {hu, Gi | u ∈ [[lpe]]W and u ∈ dom(adoc) and G = data(adoc(u))}. Example 6. Consider context URI uA and the aforementioned example LPE lpe 0ex for which we have [[lpe 0ex ]]uWAex = {uA , uC } (cf. Example 5). Then, the uA -lpe 0ex -selected dataset over Wex is Dex = hGex , Nex i with Nex = {huA , data(dA )i, huC , data(dC )i} and, thus, Gex = {huA , p1 , uB i, huB , p2 , uC i, huA , p2 , uC i} (cf. Example 1). 3.2 LDQL Queries We are now ready to define LDQL queries formally. Definition 5. An LDQL query is defined recursively as follows: 1. Any tuple q = hu, lpe, P i consisting of a URI u, an LPE lpe, and a SPARQL graph pattern P is an LDQL query—hereafter, referred to as a basic LDQL query; 2. Any tuple q = h?v, lpe, P i consisting of a variable ?v, an LPE lpe, and a SPARQL graph pattern P is an LDQL query; 3. If q1 and q2 are LDQL queries, then (q1 AND q2 ) and (q1 UNION q2 ) are LDQL queries. Before introducing the formal semantics of LDQL queries, we provide some in- tuition about it. As per Definition 5, the most basic type of an LDQL query consists of a URI, an LPE, and a SPARQL graph pattern. Informally, the result of such a query q = hu, lpe, P i is the set of SPARQL solution mappings that can be obtained by evaluating the SPARQL pattern P over the u-lpe-selected dataset. Example 7. Consider a basic LDQL query huA , lpe 0ex , Bex i whose LPE is the afore- mentioned example LPE lpe 0ex (cf. Example 5) and whose SPARQL graph pattern is a basic graph pattern that contains two triple patterns, Bex = {h?x, p1 , ?yi, h?x, p2 , ?zi}. According to the semantics outlined above (and defined formally below), the result of this query over our example Web Wex consists of a single solution mapping, namely µ = {?x 7→ uA , ?y 7→ uB , ?z 7→ uC }. To see why we obtain this result, recall from Example 6 that the default graph of the uA -lpe 0ex -selected dataset over Wex is Gex = {huA , p1 , uB i, huB , p2 , uC i, huA , p2 , uC i}. Then, by the standard (set) semantics of SPARQL [1], the result of evaluating basic graph pattern Bex over this graph Gex is a singleton set that contains solution mapping µ. The other types of LDQL queries also have a set of solution mappings as their result: Operators AND and UNION represent conjunctions and disjunctions, respectively. The sec- ond base type of LDQL queries, tuples with a variable instead of a fixed (context) URI, is similar to basic LDQL queries with the difference that the variable can range over al- ternative context URIs; this type of query is meant to be used in conjunctions in which the other conjunct is used to select context URIs at query execution time (e.g., see query 00 qex in Example 8 below). The following definition formalizes the query semantics. Definition 6. The evaluation of an LDQL query q over a Web of Linked Data W, denoted by [[q]]W , is defined recursively as follows (where u is a URI, ?v is a variable, lpe is an LPE, P is a SPARQL graph pattern, [[P ]]D G0 denotes the evaluation of P over an RDF graph G0 in an RDF dataset D = hG0 , N i [1, Definition 13.3], and q1 and q2 are arbitrary LDQL queries): [[hu, lpe, P i]]W = [[P ]]D G0 ( D = hG0 , N i is the u-lpe-selected dataset over W ), S  [[h?v, lpe, P i]]W = u∈U [[hu, lpe, P i]]W o n {µu } ( where µu = {?v 7→ u} ), [[(q1 AND q2 )]]W = [[q1 ]]W o n [[q2 ]]W , [[(q1 UNION q2 )]]W = [[q1 ]]W ∪ [[q2 ]]W . Example 8. Consider an LDQL query qex = ?x, ε, h?x, p1 , ?zi , which is of the sec- ond base type in Definition 5 (with the SPARQL graph pattern being a single triple pattern). Additionally, let qex 0 = uA , lpe 0ex , {h?x, p1 , ?yi, h?x, p2 , ?zi} be the basic 00 LDQL query introduced in Example 7, and let qex be the conjunction of these two 00 0 0 queries; i.e., qex = (qex AND qex ). By Example 7 we know that [[qex ]]Wex = {µ} (with solution mapping µ as given in Example 7). Furthermore, based on the data given in Example 1, it is easy to see that [[qex ]]Wex = {µ1 , µ2 } with µ1 = {?x 7→ uA , ?z 7→ uB } 00 and µ2 = {?x 7→ uB , ?z 7→ uC }. For the evaluation of query qex over Wex , the result 0 sets [[qex ]]Wex and [[qex ]]Wex have to be joined. While for µ1 ∈ [[qex ]]Wex , there does not 0 exist a compatible join candidate in [[qex ]]Wex , solution mapping µ2 ∈ [[qex ]]Wex is com- 0 00 patible with µ ∈ [[qex ]]Wex . Consequently, we have [[qex ]]Wex = {µ0 } with µ0 = µ ∪ µ2 . 3.3 Algebraic Properties of LDQL Queries As a basis for the discussion in the next section, we show simple algebraic properties. Lemma 1. The operators AND and UNION are associative and commutative, respec- tively, and the operator AND distributes over UNION . That is, if q1 , q2 , and q3 are LDQL queries, then the following semantic equivalences hold: – (q1 AND q2 ) ≡ (q2 AND q1 ); – (q1 UNION q2 ) ≡ (q2UNION q1 );  – q1 AND (q2 AND q3 ) ≡ (q1 AND q2 ) AND q3 ;   – q1 UNION (q2 UNION q3 ) ≡ (q1 UNION q2 ) UNION q3 ;   – q1 AND (q2 UNION q3 ) ≡ (q1 AND q2 ) UNION (q1 AND q3 ) . Proof. Since the definition of LDQL operators AND and UNION is equivalent to their SPARQL counterparts, the semantic equivalences in Lemma 1 follow from correspond- ing equivalences for SPARQL graph patterns as shown by Pérez et al. [15, Lemma 2.5]. Lemma 1 allows us to write sequences of either AND or UNION without parentheses. Hereafter, we assume that every UNION-free LDQL query is represented as an LDQL query of the form (q1 AND q2 AND ... AND qm ) where each subquery qi (1 ≤ i ≤ m) is either of the form hu, lpe, P i (i.e., a basic LDQL query) or of the form h?v, lpe, P i. Moreover, we say that an LDQL query is in UNION normal form if it is of the form (q1 UNION q2 UNION ... UNION qn ) such that each subquery qi (1 ≤ i ≤ n) is a UNION-free LDQL query. The following result is an immediate consequence of Lemma 1. Corollary 1. For every LDQL query, there exists a semantically equivalent LDQL query that is in UNION normal form. 4 Web-Safeness of LDQL Queries In this section we study the “Web-safeness” of LDQL queries, where, informally, we call a query Web-safe if a complete execution of the query over the WWW is possible in practice (which is not the case for all LDQL queries as we shall see). To provide a more formal definition of this notion of Web-safeness we make the fol- lowing observations. While the mathematical structures introduced by our data model capture the notion of Linked Data on the WWW formally (and, thus, allow us to provide a formal semantics for LDQL queries), in practice, these structures are not available completely for the WWW. For instance, given that an infinite number of strings can be used as HTTP URIs [7], we cannot assume complete information about which URIs are in the domain of the partial function adoc (i.e., can be looked up to retrieve some doc- ument) and which are not; in fact, disclosing this information would require a process that systematically tries to look up every possible HTTP URI and, thus, would never terminate because of the aforementioned infiniteness. Therefore, it is also impossible to guarantee the discovery of every document in the set D (without looking up an infinite number of URIs). Consequently, any query whose execution requires a complete enu- meration of this set is not feasible in practice. Based on these observations, we define Web-safeness of LDQL queries as follows. Definition 7. An LDQL query q is Web-safe if there exists an algorithm that, for any finite Web of Linked Data W = hD, adoci, computes [[q]]W by looking up only a finite number of URIs without assuming an a priori availability of any information about the sets D and dom(adoc). 0 00 Example 9. Recall our example queries qex , qex , and qex (cf. Example 8). For query qex = ?x, ε, h?x, p1 , ?zi , any URI u ∈ U may be used to obtain a nonempty subset of the query result as long as a lookup of u retrieves a document whose data includes RDF triples that match hu, p1 , ?zi. Therefore, without access to D or dom(adoc) of the queried Web W = hD, adoci, the completeness of the computed query result can be guaranteed only by checking each of the infinitely many possible HTTP URIs. Hence, query qex is not Web-safe. In contrast, although it contains qex as a subquery, query 00 qex = (qex AND qex 0 ) is Web-safe, and so is qex0 = huA , lpe 0ex , Bex i. A possible execution 0 0 uA algorithm for qex may first compute [[lpe ex ]]W by traversing the queried Web W based on the given LPE lpe 0ex . Thereafter, the algorithm retrieves documents by looking up all URIs u ∈ [[lpe 0ex ]]uWA (or simply keeps these documents after the traversal); and, finally, the algorithm evaluates pattern Bex over the union of the RDF triples in the retrieved documents. If W is finite (i.e., contains a finite number of documents) the traversal process requires a finite number of URI lookups only, and so does the retrieval of doc- 00 uments in the second step; the final step does not look up any URI. To see that qex is 0 also Web-safe we note that after executing subquery qex (e.g., by using the algorithm as outlined before), the execution of the other (non-Web-safe) subquery qex can be reduced to a finite number of URI lookups, namely the URIs bound to variable ?x in solution 0 mappings obtained for subquery qex . Although any other URI may also be used to ob- tain solution mappings for qex , such solution mappings cannot be joined with any of the 0 00 solution mappings for qex and, thus, are irrelevant for the result of qex . The example illustrates that there exists an LDQL query that is not Web-safe. In fact, it is not difficult to see that the argument for the non-Web-safeness of query qex as made in the example can be applied to any LDQL query of the form h?v, lpe, P i; that is, none of these queries is Web-safe. However, the example also shows that more complex queries that contain such non-Web-safe subqueries may still be Web-safe. Therefore, we now introduce properties to identify LDQL queries that are Web-safe (even if some of their subqueries are not). As a basis for the discussion, we first observe that the Web- safeness of an LDQL query carries over to any semantically equivalent LDQL query. Fact 1. An LDQL query q is Web-safe if there exists an LDQL query q 0 such that q and q 0 are semantically equivalent and q 0 is Web-safe. In conjunction with Corollary 1, Fact 1 allows us to focus our discussion on LDQL queries in UNION normal form without losing generality. It is easy to show that such queries are Web-safe if all of their (top-level) subqueries are Web-safe: Proposition 1. An LDQL query of the form (q1 UNION q2 UNION ... UNION qn ) is Web- safe if each subquery qi (1 ≤ i ≤ n) is Web-safe. Proof. Assume each subquery qi is Web-safe. Hence, for each qi , there exists an al- gorithm that satisfies the conditions in Definition 7. Then, the Web-safeness of LDQL query (q1 UNION q2 UNION ... UNION qn ) is easily shown by specifying another algorithm that calls the algorithms of the subqueries sequentially and unions their results. By Proposition 1, we can show that an LDQL query in UNION normal form is Web- safe by showing that each of its UNION-free subqueries is Web-safe. Therefore, in the remainder of this section we focus the Web-safeness of UNION-free LDQL queries. 00 0 Our discussion of the ( UNION-free) query qex = (qex AND qex ) in Example 9 suggests that UNION -free LDQL queries can be shown to be Web-safe if it is possible to execute any non-Web-safe subquery by using variable bindings obtained from other subqueries. A necessary condition for such an execution strategy is that the variable in question (i.e., variable ?x in the case of non-Web-safe subquery qex = ?x, ε, h?x, p1 , ?zi ) is guaran- teed to be bound in every possible solution mapping obtained from the other subqueries. To allow for an automated verification of this condition we adopt Buil-Aranda et al.’s notion of strongly bound variables [5].1 To this end, for any SPARQL graph pat- tern P , let sbvars(P ) denote the set of strongly bound variables in P as defined by Buil-Aranda et al. [5]. For the sake of space, we do not repeat the definition here. How- ever, we emphasize that sbvars(P ) can be constructed recursively, and each variable in sbvars(P ) is guaranteed to be bound in every possible solution for P [5, Proposition 1]. To carry over these properties to LDQL queries, we use the notion of strongly bound variables in SPARQL patterns to define the following notion of strongly bound variables in LDQL queries; thereafter, in Lemma 2, we show the desired boundedness guarantee. Definition 8. The set of strongly bound variables in a LDQL query q, denoted by sbvars(q), is defined recursively as follows: 1. If q is of the form hu, lpe, P i, then sbvars(q) = sbvars(P ). 2. If q is of the form h?v, lpe, P i, then sbvars(q) = sbvars(P ) ∪ {?v}. 3. If q is of the form (q1 AND q2 ), then sbvars(q) = sbvars(q1 ) ∪ sbvars(q2 ). 4. If q is of the form (q1 UNION q2 ), then sbvars(q) = sbvars(q1 ) ∩ sbvars(q2 ). Lemma 2. Let q be an LDQL query. For every Web of Linked Data W and every solu- tion mapping µ ∈ [[q]]W , it holds that sbvars(q) ⊆ dom(µ). Proof. Lemma 2 follows trivially from Definition 8 and [5, Proposition 1]. We are now ready to show the following result. Theorem 1. A UNION-free LDQL query (q1 AND q2 AND ... AND qm ) is Web-safe if there exists a total order ≺ over the set of subqueries {q1 , q2 , ... , qm } such that for each subquery qi (1 ≤ i ≤ m), it holdsSthat either (i) qi is a basic LDQL query or (ii) qi is of the form h?v, lpe, P i and ?v ∈ qj ≺qi sbvars(qj ). Proof (Sketch). We prove Theorem 1 based on an iterative algorithm that generalizes 00 the execution strategy outlined for query qex in Example 9. That is, the algorithm exe- cutes the subqueries q1 , q2 , ... , qm sequentially in the order ≺ such that each iteration step executes one of the subqueries by using the solution mappings computed during the previous step. By the conditions in Theorem 1, the first subquery (according to ≺) 1 While we may also adopt Buil-Aranda et al.’s notion of bound variables (not to be confused with their notion of strongly bound variables), a definition of bound variables in LDQL queries would be based directly on the boundedness of variables in SPARQL patterns. Then, it is not difficult to see that the undecidability of verifying whether a given variable is bound in a given SPARQL pattern [5] would also carry over to LDQL queries. Due to space limitations, we omit discussing boundedness and use directly the decidable alternative (i.e., strong boundedness). cannot be of the form h?v, lpe, P i; hence, the first step of the iteration is guaranteed to execute a basic LDQL query. This execution resembles the execution strategy as out- 0 lined for the (basic) query qex in Example 9. For any subsequent iteration step, if the subquery executed by that step is of the form h?v, lpe, P i, then the corresponding con- dition in Theorem 1 (in conjunction with Lemma 2) guarantees that ?v ∈ dom(µ) for every solution mapping µ obtained from the previous step. These mappings are finitely many (for a finite Web of Linked Data). Then, the algorithm uses the URIs bound to variable ?v in these mappings as the only possible context URIs for the execution of the subquery (instead of using all URIs), which is sufficient because solution mappings computed for the subquery by using some other context URI would not be join compati- ble with any of the solution mappings obtained from the previous step. For a full formal proof and the complete algorithm we refer to the extended version of this paper [13]. To recapitulate, we summarize our proposed procedure to test a given LDQL query q for Web-safeness based on our results in this paper: First, by using the algebraic prop- erties in Lemma 1, the query has to be rewritten into a semantically equivalent LDQL query qnf = (q1 UNION q2 UNION ... UNION qn ) that is in UNION normal form, which is pos- sible for any LDQL query (cf. Corollary 1). Thereafter, the following Web-safeness test has to repeated for every subquery qi (1 ≤ i ≤ n); recall that each of these subqueries is a UNION-free LDQL query qi = (q1i AND q2i AND ... AND qm i i ). The test is to find an or- i i i der for their subqueries q1 , q2 , ... , qmi that satisfies the conditions in Theorem 1. Every top-level subquery qi (1 ≤ i ≤ n) for which such an order exists, is Web-safe (cf. The- orem 1). If all top-level subqueries are identified to be Web-safe by this test, then qnf is Web-safe (cf. Proposition 1), and so is q (cf. Fact 1). We conclude the section by pointing out the following limitation of our results: Even if the given conditions are sufficient to show that an LDQL query is Web-safe, they are not sufficient for showing the opposite. It remains an open question whether there exists a (decidable) property of all Web-safe LDQL queries that is sufficient and necessary. 5 Concluding Remarks and Future Work LDQL, the query language that we introduce in this paper, allows users to express queries over Linked Data on the WWW. We defined LDQL such that navigational fea- tures for selecting the query-relevant documents on the Web are separate from patterns that are meant to be evaluated over the data in the selected documents. This separation distinguishes LDQL from other approaches to express queries over Linked Data. For instance, neither any of the existing SPARQL-based approaches [4,10,11,14,16] nor 0 00 NautiLOD [8] can be used to express queries such as our example queries qex and qex . Given that our analysis of LDQL in this paper focuses primarily on Web-safeness, in future work the language may be studied with respect to the complexity of query evaluation and the problem of query containment. A formal study of the expressive power of LDQL would also be interesting. A more practical direction for future research on LDQL is the development of approaches to execute LDQL queries efficiently. References 1. M. Arenas, C. Gutierrez, and J. Pérez. On the Semantics of SPARQL. In Semantic Web Information Management - A Model-Based Perspective, chapter 13. Springer, 2009. 2. T. Berners-Lee. Linked Data. Online at http://www.w3.org/DesignIssues/ LinkedData.html, 2006. 3. C. Bizer, T. Heath, and T. Berners-Lee. Linked Data – The Story So Far. Semantic Web and Information Systems, 5(3):1–22, 2009. 4. P. Bouquet, C. Ghidini, and L. Serafini. Querying The Web Of Data: A Formal Approach. In Proceedings of the 4th Asian Semantic Web Conference, 2009. 5. C. Buil-Aranda, M. Arenas, and O. Corcho. Semantics and Optimization of the SPARQL 1.1 Federation Extension. In Proceedings of the 8th Extended Semantic Web Conference, 2011. 6. R. Cyganiak, D. Wood, M. Lanthaler, G. Klyne, J. J. Carroll, and B. McBride. RDF 1.1 Con- cepts and Abstract Syntax. W3C Recommendation, Online at http://www.w3.org/ TR/rdf11-concepts/, Feb. 2014. 7. R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, L. Masinter, P. J. Leach, and T. Berners-Lee. Hypertext Transfer Protocol – HTTP/1.1. RFC 2616, Online at http://tools.ietf. org/html/rfc2616, June 1999. 8. V. Fionda, G. Pirrò, and C. Gutierrez. NautiLOD: A Formal Language for the Web of Data Graph. ACM Transactions on the Web, 9(1):1–43, 2015. 9. S. Harris, A. Seaborne, and E. Prud’hommeaux. SPARQL 1.1 Query Language. W3C Rec- ommendation, Online at http://www.w3.org/TR/sparql11-query/, Mar. 2013. 10. A. Harth and S. Speiser. On Completeness Classes for Query Evaluation on Linked Data. In Proceedings of the 26th AAAI Conference, 2012. 11. O. Hartig. SPARQL for a Web of Linked Data: Semantics and Computability. In Proceedings of the 9th Extended Semantic Web Conference, 2012. 12. O. Hartig. An Overview on Execution Strategies for Linked Data Queries. Datenbank- Spektrum, 13(2), 2013. 13. O. Hartig. LDQL: A Language for Linked Data Queries (Extended Version). Online at http://olafhartig.de/files/LDQL-ext.pdf, 2015. 14. O. Hartig and G. Pirrò. A Context-Based Semantics for SPARQL Property Paths over the Web. In Proceedings of the 12th Extended Semantic Web Conference, 2015. 15. J. Pérez, M. Arenas, and C. Gutierrez. Semantics and Complexity of SPARQL. ACM Trans- actions on Database Systems, 34, 2009. 16. J. Umbrich, A. Hogan, A. Polleres, and S. Decker. Link Traversal Querying for a Diverse Web of Data. Semantic Web Journal, 2014.