LDQL: A Language for Linked Data Queries

                                          Olaf Hartig

                                 http://olafhartig.de


       Abstract In this paper, we propose LDQL, that is, a language to query Linked
       Data on the Web. The novelty of LDQL is that it enables a user to express sepa-
       rately (i) patterns that describe the expected query result, and (ii) Web navigation
       paths that select the data sources to be used for computing the result. As a down-
       side of this expressiveness, we find that for some LDQL queries, a complete exe-
       cution is not possible in practice. To address this issue, we show a syntactic prop-
       erty based on which systems can identify queries that do not have this limitation.


1   Introduction
In recent years an increasing amount of structured data has been published and in-
terlinked on the World Wide Web (WWW) in adherence to the Linked Data princi-
ples [2,3]. These principles are based on standard Web technologies. In particular, (i) the
Hypertext Transfer Protocol (HTTP) is used to access data, (ii) HTTP-based Uniform
Resource Identifiers (URIs) are used as identifiers for entities described in the data,
and (iii) the Resource Description Framework (RDF) is used as data model. Then, any
HTTP URI in an RDF triple presents a data link that enables software clients to retrieve
more data by looking up the URI with an HTTP request. The adoption of these princi-
ples has lead to the creation of a globally distributed dataspace: the Web of Linked Data.
     From a graph database perspective the Web of Linked Data can be conceived of as
two graphs of different types: First, the (virtual) union of all RDF triples in the Web of
Linked Data represents an RDF graph [6]; this graph is distributed over the documents
in the Web of Linked Data. Second, these documents constitute the nodes of a link graph
whose edges represent the aforementioned data links that connect the documents.
     The emergence of the Web of Linked Data makes possible an online execution of
declarative queries over up-to-date data from a virtually unbounded set of data sources,
each of which is readily accessible without any need for implementing source-specific
APIs or wrappers. This possibility has spawned research interest in approaches to query
Linked Data on the WWW as if it was a single (distributed) database. For an overview
on query execution techniques proposed in this context refer to [12].
     While there does not exist a standard language for expressing such queries, a few
options have been proposed. In particular, a first strand of research focuses on extending
the scope of SPARQL—which is the standard query language for centralized collections
of RDF data [9]—such that an evaluation of SPARQL queries over Linked Data on
the WWW has a well-defined semantics [4,10,11,14,16]. A second strand of research
focuses on navigational languages (e.g., NautiLOD [8]). A commonality of all these
proposals is that querying the aforementioned two graphs that comprise the Web of
Linked Data (i.e., the link graph and the RDF data graph) is inherently entangled in each
of the proposed query formalisms. That is, for any query expressed in such a formalism,
the definition of query-relevant regions of the link graph and the definition of query-rel-
evant data from the RDF graph within the specified regions depend on one another.
    In this paper, we study the alternative approach of avoiding such an inherent de-
pendency. To this end, we propose a novel query language for Linked Data which we
call LDQL. In contrast to the aforementioned query formalisms, LDQL separates query
components for selecting regions of the link graph from components for specifying the
query result that has to be constructed from the data in the selected regions. For the for-
mer, LDQL introduces the notion of a link path expression, which is a form of nested
regular expression, and for the latter we use SPARQL. More precisely, the most basic
type of LDQL queries consists of a link path expression and a SPARQL graph pattern.
Additionally, such queries can be combined using conjunctions and disjunctions, where
some subqueries may provide the starting points of navigation for other subqueries.
    The downside of the expressiveness provided by LDQL are queries for which a
complete execution is not possible in practice (because of the lack of a complete, up-
to-date data catalog that is inherent to the WWW). To capture this issue formally, we
define a notion of Web-safeness for LDQL queries. Then, the obvious question that
arises is how to identify those LDQL queries that are Web-safe. Our contribution to
answer this question is to show a sufficient syntactic condition for Web-safeness.
    The paper is structured as follows: Section 2 introduces a data model that provides
the basis for defining the semantics of LDQL formally. In Section 3 we define LDQL
and show simple algebraic properties. Thereafter, in Section 4 we focus on Web-safe-
ness. Section 5 concludes the paper and sketches future work. For proofs of the formal
results in this paper we refer to the extended version of this paper [13].


2   Data Model
In this section we introduce a structural data model that captures the concept of a Web
of Linked Data formally. As usual [4,8,10,11,14,16], for the definitions and analysis in
this paper, we assume that the Web is fixed during the execution of any single query.
    We use the RDF data model [6] as a basis for our model of a Web of Linked Data.
That is, we assume three pairwise disjoint, infinite sets U (URIs), B (blank nodes), and
L (literals). Then, an RDF triple is a tuple hs, p, oi ∈ (U ∪ B) × U × (U ∪ B ∪ L). For
any RDF triple t = hs, p, oi we write uris(t) to denote the set of all URIs in t.
    Additionally, we assume another infinite set D that is disjoint from U, B, and L,
respectively. We refer to elements in this set as Linked Data documents, or documents
for short, and use them to represent the concept of Web documents from which Linked
Data can be extracted. Hence, we assume a function, say data, that maps each document
d ∈ D to a finite set of RDF triples data(d) ⊆ (U ∪ B) × U × (U ∪ B ∪ L) such that
the data of each document uses a unique set of blank nodes; i.e., for any pair of distinct
documents d, d0 ∈ D, and any two RDF triples t = hs, p, oi and t0 = hs0, p0, o0 i with
t ∈ data(d) and t0 ∈ data(d0 ), it holds that {s, p, o} ∩ {s0, p0, o0 } ∩ B = ∅.
    Given these preliminaries, we are ready to define a Web of Linked Data.
Definition 1. A Web of Linked Data is a tuple W = hD, adoci that consists of a set
of documents D ⊂ D and a partial function adoc : U → D that is surjective.
     Function adoc of a Web of Linked Data W = hD, adoci captures the relationship
between the URIs that can be looked up in this Web and the documents that can be
retrieved by such lookups. Since not every URI can be looked up, the function is par-
tial. For any URI u ∈ U with u ∈ dom(adoc) (i.e., any URI that can be looked up
in W ), document d = adoc(u) can be considered the authoritative source of data for u
in W (hence, the name adoc). To accommodate for documents that are authoritative
for multiple URIs, we do not require injectivity for function adoc. However, we require
surjectivity because we conceive documents as irrelevant for a Web of Linked Data if
they cannot be retrieved by any URI lookup in this Web.
     In this paper, we assume that the set of documents D in any Web of Linked Data
W = hD, adoci is finite, in which case we say W is finite [11]. Moreover, given a
Web of Linked Data W = hD, adoci, we say that a URI u ∈ uris(t) in an RDF triple
t = hs, p, oi establishes a data link to a document d ∈ D if adoc(u) = d.
     As a final concept, our data model formalizes the aforementioned notion of a link
graph in which directed edges represent data links between documents. In the following
formal definition of this graph, each edge is associated with a label that identifies both
the particular RDF triple and the URI in this triple that establishes the corresponding
data link. These labels shall provide the basis for defining the navigational component
of our query language (cf. Section 3.1).
Definition 2. The link graph of a Web of Linked Data W = hD, adoci is a directed,
edge-labeled multigraph hD, Ei whose vertices are all documents in W, and which has
a directed edge hdsrc , t, u, dtgt i ∈ E from document dsrc ∈ D to document dtgt ∈ D if
the data of dsrc contains an RDF triple t with a URI u ∈ uris(t) that establishes a data
link to dtgt ; the edge is labeled with both the triple t and the URI u. Hence, the set of
edges E ⊆ D × T × U × D is defined as follows:
        
   E = hdsrc , t, u, dtgt i t ∈ data(dsrc ) such that u ∈ uris(t) and adoc(u) = dtgt .

Example 1. As a running example for this paper assume a simple Web of Linked Data
Wex = hDex , adocex i with three documents, dA , dB and dC (i.e., Dex = {dA , dB , dC }).
The data in these documents are the following sets of RDF triples:

           data(dA ) = {huA , p1 , uB i,            data(dB ) = {huB , p1 , uC i};
                          huB , p2 , uC i};         data(dC ) = {huA , p2 , uC i};

and function adocex is given as follows: adocex (uA ) = dA , adocex (uB ) = dB , and
adocex (uC ) = dC (i.e., dom(adocex ) = {uA , uB , uC }). This Web contains 8 data links.
For instance, URI uA in the RDF triple in data(dC ) establishes a data link to document
dA . Hence, the corresponding edge in the link graph of Wex is dC , huA , p2 , uC i, uA , dA .
Figure 1 illustrates this link graph with all 8 edges.


3   Definition of LDQL
This section defines our Linked Data query language, LDQL. LDQL queries are meant
to be evaluated over a Web of Linked Data and each such query is built from two types
              Figure 1. The link graph of our example Web of Linked Data Wex .


of components: First, there are link path expressions for selecting query-relevant doc-
uments of the queried Web of Linked Data; second, there are SPARQL graph patterns
for specifying the query result that has to be constructed from the data in the selected
documents. For this paper, we assume that the reader is familiar with the definition of
SPARQL [9], including the algebraic formalization introduced by Pérez et al. [15].
    In the following, we first formalize our notion of link path expressions. Thereafter,
we define a syntax and semantics of LDQL queries and show simple algebraic proper-
ties of such queries. For brevity, we introduce LDQL also based on an algebraic syntax.


3.1   Link Path Expressions

Link Path Expressions (LPEs) are a form of nested regular expressions for navigation
over the link graph of a Web of Linked Data. The base case for LPEs is to select single
link graph edges in the context of a designated URI. To this end,
                                                                  we introduce link pat-
terns, that is, tuples hs, p, oi ∈ U ∪ { , +} × U ∪ { , +} × U ∪ L ∪ { , +} ,
where is a special symbol that denotes a wildcard and + is another special symbol
that denotes a placeholder for the context URI. Then, a link graph edge hdsrc , t, u, dtgt i
with t = hx1 , x2 , x3 i matches a link pattern lp = hy1 , y2 , y3 i in the context of a
URI uctx if for each i ∈ {1, 2, 3}, any of the following three properties holds:

 1. yi = , or
 2. yi = + and xi = uctx , or
 3. yi = xi .

Example 2. Consider the link pattern lpex = h , p2 , i. The link graph of our example
Web Wex (cf. Example 1) contains two edges that match lpex in the context of URI uA ,
namely, the edges dA , huB , p2 , uC i, uB , dB and dA , huB , p2 , uC i, uC , dC .

    The rationale for adopting the notion of a designated context URI in our definition
of link patterns is to support cases in which link graph navigation has to be focused
solely on data links that are authoritative, where we call a data link authoritative if it
is established by an RDF triple in a document dsrc such that dsrc is the authoritative
document for some URI in the triple. Formally, in terms of link graph edges, an edge
hdsrc , t, u, dtgt i ∈ E in the link graph hD, Ei of a Web of Linked Data W = hD, adoci
represents an authoritative data link if adoc(u0 ) = dsrc for some URI u0 ∈ uris(t).
Example 3. Consider the link pattern lp0ex = h , p2 , +i, which is a more restrictive
variation of the link pattern discussed in the previous example. In contrast to lpex , there
does not exist any edge in the link graph of Wex that matches lp0ex in the context of
URI uA . Any such edge would have to represent an authoritative data link; more pre-
cisely, the RDF triple of such a data link must have the context URI (i.e., uA ) in the ob-
ject position, which is not the case for the two edges that match the less restrictive link
pattern lpex . Notice also that, if we use URI uC as context instead, there exists an edge
that matches link pattern lp0ex in the context of uC , namely dC , huA , p2 , uC i, uA , dA .
    Given the notion of a link pattern, we define LPEs as nested regular expressions over
the alphabet that consists of all possible link patterns. Hence, an LPE is an expression
defined by the following grammar (where lp is an arbitrary link pattern):

                           lpe = ε | lp | lpe/lpe | lpe|lpe | lpe ∗ | [lpe]

    Note that our notion of LPEs does not provide an operator for navigating paths
in their inverse direction. The reason for omitting such an operator is that traversing
arbitrary data links backwards is impossible on the WWW.
    The semantics of LPEs is based on a designated context URI (similar to our notion
of matching edges for link patterns). In particular, the semantics requires that each path
of link graph edges that satisfies an LPE starts from the document that is authoritative
for the given context URI. Then, the result of evaluating an LPE represents the end
nodes of all such paths. More precisely, the result is a set of URIs where, informally,
the lookup of each such URI returns the document that constitutes the end node of the
corresponding path. The following definition captures the semantics of LPEs formally.
Definition 3. Let lpe be an LPE, let W = hD, adoci be a Web of Linked Data with link
graph hD, Ei, and let uctx be a URI. The uctx -based evaluation of lpe over W, denoted
by [[lpe]]uWctx, is a set of URIs that is empty if uctx ∈
                                                        / dom(adoc); if uctx ∈ dom(adoc),
then the set is defined recursively as follows (where dctx = adoc(uctx ), lp is a link
pattern, and lpe, lpe 1 , lpe 2 are arbitrary LPEs):

               [[ ε ]]uWctx = {uctx }
               [[lp]]uWctx = {u | hdctx , t, u, di ∈ E that matches lp in the context of uctx }
                                                  0
    [[lpe 1 /lpe 2 ]]uWctx = {u ∈ [[lpe 2 ]]uW | u0 ∈ [[lpe 1 ]]uWctx }
     [[lpe 1 |lpe 2 ]]uWctx = [[lpe 1 ]]uWctx ∪ [[lpe 2 ]]uWctx
            [[lpe ∗ ]]uWctx = {uctx } ∪ [[lpe]]uWctx ∪ [[lpe/lpe]]uWctx ∪ [[lpe/lpe/lpe]]uWctx ∪ ...
          [[ [lpe] ]]uWctx = {uctx | [[lpe]]uWctx 6= ∅}.

Example 4. Let lpe ex be the LPE h , p2 , i (i.e., the link pattern discussed in Ex-
ample 2). For our example Web Wex and context URI uA , we know by Example 2 that
the link graph edges dA , huB , p2 , uC i, uB , dB and dA , huB , p2 , uC i, uC , dC match the
pattern in the context of URI uA . Hence, the LPE selects documents dB = adocex (uB )
and dC = adocex (uC ). More precisely, we have [[lpe ex ]]uWAex = {uB , uC }.
Example 5. As another example consider the slightly more complex LPE lpe 0ex which is
of the form h , p1 , i∗ /[h , p2 , i]. This LPE selects documents that can be reached
via arbitrarily long paths of data links with predicate p1 and, additionally, have some
outgoing data link with predicate p2 . For our example Web Wex and context URI uA ,
all three documents, dA , dB and dC , can be reached via a p1 -path from URI uA , but only
dA and dC pass the p2 -related test. Hence, we have [[lpe 0ex ]]uWAex = {uA , uC }.

    While LPEs allows users to select documents from the queried Web of Linked Data,
in the context of LDQL these documents are used to form an RDF dataset for evaluating
a given SPARQL graph pattern. Formally, we specify this dataset as follows.

Definition 4. Let uctx be a URI, let lpe be an LPE, and let W = hD, adoci be a Web of
Linked Data. The uctx -lpe-selected dataset over W is an RDF dataset D = hG0 , N i (as
per [9,1]) whose default graph G0 is the union of all Named Graphs of the dataset, and
that contains a Named Graph hu, Gi ∈ N for every URI u ∈ [[lpe]]uWctx whose lookup (in
W ) results in retrieving a document; hence,
            S
      G0 = hu,Gi∈N G, and
                               uctx
      N = {hu, Gi | u ∈ [[lpe]]W    and u ∈ dom(adoc) and G = data(adoc(u))}.

Example 6. Consider context URI uA and the aforementioned example LPE lpe 0ex for
which we have [[lpe 0ex ]]uWAex = {uA , uC } (cf. Example 5). Then, the uA -lpe 0ex -selected
dataset over Wex is Dex = hGex , Nex i with Nex = {huA , data(dA )i, huC , data(dC )i}
and, thus, Gex = {huA , p1 , uB i, huB , p2 , uC i, huA , p2 , uC i} (cf. Example 1).


3.2   LDQL Queries

We are now ready to define LDQL queries formally.

Definition 5. An LDQL query is defined recursively as follows:

 1. Any tuple q = hu, lpe, P i consisting of a URI u, an LPE lpe, and a SPARQL graph
    pattern P is an LDQL query—hereafter, referred to as a basic LDQL query;
 2. Any tuple q = h?v, lpe, P i consisting of a variable ?v, an LPE lpe, and a SPARQL
    graph pattern P is an LDQL query;
 3. If q1 and q2 are LDQL queries, then (q1 AND q2 ) and (q1 UNION q2 ) are LDQL queries.

     Before introducing the formal semantics of LDQL queries, we provide some in-
tuition about it. As per Definition 5, the most basic type of an LDQL query consists
of a URI, an LPE, and a SPARQL graph pattern. Informally, the result of such a
query q = hu, lpe, P i is the set of SPARQL solution mappings that can be obtained
by evaluating the SPARQL pattern P over the u-lpe-selected dataset.

Example 7. Consider a basic LDQL query huA , lpe 0ex , Bex i whose LPE is the afore-
mentioned example LPE lpe 0ex (cf. Example 5) and whose SPARQL graph pattern is a
basic graph pattern that contains two triple patterns, Bex = {h?x, p1 , ?yi, h?x, p2 , ?zi}.
According to the semantics outlined above (and defined formally below), the result of
this query over our example Web Wex consists of a single solution mapping, namely
µ = {?x 7→ uA , ?y 7→ uB , ?z 7→ uC }. To see why we obtain this result, recall
from Example 6 that the default graph of the uA -lpe 0ex -selected dataset over Wex is
Gex = {huA , p1 , uB i, huB , p2 , uC i, huA , p2 , uC i}. Then, by the standard (set) semantics
of SPARQL [1], the result of evaluating basic graph pattern Bex over this graph Gex is
a singleton set that contains solution mapping µ.
     The other types of LDQL queries also have a set of solution mappings as their result:
Operators AND and UNION represent conjunctions and disjunctions, respectively. The sec-
ond base type of LDQL queries, tuples with a variable instead of a fixed (context) URI,
is similar to basic LDQL queries with the difference that the variable can range over al-
ternative context URIs; this type of query is meant to be used in conjunctions in which
the other conjunct is used to select context URIs at query execution time (e.g., see query
  00
qex  in Example 8 below). The following definition formalizes the query semantics.
Definition 6. The evaluation of an LDQL query q over a Web of Linked Data W,
denoted by [[q]]W , is defined recursively as follows (where u is a URI, ?v is a variable,
lpe is an LPE, P is a SPARQL graph pattern, [[P ]]D  G0 denotes the evaluation of P over
an RDF graph G0 in an RDF dataset D = hG0 , N i [1, Definition 13.3], and q1 and q2
are arbitrary LDQL queries):

    [[hu, lpe, P i]]W = [[P ]]D
                              G0 ( D = hG0 , N i is the u-lpe-selected dataset over W ),
                        S                             
  [[h?v, lpe, P i]]W = u∈U [[hu, lpe, P i]]W o
                                             n {µu }         ( where µu = {?v 7→ u} ),
   [[(q1 AND q2 )]]W = [[q1 ]]W o
                                n [[q2 ]]W ,
 [[(q1 UNION q2 )]]W = [[q1 ]]W ∪ [[q2 ]]W .

Example 8. Consider an LDQL query qex = ?x, ε, h?x, p1 , ?zi , which is of the sec-
ond base type in Definition 5 (with the SPARQL graph pattern being a single triple
pattern). Additionally, let qex   0
                                      = uA , lpe 0ex , {h?x, p1 , ?yi, h?x, p2 , ?zi} be the basic
                                                            00
LDQL query introduced in Example 7, and let qex                   be the conjunction of these two
                  00                0                                          0
queries; i.e., qex = (qex AND qex ). By Example 7 we know that [[qex              ]]Wex = {µ} (with
solution mapping µ as given in Example 7). Furthermore, based on the data given in
Example 1, it is easy to see that [[qex ]]Wex = {µ1 , µ2 } with µ1 = {?x 7→ uA , ?z 7→ uB }
                                                                            00
and µ2 = {?x 7→ uB , ?z 7→ uC }. For the evaluation of query qex                over Wex , the result
                         0
sets [[qex ]]Wex and [[qex ]]Wex have to be joined. While for µ1 ∈ [[qex ]]Wex , there does not
                                             0
exist a compatible join candidate in [[qex     ]]Wex , solution mapping µ2 ∈ [[qex ]]Wex is com-
                       0                                       00
patible with µ ∈ [[qex ]]Wex . Consequently, we have [[qex        ]]Wex = {µ0 } with µ0 = µ ∪ µ2 .

3.3   Algebraic Properties of LDQL Queries
As a basis for the discussion in the next section, we show simple algebraic properties.
Lemma 1. The operators AND and UNION are associative and commutative, respec-
tively, and the operator AND distributes over UNION . That is, if q1 , q2 , and q3 are LDQL
queries, then the following semantic equivalences hold:
  – (q1 AND q2 ) ≡ (q2 AND q1 );
    – (q1 UNION q2 ) ≡ (q2UNION q1 );         
    – q1 AND (q2 AND q3 ) ≡ (q1 AND q2 ) AND q3 ;
                                                      
    – q1 UNION (q2 UNION q3 ) ≡ (q1 UNION q2 ) UNION q3 ;
                                                            
    – q1 AND (q2 UNION q3 ) ≡ (q1 AND q2 ) UNION (q1 AND q3 ) .

Proof. Since the definition of LDQL operators AND and UNION is equivalent to their
SPARQL counterparts, the semantic equivalences in Lemma 1 follow from correspond-
ing equivalences for SPARQL graph patterns as shown by Pérez et al. [15, Lemma 2.5].

    Lemma 1 allows us to write sequences of either AND or UNION without parentheses.
Hereafter, we assume that every UNION-free LDQL query is represented as an LDQL
query of the form (q1 AND q2 AND ... AND qm ) where each subquery qi (1 ≤ i ≤ m) is
either of the form hu, lpe, P i (i.e., a basic LDQL query) or of the form h?v, lpe, P i.
Moreover, we say that an LDQL query is in UNION normal form if it is of the form
(q1 UNION q2 UNION ... UNION qn ) such that each subquery qi (1 ≤ i ≤ n) is a UNION-free
LDQL query. The following result is an immediate consequence of Lemma 1.

Corollary 1. For every LDQL query, there exists a semantically equivalent LDQL query
that is in UNION normal form.


4     Web-Safeness of LDQL Queries

In this section we study the “Web-safeness” of LDQL queries, where, informally, we
call a query Web-safe if a complete execution of the query over the WWW is possible
in practice (which is not the case for all LDQL queries as we shall see).
    To provide a more formal definition of this notion of Web-safeness we make the fol-
lowing observations. While the mathematical structures introduced by our data model
capture the notion of Linked Data on the WWW formally (and, thus, allow us to provide
a formal semantics for LDQL queries), in practice, these structures are not available
completely for the WWW. For instance, given that an infinite number of strings can be
used as HTTP URIs [7], we cannot assume complete information about which URIs are
in the domain of the partial function adoc (i.e., can be looked up to retrieve some doc-
ument) and which are not; in fact, disclosing this information would require a process
that systematically tries to look up every possible HTTP URI and, thus, would never
terminate because of the aforementioned infiniteness. Therefore, it is also impossible to
guarantee the discovery of every document in the set D (without looking up an infinite
number of URIs). Consequently, any query whose execution requires a complete enu-
meration of this set is not feasible in practice. Based on these observations, we define
Web-safeness of LDQL queries as follows.

Definition 7. An LDQL query q is Web-safe if there exists an algorithm that, for any
finite Web of Linked Data W = hD, adoci, computes [[q]]W by looking up only a finite
number of URIs without assuming an a priori availability of any information about the
sets D and dom(adoc).
                                                      0         00
Example 9. Recall our example queries qex , qex         , and qex    (cf. Example 8). For query
qex = ?x, ε, h?x, p1 , ?zi , any URI u ∈ U may be used to obtain a nonempty subset
of the query result as long as a lookup of u retrieves a document whose data includes
RDF triples that match hu, p1 , ?zi. Therefore, without access to D or dom(adoc) of the
queried Web W = hD, adoci, the completeness of the computed query result can be
guaranteed only by checking each of the infinitely many possible HTTP URIs. Hence,
query qex is not Web-safe. In contrast, although it contains qex as a subquery, query
 00
qex = (qex AND qex 0
                      ) is Web-safe, and so is qex0
                                                     = huA , lpe 0ex , Bex i. A possible execution
                 0                             0 uA
algorithm for qex may first compute [[lpe ex ]]W by traversing the queried Web W based
on the given LPE lpe 0ex . Thereafter, the algorithm retrieves documents by looking up all
URIs u ∈ [[lpe 0ex ]]uWA (or simply keeps these documents after the traversal); and, finally,
the algorithm evaluates pattern Bex over the union of the RDF triples in the retrieved
documents. If W is finite (i.e., contains a finite number of documents) the traversal
process requires a finite number of URI lookups only, and so does the retrieval of doc-
                                                                                              00
uments in the second step; the final step does not look up any URI. To see that qex              is
                                                             0
also Web-safe we note that after executing subquery qex (e.g., by using the algorithm as
outlined before), the execution of the other (non-Web-safe) subquery qex can be reduced
to a finite number of URI lookups, namely the URIs bound to variable ?x in solution
                                       0
mappings obtained for subquery qex        . Although any other URI may also be used to ob-
tain solution mappings for qex , such solution mappings cannot be joined with any of the
                            0                                                  00
solution mappings for qex      and, thus, are irrelevant for the result of qex    .
    The example illustrates that there exists an LDQL query that is not Web-safe. In
fact, it is not difficult to see that the argument for the non-Web-safeness of query qex as
made in the example can be applied to any LDQL query of the form h?v, lpe, P i; that is,
none of these queries is Web-safe. However, the example also shows that more complex
queries that contain such non-Web-safe subqueries may still be Web-safe. Therefore, we
now introduce properties to identify LDQL queries that are Web-safe (even if some of
their subqueries are not). As a basis for the discussion, we first observe that the Web-
safeness of an LDQL query carries over to any semantically equivalent LDQL query.
Fact 1. An LDQL query q is Web-safe if there exists an LDQL query q 0 such that q and
q 0 are semantically equivalent and q 0 is Web-safe.
   In conjunction with Corollary 1, Fact 1 allows us to focus our discussion on LDQL
queries in UNION normal form without losing generality. It is easy to show that such
queries are Web-safe if all of their (top-level) subqueries are Web-safe:
Proposition 1. An LDQL query of the form (q1 UNION q2 UNION ... UNION qn ) is Web-
safe if each subquery qi (1 ≤ i ≤ n) is Web-safe.
Proof. Assume each subquery qi is Web-safe. Hence, for each qi , there exists an al-
gorithm that satisfies the conditions in Definition 7. Then, the Web-safeness of LDQL
query (q1 UNION q2 UNION ... UNION qn ) is easily shown by specifying another algorithm
that calls the algorithms of the subqueries sequentially and unions their results.
    By Proposition 1, we can show that an LDQL query in UNION normal form is Web-
safe by showing that each of its UNION-free subqueries is Web-safe. Therefore, in the
remainder of this section we focus the Web-safeness of UNION-free LDQL queries.
                                                00             0
     Our discussion of the ( UNION-free) query qex = (qex AND qex ) in Example 9 suggests
that UNION -free LDQL queries can be shown to be Web-safe if it is possible to execute
any non-Web-safe subquery by using variable bindings obtained from other subqueries.
A necessary condition for such an execution strategy is that the variable in question (i.e.,
variable ?x in the case of non-Web-safe subquery qex = ?x, ε, h?x, p1 , ?zi ) is guaran-
teed to be bound in every possible solution mapping obtained from the other subqueries.
     To allow for an automated verification of this condition we adopt Buil-Aranda et
al.’s notion of strongly bound variables [5].1 To this end, for any SPARQL graph pat-
tern P , let sbvars(P ) denote the set of strongly bound variables in P as defined by
Buil-Aranda et al. [5]. For the sake of space, we do not repeat the definition here. How-
ever, we emphasize that sbvars(P ) can be constructed recursively, and each variable in
sbvars(P ) is guaranteed to be bound in every possible solution for P [5, Proposition 1].
To carry over these properties to LDQL queries, we use the notion of strongly bound
variables in SPARQL patterns to define the following notion of strongly bound variables
in LDQL queries; thereafter, in Lemma 2, we show the desired boundedness guarantee.

Definition 8. The set of strongly bound variables in a LDQL query q, denoted by
sbvars(q), is defined recursively as follows:

 1. If q is of the form hu, lpe, P i, then sbvars(q) = sbvars(P ).
 2. If q is of the form h?v, lpe, P i, then sbvars(q) = sbvars(P ) ∪ {?v}.
 3. If q is of the form (q1 AND q2 ), then sbvars(q) = sbvars(q1 ) ∪ sbvars(q2 ).
 4. If q is of the form (q1 UNION q2 ), then sbvars(q) = sbvars(q1 ) ∩ sbvars(q2 ).

Lemma 2. Let q be an LDQL query. For every Web of Linked Data W and every solu-
tion mapping µ ∈ [[q]]W , it holds that sbvars(q) ⊆ dom(µ).

Proof. Lemma 2 follows trivially from Definition 8 and [5, Proposition 1].

      We are now ready to show the following result.

Theorem 1. A UNION-free LDQL query (q1 AND q2 AND ... AND qm ) is Web-safe if there
exists a total order ≺ over the set of subqueries {q1 , q2 , ... , qm } such that for each
subquery qi (1 ≤ i ≤ m), it holdsSthat either (i) qi is a basic LDQL query or (ii) qi is
of the form h?v, lpe, P i and ?v ∈ qj ≺qi sbvars(qj ).

Proof (Sketch). We prove Theorem 1 based on an iterative algorithm that generalizes
                                               00
the execution strategy outlined for query qex     in Example 9. That is, the algorithm exe-
cutes the subqueries q1 , q2 , ... , qm sequentially in the order ≺ such that each iteration
step executes one of the subqueries by using the solution mappings computed during
the previous step. By the conditions in Theorem 1, the first subquery (according to ≺)
 1
     While we may also adopt Buil-Aranda et al.’s notion of bound variables (not to be confused
     with their notion of strongly bound variables), a definition of bound variables in LDQL queries
     would be based directly on the boundedness of variables in SPARQL patterns. Then, it is not
     difficult to see that the undecidability of verifying whether a given variable is bound in a given
     SPARQL pattern [5] would also carry over to LDQL queries. Due to space limitations, we omit
     discussing boundedness and use directly the decidable alternative (i.e., strong boundedness).
cannot be of the form h?v, lpe, P i; hence, the first step of the iteration is guaranteed to
execute a basic LDQL query. This execution resembles the execution strategy as out-
                              0
lined for the (basic) query qex in Example 9. For any subsequent iteration step, if the
subquery executed by that step is of the form h?v, lpe, P i, then the corresponding con-
dition in Theorem 1 (in conjunction with Lemma 2) guarantees that ?v ∈ dom(µ) for
every solution mapping µ obtained from the previous step. These mappings are finitely
many (for a finite Web of Linked Data). Then, the algorithm uses the URIs bound to
variable ?v in these mappings as the only possible context URIs for the execution of
the subquery (instead of using all URIs), which is sufficient because solution mappings
computed for the subquery by using some other context URI would not be join compati-
ble with any of the solution mappings obtained from the previous step. For a full formal
proof and the complete algorithm we refer to the extended version of this paper [13].

     To recapitulate, we summarize our proposed procedure to test a given LDQL query q
for Web-safeness based on our results in this paper: First, by using the algebraic prop-
erties in Lemma 1, the query has to be rewritten into a semantically equivalent LDQL
query qnf = (q1 UNION q2 UNION ... UNION qn ) that is in UNION normal form, which is pos-
sible for any LDQL query (cf. Corollary 1). Thereafter, the following Web-safeness test
has to repeated for every subquery qi (1 ≤ i ≤ n); recall that each of these subqueries
is a UNION-free LDQL query qi = (q1i AND q2i AND ... AND qm       i
                                                                    i
                                                                      ). The test is to find an or-
                           i   i          i
der for their subqueries q1 , q2 , ... , qmi that satisfies the conditions in Theorem 1. Every
top-level subquery qi (1 ≤ i ≤ n) for which such an order exists, is Web-safe (cf. The-
orem 1). If all top-level subqueries are identified to be Web-safe by this test, then qnf is
Web-safe (cf. Proposition 1), and so is q (cf. Fact 1).
     We conclude the section by pointing out the following limitation of our results: Even
if the given conditions are sufficient to show that an LDQL query is Web-safe, they are
not sufficient for showing the opposite. It remains an open question whether there exists
a (decidable) property of all Web-safe LDQL queries that is sufficient and necessary.


5    Concluding Remarks and Future Work

LDQL, the query language that we introduce in this paper, allows users to express
queries over Linked Data on the WWW. We defined LDQL such that navigational fea-
tures for selecting the query-relevant documents on the Web are separate from patterns
that are meant to be evaluated over the data in the selected documents. This separation
distinguishes LDQL from other approaches to express queries over Linked Data. For
instance, neither any of the existing SPARQL-based approaches [4,10,11,14,16] nor
                                                                            0       00
NautiLOD [8] can be used to express queries such as our example queries qex   and qex  .
    Given that our analysis of LDQL in this paper focuses primarily on Web-safeness,
in future work the language may be studied with respect to the complexity of query
evaluation and the problem of query containment. A formal study of the expressive
power of LDQL would also be interesting. A more practical direction for future research
on LDQL is the development of approaches to execute LDQL queries efficiently.
References
 1. M. Arenas, C. Gutierrez, and J. Pérez. On the Semantics of SPARQL. In Semantic Web
    Information Management - A Model-Based Perspective, chapter 13. Springer, 2009.
 2. T. Berners-Lee. Linked Data. Online at http://www.w3.org/DesignIssues/
    LinkedData.html, 2006.
 3. C. Bizer, T. Heath, and T. Berners-Lee. Linked Data – The Story So Far. Semantic Web and
    Information Systems, 5(3):1–22, 2009.
 4. P. Bouquet, C. Ghidini, and L. Serafini. Querying The Web Of Data: A Formal Approach.
    In Proceedings of the 4th Asian Semantic Web Conference, 2009.
 5. C. Buil-Aranda, M. Arenas, and O. Corcho. Semantics and Optimization of the SPARQL 1.1
    Federation Extension. In Proceedings of the 8th Extended Semantic Web Conference, 2011.
 6. R. Cyganiak, D. Wood, M. Lanthaler, G. Klyne, J. J. Carroll, and B. McBride. RDF 1.1 Con-
    cepts and Abstract Syntax. W3C Recommendation, Online at http://www.w3.org/
    TR/rdf11-concepts/, Feb. 2014.
 7. R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, L. Masinter, P. J. Leach, and T. Berners-Lee.
    Hypertext Transfer Protocol – HTTP/1.1. RFC 2616, Online at http://tools.ietf.
    org/html/rfc2616, June 1999.
 8. V. Fionda, G. Pirrò, and C. Gutierrez. NautiLOD: A Formal Language for the Web of Data
    Graph. ACM Transactions on the Web, 9(1):1–43, 2015.
 9. S. Harris, A. Seaborne, and E. Prud’hommeaux. SPARQL 1.1 Query Language. W3C Rec-
    ommendation, Online at http://www.w3.org/TR/sparql11-query/, Mar. 2013.
10. A. Harth and S. Speiser. On Completeness Classes for Query Evaluation on Linked Data. In
    Proceedings of the 26th AAAI Conference, 2012.
11. O. Hartig. SPARQL for a Web of Linked Data: Semantics and Computability. In Proceedings
    of the 9th Extended Semantic Web Conference, 2012.
12. O. Hartig. An Overview on Execution Strategies for Linked Data Queries. Datenbank-
    Spektrum, 13(2), 2013.
13. O. Hartig. LDQL: A Language for Linked Data Queries (Extended Version). Online at
    http://olafhartig.de/files/LDQL-ext.pdf, 2015.
14. O. Hartig and G. Pirrò. A Context-Based Semantics for SPARQL Property Paths over the
    Web. In Proceedings of the 12th Extended Semantic Web Conference, 2015.
15. J. Pérez, M. Arenas, and C. Gutierrez. Semantics and Complexity of SPARQL. ACM Trans-
    actions on Database Systems, 34, 2009.
16. J. Umbrich, A. Hogan, A. Polleres, and S. Decker. Link Traversal Querying for a Diverse
    Web of Data. Semantic Web Journal, 2014.